arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11696 2026-05-13 cs.CV cs.AI cs.GR

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli, Jeppe Revall Frisvad

AI总结 WildRelight 是一个专为单图像重光照任务设计的首个真实场景数据集,包含高分辨率户外场景及其配对的高动态范围环境光映射,用于评估现有方法在真实环境中的表现。该数据集揭示了当前基于合成数据训练的先进模型在真实世界中存在严重的领域偏移问题。研究提出了一种基于物理引导的推理框架,结合扩散后验采样与时间感知的测试时自适应方法,实现了合成模型在真实场景中的实时对齐,为解决模拟到现实的挑战提供了新的思路。

Comments Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: https://lez-s.github.io/wildrelight_proj/

详情
英文摘要

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

2605.11695 2026-05-13 cs.CV cs.AI

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

Mikako Ochiai, Masatoshi Nagano, Tadahiro Taniguchi

AI总结 本文研究了在异构视觉代理之间通过去中心化学习产生的通信机制,探讨了当代理具有不同视觉表征时,哪些视觉信息可以被共享。研究中代理仅交换离散的标记序列,并基于本地感知证据更新自身模型,无需依赖共享的通信目标。实验表明,这种通信方式能够生成具有视觉信息的共享标记序列,在跨代理对齐、视觉特征预测和图像-文本检索任务中优于无通信基线,并揭示了视觉编码器异质性对通信内容和语言对称性的影响。

详情
英文摘要

Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.

2605.11694 2026-05-13 cs.LG

Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

Michael Lu, Max Qiushi Lin, Mo Chen, Sharan Vaswani

AI总结 本文研究无限时间折扣约束马尔可夫决策过程(CMDPs)的策略优化问题,关注实际应用中需要部署单一最终策略的场景。为了解决现有理论保证通常针对混合策略而难以直接应用的问题,作者提出采用增强拉格朗日(AL)方法,并结合投影Q上升(PQA)算法,构建了一个具有可证明最终迭代收敛性的通用框架。该方法不仅适用于表格型CMDPs,还可推广到对数线性策略及复杂非线性策略,并在连续控制任务中验证了其有效性。

详情
英文摘要

We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ($\texttt{AL}$) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the $\texttt{AL}$ sub-problem with projected Q-ascent ($\texttt{PQA}$). Combining the theoretical guarantees of $\texttt{PQA}$ and the standard $\texttt{AL}$ analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of $\texttt{PQA}$ can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.

2605.11693 2026-05-13 cs.AI

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

Abid Ali, Diego Molla-Aliod, Usman Naseem

AI总结 该研究针对多模态摘要生成任务中现有评估方法的不足,提出了一种统一的评估框架MM-Eval,用于综合衡量文本质量、图像-文本对齐性以及视觉多样性。MM-Eval通过结合事实一致性、语义连贯性、图像相关性及视觉多样性等多维度指标,实现了对多模态摘要更全面和准确的评估。实验表明,该框架优于传统启发式方法,为多模态摘要系统的比较评估提供了可解释且弱依赖参考的解决方案。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3) image-set diversity, quantified using Truncated CLIP Entropy. We calibrate MM-Eval through a learned aggregation model trained on the mLLM-EVAL news benchmark, aligning component contributions with human preferences. Our analysis reveals a text-dominant hierarchy in this setting, where factual consistency acts as a critical determinant of perceived overall quality, while visual relevance and diversity provide complementary signals. MM-Eval improves over heuristic aggregation baselines and provides an interpretable, reference-weak framework for comparative evaluation of multimodal summaries.

2605.11691 2026-05-13 cs.LG

Compositional Neural Operators for Multi-Dimensional Fluid Dynamics

Hamda Hmida, Hsiu-Wen Chang, Youssef Mesri

AI总结 该论文提出了一种用于二维流体动力学的组合神经算子(CompNO)框架,旨在解决偏微分方程的高效求解问题。该方法将复杂的物理方程分解为多个预训练的基础模块,如对流、扩散和泊松求解器等,并通过一个自适应块进行组合,从而实现对非线性相互作用的学习。实验表明,该方法在适应新物理系统时具有更高的灵活性和可解释性,并能有效复用预训练模块。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Partial differential equations (PDEs) govern diverse physical phenomena, yet high-fidelity numerical solutions are computationally expensive and Machine Learning approaches lack generalization. While Scientific Foundation Models (SFMs) aim to provide universal surrogates, typical encoding-decoding approaches suffer from high pretraining costs and limited interpretability. In this paper, we propose Compositional Neural Operators (CompNO) for 2D systems, a framework that decomposes complex PDEs into a library of Foundation Blocks. Each block is a specialized Neural Operator pretrained on elementary physics. This modular library contains convection, diffusion, and nonlinear convection blocks as well as a Poisson Solver, enabling the framework to address the pressure-velocity coupling. These experts are assembled via an Adaptation Block featuring an Aggregator. This aggregator learns nonlinear interactions by minimizing data loss and physics-based residuals driven from governing equations. The proposed approach has been evaluated on the Convection-Diffusion equation, the Burgers' equation, and the Incompressible Navier-Stokes equation. Our results demonstrate that learning from elementary operators significantly improves adaptability, enhances model interpretability and facilitates the reuse of pretrained blocks when adapting to new physical systems.

2605.11689 2026-05-13 cs.LG cs.CL

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Margaret Li, Sneha Kudugunta, Danielle Rothermel, Luke Zettlemoyer

AI总结 本文系统研究了大规模语言模型中专家混合(MoE)架构的核心设计选择,包括专家数量、粒度、共享专家、负载均衡等,并在超过2000次预训练实验中分析了这些参数对模型性能的影响。研究发现,随着MoE参数规模的增加,模型性能持续提升,且最优专家规模主要取决于活跃参数数量,而非总参数量。此外,专家数量和粒度是影响模型质量的最关键因素,而其他配置如共享专家或负载均衡机制的影响相对较小。

详情
英文摘要

Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.

2605.11688 2026-05-13 cs.LG cs.AI cs.MA

Shaping Zero-Shot Coordination via State Blocking

Mingu Kang, Sunwoo Lee, Yonghyeon Jo, Seungyul Han

AI总结 本文研究了零样本协调(ZSC)问题,即如何使智能体在未与合作伙伴预先交互的情况下实现协作,这对于现实中的多智能体系统和人机协作至关重要。为解决现有方法在面对未见合作伙伴时泛化能力不足的问题,作者提出了一种名为状态阻断协调(SBC)的框架,通过生成虚拟环境中的多样化交互场景,使智能体在训练过程中接触多种次优合作伙伴策略,从而提升其零样本协调能力。实验表明,SBC在多个基准测试中表现出优越的协调性能,尤其在与人类合作伙伴的协作中具有显著优势。

Comments 9 technical page followed by references and appendix

详情
英文摘要

Zero-shot coordination (ZSC) aims to enable agents to cooperate with independently trained partners without prior interaction, a key requirement for real-world multi-agent systems and human-AI collaboration. Existing approaches have largely emphasized increasing partner diversity during training, yet such strategies often fall short of achieving reliable generalization to unseen partners. We introduce State-Blocked Coordination (SBC), a simple yet effective framework that improves ZSC by inducing diverse interaction scenarios without direct environment modification. Specifically, SBC generates a family of virtual environments through state blocking, allowing agents to experience a wide range of suboptimal partner policies. Across multiple benchmarks, SBC demonstrates superior performance in zero-shot coordination, including strong generalization to human partners.

2605.11687 2026-05-13 cs.AI

Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis, Dimosthenis Kyriazis

AI总结 该研究针对金融领域对可信AI解释的需求,提出了一种持久化、多方法交叉验证且支持对话交互的可解释性AI架构。核心方法包括将多种XAI结果作为可检索的持久化对象进行存储,并通过检索增强生成技术实现多方法解释的对比与融合,同时引入自动化检查机制评估解释的可靠性。该架构在金融情感分析任务中进行了验证,显著提升了解释的准确性和可信度。

Comments 5 pages

详情
英文摘要

Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.

2605.11685 2026-05-13 cs.CL

Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

Zeguan Xiao, Xuanzhe Xu, Yun Chen, Yong Wang, Jian Yang, Yanqing Hu, Guanhua Chen

AI总结 本文研究了大型语言模型(LLM)在面对“重学习攻击”时的健壮性问题,发现现有遗忘方法主要优化主成分,而次要成分未被有效修改,导致攻击者可通过调整主成分快速恢复被遗忘的知识。基于表示的谱结构分析,作者提出了一种针对次要成分的遗忘方法(MCU),通过在这些更具鲁棒性的方向上进行遗忘操作,显著提升了模型对重学习攻击的抵抗力,并在多个数据集上验证了其有效性。

详情
英文摘要

Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.

2605.11684 2026-05-13 cs.LG eess.SP math.PR stat.AP

Partial Model Sharing Improves Byzantine Resilience in Federated Conformal Prediction

Ehsan Lari, Reza Arablouei, Stefan Werner

AI总结 本文提出了一种基于部分模型共享的拜占庭鲁棒联邦共形预测方法,通过每次仅交换部分模型参数来提升系统安全性与通信效率。该方法在训练和校准阶段均增强了鲁棒性,训练阶段通过部分共享限制攻击面并减少恶意更新的影响,校准阶段则利用直方图特征向量进行异常检测与共形分位数估计。实验表明,该方法在多种拜占庭攻击场景下能实现更接近名义值的预测覆盖率,并显著缩小预测区间,为联邦不确定性量化提供了更高效且鲁棒的解决方案。

Comments 5 pages, 4 figures, Accepted for presentation at the 34th European Signal Processing Conference (EUSIPCO 2026) in Bruges, Belgium

详情
英文摘要

We propose a Byzantine-resilient federated conformal prediction (FCP) method that leverages partial model sharing, where only a subset of model parameters is exchanged each round. Unlike existing robust FCP approaches that primarily harden the calibration stage, our method protects both the federated training and conformal calibration phases. During training, partial sharing inherently restricts the attack surface and attenuates poisoned updates while reducing communication. During calibration, clients compress their non-conformity scores into histogram-based characterization vectors, enabling the server to detect Byzantine clients via distance-based maliciousness scores and to estimate the conformal quantile using only benign contributors. Experiments across diverse Byzantine attack scenarios show that the proposed method achieves closer-to-nominal coverage with substantially tighter prediction intervals than standard FCP, establishing a robust and communication-efficient approach to federated uncertainty quantification.

2605.11683 2026-05-13 cs.CV

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

Kaixuan He, Song Chen, Yi Kang

AI总结 视觉 Transformer(ViT)由于自注意力机制的二次复杂度,计算开销较大。为解决这一问题,本文提出 DORA,一种基于强化学习的动态在线推理框架,用于在 ViT 中实现自适应的 token 合并。DORA 将 token 合并过程建模为马尔可夫决策过程,通过轻量级 RL 智能体根据当前特征状态和层间上下文动态决定合并策略,并通过非线性知识蒸馏惩罚函数优化智能体,以平衡计算效率与特征保真度。实验表明,DORA 在多个 ViT 尺度上均优于现有方法,在保持精度损失极小的前提下实现了显著的计算加速。

Comments Preprint. Under review

详情
英文摘要

Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.

2605.11680 2026-05-13 cs.CV

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Shivam Kumar

AI总结 本文介绍了 ShapeCodeBench,一个用于感知到程序重建的合成基准,旨在从渲染图像生成可执行的绘图程序,并与目标图像进行比较。该基准通过可重复的随机数生成器生成样本,支持创建无偏的测试集,包含不同难度级别的150个样本,并采用多种指标进行评估。实验表明,当前最先进的模型在精确匹配方面仍表现有限,表明该基准仍有较大的提升空间。

Comments 14 pages, 5 figures, 2 tables. Code, data, and artifacts: https://github.com/shivamk3r/shape-code-bench ; archival release: https://doi.org/10.5281/zenodo.20132286

详情
英文摘要

We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.

2605.11678 2026-05-13 cs.AI

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

Seungwoo Roh, Huiyeong Kim, Jong-Chan Kim

AI总结 本文提出了一种名为OOM-Free Alpamayo的框架,通过CPU-GPU内存交换技术,在不修改模型结构的前提下,实现了在显存受限的GPU上高效运行视觉-语言-动作(VLA)模型。该方法通过分层内存管理、流水线参数传输和驻留层决策策略,显著降低了显存占用并提升了推理速度。实验表明,该方法在NVIDIA Alpamayo-R1-10B模型上实现了比现有方法最高3.55倍的加速,同时保持了全BF16精度。

Comments Submitted to IEEE RTCSA on March 26, 2026 (KST); Accepted on May 4, 2026 (KST)

详情
英文摘要

End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.

2605.11674 2026-05-13 cs.RO

A Proprioceptive-Only Benchmark for Quadruped State Estimation: ATE, RPE, and Runtime Trade-offs Between Filters and Smoothers

Ylenia Nisticò, João Carlos Virgolino Soares, Joan Solà, Claudio Semini

AI总结 本文对比了三种先进的四足机器人本体感知状态估计器(MUSE、IEKF 和 IS),在 GrandTour 数据集的 CYN-1 序列上评估其长期和短期精度以及计算效率。研究发现,IEKF 和 IS 在长期轨迹误差上优于 MUSE,而短期误差在各方法间差异较小,不同方法在精度与计算延迟之间存在权衡。该研究为四足机器人状态估计器的选择提供了清晰的性能与计算成本参考,并开源了全部评估代码以保证可复现性。

Comments Submitted to IEEE Robotics and Automation Practice

详情
英文摘要

We compare three state-of-the-art proprioceptive state estimators for quadruped robots: MUSE [1], the Invariant Extended Kalman Filter (IEKF) [2], and the Invariant Smoother (IS) [3], on the CYN-1 sequence of the GrandTour Dataset [4]. Our goal is to give practitioners clear guidance on accuracy and computation time: we report long-term accuracy (Absolute Trajectory Error, ATE), short-term accuracy (translational and rotational Relative Pose Error, RPE), and per-update computation time on a fixed hardware/software stack. On this dataset, RPEs are broadly similar across methods, while IEKF and IS achieve a lower ATE than MUSE. Runtime results highlight the accuracy-latency trade-offs across the three approaches. In the discussion, we outline the evaluation choices used to ensure a fair comparison and analyze factors that influence short-horizon metrics. Overall, this study provides a concise snapshot of accuracy and cost, helping readers choose an estimator that fits their application constraints, with all evaluation code and documentation released open-source at https://github.com/iit-DLSLab/state_estimation_benchmark for full reproducibility.

2605.11672 2026-05-13 cs.AI cs.DB

A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination

Vinu Ellampallil Venugopal

AI总结 本文受分布式系统中CAP定理的启发,提出了一种针对大语言模型(LLM)的类CAP三难困境:在语义不充分的情况下,模型无法同时保证强正确性、严格无偏和高实用性。研究指出,当输入提示缺乏唯一答案时,模型若要生成有用的回答,必须引入某种选择标准,但若该标准未由用户提供或由前提合理推导,则可能导致偏见;反之,若模型避免使用未经支持的偏好,则可能保持正确性和无偏性,但会牺牲实用性。该研究揭示了某些LLM失败的根源可能在于任务本身的语义不充分,而非模型能力的局限。

详情
英文摘要

The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness--non-bias--utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.

2605.11666 2026-05-13 cs.LG cs.AI

Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

Liqin Ye, Yanbin Yin, Michael Galarnyk, Yuzhao Heng, Sudheer Chava, Chao Zhang

AI总结 本文提出了一种名为Evolutionary Task Discovery(EvoTD)的框架,旨在通过结构化进化操作提升大语言模型的推理能力。该方法将数据合成视为在算法技能和复杂度属性构成的双轴流形上的定向搜索,引入了交叉操作以增强技能组合的多样性,并通过参数化变异操作调整结构约束以促进鲁棒泛化。实验表明,EvoTD能够有效扩展模型的推理边界,并在不同模型架构和预训练设置下展现出良好的泛化能力。

详情
英文摘要

The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.

2605.11665 2026-05-13 cs.RO

Nautilus: From One Prompt to Plug-and-Play Robot Learning

Yufeng Jin, Jianfei Guo, Xiaogang Jia, Yu Deng, Zechu Li, Han Liu, Weiran Liao, Vignesh Prasad, Mathias Franzius, Gerhard Neumann, Georgia Chalvatzaki

AI总结 当前机器人学习研究分散在不同的策略家族、基准测试和实际机器人系统中,导致各实现之间复杂交织,难以移植和复用。为解决这一问题,本文提出NAUTILUS,一个开源框架,能够将用户输入的单一指令(如“用基准B评估策略A”)自动转化为可执行的复现、评估、微调和部署流程。NAUTILUS通过统一接口、类型化契约和自动化验证机制,实现了对现有和用户自定义策略、仿真器、基准和真实机器人的灵活集成,显著降低了跨体系复现与评估的工程负担。

详情
英文摘要

Robot learning research is fragmented across policy families, benchmark suites, and real robots; each implementation is entangled with the others in a complex combination matrix, making it an engineering nightmare to port any single element. General-purpose coding agents may occasionally bridge specific setups, but cannot close this gap at scale because they lack the procedural priors and validation practices that characterize robotics research workflows. We propose NAUTILUS, an open-source harness that turns a single user prompt -- for example, "Evaluate policy A with benchmark B" -- into ready-to-use reproduction, evaluation, fine-tuning, and deployment workflows. NAUTILUS provides: plug-and-play agent skill sets with distilled priors from robotics research; typed contracts among policies, simulators/benchmarks, and real-world robots; unified interfaces and execution environments; and a trustworthy agentic coding workflow with explicit, automated validation, and testing at each milestone. NAUTILUS can not only automatically generate the required adapters and containers for existing implementations, but also wrap and onboard new or user-provided policies, simulators/benchmarks, and robots, all connected via a uniform interface. This expands cross-validation coverage without hand-written glue code. Like a nautilus shell that grows by adding chambers, NAUTILUS scales by extending its execution in chambered units, making it a research harness for scalability rather than a hand-curated framework, and aiming to reduce the engineering burden of cross-family reproduction and evaluation in the ever-growing robot learning ecosystem.

2605.11663 2026-05-13 cs.CL

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Kyosuke Takami, Yuka Tateisi, Satoshi Sekine, Yusuke Miyao

AI总结 该研究基于日本全国学力调查数据,构建了一个包含科学、数学和日语科目的多模态基准数据集,包含真实考试题目、图表及约90万份学生答题分布数据。该数据集保留了真实考试的结构和内容,支持在统一评估框架下对比人类与模型的表现。研究通过实测准确率和字符级F1指标评估了多模态大语言模型的性能,并进一步分析了自动评分的可靠性,为多模态教育推理提供了可复现的基准,支持未来在真实评估场景中的模型评估与可解释性研究。

详情
英文摘要

Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark

2605.11659 2026-05-13 cs.CV cs.AI

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li, Ruixuan Li

AI总结 本文研究了在源域数据不可用的情况下,如何通过少量样本将大模型(如CLIP)适配到目标领域的问题,即无源域少样本跨域学习(CDFSL)。研究发现,基于适配器的方法(如LoRA)在CDFSL中优于基于提示的方法,其优势源于对视觉CLS token注意力的修正,从而增强模态对齐和类别区分。基于这一发现,作者提出了一个通用的注意力建模框架——语义探针(Semantic Probe),有效提升了适配器和提示方法在CDFSL中的性能,并在多个基准上取得了最先进的结果。

详情
英文摘要

Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.

2605.11636 2026-05-13 cs.AI

Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Xinbo Gao, Jing Zhang

AI总结 本文提出了一种名为 Seirênes 的自对抗自博弈强化学习框架,旨在将大语言模型在复杂上下文中推理失败的问题转化为训练信号,从而提升其鲁棒性。该方法通过单一模型同时生成具有干扰性的上下文和解决任务,迫使模型在噪声中识别核心逻辑,从而增强其深层推理能力。实验表明,Seirênes 在多个数学推理基准上取得了显著提升,并能有效暴露顶级闭源模型的推理盲点。

详情
英文摘要

We present Seirênes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seirênes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seirênes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seirênes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seirênes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seirênes' general ability to uncover reasoning models' blind spots.

2605.11634 2026-05-13 cs.CV cs.AI

Unlocking UML Class Diagram Understanding in Vision Language Models

Artem Naboichenko, René Peinl

AI总结 尽管视觉语言模型(VLMs)在各类应用中取得了显著进展,但在理解图表等结构化视觉内容方面仍存在不足,尤其在计算机科学领域的UML类图理解方面研究较少。本文提出了一种基于UML类图的视觉问答基准,兼具挑战性与可行性,并构建了一个包含16,000个图像-问题-答案三元组的大规模训练数据集。实验表明,基于LoRA的微调方法在该任务上表现优于当前主流的Qwen 3.5 27B模型。

详情
英文摘要

Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.

2605.11633 2026-05-13 cs.AI

Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai, Kunyi Liu, Hongruixuan Chen, Zhuo Zheng, Junshi Xia, Stefano Ermon, Naoto Yokoya

AI总结 该论文提出了一种名为DORA的基准测试平台,用于评估大型语言模型代理在灾难应急响应中的端到端能力。研究通过515个由专家设计的任务,覆盖45个真实灾难事件,涵盖从灾害感知、空间分析到疏散规划和多模态报告生成等多个维度,全面测试代理在异构地理空间数据上的推理与操作能力。实验揭示了当前LLM代理在灾难响应中的三大挑战,包括领域适应性不足、工具选择与参数理解困难以及长流程推理的脆弱性,为构建更可靠的灾难响应系统提供了重要参考。

Comments DORA stress-tests LLM agents on real-world disaster operations that demand comprehensive orchestration of 108 specialized tools over heterogeneous geospatial data

详情
英文摘要

Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.

2605.11629 2026-05-13 cs.CL

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Yuanhao Yue, Chengyu Wang, Yuanjie Lyu, Lei Shen, Jun Huang

AI总结 近年来,多模态大语言模型在视觉语言任务中展现出强大的链式推理能力,但由于延迟和资源限制,其在实际系统中的部署受到限制。为了解决这一问题,本文提出OmniThoughtVis,一个可扩展的数据筛选与知识蒸馏框架,用于将大模型的多模态推理能力迁移到更小、更适合部署的模型中。该方法通过生成结构化的推理轨迹并结合多种策略保证数据质量,最终在多个基准测试中显著提升了小模型的推理性能,展示了其在实际应用中的重要价值。

详情
英文摘要

Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.

2605.11628 2026-05-13 cs.CV

Single-Shot HDR Recovery via a Video Diffusion Prior

Chinmay Talegaonkar, Jinshi He, Christopher McKenna, Nicholas Antipa

AI总结 本文提出了一种基于视频扩散先验的单次拍摄高动态范围(HDR)图像恢复方法,解决了现有方法在保真度和模型复杂度之间的平衡问题。该方法将HDR重建重新定义为条件视频生成任务,通过生成曝光序列并融合为最终HDR图像,提升了重建结果的准确性和可解释性。实验表明,该方法在多个评估指标上优于现有方法,并在人类评估中获得更高偏好,同时框架还可扩展到其他图像重建任务。

详情
英文摘要

Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.

2605.11625 2026-05-13 cs.AI

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

Zhaomeng Zhou, Lan Zhang, Junyang Wang, Mu Yuan, Junda Lin

AI总结 这篇论文研究了如何让大型推理模型在有限计算资源下更高效地进行适应性推理。作者提出了一种名为Budget-Efficient Thinking(BET)的两阶段框架,通过结合行为冷启动和投资成本感知奖励机制,使模型能够根据推理的预期收益而非问题难度来分配计算预算。BET使模型学会在简单问题上快速回答、在无解问题上提前放弃、在复杂但可解的问题上保留足够计算资源,从而在多个基准测试中显著减少了推理开销并提升了整体性能。

Comments 24 pages, 6 figures, 11 tables

详情
英文摘要

Large reasoning models (LRMs) improve problem solving through extended reasoning, but often misallocate test-time compute. Existing efficiency methods reduce cost by compressing reasoning traces or conditioning budget on perceived difficulty, yet largely overlook solvability. As a result, they may spend large budgets on queries beyond the model's capability while compressing hard-but-solvable queries that require deeper reasoning. In this work, we formulate adaptive reasoning as a computational investment under uncertainty, where budget should follow the expected return of reasoning rather than perceived difficulty alone. To instantiate this principle, we propose Budget-Efficient Thinking (BET), a two-stage framework that combines behavioral cold-start with GRPO under an investment-cost-aware reward. By aligning solve-or-fold decisions with rollout-derived solvability, BET learns three behaviors: (1) short solve, answering easy queries concisely; (2) nice fold, abstaining early when continued reasoning has near-zero expected return; and (3) hero call, preserving sufficient compute for hard-but-solvable queries. Across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% on average while achieving overall performance improvements, and transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning with comparable efficiency gains.

2605.11622 2026-05-13 cs.CV

RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction

Yaxuan Song, Jianan Fan, Tianyi Wang, Qiuyue Hu, Hang Chang, Heng Huang, Weidong Cai

AI总结 本文提出了一种名为RNA-FM的生成模型,用于基于组织病理学全切片图像(WSI)预测全基因组RNA测序(RNA-seq)数据。该方法将转录组预测建模为连续时间条件运输问题,通过学习形态条件下的速度场,从简单先验分布映射到目标基因表达分布,从而更准确地捕捉生物异质性和预测不确定性。RNA-FM结合通路级别的结构信息,实现了可扩展且具有生物学可解释性的全基因组基因表达填补,实验表明其在性能和生物学意义方面均优于现有方法。

Comments 15 pages, 13 tables, 3 figures. Accepted by the Forty-Third International Conference on Machine Learning (ICML2026). Code is available at https://github.com/YXSong000/RNA-FM

详情
英文摘要

Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.

2605.11618 2026-05-13 cs.RO

Sampling-Based Follow-the-Leader Motion Planning for Manipulator-Mounted Continuum Robots

Chengnan Shentu, Nicholas Baldassini, Oluwagbotemi D. Iseoluwa, Radian Gondokaryono, Jessica Burgner-Kahrs

AI总结 本文研究了用于机械臂搭载的连续体机器人的“跟随领导者”(FTL)运动规划问题,提出了一种基于采样的运动规划方法,能够同时考虑机器人构型和机械臂末端位姿。该方法通过几何构造直接计算末端位姿,避免了在线规划中的迭代优化,提升了效率,并保证了形状搜索的分辨率完备性与末端跟踪的收敛性。实验表明,该方法在多种测试场景中均实现了100%的成功率和较高的轨迹精度。

详情
英文摘要

Follow-the-leader (FTL) motion exploits the unique morphology of continuum robots (CRs) to navigate confined spaces by having the body retrace the path of the tip. While extensively studied, existing FTL methods typically assume a fixed base or a single degree-of-freedom insertion mechanism, limiting their applicability to practical systems in which CRs are mounted on robotic manipulators with fully actuated SE(3) base pose. This paper presents a sampling-based motion planner for FTL motion of manipulator-mounted CRs that jointly considers robot configuration and base pose. The key idea is to decouple global shape search from base pose determination by computing the base pose through a closed-form geometric construction, thereby avoiding iterative optimization during online planning. The approach supports general forward models and enables efficient planning by shifting the majority of computation offline. We establish theoretical guarantees including resolution complete shape search and converging tip tracking throughout waypoint traversal and interpolation. Experiments on 120 simulated paths over 3 test classes demonstrate 0% tip error and 1.9% mean shape deviation (w.r.t. robot length) at 100% success rate. We validate the practicality of our approach on a 6-DOF tendon-driven CR mounted on a serial manipulator. Code and visualization available at https://continuumroboticslab.github.io/sb-ftl-cr-planner/.

2605.11616 2026-05-13 cs.CV

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang, Jingyi He, Yining Pan, Xulei Yang, Shijie Li

AI总结 该研究旨在解决三维功能可操作性区域的定位问题,即在视觉语言模型中准确识别出物体上可用于交互的特定区域,如把手或按钮。为此,提出了一种名为AFFORDMEM的框架,通过跨场景和场景内两种记忆机制,无需模型微调或目标场景标注,即可从源场景中构建可复用的记忆库来辅助定位。实验表明,该方法在SceneFun3D数据集上显著提升了定位精度,验证了其在细粒度定位和空间关系理解方面的有效性。

详情
英文摘要

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

2605.11613 2026-05-13 cs.LG cs.AI

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

Guobin Shen, Lei Huang, Xiang Cheng, Chenxiao Zhao, Jindong Li, Dongcheng Zhao, Xing Yu

AI总结 本文研究了在策略优化中使用自我蒸馏时,如何从通用相关性转向输入特定的奖励分配问题。作者提出,标准的自我蒸馏奖励本质上是响应与反馈之间的点互信息(pMI),并进一步将其分解为输入相关的部分和通用捷径部分。基于此,他们提出了CREDIT方法,通过对比学习分离输入特定的奖励成分,从而提升模型在多个任务上的表现,且计算开销极小。

详情
英文摘要

On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.

2605.11612 2026-05-13 cs.CL cs.AI

When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

Ziyu Liu, Tao Li, Tianjie Ni, Xiaolong Lan, Wengang Ma, Tao Yang, Guohua Wang, Junjiang He

AI总结 该研究提出了一种针对大语言模型的新型后门攻击方法——Paraesthesia,通过将情绪作为动态触发因素,实现对模型的隐蔽性攻击。不同于传统基于固定触发词的后门攻击,Paraesthesia 利用情绪风格在语义空间中形成独立聚类的特性,将情绪作为触发信号嵌入训练数据,使模型在推理阶段遇到特定情绪输入时生成预设的恶意输出。实验表明,该方法在多种任务和不同模型上均能实现高达约99%的攻击成功率,同时保持模型的正常功能。

详情
英文摘要

Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.