arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09678 2026-05-12 cs.AI

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

Ryan Albright, Golam Md Muktadir, Zarif Ikram, S M Jubaer, Mehrab Hossain, Dianbo Liu

AI总结 本文提出了一种名为 Absurd World 的基准框架,用于测试大语言模型(LLM)在逻辑推理方面的能力。该方法通过将现实世界的问题分解为符号、动作、序列和事件,并自动修改这些元素以构建逻辑自洽但荒谬的场景,从而在保持任务逻辑不变的前提下,检验 LLM 是否能够忽略现实世界中的模式进行推理。实验表明,Absurd World 是评估 LLM 逻辑推理鲁棒性的一种有效工具。

详情
英文摘要

While extremely powerful and versatile at various tasks, the thinking capabilities of large language models (LLMs) are often put under scrutiny as they sometimes fail to solve problems that humans can systematically solve. However, recent literature focuses on breaking LLM reasoning with increasingly complex problems, and whether an LLM is robust in simple logical reasoning remains underexplored. This paper proposes Absurd World, a benchmarking framework, to test LLMs against altered realism, where scenarios are logically coherent, and humans can easily solve the tasks. Absurd World breaks a real-world model into symbols, actions, sequences, and events, which are automatically altered to create absurd worlds where the logic to solve the tasks remains the same. It evaluates a large collection of models with simple and advanced prompting techniques, and proves that it is an effective tool to determine LLMs' ability to think logically, ignoring the patterns learned from the real world. One can use this framework to extensively test an LLM against a real-world problem to verify whether the LLM's reasoning capability is robust against variations of the task.

2605.09677 2026-05-12 cs.CV

VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement

Qingyu Xian, Hao Cheng, Berend Jan van der Zwaag, Rolands Kromanis, Ozlem Durmaz Incel

AI总结 本文提出了一种基于视觉基础模型(VFM)的结构位移测量框架VFM-SDM,能够在无需任务特定训练、无需现场标记和标定的情况下,实现多方向结构位移的非接触式测量。该方法结合VFM推断的相机参数估计与点跟踪技术,通过三角化重建位移,并引入结构几何约束以提升估计的物理合理性和一致性。实验结果表明,该框架在真实场景中具有较高的测量精度和稳定性,为自动化、可扩展的结构健康监测提供了新思路。

详情
英文摘要

Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.

2605.09676 2026-05-12 cs.LG cs.AI nlin.CD

ChaosNetBench: Benchmarking Spatio-Temporal Graph Neural Networks on Chaotic Lattice Dynamics

Henok Tenaw Moges, Charalampos Skokos, Deshendran Moodley

AI总结 该论文提出了一种名为ChaosNetBench(CNB)的合成基准数据集与评估框架,用于在受控的多维混沌动力学条件下评估时空图神经网络(STGNN)的性能。CNB基于耦合标准映射的晶格系统构建,允许独立调节局部混沌强度、耦合强度和系统规模,提供了96个系统实例和9600条轨迹的已知拓扑与动力学信息。研究引入了混沌指标和评估协议,通过对比13种不同架构的性能,揭示了STGNN在应对不同层次局部与全局混沌时相较于非图结构模型的优越性。

Comments 24 pages, 11 figures

详情
英文摘要

Spatio-temporal graph neural networks (STGNNs) are widely used for short-term forecasting in dynamic physical systems such as traffic and weather. However, the prevailing evaluation practice uses real world benchmark data sets in a single domain with a single fixed holdout splits, making it difficult to compare architectures across different dynamical regimes. We introduce ChaosNetBench (CNB), a synthetic benchmark dataset and evaluation framework for studying STGNN performance under controlled multidimensional chaotic dynamics. CNB is built on a lattice of coupled standard maps with independently tunable local chaos ($K$), coupling strength ($\varepsilon$), and system size ($N$), providing known topology and known dynamics across 96 system instances and 9{,}600 trajectories. We introduce chaos indicators, evaluation metrics and a protocol to analyze and compare the capacity of STGNN architectures to deal with different levels of local and global chaos. We illustrate the usage of the framework by analyzing 13 architectures (5 STGNNs and 8 non-graph baselines). The results reveal a regime dependent transition in which non-graph baselines (TCN, N-BEATS, iTransformer) remain competitive when there is low local chaos, while STGNNs (e.g., Graph WaveNet, D2STGNN, STAEformer) are generally more resilient to higher levels of local and global chaos. CNB provides a practical, reusable testbed for systematically comparing and analyzing the capacity of STGNN architectures to handle different levels of local and global chaos.

2605.09675 2026-05-12 cs.AI cs.MA

CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

Timothy Ossowski, Xinchi Liu, Danyal Maqbool, Vaibhav Dhanuka, Sheng Zhang, Hoifung Poon, Majid Afshar, Tyler Bradshaw, Junjie Hu

AI总结 本文提出CodeClinic,一个基于MIMIC-IV构建的基准,用于评估大型语言模型在临床推理任务中是否能够自动生成和组合可复用的临床技能,而非依赖固定工具库。该基准包含两个互补任务:长期ICU监测和组合信息检索,分别用于评估模型在结构化决策和多步骤推理方面的能力。研究还提出了一种离线自动形式化流程,通过迭代优化将自然语言临床指南转化为可验证的Python技能库,显著提升了推理一致性并减少了每查询的计算开销。

详情
英文摘要

Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.

2605.09672 2026-05-12 cs.RO

MVB-Grasp: Minimum-Volume-Box Filtering of Diffusion-based Grasps for Frontal Manipulation

Bibek Poudel, Abdul Basit, Muhammad Shafique

AI总结 本文针对低成本机械臂在受限工作空间中的正面抓取任务,提出了一种基于最小体积包围盒(MVBB)的抓取过滤方法MVB-Grasp,有效提升了抓取成功率。该方法通过引入几何先验,结合定向包围盒的面法线进行快速过滤,并融合学习到的判别器分数与面对齐几何信息,优化抓取候选方案。实验表明,MVB-Grasp在Unitree Z1机械臂上实现了比传统方法高出2.4倍的成功率,验证了其在受限空间抓取任务中的有效性。

Comments 8 pages, 12 figures, accepted to IJCNN 2026

详情
英文摘要

State-of-the-art 6-DoF grasp generators excel on tabletop benchmarks with overhead cameras but struggle in frontal grasping scenarios on low-cost manipulators with constrained workspaces, where kinematic limits and approach-direction constraints cause high failure rates. We address this challenge for the Unitree Z1 arm by proposing MVB-Grasp, a novel grasping stack that injects a Minimum Volume Bounding Box (MVBB) geometric prior into diffusion-based grasp generation to dramatically improve success rates in frontal, workspace-constrained settings. Our key scientific contributions are threefold: (i) an MVBB-based geometric filter that exploits oriented bounding-box face normals to reject grasps approaching through the table or misaligned with accessible object faces in O(N) time; (ii) a combined re-scoring function that blends learned discriminator scores with face-alignment geometry α=0.85, specifically calibrated for the Z1's frontal workspace and kinematic constraints; and (iii) a systematic MuJoCo evaluation protocol measuring grasp success across object types, distances, lateral positions, and pitch orientations to validate embodiment-specific performance. We implement MVB-Grasp on a Unitree Z1 arm with an Intel RealSense D405 camera, integrating YOLOv8 object detection, GraspGen for candidate generation, Principal Component Analysis (PCA)-based MVBB fitting, and inverse-kinematics trajectory planning. Experiments across 81 MuJoCo episodes (cylinder, asymmetric box, waterbottle) demonstrate that MVB-Grasp achieves 59.3% success versus 24.7% for vanilla GraspGen, a 2.4x improvement, by filtering geometrically infeasible candidates and prioritizing face-aligned grasps suited to the Z1's frontal approach constraints. Real-world trials confirm that the MVBB prior substantially improves grasp reliability on constrained, low-cost manipulators without requiring model retraining.

2605.09670 2026-05-12 cs.RO cs.CV

Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

Aws Khalil, Jaerock Kwon

AI总结 本文研究了基于视觉的遥操作系统中预测显示技术的生成能力,旨在通过生成未来视觉状态来缓解通信延迟带来的影响。作者提出了一种无需任务微调的零样本基准,评估了多种现成的生成视频模型在短时预测显示中的表现。实验表明,现有模型在预测精度、推理延迟和误差稳定性等方面难以同时满足预测显示的需求,揭示了通用生成视频模型与遥操作预测显示应用之间的性能差距。

详情
英文摘要

Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD

2605.09667 2026-05-12 cs.CV cs.AI

S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes

Albert Heruth

AI总结 本文提出了一种名为S2P-Net的紧凑型深度学习网络架构,用于在数据量较少的情况下实现旋转不变的目标识别,且无需数据增强即可保证数学上的旋转不变性。该网络结合了频域与空域信息,并通过极坐标变换增强其对旋转的鲁棒性。与传统卷积神经网络相比,S2P-Net在小样本场景下表现出更优的识别性能,为低数据条件下的旋转不变目标识别提供了新思路。

Comments 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request

详情
英文摘要

We present S2P-Net (Spectral-Spatial Polar Network), a compact deep learning architecture that achieves mathematically guaranteed rotation invariance without data augmentation. In this Paper, we also made a comparison to other neural network architectures (CNN`s). Have a look at the results and feel free to contact me for any questions. This is my first paper:) Made by Hackbert

2605.09666 2026-05-12 cs.CV cs.AI

Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models

Abdul Basit, Ashir Rashid, Muhammad Abdullah Hanif, Muhammad Shafique

AI总结 本文探讨了多发性硬化症(MS)病灶分割模型评估方法的不足,指出当前大多使用Dice分数进行评估,未能充分考虑病灶级别的检测与分割性能,以及对复杂或人类标注者难以判断情况的模型表现。作者详细分析了神经科医生在脑部MRI扫描中关注的特征,并提出了更符合实际需求的评估指标,同时在两个开源数据集上对现有先进模型进行了分析,以评估其在实际医疗场景中的适用性。

Comments 8 pages, 5 figures, Accepted to IJCNN 2026

详情
英文摘要

Multiple Sclerosis (MS) is a chronic autoimmune disease that can significantly reduce the quality of life of a patient. Existing treatment options can only help slow down the progression of the disease. Therefore, early detection and precise monitoring of disease progression are important. Deep learning offers state-of-the-art models for detecting and segmenting MS lesions in brain MRI scans. However, most of these models are evaluated using the Dice score, without accounting for lesion-wise detection and segmentation performance or other metrics that quantify model performance in cases that are complex or confusing for human annotators, or in cases that are essential for disease detection and progression monitoring. In this paper, we highlight the need to rethink the evaluation of MS lesion segmentation models. In this context, we first present problem fingerprinting in detail to highlight what neurologists look for in brain MRI scans for MS detection and progression monitoring, and which metrics are required to properly quantify model performance in these contexts. Additionally, we present an analysis of state-of-the-art models on two open-source datasets using these metrics to highlight their usability for real-world deployment in hospitals.

2605.09665 2026-05-12 cs.LG cs.AI cs.CL

Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies

Jingze Song, Zihao Chen, Wenqing Chen, Zibin Zheng

AI总结 本文研究了在指令微调中如何高效选择训练数据的问题,提出了一种联合任务-模型自适应的框架,用于学习多指标权重以优化数据选择。该方法通过在小型验证集上利用上下文学习信号,无需大规模微调即可确定最优权重配置,从而实现高效且高保真的数据评估。实验表明,该方法在多个基准和模型家族上表现出与全数据微调相当甚至更优的效果,并揭示了推理任务中语义多样性与逻辑复杂性的权衡关系。

Comments This work has been accepted at IJCAI 2026

详情
英文摘要

Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30\% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.

2605.09663 2026-05-12 cs.LG cs.AI

Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation

Julien Lafrance, Richard Khoury, Véronique Tremblay

AI总结 在动态环境中,机器学习分类器常因概念漂移导致性能下降,而传统评估方法难以准确反映数据生成过程中的因果依赖关系。本文提出了一种基于结构因果模型的数字孪生框架——因果参数漂移模拟(Causal Parametric Drift Simulation),通过精确的因果干预揭示分类器在部署前的潜在脆弱性。实验表明,该方法能发现标准统计监测手段无法识别的隐藏问题,为分类器鲁棒性评估提供了新的有效工具。

Comments 34 pages, 13 figures, 14 tables

详情
英文摘要

Machine learning classifiers in dynamic environments face concept drift -- changes in the data-generating process that degrade performance. Conventional evaluation via static test sets or noise perturbations fails to preserve causal dependencies in tabular data, often producing causally invalid assessments. Post-hoc tools like SHAP and LIME offer correlational insights that may not reflect the causal mechanisms driving model failure. We propose a framework that complements existing drift detection by leveraging Structural Causal Models as "Digital Twins" of data-generating processes, enabling precise causal interventions while preserving structural dependencies. Our technique, Causal Parametric Drift Simulation, stress-tests classifiers to identify vulnerabilities before deployment. Experiments on the Open Sourcing Mental Illness (OSMH) dataset demonstrate that this approach exposes latent vulnerabilities invisible to standard statistical monitors.

2605.09662 2026-05-12 cs.CV

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

Alessio Mazzucchelli, Maria Naranjo-Almeida, Jorge Bustos-Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer, Jordi Sanchez-Riera, Adrian Penate-Sanchez

AI总结 本文提出了一种名为BEA-GS的新型高斯泼溅方法,旨在在无需辐射监督的情况下实现更精确的物体提取。该方法通过引入两种新的损失函数,分别优化可见和不可见高斯点的几何结构,以更准确地对齐语义边界。实验表明,该方法在多个数据集上取得了当前最佳的边界分割效果,显著提升了物体级编辑和资产提取的精度。

Comments CVPR 2026 Highlight

详情
英文摘要

Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene's geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene's geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.

2605.09661 2026-05-12 cs.CL cs.AI

MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Huy Hoang Ha, Benoit Favre, Francois Portet

AI总结 本文提出MedMeta,一个用于评估大语言模型(LLM)从医学研究摘要中合成元分析结论能力的新基准。该基准包含81项来自PubMed的元分析,通过两种流程评估模型:基于真实摘要的检索增强生成(Golden-RAG)和仅依赖内部知识的参数化方法。研究发现,基于外部信息的Golden-RAG方法显著优于仅依赖内部知识的方法,而领域微调的效果有限,且当前模型在处理否定性证据时表现不佳,突显了RAG系统在临床应用中的重要性和现有模型的不足。

详情
英文摘要

Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.

2605.09659 2026-05-12 cs.RO

ASACK : Adaptive Safe Active Continual Koopman Learning for Uncertain Systems with Contractive Guarantees

Chandan Kumar Sah, Rajpal Singh, Jishnu Keshavan

AI总结 本文提出了一种名为ASACK的自适应安全主动持续Koopman学习框架,用于在存在模型不确定性和分布偏移的不确定系统中进行安全控制。该方法通过一个基于自编码器的Koopman模型进行离线学习,并利用收缩性适应律进行在线模型修正,从而在理论上有分布偏移和模型不确定性下的收敛保证。为提高数据效率,该方法结合主动学习策略,在完成任务目标的同时引导系统采集信息量大的数据,并将主动学习目标与安全约束整合到非凸优化问题中,最终通过鲁棒MPC框架实现形式化的安全保证。实验表明该方法在性能上优于现有先进方法。

详情
英文摘要

Koopman operator theory provides a powerful framework for representing nonlinear dynamics through a linear operator acting on lifted observables, enabling the use of linear control techniques for nonlinear systems. However, Koopman models are typically learned from data and often degrade in performance under model uncertainty and distributional shifts between training and deployment. Although several works have explored online adaptation to address this issue, many rely on neural network-based updates that introduce significant computational overhead and lack formal safety guarantees, limiting their suitability for real-time and safety-critical robotic applications. In this work, we propose a unified framework for continual adaptive Koopman learning that enables safe and efficient online refinement of learned models during task execution. An autoencoder-based Koopman model is first learned offline and subsequently refined online through a contractive adaptation law, which provides theoretical convergence guarantees under distributional shifts and model uncertainty. To improve data efficiency and accelerate model refinement, the adaptation mechanism is integrated with an active learning strategy that drives the system to collect informative data while accomplishing task objectives. The resulting control problem is formulated as a nonconvex optimization problem incorporating both active learning objectives and safety constraints. We further derive theoretical bounds on model approximation error and show how these bounds can be incorporated within a robust Model Predictive Control (MPC) framework to provide formal safety guarantees. The proposed approach unifies learning, excitation, and safety within a single control framework without sacrificing real-time feasibility. Extensive simulation and experimental studies demonstrate superior performance compared to state-of-the-art baselines.

2605.09656 2026-05-12 cs.RO

ORICF -- Open Robotics Inference and Control Framework

Andrés Meseguer Valenzuela, Luís Miguel Bartolín Arnau

AI总结 本文提出了一种名为ORICF的开放机器人推理与控制框架,旨在解决当前人工智能在机器人应用中计算开销大、延迟高和能耗高的问题。该框架具有模块化、声明式和模型无关的特点,支持通过轻量级YAML配置灵活调整模型、硬件和数据通道,无需修改代码。研究通过在移动机器人上结合语音识别、大语言模型和目标检测模型进行实验,验证了ORICF在边缘计算部署下可显著降低机器人端的计算负载和能耗,同时保持系统的模块化与可复现性。

Comments Accepted in ICRA26 Workshop: 8th International Workshop on Robotics Software Engineering (RoSE 26)

详情
英文摘要

Recent advances in artificial intelligence (AI) have enabled effective perception and language models for robots, but their deployment remains computationally expensive, increasing latency and energy use. This work presents the Open Robotics Inference and Control Framework (ORICF), a modular, declarative, and model-agnostic platform for composing multimodal robotic inference pipelines. ORICF integrates input/output (I/O) adapters, pluggable inference back ends, and post-processing logic, while lightweight YAML specifications allow models, hardware targets, and data channels to be changed without code modification. The framework also supports edge offloading, i.e., executing inference on nearby external computers instead of onboard the robot. ORICF is evaluated on a mobile robot that answers spoken queries about people detected in its camera stream by combining automatic speech recognition (ASR), a large language model (LLM), and a convolutional neural network (CNN) detector through Robot Operating System 2 (ROS2). Compared with onboard execution, ORICF-based edge deployment reduces robot-side compute utilization by up to 83.16% and estimated energy consumption by 65.8%, while preserving modularity and reproducibility.

2605.09650 2026-05-12 cs.AI cs.LG

Workspace Optimization: How to Train Your Agent

Elad Sarafian, Gal Kaplun, Ron Banner, Daniel Soudry, Boris Ginsburg

AI总结 本文研究了如何通过优化智能体的“工作空间”来提升其在复杂多轮任务中的表现。作者提出,当前沿语言模型的权重难以调整时,应通过结构化的外部工作空间进行训练,这一过程称为“工作空间优化”。为此,他们设计了DreamTeam系统,通过多智能体协作构建可执行的世界模型,并在ARC-AGI-3数据集上实现了比现有最优方法更高的任务解决率,同时减少了环境交互动作的使用。

详情
英文摘要

Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

2605.09649 2026-05-12 cs.LG

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying

AI总结 该论文研究了如何通过改进键值(KV)缓存的管理策略来提升模型在长上下文推理中的性能。作者提出了一种基于全局保留机制的KV缓存淘汰方法,通过学习每个token的未来有用性,在统一的内存预算下进行选择性淘汰,从而在减少内存消耗的同时提升生成质量。实验表明,该方法在多个长上下文语言和视觉语言推理任务中,能够有效降低KV内存占用并达到或超越全缓存推理的效果。

Comments A learnable KV eviction method for large language models

详情
英文摘要

The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.

2605.09644 2026-05-12 cs.CV

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou, Xiaosong Jia, Zuxuan Wu, Yu-Gang Jiang

AI总结 该论文提出了一种名为RetrieveVGGT的训练-free框架,用于解决基于Transformer的三维重建在处理长序列时因注意力机制复杂度过高而导致的内存溢出和质量下降问题。通过将上下文构建转化为检索问题,RetrieveVGGT在每一步仅检索少量相关帧,从而保持可控的内存开销,并利用VGGT中查询与键之间的相似性作为相关性指标,无需额外训练。此外,该方法引入了分段采样和基于相机位姿的空间记忆机制,进一步提升了信息多样性与定位准确性,实验表明其在性能上优于多个现有方法。

详情
英文摘要

Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.

2605.09640 2026-05-12 cs.CV cs.LG

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu

AI总结 本文研究了如何在视觉持续学习中克服灾难性遗忘问题,提出了一种基于强化微调的新方法RaPO。作者发现现有方法如GRPO在面对类别增量和领域增量学习时仍存在显著遗忘,其根本原因在于轨迹层面的策略漂移。为此,RaPO通过引入保留奖励和跨任务优势归一化,有效缓解了策略漂移带来的遗忘问题,实验表明其在多个持续学习场景中均取得优越性能,为视觉持续学习中的强化微调提供了系统性探索。

详情
英文摘要

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

2605.09638 2026-05-12 cs.LG

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh

AI总结 本研究提出了一种名为Plan2Cleanse的测试时反后门防御框架,用于检测和缓解深度强化学习模型中的后门攻击。该方法通过将后门检测转化为规划问题,利用蒙特卡洛树搜索技术高效识别并中和后门触发序列,无需重新训练模型。实验表明,Plan2Cleanse在多个环境中显著提升了后门触发的检测成功率和任务表现,验证了其在实际部署中的有效性。

Comments Published in Transactions on Machine Learning Research (TMLR)

详情
英文摘要

Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.

2605.09636 2026-05-12 cs.AI

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

Zhen Hang, Yushan Yashengjiang, Junhui Li, Huanshuo Dong, Yang Wei, Zhezheng Hao, Jiangtao Ma, Songlin Bai, Haozhong Kai, Xihang Yue, Gangzong Si, Dongming Jiang, Chao Yao, Zhanhua Hu, Jiangqing Zhang, Pengwei Liu, Yaomin Shen, Xingyu Ren, Lei Liu, Zikang Xu, Han Li, Qingsong Yao, Hande Dong, Hong Wang

AI总结 PDEAgent-Bench 是首个面向偏微分方程(PDE)求解器生成的多指标、多库基准测试平台,旨在评估从PDE描述自动生成数值求解代码的能力。该基准包含645个实例,涵盖6类数学问题和11类PDE,支持DOLFINx、Firedrake和deal.II等主流有限元库,并对生成代码的可执行性、数值精度和计算效率进行分阶段评估。实验表明,当前大型语言模型和代码生成代理虽能生成可运行代码,但在满足精度和效率要求时表现显著下降,突显了PDE求解器生成任务的挑战性与现有方法的不足。

详情
英文摘要

PDE-to-solver code generation aims to automatically synthesize executable numerical solvers from partial differential equation (PDE) specifications. This task requires not only understanding the mathematical structure of PDEs, but also selecting appropriate discretization schemes and solver configurations, and correctly implementing the resulting formulations in finite-element method (FEM) libraries. Existing code generation benchmarks mainly evaluate syntactic correctness, or success on predefined test cases. To our knowledge, there is currently no publicly available benchmark specifically for PDE-to-solver code generation, and general-purpose code benchmarks do not fully capture the unique challenges of numerical PDE solution, such as ensuring solver accuracy, efficiency, and compatibility with professional FEM libraries. We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. PDEAgent-Bench contains 645 instances across 6 mathematical categories and 11 PDE families, with common FEM libraries for DOLFINx, Firedrake, and deal.II. Each instance provides an agent-facing problem specification, a reference solution on a prescribed evaluation grid, and case-specific accuracy and runtime targets. PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks. Experiments with representative LLMs and code agents show that models can often produce runnable code, but their pass rate drops substantially once accuracy and efficiency requirements are enforced. These results indicate that current agents remain limited in producing numerically reliable and efficient PDE solvers, and that PDEAgent-Bench provides a reproducible testbed grounded in the practical requirements of numerical PDE solving.

2605.09635 2026-05-12 cs.CL

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Hao Liang, Qihan Lin, Zhaoyang Han, Xiaochen Ma, Zhen Hao Wong, Meiyi Qiang, Linzhuang Sun, Wentao Zhang

AI总结 该研究提出了K12-KGraph,一个与课程内容对齐的知识图谱,旨在评估和训练教育领域的大型语言模型。该图谱从人教版教材中提取,涵盖数学、物理、化学和生物等多个学科,包含七类节点和九类关系,用于构建多任务基准K12-Bench和训练数据集K12-Train。实验表明,基于课程结构的监督训练在教育资源有限的情况下表现更优,显著提升了模型在教育相关任务中的性能。

详情
英文摘要

Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.

2605.09634 2026-05-12 cs.CL

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Erfan Loweimi, Sofia de la Fuente Garcia, Samira Loveymi, Hadi Daneshvar, Saturnino Luz

AI总结 该研究探讨了大型语言模型(LLMs)在心理健康筛查中的可靠性,重点关注模型的一致性、语音识别(ASR)鲁棒性以及证据可信度。研究评估了Phi-4、Gemma-2-9B和Llama-3.1-8B三类模型在真实语音数据上的表现,发现Phi-4和Gemma-2-9B在模型内部一致性及对ASR错误的鲁棒性方面表现优异,而Llama-3.1-8B则表现出较差的稳定性。研究还揭示了模型评分与关键词依据之间的不一致,对临床应用的可解释性提出了挑战。

详情
英文摘要

LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.

2605.09633 2026-05-12 cs.RO cs.SY eess.SY

Minimizing Worst-Case Weighted Latency for Multi-Robot Persistent Monitoring: Theory and RL-Based Solutions

Weizhen Wang, Ziheng Wang, Jianping He, Xinping Guan, Xiaoming Duan

AI总结 本文研究多机器人在带权重图上的持续监测问题,旨在设计机器人轨迹以最小化所有节点在无限时间范围内的最差加权延迟。为了解决传统最差延迟目标无法区分瞬态性能差但渐近性能好的策略的问题,作者提出了一类尾部性能目标,并建立相应的优化问题理论框架。基于这些理论结果,构建了一个等效的事件驱动马尔可夫决策过程(TWLO-MDP),并开发了基于强化学习的求解方法,同时提出了多机器人监测基准(M2Bench),实验表明该方法能有效降低最差加权延迟并优于现有方法。

详情
英文摘要

We study multi-robot persistent monitoring on weighted graphs, where node weights encode monitoring priorities and edge weights encode travel distances. The goal is to design joint robot trajectories that minimize the worst-case weighted latency across all nodes over an infinite time horizon. The widely adopted worst-case latency objective evaluates team performance over the entire time horizon and therefore may fail to distinguish strategies with poor transient behavior but strong asymptotic performance. To address this limitation, we propose a family of tail-performance objectives that generalize the standard objective and study the resulting functional optimization problems. We establish several key theoretical properties, including the existence of optimal strategies, relationships among the proposed objectives and their corresponding optimization problems, approximation by periodic solutions to arbitrary accuracy, and reductions to event-driven decision models with discretized waiting times. Building on these results, we construct an equivalent event-driven Markov decision process (MDP), called the Tail Worst-case Latency-Optimizing Markov Decision Process (TWLO-MDP), which reformulates the tail-performance objective as a standard average-reward criterion. We then develop reinforcement-learning-based solution methods for the TWLO-MDP and introduce the multi-robot monitoring benchmark (M2Bench), a unified platform that supports the evaluation and comparison of heuristic and learning-based monitoring algorithms. Experiments on synthetic and realistic monitoring scenarios show that our methods effectively reduce the worst-case weighted latency and outperform representative baselines.

2605.09630 2026-05-12 cs.CL cs.LG

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Lin Zheng, Vasilisa Bashlovkina, Timothy Dozat, Dan Garrette, Laura Rimell, Joshua Maynez

AI总结 本文研究了字节级语言模型中基于块(patch)的高效推理方法,提出了“Scratchpad Patching(SP)”技术,通过在每个块内插入临时缓存(scratchpad)来聚合已观测的字节信息,从而更新块级上下文,减少因块大小增加导致的预测滞后问题。该方法能够在保持相同块大小的前提下提升模型性能,并显著降低键值缓存和推理计算量,为高效语言模型设计提供了新思路。

Comments 23 pages, 15 figures

详情
英文摘要

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.

2605.09628 2026-05-12 cs.CV

DegBins: Degradation-Driven Binning for Depth Super-Resolution

Zhiqiang Yan, Zhengxue Wang, Jian Yang, Gim Hee Lee

AI总结 深度超分辨率(DSR)旨在从低分辨率深度图中恢复高分辨率深度图。传统方法通常在低维特征空间中学习高分辨率与低分辨率之间的残差,但难以准确建模空间变化的退化关系。本文提出了一种新的DSR框架DegBins,通过退化驱动的分箱策略,将回归问题转化为分类-回归混合问题,利用离散深度分箱的加权组合更灵活地表示残差深度,并在高维特征空间中建模退化关系,实现分箱范围和概率分布的自适应调整。实验表明,DegBins在多个基准数据集上优于现有方法,具有更高的精度和鲁棒性。

Comments 9 pages

详情
英文摘要

Depth super-resolution (DSR) aims to recover a high-resolution (HR) depth map from its low-resolution (LR) counterpart. With color image guidance, this task is typically formulated as learning the residual between HR and LR in a low-dimensional feature space. However, this additive formulation is insufficient to accurately capture the complex relationship between HR and LR, especially under spatially varying degradations. In this paper, we introduce DegBins, a novel DSR framework that leverages degradation-driven binning to adaptively enhance residual modeling. Specifically, DegBins reformulates the regression-based DSR as a hybrid classification-regression problem, where the residual depth is represented as a linear combination of discrete depth bins weighted by their learned probability distribution, yielding more flexible and expressive representations. Furthermore, DegBins models the degradation relationship between HR and LR in a high-dimensional feature space, enabling adaptive bin range adjustment and probability optimization conditioned on local degradation characteristics. To progressively improve reconstruction quality, DegBins adopts a multi-stage refinement scheme, where each stage performs finer-grained bin partitioning and probability updating based on the former estimation. This coarse-to-fine design facilitates more accurate depth recovery, particularly in regions with severe degradations or complex structural variations. Extensive experiments across five benchmarks demonstrate that DegBins consistently outperforms existing state-of-the-art methods in terms of accuracy, robustness, and generalization.

2605.09622 2026-05-12 cs.CV cs.AI

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao

AI总结 在放射治疗计划中,体素级剂量预测是一个关键但具有挑战性的任务,现有模型往往难以在不同临床场景中泛化。本文提出 DiffKT3D,一种统一的 Any2Any 3D 扩散框架,通过迁移预训练视频扩散模型的知识,实现高效且具有临床意义的剂量预测。该方法引入了基于模态嵌入的灵活条件生成机制,并结合临床导向的强化学习后训练策略,显著提升了剂量预测精度与图像质量,优于当前最优模型。

Comments Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)

详情
英文摘要

Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.

2605.09618 2026-05-12 cs.CL cs.CY

Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols

Julia Hu, Alfred Shen, Kumar Lakshmipathi

AI总结 该研究探讨了语言模型在直接回答、多样本投票和多智能体辩论等不同推理策略下的表现差异,旨在确定在生成长度受限的情况下,哪种策略最有效。通过在MuSiQue和GSM8K数据集上对多个模型进行实验,发现最佳策略因模型和数据集而异,且难以通过简单的预判信号(如投票熵)来有效选择。研究指出,投票熵仅能预测辩论是否安全,而不能准确判断何时需要辩论,表明当前的辩论机制在实际应用中仍存在局限。

Comments 14 pages, 5 figures. Technical report / preprint

详情
英文摘要

When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale.

2605.09614 2026-05-12 cs.CV

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai, Weishu Zhao, Shiyu Liang

AI总结 本文研究了长链多模态推理中视觉信息衰减的问题,提出了一种基于信息论的分析方法,推导出干预点对下游视觉收益的下界,并据此设计了反射锚点策略优化(RAPO)方法。RAPO通过选择高熵的反射锚点并优化有限窗口的KL散度代理,有效增强了视觉信息在生成过程中的传播与保留。实验表明,RAPO在多个视觉-语言模型基准上显著优于现有方法,并且机制分析显示其能增强生成轨迹中视觉依赖的对比信号。

Comments Under Review

详情
英文摘要

Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.

2605.09613 2026-05-12 cs.RO cs.CV

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Narsimha Menga, Parikshit Sakurikar, Amirreza Rouhi, Satya Sai Reddy, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

AI总结 该研究提出了SABER,一个用于现实零售场景中机器人视觉-语言-动作(VLA)适配的高保真动作数据集。SABER通过多小时的真实店内捕捉,记录了人类在零售环境中的精细手部动作、全身运动及场景动态,无需人工编排或远程操作。该数据集包含多种动作表示形式,并在实际机器人系统上验证了其有效性,显著提升了复杂零售任务的完成率,展示了高质量数据对提升机器人性能的关键作用。

详情
英文摘要

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

2605.09611 2026-05-12 cs.CL

Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

Sietse Schelpe

AI总结 本文对检索增强生成(RAG)流程中的字节精确块级去重技术进行了实证分析,研究了其在不同应用场景下的上下文缩减效果及质量影响。通过在学术、企业及多轮对话场景下的实验,发现去重可实现高达80.34%的冗余减少,同时通过多方模型的评估验证,确认该方法不会引入可测量的质量下降。研究证明,在不牺牲模型质量的前提下,可以确定性地实现显著的推理计算节省。

Comments Preprint. Implementation and open-source community version available at: https://github.com/corbenic/merlin-community - https://zenodo.org/records/20090712

详情
英文摘要

This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict <5% Wilson 95% upper-bound MAT threshold in both the clean and high-redundancy RAG regimes. This work demonstrates that substantial inference compute savings can be achieved deterministically without compromising evaluation-grade model quality.