arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.09678 2026-05-12 cs.AI

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

Ryan Albright, Golam Md Muktadir, Zarif Ikram, S M Jubaer, Mehrab Hossain, Dianbo Liu

AI总结本文提出了一种名为 Absurd World 的基准框架，用于测试大语言模型（LLM）在逻辑推理方面的能力。该方法通过将现实世界的问题分解为符号、动作、序列和事件，并自动修改这些元素以构建逻辑自洽但荒谬的场景，从而在保持任务逻辑不变的前提下，检验 LLM 是否能够忽略现实世界中的模式进行推理。实验表明，Absurd World 是评估 LLM 逻辑推理鲁棒性的一种有效工具。

2605.09677 2026-05-12 cs.CV

VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement

Qingyu Xian, Hao Cheng, Berend Jan van der Zwaag, Rolands Kromanis, Ozlem Durmaz Incel

AI总结本文提出了一种基于视觉基础模型（VFM）的结构位移测量框架VFM-SDM，能够在无需任务特定训练、无需现场标记和标定的情况下，实现多方向结构位移的非接触式测量。该方法结合VFM推断的相机参数估计与点跟踪技术，通过三角化重建位移，并引入结构几何约束以提升估计的物理合理性和一致性。实验结果表明，该框架在真实场景中具有较高的测量精度和稳定性，为自动化、可扩展的结构健康监测提供了新思路。

2605.09676 2026-05-12 cs.LG cs.AI nlin.CD

ChaosNetBench: Benchmarking Spatio-Temporal Graph Neural Networks on Chaotic Lattice Dynamics

Henok Tenaw Moges, Charalampos Skokos, Deshendran Moodley

AI总结该论文提出了一种名为ChaosNetBench（CNB）的合成基准数据集与评估框架，用于在受控的多维混沌动力学条件下评估时空图神经网络（STGNN）的性能。CNB基于耦合标准映射的晶格系统构建，允许独立调节局部混沌强度、耦合强度和系统规模，提供了96个系统实例和9600条轨迹的已知拓扑与动力学信息。研究引入了混沌指标和评估协议，通过对比13种不同架构的性能，揭示了STGNN在应对不同层次局部与全局混沌时相较于非图结构模型的优越性。

Comments 24 pages, 11 figures

2605.09675 2026-05-12 cs.AI cs.MA

CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

Timothy Ossowski, Xinchi Liu, Danyal Maqbool, Vaibhav Dhanuka, Sheng Zhang, Hoifung Poon, Majid Afshar, Tyler Bradshaw, Junjie Hu

AI总结本文提出CodeClinic，一个基于MIMIC-IV构建的基准，用于评估大型语言模型在临床推理任务中是否能够自动生成和组合可复用的临床技能，而非依赖固定工具库。该基准包含两个互补任务：长期ICU监测和组合信息检索，分别用于评估模型在结构化决策和多步骤推理方面的能力。研究还提出了一种离线自动形式化流程，通过迭代优化将自然语言临床指南转化为可验证的Python技能库，显著提升了推理一致性并减少了每查询的计算开销。

2605.09672 2026-05-12 cs.RO

MVB-Grasp: Minimum-Volume-Box Filtering of Diffusion-based Grasps for Frontal Manipulation

Bibek Poudel, Abdul Basit, Muhammad Shafique

AI总结本文针对低成本机械臂在受限工作空间中的正面抓取任务，提出了一种基于最小体积包围盒（MVBB）的抓取过滤方法MVB-Grasp，有效提升了抓取成功率。该方法通过引入几何先验，结合定向包围盒的面法线进行快速过滤，并融合学习到的判别器分数与面对齐几何信息，优化抓取候选方案。实验表明，MVB-Grasp在Unitree Z1机械臂上实现了比传统方法高出2.4倍的成功率，验证了其在受限空间抓取任务中的有效性。

Comments 8 pages, 12 figures, accepted to IJCNN 2026

详情

英文摘要

State-of-the-art 6-DoF grasp generators excel on tabletop benchmarks with overhead cameras but struggle in frontal grasping scenarios on low-cost manipulators with constrained workspaces, where kinematic limits and approach-direction constraints cause high failure rates. We address this challenge for the Unitree Z1 arm by proposing MVB-Grasp, a novel grasping stack that injects a Minimum Volume Bounding Box (MVBB) geometric prior into diffusion-based grasp generation to dramatically improve success rates in frontal, workspace-constrained settings. Our key scientific contributions are threefold: (i) an MVBB-based geometric filter that exploits oriented bounding-box face normals to reject grasps approaching through the table or misaligned with accessible object faces in O(N) time; (ii) a combined re-scoring function that blends learned discriminator scores with face-alignment geometry α=0.85, specifically calibrated for the Z1's frontal workspace and kinematic constraints; and (iii) a systematic MuJoCo evaluation protocol measuring grasp success across object types, distances, lateral positions, and pitch orientations to validate embodiment-specific performance. We implement MVB-Grasp on a Unitree Z1 arm with an Intel RealSense D405 camera, integrating YOLOv8 object detection, GraspGen for candidate generation, Principal Component Analysis (PCA)-based MVBB fitting, and inverse-kinematics trajectory planning. Experiments across 81 MuJoCo episodes (cylinder, asymmetric box, waterbottle) demonstrate that MVB-Grasp achieves 59.3% success versus 24.7% for vanilla GraspGen, a 2.4x improvement, by filtering geometrically infeasible candidates and prioritizing face-aligned grasps suited to the Z1's frontal approach constraints. Real-world trials confirm that the MVBB prior substantially improves grasp reliability on constrained, low-cost manipulators without requiring model retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.09670 2026-05-12 cs.RO cs.CV

Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

Aws Khalil, Jaerock Kwon

AI总结本文研究了基于视觉的遥操作系统中预测显示技术的生成能力，旨在通过生成未来视觉状态来缓解通信延迟带来的影响。作者提出了一种无需任务微调的零样本基准，评估了多种现成的生成视频模型在短时预测显示中的表现。实验表明，现有模型在预测精度、推理延迟和误差稳定性等方面难以同时满足预测显示的需求，揭示了通用生成视频模型与遥操作预测显示应用之间的性能差距。

详情

英文摘要

Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD

URL PDF HTML ☆

赞 0 踩 0

2605.09667 2026-05-12 cs.CV cs.AI

S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes

Albert Heruth

AI总结本文提出了一种名为S2P-Net的紧凑型深度学习网络架构，用于在数据量较少的情况下实现旋转不变的目标识别，且无需数据增强即可保证数学上的旋转不变性。该网络结合了频域与空域信息，并通过极坐标变换增强其对旋转的鲁棒性。与传统卷积神经网络相比，S2P-Net在小样本场景下表现出更优的识别性能，为低数据条件下的旋转不变目标识别提供了新思路。

Comments 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request

2605.09666 2026-05-12 cs.CV cs.AI

Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models

Abdul Basit, Ashir Rashid, Muhammad Abdullah Hanif, Muhammad Shafique

AI总结本文探讨了多发性硬化症（MS）病灶分割模型评估方法的不足，指出当前大多使用Dice分数进行评估，未能充分考虑病灶级别的检测与分割性能，以及对复杂或人类标注者难以判断情况的模型表现。作者详细分析了神经科医生在脑部MRI扫描中关注的特征，并提出了更符合实际需求的评估指标，同时在两个开源数据集上对现有先进模型进行了分析，以评估其在实际医疗场景中的适用性。

Comments 8 pages, 5 figures, Accepted to IJCNN 2026

2605.09665 2026-05-12 cs.LG cs.AI cs.CL

Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies

Jingze Song, Zihao Chen, Wenqing Chen, Zibin Zheng

AI总结本文研究了在指令微调中如何高效选择训练数据的问题，提出了一种联合任务-模型自适应的框架，用于学习多指标权重以优化数据选择。该方法通过在小型验证集上利用上下文学习信号，无需大规模微调即可确定最优权重配置，从而实现高效且高保真的数据评估。实验表明，该方法在多个基准和模型家族上表现出与全数据微调相当甚至更优的效果，并揭示了推理任务中语义多样性与逻辑复杂性的权衡关系。

Comments This work has been accepted at IJCAI 2026

2605.09663 2026-05-12 cs.LG cs.AI

Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation

Julien Lafrance, Richard Khoury, Véronique Tremblay

AI总结在动态环境中，机器学习分类器常因概念漂移导致性能下降，而传统评估方法难以准确反映数据生成过程中的因果依赖关系。本文提出了一种基于结构因果模型的数字孪生框架——因果参数漂移模拟（Causal Parametric Drift Simulation），通过精确的因果干预揭示分类器在部署前的潜在脆弱性。实验表明，该方法能发现标准统计监测手段无法识别的隐藏问题，为分类器鲁棒性评估提供了新的有效工具。

Comments 34 pages, 13 figures, 14 tables

2605.09662 2026-05-12 cs.CV

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

Alessio Mazzucchelli, Maria Naranjo-Almeida, Jorge Bustos-Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer, Jordi Sanchez-Riera, Adrian Penate-Sanchez

AI总结本文提出了一种名为BEA-GS的新型高斯泼溅方法，旨在在无需辐射监督的情况下实现更精确的物体提取。该方法通过引入两种新的损失函数，分别优化可见和不可见高斯点的几何结构，以更准确地对齐语义边界。实验表明，该方法在多个数据集上取得了当前最佳的边界分割效果，显著提升了物体级编辑和资产提取的精度。

Comments CVPR 2026 Highlight

2605.09661 2026-05-12 cs.CL cs.AI

MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Huy Hoang Ha, Benoit Favre, Francois Portet

AI总结本文提出MedMeta，一个用于评估大语言模型（LLM）从医学研究摘要中合成元分析结论能力的新基准。该基准包含81项来自PubMed的元分析，通过两种流程评估模型：基于真实摘要的检索增强生成（Golden-RAG）和仅依赖内部知识的参数化方法。研究发现，基于外部信息的Golden-RAG方法显著优于仅依赖内部知识的方法，而领域微调的效果有限，且当前模型在处理否定性证据时表现不佳，突显了RAG系统在临床应用中的重要性和现有模型的不足。

2605.09659 2026-05-12 cs.RO

ASACK : Adaptive Safe Active Continual Koopman Learning for Uncertain Systems with Contractive Guarantees

Chandan Kumar Sah, Rajpal Singh, Jishnu Keshavan

AI总结本文提出了一种名为ASACK的自适应安全主动持续Koopman学习框架，用于在存在模型不确定性和分布偏移的不确定系统中进行安全控制。该方法通过一个基于自编码器的Koopman模型进行离线学习，并利用收缩性适应律进行在线模型修正，从而在理论上有分布偏移和模型不确定性下的收敛保证。为提高数据效率，该方法结合主动学习策略，在完成任务目标的同时引导系统采集信息量大的数据，并将主动学习目标与安全约束整合到非凸优化问题中，最终通过鲁棒MPC框架实现形式化的安全保证。实验表明该方法在性能上优于现有先进方法。

详情

英文摘要

Koopman operator theory provides a powerful framework for representing nonlinear dynamics through a linear operator acting on lifted observables, enabling the use of linear control techniques for nonlinear systems. However, Koopman models are typically learned from data and often degrade in performance under model uncertainty and distributional shifts between training and deployment. Although several works have explored online adaptation to address this issue, many rely on neural network-based updates that introduce significant computational overhead and lack formal safety guarantees, limiting their suitability for real-time and safety-critical robotic applications. In this work, we propose a unified framework for continual adaptive Koopman learning that enables safe and efficient online refinement of learned models during task execution. An autoencoder-based Koopman model is first learned offline and subsequently refined online through a contractive adaptation law, which provides theoretical convergence guarantees under distributional shifts and model uncertainty. To improve data efficiency and accelerate model refinement, the adaptation mechanism is integrated with an active learning strategy that drives the system to collect informative data while accomplishing task objectives. The resulting control problem is formulated as a nonconvex optimization problem incorporating both active learning objectives and safety constraints. We further derive theoretical bounds on model approximation error and show how these bounds can be incorporated within a robust Model Predictive Control (MPC) framework to provide formal safety guarantees. The proposed approach unifies learning, excitation, and safety within a single control framework without sacrificing real-time feasibility. Extensive simulation and experimental studies demonstrate superior performance compared to state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.09656 2026-05-12 cs.RO

ORICF -- Open Robotics Inference and Control Framework

Andrés Meseguer Valenzuela, Luís Miguel Bartolín Arnau

AI总结本文提出了一种名为ORICF的开放机器人推理与控制框架，旨在解决当前人工智能在机器人应用中计算开销大、延迟高和能耗高的问题。该框架具有模块化、声明式和模型无关的特点，支持通过轻量级YAML配置灵活调整模型、硬件和数据通道，无需修改代码。研究通过在移动机器人上结合语音识别、大语言模型和目标检测模型进行实验，验证了ORICF在边缘计算部署下可显著降低机器人端的计算负载和能耗，同时保持系统的模块化与可复现性。

Comments Accepted in ICRA26 Workshop: 8th International Workshop on Robotics Software Engineering (RoSE 26)

2605.09650 2026-05-12 cs.AI cs.LG

Workspace Optimization: How to Train Your Agent

Elad Sarafian, Gal Kaplun, Ron Banner, Daniel Soudry, Boris Ginsburg

AI总结本文研究了如何通过优化智能体的“工作空间”来提升其在复杂多轮任务中的表现。作者提出，当前沿语言模型的权重难以调整时，应通过结构化的外部工作空间进行训练，这一过程称为“工作空间优化”。为此，他们设计了DreamTeam系统，通过多智能体协作构建可执行的世界模型，并在ARC-AGI-3数据集上实现了比现有最优方法更高的任务解决率，同时减少了环境交互动作的使用。

2605.09649 2026-05-12 cs.LG

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying

AI总结该论文研究了如何通过改进键值（KV）缓存的管理策略来提升模型在长上下文推理中的性能。作者提出了一种基于全局保留机制的KV缓存淘汰方法，通过学习每个token的未来有用性，在统一的内存预算下进行选择性淘汰，从而在减少内存消耗的同时提升生成质量。实验表明，该方法在多个长上下文语言和视觉语言推理任务中，能够有效降低KV内存占用并达到或超越全缓存推理的效果。

Comments A learnable KV eviction method for large language models

2605.09644 2026-05-12 cs.CV

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou, Xiaosong Jia, Zuxuan Wu, Yu-Gang Jiang

AI总结该论文提出了一种名为RetrieveVGGT的训练-free框架，用于解决基于Transformer的三维重建在处理长序列时因注意力机制复杂度过高而导致的内存溢出和质量下降问题。通过将上下文构建转化为检索问题，RetrieveVGGT在每一步仅检索少量相关帧，从而保持可控的内存开销，并利用VGGT中查询与键之间的相似性作为相关性指标，无需额外训练。此外，该方法引入了分段采样和基于相机位姿的空间记忆机制，进一步提升了信息多样性与定位准确性，实验表明其在性能上优于多个现有方法。

2605.09640 2026-05-12 cs.CV cs.LG

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu

AI总结本文研究了如何在视觉持续学习中克服灾难性遗忘问题，提出了一种基于强化微调的新方法RaPO。作者发现现有方法如GRPO在面对类别增量和领域增量学习时仍存在显著遗忘，其根本原因在于轨迹层面的策略漂移。为此，RaPO通过引入保留奖励和跨任务优势归一化，有效缓解了策略漂移带来的遗忘问题，实验表明其在多个持续学习场景中均取得优越性能，为视觉持续学习中的强化微调提供了系统性探索。

详情

英文摘要

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

URL PDF HTML ☆

赞 0 踩 0

2605.09638 2026-05-12 cs.LG

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh

AI总结本研究提出了一种名为Plan2Cleanse的测试时反后门防御框架，用于检测和缓解深度强化学习模型中的后门攻击。该方法通过将后门检测转化为规划问题，利用蒙特卡洛树搜索技术高效识别并中和后门触发序列，无需重新训练模型。实验表明，Plan2Cleanse在多个环境中显著提升了后门触发的检测成功率和任务表现，验证了其在实际部署中的有效性。

Comments Published in Transactions on Machine Learning Research (TMLR)

2605.09636 2026-05-12 cs.AI

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

Zhen Hang, Yushan Yashengjiang, Junhui Li, Huanshuo Dong, Yang Wei, Zhezheng Hao, Jiangtao Ma, Songlin Bai, Haozhong Kai, Xihang Yue, Gangzong Si, Dongming Jiang, Chao Yao, Zhanhua Hu, Jiangqing Zhang, Pengwei Liu, Yaomin Shen, Xingyu Ren, Lei Liu, Zikang Xu, Han Li, Qingsong Yao, Hande Dong, Hong Wang

AI总结 PDEAgent-Bench 是首个面向偏微分方程（PDE）求解器生成的多指标、多库基准测试平台，旨在评估从PDE描述自动生成数值求解代码的能力。该基准包含645个实例，涵盖6类数学问题和11类PDE，支持DOLFINx、Firedrake和deal.II等主流有限元库，并对生成代码的可执行性、数值精度和计算效率进行分阶段评估。实验表明，当前大型语言模型和代码生成代理虽能生成可运行代码，但在满足精度和效率要求时表现显著下降，突显了PDE求解器生成任务的挑战性与现有方法的不足。

详情

英文摘要

PDE-to-solver code generation aims to automatically synthesize executable numerical solvers from partial differential equation (PDE) specifications. This task requires not only understanding the mathematical structure of PDEs, but also selecting appropriate discretization schemes and solver configurations, and correctly implementing the resulting formulations in finite-element method (FEM) libraries. Existing code generation benchmarks mainly evaluate syntactic correctness, or success on predefined test cases. To our knowledge, there is currently no publicly available benchmark specifically for PDE-to-solver code generation, and general-purpose code benchmarks do not fully capture the unique challenges of numerical PDE solution, such as ensuring solver accuracy, efficiency, and compatibility with professional FEM libraries. We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. PDEAgent-Bench contains 645 instances across 6 mathematical categories and 11 PDE families, with common FEM libraries for DOLFINx, Firedrake, and deal.II. Each instance provides an agent-facing problem specification, a reference solution on a prescribed evaluation grid, and case-specific accuracy and runtime targets. PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks. Experiments with representative LLMs and code agents show that models can often produce runnable code, but their pass rate drops substantially once accuracy and efficiency requirements are enforced. These results indicate that current agents remain limited in producing numerically reliable and efficient PDE solvers, and that PDEAgent-Bench provides a reproducible testbed grounded in the practical requirements of numerical PDE solving.

URL PDF HTML ☆

赞 0 踩 0

2605.09635 2026-05-12 cs.CL

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Hao Liang, Qihan Lin, Zhaoyang Han, Xiaochen Ma, Zhen Hao Wong, Meiyi Qiang, Linzhuang Sun, Wentao Zhang

AI总结该研究提出了K12-KGraph，一个与课程内容对齐的知识图谱，旨在评估和训练教育领域的大型语言模型。该图谱从人教版教材中提取，涵盖数学、物理、化学和生物等多个学科，包含七类节点和九类关系，用于构建多任务基准K12-Bench和训练数据集K12-Train。实验表明，基于课程结构的监督训练在教育资源有限的情况下表现更优，显著提升了模型在教育相关任务中的性能。

2605.09634 2026-05-12 cs.CL

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Erfan Loweimi, Sofia de la Fuente Garcia, Samira Loveymi, Hadi Daneshvar, Saturnino Luz

AI总结该研究探讨了大型语言模型（LLMs）在心理健康筛查中的可靠性，重点关注模型的一致性、语音识别（ASR）鲁棒性以及证据可信度。研究评估了Phi-4、Gemma-2-9B和Llama-3.1-8B三类模型在真实语音数据上的表现，发现Phi-4和Gemma-2-9B在模型内部一致性及对ASR错误的鲁棒性方面表现优异，而Llama-3.1-8B则表现出较差的稳定性。研究还揭示了模型评分与关键词依据之间的不一致，对临床应用的可解释性提出了挑战。

2605.09633 2026-05-12 cs.RO cs.SY eess.SY

Minimizing Worst-Case Weighted Latency for Multi-Robot Persistent Monitoring: Theory and RL-Based Solutions

Weizhen Wang, Ziheng Wang, Jianping He, Xinping Guan, Xiaoming Duan

AI总结本文研究多机器人在带权重图上的持续监测问题，旨在设计机器人轨迹以最小化所有节点在无限时间范围内的最差加权延迟。为了解决传统最差延迟目标无法区分瞬态性能差但渐近性能好的策略的问题，作者提出了一类尾部性能目标，并建立相应的优化问题理论框架。基于这些理论结果，构建了一个等效的事件驱动马尔可夫决策过程（TWLO-MDP），并开发了基于强化学习的求解方法，同时提出了多机器人监测基准（M2Bench），实验表明该方法能有效降低最差加权延迟并优于现有方法。

2605.09630 2026-05-12 cs.CL cs.LG

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Lin Zheng, Vasilisa Bashlovkina, Timothy Dozat, Dan Garrette, Laura Rimell, Joshua Maynez

AI总结本文研究了字节级语言模型中基于块（patch）的高效推理方法，提出了“Scratchpad Patching（SP）”技术，通过在每个块内插入临时缓存（scratchpad）来聚合已观测的字节信息，从而更新块级上下文，减少因块大小增加导致的预测滞后问题。该方法能够在保持相同块大小的前提下提升模型性能，并显著降低键值缓存和推理计算量，为高效语言模型设计提供了新思路。

Comments 23 pages, 15 figures

2605.09628 2026-05-12 cs.CV

DegBins: Degradation-Driven Binning for Depth Super-Resolution

Zhiqiang Yan, Zhengxue Wang, Jian Yang, Gim Hee Lee

AI总结深度超分辨率（DSR）旨在从低分辨率深度图中恢复高分辨率深度图。传统方法通常在低维特征空间中学习高分辨率与低分辨率之间的残差，但难以准确建模空间变化的退化关系。本文提出了一种新的DSR框架DegBins，通过退化驱动的分箱策略，将回归问题转化为分类-回归混合问题，利用离散深度分箱的加权组合更灵活地表示残差深度，并在高维特征空间中建模退化关系，实现分箱范围和概率分布的自适应调整。实验表明，DegBins在多个基准数据集上优于现有方法，具有更高的精度和鲁棒性。

Comments 9 pages

2605.09622 2026-05-12 cs.CV cs.AI

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao

AI总结在放射治疗计划中，体素级剂量预测是一个关键但具有挑战性的任务，现有模型往往难以在不同临床场景中泛化。本文提出 DiffKT3D，一种统一的 Any2Any 3D 扩散框架，通过迁移预训练视频扩散模型的知识，实现高效且具有临床意义的剂量预测。该方法引入了基于模态嵌入的灵活条件生成机制，并结合临床导向的强化学习后训练策略，显著提升了剂量预测精度与图像质量，优于当前最优模型。

Comments Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)

2605.09618 2026-05-12 cs.CL cs.CY

Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols

Julia Hu, Alfred Shen, Kumar Lakshmipathi

AI总结该研究探讨了语言模型在直接回答、多样本投票和多智能体辩论等不同推理策略下的表现差异，旨在确定在生成长度受限的情况下，哪种策略最有效。通过在MuSiQue和GSM8K数据集上对多个模型进行实验，发现最佳策略因模型和数据集而异，且难以通过简单的预判信号（如投票熵）来有效选择。研究指出，投票熵仅能预测辩论是否安全，而不能准确判断何时需要辩论，表明当前的辩论机制在实际应用中仍存在局限。

Comments 14 pages, 5 figures. Technical report / preprint

详情

英文摘要

When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale.

URL PDF HTML ☆

赞 0 踩 0

2605.09614 2026-05-12 cs.CV

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai, Weishu Zhao, Shiyu Liang

AI总结本文研究了长链多模态推理中视觉信息衰减的问题，提出了一种基于信息论的分析方法，推导出干预点对下游视觉收益的下界，并据此设计了反射锚点策略优化（RAPO）方法。RAPO通过选择高熵的反射锚点并优化有限窗口的KL散度代理，有效增强了视觉信息在生成过程中的传播与保留。实验表明，RAPO在多个视觉-语言模型基准上显著优于现有方法，并且机制分析显示其能增强生成轨迹中视觉依赖的对比信号。

Comments Under Review

2605.09613 2026-05-12 cs.RO cs.CV

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Narsimha Menga, Parikshit Sakurikar, Amirreza Rouhi, Satya Sai Reddy, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

AI总结该研究提出了SABER，一个用于现实零售场景中机器人视觉-语言-动作（VLA）适配的高保真动作数据集。SABER通过多小时的真实店内捕捉，记录了人类在零售环境中的精细手部动作、全身运动及场景动态，无需人工编排或远程操作。该数据集包含多种动作表示形式，并在实际机器人系统上验证了其有效性，显著提升了复杂零售任务的完成率，展示了高质量数据对提升机器人性能的关键作用。

详情

英文摘要

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

URL PDF HTML ☆

赞 0 踩 0

2605.09611 2026-05-12 cs.CL

Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

Sietse Schelpe

AI总结本文对检索增强生成（RAG）流程中的字节精确块级去重技术进行了实证分析，研究了其在不同应用场景下的上下文缩减效果及质量影响。通过在学术、企业及多轮对话场景下的实验，发现去重可实现高达80.34%的冗余减少，同时通过多方模型的评估验证，确认该方法不会引入可测量的质量下降。研究证明，在不牺牲模型质量的前提下，可以确定性地实现显著的推理计算节省。

Comments Preprint. Implementation and open-source community version available at: https://github.com/corbenic/merlin-community - https://zenodo.org/records/20090712