arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09409 2026-06-09 cs.AI cs.CL cs.LG 新提交

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

正确看起来更好：成对比较揭示准确性排名

Mina Remeli, Moritz Hardt

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany（马克斯·普朗克智能系统研究所，蒂宾根，德国）； Tübingen AI Center（蒂宾根人工智能中心）

AI总结本文通过将基准测试转化为生成式评估，发现成对比较结合Elo方法得到的模型排名与基于真实准确率的排名高度一致（Spearman相关系数>0.9），且风格和裁判偏见影响较小，但答案重复（echo）是裁判偏好的因果驱动因素。

详情

Comments: Accepted at ICML'26

AI中文摘要

成对比较结合诸如Elo等聚合方法已成为评估生成模型的核心，但人们仍担心它们会奖励肤浅的风格线索或显示裁判偏见。从更积极的角度看，我们表明，当存在真实准确率用于比较时，成对比较得出的模型排名与基于真实准确率的排名高度一致。通过将五个知名基准测试转化为自由形式的生成评估，我们发现Elo排名与准确率排名的Spearman相关系数超过0.9，并且在裁判较弱时显著优于直接评估。此外，风格和裁判偏见对模型排名的影响较小，尽管大多数判断发生在两个候选答案都正确（或都错误）的成对上。在这样的成对比较中，我们发现最终答案后的重复（echo）是裁判偏好的因果驱动因素。

英文摘要

Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

URL PDF HTML ☆

赞 0 踩 0

2606.09400 2026-06-09 cs.CV 新提交

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

vesselFM-CT：在CT图像中分割所有血管以实现系统级心血管分析

Bastian Wittmann, Chinmay Prabhakar, Suprosanna Shit, Bjoern Menze

发表机构 * Department of Quantitative Biomedicine, University of Zurich（苏黎世大学定量生物医学系）

AI总结提出vesselFM-CT模型，通过迭代多步训练和TubeLoss损失函数，实现CT图像中从大血管到微小肠系膜血管的全分割，优于基线方法，支持系统级心血管分析。

详情

AI中文摘要

人体血管网络中的血管在半径、长度、拓扑特性和分支模式上表现出剧烈的结构变化。这种异质性，加上位置特定的解剖背景变化，对稳健、大规模地分析整个心血管系统构成了重大挑战。因此，大多数研究集中在血管网络的狭窄孤立部分。虽然这些针对性研究提供了有价值的见解，但它们本质上限制了评估血管网络整体系统健康和功能完整性的能力。在这项工作中，我们旨在弥合这一差距，以推进临床诊断和我们对血管生理学的基本理解。我们提出了在CT图像中分割所有血管的任务，范围从心血管系统最大的组成部分到微小的肠系膜血管。为此，我们引入了vesselFM-CT，这是第一个能够稳健分割3D CT图像中所有血管的模型。vesselFM-CT通过迭代多步过程进行训练，并优化我们提出的TubeLoss损失函数，有效解决了心血管系统固有的异质性。我们证明vesselFM-CT优于所有基线，并能够从CT图像中自动精确提取心血管系统，从而解锁广泛的临床和技术视角，包括自动疾病分类和合成CT图像生成。

英文摘要

The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variations in radius, length, topological properties, and branching patterns. This heterogeneity, together with location-specific anatomical background variations, poses a significant challenge for robust, large-scale analysis of the entire cardiovascular system. As a result, most research has focused on narrow, isolated segments of the vascular network. While such targeted studies provide valuable insights, they inherently limit the ability to assess the systemic health and functional integrity of the vascular network as a whole. In this work, we aim to bridge this gap to advance both clinical diagnostics and our fundamental understanding of vascular physiology. We propose the task of segmenting all vessels in CT images, ranging from the largest components of the cardiovascular system to even minuscule mesenteric vessels. To this end, we introduce vesselFM-CT, the first model capable of robustly segmenting all blood vessels in 3D CT images. VesselFM-CT is trained via an iterative, multi-step process and optimizes our proposed TubeLoss loss function, effectively addressing the inherent heterogeneity of the cardiovascular system. We demonstrate that vesselFM-CT outperforms all baselines and enables automated, precise extraction of the cardiovascular system from CT images, thereby unlocking a wide range of clinical and technical perspectives, including automated disease classification and synthetic CT image generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09399 2026-06-09 cs.AI 新提交

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

RunAgent SuperBrowser: 基于人类浏览行为的自主网页导航理论

Radeen Mostafa, Sawradip Saha

发表机构 * RunAgent AI

AI总结提出SuperBrowser自主网页导航代理，通过模仿人类浏览的感知-认知-行动三元机制，在Mind2Web Hard基准上以89.47%成功率超越现有开源研究代理。

详情

Comments: 31 pages, 8 figures, preprint/work in progress

AI中文摘要

我们提出SUPERBROWSER，一个自主网页导航代理，其设计基于一个指导性假设：网页代理应该像人一样浏览。人类阅读页面时不会记住看到的每个像素；他们会看几个候选目标，决定一个，并只记住维持目标所需的信息。我们将这个感知-认知-行动三元组实现为三个耦合机制。首先，一个视觉优先的边界框管道在每个截图上标记候选交互区域，并异步预取给语言模型，使“眼睛”先于“手”。其次，一个三角色大脑——一个分类和路由的编排器、一个每几步评估进度的规划器、一个发出每步动作的工作器——将战略推理与操作推理分离。第三，一个结构化的账本只存储人类会记住的内容：目标、最近三个动作、少量事实和死胡同、以及少量检查点；一个六阶段驱逐循环系统性地从实时上下文中丢弃过时的截图、状态块和推理痕迹。动作执行是一个三层点击级联（Chrome DevTools协议到Puppeteer到脚本化），带有拟人化的贝塞尔运动，以及一个感知V形箭头的边界框捕捉器，解决“大标签旁的小箭头”歧义。在Mind2Web Hard基准（66个任务）上，SUPERBROWSER达到89.47%的成功率，总体排名第三，并以大幅优势领先所有已发表的开源/研究浏览器代理基线。我们认为，这一提升并非来自任何单一技巧，而是来自整个系统中认知契约的一致应用。

英文摘要

We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.

URL PDF HTML ☆

赞 0 踩 0

2606.09396 2026-06-09 cs.CL cs.LG 新提交

PriFT: Prior-Support Guided Supervised Fine-Tuning

PriFT: 先验支持引导的监督微调

Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard

发表机构 * EPFL（瑞士联邦理工学院洛桑分校）

AI总结提出PriFT方法，利用冻结的预训练模型计算token权重，避免在线模型导致的自我强化动态，在数学推理、代码生成和医疗问答任务中取得SFT最优结果，并为后续RL提供更好初始化。

详情

Comments: The first two authors contributed equally to this work

AI中文摘要

监督微调（SFT）是下游任务适配的高效方法，通常作为强化学习（RL）的初始化阶段，但其泛化能力可能弱于RL。一个关键限制是其离策略目标：SFT逐token拟合固定演示，包括与模型预训练分布对齐不良的目标，这可能导致过拟合。最近一系列工作通过给与当前模型预测分布更对齐的token分配更大的训练权重来解决此问题，直觉是拟合这些token对模型的预训练知识和表示的扭曲较小。然而，从当前微调模型计算token权重会将token权重与优化轨迹纠缠在一起，随着分布迅速偏离预训练模型，引发自我强化动态。为了解决这个问题，我们提出PriFT（先验支持引导的微调），该方法从冻结的预训练参考模型导出token权重，以获得不受微调影响的稳定重加权信号。该信号估计先验支持：每个目标token受预训练分布支持的程度。在多种现有token重加权规则中，将重加权信号从在线模型替换为预训练模型一致地提升了性能。我们引入了两种实例化：PriFT-prob使用预训练token概率，而PriFT-mass根据预训练分布下的累积概率质量选择token。在数学推理、代码生成和医疗问答上的大量实验表明，PriFT在SFT基线中取得了最先进的结果，并为后续RL训练提供了更好的初始化。

英文摘要

Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.

URL PDF HTML ☆

赞 0 踩 0

2606.09393 2026-06-09 cs.CV 新提交

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++：基于可验证奖励的统一强化学习用于密集图像和视频描述生成

Penghui Yang, Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Yibin Wang, Yujie Zhou, Jiazi Bu, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin

发表机构 * Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； Microsoft（微软）； Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Alibaba Cloud（阿里云）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出CapRL++框架，利用可验证奖励的强化学习（RLVR）优化多模态描述生成，通过非视觉语言模型回答问题的准确性作为奖励，提升密集描述质量，在20多个基准上超越传统监督微调。

详情

Comments: 26 pages, 10 figures. Project page: https://github.com/InternLM/CapRL. arXiv admin note: text overlap with arXiv:2509.22647

AI中文摘要

图像和视频描述是连接视觉与语言领域的基础任务，在预训练大型视觉语言模型（LVLMs）中发挥关键作用。当前最先进的描述模型通常采用监督微调（SFT）训练，这种范式依赖于昂贵且不可扩展的标注，并常导致模型记忆特定真实答案，限制了其通用性和生成多样化、创造性描述的能力。为克服这些局限，我们提出将可验证奖励的强化学习（RLVR）应用于多模态描述的开放任务。我们引入描述强化学习++（CapRL++），一种新颖的无参考训练框架，通过效用重新定义描述质量：高质量描述应使非视觉语言模型能够准确回答关于相应视觉内容的问题。CapRL++采用解耦的两阶段流程，其中LVLM生成描述，目标奖励来自一个独立的、无视觉的LLM仅基于该描述回答多项选择题的准确率。在超过20个图像和视频基准上的评估表明，CapRL++提升了密集描述质量，并增强了基于描述的预训练在空间和时间理解等任务上的表现。在CapRL++标注的可扩展图像和视频描述数据集上预训练带来了显著的下游收益。此外，在描述质量评估的Prism框架内，使用CapRL++训练的紧凑模型在密集描述性能上可与Qwen2.5-VL-72B和Qwen3-VL-235B-A22B等大得多的模型相媲美。这些结果验证了CapRL++能有效训练模型生成可泛化、高保真的描述，为超越传统SFT的局限奠定了坚实基础。

英文摘要

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.

URL PDF HTML ☆

赞 0 踩 0

2606.09392 2026-06-09 cs.AI 新提交

From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction

从粗到细：管理时空数据中的时间粒度以实现细粒度交通预测

Shuhao Li, Weidong Yang, Yue Cui, Zizhuo Xu, Lipeng Ma, Fan Zhang, Xiaofang Zhou

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与技术学院）； Tongyi Lab, Alibaba Group（阿里巴巴集团通义实验室）； The Hong Kong University of Science and Technology（香港科技大学）； Guangzhou University（广州大学）

AI总结针对粗粒度采样数据难以支持细粒度预测的问题，提出时空细化预测器（STRP），通过树卷积和逆膨胀卷积实现高效时空建模，在六个数据集上显著优于现有方法。

详情

AI中文摘要

高效的交通数据获取、存储和利用是时空数据管理中的关键挑战。大多数交通数据系统以固定的粗粒度时间间隔收集和存储观测数据，以降低存储和计算成本。然而，这种粗粒度数据严重限制了需要更细时间粒度预测的下游应用。在所有地点和时间段收集和维护细粒度交通数据将给数据库存储和预处理流程带来巨大负担。为了解决这种时间粒度不匹配问题，我们定义了一个新问题：利用粗粒度采样数据预测细粒度未来交通。我们提出了时空细化预测器（STRP），一种面向时空数据系统的粒度感知框架。STRP集成了两个组件：用于高效且可解释的空间依赖建模的树卷积，以及用于渐进式时间外推的逆膨胀卷积。STRP支持两种实用的预测设置：基于窗口和基于持续时间的，以处理不同形式的粒度不匹配。在六个基准数据集上的实验表明，STRP在准确性和效率上均显著优于最先进的基线方法。我们的工作为管理时空交通数据系统中的粒度不匹配提供了一种实用且可解释的方法。

英文摘要

Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic data systems collect and store observations at fixed, coarse-grained temporal intervals to reduce storage and computation costs. However, such coarse-grained data severely limits downstream applications that require predictions at a finer temporal granularity. Collecting and maintaining fine-grained traffic data across all locations and time periods would impose a substantial burden on database storage and preprocessing pipelines. To address this temporal granularity mismatch, we formulate a novel problem: predicting fine-grained future traffic using coarse-grained sampled data. We propose the Spatial-Temporal Refinement Predictor (STRP), a granularity-aware framework for spatio-temporal data systems. STRP integrates two components: Tree Convolution for efficient and interpretable spatial dependency modeling, and Inverse Dilated Convolution for progressive temporal extrapolation. STRP supports two practical prediction settings: window-based and duration-based, to handle different forms of granularity mismatch. Experiments on six benchmark datasets show that STRP significantly outperforms state-of-the-art baselines in both accuracy and efficiency. Our work offers a practical and interpretable approach to managing granularity mismatches in spatio-temporal traffic data systems.

URL PDF HTML ☆

赞 0 踩 0

2606.09390 2026-06-09 cs.CV cs.AI cs.RO 新提交

Real-time body pose non-verbal communication with a consistency-based reliability measure

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest（布加勒斯特理工大学）； Simion Stoilow Institute of Mathematics of the Romanian Academy（罗马尼亚科学院西蒙·斯托伊洛数学研究所）； NORCE Norwegian Research Centre AS（挪威研究中心）

AI总结研究仅从2D身体姿态识别通信意图，提出自回归自一致性作为无监督可靠性信号，并在嵌入式GPU上实现实时性能。

详情

AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号，特别是在需要实时低成本设备上的人-机器人通信场景中，如救援任务。然而，现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本，而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集，并将其与其他真实（IPC）和合成（MotionLCM, VEO3.1, Kimodo）数据集进行比较，这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型，从骨架图分类器到联合运动预测网络，并在嵌入式GPU（NVIDIA Orin Nano）上报告了性能指标和帧率，因为在我们的场景中速度和准确性同样重要。最后，我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明，界定了自一致性预测正确的概率，表明该概率随一致步数增加而增长，并识别了自信预测仍可能错误的条件，与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.09389 2026-06-09 cs.CL 新提交

LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks

LexRubric：面向开放式法律任务的基于评分指南的诊断基准

Yifan Chen, Haitao Li, Yiran Hu, Kaisong Song, Jun Lin, Yueyue Wu, Qingyao Ai, Min Zhang, Yiqun Liu

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）； University of Waterloo（滑铁卢大学）； Alibaba Group（阿里巴巴集团）

AI总结提出LexRubric基准，包含649个中国法律咨询与司法考试实例及12,337条原子评分标准，通过六维框架评估LLM在开放式法律任务中的可靠性，发现当前模型仍面临挑战。

详情

AI中文摘要

随着大型语言模型（LLM）越来越多地应用于现实法律任务，评估其开放式法律响应的可靠性变得至关重要。这些任务需要上下文敏感的答案，且容错空间极小，因此需要能够识别响应质量失败具体原因的细粒度诊断评估。我们引入了LexRubric，一个基于评分指南的基准，用于评估开放式中文法律任务。LexRubric包含来自法律咨询和司法考试的649个实例，这些实例既反映了日常法律需求，也体现了专业法律推理，覆盖14个法律场景。此外，它还包含12,337条由专家编写的原子评分标准，这些标准组织在一个统一的六维框架下，能够跨任务和评估维度进行准确的评估和诊断分析。为了验证评估的可靠性，我们测试了多个评判模型，并将基于模型的评判与人类评判进行了比较。我们进一步在LexRubric上评估了18个近期通用和法律领域的LLM。结果表明，不同模型展现出不同的能力特征，且开放式法律问题对当前LLM仍然具有挑战性。数据可在以下网址获取：https://github.com/foggpoy/LexRubric。

英文摘要

As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.

URL PDF HTML ☆

赞 0 踩 0

2606.09388 2026-06-09 cs.LG 新提交

Distilling Safe LLM Systems via Soft Prompts for On Device Settings

通过软提示蒸馏安全的设备端LLM系统

Motasem Alfarra, Cristina Pinneri, Dana Kianfar, Mohammed Almousa, Christos Louizos

发表机构 * Qualcomm AI Research（高通人工智能研究院）

AI总结针对资源受限设备上部署安全大语言模型（LLM）的挑战，提出基于软提示与蒸馏训练的安全对齐方法，在最小化额外计算开销的同时实现优越的安全-有用性权衡。

详情

Journal ref: 42nd Conference on Uncertainty in Artificial Intelligence 2026
Comments: Accepted to UAI 2026

AI中文摘要

在资源受限的边缘设备上部署安全的大语言模型（LLM）面临关键挑战：虽然将LLM与防护模型结合的双模型系统能提供有效的安全保障，但其巨大的内存和计算需求使其在设备端部署中代价高昂。本文对资源受限环境下的参数高效安全对齐方法进行了全面研究。通过对多种LLM架构、训练目标和参数高效微调方法的系统评估，我们发现软提示与基于蒸馏的训练相结合始终优于其他方法。我们引入了基于总变差和KL散度的蒸馏框架，能够有效将防护模型的安全行为迁移到学习到的软提示中。我们在多个基准上的评估表明，与LoRA适配器、引导向量和直接优化方法相比，这种组合在安全-有用性权衡上表现更优，同时在推理时仅需极少的额外内存和计算。这些发现确立了软提示蒸馏作为设备端LLM部署中安全对齐的首选方法。

英文摘要

Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.09383 2026-06-09 cs.CV 新提交

An Opticalmechanics Framework for Dynamic Estimation of Multibody Systems

多体系统动态估计的光力学框架

Banglei Guan, Xuanyu Bai, Qingquan Chen, Zibin Liu, Dongcai Tan, Zhenbao Yu, Yang Shang, Qifeng Yu

发表机构 * National University of Defense Technology（国防科技大学）

AI总结提出光力学运动-动力学集成框架，通过图像测量运动学量结合遗传算法优化，实现无接触关节力矩估计，实验验证腕关节力矩误差0.46 Nm。

详情

Comments: 10 pages, 12 figures

AI中文摘要

传统的人体动力学分析通常受限于接触力/力矩传感器和受控实验室环境。为解决此问题，本研究提出了一种用于多体系统的光力学运动-动力学集成估计框架。具体而言，建立约束多体模型描述系统动力学，同时将图像测量的运动学量作为动态估计的非接触输入。然后通过基于遗传算法的优化，最小化模型预测与图像测量运动学量之间的差异，识别未知关节力矩。在气浮平台上的实验验证表明，与传感器测量相比，从图像数据估计的腕关节力矩平均绝对误差为0.46 Nm。在前向预测测试中，模型预测的角速度相对于图像测量结果的平均绝对误差为0.006 rad/s。本研究展示了在直接力/力矩测量困难的情况下，结合图像测量和力学建模进行非接触动态估计的潜力。

英文摘要

Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque sensors and controlled laboratory environments. To address this issue, this study proposes an opticalmechanics kinematic-dynamic integrated estimation framework for multibody systems. Specifically, a constrained multibody model is established to describe the system dynamics, while image-measured kinematic quantities are used as non contact inputs for dynamic estimation. The unknown joint torque is then identified through a genetic-algorithm based optimization by minimizing the discrepancy between model-predicted and image-measured kinematic quan tities. Experimental validation on an air-bearing platform showed that the wrist joint torque estimated from image data achieved a mean absolute error of 0.46 Nm compared with sensor measurements. In the forward prediction test, the model-predicted angular velocity achieved a mean absolute error of 0.006 rad/s relative to the image-measured results. This study demonstrates the potential of combining image measurement and mechanical modeling for non-contact dynamic estimation in scenarios where direct force and torque measurement is difficult.

URL PDF HTML ☆

赞 0 踩 0

2606.09381 2026-06-09 cs.RO 新提交

ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration

ReGIL: 基于检索引导的单一示范模仿学习

Yuying Zhang, Francesco Verdoja, Wenyan Yang, Ville Kyrki

发表机构 * Aalto University（阿尔托大学）

AI总结提出ReGIL框架，将单一示范作为外部记忆，通过检索引导探索、生成正则化缓冲和构建奖励，在LIBERO和Meta-World基准及真实机器人任务中显著提升成功率和训练效率。

详情

AI中文摘要

使用深度神经网络从单一示范中学习机器人操作策略仍然极具挑战性，因为即使与示范轨迹有微小偏差也可能迅速累积导致失败，而收集大量在线交互数据成本高昂。我们提出ReGIL，一种检索引导的模仿学习框架，将单一示范视为外部记忆。ReGIL在整个训练过程中反复查询该静态记忆，以同时引导探索、生成正则化缓冲和构建奖励。具体而言，它通过当前轨迹与检索片段之间的局部时间对齐来计算奖励，为策略改进提供逐步且信息丰富的反馈。我们在LIBERO和Meta-World基准的机器人操作任务上，在单一示范设置下评估了ReGIL。ReGIL在成功率和训练效率上均优于先前基线。在真实机器人实验中，仅使用一个示范和不到一小时的在线训练，ReGIL在三个操作任务上（初始机器人姿态和目标位置均随机）实现了超过75%的成功率。这些结果表明，将单一示范作为可重用记忆可以提供比静态监督更高效的机器人学习。更多详情请访问我们的网站：https://regil2026.github.io/

英文摘要

Learning robot manipulation policies with deep neural networks from a single demonstration remains highly challenging, as even small deviations from the demonstrated trajectory can quickly compound into failure, while collecting substantial online interaction data is costly. We propose ReGIL, a retrieval-guided imitation learning framework that treats a single demonstration as an external memory. ReGIL repeatedly queries this static memory throughout training to simultaneously guide exploration, generate the regularization buffer, and construct rewards. Specifically, it computes rewards through local temporal alignment between the current trajectory and the retrieved segment, providing step-wise and informative feedback for policy improvement. We evaluate ReGIL on robotic manipulation tasks from the LIBERO and Meta-World benchmarks under the single demonstration setting. ReGIL outperforms prior baselines in both success rate and training efficiency. In real-robot experiments, using only one demonstration and less than one hour of online training, ReGIL achieves over 75% success rate across three manipulation tasks with randomness in both initial robot pose and target position. These results demonstrate that leveraging the single demonstration as reusable memory can provide more than static supervision for efficient robot learning. More details can be found on our website: https://regil2026.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.09380 2026-06-09 cs.LG cs.AI cs.CL 新提交

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

推理竞技场：当可验证奖励不足时的轨迹锦标赛

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

发表机构 * University of Cambridge（剑桥大学）； Mistral AI

AI总结提出推理竞技场框架，通过轨迹锦标赛将无梯度信号的非多样奖励组转化为相对奖励信号，结合Bradley-Terry模型高效整合强化学习，在数学和编码基准上平均提升7.6%，加速训练27%-41%。

详情

Comments: 9 pages, 6 figures, 2 tables (17 pages including references and appendices)

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为通过结果监督提升大语言模型推理能力的主流范式。然而，可验证奖励在组级别常常变得无信息：当给定提示的所有采样轨迹获得相同奖励时，组相对优势估计无法提供梯度信号，尽管这些轨迹在推理质量上可能差异显著。我们提出推理竞技场，一种自适应训练框架，将此类非多样奖励组路由至裁判系统而非丢弃。除了检查最终答案，推理竞技场构建轨迹锦标赛，其中推理轨迹进行两两比较以暴露组内更细粒度的偏好，将推理质量转化为丰富的相对奖励信号。为使奖励估计高效，而非穷举比较每一对，每个新轨迹与一个动态更新的先前生成轨迹小池作为锚点进行评估，以高效建立相对排名。然后我们在不完整比较图上拟合Bradley-Terry模型，实现无需二次成对比较的可扩展强化学习集成。实验结果表明，推理竞技场在竞赛数学和编码基准上平均比RLVR基线高出7.6%。通过将原本浪费的零优势样本转化为有用的梯度更新，我们的方法加速训练27%至41%，节省近50%的生成计算量，并显著提升整体推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09378 2026-06-09 cs.CV 新提交

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Echo-DM: 通过条件潜在扩散和区域感知融合去除超声标记

Zhiwei Wang, Tao Huang, Wentao Jiang, Muyi Li, Jianxin Liu, Jian Chen, Jie Zou, Yong Luo, Bo Du, Jing Zhang

发表机构 * School of Computer Science, Wuhan University, China（武汉大学计算机学院）； The Central Hospital of Wuhan, China（武汉市中心医院）； School of Computer Science, Hubei University of Technology, China（湖北工业大学计算机学院）

AI总结提出Echo-DM框架，结合条件潜在扩散和区域感知融合，在无掩码条件下有效去除超声图像中的人工标记，同时保持解剖结构保真度。

详情

Comments: 18 pages, 4 figures

AI中文摘要

临床超声图像通常包含人工标记，如测量卡尺和文字，以辅助诊断解释和比较。然而，这些标记可能在下游自动分析中引入捷径偏差，促使深度学习模型依赖标记相关线索而非临床有意义的解剖结构。现有的标记去除方法要么依赖于掩码且易受错误传播影响，要么是无掩码的确定性修复器，可能过度平滑超声纹理并扰动未受影响的背景区域。为应对这些挑战，我们提出了Echo-DM，一个通过条件潜在扩散和区域感知融合进行超声标记去除的框架。Echo-DM遵循通用的编码器-扩散-解码器流水线，其中基于DiT的条件潜在扩散网络执行全局修复，区域感知融合模块在端到端无掩码推理下强制执行保留感知的图像空间细化。基于这一固定核心设计，我们进一步分别用基于VAE和基于RAE的潜在模块实例化了Echo-DM-V和Echo-DM-R，这表明Echo-DM架构与多种潜在模块实例化兼容。在Echo-PAIR（一个大规模配对临床超声数据集）上的大量实验表明，与代表性的两阶段基线相比，Echo-DM具有优越的标记去除能力和强大的解剖保真度，同时在部署设置中提供了有利的质量-效率权衡。数据、代码和模型将在https://github.com/MiliLab/Echo-DM发布。

英文摘要

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues rather than clinically meaningful anatomy. Existing marker removal methods are either mask-dependent and vulnerable to error propagation, or mask-free deterministic restorers that may over-smooth ultrasound texture and perturb unaffected background regions. To address these challenges, we present Echo-DM, a framework for ultrasound marker removal via conditional latent diffusion and region-aware fusion. Echo-DM follows a common encoder-diffusion-decoder pipeline, where a DiT-based conditional latent diffusion network performs global restoration and a region-aware fusion module enforces preservation-aware image-space refinement under end-to-end mask-free inference. Building on this fixed core design, we further instantiate Echo-DM-V and Echo-DM-R with VAE-based and RAE-based latent modules, respectively, which demonstrates that the Echo-DM architecture is compatible with diverse latent-module instantiations. Extensive experiments on Echo-PAIR, a large-scale paired clinical ultrasound dataset, demonstrate superior marker removal and strong anatomical fidelity compared with representative two-stage baselines, while providing favorable quality--efficiency trade-offs across deployment settings. Data, code and models will be released at https://github.com/MiliLab/Echo-DM.

URL PDF HTML ☆

赞 0 踩 0

2606.09376 2026-06-09 cs.CL 新提交

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

精确性不等于忠实度：基于完全神谕的覆盖感知接地生成评估

Juan S. Santillana

发表机构 * Globant

AI总结针对参考无关忠实度指标仅测量精确性而忽略召回率的问题，提出利用完全神谕（F1赛事和NOAA天气预报）测量覆盖度，并设计结合精确性与覆盖度的综合指标及验证器引导生成方法。

详情

Comments: 8 pages, multilingual (EN/ES/PT). A reference-free faithfulness metric adding recall (coverage) against a complete structured oracle: precision-only rewards abstention; requiring coverage reorders models. Code: https://github.com/vectrayx/precision-is-not-faithfulness Demo: https://huggingface.co/spaces/jsantillana/faithful-strategy-engineer-f1

AI中文摘要

超越人类：使用迁移学习的多物种动物面部识别

Maria De Marsico, Anil K. Jain, Annalaura Miglino

发表机构 * Sapienza University of Rome（罗马大学）； Michigan State University（密歇根州立大学）； University of Salerno（萨莱诺大学）

AI总结研究利用迁移学习（FaceNet和Vision Transformer）实现多物种动物面部识别，在狗、灵长类和牛数据集上验证，狗识别准确率最高（96.85%），部分场景超越现有方法。

详情

Comments: This paper extends the work published in the proceedings of CAIP 2025 conference: 'Adapting to the Wild: From Human Face to Animal Face Recognition' by De Marsico, M., Jain, A. K., Miranda, M., & Orlando, A

AI中文摘要

个体动物识别可用于寻找丢失或被盗的宠物、追踪濒危物种个体以及识别拥挤农场中的动物。目前的识别技术主要使用物理设备（如微芯片），通常不切实际且难以应用。这些可以通过动物面部进行远程识别来替代；如果足够准确，它具有多个优势：非侵入性、可远距离工作、难以伪造，例如在食品工业中用病畜替换健康畜的情况。现有的少数数据集具有足够的每个主体图像并标注了单个动物身份，但不足以训练当前的深度学习架构。我们转而研究迁移学习的可能性，利用预训练网络模型作为骨干。我们的实验比较了专门在大型人脸数据库上训练的FaceNet和在ImageNet（即对象类别）上预训练的Vision Transformer（ViT）。我们使用了三种非常不同的动物的面部数据集：狗、灵长类（狐猴、金丝猴和黑猩猩）和牛。我们报告了结果，并对每个数据集与当前最优（SOTA）专门训练的深度网络进行了比较。三个数据集的捕获条件不同。图像质量（分辨率、运动模糊、不同姿态等）从狗到牛到灵长类依次下降。最佳性能在狗上实现，ViT达到了96.85%的平均验证准确率和84.34%的Rank-1识别率。濒危灵长类的结果仍然令人鼓舞，但性能因动物类别和任务（验证或识别）而异，并不总是优于SOTA。对于牛，ViT结果优于SOTA，而FaceNet仍然具有竞争力。

英文摘要

Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.

URL PDF HTML ☆

赞 0 踩 0

2606.09351 2026-06-09 cs.CL stat.ME 新提交

In-Context Learning for the Imputation of Public Opinion Data with Large Language Models

基于上下文学习的民意数据插补方法

Tobias Holtdirk, Georg Ahnert, Joseph W Sakshaug, Anna-Carolina Haensch

发表机构 * LMU Munich（慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Mannheim（曼海姆大学）； Institute for Employment Research (IAB)（就业研究所（IAB））； University of Maryland, College Park（马里兰大学帕克分校）

AI总结提出通过上下文学习（ICL）插补调查缺失数据，在150个意见变量上评估，相比MICE PMM方法，在所有缺失机制下绝对误差更低，尤其非随机缺失时优势显著。

详情

AI中文摘要

大型语言模型已被广泛评估为个体调查响应的模拟器。然而，在实践中，完全未观测到的响应很少见；主要问题是部分无响应。插补旨在通过填充这些缺失值来恢复调查数据集的整体结构。它有自己的明确定义的评估标准，并且与预测有根本区别。我们提出通过上下文学习（ICL）来插补缺失的调查数据。我们在美国趋势面板的15波调查中，针对150个意见变量，系统评估了不同缺失机制（MCAR、MAR、MNAR）下的ICL设计选择。与成熟的数据插补统计方法（如MICE PMM）相比，我们的ICL方法在所有缺失机制下均持续降低了绝对误差，在非随机缺失（MNAR）下收益最大。值得注意的是，性能最佳的配置（gpt-oss-120b，100个上下文示例）实现了接近名义水平的总体覆盖率（接近95%），置信区间比MICE PMM窄2到5倍。我们发布了一个具有类似sklearn API的Python包，以便使用本地和专有LLM轻松部署我们的方法。

英文摘要

Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09350 2026-06-09 cs.RO cs.CV 新提交

Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification

驯服感知抖动：面向可靠运动分类的不确定性感知激光雷达目标检测

Cornelius Schröder, Žygimantas Marcinkus, Markus Lienkamp

发表机构 * Technical University of Munich（慕尼黑工业大学）； Institute for Automotive Engineering, Munich Institute of Robotics and Machine Intelligence, School of Engineering and Design（汽车工程研究所，慕尼黑机器人与机器智能研究所，工程与设计学院）

AI总结提出一种部署友好的策略，通过不确定性估计和统计检验减少静态物体的虚假动态预测，在真实驾驶中显著降低误报和不必要停车。

详情

AI中文摘要

可靠的运动分类对于自动驾驶至关重要，因为对静态物体的错误动态预测可能会级联导致不必要的规划器干预。不稳定的边界框预测会导致跟踪中产生虚假的速度估计和错误预测的轨迹。我们提出了一种部署友好的缓解策略，该策略通过偶然不确定性估计增强3D目标检测器，并在短观测窗口上应用双样本z检验来区分真实运动和抖动。该方法集成到Autoware中，仅需最小改动，并重用现有数据关联以最小化计算开销。实验结果表明，在nuScenes上与速度阈值法性能相当，但在真实道路测试中，虚假动态预测和不必要停车显著减少，这是因为记录数据中存在中间抖动带，而仅基于速度的规则会误分类。这表明，不确定性感知检测和轻量级统计测试可以在噪声更大的真实环境中为自动驾驶带来实际性能提升。

英文摘要

Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09348 2026-06-09 cs.LG cs.CL 新提交

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD: 特权贝叶斯自蒸馏用于长程信用分配

Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, Lei Song, Jiang Bian, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University（上海交通大学人工智能学院）； XYZ AI Lab（XYZ AI实验室）

AI总结提出PBSD方法，通过贝叶斯校准的自蒸馏将稀疏最终奖励转化为细粒度步骤级信用信号，解决长程智能体任务中的信用分配问题，实验表明其提升领域内外性能并促进泛化。

详情

AI中文摘要

长程智能体任务对基于结果的强化学习提出了根本性的信用分配挑战：轨迹级奖励验证最终正确性，但很少指导哪些中间推理步骤或工具交互对结果有贡献。在多轮搜索智能体中，这一困难尤为突出，因为成功轨迹可能包含误导性动作，而失败轨迹可能包含有价值的证据收集步骤。我们提出PBSD（特权贝叶斯自蒸馏），一种在稀疏最终奖励下进行细粒度信用分配的贝叶斯校准自蒸馏方法。PBSD通过验证答案的后验与先验概率比来衡量轨迹质量，并应用贝叶斯规则将这个难以估计的答案侧比率转化为标准学生模型与特权答案条件教师模型之间的易处理似然比。对该贝叶斯证据分数的自回归分解产生轮级信号，识别每个中间轮次是支持还是破坏已验证结果。因此，PBSD提供了一种原则性且优雅的重新加权方案，将稀疏结果监督转化为贝叶斯校准的轮级信用信号，同时完全兼容标准策略优化。实验表明，PBSD在领域内和领域外设置中均持续提升性能，并有效将知识从短上下文训练迁移到长上下文推理，表明其细粒度信用分配机制促进了更有效的策略学习并带来更好的泛化。

英文摘要

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.09340 2026-06-09 cs.LG 新提交

Thresholded Local Hyper-Flow Diffusion

阈值化局部超流扩散

Meher Chaitanya, Sebastian Dalleiger, Luana Ruiz

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出TL-HFD算法，通过局部活动区域和阈值化边界激活实现超图种子聚类的局部扩散，保证与全局更新等价并给出有限时间对偶次优性界。

详情

AI中文摘要

局部超流扩散（HFD）为一般子模超图中的种子聚类提供了与边大小无关的Cheeger型保证，但现有的HFD求解器在每次迭代中不保持中间计算的局部性。我们引入了阈值化局部HFD（TL-HFD），这是一种一阶方法，它维护种子周围的活动区域，对该区域及其直接边界执行投影次梯度更新，并通过阈值化（top-k）边界激活进行扩展。我们证明了局部更新是精确的：限制在活动区域及其边界上的度预条件投影次梯度步骤与无限制的全局更新一致。我们为精确和阈值化更新建立了有限时间对偶次优性，将后者视为具有显式跳过边界误差的不精确投影次梯度步骤。我们进一步推导了一个加性激活体积界，由实现的局部次梯度范数和新激活顶点中的最小边界推动控制，并将具有局部支持的近似对偶最优性转化为早期停止迭代的鲁棒扫描切割保证。对于一般子模切割成本，每次迭代在扫描区域中是局部的，并且在超边原语中是对 oracle 敏感的。实验上，TL-HFD通常匹配或优于HFD，同时激活更少的体积，在扩散倾向于吸收非目标顶点的噪声实例上获得最大收益。

英文摘要

Local Hyper-Flow Diffusion (HFD) gives an edge-size-independent Cheeger-type guarantee for seeded clustering in general submodular hypergraphs, but existing HFD solvers do not keep intermediate computation local at every iteration. We introduce Thresholded Local HFD (TL-HFD), a first-order method that maintains an active region around the seeds, performs projected subgradient updates on that region and its immediate boundary, and expands via thresholded (top-k) boundary activation. We prove that the local update is exact: the degree-preconditioned projected subgradient step restricted to the active region and its boundary coincides with the unrestricted global update. We establish finite-time dual suboptimality for both exact and thresholded updates, treating the latter as inexact projected subgradient steps with explicit skipped-boundary error. We further derive an additive activated-volume bound controlled by realized local subgradient norms and the minimum boundary-push among newly activated vertices, and translate approximate dual optimality with localized support into a robust sweep-cut guarantee for early-stopped iterates. For general submodular cut-costs, each iteration is local in the scanned region and oracle-sensitive in the hyperedge primitive. Empirically, TL-HFD often matches or improves over HFD while activating less volume, with the largest gains on noisy instances where diffusion tends to absorb non-target vertices.

URL PDF HTML ☆

赞 0 踩 0

2606.09338 2026-06-09 cs.CL 新提交

Multi-Hop Knowledge Composition is Bound by Pretraining Exposure

多跳知识组合受限于预训练暴露

Yannis Karmim, Luis Marti, Djamé Seddah, Valentin Barrière

发表机构 * Inria, Paris, France（法国国家信息与自动化研究所（巴黎））； Inria, Chile（法国国家信息与自动化研究所（智利））； Dept. of Computer Science, Universidad de Chile（智利大学计算机科学系）

AI总结研究发现，即使单跳准确率达97%，大语言模型仍无法进行隐式多跳推理，原因是预训练中缺乏组合上下文，而非知识缺失。

详情

AI中文摘要

大语言模型在隐式多跳推理上失败：当模型能正确回答“$X$出生于何时？”和“$Y$最亲密的朋友是谁？”，但在单次前向传播中回答“$Y$最亲密的朋友出生于何时？”时却失败，即使这两个事实都被完美记忆且可单独检索。我们在受控自然语言环境中研究这一失败，严格区分预训练期间暴露于组合上下文的个体和从未出现在任何此类上下文中的个体。我们确认，即使单跳准确率达到97%，组合失败仍然存在，从而将这一差距确定为预训练失败而非知识缺失。我们提出并测试了九种以数据为中心的增强格式，发现组合预训练可以迁移到暴露个体的未见问题，但从未迁移到未参与组合预训练的个体，这表明预训练期间暴露于组合上下文是隐式多跳推理的必要条件。

英文摘要

Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We study this failure in a controlled natural language setting with a strict separation between individuals exposed to compositional contexts during pretraining and those that never appear in any such context. We confirm that compositional failure persists even at 97% 1-hop accuracy, establishing the gap as a pretraining failure rather than a knowledge absence. We propose and test nine data-centric augmentation formats and find that compositional pretraining transfers to unseen questions for exposed individuals, but never to individuals absent from compositional pretraining, suggesting that exposure to compositional contexts during pretraining is a necessary condition for implicit multi-hop reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.09334 2026-06-09 cs.CL 新提交

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

提示工程在最小编辑乌克兰语语法错误纠正中能走多远？

Kateryna Karpo, Artem Chernodub

发表机构 * Ukrainian Catholic University（乌克兰天主教大学）； YouScan ； Zendesk

AI总结评估11个商业LLM在乌克兰语最小编辑语法错误纠正上的表现，发现结合最小编辑提示和少样本策略的Gemini 3.1-Pro达到F0.5=69.22，缩小了与微调SOTA的差距。

详情

AI中文摘要

微调大型语言模型在乌克兰语语法错误纠正中占主导地位，而通过API访问的LLM在最小编辑基准上几乎未经过测试。我们在UNLP 2023 GEC-only基准上评估了来自四个提供商的11个商业LLM和一个开源乌克兰语模型，比较了零样本、少样本、最小编辑和LLM辅助提示优化策略。我们的最佳配置（Gemini 3.1-Pro）达到了F0.5=69.22，缩小了与微调SOTA（F0.5=73.14）超过90%的差距。对于零样本提示，只有Claude模型受益于乌克兰语指令。然而，所有模型的最佳总体结果使用了乌克兰语最小编辑提示，其语言特定规则需要精确的乌克兰语表达。在最小编辑+少样本基础上进行LLM辅助提示优化获得了最高分数。详细的最小编辑指令在标点和大小写错误上带来了最大收益，但导致模型放弃了几个低频类别。深入错误分析，我们识别了与乌克兰语特定语言现象相关的五种重复过度纠正模式。代码、提示和输出已公开。

英文摘要

Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.09327 2026-06-09 cs.LG cs.AI 新提交

A Universal Dense Football Event Representation Based on TabTransformer

基于TabTransformer的通用密集足球事件表示

Weiran Yang, Daniel Memmert, Maximilian Klemp-Weins

发表机构 * Institute of Exercise Training and Sport Informatics, German Sport University Cologne（科隆德国体育大学运动训练与体育信息学研究所）

AI总结提出基于TabTransformer的模型，通过学习分类特征的嵌入向量，生成密集的足球事件表示，在下游任务中优于基线方法。

详情

Comments: 12 pages, 1 figure. Preprint submitted to the 13th Workshop on Machine Learning and Data Mining for Sports Analytics (MLSA 2026)

AI中文摘要

足球事件数据为团队运动中球员动作的定量分析提供了丰富的时空来源。这些数据集包含异构特征，将连续的位置坐标与分类变量（如动作类型、动作结果和身体部位）相结合。此类数据已应用于体育分析中的比赛结果预测、球员评估和战术模式识别。然而，现有方法主要使用独热或序数嵌入表示来编码分类特征，忽略了动作描述符的内在语义。Transformer是一种基于自注意力的深度神经网络架构，能够捕获输入特征在任意位置之间的依赖关系。我们提出并实现了一个基于Transformer的模型，以学习分类事件特征之间的潜在依赖关系，并生成足球事件的密集表示。通过将分类特征编码为学习到的嵌入向量，在预训练期间捕获了特定于运动的动作语义，使得表示能够支持下游任务，如动作价值估计和比赛风格识别。实证评估表明，在下游预测任务中，嵌入表示在概率校准方面优于任务特定基线，如Brier分数所衡量的。

英文摘要

Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.

URL PDF HTML ☆

赞 0 踩 0

2606.09323 2026-06-09 cs.AI cs.DB 新提交

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

TRL-Bench：标准化跨范式的表格编码器表示级评估

Wei Pang, Xiangru Jian, Hehan Li, Zhixuan Yu, Alex Xue, Jinyang Li, Zhengyuan Dong, Xinjian Zhao, Hao Xu, Chao Zhang, Reynold Cheng, M. Tamer Özsu, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； University of Waterloo（滑铁卢大学）； The University of Hong Kong（香港大学）； The University of Sydney（悉尼大学）； Université Lyon 1（里昂第一大学）

AI总结提出TRL-Bench，通过标准化下游条件，从列/表、行和组合数据湖表增强三个粒度评估表格编码器，揭示编码器质量具有能力特异性而非单一排名。

详情

AI中文摘要

表格编码器通常在特定任务的全流程管道中进行评估，因此即使处理相似的表格信号，来自不同训练范式的模型也难以直接比较。我们引入了TRL-Bench，一个多粒度表格表示学习基准，用于标准化跨范式的表示级评估：每个编码器通过其支持的封装器导出行、列或表嵌入，共享的轻量级探测头在三个套件中对其进行探测：TRL-CTbench（列/表）、TRL-Rbench（行）和TRL-DLTE（涵盖所有三种粒度的组合数据湖表增强）。为支持这一标准化设置，我们发布了精选的基准资产和任务重构，包括50个OpenML表格（含123个验证目标）、16个行对链接重写以及一个由1379个父表衍生的47772表DLTE湖。在20个模型和16个任务上的实验表明，一旦下游条件标准化，编码器质量是能力特定的，而非由单一排行榜决定。在TRL-CTbench中，通用文本编码器通常在具有强表面文本信号的任务上领先，而表格专用编码器在其预训练目标与任务对齐时获胜。在TRL-Rbench中，表内预测和跨表链接偏好不同的训练机制，原子链接性能与DLTE管道中的行匹配阶段强相关。在TRL-DLTE中，最强管道结合了能力匹配的专用编码器而非重复使用单一编码器，且顶级端到端质量取决于非加性的组合适配而非每阶段边际排名。TRL-Bench提供了一个通用协议，用于在共享下游条件下测量导出表格表示中的可复用信号。代码和数据：https://github.com/LOGO-CUHKSZ/TRL-Bench

英文摘要

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench

URL PDF HTML ☆

赞 0 踩 0

2606.09314 2026-06-09 cs.RO 新提交

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

KPGrasp: 可扩展的关键点流匹配用于灵巧抓取生成

Yuansen Huang, Jiayi Chen, Haoran Liu, Yubin Ke, Bing Han, Jiangran Lyu, Mi Yan, Li Yi, He Wang

发表机构 * Peking University（北京大学）； Galbot ； Xi’an Jiaotong University（西安交通大学）； Tsinghua University（清华大学）

AI总结提出KPGrasp框架，通过全欧几里得手部关键点参数化和Transformer流模型，从大规模数据学习灵巧抓取先验，无需接触损失或测试时优化，在模拟和真实场景中实现高成功率与低穿透深度。

详情

Comments: 14 pages, 7 figures, 6 tables

AI中文摘要

对于基于学习的方法而言，生成高质量的灵巧抓取仍然具有挑战性，这些方法通常依赖于精心调整的接触损失或昂贵的基于接触的测试时优化。我们提出了KPGrasp，一个流匹配框架，从大规模数据中学习灵巧抓取先验，而不是依赖接触损失或基于接触的测试时优化。KPGrasp将全欧几里得3D手部关键点参数化与一个简单但可扩展的Transformer流模型相结合。该参数化避免了传统混合SE(3)姿态和关节角度输出空间的缺点，在与物体点云相同的坐标系中表达抓取，从而实现了原生空间推理；Transformer流模型仅使用标准流匹配损失进行训练，并随着数据、模型容量和批大小有效扩展。实验表明，在两个模拟基准上达到了最先进的性能。在Dexonomy基准上，它达到了76.3%的抓取成功率，比最强的直接可比基线提高了47.4%，同时将穿透深度减少到2.4毫米。同一模型在DexGrasp Anything基准上也无需微调即可达到最佳平均性能。对于批量推理，KPGrasp每次抓取仅需0.032秒。最后，在20个不同物体上的真实世界实验表明，该流水线可以在真实环境中部署。

英文摘要

Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.

URL PDF HTML ☆

赞 0 踩 0