arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.21768 2026-05-22 cs.LG cs.MA

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-R2: 长时间 horizon 记忆增强 LLM agent 的公平信用分配

Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma

AI总结 本文提出 Memory-R2 框架,通过结合局部和全局组相对优化方法,解决长时间 horizon 记忆增强 LLM agent 在多会话环境中训练时由于记忆状态差异导致的信用分配不公平问题,同时联合优化记忆形成与记忆演化。

详情
AI中文摘要

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

英文摘要

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

2605.21766 2026-05-22 cs.CV cs.GR

BodyReLux: Temporally Consistent Full-Body Video Relighting

BodyReLux: 时序一致的全身人体视频重照明

Li Ma, Mingming He, Xueming Yu, David M. George, Ahmet Levent Taşel, Paul Debevec, Julien Philip

AI总结 本文提出BodyReLux,一种基于视频扩散的框架,用于在时序一致的方式下重照明全身人体表演。该方法利用混合数据集训练,结合传统静态单光源捕捉和新型动态表演捕捉技术,通过引入新的光照条件表示方法和数据增强管道,实现了高质量、鲁棒且时序一致的视频重照明。

Comments Siggraph 2026 Journal Track. Project page: https://eyeline-labs.github.io/bodyrelux/

详情
AI中文摘要

能够重照明人体表演是后期制作和内容创作中的基本任务。我们提出了BodyReLux,一种针对特定主体的视频扩散框架,用于在时序一致的方式下重照明全身人体表演。我们的模型是在一个混合的像素对齐视频重照明数据集上训练的,涵盖了多样化的光照条件、表演和视角组合。为了获得这样的数据集,我们结合了传统的静态单光源捕捉(OLAT)和一种新的动态表演捕捉方法,在其中两个平滑变化的光照序列被快速交错。由于光照操作在人类闪烁融合阈值之上,交错不会显得闪烁。我们从预训练的文本到视频模型中训练视频重照明模型,以充分利用生成先验来产生高质量视频。为了实现精确的光照控制,我们引入了一种新的光照条件方法,将每个光源表示为一个标记。我们进一步使用掩码注意力对光照序列进行条件处理,以支持动态光照控制。结合精心设计的数据增强管道,我们实现了高质量、鲁棒且时序一致的特定主体人体表演视频重照明。

英文摘要

Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.

2605.21765 2026-05-22 cs.LG

Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

Emanuel Sommer, David Rügamer

AI总结 本文探讨了在贝叶斯深度学习中采样推理(SAI)的潜力,指出其在计算效率上已与优化方法相当,并可能成为更有效的推理方法。核心贡献是推动SAI在贝叶斯神经网络中的应用,解决现有误解,以实现更精确的不确定性量化。

Comments In Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

详情
AI中文摘要

贝叶斯神经网络(BNNs)中基于采样的推理(SAI)的实用应用仍然有限,部分原因是持续存在的关于其可行性和效率的误解。本文认为,SAI在计算上已与基于优化的方法达到平衡,并即将超越这些方法,成为BNNs中更有效和高效的推理方法。这一发展应成为整个社区的利益,推动BNNs作为一种原则性的范式,实现其长期未实现的承诺,即为神经网络提供原则性的不确定性量化。SAI甚至可以做到更多——通过模型平均获得更优的预测性能,成为各种可能的下游任务的基础,并为BNNs的景观提供关键见解。为了实现这种变革并释放采样的潜力,克服当前的误解是必要的第一步。下一步是重新定向研究努力,解决SAI中尚存的挑战。特别是,社区必须专注于两个核心问题:充分探索后验景观和高保真度地蒸馏后验样本以实现高效的下游推理。通过解决概念和实践上的障碍,我们可以解锁SAI的全部潜力,并将其确立为贝叶斯深度学习中的核心工具。

英文摘要

The practical adoption of sampling-based inference (SAI) in Bayesian neural networks (BNNs) remains limited, partly due to persistent misconceptions about the feasibility and efficiency of sampling. This position paper argues that SAI has achieved computational parity with optimization-based methods and is at the verge of superseding such methods for effective and efficient inference in BNNs. This development should be in the interest of the whole community, promoting BNNs as a principled paradigm with its long-standing yet unfulfilled promise of providing principled uncertainty quantification for neural networks. SAI can even do more -- yielding superior prediction performance through model averaging, serving as the foundation for a plethora of possible downstream tasks, and providing crucial insights into the landscape of BNNs. In order to make such a change happen and unfold the potential of sampling, overcoming current misconceptions is a necessary first step. The next step is to realign research efforts toward addressing remaining challenges in SAI. In particular, the community must focus on two core problems: sufficient exploration of the posterior landscape and high-fidelity distillation of posterior samples for efficient downstream inference. By addressing conceptual and practical obstacles, we can unlock the full potential of SAI and establish it as a central tool in Bayesian deep learning.

2605.21763 2026-05-22 cs.LG cs.SY eess.SY stat.ML

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

关于优化确定等价的折扣强化学习样本复杂性

Oliver Mortensen, Mohammad Sadegh Talebi

AI总结 本文研究了有限折扣MDP中的风险敏感强化学习,考虑了优化确定等价(OCE)这一风险度量家族,分析了在递归OCE下学习最优状态-动作价值函数和最优策略的样本复杂性,并给出了PAC可学习的效用函数的精确刻画,同时建立了基于模型的简单方法的PAC样本复杂性界,并展示了当效用函数的域不为全实数时问题不可PAC学习,最后给出了价值和策略学习的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了有效时间 horizon 1/(1-γ) 的依赖性。

Comments Accepted to RLC 2026. arXiv admin note: substantial text overlap with arXiv:2506.00286

详情
AI中文摘要

我们研究了有限折扣MDP中的风险敏感强化学习,其中假设存在MDP的生成模型。我们考虑了一类称为优化确定等价(OCE)的风险度量家族,其中包括重要的风险度量,如熵风险、CVaR和均方差。我们的重点是递归OCE下学习最优状态-动作价值函数(价值学习)和最优策略(策略学习)的样本复杂性。我们提供了效用函数u的精确刻画,使得对应的OCE定义了一个PAC可学习的目标。我们分析了一个简单的基于模型的方法并推导了PAC样本复杂性界。我们证明了当u的域不为全实数dom(u)≠R时,相应的问题不可PAC学习。最后,我们为价值和策略学习建立了相应的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了下界,使有效时间 horizon 1/(1-γ) 的依赖性显式化。具体而言,对于CVaR_τ,我们展示了τ的正确依赖性为1/τ²,从而在状态-of-the-art上改进了1/τ因子,尽管我们的界在1/(1-γ)上的依赖性是次优的。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-γ}$ explicit. Specifically, for $\text{CVaR}_τ$ we show that the correct dependence on $τ$ is $\frac{1}{τ^2}$, thus improving by a factor of $\frac{1}τ$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-γ}$.

2605.21762 2026-05-22 cs.LG

Machine learning prediction of obstructive coronary artery disease using opportunistic coronary calcium and epicardial fat assessments from CT calcium scoring scans

利用CT钙扫描中的机会性冠状动脉钙化和心外膜脂肪评估进行阻塞性冠状动脉疾病的机器学习预测

Juhwan Lee, Ammar Hoori, Tao Hu, Justin N. Kim, Mohamed H. E. Makhlouf, Michelle C. Williams, David E. Newby, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

AI总结 本研究开发了一种先进的机器学习框架,通过分析CT钙扫描中的冠状动脉钙化和心外膜脂肪数据,预测阻塞性冠状动脉疾病,展示了该方法在提高预测性能和减少对增强CT或侵入性检查依赖方面的潜力。

Comments 16 pages, 4 figures, 3 tables

详情
AI中文摘要

非对比计算断层扫描钙评分(CTCS)是一种成本效益高的成像模态,广泛用于检测冠状动脉钙化。本研究旨在开发一种先进的机器学习框架,利用CTCS图像中冠状动脉钙化和心外膜脂肪的定量分析来预测阻塞性冠状动脉疾病(CAD)。研究人群包括1,324名接受CTCS和冠状动脉CT血管造影的SCOT-HEART临床试验参与者。我们从CTCS图像中提取并分析了广泛特征,包括24个临床变量、189个钙组学和211个心外膜脂肪组学特征。特征选择使用CatBoost算法结合SHapley Additive exPlanation(SHAP)值进行。预测建模利用CatBoost梯度提升方法,专注于最有信息量的特征。从初始的424个候选特征中,通过CatBoost-SHAP方法确定了14个最具有预测性的特征。前两个预测特征来自脂肪组学,其余12个特征来自钙组学。优化后的模型表现出稳健的预测能力,显示出灵敏度为83.1±4.6%、特异性为93.8±1.7%、准确度为85.3±2.0%、F1分数为73.9±3.3%。包括钙组学和脂肪组学数据显著提高了预测性能。值得注意的是,该模型在具有不同冠状动脉钙化评分的患者中也表现出可靠的预测准确性,包括在零钙化评分的情况下仍存在阻塞性CAD的病例。这种创新方法有潜力改善临床决策,并可能减少对增强CT或侵入性诊断程序的依赖,特别是在低至中等风险患者群体中。

英文摘要

Non-contrast computed tomography calcium scoring (CTCS) is a cost-effective imaging modality widely used to detect coronary artery calcifications. This study aimed to develop an advanced machine learning framework that utilizes quantitative analyses of coronary calcium and epicardial fat from CTCS images to predict obstructive coronary artery disease (CAD). The study population consisted of 1,324 patients from the SCOT-HEART clinical trial who underwent both CTCS and coronary CT angiography. We extracted and analyzed a broad range of features, including 24 clinical variables, 189 calcium-omics, and 211 epicardial fat-omics features from the CTCS images. Feature selection was conducted using the CatBoost algorithm combined with SHapley Additive exPlanation (SHAP) values. Predictive modeling utilized the CatBoost gradient boosting method, focusing on the most informative features. From an initial set of 424 candidate features, 14 were identified as most predictive through the CatBoost-SHAP method. The top two predictive features originated from fat-omics, with the remaining 12 features derived from calcium-omics. The optimized model achieved robust predictive capabilities, demonstrating a sensitivity of 83.1+/-4.6%, specificity of 93.8+/-1.7%, accuracy of 85.3+/-2.0%, and an F1 score of 73.9+/-3.3%. Inclusion of calcium-omics and fat-omics data significantly improved predictive performance. Notably, the model also showed reliable predictive accuracy in patients with diverse coronary calcium scores, including cases with obstructive CAD despite a zero-calcium score. This innovative approach holds promise for improving clinical decision-making and potentially reducing dependence on contrast-enhanced or invasive diagnostic procedures, particularly within low-to intermediate-risk patient groups.

2605.21758 2026-05-22 cs.AI

A Causal Argumentation Method for Explainability of Machine Learning Models

一种用于机器学习模型可解释性的因果论辩方法

Henry Salgado, Meagan R. Kendall, Martine Ceberio

AI总结 本文提出一种结合因果推理和论辩推理的方法,用于解释机器学习模型为何做出特定预测,通过因果发现方法识别变量间的因果关系,并将其转化为双极论辩框架来表示特征间的支持与反对交互,最终通过半稳定语义确定解释性特征扩展。

Comments To be published in The 4th World Conference on eXplainable Artificial Intelligence

详情
AI中文摘要

可解释人工智能(XAI)方法旨在识别影响模型预测的相关特征,但往往无法清晰解释为何某些决策被做出。在本工作中,我们提出了一种新颖的方法,将因果推理与基于论辩的推理相结合,以解释模型为何做出预测。我们的方法首先使用因果发现方法识别变量间的因果关系,然后将这些关系转化为双极论辩框架(BAF)以表示特征间的支持与反对交互。通过使用半稳定语义,我们找到能够解释为何某些结果被选择的特征扩展。我们在两个基准数据集上展示了我们的方法,并将其结果与标准事后可解释性方法进行比较。

英文摘要

Explainable AI (XAI) methods identify which features are relevant to a model's predictions but often fail to clarify why certain decisions are made. In this work, we present a novel method that integrates causality with argument-based reasoning to explain why models may be making predictions. Our approach first identifies causal relationships among variables using causal discovery methods and then translates these into a Bipolar Argumentation Framework (BAF) to represent supportive and opposing interactions among features. By using semi-stable semantics, we find extensions of features that explain why certain outcomes may have been chosen. We demonstrate our method on two benchmark datasets and compare its results against standard post-hoc explainability approaches.

2605.21752 2026-05-22 cs.LG cs.AI

PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation

PEARL:通过对比学习实现工业级直播推荐的无偏百分位估计

Blake Gella, Wei Wu, Yuhao Yin, Zexi Huang, Zikai Wang, Emily Liu, Junlin Zhang, Wentao Guo, Qinglei Wang

AI总结 本文提出PEARL框架,通过对比学习方法解决用户行为不平衡问题,通过相对偏好信号建模提升推荐系统的性能和鲁棒性。

详情
AI中文摘要

训练于用户交互数据的推荐系统容易受到行为强度不平衡的影响——这种系统性扭曲源于用户间异质的参与模式。这种不平衡会使反馈信号失真,使得观察到的互动不再真实反映真实的偏好,导致模型过度放大高活跃用户信号而低估其他人,最终在大规模情况下降低推荐质量与鲁棒性。为了解决这个问题,我们提出了一种非参数对比百分位近似框架PEARL,该框架建模相对偏好信号而非绝对参与程度。基于相对优势去偏,PEARL利用真实的对比交互样本直接近似百分位关系,而无需依赖辅助分布估计模型。我们提供了理论证明,表明这种成对比较能产生无偏的基于百分位的偏好信号估计。为了更广泛的应用,我们引入了基于预测的重采样机制用于百分位平滑以处理稀疏和离散的反馈,以及通用的价值加权形式和共训练策略以增强建模灵活性和表示学习。大量离线实验表明,PEARL有效减轻了行为偏差,并在多个排序目标上一致提高了推荐性能。在拥有数十亿用户的大规模直播平台部署后,在线A/B测试确认了实际收益:观看时长增加2.10%,消费金额增加0.80%,互动率增加1.49%,举报率降低6.91%。

英文摘要

Recommender systems trained on user interaction data are susceptible to behavioral intensity imbalance--a systematic distortion arising from heterogeneous engagement patterns across users. This imbalance skews feedback signals such that observed interactions no longer faithfully reflect true preferences, causing models to disproportionately amplify signals from highly active users while underrepresenting others, which ultimately degrades recommendation quality and robustness at scale. To address this issue, we propose a nonparametric contrastive percentile approximation framework, PEARL, that models relative preference signals instead of absolute engagement magnitudes. Building upon relative advantage debiasing, PEARL leverages real contrastive interaction samples to approximate percentile relationships directly, without relying on auxiliary distribution estimation models. We provide theoretical justification demonstrating that such pairwise comparisons yield unbiased estimates of percentile-based preference signals. For broader applicability, we introduce a prediction-based bootstrapping mechanism for percentile smoothing to handle sparse and discrete feedback, alongside a generalized value-weighted formulation and a co-training strategy to enhance both modeling flexibility and representation learning. Extensive offline experiments demonstrate that PEARL effectively mitigates behavioral bias and consistently improves recommendation performance across multiple ranking targets. Deployed in a production livestream platform with a combined user base of billions, online A/B testing confirms substantial real-world gains: +2.10% Watch Duration, +0.80% Consumption Amount, +1.49% Interaction Rate, and -6.91% Report Rate.

2605.21751 2026-05-22 cs.LG

Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization

模型可以建模,但无法绑定:文本到优化中的结构化 grounding

Zhiqi Gao, Albert Ge, Alexander Berenbeim, Nathaniel D. Bastian, Frederic Sala

AI总结 本文研究了文本到优化任务中建模与绑定两个关键能力的分离性,发现随着实例数据增长,模型准确性下降,提出BIND方法通过结构化文件外部化数据来提升绑定性能,验证了绑定专精模型在不同优化类别中的优势。

详情
AI中文摘要

文本到优化需要两种可分离的能力:建模——选择正确的优化结构——和绑定——将每个系数、索引和参数在具体问题数据中具体化。我们通过Text2Opt-Bench,一个涵盖12类问题的可扩展基准,研究了这一问题,该基准包含从教科书线性规划到具有数千变量的随机和多目标形式的求解器验证优化问题。在10多个模型上,我们发现当实例数据增长时,准确性下降,即使优化形式本身简单。我们称此为有效绑定限制。我们通过一种简单的推理时间方法BIND来解决,该方法将数值数据外部化到结构化文件中,使模型能够程序化地绑定数据,而不是从提示中转录。BIND将GPT-5-Nano的准确性从59.1%提升到82.4%,在低于pass@1的token成本下达到pass@5(82.0%)的水平,并将GPT-5的准确性从86.2%提升到95.8%。此外,我们通过仅在绑定上微调模型验证了我们的假设,证明在三个结构上不同的优化类别中,绑定专精模型在端到端SFT和RL中表现更优,1.5B绑定专精模型单独即可达到7B端到端基线的水平。

英文摘要

Text-to-optimization requires two separable capabilities: modeling -- choosing the right optimization structure -- and binding -- grounding every coefficient, index, and parameter in the concrete problem data. We study this via Text2Opt-Bench, a scalable benchmark of solver-verified optimization problems spanning 12 categories, from textbook linear programs to stochastic and multi-objective formulations with up to thousands of variables. Across 10+ models, we find that accuracy collapses as instance data grows, even when the formulation itself is simple. We call this the effective binding limit. We address this via a simple inference-time approach, BIND, which externalizes numeric data to structured files so the model binds data programmatically rather than transcribing from the prompt. BIND improves GPT-5-Nano from 59.1% to 82.4% accuracy, matching pass@5 (82.0%) at lower token cost than pass@1, and GPT-5 from 86.2% to 95.8%. Furthermore, we validate our hypothesis by finetuning a model exclusively on binding and show that it outperforms end-to-end SFT and RL across three structurally distinct optimization categories, with a 1.5B binding specialist alone matching a 7B end-to-end baseline.

2605.21748 2026-05-22 cs.CL

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge: 一个多轮LLM-as-a-Judge合成基准生成器

Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell

AI总结 本文提出RankJudge,一种用于评估LLM作为评判者在多轮对话中表现的合成基准生成器,通过生成带有单个缺陷的对话对,实现对评判准确性的严格评估,并通过领域覆盖和21个前沿LLM评判者评估,验证了评判排名的稳定性。

详情
AI中文摘要

随着交互式基于LLM的应用被创建和优化,模型开发者需要在多个可能的轴上评估生成文本的质量。对于更简单的系统,人工评估可能是可行的,但在复杂的系统如对话聊天机器人中,生成文本的数量可能会超出人类注释资源的承受能力。模型开发者已经开始依赖自动评估,其中LLM也被用来判断生成质量。然而,现有的LLM-as-a-judge基准主要集中在简单的问答任务上,而无法匹配多轮对话的复杂性。我们引入了RankJudge,一种用于评估LLM-as-a-judge在基于参考文档的多轮对话中的基准生成器。RankJudge生成对话对,其中一组对话在某一回合中注入了一个单一的缺陷。这种构造使得对话对可以被无歧义地标记为更好或更差,并且能够精确地将失败类别隔离到单个回合中,从而实现一个严格的联合正确性标准来评判。我们实现了RankJudge在机器学习、生物医学和金融领域,评估了21个前沿LLM评判者,并通过Bradley-Terry模型对这些评判者进行排名。我们的方法还允许对每个对话对进行难度评分,我们利用这些评分动态地整理评估切片以减少标签噪声,这已通过人工注释得到验证。我们发现,在部分可观测性、更粗略的正确性标准以及替代的随机游走评分算法下,评判排名是稳定的。

英文摘要

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

2605.21747 2026-05-22 cs.CV cs.RO

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

通过利用视觉语言模型推断车辆信息以改进自动驾驶中的3D标注

Steven Chen, Shivesh Khaitan, Nemanja Djuric

AI总结 本文提出了一种利用视觉语言模型推断车辆信息以提高自动驾驶中3D车辆标注精度的方法,通过零样本推理车辆信息,结合车辆型号和型号识别方法,提升了标注效率和质量。

Comments To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

详情
AI中文摘要

我们提出了一种通过零样本推理车辆信息来提高自动驾驶应用中3D车辆标注的方法,利用车辆制造商和型号识别(VMMR)方法。所提出的方法利用视觉语言模型(VLM)从图像片段中推断车辆的制造商、型号和代数,并输出准确的3D包围盒尺寸以引导手动标注。我们评估了迭代提示工程和不同VLMs选择对车辆包围盒推断和制造商/型号/代数识别的影响。与强大的基线相比,所提出的方法不仅在准确性上表现出色,而且在缓解特定失败模式方面也表现出色,例如在车辆显著遮挡的情况下,VLMs提供的尺寸比初始激光雷达辅助的人工标注标签更优。在公共和专有数据上的实验强烈表明,我们的结论可以推广到不同的标注者和数据集。结果表明,将VLMs整合到标注过程中可以减少手动标注时间,同时提高标注质量。

英文摘要

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

2605.21745 2026-05-22 cs.LG

Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring

基于非增强CT钙化评分的冠状动脉钙化定量分析用于预测心肌缺血

Juhwan Lee, Sadeer Al-Kindi, Ammar Hoori, Tao Hu, Hao Wu, Justin N. Kim, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

AI总结 本文提出了一种新的机器学习框架,利用非增强CT钙化评分扫描中的定量冠状动脉钙化评估来预测心肌缺血,通过XGBoost和SHAP识别相关特征,并在5折交叉验证中训练和评估模型,结果显示钙化组学特征显著提高了预测性能。

Comments 15 pages, 4 figures, 3 tables

详情
AI中文摘要

非增强计算机断层扫描钙化评分(CTCS)被广泛认可为心血管风险分层的有效工具。本研究旨在开发一种新的机器学习框架,利用常规非增强CTCS扫描进行定量冠状动脉钙化评估,以预测心肌缺血。本研究分析了1,375名患者,这些患者在一年内于克利夫兰医学中心接受了非增强CTCS和去甲肾上腺素应力心脏正电子发射断层扫描心肌灌注成像。总共评估了74个变量,包括临床变量、Agatston评分和钙化组学特征。通过XGBoost结合SHAP确定了相关特征。使用5折交叉验证训练和评估预测模型。在987名患者中,89名(9%)被确定为心肌缺血阳性。最终模型整合了Agatston评分、八个钙化组学特征和年龄。所提出的模型实现了98.9±3.0%的精度,79.2±8.4%的灵敏度,以及87.7±5.3%的F1分数。与仅使用临床变量或临床变量加Agatston评分的模型相比,添加钙化组学特征显著提高了预测性能(p<0.05)。有趣的是,尽管基于SHAP分析,钙化动脉的数量是排名最低的特征,但在逻辑回归分析中,它与心肌缺血的关联最强(比值比:3.63,95%置信区间:2.80-4.77,p<0.00001)。我们开发了一种机器学习方法,用于使用常规获取的非增强CTCS扫描预测心肌缺血。钙化组学特征在传统风险因素和Agatston评分之外提供了额外的预测价值,并可能支持更可及的心血管风险分层。

英文摘要

Non-contrast computed tomography calcium scoring (CTCS) is widely recognized as an effective tool for cardiovascular risk stratification. This study aimed to develop a novel machine learning framework for predicting myocardial ischemia from routine non-contrast CTCS scans using quantitative coronary calcium assessment. This study analyzed 1,375 patients who underwent both non-contrast CTCS and regadenoson stress cardiac positron emission tomography myocardial perfusion imaging within one year at University Hospitals Cleveland Medical Center. A total of 74 variables, including clinical variables, Agatston score, and calcium-omics features, were evaluated. Relevant features were identified using XGBoost with Shapley Additive exPlanations (SHAP). Predictive models were trained and evaluated using 5-fold cross-validation. Among 987 patients, 89 (9%) were positive for myocardial ischemia. The final model incorporated the Agatston score, eight calcium-omics features, and age. The proposed model achieved a precision of 98.9+/-3.0%, sensitivity of 79.2+/-8.4, and F1 score of 87.7+/-5.3%. The addition of calcium-omics features significantly improved predictive performance compared with models using clinical variables alone or clinical variables with the Agatston score (p<0.05). Interestingly, the number of calcified arteries, despite being the lowest-ranked feature based on SHAP analysis, showed the strongest association with myocardial ischemia in logistic regression analysis (odds ratio: 3.63, 95% confidence interval: 2.80-4.77, p<0.00001). We developed a machine learning approach for predicting myocardial ischemia using routinely acquired non-contrast CTCS scans. Calcium-omics features provided incremental predictive value beyond conventional risk factors and Agatston scoring and may support more accessible cardiovascular risk stratification.

2605.21742 2026-05-22 cs.LG cs.IT math.IT

Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification

修正先验数据拟合网络在表格分类中的类别不平衡

Samuel McDowell, Nathan Stromberg, Lalitha Sankar

AI总结 本文研究了如何修正先验数据拟合网络在表格分类中因类别不平衡导致的性能问题,通过分析现有技术发现阈值法因PFNs的校准特性表现优异,下采样因PFNs的有限数据性能表现相当,并具有降低推理计算成本的优势。

Comments 5 pages, 6 figures, Information Theory Workshop (ITW)

详情
AI中文摘要

Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

英文摘要

Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

2605.21728 2026-05-22 cs.CV cs.CL cs.LG

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

AI总结 本文提出了一种无参考图像描述评估方法BEiTScore,通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足,提出了一种新的评估指标,并在多种场景下验证了其优越的性能。

详情
AI中文摘要

图像描述评估仍是一个重大挑战,因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型(LLMs)作为评判者的大量计算成本,或者受到标准CLIP基于编码器的限制,例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力,因为将描述视为“词袋”。我们提出了一种新的学习度量标准,以解决上述挑战,基于一个轻量级交叉编码器,其初始化来自视觉问答模型检查点,平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习,特征是对抗性的LLM基于数据增强,以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准,用于在多种场景中评估详细的描述评估。实验结果表明,所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时,实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

2605.21726 2026-05-22 cs.CL cs.AI

Probabilistic Attribution For Large Language Models

基于概率的大型语言模型归因

Shilpika Shilpika, Carlo Graziani, Bethany Lusch, Venkatram Vishwanath, Michael E. Papka

AI总结 本文提出了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示,从而提高大型语言模型的可解释性。

Comments 29 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLMs)生成性的特性体现在它们计算每个响应token的条件概率,以根据先前的token进行采样。这些概率编码了模型在训练中学习的分布结构,并在推理中加以利用。在本文中,我们利用这些概率将LLMs置于随机过程的数学理论框架中。我们使用此框架设计了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示。该表示独立于模型的计算结构。此表示给出了响应给提示的条件概率,以及在移除一个token后的响应给提示的条件概率。我们的归因分数是这两个概率比值的对数。我们进一步计算了单个提示token分布的熵,条件于剩余的上下文。熵与归因分数之间的相互作用揭示了LLM的行为。我们评估了8个模型在7个提示上的表现,并调查了异常、token敏感性、响应稳定性、模型稳定性以及训练收敛性,从而提高了可解释性,并引导用户关注生成中不确定或不稳定的部分。

英文摘要

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

2605.21724 2026-05-22 cs.LG cs.AI

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

TBP-mHC: 通过运输多面体实现 manifold-constrained 超连接的全表达性

Anton Lyubinin

AI总结 本文提出 TBP-mHC,通过运输多面体参数化实现 manifold-constrained 超连接的全表达性,解决了超连接中无约束混合导致的训练不稳定性问题,并在语言模型预训练中展示了竞争性性能和改进的稳定性与可扩展性。

详情
AI中文摘要

超连接(HC)通过在多个残差流之间引入可学习的混合来改进残差网络,但无约束的混合导致训练不稳定。Manifold-Constrained Hyper-Connections(mHC)通过Sinkhorn归一化强制近似双随机性,而mHC-lite则通过置换矩阵的凸组合确保精确约束,但以阶乘复杂度为代价。KromHC通过Kronecker积参数化减少此成本,但限制混合矩阵为Birkhoff多面体的结构子流形。我们提出运输Birkhoff多面体(TBP)参数化及其递归变体(RTBP),通过(n-1)^2自由度构造精确的双随机混合矩阵。我们的方法避免了迭代归一化和组合爆炸,同时保持Birkhoff多面体的完整表达性。在语言模型预训练中的实验证明了竞争性性能,同时具有改进的稳定性和可扩展性。

英文摘要

Hyper-Connections (HC) improve residual networks by introducing learnable mixing across multiple residual streams, but unconstrained mixing leads to training instability. Manifold-Constrained Hyper-Connections (mHC) address this by enforcing approximate double stochasticity via Sinkhorn normalization, while mHC-lite ensures exact constraints through convex combinations of permutation matrices at the cost of factorial complexity. KromHC reduces this cost using Kronecker-product parameterizations, but restricts the mixing matrices to a structured submanifold of the Birkhoff polytope . We propose Transportation Birkhoff Polytope (TBP) parameterizations and their Recursive variants (RTBP), which construct exactly doubly stochastic mixing matrices with $(n-1)^2$ degrees of freedom. Our approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope. Empirical results on language model pre-training' demonstrate competitive performance with improved stability and scalability.

2605.21723 2026-05-22 cs.RO cs.AI cs.MA cs.SY eess.SY

Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems

在异质多团队系统中学习利他性协作

Riwa Karam, Ruoyu Lin, Brooks A. Butler, Magnus Egerstedt

AI总结 本文研究了通过动态机器人分配实现的异质多团队协作,将机器人视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是组合性的,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

详情
AI中文摘要

本文研究了通过动态机器人分配实现的异质多团队协作,其中机器人被视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,我们提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是一个组合问题,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

英文摘要

This paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton's rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton's rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.

2605.21719 2026-05-22 cs.RO cs.SY eess.SY

Mind the Gaps: Multi-Robot Feedback-Driven Ergodic Coverage in Unknown Environments

注意缝隙:未知环境中的多机器人反馈驱动的遍历覆盖

Thales Costa Silva, Nora Ayanian

AI总结 本文提出了一种多机器人反馈驱动的遍历覆盖策略,通过实时环境模型反馈调整机器人采样行为,以提高未知环境中的覆盖效率和资源分配。

详情
AI中文摘要

在本文中,我们解决了多机器人自适应覆盖的问题,其中机器人团队通过连续调整位置进行动态采样以收集环境数据。此任务具有挑战性,特别是在机器人必须随时间高效分配到新采样位置时。遍历搜索方法通过确保机器人时间平均的空间分布与环境信息的空间分布一致来优化机器人轨迹。虽然这些方法在目标分布已知的情况下能促进有效探索,但往往无法考虑环境的未知先验分布。为克服这一限制,我们提出了一种自适应覆盖策略,利用环境模型的实时反馈来调整机器人采样行为以应对未知条件。我们的方法通过基于环境参数模型构建目标空间信息分布,该分布在线更新,从而增强传统遍历轨迹优化。该策略假设环境是静态或变化缓慢相对于机器人运动。我们的框架使机器人能够动态优先考虑高兴趣区域,提高覆盖效率,为单个代理合成有效的控制策略,并在未知先验分布的设置中优化资源使用。我们通过仿真验证了我们的方法,证明了其在提高覆盖和资源分配方面的有效性。

英文摘要

In this work, we address the problem of multi-robot adaptive coverage, where teams of robots perform dynamic sampling by continuously adjusting their positions to collect data in an environment. This task can be challenging, particularly when robots must be efficiently allocated to new sampling locations over time. Ergodic search methods optimize robot trajectories by ensuring that the robots' time-averaged spatial distribution aligns with the spatial distribution of environmental information. While these methods promote effective exploration provided a target distribution, they often fail to account for unknown prior distributions of the environment. To overcome this limitation, we propose an adaptive coverage strategy that utilizes real-time feedback from an environmental model to adjust robot sampling behavior in response to unknown conditions. Our approach enhances traditional ergodic trajectory optimization by constructing a target spatial information distribution based on parametric models of the environment, which are updated online. This strategy assumes that the environment is either static or changes slowly compared to the robot's motion. Our framework allows robots to dynamically prioritize regions of high interest, improving coverage efficiency, synthesizing effective control policies for individual agents, and optimizing resource use in settings with unknown prior distributions. We validate our approach through simulations, demonstrating its effectiveness in enhancing coverage and resource allocation.

2605.21714 2026-05-22 cs.CV cs.RO

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

AVI-HT:自适应视觉-IMU融合用于3D手部跟踪

Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

AI总结 本文提出AVI-HT,一种自适应视觉-IMU融合方法,通过联合建模第一人称视角图像与手套上的6自由度IMU信号,用于跟踪3D手部姿态。核心方法包括同步多模态训练数据配对和跨传感器深度注意力机制,主要贡献是提高了在手-物体交互场景中的准确性和可用性。

详情
AI中文摘要

我们提出了AVI-HT,一种用于通过联合建模第一人称视角图像与手套上的6自由度IMU信号来跟踪3D手部姿态的自适应视觉-IMU融合方法。AVI-HT在手-物体交互(HOI)场景中,特别是在重视觉遮挡情况下,实现了显著提高的准确性和可用性。其成功基于两个互补的成分:(1)同步多模态训练数据配对身体上的视觉-IMU传感器流与运动捕捉系统的地面真实3D手部姿态;(2)一种跨传感器深度注意力机制,能够自适应地调节对视觉和单个IMU传感器的信任度。为了在真实世界中评估AVI-HT,我们在包含100000+对视觉-IMU样本的DexGloveHOI数据集中进行了广泛的实验,这些样本具有同步的3D标注姿态,用户在日常任务中操作各种物体。我们比较了多种单模态和多模态跟踪方法,基于两种手部模型(UmeTrack、MANO)。结果表明,AVI-HT在基准上将平均关键点误差减少了16.1%,其腕对齐变体减少了24.2%。消融研究进一步揭示了IMU传感器在不同活动类型中的每指贡献,以及模型对IMU噪声和视觉-IMU融合中的时间偏移的敏感性。

英文摘要

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

2605.21713 2026-05-22 cs.CL

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Sem-Detect:基于语义层面的AI生成同行评审检测

André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

AI总结 本文提出Sem-Detect方法,通过结合文本特征和语义分析,区分AI生成与人类撰写的同行评审,实验表明其在二分类和三分类场景下均表现出色,准确率显著提升。

详情
AI中文摘要

如何区分同行评审是人类还是AI生成?我们主张在该设置中,不应仅根据文本特征来确定作者身份,还应考虑评审所表达的观点、判断和主张。为此,我们提出了Sem-Detect,一种用于同行评审的作者身份检测方法,通过结合文本特征和基于声明的语义分析来实现这一原则。Sem-Detect通过将目标评审与同一篇论文的多个AI生成评审进行比较,利用观察到的不同AI模型倾向于收敛到相似点,而人类评审者引入更多独特和多样化的观点这一现象。因此,Sem-Detect能够区分完全由AI生成的评审与真实的由人类撰写的评审,包括那些经过LLM优化但仍反映人类判断的评审。在包含超过20,000篇同行评审的ICLR和NeurIPS会议数据集中,Sem-Detect在二分类设置中将TPR@0.1% FPR比最强基线提高了25.5%。此外,在三分类场景中,我们实证表明LLM优化保留了人类评审的语义信号,这些信号仍与完全由AI生成的文本模式区分开来;因此,少于3.5%的LLM优化后的评审被错误分类为AI生成。

英文摘要

How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

2605.21712 2026-05-22 cs.CL

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

通过生成式AI拓宽交通安全管理数据的可及性:一种基于模式的时空自然语言查询框架

Mahdi Azhdari, Eric J. Gonzales

AI总结 本文提出了一种基于模式的自然语言接口,利用大型语言模型解释用户意图,同时保持确定性和可审查的执行,以解决交通安全管理数据访问不均的问题,通过整合事故记录、道路属性和地理空间数据,提升公共部门的安全规划能力。

Comments 30 pages, 5 figures

详情
AI中文摘要

交通安全管理分析需要通过基于GIS的工作流整合事故记录、道路属性和地理空间数据,但各机构和社区利益相关者之间的访问仍然不均。技术前提导致分析工具与能够使用它们的从业者之间存在差距。地方机构、学校委员会和居民可能有安全担忧,但缺乏检索、过滤、映射和分析相关数据的能力。生成式AI提供了一种缩小这一差距的方法,但其在公共部门的使用引发了关于可靠性和可复现性的问题。本文提出了一种基于模式的自然语言接口,利用大型语言模型(LLM)解释用户意图,同时保持确定性和可审查的执行,以权威数据库为基础。用户查询被翻译成结构化的语义框架,通过基于规则的层验证,编译成一个有类型的有向无环图,用于执行PostGIS数据库。这种设计将语言解释与确定性执行分离,保持结果可复现和基于模式,同时去除访问障碍。该框架使用整合了事故记录、道路属性和地理空间层(包括学校、公交站、过街天桥和市政边界)的州级马萨诸塞州交通安全管理数据库进行评估。所有查询均成功执行;验证层纠正了29%的评估查询中的错误,反映了灵活的自然语言与严格模式要求之间的差距。结果表明,结合自然语言的可访问性与确定性执行是扩大交通安全管理数据可及性的可行方向,对公共部门规划中的可信AI具有启示。

英文摘要

Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites create a gap between analytical tools central to safety planning and the practitioners able to use them. Local agencies, school committees, and residents may have safety concerns but limited capacity to retrieve, filter, map, and analyze relevant data. Generative AI offers a way to narrow this divide, but its public-sector use raises questions about reliability, reproducibility, and governance. This paper presents a schema-grounded natural language interface for transportation safety analysis, using a large language model (LLM) to interpret user intent while preserving deterministic, reviewable execution against an authoritative database. User queries are translated into structured semantic frames, validated by a rule-based layer, compiled into a typed directed acyclic graph of spatial operations, and executed against a PostGIS database. This bounded design separates language interpretation from deterministic execution, keeping results reproducible and schema-grounded while removing access barriers. The framework is evaluated using a statewide Massachusetts transportation safety database integrating crash records, roadway attributes, and geospatial layers including schools, bus stops, crosswalks, and municipal boundaries. All queries executed successfully; the validation layer corrects errors in 29% of evaluation queries, reflecting the gap between flexible natural language and strict schema-grounded requirements. The results suggest that combining natural language accessibility with deterministic execution is a practical direction for broadening access to transportation safety data, with implications for trustworthy AI in public-sector planning.

2605.21710 2026-05-22 cs.RO

PGDG: Physically Grounded Data Generation for Robust Bimanual Policy Learning from a Single Demonstration

PGDG: 为从单个示范中学习鲁棒双臂策略而设计的物理基础数据生成

Cunxi Dai, Haoran Chang, Aditya Nisal, Rahul Kumar, Guofei Chen, Tao Chen, Yuzhe Qin, Guanya Shi

AI总结 本文提出PGDG,一种基于物理的数据生成框架,通过零样本校准扩展单个示范为包含物理上合理、成功和多样恢复行为的紧凑数据集,从而提升双臂操作中接触丰富的行为克隆性能。

详情
AI中文摘要

接触丰富的双臂操作中的行为克隆仍然具有挑战性,因为多样化的示范收集成本高,且即使小的扰动也可能将系统推入无恢复监督的流形外状态。我们提出PGDG,一种具有零样本校准的数据生成框架,能够在不额外人工标注的情况下,将单个示范扩展为一个包含物理上合理、成功且多样化的恢复行为的紧凑数据集。PGDG在物理基础采样器和数据集校准器之间迭代,其中校准器选择具有信息量、非冗余性和可恢复性的行为来更新采样分布,朝向未覆盖的恢复模式;而采样器则从更新后的分布中绘制出物理上合理的滚动候选,并保留成功的轨迹。为进一步提高数据质量,PGDG应用短时间域采样基于控制来重新标记所选的高风险状态并应用纠正动作。在四个双臂操作任务中,PGDG在仿真和零样本现实世界迁移中均优于仅空间增强的方法。在RotateBox-Pitch任务中,仿真中的成功率从38%提升到93%,现实世界中的成功率从35%提升到82%。PGDG还能够有效促进如GR00T等基础模型的微调,使成功率从46%提升到77%。更多结果可在我们的网站上查看:https://cunxid.github.io/PGDG/。

英文摘要

Behavior cloning for contact-rich bimanual manipulation remains challenging because diverse demonstrations are expensive to collect, and even small disturbances can push the system into off-manifold states where no recovery supervision is available. We propose PGDG, a data generation framework with zero-shot curation that expands a single demonstration into a compact dataset of physically plausible, successful, and diverse recovery behaviors without additional human labeling. PGDG iterates between a physics-grounded sampler and a dataset curator, where the curator selects informative, non-redundant, and recoverable behaviors to update the sampling distribution toward under-covered recovery modes, and the sampler draws physically plausible rollout candidates from this updated distribution and retains successful trajectories. To further improve data quality, PGDG applies short-horizon sampling-based control to relabel selected risky states with corrective actions. Across four bimanual manipulation tasks, PGDG consistently outperforms spatial-only augmentation in both simulation and zero-shot real-world transfer. On RotateBox-Pitch, success improves from 38% to 93% in simulation and from 35% to 82% in the real world. PGDG also enables effective foundation models fine-tuning such as GR00T, increasing success from 46% to 77%. Additional results are available in our website: https://cunxid.github.io/PGDG/.

2605.21704 2026-05-22 cs.RO cs.SY eess.SY

Motion Design for Grasp-Based Dynamic Locomotion in Microgravity

微重力环境下基于抓取的动态移动运动设计

Chaerim Moon, Joohyung Kim, Justin K. Yim

AI总结 本文针对微重力环境下多肢体机器人系统基于抓取的动态移动问题,提出了一种可参数化的移动规划框架,通过调整步态模式、步长、移动速度和名义姿态等参数,评估其在稳定性和驱动需求方面的性能。研究结果表明,扩大可行接触力空间并抑制脉冲全身动力学可提升移动性能。

详情
AI中文摘要

在微重力环境中,移动通常依赖于稀疏且不规则排列的锚点,这促使了基于抓取的多肢体移动。在此设置中,动态移动只有通过有意识地调节锚定相互作用和全身协调,才能在耦合的动力学和运动学约束下实现。本文提出了针对微重力环境下多肢体机器人系统基于抓取的动态移动的设计见解,目标是需要六维肢体操作以与候选锚点建立接触的场景。研究的设计参数包括步态模式、步长、移动速度和名义姿态。提出了一种可参数化的移动规划框架,以支持这些参数的变化,并评估由此产生的移动性能,包括稳定性和驱动需求。在基于物理的仿真中采用了两种代表性四足形态进行评估。结果表明,扩大可行接触力空间并抑制脉冲全身动力学可提高移动性能。这些发现为微重力移动中多肢体系统的接触配置选择和全身协调策略提供了指导。

英文摘要

Locomotion in microgravity often relies on sparsely and irregularly arranged anchors, motivating grasp-based mobility with multiple limbs. In this setting, dynamic locomotion is feasible only through deliberate regulation of both anchored interactions and whole-body coordination under coupled dynamic and kinematic constraints. This paper presents design insights for grasp-based dynamic locomotion with multi-limbed robotic systems in microgravity, targeting scenarios that require 6D limb manipulation to establish contacts with candidate anchors. The investigated design parameters include gait pattern, stride length, locomotion speed, and nominal posture. A parameterizable locomotion planning framework is proposed to support variations of these parameters and to evaluate the resulting locomotion performance in terms of stability and actuation demand. Two representative quadruped morphologies are adopted for evaluation in physics-based simulation. The results demonstrate that enlarging the feasible contact wrench space and attenuating impulsive whole-body dynamics improve locomotion performance. These findings inform strategies for contact configuration selection and whole-body coordination in microgravity locomotion with multi-limbed systems.

2605.21699 2026-05-22 cs.LG cs.CL

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token: 通过投影引导的跨分词器知识蒸馏

Sharath Turuvekere Sreenivas, Adithyakrishna Venkatesh Hanasoge, Mingyu Yang, Ali Taghibakhshi, Saurav Muralidharan, Ashwath Aithal, Pavlo Molchanov

AI总结 本文提出X-Token,一种通过投影引导的跨分词器知识蒸馏方法,解决传统方法在处理不同分词器间知识迁移时的不足,通过两个互补的损失函数改进知识蒸馏效果。

详情
AI中文摘要

跨分词器知识蒸馏允许学生模型从具有不兼容词汇表的教师模型中学习。先前工作基于隐藏状态或对数几率,后者更优,因为它不需要辅助组件。基于对数几率的方法要么只使用正确分词的概率,从而遗漏了教师分布中的全部'暗知识',要么基于完整的输出分布,依赖严格的分词划分和/或不严谨的启发式排序。我们发现完整分布、基于对数几率方法的两个关键缺点:(i) 不常见分词失败,其中关键分词落入未匹配子集(例如,在数字拆分Qwen监督下Llama的1100多数字),在训练中被抑制,导致GSM8k从12.89降至2.56,相较于使用相同分词器的KD;(ii) 过于保守的匹配,严格的一对一匹配排除了表面形式间的近等价分词。这些失败需要不同的解决办法:当关键分词对齐错误时消除划分,当对齐可靠时进行细化。我们提出X-Token,一种具有两个互补损失函数的方法,针对这些问题。P-KL通过稀疏投影矩阵W(从分词级别字符串规则初始化)消除划分,并通过将学生分布与教师分布对齐来解决不常见分词失败。H-KL保留混合形式,同时放松匹配,使每个学生分词与W下的最高排名教师映射对齐。两个目标共享W并自然扩展到多个教师。实验证明,在Llama-3.2-1B上,X-Token在Qwen3-4B教师下比当前最佳GOLD高出+3.82平均点,在Phi-4-Mini教师下高出+0.5。此外,双教师设置(Phi-4-mini + Llama-3B)在单教师蒸馏上提高了+1.3点。

英文摘要

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

2605.21695 2026-05-22 cs.AI cs.HC

The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

人工智能使用与信息性对逻辑推理技能发展的影响力

Shang Wu, Hongyu Yao, Catarina Belem, Shuyuan Fu, Mark Steyvers, Padhraic Smyth

AI总结 本文研究了人工智能使用和信息性如何影响逻辑推理技能的发展,发现高使用AI的用户表现较差,而信息性低的AI对学习无帮助,信息性高的AI则在短期内提升表现但影响不均一。

Comments Accepted at Hybrid Human Artificial Intelligence (HHAI) 2026

详情
AI中文摘要

人工智能(AI)正越来越多地融入人类问题解决过程中,但其对个体技能发展的影响仍不明确。我们考察了在受控的逻辑推理任务中,有需求访问AI帮助的情况下,AI使用和信息性如何塑造学习。我们发现,更高的AI使用与更弱的技能发展相关:大量使用AI的用户相对于同等 peers 表现较差,而少量使用AI的用户则与不使用AI的匹配用户表现相似。我们还发现这些模式由AI的信息性所中介。低信息性AI既不能提高即时表现,也不能在移除AI帮助后保持表现,且与整体学习能力较弱相关。另一方面,高信息性AI在实验中被发现能提升短期表现,但平均而言不会减少AI帮助移除后的结果,但影响具有异质性。我们的发现总体表明,AI根据情境,可能通过放大独立推理来补充人类技能发展,或作为替代品削弱此类推理,这意味着在AI帮助存在的情况下,调节AI的访问和使用将对促进技能发展至关重要。

英文摘要

Artificial intelligence (AI) is being increasingly integrated into human problem-solving, yet its effects on individual skill development remain unclear. We examine how both AI usage and informativeness can shape learning in the context of a controlled logical reasoning task with on-demand access to AI assistance. We find that greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI. We also find in our study that these patterns are mediated by AI informativeness. Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall. On the other hand, high-information AI was found to improve short-run performance without reducing post-AI outcomes on average in our experiments, but with heterogeneous effects. Our findings in general suggest that AI can, depending on context, either complement human skill development by amplifying independent reasoning or can act as a substitute that undermines such reasoning, with the implication that regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.

2605.21692 2026-05-22 cs.LG stat.ML

Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

表示差距:从几何视角解释神经网络的不合理有效性

David Perera, Victor Moura, Lais Isabelle Alves dos Santos, Michel F. C. Haddad, Flavio Figueiredo

AI总结 本文从几何视角出发,研究神经网络的表示差距,提出一个与泛化误差密切相关的度量标准,并展示其在更广泛任务和训练算法中的适用性,通过实验证明该理论在合成数据和现实数据中的准确性。

详情
AI中文摘要

精确地用可以高效估计的参数来表征神经网络的渐近泛化误差是机器学习中的关键问题,这严重依赖于启发法和实践者的直觉来做出关键设计选择。为了缓解这一问题,我们引入了表示差距,这是一个与泛化误差密切相关的度量标准,但具有更好的渐近动态特性。我们专注于等变扩散模型,并利用最优量化和点过程理论的结果,推导出表示差距的精确渐近等价,并证明其由单个参数,即任务的内在维度所支配,该参数易于解释、高效估计,并可与常见神经网络架构的等变性相关联。我们展示了这种渐近动态也适用于更广泛的任务和训练算法。最后,我们通过实验证明,我们的渐近定律和内在维度估计在广泛的合成数据集上准确,这些数据集中的这些量是已知的,以及在更现实的数据集上,我们得到的结果与相关文献一致。

英文摘要

Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.

2605.21688 2026-05-22 cs.RO cs.SY eess.SY

Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control

闭环仿真到现实强化学习用于可变形微纤维形状控制

Alessandro Amici, Houari Bettahar, Veeti Jaakkola, Quan Zhou

AI总结 本文提出了一种闭环仿真到现实强化学习方法,用于在表面控制可变形微纤维形状,通过在简化摩擦模拟器中训练几何形状调节,并利用实时视觉反馈在部署过程中迭代修正未建模的表面相互作用效果。

Comments 7 pages,7 figures

详情
AI中文摘要

自主基于接触的微 manipulation 是具有挑战性的,因为微尺度的表面和界面相互作用难以准确建模,限制了传统基于模型的控制和仿真到现实学习的使用。我们提出了一种闭环仿真到现实强化学习(RL)方法,用于表面上的微纤维形状控制。核心思想是在简化摩擦less 模拟器中训练几何形状调节,并在部署过程中依赖实时视觉反馈来迭代修正观测到的未建模表面相互作用效果。一个完全在仿真中训练的 RL 策略被直接转移到一个物理双夹爪微 manipulation 系统上,该系统以 40 Hz 运行,无需重新训练或领域适应。使用丝绸微纤维作为测试平台,该策略在 24 种不同的初始配置上实现了平均点状形状误差为 270 ± 80 微米。在九种样本中,覆盖三种纤维直径(50、80 和 120 微米)和三种 manipulated 长度(10 mm、15 mm 和 20 mm)的所有组合时,相同的策略在不重新训练或调整的情况下实现了亚毫米级的最终形状误差。这些结果表明,一个在简化模拟器中学习的策略可以在表面接触下实现可重复的现实世界微纤维形状调节,只要任务相关的仿真到现实不匹配效应在闭环反馈回路中仍然可观测和可纠正。

英文摘要

Autonomous contact-based micromanipulation is challenging because surface and interfacial interactions at the microscale are difficult to model accurately, limiting the use of conventional model-based control and sim-to-real learning. We present a closed-loop sim-to-real reinforcement learning (RL) approach for microfiber shape control on a surface. The central idea is to train geometric shape regulation in a simplified frictionless simulator and rely on real-time visual feedback during deployment to iteratively correct the observed effects of unmodeled surface interactions. An RL policy trained entirely in simulation is transferred directly to a physical dual-gripper micromanipulation system operating at 40 Hz, without retraining or domain adaptation. Using silk microfibers as a testbed, the policy achieves a mean point-wise shape error of 270 $\pm$ 80 $μ$m across twenty-four diverse initial configurations. Across nine specimens covering all combinations of three fiber diameters (50, 80, and 120 $μ$m) and three manipulated lengths (10 mm, 15mm, and 20 mm), the same policy achieves sub-millimeter final shape error without any retraining or retuning. These results show that a policy learned in a simplified simulator can achieve repeatable real-world microfiber shape regulation under surface contact, provided that the task-relevant effects of the sim-to-real mismatch remain observable and correctable within the closed feedback loop.

2605.21686 2026-05-22 cs.RO

Distributed Multi-Coverage for Robot Swarms

机器人群的分布式多覆盖

Mariem Guitouni, Aaron T. Becker

AI总结 本文提出了一种分布式多覆盖算法,用于解决机器人群在局部感知、局部通信和无全局协调的情况下,维持关键资产可靠覆盖的问题,同时应对机器人故障等约束条件。

Comments Accepted at ANTS 2026 (International Conference on Swarm Intelligence), published by Springer Nature

详情
AI中文摘要

自主无人机群用于监视、环境监测和基础设施检查时,必须在机器人故障的情况下保持关键资产的可靠覆盖。这要求多覆盖:每个资产必须由多个机器人观察以实现冗余,且覆盖要求因资产的重要性而异。尽管最近的工作已通过整数规划最优地解决了集中式问题,但实际部署面临约束,需要分布式解决方案:机器人具有有限的通信范围,机载计算限制了全局规划,且部分系统故障不得导致任务中止。本文提出了一种适用于具有局部感知、局部通信和无全局协调的机器人群的分布式多覆盖算法。

英文摘要

Autonomous drone swarms deployed for surveillance, environmental monitoring, and infrastructure inspection must maintain reliable coverage of critical assets despite robot failures. This requires multicoverage: each asset must be observed by multiple robots for redundancy, with coverage requirements varying by asset importance. While recent work has solved the centralized problem optimally using integer programming, practical deployments face constraints that demand distributed solutions: robots operate with limited communication ranges, onboard computation restricts global planning, and partial system failures must not cause mission abort. We present a distributed multicoverage algorithm for robot swarms operating with local sensing, local communication, and no global coordination.

2605.21683 2026-05-22 cs.AI

Investigating Concept Alignment Using Implausible Category Members

通过不合理的类别成员探究概念对齐

Sunayana Rane, Brenden M. Lake, Thomas L. Griffiths

AI总结 本文研究了通过询问不合理类别成员来探究概念边界,发现AI模型在某些概念上与人类存在显著差异,如将'词语'归类为'车辆'或'衣物',并探讨了这些概念错位对AI安全的影响。

详情
AI中文摘要

开发具有人类日常概念理解能力的AI系统是朝着安全、可靠系统的重要一步。在探测概念理解时,询问合理的类别成员(例如

英文摘要

Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., "Is a car a vehicle?") is likely to recall patterns in the model's vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., "Is an olive a vehicle?") to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems' assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare these assignments to those made by human participants on the full range of within-category and cross-category assignment tasks. Our results reveal a range of concepts for which which models differ in meaningful and surprising ways from humans, including treating "words" as belonging to categories like "vehicles" and "clothing," identifying several "vegetable" category members as "fruit," and assigning exemplars from non-weapon categories to the "weapons" category. We also demonstrate how these instances of concept misalignment translate into problematic downstream behavior with implications for AI safety.

2605.21680 2026-05-22 cs.RO

Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments

Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments

Lou De Bel-Air, Luca Morando, Ruitao Chen, Keru Wang, Benjamin Jarvis, Charbel Toumieh, Yang Zhou, Ken Perlin, Dario Floreano, Giuseppe Loianno

AI总结 本文提出了一种基于虚拟现实的共享控制框架,用于在约束和未知环境中操作无人机团队,通过实时用户引导探索,提升在无结构环境中的自主导航能力。核心方法是一种基于用户引导的运动原语规划器,结合阻抗控制器,使操作员能够灵活影响团队行为并引导无人机前往自主规划器可能忽略的感兴趣区域。

Comments Accepted at IEEE International Conference in Robotics and Automation, Vienna 2026

详情
AI中文摘要

尽管自主多机器人能够实现安全协调的导航,但它们往往难以适应突发状况并捕捉操作员驱动的目标。本文提出了一种基于虚拟现实(VR)的共享控制框架,用于在约束和未知环境中操作无人机团队,实现实时用户引导探索。我们的方法核心是一种新颖的基于用户引导的运动原语规划器,能够计算连续的碰撞免费轨迹,同时持续整合操作员输入。该规划器与阻抗控制器相结合,使操作员能够灵活影响团队行为并引导无人机前往感兴趣区域。系统支持混合现实操作,包括物理和模拟无人机,并实现双方面VR接口,使操作员通过迁移点引导机器人团队,同时接收即时的团队状态视觉反馈。实验结果表明,共享控制提高了障碍物避障能力,保持了机器人间的间距,并减少了操作员的负担,展示了沉浸式、人机协作多机器人导航的可行性和优势。

英文摘要

While autonomous multi-robots can achieve safe and coordinated navigation, they often struggle to adapt to unforeseen conditions and to capture operator-driven objectives in unstructured environments. We present a Virtual Reality (VR)-based shared control framework for teams of drones operating in constrained and unknown environments, enabling real-time, user-guided exploration. At the core of our approach is a novel, user-guided motion-primitive-based planner that computes continuous, collision-free trajectories while continuously integrating operator input. This planner is coupled with an admittance controller, allowing the operator to flexibly influence team behavior and guide drones toward regions of interest that autonomous planners may overlook. The system supports mixed-reality operations with both physical and simulated drones, and implements a bilateral VR-based interface, allowing the operator to guide the robot team via migration points while receiving immediate visual feedback of the team state. Experimental results show that shared control improves obstacle avoidance, maintains inter-agent spacing, and reduces operator effort, demonstrating the feasibility and advantages of immersive, human-in-the-loop multi-robot navigation.

2605.21669 2026-05-22 cs.CV cs.AI

MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

MRecover: 一种基于AI生成对比度的条件生成模型,用于通过AI生成对比度恢复运动模糊的MRI图像

Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim

AI总结 该研究提出了一种条件生成模型MRecover,利用AI生成的对比度来恢复运动模糊的MRI图像,通过自回归切片条件化实现体积分 consistency,提高了 hippocampal 子区域分割的精度和泛化能力。

详情
AI中文摘要

海马亚区分割需要高分辨率的T2w turbo spin echo (TSE) MRI,但该序列易受运动伪影影响,导致数据丢失。我们开发了一种条件生成模型(MRecover),通过自回归切片条件化生成常规获取的T1w图像,生成TSE图像以实现体积分 consistency。在7T MRI数据(n=577)上训练,该模型在域内实现了高保真度(n=148,SSIM=0.84,FSIM=0.94),并能很好地推广到域外3T数据:合成和原生图像的亚区体积高度匹配(n=416,r=0.87-0.97),并在运动影响的ADNI3数据集中通过质量控制后,分析可及受试者数量增加了31.8%(593 vs 450)。合成图像还由于增加诊断组差异的样本量,产生了更大的效应量(整个海马体ε²=0.121-0.100 vs. 0.086-0.062,左右半球)。项目页面:https://jinghangli98.github.io/MRecover/

英文摘要

Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $ε^2$= 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: https://jinghangli98.github.io/MRecover/