arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2094
热门方向导航
2606.19031 2026-06-19 cs.RO 新提交

Congestion-Aware Robot Tour Planning in Crowded Environments

拥挤环境中的拥塞感知机器人巡视规划

Stefano Bernagozzi, Charlie Street, Masoumeh Mansouri, Lorenzo Natale

发表机构 * Istituto Italiano di Tecnologia(意大利理工学院) Università di Genova(热那亚大学) University of Birmingham(伯明翰大学)

AI总结 提出一种基于概率的巡视规划器,通过学习人流预测模型并在线构建马尔可夫决策过程,在拥挤环境中高效规划机器人路径,减少拥塞影响。

Comments Accepted to IEEE IROS 2026

详情
Journal ref
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2026
AI中文摘要

自主移动服务机器人通常需要完成在环境中遍历一组位置的巡视任务。例如,引导人们穿过购物中心、在配送中心递送包裹或在博物馆提供导览。然而,在拥挤环境中,人群的存在可能对机器人性能产生负面影响。例如,人类会触发机器人的碰撞避免操作,从而降低机器人速度。人群随机移动且随时间变化。本文提出一种针对拥挤环境的概率巡视规划器,该规划器明确考虑人类拥塞。我们学习圆形线性流场(CLiFF)地图,该地图根据初始观测预测人类轨迹。然后,我们利用这些预测在线构建并求解马尔可夫决策过程,从而高效地将机器人引导通过环境。我们的方法具有足够的可扩展性,能够在观察到新人群时重新规划。我们在购物中心的真实人群数据集上评估了该方法。

英文摘要

Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 新提交

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

2606.18960 2026-06-19 cs.CV cs.RO 新提交

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2606.18950 2026-06-19 cs.AI 新提交

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出RTSGameBench,基于Beyond All Reason游戏,通过多样化对战、迷你游戏诊断和自进化生成框架,评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情
AI中文摘要

现代视觉语言模型(VLM)在竞争和合作环境中的不确定性下,往往难以进行战略推理,即预测和影响其他智能体的行为。实时策略(RTS)游戏可以作为诊断这一局限性的自然测试平台,因为它们要求与盟友协调、适应对手策略,并在部分可观测性下进行长期规划。然而,现有的RTS基准评估范围有限,缺乏系统的能力诊断,并且局限于预设计的场景覆盖。为了解决这些限制,我们提出了RTSGameBench,它建立在Beyond All Reason之上,这是一款大规模RTS游戏,其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估,通过迷你游戏进行诊断性评估,每个迷你游戏针对单个战略能力,并通过自进化生成框架实现可扩展的覆盖,该框架将自由形式的查询转化为新的迷你游戏,并在连续循环中改进。此外,为了让VLM在大规模RTS游戏中运行,我们提供了RTSGameAgent,它通过具有智能体记忆的有限状态机(FSM)管理单位。我们通过实验验证,多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

2606.18933 2026-06-19 cs.LG cs.IR stat.ME 新提交

Zero-Shot Active Feature Acquisition via LLM-Elicitation

基于LLM启发式的零样本主动特征获取

Binyamin Perets, Natalie Mendelson, Shiran Vainberg, Yehuda Chowers, Shai Shen-Orr, Shie Mannor

发表机构 * Faculty of EE, Technion(技术学院电子工程系) Faculty of Medicine, Technion(技术学院医学院) CytoReason NVIDIA

AI总结 提出通过LLM启发式获取马尔可夫随机场充分统计量的零样本主动特征获取框架,解决数据标注不足问题,在IBD患者诊断中优于现有方法。

详情
AI中文摘要

主动特征获取(AFA)顺序选择要观察的特征以达成分类或排序决策。其主要局限性在于依赖大量标注数据来拟合指导获取的概率模型。大型语言模型(LLM)提供无监督的领域知识,但作为序列规划者表现不佳。要求其同时知晓和决策会混淆最好分开的能力。这里,我们通过严格的启发式方法开发了一个零样本AFA框架:仅要求LLM返回其可被信任返回的内容,即马尔可夫随机场(MRF)的充分统计量——一元偏差和成对协变。我们将该框架应用于两个场景:二分类和top-$k$识别。实践中,LLM可靠地仅返回判别性统计量,即区分类别而非孤立每个类别的统计量,这阻碍了经典AFA。我们应用最大熵闭包来解决这种规范模糊性。我们在炎症性肠病(IBD)患者队列上进行评估,这是一个活跃的临床环境,其中诊断模糊性和患者异质性阻碍了稳定的治疗策略。我们的框架在真实标签和其自身提取的信念上均优于LLM。在最关键的地方,即最困难的患者上,我们的top-$k$获取策略显著优于所有现有方法。

英文摘要

Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

2606.18812 2026-06-19 cs.LG cs.AI 新提交

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France(巴黎高等师范学院,PSL大学,法国巴黎) Soda team, Inria Saclay, Palaiseau, France(Soda团队,法国国家信息与自动化研究所萨克雷中心,法国帕莱索)

AI总结 提出通过合成MDP构建强化学习基础模型,利用固定大小的充分统计量使注意力架构适用,在线和离线实验均优于传统算法。

详情
AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动,而结构化领域(表格预测、时间序列预测、图学习、强化学习)则不然。替代方案是合成数据,它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中:TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先},强化学习是明显的空白:采样一个合成MDP与采样一个合成表格数据集一样可行,然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次},MDP允许一个固定大小的充分统计量,独立于观察到的回合且形状为表格形式,这使得它们直接适用于用于表格基础模型的基于注意力的架构,只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证,我们完全在合成MDP上训练一个模型,并表明,无需任务特定的调优,它就能在上下文中解决留出的表格基准,包括在线和离线:在线时,使用比UCB-VI和表格Q-learning少得多的回合;离线时,与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

2606.18613 2026-06-19 cs.CL cs.AI 新提交

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生?PhysAssistBench:交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University(阿尔托大学) Tencent(腾讯) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Hong Kong Polytechnic University(香港理工大学) Aarhus University(奥胡斯大学) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PhysAssistBench基准,通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力,发现当前模型不可靠,瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情
AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生,但当前的评估通常测试孤立能力:临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力,其中医生提出不明确的请求,患者模糊描述症状,EHR系统要求精确的工具使用。我们引入PhysAssistBench,一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例,PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理,将静态EHR记录转化为多轮临床场景,同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集,包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明,当前模型在此设置下仍不可靠,这暴露了临床LLM的关键瓶颈:可靠的辅助需要知识、沟通和系统之间的协调,而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

2606.18611 2026-06-19 cs.SD cs.AI cs.LG stat.ML 新提交

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company(朝日新闻社) Tokyo Woman's Christian University(东京女子基督教大学)

AI总结 提出参数高效的QC-GAN,结合四元数Conformer生成器和MetricGAN训练,通过汉密尔顿积共享权重减少参数量,在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48,性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

详情
AI中文摘要

我们提出了一种参数高效的语音增强框架——四元数Conformer GAN(QC-GAN),它将四元数Conformer生成器与基于MetricGAN的训练相结合。汉密尔顿积通过结构化权重共享对幅度和相位进行编码,在减少层参数数量的同时保持其相互依赖性。采用度量学习判别器,通过优化近似感知评估分数来最大化感知质量。在VoiceBank+DEMAND数据集上,QC-GAN仅用0.89M参数就达到了3.48的语音质量感知评估(PESQ)分数,其性能与最先进模型相当,而参数量不到后者的一半。一个35K参数的变体实现了3.23的PESQ分数,以显著更少的参数超越了传统方法。在DNS-Challenge 3数据集上的评估进一步证实了其在真实世界条件下的泛化能力。

英文摘要

We propose a parameter-efficient speech enhancement framework, Quaternion Conformer GAN (QC-GAN), which combines a Quaternion Conformer generator with MetricGAN-based training. The Hamilton product encodes the magnitude and phase via structured weight sharing, reducing the number of layer parameters while preserving their interdependencies. A metric-learning discriminator was employed to maximize perceptual quality by optimizing the approximate perceptual evaluation scores. On the VoiceBank+DEMAND dataset, QC-GAN achieved a Perceptual Evaluation of Speech Quality (PESQ) score of 3.48 with only 0.89M parameters, delivering a performance comparable to state-of-the-art models at less than half their size. A 35K-parameter variant achieved a PESQ score of 3.23, surpassing conventional methods with significantly fewer parameters. Evaluation on the DNS-Challenge 3 dataset further confirmed generalization to real-world conditions.

2606.18485 2026-06-19 cs.SD cs.AI eess.AS 新提交

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF:无需长语音数据训练的推理时长生成长语音生成

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 提出MagpieTTS-LF推理时方法,通过软注意力先验、有状态推理和历史感知文本编码,在不重新训练模型的情况下实现连贯的长语音生成。

详情
Journal ref
Interspeech 2026
AI中文摘要

神经文本到语音(TTS)系统在短语句上取得了显著质量,但长语音生成表现出韵律漂移、说话人不一致和句子边界伪影。现有方法要么压缩序列、增加上下文长度,要么简单拼接独立合成的片段。我们提出一种称为MagpieTTS-LF的推理时方法,使MagpieTTS能够在不重新训练模型的情况下生成连贯的长语音。我们的方法引入了三个关键创新:(1)软注意力先验,在保留过去和未来上下文的同时引导单调对齐;(2)有状态推理算法,跨句子块维护上下文,确保韵律连续性;(3)历史感知文本编码,利用过去文本进行语篇级韵律规划。在长文本上的实验表明,与其他基线相比,在长距离可懂度、韵律连贯性、说话人一致性和边界自然度方面有显著改进。

英文摘要

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

2606.18413 2026-06-19 cs.AI cs.HC 新提交

Searching for Synergy in Shared Workspace Human-AI Collaboration

在共享工作空间的人机协作中寻找协同效应

Nachiket Kotalwar, Rohini Das, Carolyn Rose

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究共享工作空间的人机团队协作,通过Collaborative Gym环境实验发现,缺乏协调结构时增加协作者会降低性能,而结合共享记忆和模拟人在环门控的脚手架可提升团队绩效。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

自动化AI代理越来越强大,但许多科学和专业任务仍需要人类判断和情境专业知识。我们研究共享工作空间的人机团队,其中AI代理和人类协作者必须在提交最终答案前协调职责。使用Collaborative Gym环境和DiscoveryBench任务,我们考察何时添加模拟人类协作者能提升性能,以及何时过程损失将额外协作者变为协调开销。在1482个会话中,当团队缺乏协调贡献的结构时,添加相关协作者会降低性能。然后我们评估一种脚手架,它结合了共享群体记忆和模拟人在环(HITL)门控,其中选定动作需要指定模拟参与者的批准。这种脚手架在三人团队中最为明显,产生了更高的平均性能,具有更清晰的责任信号和更强的专业知识路由到团队动作。总体而言,人机团队如何协调和整合专业知识与他们可用的能力同样重要。

英文摘要

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

2606.18249 2026-06-19 cs.CV 新提交

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模:共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身AI研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里公司)

AI总结 提出UniAR框架,通过单一离散视觉分词器桥接视觉理解与生成,采用并行位预测和扩散解码,在图像生成和编辑上达到最优,同时保持多模态理解竞争力。

Comments ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情
AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而,现有方法通常依赖两个不同的视觉分词器,这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR,一个统一的自回归框架,其中单个离散视觉分词器作为理解和生成之间的关键桥梁,使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码,从而实现共享上下文。UniAR采用预训练的视觉编码器,结合多级特征融合和无查找的逐位量化方案,在保留高层语义和低层细节的同时,以最小代价扩展有效视觉词汇。在此基础上,统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码,大幅减少视觉序列长度并加速生成。最后,基于扩散的视觉解码器对离散视觉标记进行操作,以解码高保真图像。通过大规模预训练,随后进行监督微调和强化学习,UniAR在图像生成和图像编辑上达到了最先进的性能,同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

2606.18191 2026-06-19 cs.AI cs.MA 新提交

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW:用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research(ServiceNow人工智能研究)

AI总结 提出DRFLOW基准,评估AI代理从异构源预测个性化工作流的能力,包含5领域100任务,并设计7个诊断指标,实验显示现有代理性能有限。

详情
AI中文摘要

深度研究(DR)系统越来越多地用于复杂信息寻求任务,但现有工作主要关注生成报告和摘要。相比之下,许多企业任务需要代理识别具体的工作流,即一系列行动步骤。例如,代理不应总结预算政策,而应能确定回答诸如“在固定预算下如何申请新员工?”这类问题所需的步骤。因此,我们引入DRFLOW,一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据,然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务,1246个参考工作流步骤,基于超过3900个来源。我们定义了七个诊断指标,涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent(DRFA),一个面向工作流的参考代理,用于预测个性化工作流。我们表明,尽管DRFA相比强基线代理有所改进(平均F1分数提升高达10.02%),但在这些工作流指标上仍有很大的改进空间,表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

2606.18112 2026-06-19 cs.RO cs.CV 新提交

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

2606.17979 2026-06-19 cs.AI 新提交

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training(机构文本:STAR:时空自适应奖励分配用于文本到图像强化学习后训练)

AI总结 针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题,提出STAR方法,利用文本-图像注意力构建时空自适应分配图,对相关潜在区域施加更强策略更新,提升语义对齐和文本渲染性能。

详情
AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势,并以相同强度应用于整个生成轨迹。然而,文本到图像生成自然具有时间和空间结构:不同的去噪步骤负责不同的生成阶段,而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题,我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励(STAR)分配**。STAR利用生成模型内部的文本-图像注意力,从用户提示中真正关心的核心内容开始,构建在去噪步骤和展开中动态变化的空间分配图,并将相同的组相对优势分配给更相关的潜在区域,几乎没有额外的计算开销。然后,STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型,并在三个任务上评估:GenEval、OCR文本渲染和PickScore。实验结果表明,STAR在不改变外部奖励源的情况下,改善了组合语义对齐、文本渲染和偏好优化,在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

2606.17886 2026-06-19 cs.LG 新提交

Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias

单调Kolmogorov-Arnold网络:单调性作为归纳偏置的理论与实证研究

Mikhail Krasnov, Blaž Bertalanič, Carolina Fortuna

发表机构 * Jozef Stefan Institute(约瑟夫·斯特凡研究所)

AI总结 提出MKAN,通过指数重参数化B样条系数、正边权和单调基激活实现硬单调性,理论证明任何特征提取器可被单调化且编码器规模有界,实验表明MKAN在单调性基准上达到最优并保持KAN的逐边功能透明性。

详情
AI中文摘要

单调性一直是神经网络长期使用的架构归纳偏置,其动机来源于表格、科学和经济场景,其中输出已知对某些输入呈单调响应。现有方法基于MLP或流模型,缺乏逐边功能透明性;唯一具有单调性的KAN变体MonoKAN仅在受限参数子集上施加约束,并需要投影式训练过程。我们通过\textbf{MKAN}填补了这一空白,MKAN是一种KAN,通过B样条系数的指数重参数化、正边权和单调基激活,对所有参数值保证硬单调性。训练简化为标准的无约束梯度下降。我们的主要理论贡献是一个\textbf{表示代价}定理:任何诱导球状语义邻域划分的$C^K, K >0$特征提取器,都可以在$N' = N^* + k \le 2N^*$处实现等价邻域结构的单调实现,其中$k$是原始非单调坐标的数量。该界限与架构无关,并为单调编码器提供了原则性的规模确定规则。实验上,MKAN在SMM/ICML-2024基准上与最先进的单调神经网络竞争,同时是唯一结合了硬无约束单调性和KAN逐边功能透明性的方法;在四个真实数据集上的自监督特征规模扫描中验证了$2N^*$预测,在受控单调生成数据集上,MKAN以显著高于KAN、MLP和线性基线的Spearman对齐恢复了真实因子。

英文摘要

Monotonicity has been a long-running architectural inductive bias for neural networks, motivated by tabular, scientific, and economic settings where outputs are known to respond monotonically to certain inputs. Existing approaches are MLP- or flow-based and lack per-edge functional transparency; the only Kolmogorov--Arnold Network (KAN) variant with monotonicity, MonoKAN, enforces the constraint only on a restricted parameter subset and requires a projection-style training procedure. We close this gap with \textbf{MKAN}, a KAN with hard monotonicity guaranteed for \emph{all} parameter values via exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation. Training reduces to standard unconstrained gradient descent. Our headline theoretical contribution is a \emph{representation-cost} theorem: any $C^K, K >0$ feature extractor inducing a ball-shaped semantic-neighborhood partition admits a monotone realization of the equivalent neighborhood structure at $N' = N^* + k \le 2N^*$, where $k$ is the number of non-monotone coordinates of the original. The bound is architecture-agnostic and gives a principled sizing rule for monotone encoders. Empirically, MKAN is competitive with state-of-the-art monotone NNs on the SMM/ICML-2024 benchmark while being the only method that combines hard unconstrained monotonicity with KAN's per-edge functional transparency; the $2N^*$ prediction is validated in a self-supervised feature-size sweep on four real datasets, and on a controlled monotone-generative dataset MKAN recovers ground-truth factors with substantially higher Spearman alignment than KAN, MLP, and linear baselines.

2606.17832 2026-06-19 cs.LG 新提交

From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致:稳定LLM中的信念

SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

发表机构 * Department of Statistics, Seoul National University Korea Advanced Institute of Science \& Technology Department of AI, Kookmin University University of Hong Kong

AI总结 研究LLM在多项选择问答中的信念漂移问题,提出提示式预测重采样(PPR)方法,发现信念过程会自稳定并收敛,进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

详情
AI中文摘要

大型语言模型(LLM)常被假设执行隐式贝叶斯推理,然而一个关键的一致性条件——预测信念的鞅性质——已被证明在受控的合成上下文学习设置中失效。我们在更典型的使用场景中重新审视这个问题:通用多项选择问答。利用离散答案空间,我们计算精确的预测分布,并研究由自回归答案重采样引起的信念动态。我们引入了提示式预测重采样(PPR),其中LLM对同一问题生成一系列答案。实验表明,PPR揭示了早期阶段的信念漂移,表明鞅性质被违反。然而,在足够的重采样步骤后,信念过程自稳定并收敛到一个一致的预测分布。基于这一观察,我们进一步提出了(i)种子答案提示策略以加速稳定,以及(ii)自一致性损失,通过微调将早期漂移摊销到模型中。在多项选择问答基准上的实验表明,我们的方法在不牺牲准确性的情况下显著减少了信念漂移并提高了预测一致性。

英文摘要

Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.17041 2026-06-19 cs.CL cs.IR 新提交

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.16780 2026-06-19 cs.RO 新提交

DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps

DIFF-IPPO:基于扩散的开放词汇信念地图信息路径规划

Sausar Karaf, Oleg Sautenkov, Mikhail Martynov, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, CDE, Skoltech(智能空间机器人实验室,CDE,斯科尔科沃科学技术研究院)

AI总结 提出DIFF-IPPO框架,结合开放词汇信念地图生成器与扩散规划器,在非高斯信念图上生成全局轨迹,实现高效目标搜索,检测得分达81.49%-86.55%。

详情
AI中文摘要

探索和物体搜索要求机器人感知环境、识别感兴趣区域,并规划提高目标检测可能性或最大化信息增益的轨迹。许多IPP方法,特别是在连续环境监测中,依赖于高斯过程信念模型,而物体搜索场景通常从语义或开放词汇感知中产生复杂的多模态信念地图。直接基于这种非高斯信念地图的全局轨迹生成仍然相对未被充分探索。尽管基于扩散的规划器为此类分布建模提供了强大能力,但它们在信息路径规划中的应用仍然有限。在这项工作中,我们提出了DIFF-IPPO,一个集成了开放词汇信念地图生成器和基于扩散的规划器的流水线,用于在信念地图上生成全局轨迹。该方法生成的轨迹将传感器覆盖集中在高信念区域,在不同数据集场景下实现了81.49%至86.55%的归一化检测得分。我们在一个模拟的搜索与救援场景中验证了该系统,其中规划器搜索候选建筑区域以定位燃烧的建筑。在此设置中,一个由五架无人机组成的团队使用批处理信念地图条件轨迹生成,在3.5分钟内实现了首次检测。

英文摘要

Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.

2606.16682 2026-06-19 cs.LG cs.CL 新提交

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩:自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 研究多模态自评估中偏好坍缩的加剧现象,发现跨模态传染导致策略选择扭曲,并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情
AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时,会出现系统性偏差。我们表明,评估者偏好坍缩(EPC)在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现,我们发现单一策略(step_by_step)吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后,我们展示了一种称为跨模态传染的新现象:在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式,我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置(总计53次独立重复,15,592次API调用)的第3阶段统计验证揭示了一个清晰的层次结构:跨模型评估(GPT-4o,N=8)产生强但对称的双向传染(平均gamma_{T->V}=1.176,gamma_{V->T}=1.089,Delta=-0.088,p=0.575,Cohen's d=0.29);高轮次(DashScope,50轮)导致坍缩为单一策略主导(70%零传染);而自评估提供近乎完全的免疫——97%的运行(N=30,DeepSeek-chat)产生恰好为零的传染(平均gamma=0.033,95% CI [-0.031, 0.010],p=0.642,d=0.07)。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵,发布了MM-EPC实验框架,并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong contagion (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero contagion (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

2606.16615 2026-06-19 cs.CV 新提交

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.16575 2026-06-19 cs.LG math-ph math.MP 新提交

RepNN: Tackling spectral bias in deep neural networks via parameter reparameterization

RepNet:通过参数重参数化解决深度神经网络中的谱偏差

Yong Wang, Tao Zhou, Xuhui Meng

发表机构 * Institute of Interdisciplinary Research for Mathematics and Applied Science, School of Mathematics and Statistics, Huazhong University of Science and Technology(华中科技大学数学与统计学院交叉科学与应用数学研究所) Institute of Computational Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院计算数学研究所)

AI总结 针对深度神经网络在捕捉振荡和多尺度行为时的谱偏差问题,提出RepNet模型,通过重参数化第一隐藏层的权重和偏置,有效控制初始斜率尺度和分区点分布,实现自适应频率缩放,在函数逼近、PDE求解和算子学习中显著提升精度。

详情
AI中文摘要

深度神经网络(DNN)在科学计算中取得了显著成功,但在捕捉振荡和多尺度行为时常常受到谱偏差的影响。在本研究中,我们通过考察浅层ReLU神经网络在高频函数拟合中的失败来探究这一局限性。这一观察识别出解决快速振荡的两个重要因素:初始斜率尺度和网络诱导的分区点分布。受此分析启发,我们提出了RepNet,一种针对ReLU和tanh网络的重参数化DNN模型,专为高频和多尺度问题设计。关键思想是重参数化第一隐藏层的权重和偏置,从而能够有效控制初始斜率尺度并提供合适的初始分区点分布。此外,将重参数化的权重和偏置视为可训练参数,使得DNN在训练过程中实现自适应频率缩放。我们还推导了重参数化DNN的输出和斜率幅度的定量估计,以指导所提方法的初始化。数值实验,包括多尺度一维和四维函数逼近、结合物理信息神经网络(PINN)的正向和逆向PDE问题以及算子学习,表明RepNet在略微增加计算成本的情况下,提高了普通DNN在捕捉高度振荡特征时的预测精度。这些结果表明,RepNet为克服谱偏差并将DNN应用于多尺度问题提供了一种有效且灵活的方法。

英文摘要

Deep neural networks (DNNs) have achieved remarkable success in scientific computing, yet they often suffer from spectral bias in capturing oscillatory and multiscale behaviors. In this study, we investigate this limitation by examining the failure of shallow ReLU neural networks in fitting high-frequency functions. This observation identifies two important factors in resolving rapid oscillations: the initial slope scale and the distribution of partition points induced by the networks. Motivated by this analysis, we propose RepNN, a reparameterized neural network model with activation ReLU or tanh designed for high-frequency and multiscale problems. The key idea is to reparameterize the weights and biases in the first hidden layer, which enables effective control of the initial slope scale and provides an appropriate distribution of the initial partition points. Furthermore, treating the reparameterized weights and biases as trainable parameters allows the DNN to achieve adaptive frequency scaling during training. In addition, we derive quantitative estimates for the output and slope magnitudes of the reparameterized DNN to guide the initialization of the proposed method. Numerical experiments, including multiscale one- and four-dimensional function approximations, forward and inverse PDE problems in combination with physics-informed neural networks (PINNs), and operator learning for an earthquake problem using real data, demonstrate that RepNN improves the predicted accuracy of vanilla DNNs in capturing highly oscillatory features with slightly additional computational cost. These results indicate that RepNN provides an effective and flexible approach for overcoming spectral bias and applying DNNs to multiscale problems.

2606.16417 2026-06-19 cs.SD eess.AS 新提交

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成,无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Joycent,一种基于扩散模型的口音TTS方法,直接从标准音素序列和语音参考合成口音语音,无需口音音素预测,通过条件层归一化集成口音和说话人表征,并引入WhisAID口音识别模型,在保持说话人身份的同时提升口音自然度。

详情
AI中文摘要

口音文本到语音(TTS)旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程,首先将标准音素序列转换为口音音素序列,然后合成口音语音。然而,这种方法存在错误累积问题,并且需要配对的标准-口音音素序列数据,这在实践中往往有限。此外,基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中,我们提出了Joycent,一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考合成口音语音,无需口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)集成口音和说话人表征。我们引入了WhisAID,一种在口音普通话语音上训练的普通话口音识别模型,以提取口音表征。实验结果表明,与基线系统相比,Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示:https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

2606.16057 2026-06-19 cs.RO cs.SY eess.SP eess.SY 新提交

A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation

一种智能调度混合(SSH)EKF-FGO状态估计方法

Eric Levy, Soosan Beheshti

发表机构 * GitHub arXiv

AI总结 本文通过智能调度混合EKF-FGO框架,实验性地将优化调度作为独立设计变量,研究其在平衡估计精度与计算成本中的作用,并在平面SLAM仿真中验证了调度对预优化漂移、瞬态误差和运行时间的显著影响。

Comments This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore

详情
AI中文摘要

在机器人学和控制中,可靠的状态估计需要在估计精度和计算成本之间取得平衡。虽然基于滤波的方法(如扩展卡尔曼滤波器,EKF)提供高效的实时更新,而使用因子图的优化公式化方法改善全局一致性,但优化调度的作用通常被隐式处理,而非作为明确的设计变量进行研究。本文提出了一项实验研究,通过使用智能调度混合(SSH)EKF-FGO框架作为受控测试平台,明确隔离了优化调度。通过将基于EKF的状态传播与定期调用的批量优化相结合,并保持求解器结构和计算量固定,本文的主要贡献是实验性地将优化调度表征为一个独立的设计变量,它控制着中间估计精度与计算成本之间的权衡。在平面SLAM环境中的仿真结果表明,调度强烈影响预优化漂移、瞬态误差行为和运行时间。特别是,结果识别出一些操作区域,在这些区域中,全局优化的大部分好处可以以一小部分计算成本保留,从而突显了优化调度作为混合状态估计系统中一个未被充分探索但至关重要的考虑因素。

英文摘要

Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.

2606.15966 2026-06-19 cs.CV cs.GR 新提交

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.

2606.15908 2026-06-19 cs.CV 新提交

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2606.15862 2026-06-19 cs.AI 新提交

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group(蚂蚁集团) City University of Hong Kong(香港城市大学)

AI总结 提出RetailBench基准,模拟单店超市运营,评估LLM代理在长期决策中的表现,发现多数模型无法持续生存,与最优策略差距显著。

Comments This paper is my paper's second version [see arXiv:2603.16453v2]

详情
AI中文摘要

大型语言模型(LLM)代理在短期、范围明确的任务上取得了快速进展,但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench,一个基于数据驱动的模拟基准,用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程,并设计支持千天规模的模拟。在此环境中,代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内,在代表性代理框架下评估了七个当代LLM,并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异:只有一小部分能够存活整个评估期,即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

2606.15832 2026-06-19 cs.LG math.OC 新提交

SILAGE: Memory-Efficient, Full-Gradient-Free Nonconvex Optimization for Nested Finite Sums

SILAGE: 针对嵌套有限和的内存高效、完全无全梯度的非凸优化

Igor Sokolov, Laurent Condat, Peter Richtárik

发表机构 * Center of Excellence for Generative AI, King Abdullah University of Science and Technology (KAUST)(生成人工智能卓越中心,国王阿卜杜勒-阿齐兹大学科学与技术学院)

AI总结 针对大规模数据中嵌套双有限和结构的非凸优化,提出SILAGE算法,通过利用双和结构避免全局全梯度刷新,仅需O(n)内存,并基于组间和组内异质性实现自适应收敛分析。

Comments 81 pages, 3 algorithms, 4 theorems, 2 corollaries, 11 lemmas, 2 figures, 12 tables

详情
AI中文摘要

大规模数据集上的经验风险最小化自然呈现出嵌套的双有限和结构,其中 $N=nm$ 个总样本被逻辑或物理地划分为 $n$ 个大小为 $m$ 的块(例如,在池化数据孤岛、核外学习或有意分层中)。虽然方差缩减方法对非凸目标实现了最优的 oracle 复杂度,但在此集中式场景中它们遭受严重的扩展瓶颈。递归估计器(如 PAGE)需要定期对所有 $nm$ 个样本进行全局全梯度刷新,这在计算上代价高昂。相反,单循环方法(如 SILVER)避免了此类刷新,但需要不切实际的 $\mathcal{O}(nm)$ 内存来存储每个样本的控制变量。在本文中,我们提出了 SILAGE,一种解决此权衡的方差缩减算法。通过主动利用双和结构,SILAGE 消除了对所有 $nm$ 组件的周期性全局全梯度刷新(每次迭代最多评估一个局部组梯度),同时仅需 $\mathcal{O}(n)$ 内存。此外,我们提供了严格的收敛分析,避免了悲观的 worst-case Lipschitz 常数。相反,SILAGE 的复杂度通过嵌套的函数相似性(组间异质性 $δ_1$ 和组内异质性 $δ_2$)自然地适应底层数据几何。我们的结果在几个实际相关场景中改进了现有的最先进界限。

英文摘要

Empirical risk minimization on massive datasets naturally exhibits a nested double finite-sum structure, where $N=nm$ total samples are logically or physically partitioned into $n$ blocks of size $m$ (e.g., in pooled data silos, out-of-core learning, or deliberate stratification). While variance-reduced methods achieve optimal oracle complexities for nonconvex objectives, they suffer from severe scaling bottlenecks in this centralized regime. Recursive estimators, such as PAGE, require periodic global full-gradient refreshes over all $nm$ samples, which are computationally expensive. Conversely, single-loop methods, such as SILVER, avoid such refreshes but require an impractical $\mathcal{O}(nm)$ memory footprint to store a control variate for every sample. In this paper, we propose SILAGE, a variance-reduced algorithm that addresses this trade-off. By actively exploiting the double-sum structure, SILAGE eliminates periodic global full-gradient refreshes over all $nm$ components (evaluating at most one local group gradient per iteration) while requiring only $\mathcal{O}(n)$ memory. Furthermore, we provide a tight convergence analysis that avoids pessimistic worst-case Lipschitz constants. Instead, SILAGE's complexity natively adapts to the underlying data geometry via nested functional similarities: across-group ($δ_1$) and within-group ($δ_2$) heterogeneity. Our results improve existing state-of-the-art bounds in several practically relevant regimes.

2606.15648 2026-06-19 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种无需配对标签的迁移学习方法,将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制,利用跨域先验监督各步骤,实现物理一致的增强。

详情
Journal ref
Information Fusion (2026): 104557
AI中文摘要

水下图像在不同水质条件下拍摄,导致复杂的退化,包括颜色偏差、低对比度和模糊效应。最近,基于学习的方法已显示出在水下图像增强(UIE)方面的潜力。然而,以往的大多数工作侧重于训练策略或网络设计,使增强结果与数据集中的标签良好对齐,忽略了标签是从先前UIE方法的增强结果中选取的,这些伪标签存在噪声。因此,它们的模型性能在一定程度上并不令人满意。然而,收集水下图像的真实标签具有挑战性。在这项工作中,我们提出了一种基于迁移学习的UIE方法,该方法不需要水下图像具有成对的噪声或真实标签来学习。相反,首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后,利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式,通过迁移学习实现了一种新颖的UIE,并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明,我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能,并有效提升了下游视觉任务,显著优于基准方法。项目仓库:https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

2606.15516 2026-06-19 cs.RO 新提交

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io/

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.