arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 5447
2606.23689 2026-06-23 cs.RO cs.LG 新提交

AutoDex: An Automated Real-World System for Dexterous Grasping Data Collection

AutoDex: 一个用于灵巧抓取数据收集的自动化真实世界系统

Mingi Choi, Gunhee Kim, Jisoo Kim, Taeksoo Kim, Taeyun Ha, Jongbin Lim, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出AutoDex,一个自动化真实世界数据收集系统,通过密集感知、碰撞监控和主动重置,实现灵巧抓取候选的物理验证,相比遥操作速度提升4.8倍,抓取成功率从34%提升至76%。

Comments 16 pages, 9 figures. Includes supplementary material

详情
AI中文摘要

学习鲁棒的灵巧抓取需要记录抓取尝试物理结果的真实世界数据。此类数据难以大规模获取:遥操作产生有效的物理结果但速度慢且存在操作者偏差,而基于仿真的生成廉价且可扩展但无法验证接触有效性。一个自然的解决方案是生成候选抓取并在真实硬件上验证,但这只有在整个收集循环(感知、执行、标注和重置)无需人工干预的情况下才能扩展。我们提出AutoDex,一个自动化真实世界数据收集系统,它闭合了这个循环:对于来自可替换生成器的每个候选,它利用密集的20摄像头感知在严重手-物遮挡下定位物体,执行碰撞监控的机器人运动,标注提举-保持的成功或失败,并在试验之间主动重置物体以暴露跨稳定姿态的额外候选。结果是一个可复用的物理标注抓取试验数据库,下游系统可以通过检索和可行性过滤进行查询。使用AutoDex,我们在Allegro和Inspire手上收集了100个不同物体的3,593次抓取试验,并带有同步的多视角观测和机器人状态日志。对于匹配的500次轨迹收集,AutoDex需要10.3小时,而遥操作需要49.4小时,实现了4.8倍的吞吐量提升,并且从AutoDex验证的数据库中检索到的抓取成功率为76%,而仅仿真验证的成功率为34%。代码和数据将公开发布。

英文摘要

Learning robust dexterous grasping requires real-world data that records the physical outcomes of grasp attempts. Such data is hard to obtain at scale: teleoperation yields valid physical outcomes but is slow and operator-biased, while simulation-based generation is cheap and scalable but cannot certify contact validity. A natural solution is to generate candidate grasps and verify them on real hardware, but this scales only if the entire collection loop (perception, execution, labeling, and reset) runs without human intervention. We present AutoDex, an automated real-world data-collection system that closes this loop: for each candidate from a replaceable generator, it localizes the object under severe hand-object occlusion with dense 20-camera perception, executes collision-monitored robot motions, labels lift-and-hold success or failure, and actively resets the object between trials to expose additional candidates across stable poses. The result is a reusable database of physically labeled grasp trials that downstream systems can query by retrieval and feasibility filtering. Using AutoDex, we collect 3,593 grasp trials across Allegro and Inspire hands on 100 diverse objects, with synchronized multi-view observations and robot-state logs. For a matched 500-trajectory collection, AutoDex requires 10.3 h versus 49.4 h for teleoperation, yielding a 4.8x throughput improvement, and grasps retrieved from the AutoDex-validated database succeed 76% versus 34% for simulation-only validation. Code and data will be publicly released.

2606.23688 2026-06-23 cs.CV 新提交

Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

Lift4D: 协调单视图3D估计以实现野外4D重建

Yehonathan Litman, Xiaoxuan Ma, Manan Shah, Nicolas Ugrinovic, Kris Kitani, Fernando De la Torre, Shubham Tulsiani

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Lift4D框架,通过因果潜在条件化单视图3D模型获得时域一致的初始化,再结合遮挡感知优化和扩散先验,实现从单目视频重建动态非刚体4D场景,显著优于现有方法。

Comments Webpage, Demos: this https URL (https://lift4d.github.io)

详情
AI中文摘要

从单目视频重建动态非刚体对象需要将直接观测的视觉线索与几何和外观的数据驱动先验相结合。先前的方法要么直接从视觉输入学习预测4D表示,要么初始化一个3D表示,随后根据视频证据进行变形和细化。然而,前者受限于4D训练数据的稀缺性,而后者仅在初始重建时利用先验,之后仅依赖视频监督;两者都无法很好地处理具有大变形和遮挡的复杂野外场景。我们提出了Lift4D,一个测试时优化框架,解决了这两个局限性。首先,我们通过因果潜在条件化,使现有的单视图3D重建模型产生时域一致的逐帧预测,为可变形3D高斯泼溅表示提供连贯的初始化。然后,我们通过遮挡感知优化来“雕刻”这个表示,以匹配输入视频,该优化忠实地恢复可见表面细节,同时使用视图条件扩散先验完成未观测区域。我们证明Lift4D明显优于先前的4D重建方法,特别是在具有严重遮挡和非刚体运动的挑战性野外序列上。

英文摘要

Reconstructing dynamic non-rigid objects from monocular video requires integrating visual cues from direct observations with data-driven priors over geometry and appearance. Prior approaches either learn to directly predict 4D representations from visual input or initialize a 3D representation that is subsequently deformed and refined based on video evidence. However, the former are constrained by the scarcity of 4D training data, while the latter leverage priors only for the initial reconstruction and rely solely on video supervision thereafter; neither handles complex in-the-wild scenarios with large deformations and occlusions well. We present Lift4D, a test-time optimization framework that addresses both limitations. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then ``sculpt'' this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior. We demonstrate that Lift4D clearly improves over prior 4D reconstruction methods, particularly on challenging in-the-wild sequences with severe occlusions and non-rigid motion.

2606.23687 2026-06-23 cs.CL 新提交

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

随机化 YaRN 提升长上下文推理的长度泛化能力

Manas Mehta, Fangcong Yin, Greg Durrett

发表机构 * New York University(纽约大学)

AI总结 提出随机化 YaRN 方法,通过结合 YaRN 位置外推、随机位置编码和长度课程,在短上下文训练数据上提升模型对长上下文(16K-128K)的推理性能。

详情
AI中文摘要

大型语言模型(LLMs)通常在短序列上预训练,然后通过额外训练扩展到处理更长的序列。然而,这些LLMs在进一步泛化到非常长的序列时仍然存在困难。我们提出随机化YaRN,一种通过结合基于YaRN的位置外推、随机位置编码和长度课程来改进长度泛化的训练方法。在短上下文数据的训练过程中,标记被分配从更大位置范围采样的YaRN位置编码,即使在短上下文输入下也能使模型暴露于分布外的位置表示。我们在两个具有挑战性的长上下文推理基准BABILong和多轮共指消解(MRCR)上评估随机化YaRN。当在上下文长度<8K的数据上训练时,随机化YaRN在16K到128K的上下文长度上持续改进推理性能,并优于标准微调,最大的改进出现在远分布外长度上。我们的结果表明,逐步将模型暴露于分布外位置分布为可泛化的长上下文推理提供了一种有效的方法。

英文摘要

Large language models (LLMs) are typically pretrained on short sequences and then extended to work on longer sequences with additional training. However, such LLMs still struggle to further generalize to very long sequences. We propose Randomized YaRN, a training method that improves length generalization by combining YaRN-based positional extrapolation with randomized positional encoding and a length curriculum. During training on short context data, tokens are assigned YaRN positional encodings sampled from a larger position range, exposing the model to out-of-distribution positional representations even on short-context inputs. We evaluate Randomized YaRN on two challenging long-context reasoning benchmarks, BABILong and Multi-Round Coreference Resolution (MRCR). When training on data with <8K context, Randomized YaRN consistently improves reasoning performance on context lengths from 16K to 128K and outperforms standard fine-tuning, with the largest gains appearing at far out-of-distribution lengths. Our results suggest that progressively exposing models to OOD positional distributions provides an effective recipe for generalizable long-context reasoning.

2606.23686 2026-06-23 cs.RO 新提交

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

LIBERO-Safety:视觉-语言-动作模型中物理与语义安全的综合基准

Rongxu Cui, Zongzheng Zhang, Jingrui Pang, Haohan Chi, Jinbang Guo, Saining Zhang, Shaoxuan Xie, Xin Jin, Yao Mu, Jiaolong Yang, Guocai Yao, Xianyuan Zhan, Ya-Qin Zhang, Hao Zhao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院) Beijing Academy of Artificial Intelligence (BAAI)(北京智源人工智能研究院) Beihang University(北京航空航天大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学) Shanghai Jiao Tong University(上海交通大学) Microsoft Research Asia (MSRA)(微软亚洲研究院)

AI总结 针对视觉-语言-动作模型操作安全未验证的问题,提出参数化安全基准和关键帧驱动数据生成流水线,构建大规模无碰撞数据集,系统评估八种模型,揭示泛化-安全张力。

Comments Accepted by ECCV 2026, Project Page: this https URL (https://libero-safety.github.io/)

详情
AI中文摘要

尽管视觉-语言-动作(VLA)模型具有令人印象深刻的操控能力,但它们在严格约束下的操作安全性仍未得到充分验证。为了解决这一问题,我们引入了一个参数化安全基准,以程序化生成具有全面随机性的安全关键场景。为了克服人类遥操作的可扩展性瓶颈,我们开发了一种新颖的关键帧驱动数据生成流水线。利用这一基础设施,我们整理了一个包含19,664个严格无碰撞演示的大规模数据集,并进行了广泛的领域随机化。然后,我们对八个VLA模型和两个具身基础模型进行了系统的跨范式评估。我们的分析揭示了一个关键的泛化-安全张力:尽管高多样性训练促进了更安全的轨迹,但任务成功从根本上受到次优轨迹合成和语义错位的瓶颈限制。通过提供可扩展的流水线、鲁棒的数据集和深刻的失败模式见解,LIBERO-Safety为开发安全可靠的VLA模型奠定了关键基础。

英文摘要

Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.

2606.23685 2026-06-23 cs.RO 新提交

LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation

LaST-HD:从可扩展人类数据中学习潜在物理推理以进行机器人操作

Jiaming Liu, Yinxi Wang, Chenyang Gu, Siyuan Qian, Xiangju Mi, Hao Chen, Jiawei Chen, Qingpo Wuwu, Xiaoqi Li, Nuowei Han, Yiming Zhang, Xuheng Zhang, Yang Yue, Yeqing Yang, Lei Wang, Peng Jia, Hao Tang, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室) The Chinese University of Hong Kong(香港中文大学) Simplexity Robotics Aether Tech

AI总结 提出LaST-HD框架,通过将人手与机器人演示在共享潜在推理空间中对齐,利用未配对轨迹训练世界模型合成统一潜在目标,实现跨形态物理动力学内化,并开发低成本数据手套OOL Glove,结合混合训练策略,仅用少量人类数据即可达到高泛化性能。

详情
AI中文摘要

人手演示为机器人学习提供了直接且可扩展的物理交互数据源。虽然手动重定向对于建立不同形态之间的运动学动作对应关系必不可少,但稳健的迁移需要超越几何层面,解决人类与机器人操作之间物理动力学的潜在对齐问题。为此,我们引入了LaST-HD,一种新颖的人到机器人动作学习范式,它通过在人手和机器人演示之间共享潜在推理空间来扩展推理-行动VLA。LaST-HD不是模仿人类运动学,而是在未配对的人手和机器人轨迹上训练一个辅助的动作条件世界模型,以合成统一的潜在目标。在这个共享的前向动力学空间中对齐跨形态表示后,这些目标监督LaST-HD的潜在推理过程,使其能够内化共享的物理动力学并驱动高效的人手动作学习。此外,我们开发了Out-of-Lab (OOL) Glove,一种专为LaST-HD设计的低成本动作捕捉手套,用于人手数据收集。捕获的人类数据提供精确的关键点,并作为跨夹爪和灵巧手的通用动作监督。借助对齐的潜在空间和高保真人手数据,我们开发了一种渐进式混合到人类训练方案,包括混合人机协同训练和训练后的人手在线校正。通过混合协同训练,LaST-HD仅使用人手演示就改善了对新物体、场景和位置的泛化能力。通过在线校正,LaST-HD进一步适应新环境,仅使用20分钟的OOL手套数据即可达到超过90%的准确率。

英文摘要

Human-hand demonstrations provide a direct and scalable source of physical interaction data for robot learning. While manual retargeting is indispensable for establishing kinematic action correspondence across different morphologies, robust transfer requires going beyond geometry to address the underlying alignment of physical dynamics between human and robot manipulation. To address this, we introduce LaST-HD, a novel human-to-robot action learning paradigm that extends reasoning-before-acting VLA by aligning human-hand and robot demonstrations in a shared latent reasoning space. Rather than mimicking human kinematics, LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets. After aligning cross-embodiment representations in this shared forward-dynamics space, these targets supervise LaST-HD's latent reasoning process, enabling it to internalize shared physical dynamics and drive efficient human-hand action learning. Moreover, we develop Out-of-Lab (OOL) Glove, a low-cost motion-capture glove tailored to LaST-HD for human-hand data collection. The captured human data provide precise keypoints and serve as universal action supervision across grippers and dexterous hands. Armed with the aligned latent space and high-fidelity human-hand data, we develop a progressive mixed-to-human training recipe comprising mixed human-robot co-training and human-hand online correction post-training. Through mixed co-training, LaST-HD improves generalization to novel objects, scenes, and positions using only human-hand demonstrations. With online correction, LaST-HD further adapts to novel environments and achieves over 90\% accuracy using only 20 minutes of OOL glove data.

2606.23682 2026-06-23 cs.CV 新提交

Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping

保留精华:通过令牌丢弃实现高效的参考条件生成

Rishubh Parihar, Ayush Raina, R. Venkatesh Babu, Or Patashnik

发表机构 * IISc Bangalore(印度科学研究所班加罗尔分校) Tel Aviv University(特拉维夫大学)

AI总结 提出Sparse Context方法,通过丢弃大部分参考令牌并微调模型,实现参考条件生成中4倍(多参考)和2倍(单参考)的推理加速,且不降低视觉质量。

Comments Project Page: this https URL (https://sparsecontext.github.io)

详情
AI中文摘要

基于参考的扩散模型通过利用输入图像中的元素来引导提示驱动的合成,实现了高度可控的图像生成。然而,这些模型在运行时计算成本高昂,并且其成本随输入参考数量的增加而严重扩展。虽然扩散模型的效率在提示驱动生成的背景下已被广泛研究,但在基于参考的模型领域仍 largely under-explored。这一设置带来了仅关注生成的方法无法解决的独特挑战。特别是,将参考表示为密集令牌网格的浪费性方式提供了显著的改进机会。在这项工作中,我们提出了Sparse Context,一种通过仅保留减少的参考令牌子集来构建稀疏参考表示的方法。我们观察到,即使不修改模型,在推理时丢弃大量参考令牌也能在很大程度上保持其生成能力。为了充分发挥这一潜力,我们以不同比例随机丢弃令牌对模型进行微调,鼓励对部分参考表示的鲁棒性。至关重要的是,这种训练策略使模型与任何特定的令牌选择规则解耦,从而允许在推理时进行灵活控制。在推理时,我们采用任务感知的令牌选择策略,优先考虑参考图像中信息最丰富的区域,根据输入和任务需求调整令牌预算,而不是随机丢弃。大量实验表明,我们的方法在多参考生成中实现了4倍的推理速度提升,在单参考生成中实现了2倍的提升。重要的是,这种效率是在不损害空间对齐编辑和主体驱动生成的视觉质量的情况下实现的。

英文摘要

Reference-based diffusion models enable highly controllable image generation by leveraging elements from input images to guide prompt-driven synthesis. However, these models are computationally expensive in runtime, and their cost scales severely with the number of input references. While the efficiency of diffusion models has been extensively studied in the context of prompt-driven generation, it remains largely under-explored in the realm of reference-based models. This setting presents unique challenges not addressed by methods focusing solely on generation. In particular, the wasteful representation of references as dense token grids offers significant opportunities for improvement. In this work, we present Sparse Context, a method for constructing sparse reference representations by retaining only a reduced subset of reference tokens. We observe that even without modifying the model, dropping a significant portion of reference tokens at inference time largely preserves its generation capabilities. To fully realize this potential, we fine-tune the model with random token dropping at varying ratios, encouraging robustness to partial reference representations. Crucially, this training strategy decouples the model from any specific token selection rule, allowing flexible control at inference time. At inference time, instead of random dropping, we apply task-aware token selection strategies that prioritize the most informative regions of the reference images, adapting the token budget to the input and task requirements. Extensive experiments show our method achieves a 4x increase in inference speed for multi-reference generation and an 2x for single reference generation. Importantly, this efficiency is achieved without compromising visual quality across both spatially-aligned editing and subject-driven generation.

2606.23680 2026-06-23 cs.RO cs.AI cs.LG 新提交

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

CoorDex: 协调身体与手部先验以实现连续灵巧类人运动操作

Sikai Li, Shuning Li, Zhenyu Wei, Yunchao Yao, Chenran Li, Mingyu Ding

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) University of California, Berkeley(加利福尼亚大学伯克利分校)

AI总结 提出CoorDex学习框架,通过将高维身体和灵巧手控制转化为协调的潜在残差控制,实现类人机器人在运动中执行灵巧操作,如不停顿抓取瓶子、移动中开冰箱门等。

Comments Project page: this https URL (https://skevinci.github.io/coordex/)

详情
AI中文摘要

类人运动操作常被简化为走走停停的过程:走到物体旁,停下操作,然后恢复运动。此外,它通常依赖于低自由度(DoF)的末端执行器,表现为开合抓取基元。我们提出CoorDex,一种学习流水线,将高维身体和灵巧手控制转化为协调的潜在残差控制,实现高自由度灵巧运动操作。从模拟的全身和手部演示开始,CoorDex训练类人身体和灵巧手的特权运动跟踪教师,将其蒸馏为基于本体感觉的潜在先验,并将冻结的先验作为下游残差强化学习的动作空间。协调的潜在残差策略通过共享任务上下文和分离的身体-手部残差头组合这些先验,在保持自然全身运动的同时提高手指级接触可靠性。CoorDex使配备20自由度WUJI手的Unitree G1类人机器人能够在运动中执行灵巧操作,包括不停顿抓取和搬运瓶子、移动中打开冰箱门以及立方体抓取和翻转。在行走-抓取-搬运任务上的消融实验表明,在相同奖励预算下,关节空间PPO、关节空间手部控制和单一潜在预测均失败,而潜在先验接口和协调残差结构使高维接触丰富的运动操作变得可训练。项目页面:this https URL

英文摘要

Humanoid loco-manipulation is often simplified into a stop-and-go process: walking to an object, stopping to manipulate it, and then resuming locomotion. It also commonly relies on low degree-of-freedom (DoF) end effectors that behave like an open-close grasp primitive. We introduce CoorDex, a learning pipeline that converts high-dimensional body and dexterous hand control into coordinated latent residual control, enabling high-DoF dexterous loco-manipulation on the move. Starting from simulated whole-body and hand demonstrations, CoorDex trains privileged motion tracking teachers for the humanoid body and dexterous hand, distills them into proprioception-conditioned latent priors, and uses the frozen priors as the action space for downstream residual reinforcement learning. A coordinated latent residual policy composes these priors through shared task context and separate body-hand residual heads, preserving natural whole-body motion while improving finger-level contact reliability. CoorDex enables a Unitree G1 humanoid with a 20-DoF WUJI hand to execute dexterous manipulation while in motion, including non-stop bottle grasping and carrying, fridge door opening on the move, and cube pick-and-turn. Ablations on the walk-grasp-carry task show that joint-space PPO, joint-space hand control, and monolithic latent prediction all fail under the same reward budget, while the latent-prior interface and coordinated residual structure make high-dimensional contact-rich loco-manipulation trainable. Project Page: this https URL

2606.23679 2026-06-23 cs.CV cs.AI cs.GR cs.LG 新提交

Semantic Browsing: Controllable Diversity for Image Generation

语义浏览:图像生成的可控多样性

Sara Dorfman, Maya Vishnevsky, Omer Dahary, Or Patashnik, Daniel Cohen-Or

发表机构 * Tel Aviv University(特拉维夫大学)

AI总结 针对文本到图像模型生成多样性不足且缺乏语义结构的问题,提出一种在文本层面引入多样性的方法,通过视觉语言模型和代理工作流实现用户可导航的语义变化空间。

Comments ECCV 2026. Project page: this https URL (https://saradorfman1.github.io/SemanticBrowsing-webpage/)

详情
AI中文摘要

现代文本到图像模型在视觉保真度和提示遵循性方面表现出色。然而,这种严格的遵循是以牺牲多样性为代价的:生成的样本倾向于坍缩为单一的视觉解释。现有的提高多样性的方法产生的输出是由偶然变化而非有意义的设计选择驱动的。这激发了一个新的多样性任务变体,其中对生成的样本施加结构。我们引入了一种可控多样性的方法,实现了语义浏览,用户可以导航结构化的图像画廊,并通过系统遍历有意义的、可解释的变化轴进行创造性探索。实现这种语义控制水平需要对场景有深入理解。我们利用了最近文本到图像模型在详细描述上训练的事实,有效地将语义决策与像素生成解耦。这实现了一个范式转变:我们不再依赖文本到图像模型内的随机变化,而是直接在文本层面引入多样性。通过利用丰富的文本表示,我们允许视觉语言模型(VLM)在完整场景上下文中操作。为了克服标准VLM典型的通用输出,我们采用了一种代理工作流,明确强制实施与原始提示相适应的结构化变化。我们证明了我们的方法产生了多样且可导航的设计空间,其中每个变化都对应一个特定的、用户可理解的语义决策。

英文摘要

Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.

2606.23678 2026-06-23 cs.CV cs.AI 新提交

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

AIR: MLLMs中的自适应交错推理与代码

Cong Han, Xiaohan Lan, Haibo Qiu, Yujie Zhong

发表机构 * Independent Researcher(独立研究员)

AI总结 提出自适应交错推理框架AIR,通过强化学习训练代码增强的复杂数值计算任务,采用分组约束奖励函数和两阶段数据构建,使性能平均提升6.1个百分点。

Comments 19 pages, 4 figures

详情
AI中文摘要

遵循OpenAI o3引发的范式转变,交错推理与代码以增强多模态大语言模型(MLLMs)已成为关键研究前沿。现有文献主要关注视觉感知任务中的工具使用。然而,此类方法通常依赖预定义的视觉操作启发式规则,且因专注于视觉操作而本质上无法处理数值计算问题。本文通过扩展强化学习训练,在代码增强的复杂数值计算任务上赋予MLLMs自适应交错推理能力。为此,我们提出一个包含三个组件的综合解决方案:两阶段冷启动数据构建流程、用于强化学习数据集筛选的数据过滤策略,以及利用分组约束奖励函数实现交错推理轨迹的自适应工具调用策略。大量实验表明,使用分组约束奖励函数进行强化学习训练后,评估基准上的性能平均提升6.1个百分点。具体而言,交错推理样本的准确率提高9.9个百分点,工具使用的整体成功率超过95%。我们的数据和代码见:this https URL。

英文摘要

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily on tool-use within vision-perception tasks. However, such approaches typically rely on predefined heuristics for visual manipulation and are inherently incapable of addressing numerical computation problems due to their exclusive focus on visual operations. This paper empowers MLLMs with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. To this end, we propose a comprehensive three-component solution consisting of: a two-stage cold-start data construction pipeline, data filtering strategies for RL dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function for interleaved reasoning trajectories. Extensive experiments demonstrate that after Reinforcement Learning training with the group-constrained reward function, performance improves by an average of 6.1 percentage points (pp) on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 pp, and the overall success rate of tool-use exceeds 95%. Our data and code are available at: this https URL.

2606.23676 2026-06-23 cs.LG cs.AI math.OC stat.ML 新提交

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

开放问题:AdamW 在重尾噪声下是否有效?

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

发表机构 * Nanjing University(南京大学) Zhejiang University(浙江大学) Fudan University(复旦大学)

AI总结 本文提出开放问题:AdamW 在重尾噪声下能否收敛?通过加权度量基准和走廊下界机制,揭示分母记忆可能隐藏大梯度。

详情
AI中文摘要

AdamW 是训练大型语言模型(LLM)的事实标准优化器,但其理论仍主要停留在有限方差情形。这越来越令人不满,因为经验证据表明,LLM 预训练中的随机梯度噪声通常是重尾的。最近的工作表明,基于符号的优化器(如 Lion 和 Muon)实现了尖锐的重尾速率,并且 AdaGrad 也能在重尾噪声下收敛。然而,在此情形下,AdamW 尚无严格的收敛理论。AdamW 能否在相同的重尾假设下收敛,还是其二阶矩累积器造成了真正的障碍?我们将此表述为一个开放问题,证明了一个正加权度量基准,并给出了一个走廊下界机制,表明分母记忆如何隐藏大梯度。

英文摘要

AdamW is the de facto optimizer for training large language models (LLMs), yet the theory behind it still lives mostly in finite-variance regimes. This is increasingly unsatisfying, as empirical evidence indicates that stochastic gradient noise in LLM pretraining is typically heavy-tailed. Recent work shows that sign-based optimizers such as Lion and Muon achieve sharp heavy-tailed rates, and that AdaGrad can also converge under heavy-tailed noise. However, no rigorous convergence theory for AdamW has yet been established in this regime. Can AdamW converge under the same heavy-tailed assumptions, or does its second-moment accumulator create a genuine obstruction? We formulate this as an open problem, prove a positive weighted-metric benchmark, and give a corridor lower-bound mechanism showing how denominator memory can hide large gradients.

2606.23675 2026-06-23 cs.CV 新提交

IMAGIN-4D: Image-Guided Controllable Interaction Generation

IMAGIN-4D:图像引导的可控交互生成

Sai Kumar Dwivedi, Federica Bogo, Buğra Tekin, Chenhongyi Yang, Nadine Bertsch, Tomas Hodan, Michael J. Black, Dimitrios Tzionas, Shreyas Hampali

发表机构 * Meta, Zurich(Meta 苏黎世)

AI总结 提出IMAGIN-4D,一种基于扩散模型的人-物交互生成方法,通过时空分解的图像条件控制交互细节,提升细粒度交互控制能力。

Comments 15 pages, 8 figures. Project page: this https URL (https://imagin4d.github.io)

详情
AI中文摘要

生成人-物交互(HOI)是角色动画、机器人、AR/VR和具身智能的核心。最近的HOI生成方法从文本、物体几何和稀疏路径点合成运动,控制动作语义和物体轨迹。然而,这些信号对交互的描述不充分:相同的提示和轨迹可能产生不同的抓取、接近方向、身体姿势、物体姿势、接触和身体-物体布局。我们通过参考图像作为期望交互快照的视觉规范来解决这种歧义性。然而,单一的全局图像表示混淆了不同的线索,并对所有帧施加相同的视觉证据。因此,我们引入了IMAGIN-4D,一种基于扩散的HOI生成器,它在时空上分解图像条件。对于空间条件,IMAGIN-4D提取有监督的交互状态令牌,用于描述所描绘帧中的身体姿势、物体姿势、身体-物体接触和空间关系。对于时间条件,它通过每生成帧查询图像块来计算帧感知令牌,允许序列片段关注同一图像中的不同视觉线索。为了平衡图像、文本和路径点线索,IMAGIN-4D使用角色感知条件:文本、路径点和交互状态令牌使用独立的AdaLN流,而帧感知视觉令牌与运动令牌进行交叉注意力。由于HOI运动数据集缺乏配对图像,我们从FullBodyManipulation(FBM)构建了一个合成运动到图像的渲染流程,并引入了一个图像一致性度量来评估生成的运动是否匹配参考快照。在FBM和BEHAVE上的实验表明,与单令牌和统一图像条件基线相比,IMAGIN-4D在保持路径点跟随和运动质量的同时,改进了细粒度交互控制。代码和模型将在https://this URL发布。

英文摘要

Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence. We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens. Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality. Code and models will be released at this https URL.

2606.23673 2026-06-23 cs.AI cs.LG 新提交

PsyBridge: A Hybrid Intelligent Framework for Multi-Dimensional Mental Health Assessment and Decision Support

PsyBridge:一种用于多维心理健康评估与决策支持的混合智能框架

Sunil Wanjari, Manish Thakre, Aayushi Asole, Sharwari Raut, Kwabena Adu-Duodu, Yinhao Li, Stanly Wilson

发表机构 * St. Vincent Pallotti College of Engineering and Technology, Nagpur, India(圣文森特帕洛蒂工程技术学院,那格浦尔,印度) Government Medical College (GMC), Chandrapur, India(政府医学院,钱德拉普尔,印度) Newcastle University, UK(纽卡斯尔大学,英国)

AI总结 提出PsyBridge混合智能框架,整合临床筛查工具、认知评估和人格特征,通过加权聚合机制生成可解释的心理健康风险分类,在500例半合成数据集上准确率达0.84。

详情
AI中文摘要

心理健康评估通常依赖于孤立的筛查工具或数据驱动模型,这些模型往往缺乏可解释性和多维整合。现有方法常聚焦于抑郁或焦虑等单一指标,对全面且可解释的决策支持有限。为解决此局限,本研究提出PsyBridge,一种混合智能决策支持框架,通过将临床验证的筛查工具、认知评估和人格特征整合到统一架构中,实现多维心理健康评估。该框架采用模块化设计和加权聚合机制,纳入PHQ-9和GAD-7评估以及认知和行为指标,生成可解释的心理健康风险分类和建议。为评估该框架,基于临床评分分布构建了包含500个患者档案(代表不同严重程度)的半合成数据集。实验结果表明,PsyBridge的整体准确率达到0.84,优于单独的PHQ-9和GAD-7评估,同时提高了精确率、召回率和F1分数。敏感性分析和消融研究进一步表明,整合认知和人格成分有助于更稳定的分类性能,并减少中度风险预测中的不一致性。研究结果表明,PsyBridge为AI辅助的心理健康决策支持提供了一种可扩展且可解释的方法,尤其在数字医疗和远程医疗环境中。

英文摘要

Mental health assessment commonly relies on isolated screening instruments or data-driven models that often lack interpretability and multi-dimensional integration. Existing approaches frequently focus on individual indicators such as depression or anxiety while providing limited support for comprehensive and explainable decision-making. To address this limitation, this study proposes PsyBridge, a hybrid intelligent decision-support framework designed for multi-dimensional mental health assessment through the integration of clinically validated screening tools, cognitive evaluation, and personality profiling within a unified architecture. The proposed framework incorporates PHQ-9 and GAD-7 assessments alongside cognitive and behavioural indicators using a modular design and a weighted aggregation mechanism to generate interpretable mental health risk classifications and recommendations. To evaluate the framework, a semi-synthetic dataset consisting of 500 patient profiles representing varying severity levels was constructed based on clinically grounded score distributions. Experimental results demonstrate that PsyBridge achieves an overall accuracy of 0.84, outperforming standalone PHQ-9 and GAD-7 assessments while improving precision, recall, and F1-score. Sensitivity analysis and ablation studies further indicate that integrating cognitive and personality components contributes to more stable classification performance and reduces inconsistencies in moderate-risk prediction. The findings suggest that PsyBridge provides a scalable and interpretable approach for AI-assisted mental health decision support, particularly within digital healthcare and telehealth environments.

2606.23671 2026-06-23 cs.CL 新提交

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

LLM 能否可靠地自我报告对抗性预填充,以及如何实现?

Quang Minh Nguyen, Uzair Ahmed, Taegyoon Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 研究LLM在安全场景中识别自身输出是否受对抗性预填充攻击的能力,发现模型平均27.3%声称预填充输出有意图,且内省信号主要来自安全/拒绝推理,LoRA微调方法扩大了意图-篡改探针差距并提高了攻击成功率。

详情
AI中文摘要

先前研究表明,大型语言模型(LLM)在良性任务上表现出内省能力。我们将问题扩展到安全场景,并考察模型能够识别自身先前响应是由对抗性预填充攻击引发的可靠性。在十个开放权重的指令微调LLM(3B到70B)和四个安全基准上,没有模型能可靠地识别自身受损输出,模型对预填充响应声称意图的平均率为27.3%。内省信号主要来自安全与拒绝相关的推理。将模型权重与拒绝方向正交化,使预填充和自然输出的声称率差距几乎缩小到零,但该方向并非其唯一中介。信号还依赖于探针:将问题表述为内部意图与外部篡改,会在相同模型上引发定性不同的响应。我们在八个3B到27B的模型上测试了三种LoRA微调方法(SFT、GRPO、DPO);所有三种方法在8B到27B的每个模型上都扩大了意图探针差距,方法排名因模型而异。干预措施不会迁移到篡改探针,并且反直觉地提高了大多数模型在对抗性预填充下的攻击成功率,相当于部分缓解。这些发现概述了安全场景中观察到的内省信号的机制,并强调了LLM自我报告可靠性的风险。

英文摘要

Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.

2606.23670 2026-06-23 cs.LG cs.AI cs.CL 新提交

Tapered Language Models

锥形语言模型

Reza Bayat, Ali Behrouz, Aaron Courville

发表机构 * Mila Cornell University(康奈尔大学) Université de Montréal(蒙特利尔大学) CIFAR AI Chair(CIFAR人工智能教席)

AI总结 针对现有语言模型各层参数均匀分配的问题,提出锥形语言模型(TLM),在固定预算下单调递减层容量,实验表明MLP宽度按余弦调度锥形化可提升困惑度和下游性能。

详情
AI中文摘要

现代语言模型,包括Transformer、循环和基于记忆的变体,共享一个共同框架:一组相同的层,其中参数在深度上均匀分配。这是从原始Transformer继承的默认设置,此后基本保持不变,然而越来越多的证据表明,各层对最终输出的贡献并不均匀,后面的层更多地是细化残差流而非转换它。我们询问参数容量是否应反映这种不对称性。我们的受控实验表明,在固定预算下,将更多容量分配给早期层、更少给后期层,比均匀宽度的基线提高了困惑度,而反向分配则有害。基于这一结果,我们引入了锥形语言模型(TLM),这是一种架构原则,其中参数承载组件在固定总预算下沿深度单调锥形化。MLP是这种实例化的自然场所:它们在现代LM系列中主导参数数量,并将宽度暴露为单一、干净的变异轴。在三种模型规模和四种架构(Transformer、门控注意力、Hope-attention和Titans)上,通过平滑余弦调度锥形化MLP宽度,在无额外参数或计算成本的情况下,一致地提高了困惑度和下游基准性能。这些发现确立了深度感知容量分配作为语言模型设计中一种简单、架构无关的轴,是一个隐藏在显眼处的免费杠杆。

英文摘要

Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.

2606.23668 2026-06-23 cs.LG 新提交

On the Limits of Prompt-Conditioned Language Models as General-Purpose Learners

关于提示条件语言模型作为通用学习器的局限性

David Mguni, Julian Ma, Jun Wang

发表机构 * Queen Mary University London(伦敦玛丽女王大学) University College London(伦敦大学学院)

AI总结 本文通过廉价谈话博弈模型分析提示条件语言模型,证明语言作为容量受限通道导致任务不可区分性,并因对齐约束产生不可约误差,从而否定其通过提示实现通用问题求解的能力。

详情
AI中文摘要

大型语言模型(LLM)常被描述为能够解决任意任务的通用求解器。我们认为这种观点忽略了一个基本约束:语言是用于传达任务信息的压缩且容量有限的接口。将用户-系统交互建模为双层廉价谈话博弈,我们分析了潜在任务如何被编码到提示中,并在对齐和安全约束下被重新解释。我们引入一个概念性分解,将任务推断与执行分离,并推导出PAC-Bayes界限,区分有限样本估计误差与不可约的结构性限制。我们的第一个主要结果建立了一个表达性下限:语言作为容量有限的通信通道,每当任务族的信息复杂度超过通道容量时,不同的任务对求解器而言必然变得不可区分,导致严格为正的误差下限,该误差无法通过额外数据、优化或模型缩放消除。然后,我们建立了一个目标错位下限:当对齐约束限制可接受的输出集时,用户理想分布可能位于可行类之外,导致不可约的失真。综合这些结果,我们得出一个形式化的否定结论:仅通过提示,提示条件LLM并非通用问题求解器,因为存在某些任务族,即使在无限数据情况下,正确行为也无法实现。更广泛地说,我们的分析表明,基于提示的泛化局限性源于信息受限的通信和对齐约束的目标。这表明,超越自然语言的接口,包括多模态观察和外部记忆,可能通过增加系统可用的任务相关信息来减少LLM固有的局限性。

英文摘要

Large Language Models (LLMs) are frequently portrayed as general-purpose solvers capable of solving arbitrary tasks. We argue that this view overlooks a fundamental constraint: language is a compressed and capacity-limited interface for conveying task information. Modelling User--System interaction as a bilevel \emph{cheap-talk} game, we analyse how latent tasks are encoded into prompts and reinterpreted under alignment and safety constraints. We introduce a conceptual decomposition separating task inference from execution and derive PAC-Bayes bounds that distinguish finite-sample estimation error from irreducible structural limitations. Our first main result establishes an \emph{expressivity floor}: language acts as a capacity-limited communication channel, and whenever the informational complexity of a task family exceeds the capacity of that channel, distinct tasks become unavoidably indistinguishable to the Solver, inducing a strictly positive error floor that cannot be eliminated by additional data, optimisation, or model scaling alone. We then establish an \emph{objective-misalignment floor}: when alignment constraints restrict the admissible output set, the User-ideal distribution may lie outside the feasible class, inducing an irreducible distortion. Together, these results yield a formal negative conclusion: prompt-conditioned LLMs are not universal problem solvers through prompting alone, as there exist task families for which correct behaviour is provably unattainable even in the infinite-data regime. More broadly, our analysis shows the limits of prompt-based generalisation arise from information-constrained communication and alignment-constrained objectives. This suggests that interfaces beyond natural language, including multimodal observations and, external memory, may reduce the inherent LLM limitations by increasing the task-relevant information available to the System.

2606.23664 2026-06-23 cs.LG cs.MA 新提交

MAS-PromptBench: When Does Prompt Optimization Improve Multi-Agent LLM Systems?

MAS-PromptBench:提示优化何时能提升多智能体LLM系统?

Juyang Bai, Laixi Shi

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Department of Electrical and Computer Engineering, Johns Hopkins University(约翰霍普金斯大学电气与计算机工程系)

AI总结 系统研究提示优化在多智能体LLM系统中的效果,发现其能显著提升性能,但收益依赖于任务、工作流和团队规模等配置。

Comments Project page: this https URL (https://juyangbai.github.io/MAS-PromptBench/); Code: this https URL (https://github.com/juyangbai/MAS-PromptBench)

详情
AI中文摘要

多智能体系统(MAS)为智能体AI提供了一条可扩展的前进道路,包含多个基于LLM的智能体,每个智能体被分配一个系统提示和一个工作流中的位置,该工作流管理智能体间的协调和输出聚合。因此,系统提示构成了一个关键且可访问的优化表面:它们指定了智能体的角色和行为,无需模型微调即可实现系统级改进。尽管提示优化在单个LLM上显示出巨大潜力,但将其扩展到MAS面临独特的挑战,尤其是搜索空间呈指数级增长。目前尚不清楚提示优化是否、何时以及在多大程度上提升MAS性能,以及这些收益对系统配置的敏感程度。在这项工作中,我们系统地研究了跨多种MAS设置(任务、工作流、通信协议和团队规模不同)的系统提示优化,并基准测试了两种自然扩展了最先进单智能体方法的提示优化器。结果揭示了其解锁显著收益的潜力,同时暴露了开放挑战,描述了在不同MAS设置中提示优化何时以及多大程度上有帮助。

英文摘要

Multi-agent systems (MAS) offer a scalable path forward for agentic AI, comprising multiple LLM-based agents, each assigned a system prompt and a position within a workflow that governs inter-agent coordination and output aggregation. System prompts thus form a critical and accessible optimization surface: they specify agents' roles and behaviors, enabling system-level improvements without model finetuning. Although prompt optimization has shown substantial potential for single LLMs, extending it to MAS poses distinct challenges, notably an exponentially growing search space. It remains unclear whether, when, and by how much prompt optimization improves MAS performance, and how sensitive such gains are to system configuration. In this work, we systematically study system-prompt optimization across a broad range of MAS setups varying in task, workflow, communication protocol, and team size, benchmarking two prompt optimizers that naturally extend state-of-the-art single-agent methods. The results reveal its potential to unlock significant gains while exposing open challenges, characterizing when and how much prompt optimization helps across diverse MAS settings.

2606.23658 2026-06-23 cs.RO 新提交

A Reduced Order Model for Emergent Mechanics in Woven Systems

编织系统中涌现力学的降阶模型

Anvay A. Pradhan, Evgueni T. Filipov, Talia Y. Moore

发表机构 * Department of Mechanical Engineering, University of Michigan(密歇根大学机械工程系) Department of Civil and Environmental Engineering, University of Michigan(密歇根大学土木与环境工程系) Department of Robotics, University of Michigan(密歇根大学机器人学系) Department of Ecology and Evolutionary Biology, Museum of Zoology, University of Michigan(密歇根大学动物学博物馆生态与进化生物学系)

AI总结 提出一种降阶模型,通过节点和四个物理可解释的刚度单元捕捉编织结构中的各向异性刚度、剪切锁定等涌现力学行为,校准后与实验误差在5%以内,并展示了连续模型无法实现的能力。

详情
AI中文摘要

编织结构展现出丰富的力学行为,包括各向异性刚度、剪切诱导锁定和屈曲互换,这些行为纯粹源于单个编织物的几何排列,而非组成材料的性质。现有模型要么均匀化这些相互作用,要么以高昂的计算成本解析它们。我们引入了一种降阶模型,通过节点系统和四个物理可解释的刚度单元(捕捉轴向变形、面内解屈曲、编织物间剪切和摩擦滑动)来弥合这一差距。单位胞元的特征值分析证实,最低能量的变形模式直接对应于已知的编织特定现象,并且每个单元对于完整的运动学和力学描述都是必要的。刚度参数根据经验三点弯曲和剪切数据校准,在不同编织宽度和间距下,一致性在5%以内。然后,应用经过验证的模型展示了超出连续方法能力范围的能力,包括:由屈曲互换引起的涌现泊松响应、渐进式编织物拉拔过程中的阶梯式力减小、三种不同撕裂构型下的应力局部化,以及通过空间梯度编织物刚度的可编程力学各向异性。该框架的物理透明性和计算效率使其成为分析和设计具有可编程力学响应的编织结构材料的实用工具。

英文摘要

Woven structures exhibit rich mechanical behaviors including anisotropic stiffness, shear-induced locking, and crimp interchange that emerge purely from the geometric arrangement of individual weavers rather than from constituent material properties. Existing models either homogenize these interactions or resolve them at prohibitive computational cost. We introduce a reduced-order model that bridges this gap by representing individual weaver interactions through a system of nodes and four physically interpretable stiffness elements capturing axial deformation, in-plane uncrimping, inter-weaver shear, and frictional slip. Eigenvalue analysis of the unit cell confirms that the lowest-energy deformation modes correspond directly to known weave-specific phenomena, and that each element is necessary for a complete kinematic and mechanistic description. Element stiffness parameters are calibrated against empirical three-point bending and shear data, achieving agreement within 5% across varied weaver widths and spacings. The validated model is then applied to demonstrate capabilities beyond the reach of continuum approaches including: the emergent Poisson's response arising from crimp interchange, stepwise force reduction during progressive weaver pullout, stress localization under three distinct tearing configurations, and programmable mechanical anisotropy through spatially graded weaver stiffness. The physical transparency and computational efficiency of the framework position it as a practical tool for the analysis and design of woven architected materials with programmable mechanical response.

2606.23655 2026-06-23 cs.LG cs.DS 新提交

Dynamic estimation of slowly varying sequences

缓慢变化序列的动态估计

Prashant Gokhale, Mikhail Khodak, Sandeep Silwal

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 针对缓慢变化序列,提出通用框架将隐式迹估计推广到多种线性和非线性函数,并开发自适应预算算法,获得更优的路径长度变分界。

Comments Preprint. 14 pages, 4 figures

详情
AI中文摘要

我们考虑逐步逼近缓慢变化序列中每个元素的函数的问题,即位置$i$和$i-1$处元素之差的幅度$\alpha_i$很小。最近关于隐式迹估计的工作表明,当$\alpha_t$很小时,重用对过去序列元素的查询可以降低总成本[Dharangutte \& Musco, NeurIPS~2021; Woodruff et al., NeurIPS~2022]。我们引入了一个框架,将其推广到不同向量空间上的各种线性和非线性函数,获得了矩阵幂、谱密度、蒙特卡洛积分以及来自偏微分方程(PDEs)的边值问题的新颖序列估计结果。此外,我们为该框架开发了一种新算法,该算法根据$\alpha_t$局部缩放估计预算,在估计长度为$m$的序列的成本上获得了更尖锐的路径长度型变分界$\mathcal O(\sum_{i=1}^m\alpha_i)$。这改进了先前隐式迹估计的界$\mathcal O(m\cdot\max_i\alpha_i)$[Dharangutte \& Musco, NeurIPS~2021],后者通过使用最坏情况$\alpha_i$固定查询预算实现,因此对于具有罕见突发的稳定序列效率低下。最后,虽然所有过去的工作都假设已知$\alpha_i$的界,但我们展示了在某些情况下如何以(几乎)无额外成本的方式在线估计变化。总之,我们的框架使序列逼近工具包通用且自适应,同时改进了动态迹估计的最新保证。

英文摘要

We consider the problem of sequentially approximating functions of each element in a slowly-varying sequence, i.e. one where the magnitude $\alpha_i$ of the difference between the elements at positions $i$ and $i-1$ is small. Recent work on implicit trace estimation shows that when $\alpha_t$ is small, reusing queries to past sequence elements can reduce the overall cost [Dharangutte \& Musco, NeurIPS~2021; Woodruff et al., NeurIPS~2022]. We introduce a framework generalizing this to a variety of linear and nonlinear functions on diverse vector spaces, obtaining novel sequential estimation results for matrix powers, spectral densities, Monte Carlo integration, and a boundary value problem from partial differential equations~(PDEs). Furthermore, we develop a novel algorithm for use with this framework that locally scales the estimation budget with $\alpha_t$, obtaining sharper path-length-style variation bounds of form $\mathcal O(\sum_{i=1}^m\alpha_i)$ on the cost of estimating a sequence of length $m$. This improves upon the previous implicit trace estimation bound of $\mathcal O(m\cdot\max_i\alpha_i)$ [Dharangutte \& Musco, NeurIPS~2021], which is achieved by fixing the query budget using the worst-case $\alpha_i$ and is thus inefficient for stable sequences with rare bursts. Lastly, while all past work assumes a known bound on $\alpha_i$, we show in certain cases how the changes can be estimated on-the-fly with (nearly) no added cost. In summary, our framework makes the sequential approximation toolkit general-purpose and adaptive while improving upon state-of-the-art-guarantees for dynamic trace estimation.

2606.23654 2026-06-23 cs.CL cs.SE 新提交

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

EnterpriseClawBench: 从真实工作会话中基准测试智能体

Jincheng Zhong, Weizhi Wang, Che Jiang, Kai Tian, Zhenzhao Yuan, Junlin Yang, Dianqiao Lei, Kaiyan Zhang

发表机构 * Horizon Research, Frontis.AI(地平线研究,Frontis.AI)

AI总结 提出EnterpriseClawBench,从真实企业工作会话构建852个可复现任务,评估智能体在文件读取、工具调用和工件交付方面的能力,最佳配置得分仅0.663。

详情
AI中文摘要

企业智能体越来越多地在工作空间内运行:它们读取异构文件、调用工具并交付业务工件。我们引入了EnterpriseClawBench,这是一个从专有的真实世界智能体会话构建的企业智能体基准。从大量工作会话档案出发,EnterpriseClawBench生成了852个可复现任务,每个任务都配有恢复的夹具、重写的提示、角色类别、技能子类别、硬规则和语义评分标准。由于会话包含企业内部内容,我们不发布基准数据;相反,我们的可重用贡献是构建和评估协议。在EnterpriseClawBench上,最佳配置仅达到0.663(Codex with GPT-5.5)。这些结果表明,企业智能体评估必须报告工具-模型组合、工件交付、视觉质量、成本、运行时间和技能迁移行为,而不是将性能压缩为单一分数。代码:此 https URL

英文摘要

Enterprise agents increasingly operate inside workspaces: they read heterogeneous files, invoke tools, and deliver business artifacts. We introduce EnterpriseClawBench, an enterprise agent benchmark constructed from proprietary, real-world agent sessions. Starting from a large archive of workplace sessions, the EnterpriseClawBench produces 852 reproducible tasks, each paired with recovered fixtures, rewritten prompts, role classes, skill subclasses, hard rules, and semantic rubrics. Because the sessions contain internal enterprise content, we do not release the benchmark data; instead, our reusable contribution is the construction and evaluation protocol. On EnterpriseClawBench, the best configuration reaches only 0.663 (Codex with GPT-5.5). These results show that enterprise agent evaluation must report harness--model combinations, artifact delivery, visual quality, cost, runtime, and skill-transfer behavior, rather than collapsing performance into a single score. Code: this https URL

2606.23653 2026-06-23 cs.CV 新提交

Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images

轻量级神经框架:从多视图图像稳健估计3D体积和表面积

Diego E. Farchione, Ramzi Idoughi, Peter Wonka

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学)

AI总结 提出一种全前馈框架,通过图解码器融合3D点云与2D特征,直接从多视图图像回归尺度归一化的体积、表面积及其不确定性,在稀疏视图下优于现有方法。

详情
AI中文摘要

准确的体积和表面积估计对于从海洋生态学到医学诊断等多种应用至关重要。然而,现有方法通常计算成本高,且在稀疏和噪声数据下性能较差。我们提出了一种全前馈框架,直接从多视图图像回归尺度归一化的体积和表面积及其相关的不确定性。通过基于图的解码器将3D点云重建与视图对齐的2D特征融合,我们的模型绕过了迭代优化,确保了卓越的可扩展性和快速推理。实验结果表明,我们的方法在输入图像数量较少时尤其优于最先进的方法。经过珊瑚监测、饮食分析和人体测量学的验证,我们提出的框架为定量形状分析提供了稳健、自适应的解决方案。该架构为从视觉数据精确估计几何参数提供了高速、可扩展的替代方案,即使在资源受限或稀疏视图场景下也能保持高性能。

英文摘要

Accurate volume and surface area estimation is critical for diverse applications, from marine ecology to medical diagnostics. However, existing methods often suffer from high computational costs and poor performance with sparse and noisy data. We propose a fully feed-forward framework that regresses scale-normalized volume and surface area and their associated uncertainties directly from multi-view images. By fusing 3D point cloud reconstructions with view-aligned 2D features through a graph-based decoder, our model bypasses iterative optimization, ensuring exceptional scalability and rapid inference. Experimental results demonstrate that our approach outperforms state-of-the-art methods, particularly when operating with a low number of input images. Validated across coral monitoring, dietary analysis, and anthropometry, our proposed framework provides a robust, adaptable solution for quantitative shape analysis. This architecture provides a high-speed, scalable alternative for precise geometric estimation from visual data, maintaining high performance even in resource-constrained or sparse-view scenarios.

2606.23643 2026-06-23 cs.AI 新提交

TailorMind: Towards Preference-Aligned Multimodal Content Generation

TailorMind: 面向偏好对齐的多模态内容生成

Hengji Zhou, Ye Liu, Yufeng Liu, Si Wu, Lianghao Xia, Liqiang Nie

发表机构 * South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出TailorMind框架,通过超图协同过滤和文本梯度下降建模用户偏好,结合检索增强风格控制和跨模态一致性反射,实现无候选池的个性化多模态内容生成,在TailorBench上优于基线。

Comments 18 pages, 13 figures, 6 tables. Code available at this https URL (https://github.com/iLearn-Lab/TailorMind)

详情
AI中文摘要

个性化内容系统依赖于可用的用户生成内容(UGC),当合适的内容缺失、延迟或创建成本高昂时,系统会陷入困境。尽管多模态生成器可以按需合成内容,但如何将行为痕迹转化为生成就绪的偏好仍未被充分探索。我们研究个性化多模态内容生成:在没有现有项目池或等待匹配UGC的情况下,创建用户定制的多模态内容。我们提出TailorMind,将协同偏好建模与可控多模态生成联系起来。TailorMind通过超图协同过滤丰富稀疏的用户历史,并利用排序误差反馈和文本梯度下降优化文本画像。检索增强风格控制将输出锚定在真实的UGC模式中,而跨模态一致性反射减少了语义漂移。我们构建了TailorBench,一个来自三个主流平台的基准,从五个维度进行评估:连贯性、新颖性、美学性、幻觉性、画像性。实验表明,TailorMind在连贯性上达到或超过代表性生成基线和真实UGC,提高了新颖性和美学质量,展示了优于检索可用内容或可比UGC的优势,同时在重排序中实现了高达29%的召回率提升。我们的代码发布在:this https URL。

英文摘要

Personalized content systems depend on available UGC and struggle when suitable content is absent, delayed, or costly to create. Although multimodal generators can synthesize content on demand, how to translate behavioral traces into generation-ready preferences remains underexplored. We study personalized multimodal content generation: creating user-tailored multimodal content without existing item pools or waiting for matching UGC. We propose TailorMind, linking collaborative preference modeling with controllable multimodal generation. TailorMind enriches sparse user histories via hypergraph collaborative filtering and optimizes textual profiles with ranking-error feedback and textual gradient descent. Retrieval-augmented style control grounds outputs in authentic UGC patterns, while cross-modal cohesion reflection reduces semantic drift. We construct TailorBench, a benchmark from three mainstream platforms evaluated along five dimensions: coherence, novelty, aesthetic, hallucination, profiling. Experiments show that TailorMind achieves competitive or stronger coherence, improves novelty and aesthetic quality over representative generation baselines and ground-truth UGC, demonstrating advantages over retrieving available content or comparable UGC, while achieving up to 29% Recall gains in reranking. Our code is released at: this https URL.

2606.23641 2026-06-23 cs.RO 新提交

Flatness Preserves Instruction Following in Vision-Language-Action Models

平坦性保持视觉-语言-动作模型中的指令遵循能力

Haochen Zhang, Yonatan Bisk

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对VLA模型微调后忽视语言指令的问题,提出平坦性保持优化(SAM),在不增加数据或修改架构的情况下,将指令遵循率提升60%以上。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用预训练的视觉-语言表示具有开放世界泛化的潜力,但在有限机器人数据上的下游微调通常会降低这些表示,导致策略脆弱,忽视语言指令而偏好视觉捷径,我们将这种失败模式称为指令盲。我们假设在有限数据上的标准微调将梯度应用于稀疏的点集,这表现为具有高曲率最小值的尖锐损失景观。我们提出直接在相同数据上进行微调时通过平坦性保持优化来解决这个问题,其中学习更平坦的景观使模型对权重空间中的扰动更加鲁棒。具体来说,我们证明在VLA微调期间简单地应用锐度感知最小化(SAM)可以在多个模拟和真实世界基准测试中将指令遵循率提高超过60%,而无需额外数据、架构修改或重新训练。我们进一步分析了选择性锐度的影响,量化了其效果,并表明我们的方法与现有指导技术互补。项目页面可在此https URL找到。

英文摘要

Vision-language-action (VLA) models have the potential for open-world generalization by leveraging pretrained vision-language representations, yet downstream finetuning on limited robot data often degrades these representations, leading to brittle policies that ignore language instructions in favor of visual shortcuts, a failure mode we term instruction blindness. We hypothesize that standard finetuning with limited data applies gradients to a sparse set of points, which manifests as a sharp loss landscape with high-curvature minima. We propose to address this directly through flatness-preserving optimization while finetuning on the exact same data, where learning a flatter landscape results in a model more robust to perturbations in the weight space. Specifically, we demonstrate that simply applying sharpness-aware minimization during VLA finetuning significantly improves instruction following by over 60% across multiple simulation and real-world benchmarks without additional data, architectural modification, or retraining. We further analyze the effect of selective sharpness, quantify its effects, and show that our approach is complementary to existing guidance techniques. Project page can be found at this https URL.

2606.23640 2026-06-23 cs.LG cs.AI cs.RO stat.ML 新提交

Learning Process Rewards via Success Visitation Matching for Efficient RL

通过成功访问匹配学习过程奖励以实现高效强化学习

Raymond Tsao, Andrew Wagenmaker, Sergey Levine

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 针对稀疏奖励导致的信用分配难题,提出通过判别器区分成功/失败轨迹,生成密集过程奖励以匹配成功访问分布,在不改变最优策略下加速机器人控制策略微调。

详情
AI中文摘要

在强化学习的许多现代应用中,任务的自然奖励本质上是稀疏的:除了任务完成时给予+1奖励外,其他所有地方都给予0奖励。训练策略最大化这种稀疏奖励需要解决具有挑战性的信用分配问题,导致RL改进缓慢或无效。我们提出了一种简单的方法,将稀疏的结果奖励转化为密集的过程奖励。我们的方法依赖于训练一个判别器来区分先前成功和失败的回合,并使用该判别器激励RL学习策略匹配成功回合的状态-动作访问,同时避免失败回合的访问。通过激励策略匹配所有状态(而不仅仅是那些对应任务成功的状态)的访问,该奖励提供了关于任务完成是否取得进展的密集反馈,并且我们证明这在不改变最优策略的情况下是可实现的。专注于机器人控制策略的微调,我们展示了与简单最大化稀疏结果奖励相比,我们的方法在模拟和真实世界的操作任务上都能显著加快RL微调性能。

英文摘要

In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.

2606.23637 2026-06-23 cs.LG math.OC 新提交

Muown Implicitly Performs Angular Step-size Decay

Muown 隐式执行角步长衰减

Florian Hübler, Kai Lion, Antonio Orvieto, Niao He

发表机构 * ETH Zurich(苏黎世联邦理工学院) ELLIS Institute Tübingen(蒂宾根ELLIS研究所) MPI-IS Tübingen AI Center(马克斯·普朗克智能系统研究所蒂宾根人工智能中心)

AI总结 本文揭示 Muown 优化器的方向更新等价于归一化方向上的黎曼步长,其未归一化参数幅度仅调节角步长,据此提出显式角步长的 AngularMuown 方法,在 nanoGPT 竞赛和 Qwen2 等模型上表现更优。

详情
AI中文摘要

诸如 Muon 和 Muown 等矩阵感知优化器最近在预训练 Transformer 方面显示出强大的经验性能。特别是,Muown 将每个权重矩阵分解为行幅度和未归一化的方向变量,使用 Adam 更新前者,使用 Muon 更新后者。我们证明 Muown 的方向更新等价于归一化方向上的黎曼步长,而未归一化参数化的幅度仅调节角步长。这解释了 Muown 的步长稳定性,并建议使角步长显式化。由此产生的方法 AngularMuown 直接优化归一化方向,并使用与径向幅度更新解耦的可调度角乘数。AngularMuown 优于 Muown,并且在撰写本文时,一个初步版本在 modded nanoGPT 速度竞赛的每个优化器类别中领先。在 Qwen2-0.5B 和 1.1B 参数混合专家模型上的进一步实验证实该算法可扩展到小模型之外。该算法的实现可在此 https URL 获取。

英文摘要

Matrix-aware optimizers such as Muon and Muown have recently shown strong empirical performance for pre-training Transformers. In particular, Muown separates each weight matrix into row magnitudes and an un-normalized direction variable, updating the former with Adam and the latter with Muon. We show that the directional update of Muown is equivalent to a Riemannian step on the normalized directions, while the magnitude of the un-normalized parameterization only modulates the angular step size. This explains the step-size stability of Muown and suggests making the angular step size explicit. The resulting method, AngularMuown, optimizes directly over the normalized directions and uses a schedulable angular multiplier decoupled from the radial magnitude update. AngularMuown improves over Muown and, at the time of writing, a preliminary version is leading the per-optimizer category of the modded nanoGPT speedrunning competition. Further experiments on Qwen2-0.5B, and 1.1B parameter mixture-of-experts models confirm the algorithm scales beyond small models. An implementation of the algorithm is available at this https URL

2606.23634 2026-06-23 cs.CV 新提交

Pose Anything Anywhere:Model-free Object Poses from Arbitrary References

Pose Anything Anywhere: 任意参考图像下的无模型物体姿态估计

Hongli Xu, Jiaqi Hu, Junwen Huang, Boyang Zhong, Peter KT Yu, Nassir Navab, Benjamin Busam, Slobodan Ilic

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Siemens AG(西门子股份公司) XYZ Robotics ROBOX

AI总结 提出PANY,一种统一的无模型框架,通过多视图变换器学习视图一致几何和跨视图对齐,支持单/稀疏参考视图和RGB-D输入,在YCB-V和LM-O上分别提升姿态精度12%和20%以上。

Comments Accepted to ECCV 2026

详情
AI中文摘要

估计未见物体的6D姿态是开放世界机器人和具身感知中的一个基本但具有挑战性的问题。基于模型的方法准确但依赖于CAD资产或繁重的初始化,而大多数无模型方法仍局限于成对单锚匹配,因此在遮挡、大视角变化和低查询-参考重叠下失败。因此,我们提出PANY,一个统一的无模型框架,无缝支持RGB和RGB-D输入,基于一个或多个无姿态的稀疏参考视图,并有效泛化到新物体。PANY基于多视图变换器几何骨干,通过学习在宽基线和有限重叠下保持稳定的视图一致几何和跨视图对齐线索,超越了成对匹配。当额外的无姿态辅助视图可用时,PANY通过姿态图规范注册聚合它们以增加几何覆盖并强化最终姿态。大量实验表明,PANY在多个基准上实现了最先进的性能,显著优于现有无模型方法,在YCB-V上姿态精度提升12%,在LM-O上提升超过20%。此外,PANY在单参考和稀疏参考设置下均表现良好,在真实环境中展现出强鲁棒性。

英文摘要

Estimating the 6D pose of unseen objects is a fundamental yet challenging problem for open-world robotics and embodied perception. Model-based methods are accurate but depend on CAD assets or heavy onboarding, while most model-free approaches are still limited to pairwise single-anchor matching and thus fail under occlusion and large viewpoint changes with low query-reference overlap. Therefore, we present PANY, a unified model-free framework that seamlessly supports both RGB and RGB-D inputs, operates on one or sparse pose-free reference views, and generalizes effectively to novel objects. Built on a multi-view transformer geometry backbone, PANY moves beyond pairwise matching by learning view-consistent geometry and cross-view alignment cues that remain stable under wide baselines and limited overlap. When additional unposed assist views are available, PANY aggregates them via pose-graph canonical registration to increase geometric coverage and reinforce the final pose. Extensive experiments show that PANY achieves state-of-the-art performance across multiple benchmarks, substantially outperforming existing model-free methods, improving pose accuracy by +12% on YCB-V and over +20% on LM-O. Furthermore, PANY consistently performs well under both single-reference and sparse-reference settings, demonstrating strong robustness in real-world environments.

2606.23631 2026-06-23 cs.AI 新提交

AI-driven Optimisation of Quality of Recovery (QoR) in Remote Patient Monitoring

AI驱动的远程患者监测中恢复质量(QoR)优化

Yansong Liu, Li-Hsi (Sonny)Lin, Pramit Khetrapal, Ronnie Stafford, John Kelly, Ivana Drobnjak

发表机构 * Polish Institute for Evidence Based Medicine(波兰循证医学研究所)

AI总结 针对远程监测中QoR-15问卷每日提交率低(仅55%)的问题,提出QoR-compact(5项每日输入),经3003个子集评估,其预测术后恢复严重程度的AUC-ROC达0.968,与完整问卷相当,且能忠实跟踪再入院事件。

详情
AI中文摘要

远程患者监测依赖于患者报告的数据来捕捉设备无法测量的恢复主观维度。恢复质量(QoR-15)调查问卷是此目的的黄金标准工具。它最初设计并验证用于偶尔的院内评估,但远程监测现在每天向患者发放。在我们自己的术后部署中,只有55%的患者在30天的监测中提交了超过14天的调查。我们开发了QoR-compact,一个用于RPM预测路径的五项每日输入。设定部署驱动的目标为每日项的三分之一,我们穷举评估了QoR-15的所有3003个五问题子集,并测试了其中最佳子集在预测近期术后恢复严重程度方面是否与完整工具匹配。QoR-compact实现了平均AUC-ROC为0.968(95% CI 0.915-0.988),与使用三分之一项获得的0.964基线统计上相当。患者级回测表明,它像完整表格一样忠实地跟踪再入院事件。其五个项目涵盖了恢复的身体和心理轴:Q3(感觉休息好)、Q9(感觉舒适和有控制感)、Q10(总体幸福感)、Q12(严重疼痛)和Q14(感到担忧或焦虑)。QoR-15仍然是恢复的黄金标准测量;QoR-compact作为为预测设计的更短每日输入对其进行补充。这种对等性为前瞻性研究提供了基础,即较轻的每日输入是否反过来能更一致地完成。在临床使用前需要在更大队列上进行外部验证。

英文摘要

Remote patient monitoring depends on patient-reported data to capture the subjective dimension of recovery that devices cannot measure. The Quality of Recovery (QoR-15) survey is the gold-standard instrument for this purpose. It was designed and validated for occasional in-hospital assessment, yet remote monitoring now administers it to patients daily. In our own post-surgical deployment, only 55% of patients submitted the survey more than 14 days of 30 monitoring days. We developed QoR-compact, a five-item daily input for the RPM prediction pathway. Setting a deployment-driven target of one-third of the daily items, we exhaustively evaluated all 3,003 five-question subsets of the QoR-15 and tested whether the best of them matches the full instrument in predicting near-term postoperative recovery severity. QoR-compact achieves a mean AUC-ROC of 0.968 (95% CI 0.915-0.988), statistically comparable to the 0.964 baseline obtained with one-third of the items. Patient-level backtesting indicates that it tracks readmission events as faithfully as the full form. Its five items span the physical and psychological axes of recovery: Q3 (feeling rested), Q9 (feeling comfortable and in control), Q10 (general well-being), Q12 (severe pain), and Q14 (feeling worried or anxious). The QoR-15 remains the gold-standard measure of recovery; QoR-compact complements it as a shorter daily input designed for prediction. This parity provides the basis for a prospective study of whether a lighter daily input is, in turn, completed more consistently. External validation on larger cohorts is required before clinical use.

2606.23626 2026-06-23 cs.LG cs.AI 新提交

DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

DiT-Reward:文本到图像奖励建模的生成式表示

Yuanming Yang, Guoqing Ma, Bo Wang, Yuan Zhang, Wei Tang, Chenyi Li, Haoyang Huang, Nan Duan

发表机构 * JD Explore Academy, JD.com(京东探索研究院,京东) Tsinghua University(清华大学) Beijing Institute of Technology(北京理工大学) Peking University(北京大学)

AI总结 提出DiT-Reward,利用预训练文本到图像扩散Transformer的生成表示进行奖励预测,在多个偏好基准上超越HPSv3,并实现1.65倍推理加速。

详情
AI中文摘要

图像生成所学习的表示能否也用于评估生成的图像?我们研究文本到图像奖励预测作为生成式表示学习的一个下游任务。为此,我们引入DiT-Reward,通过处理近乎干净的图像潜在变量并聚合跨Transformer层的文本条件图像表示,将预训练的文本到图像扩散Transformer转换为奖励模型。在与HPSv3相同的训练数据混合下,DiT-Reward在所有四个评估的偏好基准上均优于HPSv3,在HPDv2上达到85.6%,在HPDv3上达到77.6%。当生成骨干网络被冻结时,一个轻量级的学习头仍能从其表示中提取有意义的偏好预测。跨深度的探测进一步揭示,下游奖励性能在中间到后期层中最强,并且受益于结合不同阶段的表示。我们还观察到与生成骨干网络容量一致的正向缩放。最后,当用于使用Flow-GRPO优化Stable Diffusion 3.5 Large时,DiT-Reward在匹配的训练轨迹上优于HPSv3,在逼真度方面尤其明显。直接潜在评分还实现了比HPSv3快1.65倍的推理速度,且峰值内存相当。这些结果表明,预训练的生成式DiT为奖励建模和策略优化提供了可迁移的表示。

英文摘要

Can representations learned for image generation also support the evaluation of generated images? We study text-to-image reward prediction as a downstream task of generative representation learning. To this end, we introduce DiT-Reward, which converts a pretrained text-to-image Diffusion Transformer into a reward model by processing near-clean image latents and aggregating text-conditioned image representations across transformer layers. Under the same training data mixture as HPSv3, DiT-Reward outperforms HPSv3 on all four evaluated preference benchmarks, reaching 85.6% on HPDv2 and 77.6% on HPDv3. When the generative backbone is frozen, a lightweight learned head can still extract meaningful preference predictions from its representations. Probing across depth further reveals that downstream reward performance is strongest in the middle-to-late layers and benefits from combining representations across different stages. We also observe consistent positive scaling with generative backbone capacity. Finally, when used to optimize Stable Diffusion 3.5 Large with Flow-GRPO, DiT-Reward outperforms HPSv3 along the matched training trajectory, with particularly clear gains in realism. Direct latent scoring also achieves a 1.65x inference speedup over HPSv3 with comparable peak memory. These results show that pretrained generative DiTs provide transferable representations for reward modeling and policy optimization.

2606.23625 2026-06-23 cs.RO 新提交

Learning to See While Learning to Act: Diffusion Models for Active Perception in Robot Imitation

学习行动的同时学习观察:机器人模仿中的主动感知扩散模型

Kuancheng Wang, Vaibhav Saxena, Shuo Cheng, Yotto Koga, Danfei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Autodesk Research(欧特克研究院) NVIDIA(英伟达)

AI总结 提出See2Act模仿学习方法,通过耦合动作去噪与视角优化,在测试时基于主动推断的视角序列预测动作,解决遮挡下的部分可观测问题,在Ravens和RLBench任务上性能提升达34%。

Comments Project website: this http URL (http://see2act.github.io)

详情
AI中文摘要

大多数模仿学习方法假设桌面环境完全可观测。实际上,物体经常被遮挡,要求机器人同时搜索和行动,而从有限的示范中学习这种耦合行为仍然具有挑战性。我们提出See2Act,一种模仿学习方法,通过将动作去噪与视角优化相结合,在测试时基于主动推断的视角序列来条件化动作预测。该策略使用锚定于离线示范中关键帧动作的相机位姿进行训练,从而在学习如何行动的同时隐式学习观察位置。我们实验证明,在Ravens中,该策略在严重遮挡下恢复信息丰富的视角,在RLBench任务上,相比先前方法性能提升高达34%。在真实世界中,我们在数字孪生中收集了50个示范,并在使用深度观测的拾取和放置任务上实现了零样本的仿真到现实迁移。该策略处理了显著的遮挡,表明学习到的视角推理能够在部分可观测性下实现鲁棒的操作。

英文摘要

Most imitation learning methods assume full observability in table-top settings. In practice, objects are often occluded, requiring robots to both search and act, and learning this coupled behavior from limited demonstrations remains challenging. We propose See2Act, an imitation learning approach that conditions action prediction on a sequence of actively-inferred viewpoints at test time, by coupling action denoising with viewpoint refinement. The policy is trained using camera poses anchored to keyframe actions from offline demonstrations, enabling implicit learning of where to see, while learning how to act. We empirically demonstrate that in Ravens the policy recovers informative viewpoints under severe occlusions, and on RLBench tasks it improves performance by up to 34% over prior methods. In the real world, we collect 50 demonstrations in a digital twin and achieve zero-shot sim-to-real transfer on pick-and-place tasks using depth observations. The policy handles significant occlusions, showing that learned viewpoint reasoning enables robust manipulation under partial observability.

2606.23617 2026-06-23 cs.RO cs.AI cs.LG 新提交

RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models

RECALL: 面向视觉-语言-动作模型的主动终身学习的恢复经验收集

Ulas Berk Karli, Tesca Fitzgerald

发表机构 * Yale University(耶鲁大学)

AI总结 提出主动不确定性引导的数据收集方法,结合持续学习技术,解决VLA模型被动模仿学习中的效率低和灾难性遗忘问题。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过被动模仿学习进行微调,即在策略表现不佳的任务中收集额外的演示。这种方法存在几个缺点:它要求机器人在触发数据收集之前失败,对哪些状态需要监督提供的指导很少,并且浪费演示者在策略已经表现良好的冗余任务部分上的精力。在本文中,我们提出了一种针对VLA的主动、持续学习范式。我们证明,与使用被动收集的演示相比,主动、不确定性引导的数据收集能带来更高效的微调。然而,我们也发现,仅对主动收集的恢复数据进行微调会导致灾难性遗忘。我们评估了持续学习技术,包括基于重放的数据混合和弹性权重巩固,并确定了不确定性引导的恢复数据的可塑性与先前学习行为的保留之间的权衡。总体而言,我们的工作为自回归VLA的主动持续学习提供了实证研究,确立了不确定性引导的恢复演示可以提高适应效率,同时也揭示了将有针对性的新数据纳入大型机器人策略时的开放挑战。

英文摘要

Vision-Language-Action (VLA) models are commonly fine-tuned through passive imitation learning, where additional demonstrations are collected for tasks where the policy performs poorly. This approach incurs several downsides: it requires the robot to fail before data collection is triggered, provides little guidance about which states require supervision, and wastes demonstrator effort on redundant parts of the task where the policy already performs well. In this paper, we propose an active, continual learning paradigm for VLAs. We demonstrate that active, uncertainty-guided data collection leads to more efficient fine-tuning than when using passively-collected demonstrations. However, we also find that fine-tuning only on actively-collected recovery data leads to catastrophic forgetting. We evaluate techniques for continual learning, including replay-based data mixing and elastic weight consolidation, and identify tradeoffs between plasticity to uncertainty-guided recovery data and retention of previously learned behaviors. Overall, our work contributes an empirical study of active continual learning for autoregressive VLAs, establishing that uncertainty-guided recovery demonstrations can improve adaptation efficiency while also revealing open challenges when targeted new data is incorporated into large robot policies.

2606.23615 2026-06-23 cs.CV cs.LG 新提交

Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark

Hedgementation = 树篱分割:一项遥感基准

Nathan Senyard, Salem Hamdani, Astrid Zhang, Derek Wang, Evan Shelhamer, Mathias Lécuyer, Joséphine Gantois

发表机构 * UBC CS(不列颠哥伦比亚大学计算机科学系) UBC IRES & FRE(不列颠哥伦比亚大学资源、环境与可持续发展研究所与林业资源与环境系) INSAT Tunisia(突尼斯国家应用科学学院) Vector Institute(向量研究所)

AI总结 提出Hedgementation基准,用于评估机器学习模型从遥感数据中绘制全国尺度10m²分辨率树篱图的能力,结合多源遥感数据和法国树篱清单,测试三种基线模型在空间距离和气候区上的泛化性能。

详情
AI中文摘要

我们提出了Hedgementation:一个新的基准,用于评估机器学习模型从遥感数据中在全国尺度和10m²空间分辨率下绘制树篱图的能力。我们整合并协调了多个遥感数据产品以及来自法国树篱清单的 ground truth 标签。我们测量了三种基线模型在空间距离和气候区(一项更具明确挑战性的任务)上的泛化能力。我们的基准测试了应用于追踪高农业重要性细粒度特征的遥感监督和自监督学习方法。重现基准和基线结果的代码可在以下网址获取:https://this https URL。

英文摘要

We propose Hedgementation: a new benchmark to evaluate machine learning models for hedgerow mapping from remote sensing data at country scale and 10m$^2$ spatial resolution. We combine and harmonize multiple remote sensing data products and ground truth labels sourced from a hedgerow inventory in France. We measure the ability of three baseline models to generalize across spatial distance, and across climatic zones, a more explicitly challenging task. Our benchmark tests both supervised and self-supervised learning approaches for remote sensing, applied to tracking fine-scale features of high agricultural importance. The code to reproduce the benchmark and baselines results is available at this https URL.