arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics（LimX动力学）

AI总结提出Any2Any范式，通过运动学对齐和动力学微调，实现预训练全身跟踪模型高效迁移至新的人形机器人本体，仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情

AI中文摘要

全身跟踪（WBT）模型已成为人形机器人的关键基础，使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算，使得在新人形平台上快速部署成本高昂。这自然引发一个问题：预训练的WBT模型能否通过最小化适应跨本体迁移？为回答这个问题，我们提出Any2Any，一种范式，能够高效地将现有WBT专家迁移到新人形本体，仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐，对齐其输入和输出空间，使得预训练的源策略可以在目标本体上有意义地重用。然后，Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调（PEFT）组件进行动力学适应，保留有用的行为先验，同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明，与从头训练相比，Any2Any显著加速收敛并降低训练成本，同时实现具有竞争力或更优的跟踪性能。值得注意的是，仅使用完整训练所需计算和数据的1%，Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明，预训练的WBT专家可以跨本体高效重用，为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

URL PDF HTML ☆

赞 0 踩 0

2605.22748 2026-06-19 cs.RO cs.AI cs.LG cs.MA 版本更新

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich（苏黎世大学机器人与感知组）； Google DeepMind（谷歌深Mind）； Nomagic

AI总结本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能，展示了多智能体交互对真实世界交互安全性的关键作用，同时在高速赛车中超越人类飞行员并减少碰撞率。

Comments 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl

详情

AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能，但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式，其中其他参与者被忽略或视为环境噪声，阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台，训练智能体在复杂空气动力学相互作用和战略机动中导航，具有可变数量的赛车。通过联赛基于的自我对战，智能体进化出复杂的前瞻性行为，包括主动避障、超车和处理多智能体物理交互，包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员，同时与最先进的单智能体基线相比，碰撞率减少了50%。关键的是，使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明，实现稳健的机器人共存的路径不在于孤立的安全约束，而在于多智能体交互的严格要求。多媒体材料可在：https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

URL PDF HTML ☆

赞 0 踩 0

2605.16865 2026-06-19 cs.CL 版本更新

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Jinesis Lab, University of Toronto & Vector Institute（Jinesis实验室，多伦多大学及向量研究所）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Princeton University（普林斯顿大学）； Cornell University（康奈尔大学）； The University of Tokyo（东京大学）； RIKEN AIP（日本理化学研究所AIP）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（德国图宾根最大计划智能系统研究所）； EuroSafeAI

AI总结本文提出MixSD方法，通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入，从而在保持预训练能力的同时提升事实记忆和推理能力。

详情

AI中文摘要

监督微调（SFT）被广泛用于将新知识注入语言模型，但通常会损害预训练能力，如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致，迫使优化器模仿低概率token序列。为了解决这个问题，我们提出了MixSD，一种无需外部教师的简单方法，用于对齐分布的知识注入。与固定目标训练不同，MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号，同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD，研究事实回忆和算术功能学习，并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下，MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线，能够保留基础模型的100% held-out能力，同时保持接近完美的训练准确率，而标准SFT只能保留1%。我们进一步表明，MixSD在基础模型下生成的监督目标具有显著更低的NLL，并减少了有害的Fisher敏感参数方向运动。这些结果表明，将监督与模型的本征生成分布对齐是简单且有效的知识注入原则，可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2509.24725 2026-06-19 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net：基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出Q-Net框架，通过结合卡尔曼滤波与神经网络，解决信号交叉口队列长度估计中的数据融合问题，提升空间转移性和实时性，实现无需昂贵传感设备的准确队列估计。

详情

DOI: 10.1016/j.trc.2026.105809
Journal ref: Transportation Research Part C: Emerging Technologies, Volume 190, September 2026, Article 105809

AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源：(i) 接近停止线的环形检测器提供的车辆计数汇总数据，以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD)，但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此，本文提出Q-Net：一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战，如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构，并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现，并通过将aFCD测量分组为固定大小的局部组来提高空间转移性，使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示，Q-Net优于基线方法，能够准确追踪队列的形成和消散，并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性，Q-Net在无需昂贵的传感基础设施（如摄像头或雷达）的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

URL PDF HTML ☆

赞 0 踩 0

2605.20448 2026-06-19 cs.CV cs.LG 版本更新

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体？

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI（德克南人工智能）

AI总结本文通过一个包含3034个样本的人工整理基准，探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力，发现模型在重新安排可见布局时表现优异，但在遮挡和反射推断上表现较差。

详情

AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体，但它们是否代表这些物体所处的3D布局？我们引入了一个包含3034个样本的人工整理基准，针对空间理解的三个组成部分：深度有序遮挡（通过三种独立的反事实操作化进行探测）、可见反射的光学几何推断，以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分，没有使用LLM作为判断标准，揭示了明显的分离：在53-97%的准确率下，能够对可见布局进行重新安排的模型，在遮挡任务中表现不佳，仅在6-45%之间，而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示，失败归因于视觉标记合并：在视觉编码器中可恢复的空间信息在标记压缩后变得不可用，只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

URL PDF HTML ☆

赞 0 踩 0

2604.00626 2026-06-19 cs.LG cs.CL 版本更新

A Survey of On-Policy Distillation for Large Language Models

大型语言模型的在线策略蒸馏综述

Mingyang Song, Mao Zheng

发表机构 * Tencent, China（腾讯，中国）

AI总结本文综述了大型语言模型的在线策略蒸馏方法，探讨了蒸馏过程中如何通过反馈减少累积误差，提出了基于f-散度最小化的蒸馏框架，并分析了蒸馏与强化学习之间的联系。

Comments Ongoing Work

详情

AI中文摘要

随着大型语言模型（LLMs）在能力和成本上的持续增长，将前沿能力转移到更小、可部署的学生模型已成为核心工程问题，知识蒸馏仍然是这一转移的主导技术。工业流水线中普遍采用的静态模仿教师生成文本的方法存在结构性缺陷，随着任务变得更长且需要更多推理，这种缺陷变得更加严重。因为学生是在完美教师前缀上训练的，但在推理时必须生成自己的文本，小错误往往会积累成学生很少被训练来恢复的轨迹，导致的暴露偏差已被证明与序列长度的平方成比例。在线策略蒸馏（OPD）围绕这一观察重新组织训练循环，通过让教师对学生实际生成的内容提供反馈，以减少累积项趋于线性，并将蒸馏重新定义为迭代修正过程，而不是单次模仿。由此产生的文献在分歧设计、奖励引导优化和自我对抗方面有所扩展，但贡献仍然分散在知识蒸馏、RLHF和模仿学习社区中，缺乏统一的处理。本文提供了这样的处理。我们正式将OPD定义为学生采样轨迹上的f-散度最小化，将该领域沿三个设计轴（优化什么、信号来源在哪里、以及如何在实践中稳定训练）组织起来，并整合成功条件、反复失败模式以及OPD与KL约束强化学习之间的联系。最后，我们提出了由此综合而产生的开放性问题，包括蒸馏扩展定律、不确定反馈、代理蒸馏以及知识蒸馏与强化学习之间的日益增长的重叠。

英文摘要

As Large Language Models continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become an important engineering problem, and knowledge distillation remains a common technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but generates its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as f-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained reinforcement learning. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agent-level distillation, and the growing overlap between knowledge distillation and RL.

URL PDF HTML ☆

赞 0 踩 0

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea（韩国文化科技研究所）； Maum AI Inc., Republic of Korea（马姆人工智能公司）

AI总结本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题，通过分析下游语义失败，揭示了传统ASR指标无法完全捕捉的误差影响，发现不同性能的LLM在级联降级上的一致性，识别出单字符ASR错误作为语义失败通道，并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net：一种通用的基于网格的替代模型，用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London（帝国理工学院伦敦设计工程学院）； TUM School of Engineering and Design, Technical University of Munich（慕尼黑技术大学工程与设计学院）； Faculty of Mechanical Engineering, OTH Regensburg（雷根斯堡机械工程学院）； NVIDIA（NVIDIA公司）

AI总结本文提出Mask-Morph Graph U-Net，通过特征对齐的重心参数化和节点掩码预训练，提升网格模拟的通用性和数据效率，适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情

AI中文摘要

非线性有限元碰撞模拟准确但计算成本高，限制了其在迭代设计优化中的应用。基于图神经网络（GNN）的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟，其共享节点和边更新函数在不同图结构中相对通用。相比之下，非共享边特定聚合层能更准确地捕捉非线性关系，但通常需要固定图连接性，限制了通用性。本文提出Mask-Morph Graph U-Net（MMGUNet），一种解决分层图U-Net架构限制的方法，该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性，所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格，然后构建跨图边。它进一步在监督预训练中应用节点掩码，随后进行参数高效的微调，其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明，粗图变形相对于固定粗图基线提高了测试准确性，而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径，用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

URL PDF HTML ☆

赞 0 踩 0

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别？

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响，发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench：一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出CADBench，一个统一的多模态CAD程序生成基准，包含18000个样本和六类基准，评估11种视觉语言模型，揭示了CAD程序生成中的三种常见失败模式。

详情

AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心，但进展难以衡量，因为现有评估分散在数据集、模态和指标上。我们引入CADBench，一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本，涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族，五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染，以及六个指标，涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层，所有家族均进行多样性采样，以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统，生成超过140万个CAD程序。在理想输入下，专用的网格到CAD模型显著优于代码生成VLMs，后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式：几何复杂性增加时重建质量下降，CAD专用模型在模态转移下可能变得脆弱，且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

URL PDF HTML ☆

赞 0 踩 0

2605.09609 2026-06-19 cs.LG math.AG 版本更新

Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

多项式神经网络的最小填充架构：反例、前沿搜索与缺陷

Kevin Dao, Jose Israel Rodriguez

发表机构 * Department of Mathematics, University of Wisconsin-Madison, Wisconsin, USA（威斯康星大学麦迪逊分校数学系）

AI总结本文通过前沿搜索和符号计算验证了多项式神经网络的最小单峰猜想反例，揭示了部分子架构存在较大缺陷，与以往小缺陷现象形成对比。

2605.09383 2026-06-19 cs.RO 版本更新

Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level

安全关键的激光雷达-惯性里程计与在线流形确定性保护级别

Yueqi Zhu, Yan Pan, Chufan Rui, Jiasheng Luo, Shihua Li, Bo Zhou

发表机构 * School of Automation, Southeast University（东南大学自动化学院）； Key Laboratory of Measurement and Control of CSE, Ministry of Education（教育部测控CSE重点实验室）

AI总结本文提出一种安全关键的激光雷达-惯性里程计，通过在线流形确定性状态估计提供确定性保护级别，以提升移动机器人在安全关键场景中的导航安全性。

详情

AI中文摘要

在安全关键场景中，自主导航系统的保护级别对于使移动机器人安全执行任务至关重要。然而，现有针对机器人概率导航系统的研究通常使用有限数据集进行离线准确性评估，并假设结果可应用于未知真实环境。因此，当前自主移动机器人往往缺乏在线安全评估的保护级别。为填补这一空白，我们提出了一种安全关键的激光雷达-惯性里程计（LIO），其基于在线流形确定性状态估计提供确定性保护级别。通过采用未知但有界的假设，我们推导出点云噪声与迭代最近点算法估计不确定性之间的简洁闭式关系。利用这一关系，我们设计了一种在线流形椭球集成员滤波器，并将其实现于LIO系统中。利用集成员滤波器的性质，我们的系统将估计位置的可行集作为确定性保护级别，用作机器人下游自主操作的安全参考。实验结果表明，我们的系统能够为各种环境中的不同机器人提供有效的确定性在线安全参考。

英文摘要

In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.

URL PDF HTML ☆

赞 0 踩 0

2605.08525 2026-06-19 cs.RO cs.SY eess.SY 版本更新

Model-Reference Adaptive Flight Control of a 95-mg Insect-Scale Flapping-Wing Aerial Robot

95毫克昆虫尺度扑翼飞行机器人的模型参考自适应飞行控制

Francisco M. F. R. Gonçalves, Conor K. Trygstad, Néstor O. Pérez-Arancibia

发表机构 * Washington State University（华盛顿州立大学）

AI总结针对昆虫尺度扑翼飞行机器人参数不确定性和扰动问题，提出模型参考自适应控制（MRAC）架构，结合混合乘性扩展卡尔曼滤波，实现高精度位置控制，并通过95毫克机器人实验验证了悬停和轨迹跟踪性能。

Comments Under review, 8 pages, 7 figures

详情

AI中文摘要

由于系统尺度和复杂制造，描述扑翼昆虫尺度飞行机器人动力学的模型存在参数不确定性，例如惯性矩阵和飞行器的执行器映射。此外，由于其低惯性，这种机器人在飞行中受到随机和系统性扰动的严重影响，包括电源线张力、阵风和机翼不对中产生的非期望气动力。因此，在亚分克尺度上执行复杂机动的高性能要求机器人调整其行为以抵消扰动和模型不确定性。为此，我们引入了一种模型参考自适应控制（MRAC）架构，用于可实现为三维空间中刚体的扑翼机器昆虫的高性能位置控制。此外，我们展示了在飞行中实现混合乘性扩展卡尔曼滤波以估计当前和期望角速度，如何显著抑制姿态振动，特别是沿滚转和俯仰自由度，并提高飞行性能。为了展示所提方法的适用性、功能性和高性能，我们使用一个95毫克的昆虫尺度飞行机器人进行了实时悬停和轨迹跟踪六自由度飞行控制实验。

英文摘要

Due to the system's scale and complex fabrication, the model describing the dynamics of a flapping-wing insect-scale aerial robot is subject to parameter uncertainty; for example, in the inertia matrix and the actuator mapping of the flier. Furthermore, due to its low inertia, this type of robot is greatly affected by stochastic and systematic disturbances during flight, including power-wire tension, gusts, and undesired aerodynamic forces produced by wing misalignment. Therefore, the high-performance execution of complex maneuvers at the subdecigram scale requires the robot to adapt its behavior to counteract disturbances and model uncertainty. Toward this objective, we introduce a model-reference adaptive control (MRAC) architecture for high-performance position control of flapping-wing robotic insects that can be modeled as rigid bodies in the three-dimensional (3D) space. In addition, we demonstrate how the implementation of a hybrid multiplicative extended Kálmán filter for estimating current and desired angular velocities during flight significantly dampens attitude vibrations, especially along the roll and pitch degrees of freedom (DOFs), and also improves flight performance. To show the suitability, functionality, and high performance of the proposed approach, we conducted real-time hovering and trajectory-tracking 6-DOF flight control experiments with a 95-mg insect-scale aerial robot.

URL PDF HTML ☆

赞 0 踩 0

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong（香港大学）； Shenzhen University（深圳大学）； Shenzhen Loop Area Institute（深圳环城区域研究所）

AI总结提出基于对象共现的OOD检测框架，通过解耦表示和分治策略区分近OOD，缓解简单性偏差，在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情

AI中文摘要

分布外（OOD）检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内（ID）和OOD数据，忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性，因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发，我们提出了一种以对象为中心的OOD检测框架，学习捕捉图像中的对象共现（OCO）模式。该方法引入了一种新的OOD检测范式，通过预测测试样本的解耦表示来理解图像中的对象共现，然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景，最后以分治方式进行OOD检测。通过这种方式，OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD，避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO，展示了竞争性结果，并证实了其处理语义和协变量偏移的能力。代码发布在：https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

URL PDF HTML ☆

赞 0 踩 0

2604.23938 2026-06-19 cs.CL 版本更新

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence（计算科学卓越中心）

AI总结提出TSAssistant多智能体框架，通过分层指令架构和交互式优化循环，将靶点安全性评估报告生成分解为专业子任务，实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情

AI中文摘要

靶点安全性评估（TSA）需要系统整合遗传、转录组、靶点同源性、药理学和临床数据，以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家，在可扩展性和可重复性方面面临挑战。我们提出TSAssistant，一种人在回路中的多智能体框架，将TSA报告生成分解为专门子智能体的工作流：研究子智能体各自基于并引用单个TSA领域，合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据，生成可单独引用、基于证据的章节，其行为由分层指令架构塑造，该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束，程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束，而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较，而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性，发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

URL PDF HTML ☆

赞 0 踩 0

2605.00665 2026-06-19 cs.CV 版本更新

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

基于深度学习的视网膜图像预测阿尔茨海默病风险因素：英国生物银行中生物学相关形态学关联的开发和验证

Seowung Leem, Yunchao Yang, Adam J. Woods, Ruogu Fang

发表机构 * J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida（朱·克雷顿·普瑞特生物医学工程系，佛罗里达大学）； University of Florida Research Computing（佛罗里达大学研究计算中心）； Meta AI (FAIR)（Meta AI（FAIR））； School of Behavioral and Brain Sciences, University of Texas at Dallas（德克萨斯大学达拉斯分校行为与脑科学学院）； Dept. of Electrical and Computer Engineering, University of Florida（佛罗里达大学电气与计算机工程系）； Dept. of Computer and Information Science and Engineering, University of Florida（佛罗里达大学计算机与信息科学与工程系）； Center for Cognitive Aging and Memory, University of Florida（佛罗里达大学认知衰老与记忆中心）

AI总结利用深度学习从视网膜彩色眼底照片预测12个阿尔茨海默病相关风险因素，并揭示其背后的视网膜结构特征，发现视神经头和视网膜血管等区域与风险因素及阿尔茨海默病前期变化相关。

Comments Accepted to the "Journal of Alzheimer's Disease" for publication

详情

AI中文摘要

系统性的、代谢性的、生活方式的因素已通过流行病学和AD特异性生物标志物研究与阿尔茨海默病（AD）建立关联。彩色眼底摄影（CFP）是否包含与这些AD相关风险域相对应的视网膜结构特征仍不清楚。为了确定深度学习（DL）模型能否从CFP预测12个AD相关风险因素，并表征这些预测背后的视网膜结构，从而评估CFP是否反映AD易感性的通路。使用来自英国生物银行的44,501名独特参与者的62,876张CFP，训练DL模型预测与AD发病率相关的12个因素：6个分类变量（性别、吸烟、失眠、经济状况、饮酒、抑郁）和6个连续变量（年龄、受教育完成年龄、BMI、收缩压、舒张压、HbA1c）。评估模型性能、模型显著性和显著性衍生得分（CAM-Score），并与视网膜形态测量进行比较。还将得分在AD发病病例（平均发病前8.55年）与匹配对照之间进行比较。DL的性能范围为分类变量的AUROC=0.5654-0.9480，连续变量的R2=-0.0291-0.7620，优于大多数形态测量-机器学习模型。基于显著性的得分一致地突出了生物学上有意义的区域，特别是视神经头和视网膜血管。它也与现有的形态测量变异一致。多个基于显著性的得分在AD发病病例与匹配对照之间存在显著差异，表明风险因素的视网膜相关性与临床前AD相关变化之间存在潜在重叠。CFP编码了与AD风险因素相关的视网膜特征。尽管不具有诊断性，但DL衍生的视网膜表征可能揭示反映潜在AD易感性的生物学上有意义的风险相关结构变化。

英文摘要

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

URL PDF HTML ☆

赞 0 踩 0

2602.17315 2026-06-19 cs.LG cs.AI 版本更新

Flickering Multi-Armed Bandits

闪烁多臂老虎机

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； INRIA Paris（巴黎国家信息与自动化研究所）

AI总结提出闪烁多臂老虎机模型，通过随机图约束动作可用性，设计两阶段懒惰随机游走算法实现次线性遗憾界，并证明信息论下界的最优性。

2604.19196 2026-06-19 cs.CV 版本更新

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向域泛化人脸反欺骗的视觉基础模型基准测试

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

发表机构 * Graduate School of Information Sciences, Tohoku University, Japan（东北大学信息科学研究生院，日本）

AI总结本文系统评估15种预训练视觉模型在人脸反欺骗域泛化中的表现，发现自监督ViT（尤其是DINOv2+Registers）结合数据增强和注意力损失在MICO协议上达到最优，且计算高效。

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情

AI中文摘要

人脸反欺骗（FAS）由于需要在未见过的环境中进行鲁棒的域泛化而仍然具有挑战性。尽管最近的趋势利用视觉-语言模型（VLM）进行语义监督，但这些多模态方法通常需要高昂的计算资源并表现出高推理延迟。此外，它们的有效性本质上受限于底层视觉特征的质量。本文重新审视仅视觉基础模型建立高效鲁棒FAS基线的潜力。我们在严苛的跨域场景下（包括MICO和有限源域（LSD）协议）对15个预训练模型进行了系统基准测试，例如有监督CNN、有监督ViT和自监督ViT。我们的全面分析表明，自监督视觉模型，特别是带有寄存器的DINOv2，显著抑制了注意力伪影并捕获了关键的细粒度欺骗线索。结合人脸反欺骗数据增强（FAS-Aug）、分块数据增强（PDA）和注意力加权分块损失（APL），我们提出的仅视觉基线在MICO协议上达到了最先进的性能。该基线在数据受限的LSD协议下优于现有方法，同时保持优越的计算效率。这项工作为FAS提供了一个确定的仅视觉基线，表明优化的自监督视觉变换器可以作为仅视觉和未来多模态FAS系统的骨干。项目页面见：此https URL。

英文摘要

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

URL PDF HTML ☆

赞 0 踩 0

2604.07328 2026-06-19 cs.LG 版本更新

How to sketch a learning algorithm

如何勾勒学习算法

Sam Gunn

发表机构 * UC Berkeley（伯克利大学）

AI总结提出一种数据删除方案，基于稳定性假设，通过随机复方向的高阶导数局部勾勒算术电路，实现深度学习模型输出预测的误差和失败概率可忽略，且预计算和推理仅慢对数因子。

Comments Improved presentation and simplified Algorithm 4

详情

AI中文摘要

训练数据的选择如何影响AI模型？这个广泛的问题对于可解释性、隐私和基础科学至关重要。其技术核心是数据删除问题：在合理的预计算量之后，快速预测如果从学习算法中排除给定训练数据子集，模型在给定情况下的行为。我们提出了一种数据删除方案，能够在深度学习设置中以可忽略的误差$\varepsilon$和失败概率$\delta$预测模型输出。我们的预计算和预测算法分别仅比常规训练和推理慢$\tilde{O}(\log(1/\delta)/\varepsilon^2)$因子。存储需求为$\tilde{O}(\log(1/\delta)/\varepsilon^2)$个模型。我们的证明基于一个称为稳定性的假设。与先前工作所做的假设相比，稳定性似乎与学习强大AI模型完全兼容。为支持这一点，我们展示了稳定性在microgpt的最小实验集中得到满足。我们的代码可在https://this URL获取。在技术层面，我们的工作基于一种新方法，通过计算随机复方向的高阶导数来局部勾勒算术电路。前向模式自动微分允许廉价计算这些导数。

英文摘要

How does the choice of training data influence an AI model? This broad question is of central importance to interpretability, privacy, and basic science. At its technical core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ and failure probability $δ$ in the deep learning setting. Our precomputation and prediction algorithms are only $\tilde{O}(\log(1/δ)/\varepsilon^2)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\tilde{O}(\log(1/δ)/\varepsilon^2)$ models. Our proof is based on an assumption that we call stability. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

URL PDF HTML ☆

赞 0 踩 0

2603.04531 2026-06-19 cs.RO 版本更新

PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation

PTLD: 从仿真到现实的触觉潜在知识蒸馏用于灵巧操作

Rosy Chen, Mustafa Mukadam, Michael Kaess, Tingfan Wu, Francois R Hogan, Jitendra Malik, Akash Sharma

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Washington（华盛顿大学）； FAIR at Meta（Meta的FAIR团队）； UC Berkeley（伯克利大学）

AI总结提出PTLD方法，通过真实世界触觉策略数据蒸馏鲁棒状态估计器，解决触觉仿真困难问题，在灵巧操作任务中相比纯本体感策略提升182%和57%。

详情

AI中文摘要

触觉灵巧操作对于自动化复杂家务任务至关重要，但学习有效控制策略仍然是一个挑战。虽然最近的工作依赖于模仿学习，但通过机器人遥操作或动觉教学获取多指手的高质量演示是困难的。另一种方法是，通过强化学习我们可以在仿真中学习技能，但快速且真实的触觉观测仿真具有挑战性。为了弥合这一差距，我们引入了PTLD：从仿真到现实的触觉潜在知识蒸馏，这是一种无需触觉仿真即可学习触觉操作技能的新方法。我们的关键思想不是模拟触觉传感器或纯粹依赖本体感策略进行零样本从仿真到现实的迁移，而是利用现实世界中的特权传感器收集真实的触觉策略数据。然后，这些数据用于蒸馏一个鲁棒的状态估计器，该估计器基于触觉输入运行。我们的实验表明，PTLD可以通过结合触觉感知显著改善在仿真中训练的本体感操作策略。在基准的掌内旋转任务中，PTLD相比纯本体感策略实现了182%的提升。我们还展示了PTLD能够学习具有挑战性的触觉掌内重定向任务，在该任务中，我们观察到达到的目标数量相比仅使用本体感提高了57%。网站：此 https URL。

英文摘要

Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.

URL PDF HTML ☆

赞 0 踩 0

2604.15838 2026-06-19 cs.LG 版本更新

Reversible Residual Normalization Alleviates Spatio-Temporal Distribution Shift

可逆残差归一化缓解时空分布偏移

Zhaobo Hu, Vincent Gauthier, Mehdi Naima

发表机构 * CNRS -- LIP6 Sorbonne Universit\'e

AI总结针对时空分布偏移问题，提出可逆残差归一化框架，通过空间感知可逆变换同时处理时空维度偏移，结合图卷积与谱约束图神经网络实现自适应归一化。

详情

AI中文摘要

分布偏移严重降低了深度预测模型的性能。虽然这一问题在单变量时间序列中已有充分研究，但在时空领域中仍然是一个重大挑战。有效的解决方案如实例归一化及其变体可以通过标准化统计量来缓解时间偏移。然而，图上的分布偏移更为复杂，不仅涉及单个节点序列的漂移，还涉及空间网络中的异质性，其中不同节点表现出不同的统计特性。为了解决这个问题，我们提出了可逆残差归一化（RRN），一种新颖的框架，执行空间感知的可逆变换以解决空间和时间维度上的分布偏移。我们的方法在可逆残差块中集成了图卷积操作，实现了在保持可逆性的同时尊重底层图结构的自适应归一化。通过将中心归一化与谱约束图神经网络相结合，我们的方法以数据驱动的方式捕获和归一化复杂的时空关系。我们框架的双向性允许模型在归一化的潜在空间中学习，并通过逆变换恢复原始分布特性，为动态时空系统上的预测提供了一种鲁棒且模型无关的解决方案。

英文摘要

Distribution shift severely degrades the performance of deep forecasting models. While this issue is well-studied for individual time series, it remains a significant challenge in the spatio-temporal domain. Effective solutions like instance normalization and its variants can mitigate temporal shifts by standardizing statistics. However, distribution shift on a graph is far more complex, involving not only the drift of individual node series but also heterogeneity across the spatial network where different nodes exhibit distinct statistical properties. To tackle this problem, we propose Reversible Residual Normalization (RRN), a novel framework that performs spatially-aware invertible transformations to address distribution shift in both spatial and temporal dimensions. Our approach integrates graph convolutional operations within invertible residual blocks, enabling adaptive normalization that respects the underlying graph structure while maintaining reversibility. By combining Center Normalization with spectral-constrained graph neural networks, our method captures and normalizes complex Spatio-Temporal relationships in a data-driven manner. The bidirectional nature of our framework allows models to learn in a normalized latent space and recover original distributional properties through inverse transformation, offering a robust and model-agnostic solution for forecasting on dynamic spatio-temporal systems.

URL PDF HTML ☆

赞 0 踩 0

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K：用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney（悉尼科技大学）； University of Sydney（悉尼大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白，构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集，并基于此基准测试了九种最新方法，识别出最鲁棒的方法和最具挑战的场景。

详情

AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中，已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而，对于无干扰辐射场，每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏，限制了发展。为填补这一空白，我们引入了DF3DV-1K，一个包含1048个场景的大规模真实世界数据集，每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像，模拟随意拍摄，涵盖128种干扰类型和161种场景主题，包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K，我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试，识别出最鲁棒的方法和最具挑战的场景。除了基准测试，我们还展示了DF3DV-1K的一个应用：微调基于扩散的2D增强器以改进辐射场方法，在保留集（例如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展，并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取：此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

URL PDF HTML ☆

赞 0 踩 0

2604.13240 2026-06-19 cs.CV cs.LG 版本更新

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554（里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554）； LTSER Zone Atelier Armorique（Armorique 领域实验室区）； University of Würzburg, Center for Artificial Intelligence and Data Science（乌尔姆大学、人工智能与数据科学中心）

AI总结提出首个基于概念的可解释AI方法用于物种分布模型，利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集，通过Robust TCAV量化景观概念对模型预测的影响，案例研究验证了方法的有效性。

详情

AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型（SDMs）是完成此任务的主要工具，具有两个目的：实现稳健的预测性能，同时提供关于分布驱动因素的生态见解。然而，深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标，我们提出了首个基于概念的可解释AI（XAI）在SDMs中的实现。我们利用Robust TCAV（测试与概念激活向量）方法量化景观概念对模型预测的影响。为此，我们提供了一个新的开放获取的景观概念数据集，该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块，旨在适用于广泛的物种。我们通过两个水生昆虫（襀翅目和毛翅目）的案例研究，使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明，基于概念的XAI有助于根据专家知识验证SDMs，同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息，对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2602.22495 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement-aware Knowledge Distillation for LLM Reasoning

面向LLM推理的强化学习感知知识蒸馏

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

发表机构 * Meta ； Guo et al. ； Lin et al. ； Xu et al. ； Shao et al. ； Schulman et al. ； Xie et al.

AI总结提出RL感知蒸馏（RLAD），通过信任区域比率蒸馏（TRRD）在强化学习后训练中实现选择性模仿，解决分布不匹配和目标干扰问题，在逻辑推理和数学基准上优于现有方法。

详情

AI中文摘要

强化学习（RL）后训练最近推动了长链思维推理大语言模型（LLM）的重大进展，但这类模型的高推理成本促使将其蒸馏到更小的学生模型中。大多数现有的知识蒸馏（KD）方法是为监督微调（SFT）设计的，依赖于固定的教师轨迹或基于教师-学生KL散度的正则化。当与RL结合时，这些方法常常遭受分布不匹配和目标干扰：教师监督可能与学生不断变化的rollout分布不一致，并且KL正则化项可能与奖励最大化竞争，需要仔细的损失平衡。为了解决这些问题，我们提出了RL感知蒸馏（RLAD），它在RL期间执行选择性模仿——仅在改进当前策略更新时引导学生向教师学习。我们的核心组件，信任区域比率蒸馏（TRRD），用基于PPO/GRPO风格似然比的目标替代教师-学生KL正则化项，该目标锚定到教师-旧策略混合，从而在学生rollout上产生优势感知、信任区域约束的蒸馏，并自然平衡探索、利用和模仿。在多种逻辑推理和数学基准上，RLAD始终优于离线蒸馏、标准GRPO和基于KL的在策略教师-学生知识蒸馏。

英文摘要

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

URL PDF HTML ☆

赞 0 踩 0

2604.07593 2026-06-19 cs.AI 版本更新

Too long; didn't solve

太长；没解决

Lucía M. Cabrera, Isaac Saxton-Knight, Jocelyn D'Arcy

发表机构 * Instituto Balseiro（巴塞罗那研究所）； Poindexter Labs（波因迪克斯实验室）

AI总结研究提示长度和解答长度与大型语言模型在数学问题上的性能关系，发现两者与模型失败率正相关。

2604.06464 2026-06-19 cs.LG physics.app-ph stat.ML 版本更新

Weighted Bayesian Conformal Prediction

加权贝叶斯共形预测

Xiayin Lou, Peng Luo

发表机构 * Technical University of Munich（慕尼黑技术大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出加权贝叶斯共形预测（WBCP），通过加权Dirichlet先验推广贝叶斯共形预测到重要性加权设置，理论证明有效样本量决定后验方差，并提供更丰富的条件覆盖不确定性。

详情

AI中文摘要

共形预测提供具有有限样本覆盖保证的分布自由预测区间，Snell & Griffiths 最近的工作将其重新解释为贝叶斯求积（BQ-CP），通过阈值上的 Dirichlet 后验产生强大的数据条件保证。然而，BQ-CP 根本上要求 i.i.d. 假设。同时，加权共形预测通过重要性权重处理分布偏移，但仍然是频率学派方法，仅产生点估计阈值。我们提出 \textbf{加权贝叶斯共形预测（WBCP）}，它将 BQ-CP 推广到任意重要性加权设置，用加权 Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$ 替换均匀 Dirichlet $\Dir(1,\ldots,1)$，其中 $\neff$ 是 Kish 有效样本量。我们证明了四个理论结果：(1)~$\neff$ 是匹配频率学派和贝叶斯方差的唯一集中参数；(2)~后验标准差以 $O(1/\sqrt{\neff})$ 衰减；(3)~BQ-CP 的随机占优保证扩展到每个权重轮廓的数据条件保证；(4)~HPD 阈值在条件覆盖上提供 $O(1/\sqrt{\neff})$ 的改进。我们将 WBCP 实例化为 \emph{地理贝叶斯共形预测}，其中基于核的空间权重产生每个位置的后验，并具有可解释的诊断。在合成和真实空间数据集上的实验表明，WBCP 在保持覆盖保证的同时提供了更丰富的不确定性信息。

英文摘要

Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell \& Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose \textbf{Weighted Bayesian Conformal Prediction (WBCP)}, which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet $\Dir(1,\ldots,1)$ with a weighted Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$, where $\neff$ is Kish's effective sample size. We prove four theoretical results: (1)~$\neff$ is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as $O(1/\sqrt{\neff})$; (3)~BQ-CP's stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides $O(1/\sqrt{\neff})$ improvement in conditional coverage. We instantiate WBCP for spatial prediction as \emph{Geographical BQ-CP}, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.

URL PDF HTML ☆

赞 0 踩 0

2604.06265 2026-06-19 cs.LG cond-mat.stat-mech quant-ph 版本更新

SMT-AD: a scalable quantum-inspired anomaly detection approach

SMT-AD：一种可扩展的量子启发式异常检测方法

Apimuk Sornsaeng, Si Min Chan, Wenxuan Zhang, Swee Liang Wong, Joshua Lim, Jonathan Pan, Dario Poletti

发表机构 * Science, Mathematics and Technology Cluster, Singapore University of Technology and Design（新加坡科技设计大学科学、数学与技术集群）； Centre for Quantum Technologies, National University of Singapore（新加坡国立大学量子技术中心）； Artificial Intelligence and Data Analytics Strategic Technology Centre, ST Engineering（ST工程人工智能与数据分析战略技术中心）； Engineering Product Development Pillar, Singapore University of Technology and Design（新加坡科技设计大学工程产品开发支柱）

AI总结提出基于多分辨率张量叠加的量子启发式异常检测方法SMT-AD，通过傅里叶辅助特征嵌入和矩阵乘积算子实现线性可扩展，在标准数据集上取得竞争性能。

Comments 12 pages, 5 figures

详情

AI中文摘要

量子启发的张量网络算法已被证明是机器学习任务（包括异常检测）中有效且高效的模型。在此，我们提出一种高度可并行化的量子启发式方法，称为SMT-AD（Superposition of Multiresolution Tensors for Anomaly Detection）。它基于键维数为1的矩阵乘积算子的叠加，通过傅里叶辅助特征嵌入对输入数据进行变换，其中可学习参数的数量随特征大小、嵌入分辨率和矩阵乘积算子结构中附加组件的数量线性增长。我们展示了在标准数据集（包括信用卡交易）上成功的异常检测，并发现即使采用最小配置，它也能与已建立的异常检测基线相媲美。此外，它提供了一种直接的方法来减少模型权重，甚至通过突出最相关的输入特征来提高性能。

英文摘要

Quantum-inspired tensor networks algorithms have shown to be effective and efficient models for machine learning tasks, including anomaly detection. Here, we propose a highly parallelizable quantum-inspired approach which we call SMT-AD from Superposition of Multiresolution Tensors for Anomaly Detection. It is based upon the superposition of bond-dimension-1 matrix product operators to transform the input data with Fourier-assisted feature embedding, where the number of learnable parameters grows linearly with feature size, embedding resolutions, and the number of additional components in the matrix product operators structure. We demonstrate successful anomaly detection when applied to standard datasets, including credit card transactions, and find that, even with minimal configurations, it achieves competitive performance against established anomaly detection baselines. Furthermore, it provides a straightforward way to reduce the weight of the model and even improve the performance by highlighting the most relevant input features.

URL PDF HTML ☆

赞 0 踩 0

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结提出Vero系列开放视觉语言模型，通过构建600K样本数据集Vero-600K和任务路由奖励，在30个基准测试中平均提升2.9-5.4点，Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情

AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么？最强的视觉语言模型（VLM）表明广泛的视觉推理是可以实现的，但其封闭的数据和强化学习（RL）流程使得其成果难以研究、复现或扩展。我们引入了Vero，一个完全开放的VLM系列，在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励，构建了Vero-600K，一个来自59个数据集的600K样本数据集，并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中，Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型，Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是，基于Instruct模型训练的Vero-Qwen3I-8B，在没有额外蒸馏的情况下，平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示，不同的任务类别引发不同的推理模式，而广泛的收益依赖于联合学习它们，而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2604.05435 2026-06-19 cs.AI 版本更新

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

CareTransition-Audit：用于高效护理过渡的出院总结审计基准

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava, Shivali Dalmia, Abhishek Mukherji

发表机构 * Department of Computer Science \& Engineering, University of Minnesota-Twin Cities, Minneapolis, USA ； Centific AI Research, Redmond, USA

AI总结提出基于大语言模型的自动化框架，通过46项检查清单审计出院总结完整性，在MIMIC-IV数据集上基准测试11个模型，最佳模型与临床医生标签的Cohen's kappa约0.5，所有模型难以识别模糊文档。

Comments Accepted as a poster at IEEE-ICHI 2026; Accepted at SD4H@ICML

详情

AI中文摘要

不完整或不一致的出院文档会导致护理碎片化和可避免的再入院。尽管其在患者安全中至关重要，但审计出院总结依赖于人工审查且无法扩展。我们提出一个使用大语言模型（LLM）的自动化审计框架。我们的方法将DISCHARGED框架操作化为一个包含46个问题的检查清单。使用来自MIMIC-IV数据库的50份总结及临床医生真实标签，我们对11个LLM进行基准测试。模型评估的平均文档完整性范围为54.9%至74.2%，最佳模型与临床医生标签的Cohen's kappa值约为0.5，表明中等一致性。所有模型在识别模糊文档（Unclear）方面均存在困难，突显了当前自动化审计的关键差距。本工作为临床文档的系统性质量改进提供了临床医生验证的基准和零样本基线。

英文摘要

Incomplete or inconsistent discharge documentation drives care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies on manual review and does not scale. We propose an automated framework for auditing discharge summaries using large language models (LLMs). Our approach operationalizes the DISCHARGED framework into a checklist of 46 questions. Using 50 summaries from the MIMIC-IV database, with clinician ground-truth labels, we benchmark 11 LLMs. Model-assessed mean documentation completeness ranges from 54.9% to 74.2%, and the best-performing models achieve a Cohen's kappa values around 0.5 against clinician labels, indicating moderate agreement. All models struggle to identify ambiguous documentation (Unclear), highlighting a key gap in current automated auditing. This work provides a clinician-validated benchmark and zero-shot baselines for systematic quality improvement in clinical documentation.

URL PDF HTML ☆

赞 0 踩 0

2603.29924 2026-06-19 cs.CV 版本更新

Abstraction in Style: Beyond Texture and Color

风格中的抽象：超越纹理与色彩

Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang, Daniel Cohen-Or, Hui Huang

发表机构 * Shenzhen University（深圳大学）； Visual Computing Research Center (VCC), College of Computer Science and Software Engineering (CSSE)（视觉计算研究中心（VCC），计算机科学与软件工程学院）； Peking University（北京大学）

AI总结提出Abstraction in Style (AiS)框架，将结构抽象与视觉风格分离，通过中间抽象代理实现几何保真度放松，从而支持更广泛的非真实感风格迁移。

Comments SIGGRAPH 2026

详情

AI中文摘要

艺术风格通常嵌入超越表面外观的抽象，涉及对结构的有意重新诠释，而不仅仅是纹理或色彩的变化。传统的风格迁移方法通常保留输入几何结构，因此难以捕捉这种更深层次的抽象行为，尤其是对于插画和非真实感风格。在这项工作中，我们引入了Abstraction in Style (AiS)，一个将结构抽象与视觉风格化分离的生成框架。给定目标图像和少量风格样本，AiS首先推导出一个中间抽象代理，该代理根据风格所展现的抽象逻辑重新诠释目标的结构。代理捕捉语义结构，同时放松几何保真度，使得后续的风格化能够在抽象表示而非原始图像上进行操作。在第二阶段，渲染抽象代理以产生最终风格化输出，保持与参考风格的视觉一致性。两个阶段都使用共享的图像空间类比实现，使得变换可以从视觉样本中学习，无需显式的几何监督。通过将抽象与外观解耦，并将抽象视为显式、可迁移的过程，AiS支持更广泛的风格变换，提高了可控性，并实现了更具表现力的风格化。

英文摘要

Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target's structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

URL PDF HTML ☆

赞 0 踩 0