arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.21714 2026-05-22 cs.CV cs.RO

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

AVI-HT:自适应视觉-IMU融合用于3D手部跟踪

Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 本文提出AVI-HT,一种自适应视觉-IMU融合方法,通过联合建模第一人称视角图像与手套上的6自由度IMU信号,用于跟踪3D手部姿态。核心方法包括同步多模态训练数据配对和跨传感器深度注意力机制,主要贡献是提高了在手-物体交互场景中的准确性和可用性。

详情
AI中文摘要

我们提出了AVI-HT,一种用于通过联合建模第一人称视角图像与手套上的6自由度IMU信号来跟踪3D手部姿态的自适应视觉-IMU融合方法。AVI-HT在手-物体交互(HOI)场景中,特别是在重视觉遮挡情况下,实现了显著提高的准确性和可用性。其成功基于两个互补的成分:(1)同步多模态训练数据配对身体上的视觉-IMU传感器流与运动捕捉系统的地面真实3D手部姿态;(2)一种跨传感器深度注意力机制,能够自适应地调节对视觉和单个IMU传感器的信任度。为了在真实世界中评估AVI-HT,我们在包含100000+对视觉-IMU样本的DexGloveHOI数据集中进行了广泛的实验,这些样本具有同步的3D标注姿态,用户在日常任务中操作各种物体。我们比较了多种单模态和多模态跟踪方法,基于两种手部模型(UmeTrack、MANO)。结果表明,AVI-HT在基准上将平均关键点误差减少了16.1%,其腕对齐变体减少了24.2%。消融研究进一步揭示了IMU传感器在不同活动类型中的每指贡献,以及模型对IMU噪声和视觉-IMU融合中的时间偏移的敏感性。

英文摘要

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

2605.21713 2026-05-22 cs.CL

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Sem-Detect:基于语义层面的AI生成同行评审检测

André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)

AI总结 本文提出Sem-Detect方法,通过结合文本特征和语义分析,区分AI生成与人类撰写的同行评审,实验表明其在二分类和三分类场景下均表现出色,准确率显著提升。

详情
AI中文摘要

如何区分同行评审是人类还是AI生成?我们主张在该设置中,不应仅根据文本特征来确定作者身份,还应考虑评审所表达的观点、判断和主张。为此,我们提出了Sem-Detect,一种用于同行评审的作者身份检测方法,通过结合文本特征和基于声明的语义分析来实现这一原则。Sem-Detect通过将目标评审与同一篇论文的多个AI生成评审进行比较,利用观察到的不同AI模型倾向于收敛到相似点,而人类评审者引入更多独特和多样化的观点这一现象。因此,Sem-Detect能够区分完全由AI生成的评审与真实的由人类撰写的评审,包括那些经过LLM优化但仍反映人类判断的评审。在包含超过20,000篇同行评审的ICLR和NeurIPS会议数据集中,Sem-Detect在二分类设置中将TPR@0.1% FPR比最强基线提高了25.5%。此外,在三分类场景中,我们实证表明LLM优化保留了人类评审的语义信号,这些信号仍与完全由AI生成的文本模式区分开来;因此,少于3.5%的LLM优化后的评审被错误分类为AI生成。

英文摘要

How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

2605.21712 2026-05-22 cs.CL

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

通过生成式AI拓宽交通安全管理数据的可及性:一种基于模式的时空自然语言查询框架

Mahdi Azhdari, Eric J. Gonzales

发表机构 * Department of Civil and Environmental Engineering, University of Massachusetts Amherst(麻省大学阿姆赫斯特分校土木与环境工程系)

AI总结 本文提出了一种基于模式的自然语言接口,利用大型语言模型解释用户意图,同时保持确定性和可审查的执行,以解决交通安全管理数据访问不均的问题,通过整合事故记录、道路属性和地理空间数据,提升公共部门的安全规划能力。

Comments 30 pages, 5 figures

详情
AI中文摘要

交通安全管理分析需要通过基于GIS的工作流整合事故记录、道路属性和地理空间数据,但各机构和社区利益相关者之间的访问仍然不均。技术前提导致分析工具与能够使用它们的从业者之间存在差距。地方机构、学校委员会和居民可能有安全担忧,但缺乏检索、过滤、映射和分析相关数据的能力。生成式AI提供了一种缩小这一差距的方法,但其在公共部门的使用引发了关于可靠性和可复现性的问题。本文提出了一种基于模式的自然语言接口,利用大型语言模型(LLM)解释用户意图,同时保持确定性和可审查的执行,以权威数据库为基础。用户查询被翻译成结构化的语义框架,通过基于规则的层验证,编译成一个有类型的有向无环图,用于执行PostGIS数据库。这种设计将语言解释与确定性执行分离,保持结果可复现和基于模式,同时去除访问障碍。该框架使用整合了事故记录、道路属性和地理空间层(包括学校、公交站、过街天桥和市政边界)的州级马萨诸塞州交通安全管理数据库进行评估。所有查询均成功执行;验证层纠正了29%的评估查询中的错误,反映了灵活的自然语言与严格模式要求之间的差距。结果表明,结合自然语言的可访问性与确定性执行是扩大交通安全管理数据可及性的可行方向,对公共部门规划中的可信AI具有启示。

英文摘要

Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites create a gap between analytical tools central to safety planning and the practitioners able to use them. Local agencies, school committees, and residents may have safety concerns but limited capacity to retrieve, filter, map, and analyze relevant data. Generative AI offers a way to narrow this divide, but its public-sector use raises questions about reliability, reproducibility, and governance. This paper presents a schema-grounded natural language interface for transportation safety analysis, using a large language model (LLM) to interpret user intent while preserving deterministic, reviewable execution against an authoritative database. User queries are translated into structured semantic frames, validated by a rule-based layer, compiled into a typed directed acyclic graph of spatial operations, and executed against a PostGIS database. This bounded design separates language interpretation from deterministic execution, keeping results reproducible and schema-grounded while removing access barriers. The framework is evaluated using a statewide Massachusetts transportation safety database integrating crash records, roadway attributes, and geospatial layers including schools, bus stops, crosswalks, and municipal boundaries. All queries executed successfully; the validation layer corrects errors in 29% of evaluation queries, reflecting the gap between flexible natural language and strict schema-grounded requirements. The results suggest that combining natural language accessibility with deterministic execution is a practical direction for broadening access to transportation safety data, with implications for trustworthy AI in public-sector planning.

2605.21710 2026-05-22 cs.RO

PGDG: Physically Grounded Data Generation for Robust Bimanual Policy Learning from a Single Demonstration

PGDG: 为从单个示范中学习鲁棒双臂策略而设计的物理基础数据生成

Cunxi Dai, Haoran Chang, Aditya Nisal, Rahul Kumar, Guofei Chen, Tao Chen, Yuzhe Qin, Guanya Shi

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Dexmate

AI总结 本文提出PGDG,一种基于物理的数据生成框架,通过零样本校准扩展单个示范为包含物理上合理、成功和多样恢复行为的紧凑数据集,从而提升双臂操作中接触丰富的行为克隆性能。

详情
AI中文摘要

接触丰富的双臂操作中的行为克隆仍然具有挑战性,因为多样化的示范收集成本高,且即使小的扰动也可能将系统推入无恢复监督的流形外状态。我们提出PGDG,一种具有零样本校准的数据生成框架,能够在不额外人工标注的情况下,将单个示范扩展为一个包含物理上合理、成功且多样化的恢复行为的紧凑数据集。PGDG在物理基础采样器和数据集校准器之间迭代,其中校准器选择具有信息量、非冗余性和可恢复性的行为来更新采样分布,朝向未覆盖的恢复模式;而采样器则从更新后的分布中绘制出物理上合理的滚动候选,并保留成功的轨迹。为进一步提高数据质量,PGDG应用短时间域采样基于控制来重新标记所选的高风险状态并应用纠正动作。在四个双臂操作任务中,PGDG在仿真和零样本现实世界迁移中均优于仅空间增强的方法。在RotateBox-Pitch任务中,仿真中的成功率从38%提升到93%,现实世界中的成功率从35%提升到82%。PGDG还能够有效促进如GR00T等基础模型的微调,使成功率从46%提升到77%。更多结果可在我们的网站上查看:https://cunxid.github.io/PGDG/。

英文摘要

Behavior cloning for contact-rich bimanual manipulation remains challenging because diverse demonstrations are expensive to collect, and even small disturbances can push the system into off-manifold states where no recovery supervision is available. We propose PGDG, a data generation framework with zero-shot curation that expands a single demonstration into a compact dataset of physically plausible, successful, and diverse recovery behaviors without additional human labeling. PGDG iterates between a physics-grounded sampler and a dataset curator, where the curator selects informative, non-redundant, and recoverable behaviors to update the sampling distribution toward under-covered recovery modes, and the sampler draws physically plausible rollout candidates from this updated distribution and retains successful trajectories. To further improve data quality, PGDG applies short-horizon sampling-based control to relabel selected risky states with corrective actions. Across four bimanual manipulation tasks, PGDG consistently outperforms spatial-only augmentation in both simulation and zero-shot real-world transfer. On RotateBox-Pitch, success improves from 38% to 93% in simulation and from 35% to 82% in the real world. PGDG also enables effective foundation models fine-tuning such as GR00T, increasing success from 46% to 77%. Additional results are available in our website: https://cunxid.github.io/PGDG/.

2605.21704 2026-05-22 cs.RO cs.SY eess.SY

Motion Design for Grasp-Based Dynamic Locomotion in Microgravity

微重力环境下基于抓取的动态移动运动设计

Chaerim Moon, Joohyung Kim, Justin K. Yim

发表机构 * Department of Mechanical Science and Engineering at the University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校机械科学与工程系)

AI总结 本文针对微重力环境下多肢体机器人系统基于抓取的动态移动问题,提出了一种可参数化的移动规划框架,通过调整步态模式、步长、移动速度和名义姿态等参数,评估其在稳定性和驱动需求方面的性能。研究结果表明,扩大可行接触力空间并抑制脉冲全身动力学可提升移动性能。

详情
AI中文摘要

在微重力环境中,移动通常依赖于稀疏且不规则排列的锚点,这促使了基于抓取的多肢体移动。在此设置中,动态移动只有通过有意识地调节锚定相互作用和全身协调,才能在耦合的动力学和运动学约束下实现。本文提出了针对微重力环境下多肢体机器人系统基于抓取的动态移动的设计见解,目标是需要六维肢体操作以与候选锚点建立接触的场景。研究的设计参数包括步态模式、步长、移动速度和名义姿态。提出了一种可参数化的移动规划框架,以支持这些参数的变化,并评估由此产生的移动性能,包括稳定性和驱动需求。在基于物理的仿真中采用了两种代表性四足形态进行评估。结果表明,扩大可行接触力空间并抑制脉冲全身动力学可提高移动性能。这些发现为微重力移动中多肢体系统的接触配置选择和全身协调策略提供了指导。

英文摘要

Locomotion in microgravity often relies on sparsely and irregularly arranged anchors, motivating grasp-based mobility with multiple limbs. In this setting, dynamic locomotion is feasible only through deliberate regulation of both anchored interactions and whole-body coordination under coupled dynamic and kinematic constraints. This paper presents design insights for grasp-based dynamic locomotion with multi-limbed robotic systems in microgravity, targeting scenarios that require 6D limb manipulation to establish contacts with candidate anchors. The investigated design parameters include gait pattern, stride length, locomotion speed, and nominal posture. A parameterizable locomotion planning framework is proposed to support variations of these parameters and to evaluate the resulting locomotion performance in terms of stability and actuation demand. Two representative quadruped morphologies are adopted for evaluation in physics-based simulation. The results demonstrate that enlarging the feasible contact wrench space and attenuating impulsive whole-body dynamics improve locomotion performance. These findings inform strategies for contact configuration selection and whole-body coordination in microgravity locomotion with multi-limbed systems.

2605.21699 2026-05-22 cs.LG cs.CL

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token: 通过投影引导的跨分词器知识蒸馏

Sharath Turuvekere Sreenivas, Adithyakrishna Venkatesh Hanasoge, Mingyu Yang, Ali Taghibakhshi, Saurav Muralidharan, Ashwath Aithal, Pavlo Molchanov

发表机构 * NVIDIA

AI总结 本文提出X-Token,一种通过投影引导的跨分词器知识蒸馏方法,解决传统方法在处理不同分词器间知识迁移时的不足,通过两个互补的损失函数改进知识蒸馏效果。

详情
AI中文摘要

跨分词器知识蒸馏允许学生模型从具有不兼容词汇表的教师模型中学习。先前工作基于隐藏状态或对数几率,后者更优,因为它不需要辅助组件。基于对数几率的方法要么只使用正确分词的概率,从而遗漏了教师分布中的全部'暗知识',要么基于完整的输出分布,依赖严格的分词划分和/或不严谨的启发式排序。我们发现完整分布、基于对数几率方法的两个关键缺点:(i) 不常见分词失败,其中关键分词落入未匹配子集(例如,在数字拆分Qwen监督下Llama的1100多数字),在训练中被抑制,导致GSM8k从12.89降至2.56,相较于使用相同分词器的KD;(ii) 过于保守的匹配,严格的一对一匹配排除了表面形式间的近等价分词。这些失败需要不同的解决办法:当关键分词对齐错误时消除划分,当对齐可靠时进行细化。我们提出X-Token,一种具有两个互补损失函数的方法,针对这些问题。P-KL通过稀疏投影矩阵W(从分词级别字符串规则初始化)消除划分,并通过将学生分布与教师分布对齐来解决不常见分词失败。H-KL保留混合形式,同时放松匹配,使每个学生分词与W下的最高排名教师映射对齐。两个目标共享W并自然扩展到多个教师。实验证明,在Llama-3.2-1B上,X-Token在Qwen3-4B教师下比当前最佳GOLD高出+3.82平均点,在Phi-4-Mini教师下高出+0.5。此外,双教师设置(Phi-4-mini + Llama-3B)在单教师蒸馏上提高了+1.3点。

英文摘要

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

2605.21695 2026-05-22 cs.AI cs.HC

The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

人工智能使用与信息性对逻辑推理技能发展的影响力

Shang Wu, Hongyu Yao, Catarina Belem, Shuyuan Fu, Mark Steyvers, Padhraic Smyth

发表机构 * University of California, Irvine(加州大学尔湾分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了人工智能使用和信息性如何影响逻辑推理技能的发展,发现高使用AI的用户表现较差,而信息性低的AI对学习无帮助,信息性高的AI则在短期内提升表现但影响不均一。

Comments Accepted at Hybrid Human Artificial Intelligence (HHAI) 2026

详情
AI中文摘要

人工智能(AI)正越来越多地融入人类问题解决过程中,但其对个体技能发展的影响仍不明确。我们考察了在受控的逻辑推理任务中,有需求访问AI帮助的情况下,AI使用和信息性如何塑造学习。我们发现,更高的AI使用与更弱的技能发展相关:大量使用AI的用户相对于同等 peers 表现较差,而少量使用AI的用户则与不使用AI的匹配用户表现相似。我们还发现这些模式由AI的信息性所中介。低信息性AI既不能提高即时表现,也不能在移除AI帮助后保持表现,且与整体学习能力较弱相关。另一方面,高信息性AI在实验中被发现能提升短期表现,但平均而言不会减少AI帮助移除后的结果,但影响具有异质性。我们的发现总体表明,AI根据情境,可能通过放大独立推理来补充人类技能发展,或作为替代品削弱此类推理,这意味着在AI帮助存在的情况下,调节AI的访问和使用将对促进技能发展至关重要。

英文摘要

Artificial intelligence (AI) is being increasingly integrated into human problem-solving, yet its effects on individual skill development remain unclear. We examine how both AI usage and informativeness can shape learning in the context of a controlled logical reasoning task with on-demand access to AI assistance. We find that greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI. We also find in our study that these patterns are mediated by AI informativeness. Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall. On the other hand, high-information AI was found to improve short-run performance without reducing post-AI outcomes on average in our experiments, but with heterogeneous effects. Our findings in general suggest that AI can, depending on context, either complement human skill development by amplifying independent reasoning or can act as a substitute that undermines such reasoning, with the implication that regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.

2605.21692 2026-05-22 cs.LG stat.ML

Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective

表示差距:从几何视角解释神经网络的不合理有效性

David Perera, Victor Moura, Lais Isabelle Alves dos Santos, Michel F. C. Haddad, Flavio Figueiredo

发表机构 * Universidade Federal de Minas Gerais(巴西联邦大学矿务学院) Queen Mary University of London(伦敦女王玛丽大学)

AI总结 本文从几何视角出发,研究神经网络的表示差距,提出一个与泛化误差密切相关的度量标准,并展示其在更广泛任务和训练算法中的适用性,通过实验证明该理论在合成数据和现实数据中的准确性。

详情
AI中文摘要

精确地用可以高效估计的参数来表征神经网络的渐近泛化误差是机器学习中的关键问题,这严重依赖于启发法和实践者的直觉来做出关键设计选择。为了缓解这一问题,我们引入了表示差距,这是一个与泛化误差密切相关的度量标准,但具有更好的渐近动态特性。我们专注于等变扩散模型,并利用最优量化和点过程理论的结果,推导出表示差距的精确渐近等价,并证明其由单个参数,即任务的内在维度所支配,该参数易于解释、高效估计,并可与常见神经网络架构的等变性相关联。我们展示了这种渐近动态也适用于更广泛的任务和训练算法。最后,我们通过实验证明,我们的渐近定律和内在维度估计在广泛的合成数据集上准确,这些数据集中的这些量是已知的,以及在更现实的数据集上,我们得到的结果与相关文献一致。

英文摘要

Characterizing precisely the asymptotic generalization error of neural networks using parameters that can be estimated efficiently is a crucial problem in machine learning, which relies heavily on heuristics and practitioners' intuition to make key design choices. In order to mitigate this issue, we introduce the Representation Gap, a metric closely related to the generalization error, but admitting better-behaved asymptotic dynamics. Focusing on equivariant diffusion models and leveraging results from optimal quantization and point-process theory, we derive a precise asymptotic equivalent of the Representation Gap and show that it is governed by a single parameter, the \textit{intrinsic dimension} of the task, which is easy to interpret, efficient to estimate, and can be linked to the equivariances of common neural network architectures. We show that this asymptotic dynamic also extends to a broader range of tasks and training algorithms. Finally, we demonstrate empirically that our asymptotic law and intrinsic dimension estimation are accurate on a wide range of synthetic datasets, where these quantities are known, as well as on more realistic datasets, where we obtain results consistent with the related literature.

2605.21688 2026-05-22 cs.RO cs.SY eess.SY

Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control

闭环仿真到现实强化学习用于可变形微纤维形状控制

Alessandro Amici, Houari Bettahar, Veeti Jaakkola, Quan Zhou

发表机构 * Department of Electrical Engineering and Automation, Aalto University(艾尔沃大学电气工程与自动化系)

AI总结 本文提出了一种闭环仿真到现实强化学习方法,用于在表面控制可变形微纤维形状,通过在简化摩擦模拟器中训练几何形状调节,并利用实时视觉反馈在部署过程中迭代修正未建模的表面相互作用效果。

Comments 7 pages,7 figures

详情
AI中文摘要

自主基于接触的微 manipulation 是具有挑战性的,因为微尺度的表面和界面相互作用难以准确建模,限制了传统基于模型的控制和仿真到现实学习的使用。我们提出了一种闭环仿真到现实强化学习(RL)方法,用于表面上的微纤维形状控制。核心思想是在简化摩擦less 模拟器中训练几何形状调节,并在部署过程中依赖实时视觉反馈来迭代修正观测到的未建模表面相互作用效果。一个完全在仿真中训练的 RL 策略被直接转移到一个物理双夹爪微 manipulation 系统上,该系统以 40 Hz 运行,无需重新训练或领域适应。使用丝绸微纤维作为测试平台,该策略在 24 种不同的初始配置上实现了平均点状形状误差为 270 ± 80 微米。在九种样本中,覆盖三种纤维直径(50、80 和 120 微米)和三种 manipulated 长度(10 mm、15 mm 和 20 mm)的所有组合时,相同的策略在不重新训练或调整的情况下实现了亚毫米级的最终形状误差。这些结果表明,一个在简化模拟器中学习的策略可以在表面接触下实现可重复的现实世界微纤维形状调节,只要任务相关的仿真到现实不匹配效应在闭环反馈回路中仍然可观测和可纠正。

英文摘要

Autonomous contact-based micromanipulation is challenging because surface and interfacial interactions at the microscale are difficult to model accurately, limiting the use of conventional model-based control and sim-to-real learning. We present a closed-loop sim-to-real reinforcement learning (RL) approach for microfiber shape control on a surface. The central idea is to train geometric shape regulation in a simplified frictionless simulator and rely on real-time visual feedback during deployment to iteratively correct the observed effects of unmodeled surface interactions. An RL policy trained entirely in simulation is transferred directly to a physical dual-gripper micromanipulation system operating at 40 Hz, without retraining or domain adaptation. Using silk microfibers as a testbed, the policy achieves a mean point-wise shape error of 270 $\pm$ 80 $μ$m across twenty-four diverse initial configurations. Across nine specimens covering all combinations of three fiber diameters (50, 80, and 120 $μ$m) and three manipulated lengths (10 mm, 15mm, and 20 mm), the same policy achieves sub-millimeter final shape error without any retraining or retuning. These results show that a policy learned in a simplified simulator can achieve repeatable real-world microfiber shape regulation under surface contact, provided that the task-relevant effects of the sim-to-real mismatch remain observable and correctable within the closed feedback loop.

2605.21686 2026-05-22 cs.RO

Distributed Multi-Coverage for Robot Swarms

机器人群的分布式多覆盖

Mariem Guitouni, Aaron T. Becker

发表机构 * University of Houston(德克萨斯大学休斯顿分校)

AI总结 本文提出了一种分布式多覆盖算法,用于解决机器人群在局部感知、局部通信和无全局协调的情况下,维持关键资产可靠覆盖的问题,同时应对机器人故障等约束条件。

Comments Accepted at ANTS 2026 (International Conference on Swarm Intelligence), published by Springer Nature

详情
AI中文摘要

自主无人机群用于监视、环境监测和基础设施检查时,必须在机器人故障的情况下保持关键资产的可靠覆盖。这要求多覆盖:每个资产必须由多个机器人观察以实现冗余,且覆盖要求因资产的重要性而异。尽管最近的工作已通过整数规划最优地解决了集中式问题,但实际部署面临约束,需要分布式解决方案:机器人具有有限的通信范围,机载计算限制了全局规划,且部分系统故障不得导致任务中止。本文提出了一种适用于具有局部感知、局部通信和无全局协调的机器人群的分布式多覆盖算法。

英文摘要

Autonomous drone swarms deployed for surveillance, environmental monitoring, and infrastructure inspection must maintain reliable coverage of critical assets despite robot failures. This requires multicoverage: each asset must be observed by multiple robots for redundancy, with coverage requirements varying by asset importance. While recent work has solved the centralized problem optimally using integer programming, practical deployments face constraints that demand distributed solutions: robots operate with limited communication ranges, onboard computation restricts global planning, and partial system failures must not cause mission abort. We present a distributed multicoverage algorithm for robot swarms operating with local sensing, local communication, and no global coordination.

2605.21683 2026-05-22 cs.AI

Investigating Concept Alignment Using Implausible Category Members

通过不合理的类别成员探究概念对齐

Sunayana Rane, Brenden M. Lake, Thomas L. Griffiths

发表机构 * Department of Computer Science(计算机科学系) Princeton University(普林斯顿大学) Department of Psychology(心理学系)

AI总结 本文研究了通过询问不合理类别成员来探究概念边界,发现AI模型在某些概念上与人类存在显著差异,如将'词语'归类为'车辆'或'衣物',并探讨了这些概念错位对AI安全的影响。

详情
AI中文摘要

开发具有人类日常概念理解能力的AI系统是朝着安全、可靠系统的重要一步。在探测概念理解时,询问合理的类别成员(例如

英文摘要

Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., "Is a car a vehicle?") is likely to recall patterns in the model's vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., "Is an olive a vehicle?") to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems' assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare these assignments to those made by human participants on the full range of within-category and cross-category assignment tasks. Our results reveal a range of concepts for which which models differ in meaningful and surprising ways from humans, including treating "words" as belonging to categories like "vehicles" and "clothing," identifying several "vegetable" category members as "fruit," and assigning exemplars from non-weapon categories to the "weapons" category. We also demonstrate how these instances of concept misalignment translate into problematic downstream behavior with implications for AI safety.

2605.21680 2026-05-22 cs.RO

Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments

Flying Together: Human-Guided Immersive Shared Control for Aerial Robot Teams in Unknown Environments

Lou De Bel-Air, Luca Morando, Ruitao Chen, Keru Wang, Benjamin Jarvis, Charbel Toumieh, Yang Zhou, Ken Perlin, Dario Floreano, Giuseppe Loianno

发表机构 * New York University(纽约大学) Ecole Polytechnique Federale de Lausanne(洛桑联邦理工学院) University of California Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于虚拟现实的共享控制框架,用于在约束和未知环境中操作无人机团队,通过实时用户引导探索,提升在无结构环境中的自主导航能力。核心方法是一种基于用户引导的运动原语规划器,结合阻抗控制器,使操作员能够灵活影响团队行为并引导无人机前往自主规划器可能忽略的感兴趣区域。

Comments Accepted at IEEE International Conference in Robotics and Automation, Vienna 2026

详情
AI中文摘要

尽管自主多机器人能够实现安全协调的导航,但它们往往难以适应突发状况并捕捉操作员驱动的目标。本文提出了一种基于虚拟现实(VR)的共享控制框架,用于在约束和未知环境中操作无人机团队,实现实时用户引导探索。我们的方法核心是一种新颖的基于用户引导的运动原语规划器,能够计算连续的碰撞免费轨迹,同时持续整合操作员输入。该规划器与阻抗控制器相结合,使操作员能够灵活影响团队行为并引导无人机前往感兴趣区域。系统支持混合现实操作,包括物理和模拟无人机,并实现双方面VR接口,使操作员通过迁移点引导机器人团队,同时接收即时的团队状态视觉反馈。实验结果表明,共享控制提高了障碍物避障能力,保持了机器人间的间距,并减少了操作员的负担,展示了沉浸式、人机协作多机器人导航的可行性和优势。

英文摘要

While autonomous multi-robots can achieve safe and coordinated navigation, they often struggle to adapt to unforeseen conditions and to capture operator-driven objectives in unstructured environments. We present a Virtual Reality (VR)-based shared control framework for teams of drones operating in constrained and unknown environments, enabling real-time, user-guided exploration. At the core of our approach is a novel, user-guided motion-primitive-based planner that computes continuous, collision-free trajectories while continuously integrating operator input. This planner is coupled with an admittance controller, allowing the operator to flexibly influence team behavior and guide drones toward regions of interest that autonomous planners may overlook. The system supports mixed-reality operations with both physical and simulated drones, and implements a bilateral VR-based interface, allowing the operator to guide the robot team via migration points while receiving immediate visual feedback of the team state. Experimental results show that shared control improves obstacle avoidance, maintains inter-agent spacing, and reduces operator effort, demonstrating the feasibility and advantages of immersive, human-in-the-loop multi-robot navigation.

2605.21669 2026-05-22 cs.CV cs.AI

MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

MRecover: 一种基于AI生成对比度的条件生成模型,用于通过AI生成对比度恢复运动模糊的MRI图像

Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim

发表机构 * Department of Bioengineering, University of Pittsburgh(匹兹堡大学生物工程系) School of Medicine, University of Pittsburgh(匹兹堡大学医学院) Department of Radiology, University of Pittsburgh(匹兹堡大学放射科) Department of Psychiatry, University of Pittsburgh(匹兹堡大学精神病学系)

AI总结 该研究提出了一种条件生成模型MRecover,利用AI生成的对比度来恢复运动模糊的MRI图像,通过自回归切片条件化实现体积分 consistency,提高了 hippocampal 子区域分割的精度和泛化能力。

详情
AI中文摘要

海马亚区分割需要高分辨率的T2w turbo spin echo (TSE) MRI,但该序列易受运动伪影影响,导致数据丢失。我们开发了一种条件生成模型(MRecover),通过自回归切片条件化生成常规获取的T1w图像,生成TSE图像以实现体积分 consistency。在7T MRI数据(n=577)上训练,该模型在域内实现了高保真度(n=148,SSIM=0.84,FSIM=0.94),并能很好地推广到域外3T数据:合成和原生图像的亚区体积高度匹配(n=416,r=0.87-0.97),并在运动影响的ADNI3数据集中通过质量控制后,分析可及受试者数量增加了31.8%(593 vs 450)。合成图像还由于增加诊断组差异的样本量,产生了更大的效应量(整个海马体ε²=0.121-0.100 vs. 0.086-0.062,左右半球)。项目页面:https://jinghangli98.github.io/MRecover/

英文摘要

Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $ε^2$= 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: https://jinghangli98.github.io/MRecover/

2605.21661 2026-05-22 cs.LG cs.AI cs.CV

Hierarchical Variational Policies for Reward-Guided Diffusion

分层变分策略用于奖励引导的扩散

Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

发表机构 * Department of Computer Science(计算机科学系) University of California Irvine(加州大学伊文斯顿分校)

AI总结 本文提出了一种分层变分模型框架,通过将控制信息压缩到轻量级且表达能力强的随机策略中,实现了在降低推理成本的同时生成高质量的奖励对齐样本,该方法在4倍超分辨率任务中实现了比现有最佳基线快5倍的推理速度并具有更好的感知质量。

详情
AI中文摘要

适应预训练扩散模型以解决下游目标如逆问题通常需要昂贵的测试时间引导或优化。我们提出了一种系统框架,能够在大幅降低推理成本的同时生成高质量的奖励对齐样本。我们的方法将测试时间适应建模为分层变分模型,其中控制被压缩到一个轻量级但表达能力强的随机策略中。这种建模自然支持少量步扩散采样:大步长使推理快速,而学习的策略通过提供结构化的每步控制保持样本质量。所得到的完全压缩采样器实现了强大的质量-速度权衡,匹配或超过最近的测试时间扩展基线,同时需要显著更少的计算资源。例如,在4倍超分辨率任务中,我们的方法在比最佳表现基线快5倍的情况下实现了更好的感知质量。我们进一步将该方法扩展到半压缩的 regime,结合廉价的压缩提案和有限的测试时间优化,在多个具有挑战性的逆问题中实现了最先进的感知质量。

英文摘要

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

2605.21654 2026-05-22 cs.LG cs.AI cs.CL

Value-Gradient Hypothesis of RL for LLMs

强化学习中大语言模型的价值-梯度假说

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

发表机构 * MBZUAI(穆斯林人工智能研究所)

AI总结 本文提出了一种价值-梯度视角来解释无评论强化学习方法在大语言模型后训练中的有效性,并通过分析actor更新和注意力机制中的自适应微分,提出了价值梯度信号和可达奖励空间的分解方法。

详情
AI中文摘要

强化学习显著提升了预训练语言模型,但尚不清楚为何无评论方法如PPO和GRPO能发挥如此大的作用,以及何时能提供最大的收益。我们开发了一种无评论强化学习在大语言模型后训练中的价值-梯度视角。首先,在可微展开和加性噪声参数化下,我们证明在期望下actor更新是价值-梯度类似的:反向传播传播的costates的条件期望等于价值梯度。其次,对于离散transformer策略,我们证明通过注意力机制的自适应微分会产生经验性的costates,这些近似于该价值信号,其误差受采样间隙和策略熵的控制。这些结果促使将RL影响分解为价值梯度信号和可达奖励空间,从而得出RL在预训练轨迹上最有效的标准。

英文摘要

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

2605.21653 2026-05-22 cs.LG cs.AI cs.CL

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

放大而非学习:微调的AI文本检测器放大了预训练的方向

Alexander Smirnov

发表机构 * University College London(伦敦大学学院)

AI总结 该研究探讨了通过微调AI文本检测器来放大预训练方向而非学习AI与人类边界的问题,发现微调在某些情况下会降低辨别能力,但在非母语写作中表现不同,并展示了闭合形式雅可比预测器在不同架构中的有效性。

详情
AI中文摘要

AI文本检测器放大了预训练的典型性轴;它们并不构建AI与人类的边界。在没有任何任务监督的原始编码器上,将投影到AI-中心(HC3)的中心可以实现NYT与HC3的AUROC分别为0.806/0.944/0.834,跨三种架构(86-106%的微调辨别上限:在RoBERTa-base上,原始投影超过微调);在RoBERTa-base上,完全微调在两种流畅正式人口测试中降低了辨别能力。相同的轴在非母语ESL写作中反转(AUROC 0.06-0.20)--这是典型性阅读独有的可验证预测。一个24例冻结探测器与完全微调(0.900 vs 0.895)一致。一个闭合形式雅可比预测器参数化轴操纵干预,R²=1.000通用,提升了ELECTRA-CE部署的TPR从0.000到0.904(FPR=1%),并在三个独立训练的第三方RoBERTa检测器上转移,达到16/16 oracle等价(在OpenAI检测器上57%的NYT-FPR减少)。范围:编码器家族;机制幅度HC3锚定;人口层面共享轴,不同架构中每文本机制有所变化。三种操作上不同的探测器--文本表面caps_rate残差化、几何符号epsilon消融、闭合形式文本对预测器--在三种架构中一致,cos 0.74/0.81/1.00,确认了观察者不变性。在匹配TPR-0.90评估下,已发表的干预动物园(CC、dealign-f2c)在27个单元格中校准等价(|Delta AUROC| <= 0.0081),并且ELECTRA上的LoRA->full-FT偏移差距的97%是校准偏移而非学习表示--这是核心主张的预测确认。

英文摘要

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

2605.21649 2026-05-22 cs.LG cs.CL

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV: 基于支持的解码方法用于Entmax注意力

Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) ELLIS Unit Lisbon(里斯本ELLIS单位) INESC-ID Instituto de Telecomunicações(电信研究所)

AI总结 本文提出EntmaxKV,一种基于支持的解码框架,利用熵最大注意力的稀疏性在KV页面加载前进行稀疏解码,通过查询感知的页面评分、支持感知的候选选择和稀疏熵最大注意力,减少概率质量丢失,提高长上下文语言模型的效率。

详情
AI中文摘要

长上下文解码越来越受到KV缓存内存流量的限制,因为每个生成的标记都需在缓存上进行注意力运算,而缓存大小与上下文长度成线性增长。现有稀疏解码方法通过选择部分标记或页面来减少成本,但这些方法是为softmax注意力设计的,其密集尾部使得任何截断都会丢弃非零的概率质量。相比之下,α-entmax产生精确的零,将稀疏解码从密集尾部近似转变为支持恢复:如果所选候选包含entmax支持,稀疏解码仍保持精确。虽然最近的entmax内核实现了高效的训练,但它们并未解决自回归解码瓶颈,即密集推理仍需在稀疏性确定之前流式传输完整的KV缓存。在本文中,我们引入了EntmaxKV,一种基于entmax的稀疏解码框架,它在KV页面加载前利用稀疏性。EntmaxKV结合了查询感知的页面评分、支持感知的候选选择和稀疏entmax注意力。我们通过分析截断误差中的丢弃概率质量δ,证明输出误差由δ控制,并在恢复entmax支持时消失。我们进一步引入了一种高斯感知的entmax选择器,从轻量级页面统计中估计entmax阈值,使所选预算适应于分数分布。实验证明,EntmaxKV比基于softmax的稀疏解码在相同KV预算下丢弃更少的概率质量,保留更多支持标记,并实现更低的输出误差。在长上下文和语言建模基准上,它接近完整的缓存entmax,但使用KV缓存的少量比例,达到100万上下文长度时,比完整的注意力基线快3.36倍(softmax)和5.43倍(entmax)。代码可在:https://github.com/deep-spin/entmaxkv获取。

英文摘要

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.

2605.21646 2026-05-22 cs.LG

Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

相似部分:一种基于特征的局部和全局原型解释方法

Jacek Karolczak, Jerzy Stefanowski

发表机构 * Institute of Computing Science(计算科学研究所)

AI总结 本文提出了一种基于特征的局部和全局原型解释方法,通过整合特征重要性来提高解释的粒度,实验表明该方法在保持模型预测精度的同时增强了特征多样性。

Comments Accepted for publication in International Journal of Applied Mathematics and Computer Science (IJAMCS)

详情
AI中文摘要

基于原型的解释提供了一种直观的、基于实例的方法来支持机器学习黑箱分类器的可解释性,但通常缺乏特征层面的细粒度。我们介绍了一个框架,该框架在两个层次上整合特征重要性以解决这一差距。首先,对于局部解释,我们提出"相似部分":一种利用特征重要性评分来突出分类实例与其最近原型之间最相关、共享的特征子集的方法,以引导用户关注。其次,我们通过在全局原型选择目标函数中加入特征重要性项,积极促进所选原型的特征属性的多样性。在六个基准数据集上的实验表明,这种增强的选取过程保持或在某些情况下提高了替代模型的预测保真度,表明特征多样性并不影响模型保真度。

英文摘要

Prototype-based explanations offer an intuitive, example-based approach to support the interpretability of machine learning black box classifiers but often lack feature-level granularity. We introduce a framework that integrates feature importance at two levels to address this gap. First, for local explanations, we propose \textit{alike parts}: a method that uses feature importance scores to highlight the most relevant, shared feature subsets between a classified instance and its nearest prototype, guiding user attention. Second, we augment the global prototype selection objective function with a feature importance term to actively promote diversity in the feature attributions of the selected prototypes. Experiments on six benchmark datasets show that this augmented selection process maintains or, in some cases, increases the prediction fidelity of the surrogate model, suggesting that feature diversity does not compromise model fidelity.

2605.21645 2026-05-22 cs.AI cs.DB

AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agentic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs)

AOP-Wiki EMOD 3.0: 数据模型扩展和内容评估框架用于利用代理AI改进AOP与新方法论(NAMs)之间的整合

Virginia K. Hench, J. Harry Caufield, Sierra A. T. Moxon, Jason M. O'Brien, Stephen W. Edwards

发表机构 * Open BioData Modeling(开放生物数据建模) Environmental Genomics and Systems Biology(环境基因组学与系统生物学) Lawrence Berkeley National Laboratory(伯克利国家实验室) National Wildlife Research Centre(国家野生动物研究中心) UL Research Institutes - Chemical Insights(UL研究机构-化学洞察)

AI总结 本文提出AOP-Wiki EMOD 3.0,通过数据模型扩展和内容评估框架,利用代理AI改进AOP与新方法论之间的整合,为监管科学和生物医学领域提供支持。

Comments 7 Figures and 3 Supplemental Figures

详情
AI中文摘要

不良后果路径(AOP)是将可在实验室中测量的生物机制因果联系到不良后果的逻辑模型,与化学监管终点相关。AOPs 为新方法论(NAMs)提供上下文,包括体外和体外方法,这些方法作为替代动物测试的替代方案,AOP中的连续事件作为多尺度模型跨越生物尺度。AOP-Wiki 作为全球AOP存储库。尽管AOP-Wiki在过去十年中在AOP扩展中发挥了核心作用,但当前的数据模型和应用基础设施的限制限制了AOP-Wiki支持持续AOP增长和演变的能力。然而,代理AI的变革力量重新激发了AOP-Wiki数据现代化的努力,尤其是在核心AOP原则可以用于指导AI用于汇总和结构化AOP相关信息的时候。抓住这一势头,我们提出了AOP-Wiki EMOD 3.0,即一系列证据模型原型中的第三款,具体展示了数据模型扩展和我们对AOP-Wiki如何被转变以更好地服务于监管科学和新兴AOP在生物医学和One Health领域中的使用。我们旨在为计算生成的AOP和定量AOP(qAOPs)奠定基础,通过聚焦于AOP-Wiki内部质量改进、证据结构以提高AOP FAIRness和AI准备性,以及改进AOP框架与NAMs之间的整合,以更好地服务于下一代风险评估。

英文摘要

Adverse Outcome Pathways (AOP) are logic models that causally link biological mechanisms that can be measured in a lab to adverse outcomes, relevant to chemical regulatory endpoints. AOPs contextualize new approach methodologies (NAMs), in vitro and in silico methods used as alternatives to animal testing and the sequential events in an AOP serve as multi-scale models spanning biological scales. The AOP-Wiki serves as the global repository for AOPs. While the AOP-Wiki has played a central role in AOP expansion over the past decade, constraints within the current data model and application infrastructure limit the AOP-Wiki from supporting continued AOP growth and evolution. Yet, the transformative power of agentic AI has re-invigorated AOP-Wiki data modernization efforts at a time when core AOP principles can be harnessed to inform use of AI for aggregating and structuring AOP-relevant information. Seizing upon this momentum, we present AOP-Wiki EMOD 3.0, the third in a series of evidence model prototypes, which concretely demonstrates data model expansions and our vision for how the AOP-Wiki might be transformed to better serve regulatory science and emergent use of AOPs in biomedical and One Health contexts. We aim to lay a foundation to support computationally-generated AOPs and quantitative AOPs (qAOPs) by focussing on solutions for AOP-Wiki internal quality improvement, evidence structuring to enhance AOP FAIRness and AI-readiness, and improved integration between the AOP framework and NAMs to better serve next generation risk assessment.

2605.21642 2026-05-22 cs.CV

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

Ablate-to-Validate: 视觉语言模型真的在使用连续思维令牌吗?

Tianyi Zhang, Mahtab Bigverdi, Ranjay Krishna

发表机构 * University of Washington(华盛顿大学)

AI总结 本文提出了一种诊断原则Ablate-to-Validate,通过Token Replacement Test(TRT)测试视觉语言模型是否真正利用了连续令牌内容,发现模型性能提升可能并非源于令牌内容,而是令牌存在本身。

详情
AI中文摘要

视觉语言模型(VLMs)越来越多地引入连续或潜在的非文本令牌以支持'视觉思维'。尽管任务准确性有所提高,但这并不能证明模型确实使用这些令牌进行推理——收益可能来自于诸如增加的上下文长度、特殊令牌锚定或训练时的正则化等混淆因素。我们正式提出了一种诊断原则,Ablate-to-Validate,用于测试潜在令牌内容是否被真正利用,并将其实例化为Token Replacement Test(TRT),一个标准化的内容替换消融套件。TRT固定提示、图像、令牌预算和解码,同时用零、随机、首次重复或Oracle替代中间令牌,以确定性能是否依赖于令牌内容或仅仅是令牌存在。作为受控测试平台,我们研究了LLaVA-13B和Qwen2.5-VL-3B在相对深度推理中的表现,训练模型在多个冻结编码器(SigLIP2,CLIP,DINOv2)和令牌预算下预测和消耗连续或离散深度跨度。此外,我们还将TRT应用于三个现成的视觉思维系统(Mirage,Mull-Tokens,CoVT)在BLINK,VSP和CV-Bench上。在所有设置中,准确性提升都是潜在令牌推理的误导性代理:VLMs在令牌内容被破坏或替换时仍能保持大部分改进,揭示了拥有潜在通道与将其用作信息瓶颈之间的持续差距。我们推荐TRT作为任何引入连续思维令牌的方法的标准诊断工具,与准确性并行使用。

英文摘要

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

2605.21630 2026-05-22 cs.AI

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

MindLoom: 通过组合思维模式进行前沿级推理数据合成

Haiyang Shen, Taian Guo, Xuanzhong Chen, Mugeng Liu, Weichen Bi, Wenchun Jing, Sixiong Xie, Zhuofan Shi, Yudong Han, Chongyang Pan, Siqi Zhong, Jinsheng Huang, Ming Zhang, Yun Ma

发表机构 * Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出MindLoom框架,通过组合思维模式工程合成前沿级推理数据,解决了现有方法在问题难度控制和多样性方面的不足,实验表明其在多个基准测试中表现优异。

Comments Work in Progress. Comments: 27 pages, 4 figures, preprint

详情
AI中文摘要

尽管LLMs在推理方面取得了显著进展,系统性地生成前沿级推理数据仍然具有挑战性。现有合成方法往往缺乏对问题难度结构性因素的理解,导致多样性有限和难度控制不稳定。本文将推理问题的难度视为原子知识推理转换的累积,提出MindLoom框架,通过组合思维模式工程合成前沿级推理数据。给定一组具有验证解的难题,MindLoom首先将这些解分解为思维模式链,揭示每个问题的构建逻辑。然后训练一个检索模型,将问题状态匹配到兼容的思维模式,提供合成过程中引入哪些推理挑战的指导。新问题通过迭代应用检索到的思维模式到种子问题,并通过分布对齐采样来鼓励多样化的推理覆盖。最后,基于回放的判断阶段通过难度对生成的问题进行标记,并提供已判断正确的响应用于监督微调。我们在九个基准测试上评估了MindLoom,涵盖五个STEM学科和四个数学推理任务,多个模型家族和大小的模型在微调后均在报告的基准测试中表现出色。消融研究表明了每个组件的贡献,进一步分析表明MindLoom覆盖了广泛的推理模式,同时保持了有用的难度控制。我们已开源实现:https://github.com/EachSheep/MindLoom。

英文摘要

Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier-level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem's construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution-aligned sampling to encourage diverse reasoning coverage. Finally, a rollout-based judging stage labels generated questions by difficulty and supplies judged-correct responses for supervised fine-tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine-tuned on MindLoom-generated data achieves favorable performances over base models, distillation, and external-data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open-sourced our implementation at https://github.com/EachSheep/MindLoom.

2605.21625 2026-05-22 cs.CV cs.AI cs.CL

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

发表机构 * Cornell University(康奈尔大学) Cornell Tech(康奈尔科技) MBZUAI(麦吉尔-伯克利-浙江大学人工智能研究院) UC Berkeley(伯克利大学)

AI总结 本文提出Flat-Pack Bench基准,用于评估大视觉-语言模型在复杂视频场景中的时空理解能力,发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情
AI中文摘要

大视觉-语言模型(LVLMs)的出现显著提升了视频理解能力。然而,现有基准主要集中在粗粒度任务,如动作分割、分类、描述和检索,且这些基准通常依赖于易于口头识别的实体,如家庭物品、动物、人类主体等,限制了其在复杂真实视频场景中的适用性。但许多应用,如家具组装、烹饪等,需要对视频进行逐步细粒度的时空理解,而当前基准并未充分评估。为解决这一差距,我们引入了Flat-Pack Bench,一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现,包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪,使用多选问题配以视觉提示突出相关部分作为参考,以回答细粒度问题。我们的实验表明,最先进的LVLMs在细粒度时空推理上表现显著不足,凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互(如物理接触)方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

2605.21623 2026-05-22 cs.AI

The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

证词的形态:一种可扩展的口述史档案比较框架

Itamar Trainin, Renana Keydar, Amit Pinchevski

发表机构 * Hebrew University of Jerusalem(海法大学)

AI总结 本文通过大规模计算分析超过1600个口述史档案,探讨了犹太人大屠杀研究中两种口述证词风格的区别,并提出一种可扩展的比较语料库分析框架。

详情
AI中文摘要

研究者在大屠杀研究中常常将口述幸存者证词分为两种风格:美国犹太人研究肖尔基金会的访谈通常遵循结构化的、由访谈者引导的格式,而耶鲁福图诺夫视频档案则更倾向于自由形式、开放式风格。本研究通过分析两个档案中超过1600个证词,利用话语分割、主题建模和大型语言模型(LLM)分析,量化证词的“结构化”程度,包括主题连贯性、访谈者-幸存者动态和问题类型的分布。研究结果在总体上支持早期研究中发现的结构性差异,同时揭示了两个档案之间的显著重叠,不仅在个别访谈内,而且在共同的叙述模式中。这使得简单的“结构化vs.自由形式”二元对立在这些口述史中变得更加复杂。除了重新审视大屠杀研究中的一个基础性主张外,本工作还提供了一种可扩展、可重复的比较语料库分析框架。作为概念验证,它还为数字口述史、叙述分析以及公民科学注释平台的设计提出了更广泛的应用。

英文摘要

Researchers in Holocaust studies have often distinguished between two styles of oral survivor testimony: the USC Shoah Foundation's interviews tend to follow a structured, interviewer-guided format, whereas the Yale Fortunoff Video Archive generally favors a more free-form, open-ended style. This distinction has influenced both scholarly research and the development of later archives. In this study, we critically examine that claim by conducting a large-scale computational analysis of more than 1,600 testimonies from both collections. Leveraging discourse segmentation, topic modeling, and large language model (LLM) based analysis, we quantify the "structuredness" level of testimonies through topic coherence, interviewer-survivor dynamics, and the distribution of question types. Our results generally corroborate the structural differences identified in earlier research, while also revealing significant overlaps between the collections, both within individual interviews and across common narrative patterns. This complicates the simple "structured vs. free-form" dichotomy often applied to these oral histories. Beyond revisiting a foundational claim in Holocaust studies, our work provides a scalable, replicable framework for comparative corpus analysis. As a proof of concept, it suggests broader applications for digital oral history, narrative analysis, and the design of citizen-science annotation platforms.

2605.21622 2026-05-22 cs.AI

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

TO-Agents:一种用于基于偏好的拓扑优化的多智能体AI流水线

Isabella A. Stewart, Hongrui Chen, Faez Ahmed

发表机构 * Department of Mechanical Engineering Massachusetts Institute of Technology Cambridge, MA, 02139 USA(机械工程系 马萨诸塞理工学院 哥伦布, 马萨诸塞州, 02139 美国)

AI总结 本文提出TO-Agents,一种多智能体AI框架,通过将自然语言设计意图与迭代拓扑优化相结合,解决设计者手动转换非直接关联的偏好到求解器设置的问题,并在两个长周期设计任务中验证了其有效性。

Comments Accepted for publication in the Proceedings of the ASME 2026 International Design Engineering Technical Conferences (IDETC2026)

详情
AI中文摘要

拓扑优化可以生成高效的结构,但设计者往往必须手动将定性意图,如期望的视觉风格、产品体验或可制造性转换为与这些偏好不直接相关的求解器设置。我们提出了TO-Agents,一种多智能体AI框架,将自然语言设计意图与迭代拓扑优化连接起来。该框架将人类提供的问题描述转换为经过验证的求解器输入,运行拓扑优化求解器,渲染结果的3D拓扑,并使用多视角视觉-语言推理与独立的评判智能体来批评每个结果并修改求解器参数。我们在两个长周期设计任务上评估了该框架:悬臂梁基准测试和手机支架产品设计。在两个任务中,设计者指定了受自然树形态启发的分层分支结构的美学偏好,系统在十个独立重复中进行了四次修订循环。TO-Agents在每个案例研究中至少在60%的试验中生成了符合偏好的设计,对应于没有视觉或历史反馈的简化流水线的6倍以上的成功试验。评判评分和人类评估显示,该流水线能够识别有效的参数杠杆,从差的修订中恢复,并扩展设计探索。一个制造智能体进一步对排名最高的设计进行后处理,以实现增材制造,使设计能够从意图到原型。我们还识别了失败模式,包括过度优化、选择性记忆、工具位置错误和参数推理错误。这些结果表明,智能体拓扑优化可以将设计者从低层次参数调整转向高层次的形式和功能指定,同时强调了可靠自主工程设计所需的保障措施。

英文摘要

Topology optimization can generate efficient structures, but designers often must manually translate qualitative intent, such as desired visual style, product experience, or manufacturability into solver settings that are not directly tied to those preferences. We present TO-Agents, a multi-agent AI framework that connects natural-language design intent with iterative topology optimization. The framework converts a human-provided problem description into validated solver inputs, runs a topology optimization solver, renders the resulting 3D topology, and uses multi-view vision-language reasoning with an independent judge agent to critique each result and revise solver parameters. We evaluate the framework on two long-horizon design tasks: a cantilever beam benchmark and a phone-stand product design. In both tasks, the designer specifies an aesthetic preference for hierarchically branched structures inspired by natural tree morphologies, and the system performs four revision cycles across ten independent replicates. TO-Agents produces at least one preference-aligned design in 60% of trials for each case study, corresponding to up to 6x more successful trials than an ablated pipeline without visual or historical feedback. Judge scores and human evaluations show that the pipeline can identify effective parameter levers, recover from poor revisions, and expand design exploration. A manufacturing agent further post-processes top-ranked designs for additive manufacturing, enabling end-to-end intent-to-prototype design. We also identify failure modes, including overshooting, selective memory, misplaced tools, and incorrect parameter reasoning. These results suggest that agentic topology optimization can shift designers from low-level parameter tuning toward higher-level specification of form and function, while highlighting safeguards needed for reliable autonomous engineering design.

2605.21611 2026-05-22 cs.CV cs.LG

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

UniVL:统一的视觉-语言嵌入用于空间接地的上下文图像生成

Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

发表机构 * Center for Advanced AI(先进人工智能中心)

AI总结 本文提出了一种统一的视觉-语言嵌入方法,通过单一的视觉输入直接将语义绑定到空间位置,从而减少计算并提高图像生成质量。

详情
AI中文摘要

我们引入了空间接地的上下文图像生成任务,这是一种可控的图像生成任务,重新定义了条件生成范式。与通过两个独立编码器分别提供参考图像和全局文本提示不同,UniVL被训练以从单一统一的视觉输入中直接绑定语义到空间位置,其中文本指令被渲染到空间掩码上。这消除了推理过程中对独立文本编码器的需求。所得到的模型通过遵循用户指定的指令来支持上下文图像生成,即在指定位置生成什么内容,同时显著减少了计算量。为了解决这一任务,我们提出了一种框架,其中从光学字符识别预训练的backbone中适应的UniVL编码器读取统一的条件,并生成一个融合视觉和语义意图以及空间位置的UniVL嵌入fVIL。一个两阶段流程首先对齐UniVL与VAE嵌入空间,然后将预训练的扩散backbone完全基于UniVL嵌入进行条件生成,消除了如T5等独立文本编码器。尽管这种重新定义使用了刻意最小化的文本接口,但仍然取得了显著的实证收益。在UniVL-ImgGen上,一个包含477,000个掩码标注图像的基准数据集上,UniVL在文本提示基线之上提高了图像质量,将FID从14降低到11,并将PSNR从16提高到20。它还完全消除了文本编码器,将推理TFLOPs减少高达52%,将运行时间减少高达44%。此外的消融研究验证了所提出组件的贡献,为具有统一条件范式的高效、空间接地图像生成铺平了道路。

英文摘要

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.

2605.21610 2026-05-22 cs.LG

AgForce Enables Antigen-conditioned Generative Antibody Design

AgForce 使生成抗体设计具备抗原条件

Mansoor Ahmed, Murray Patterson

发表机构 * Georgia State University(佐治亚州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出AgForce方法,通过图神经网络和改进的解码器设计,解决传统抗体设计方法中对抗原输入忽略的问题,提升了抗体序列生成的质量和恢复能力。

详情
AI中文摘要

抗体设计方法通常基于抗原结构生成互补决定区(CDR),但基线方法的系统评估表明,它们大多忽略了抗原输入。我们识别出三种导致这种行为的失败模式。抗原盲性是因为模型从抗体框架上下文推断预测,而非抗原信息,从而产生几乎相同的CDR,无论目标如何。词汇坍塌将预测的氨基酸减少到每个位置3到5种,远低于天然序列的真实分布。此外,任何使用标准位置交叉熵训练的模型都会收敛到位置边际分布,这使得它无法产生抗原特异性序列预测。我们提出了一种名为AgForce的新型编码器-解码器架构,它使用图神经网络(GNN)作为编码器,并针对序列-结构协同设计设计了专用解码器。具体而言,我们应用了框架dropout、门控瓶颈和双曲交叉注意力,以防止抗体的捷径路径。在解码器中,一个具有Potts-like成对耦合和退火的多选学习(aMCL)的混合密度网络(MDN)序列头取代了交叉熵目标,用一个多组件分布替代了位置边际分布的最优解。一个抗原循环一致性头将梯度路由通过序列解码器,迫使预测分布编码抗原身份。AgForce在CHIMERA-Bench数据集上同时实现了最佳的结合质量和序列恢复能力,比最强的序列基线提高了8%的氨基酸恢复率,且在所有界面指标上均优于基线,几乎将GNN方法的有效词汇量翻倍。源代码可在:https://github.com/mansoor181/ag-force.git

英文摘要

Antibody design methods condition on antigen structure to generate complementarity-determining regions (CDR), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input. We identify three failure modes that explain this behavior. Antigen blindness arises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target. Vocabulary collapse reduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences. Moreover, any model trained with standard per-position cross-entropy converges to the positional marginal distribution, making it provably unable to produce antigen-specific sequence predictions. We propose a novel encoder-decoder architecture called AgForce, that uses a graph neural network (GNN) as the encoder and specialized decoders for sequence-structure co-design. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path. In the decoder, a Mixture Density Network (MDN) sequence head with Potts-like pairwise coupling and annealed Multiple Choice Learning (aMCL) replaces the cross-entropy objective with a multi-component distribution whose optimal solution differs from the positional marginal. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity. AgForce achieves the best binding quality and sequence recovery simultaneously on the CHIMERA-Bench dataset, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods. The source code is available at: https://github.com/mansoor181/ag-force.git

2605.21609 2026-05-22 cs.CL cs.AI cs.CY

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T:基于重写的青少年LLM安全机制

Heajun An, Qi Zhang, Vedanth Achanta, Jin-Hee Cho

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 本文提出CR4T框架,通过选择性响应重构替代拒绝导向的安全机制,以更符合青少年发展需求的方式提升LLM的安全性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入青少年的数字环境,介导信息搜索、建议和情感敏感的互动。然而,现有安全机制仍主要基于成人中心的规范,并通过拒绝导向的压制来实现安全。尽管这些方法可能减少即时的政策违规,但它们也可能导致对话死胡同、限制建设性指导,并未能解决青少年与AI互动中固有的发展脆弱性。我们主张,青少年LLM安全不应仅被视为过滤问题,而应被视为一种社会技术、发展一致的转变问题。为实现这一视角,我们提出了Critique-and-Revise-for-Teenagers(CR4T),一种模型无关的安全保障框架,该框架可选择性地将不安全或拒绝式输出重构为适合年龄的指导性响应,同时保持善意意图。CR4T结合轻量级风险检测与领域条件重写,以去除风险放大内容,减少不必要的对话关闭,并引入适合发展的指导。实验结果表明,针对重写显著减少了不安全和拒绝导向的结果,同时避免了对可接受互动的不必要的干预。这些发现表明,选择性响应重构为青少年面向的LLM系统提供了一种更以人为本的替代方案,以替代以拒绝为中心的安全机制。

英文摘要

Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.

2605.21606 2026-05-22 cs.LG cs.AI

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

何时教师标记可靠?用于推理的基于位置加权的在线自我蒸馏

Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种基于位置加权的在线自我蒸馏方法,用于改进推理任务中教师标记的可靠性,通过引入分支可行性诊断来识别教师标记的可靠性,并在不同模型上验证了其有效性。

Comments Pre-print. Code is available at https://github.com/SaFo-Lab/PW-OPSD

详情
AI中文摘要

在线自我蒸馏(OPSD)通过一个特权教师训练学生,但其标准目标对所有生成的标记同等重视,隐含地将特权教师目标视为在每个学生访问的前缀中同样可靠。现有的基于熵的OPD方法通过调节令牌级监督来放松这种均匀性,但推理中高教师熵的可靠性含义具有歧义:它可以反映非可行的不确定性或良性的解决方案多样性。为识别这一现象,我们引入了分支可行性诊断。具体来说,我们记录特权答案教师提示中的下一个标记替代方案,强制每个替代方案在学生提示及其在线脊柱前缀之后,并测试由此产生的学生模板延续是否能恢复正确答案。在Qwen3-4B上,我们发现一个导向的序列内位置分数是测试中最强的教师标记可靠性预测因子,达到曲线下面积(AUROC)为0.83;局部不确定性分数最多为0.57。受此轨迹结构的启发,我们提出了基于位置加权的在线自我蒸馏(PW-OPSD),其在保持相同的学生滚动生成、特权教师传递和截断的前向KL目标的同时,应用递增的位置权重。在不同随机种子的全面评估中,诊断衍生的PW-OPSD在AIME 2024和AIME 2025 Avg@12上分别提高了+1.0和+1.1分,并在两个更大规模的模型上也展示了一致的Avg@12改进。这些结果表明,推理蒸馏中的教师标记可靠性具有轨迹结构,并且可以在不增加教师计算的情况下利用。

英文摘要

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

2605.21600 2026-05-22 cs.LG

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

ConTact: 通过显式界面推理进行接触优先的抗体CDR设计

Mansoor Ahmed, Spencer VonBank, Nadeem Taj, Sujin Lee, Naila Jan, Murray Patterson

发表机构 * Georgia State University, Atlanta, USA(佐治亚州立大学) Georgia Institute of Technology, Atlanta, USA(佐治亚理工学院) DePauw University, Indiana, USA(德保罗大学) University of Engineering(工程大学)

AI总结 本文提出ConTact,一种通过显式界面推理进行抗体CDR设计的方法,通过显式分解CDR设计为三个阶段:学习表面互补性指纹、预测CDR-抗原接触以及注入接触门控抗原特征,从而提高结构质量和表位意识。

详情
AI中文摘要

计算抗体CDR设计方法基于抗原结构生成结合环,但现有架构将两个根本不同的子问题混为一谈:确定哪些CDR位置会接触抗原,以及在这些位置选择氨基酸。这种混合同一迫使模型通过统一的消息传递隐式学习接触推理,稀释抗原信号在所有位置中均等。我们引入ConTact,一种接触然后作用的架构,将CDR设计显式分解为三个连续阶段:学习表面互补性指纹、预测CDR-抗原接触以及将接触门控抗原特征注入序列头。距离偏倚的交叉注意力模块编码几何先验,倾向于空间邻居,而接触加权的交叉熵损失将梯度信号集中于结合关键位置。在CHIMERA-Bench数据集上,ConTact在结构质量(比次优基线提高7% RMSD)、表位意识(比GNN基线提高10% F1分数)以及序列恢复(AAR 0.38)方面均表现最佳。

英文摘要

Computational antibody CDR design methods condition on antigen structure to generate binding loops, yet existing architectures conflate two fundamentally distinct sub-problems: identifying which CDR positions will contact the antigen, and selecting amino acids at those positions. This conflation forces models to learn contact reasoning implicitly through uniform message passing, diluting antigen signal across all positions equally. We introduce ConTact, a contact-then-act architecture that explicitly decomposes CDR design into three cascaded stages: learning surface complementarity fingerprints, predicting CDR-antigen contacts, and injecting contact-gated antigen features into the sequence head. A distance-biased cross-attention module encodes geometric priors favoring spatial neighbors, while a contact-weighted cross-entropy loss concentrates gradient signal on binding-critical positions. On CHIMERA-Bench dataset, ConTact achieves the best structural quality (7% RMSD improvement over the next-best baseline), best epitope awareness (10% F1 score over GNN baselines), and competitive sequence recovery (AAR 0.38) among several CDR-H3 design baselines.

2605.21573 2026-05-22 cs.CV

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Lens:重新思考基础文本到图像模型的训练效率

Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, Chong Luo, Jianmin Bao, Ji Li, Lei Shi, Qinhong Yang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yitong Wang, Yunuo Chen

发表机构 * Microsoft Lens Team(微软Lens团队)

AI总结 本文提出Lens,一个具有38亿参数的文本到图像模型,在多种基准测试中表现与超过60亿参数的最新模型相当甚至更优,同时训练计算需求显著降低。通过最大化训练批次的数据信息密度和改进收敛速度的架构选择,实现了高效的训练和优化。

Comments Project Page: https://github.com/microsoft/Lens

详情
AI中文摘要

我们介绍了Lens,一个具有38亿参数的文本到图像(T2I)模型,其在多种基准测试中表现与超过60亿参数的最新模型相当甚至更优,同时训练计算需求显著降低。例如,Lens仅需约Z-Image的19.3%的训练计算。Lens的训练效率源于两个关键策略,除了其紧凑的模型大小外。首先,我们通过(i)在Lens-800M数据集上训练,该数据集包含8亿个密集标注的图像-文本对,其标注由GPT-4.1生成,平均每个标注约109个词,提供比传统短标注更丰富的语义监督,以及(ii)从具有多种分辨率和多样长宽比的图像中构建每个批次,从而扩大每个优化步骤的有效视觉覆盖范围。其次,我们通过精心的架构选择提高了收敛速度,包括采用提供更好潜在表示的语义变分自编码器(VAE)以及采用加速优化并实现从英语训练数据中多语言泛化的强语言编码器。预训练后,我们应用基于分类学驱动提示的强化学习(Lens-RL-8K)和结构化奖励标准来抑制伪影并提高视觉质量,一个具有训练免费系统提示搜索的推理模块以更好地对齐用户请求与模型,以及基于知识蒸馏的加速4步推理。通过高效的训练和系统的优化,Lens能够泛化到任意的长宽比从1:2到2:1以及分辨率高达1440^2,并支持几种常用语言的提示。得益于其紧凑的尺寸,Lens在单个NVIDIA H100 GPU上可以在3.15秒内生成1024^2的图像,而其蒸馏后的turbo版本可以在0.84秒内完成4步生成。

英文摘要

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.