arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪
2606.13258 2026-06-17 cs.AI 新提交

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

MOSAIC: 帕金森病步态评估中增量持续学习的模态特定适应

Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

发表机构 * Nanyang Technological University(南洋理工大学) Pacific Parkinson's Research Centre, University of British Columbia(不列颠哥伦比亚大学太平洋帕金森研究中心)

AI总结 针对帕金森病步态评估中模态增量场景,提出MOSAIC框架,通过模态特定预热、统计解耦MSBN架构和课程引导排斥目标,解决跨模态蒸馏不可靠、统计偏移和可塑性下降问题。

详情
AI中文摘要

基于步态的帕金森病评估越来越依赖异构传感器,但临床系统很少同时收集所有模态。新传感器可能通过设备升级、协议变更或多中心部署引入,而历史患者数据由于隐私和存储限制通常不可用。这种模态增量场景面临三个挑战:不可靠的跨模态蒸馏、模态特定的统计偏移以及保存后可塑性下降。我们提出了MOSAIC,一个紧凑的持续学习框架。首先,我们识别了有毒教师现象,并引入模态特定预热,在蒸馏前稳定新学习的模态表示。其次,我们提出了一种统计解耦的MSBN架构,在保持共享语义主干的同时隔离传感器统计信息。第三,我们设计了一个课程引导的排斥目标用于可塑性恢复,在保留旧知识的同时恢复模态特定容量。在三个多模态帕金森步态数据集上的实验表明,MOSAIC提高了最终性能并减轻了遗忘。项目代码可在以下网址获取:this https URL

英文摘要

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

2606.13196 2026-06-17 cs.AI cs.CY 新提交

Under What Conditions Can a Machine Be Called Genuinely Creative?

机器在何种条件下能够真正具有创造力?

Yong Zeng

发表机构 * Concordia University(康考迪亚大学)

AI总结 本文基于Designics理论,提出机器真正创造力需满足十个要求,并通过实例论证其计算可行性,同时指出当前生成式AI系统尚不具备真正创造力。

详情
AI中文摘要

最近的AI系统能够生成看似具有创造力的文本、软件架构、假设、设计和科学工作流。本文探讨机器在何种条件下能够真正具有创造力,以及如何在共享的认知和创造环境中保持人类能动性。它提出了一个源于Designics(意义承载的意向性变化科学)的需求框架。本文认为,真正的机器创造力不应仅由输出新颖性、当前性能或瞬时架构来定义。相反,创造力被理解为通过递归干预动力学对不完全情境的结构性转变。基于此观点,它依赖于十个需求:环境表示、范围感知、冲突识别、干预能力、后果观察、知识与环境更新、范围重定、局部到全局展开、基于价值的范围界定以及人机共居。这些需求通过Designics的三个定律(感知、冲突和能力)进行组织。本文通过选定的网络-物理和网络-生物研究(包括递归元素提取、自主网格生成以及神经生理和工作负载分析)说明了这些需求的计算可行性。然后,它将开放系统、自动发现框架、自我修改代理、基础模型和代理工作流视为压力案例:它们展示了强大的生成手段,但本身并未建立真正的机器创造力。最后,本文认为主动的AI伦理是真正机器创造力的内在部分,而非事后过滤器。基于价值的范围界定和人机共居必须塑造创造机器如何感知环境、识别冲突、选择干预、观察后果、更新知识以及重新确定未来行动的范围。

英文摘要

Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can be called genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

2606.12863 2026-06-17 cs.LG 新提交

Multimodal Graph Negative Learning

多模态图负学习

Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphMNL框架,通过负学习解决多模态属性图中节点级分支语义不平衡问题,避免主导分支偏差传播,在Grocery和Reddit M数据集上取得最优性能。

详情
AI中文摘要

多模态属性图(MAGs)将图拓扑与异构模态属性(如文本和图像)集成,从而能够对复杂关系系统进行更丰富的建模。然而,这种表达能力也使得MAGs上的学习依赖于多个语义源,包括结构拓扑、文本和视觉属性,每个都可以被视为节点表示的一个分支。当这些分支在语义信息量和可靠性上因节点而异时,就会出现节点级分支语义不平衡:一个分支为某个节点提供判别性语义,但由于模态质量或结构上下文的偏差,可能会误导另一个节点。现有方法通常通过跨分支一致性或对齐来缓解这种异质性,隐含地将主导预测视为可靠监督。当主导分支有偏差时,强制模仿可能会将其偏差传播到其他分支,并抑制对分类有用的原始语义。我们提出GraphMNL,一种图感知的多模态负学习框架,通过使用负学习作为跨分支指导来解决这个问题。该模型不强制劣质分支模仿教师预测,而是教导它们节点不太可能属于哪些类别。GraphMNL构建分支库,通过图感知可靠性仲裁识别主导和劣质分支,门控不稳定传输,并对非目标类别应用目标保持负学习。这种设计将目标监督与分支指导解耦,使得监督损失学习正确类别,而当分支一致性不可靠时,负学习抑制不太可能的备选类别。通过全面的实验评估,GraphMNL在Grocery数据集上达到72.47%的准确率,在Reddit M数据集上达到76.60的F1分数,取得了最佳性能。

英文摘要

Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

2606.12742 2026-06-17 cs.AI cs.AR 新提交

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

降低可穿戴设备上用于脑电图分析的深度学习模型复杂度

Farough Shayeste Roodi, Parham Zilouchian Moghaddam, Mahdi Mohammadi-nasab, Mehdi Modarressi, Mostafa Ersali Salehi Nasab, Masoud Daneshtalab

发表机构 * University of Tehran(德黑兰大学) Mälardalen University(梅拉达伦大学) Royal Institute of Technology(皇家理工学院)

AI总结 研究通过参数量化和电极减少方法,在资源受限的可穿戴设备上部署DNN模型,实现脑电图分析中精度与复杂度的权衡。

详情
AI中文摘要

可穿戴医疗设备是增长最快的物联网领域。许多自动化医疗服务依赖于两种关键的生物信号,即心电图和脑电图,它们分别反映心脏和大脑的活动。尽管深度神经网络被认为是处理和分析这些信号的主要方式,但可穿戴设备中非常严格的能量和计算能力限制远低于DNN模型的计算、能量和内存带宽需求,从而阻碍了深度学习在许多实际可穿戴服务中的部署。本文研究了在资源受限的可穿戴设备上部署最先进的DNN模型的可行性。值得注意的是,我们探讨了在使用参数量化和电极减少方法时,DNN的精度与计算复杂度之间的权衡。我们的研究集中在几种用于脑电图信号分析(特别是检测癫痫发作)的最先进的DNN模型上。我们的发现表明,当明智地应用这些技术时,可以显著降低所考虑的DNN的复杂度,同时对精度的影响最小。这些结果揭示了在将基于DNN的在线脑电图分析适配到可穿戴设备时,精度与复杂度降低之间明确的权衡关系。

英文摘要

Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial biological signals, namely ECG and EEG, which reflect the activity of the heart and brain, respectively. Although deep neural networks are considered the primary way to process and analyze these signals, the very tight energy and computational power constraints in wearable devices are far below the computational, energy, and memory bandwidth demands of DNN models, thereby impeding the deployment of deep learning in many practical wearable services. This paper investigates the feasibility of deploying state-of-the-art DNN models in resource-constrained wearable devices. Notably, we explore the trade-off between accuracy and computational complexity of DNNs when parameter quantization and electrode reduction methods are used. Our investigation centers on several state-of-the-art DNN models designed for EEG signal analysis, specifically for detecting epileptic seizures. Our findings demonstrate that, when applied judiciously, these techniques can significantly reduce the complexity of the DNNs under consideration with minimal adverse effects on accuracy. These results reveal the explicit trade-offs between accuracy and complexity reduction encountered when adapting DNN-based online EEG analysis for wearable devices.

2606.11990 2026-06-17 cs.LG cs.AI 新提交

Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

用于剩余使用寿命估计的时间序列基础模型嵌入

Amir El-Ghoussani, Michele De Vita, Ronald Naumann, Vasileios Belagiannis

发表机构 * University of Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Siemens AG(西门子股份公司)

AI总结 提出冻结预训练时间序列基础模型Chronos-2作为骨干,结合轻量回归头进行剩余寿命预测,在工业传感器数据上优于多种基线方法。

Comments Accepted to EUSIPCO 2026, 4 pages, 2 figures, 2 tables

详情
AI中文摘要

剩余使用寿命(RUL)预测对于工业预测性维护至关重要,然而许多基于学习的方法依赖于大量的特征工程或大型标注数据集来训练特定任务的序列模型。在这项工作中,我们引入了一种轻量级学习方法,利用冻结的预训练时间序列基础模型(TSFM),并将其与一个小型回归头结合,用于从多变量传感器流中估计RUL。具体来说,我们使用Chronos-2作为冻结骨干来提取上下文窗口特征,并训练一个轻量级回归神经网络进行RUL预测。在来自两种设备类型的真实工业传感器数据上的实验表明,在相同的预处理和评估协议下,Chronos-2特征一致地优于循环、卷积、基于Transformer和梯度提升基线。我们进一步分析了上下文长度的影响,发现随着历史记录变长,性能显著提升,这表明TSFM表示为工业环境中的RUL估计提供了一种实用且数据高效的替代方案。

英文摘要

Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

2606.11616 2026-06-17 cs.LG cs.IR 新提交

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

DeMix: 通过影响向量调试包含混合错误类型的训练数据

Jiale Deng, Yanyan Shen, Xiaogang Shi, Junjun Chai

发表机构 * Shanghai Jiao Tong University(上海交通大学) ByteDance Inc.(字节跳动) Tiktok

AI总结 提出DeMix框架,利用影响向量捕捉不同错误类型对模型行为的独特模式,将数据调试转化为多标签分类问题,并引入基于干预的学习策略,在11个任务上显著提升调试F1分数和修复后模型性能。

详情
AI中文摘要

高质量的训练数据对于机器学习模型的成功至关重要。然而,真实世界的数据集通常包含由数据准备流程中的系统性缺陷引起的混合错误类型,包括标签错误、特征错误和虚假相关性。有效的训练数据调试既需要检测错误样本,也需要识别其具体的错误类型以便进行针对性修复,但现有的数据清洗和归因方法未能充分满足这一双重需求。在本文中,我们提出DeMix,一种同时诊断错误样本及其错误类型的新框架。我们的关键见解是,不同的错误类型会在模型行为上产生不同的模式。DeMix通过影响向量捕获这些特定于错误的模式,这些影响向量描述了每个训练样本如何影响所有验证样本上的模型预测。我们将训练数据调试形式化为一个多标签分类问题,其中开发了一个分类器直接从影响向量预测错误类型。我们进一步引入了一种基于干预的学习策略,引导分类器捕获每种错误类型特有的不变理由,确保学到的分类器有效泛化。在表格数据预测、推荐系统和LLM对齐等11个任务上的实证评估表明,DeMix显著优于最先进的方法,在数据调试F1分数上提高了22.61%,在数据修复后任务模型性能上提高了9.32%。代码可在以下网址获取:this https URL。

英文摘要

High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: https://github.com/SJTU-DMTai/DeMix.

2606.10774 2026-06-17 cs.LG cs.DC 新提交

Asynchronous Decentralized Federated Learning over Lossy Wireless Links via Reception- and Age-Aware Aggregation

部分接收下分散式联邦学习的逆概率加权与信息年龄聚合

Chanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya

发表机构 * University of Melbourne(墨尔本大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对无线网络下分散式联邦学习的选择偏差和更新过时问题,提出结合逆概率加权与信息年龄加权的DFL-AA方法,理论消除链路质量偏差,实验优于现有基线。

Comments 14 pages, 9 figures, research paper for journal submission

详情
AI中文摘要

在有损无线网络上的分散式联邦学习面临两个关键挑战:选择偏差,即由于部分模型接收,来自劣质链路的更新被系统性地低估;以及更新过时,即异步节点贡献过时信息。我们表明,使用局部填充重建的均匀八卦聚合会引入持久的链路质量诱导偏差,而基于完整性的加权进一步放大了这种效应。为了解决这些挑战,我们提出了DFL-AA(具有自适应AoI加权聚合的分散式联邦学习),它结合了逆概率加权与基于在线EWMA的信道估计来纠正选择偏差,以及基于信息年龄的加权来减轻过时,而无需全局同步。我们从理论上证明DFL-AA在期望上消除了链路质量失真,并通过实验证明在不同丢包率、网络规模和异构无线条件下,其性能持续优于最先进的基线。

英文摘要

Decentralized Federated Learning(DFL) enables collaborative model training across wireless edge nodes, including IoT deployments, autonomous vehicles, UAV swarms, and satellite constellations. Operating over lossy wireless links under constraints, these systems cannot rely on retransmissions, so model parameters must be accepted as partial chunks, leading to two key failure modes, which are selection bias, where poor-quality links are systematically under-represented in gossip aggregation, and update staleness, where asynchronous nodes contribute outdated models. We prove that classical gossip aggregation introduces irreducible selection bias proportional to the link-loss rate. We propose DFL-AA (Decentralized Federated Learning with Adaptive AoI-weighted Aggregation), which corrects selection bias using Inverse Probability Weighting (IPW) with online channel estimation and mitigates staleness via Age-of-Information (AoI) decay without requiring a global clock. We prove that DFL-AA removes link-quality distortion in expectation and consistently outperforms state-of-the-art baselines across varying loss rates and heterogeneous channel conditions on fixed directed topologies.

2606.10703 2026-06-17 cs.LG cs.CL 新提交

From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

从观察到干预:混合专家模型中专家重要性的因果审计

Leonard Engmann, Christian Medeiros Adriano, Holger Giese

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过因果审计发现,混合专家模型中的路由统计指标无法预测专家重要性,现有剪枝方法的成功源于早期层冗余而非识别可删除专家。

Comments 9 pages, 2 figures, 9 tables. Accepted at the ICML 2026 Workshop on Philosophy of Science Meets Machine Learning (PhilML). Camera-ready Version. Non-archival

详情
AI中文摘要

可解释性方法通常使用观察到的模型行为的总体统计量来推断特定计算的目标干预效果;用Pearl的术语来说,它们将第一层的关联证据视为支持第二层的干预结论,而这种做法的有效性很少被检验。我们考察了一个具体实例:混合专家(MoE)剪枝中路由统计量的使用,其中利用率、激活范数和路由权重分布被视为预测哪些专家可以被移除而不产生功能损失的指标。在三个高冗余MoE架构(OLMoE-1B-7B-0924、Qwen1.5-MoE-A2.7B、DeepSeek-V2-Lite)上进行的token级干预审计发现,经过多重比较校正后,没有任何观测指标能预测任何模型中的因果专家重要性,所有60个指标-层组合的效应量均低于Cohen's $d = 0.17$。通过每个token的路由权重控制排除了统计功效不足的问题,仅在OLMoE的最后一个MoE层恢复了一个Bonferroni显著的信号($d = +0.231$, $p = 0.0013$)。现有剪枝方法在此场景下的成功并非由于识别了可删除的专家,而是因为早期层的冗余使得大多数选择标准可互换。我们的结果提供了一个明确的反例,表明从总体观测统计量到关于专家重要性的token级干预推断这一常见推理步骤存在问题,并展示了干预审计如何校准可解释性主张的证据标准。

英文摘要

Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance in any model: across all 60 metric-layer combinations effect sizes stay below Cohen's $d = 0.23$, and no metric is reliably positive under our corrected, dual-test criterion. A per-token routing weight control, run with identical $n$, rules out insufficient power, recovering a signal whose CI excludes zero at OLMoE's final MoE layer ($d = +0.231$, 95\% CI $[+0.09, +0.37]$, $p = 0.0013$). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

2606.09376 2026-06-17 cs.CL 新提交

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

精确性不等于忠实度:基于完全神谕的覆盖感知接地生成评估

Juan S. Santillana

发表机构 * Globant

AI总结 针对参考无关忠实度指标仅测量精确性而忽略召回率的问题,提出利用完全神谕(F1赛事和NOAA天气预报)测量覆盖度,并设计结合精确性与覆盖度的综合指标及验证器引导生成方法。

Comments 9 pages. v2: adds Anthropic Claude + 3 additional fine-tuned bases (1B-7B); 6 frontier families x 3 languages. Code https://github.com/vectrayx/precision-is-not-faithfulness Demo https://huggingface.co/spaces/jsantillana/faithful-strategy-engineer-f1

详情
AI中文摘要

参考无关的忠实度指标验证模型对事实的每个原子声明,并越来越多地用于评估接地生成。我们表明它们存在一个盲点:它们仅测量精确性——所陈述的声明是否得到支持?——因此奖励弃权,因为模型通过几乎不说什么就可以获得近乎完美的忠实度。我们使用F1遥测技术使其可测量,这是一个战略事实确定性推导且关键是完全的领域:对于每个决策,我们知道所有重要事实的完整集合。这种完整性——在开放领域的忠实度基准中缺失——让我们能够精确测量召回率(相关事实的覆盖度)以及精确性。在一个涵盖150场比赛的7,253个决策实例的多语言(EN/ES/PT)基准上,最精确的前沿模型覆盖不到一半的相关事实,并且按F1排名最后,因此要求覆盖度会重新排序系统;同样的效果在第二个完全神谕领域(NOAA天气预报)中再次出现。提示消融实验表明,低覆盖度不是提示不足的产物:明确要求模型彻底并不能缩小差距。我们将忠实度与覆盖度结合成一个单一分数,验证了该指标(受控扰动;无模型正则表达式提取器和跨家族LLM提取器之间的一致性,系统级Spearman 1.0),并给出了一种无需参考即可提高精确性和召回率的验证器引导生成方法。我们发布了基准、结构化注释、指标、基线和交互式演示。

英文摘要

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 157 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). Fine-tuning small models (1B-7B) on the complete oracle closes the precision-recall gap entirely (F1 ~0.98), beating every zero-shot frontier system regardless of scale. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.

2606.09004 2026-06-17 cs.AI 新提交

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

LATTEArena: 基于LLM的表格特征工程评估框架(扩展版)

Ankai Hao, Ke Chen, Huan Li, Lidan Shou

发表机构 * Zhejiang University(浙江大学)

AI总结 提出LATTEArena,首个标准化评估框架,通过六维分类法分解15种方法、模块化竞技场和组件消融实验,揭示Tree-of-Thought与MCTS成本效益最优等16项关键发现。

Comments 31 pages, 9 figures

详情
AI中文摘要

特征工程对于表格数据分析仍然至关重要,大型语言模型(LLM)已成为自动化这一过程的有前景的范式,催生了基于LLM的自动化表格特征工程(LATTE)。然而,缺乏标准化平台阻碍了公平、成本感知的比较。此外,复杂的方法设计掩盖了单个组件的具体贡献;例如,尽管LFG集成了思维树、少样本演示、蒙特卡洛树搜索和自然语言生成,但每种技术的竞争优点的孤立影响仍未量化。为解决这些挑战,我们引入了LATTEArena,这是首个竞争性评估框架,具有以下特点:(1)六维分类法,将15种代表性方法分解为可重用组件;(2)标准化模块化竞技场,用于受控比较;(3)涵盖性能、成本和鲁棒性的多维评估;(4)组件级消融,量化每种技术的竞争优点。通过广泛评估,我们揭示了16项关键发现,包括:(1)思维树与蒙特卡洛树搜索实现了最佳成本效益;(2)RPN和代码输出格式分别主导分类和回归任务。我们公开发布了模块化框架和超过4000条执行日志,使研究人员能够将新技术与现有技术无缝对比,推动LATTE发展。

英文摘要

Feature engineering remains a cornerstone of tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for its automation, giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). However, the field lacks standardized, cost-aware evaluation platforms, and the combinatorial explosion of design choices obscures true algorithmic progress. To bridge these gaps, we systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy. Based on this abstraction, we introduce LATTEArena, a standardized, modular, and extensible benchmarking framework that decouples monolithic pipelines into reusable execution blocks. By distilling the massive combinatorial space, we evaluate 24 core LATTE configurations across 7 research questions. Our head-to-head benchmarking goes beyond predictive accuracy to quantify token efficiency and execution robustness, yielding 17 empirical findings on cost-effectiveness trade-offs. Furthermore, we provide 3 concrete recommendations for optimal real-world deployment. By enabling controlled component-level comparisons, LATTEArena shifts the paradigm from ad-hoc prompt engineering to systematic context management. All code, datasets, and over 4,000 execution logs are publicly available to foster a dynamic, community-driven benchmark. Our framework, leaderboard, and all artifacts are hosted on the LATTEArena project website at https://goodenhak.github.io/LATTEArena.

2606.08810 2026-06-17 cs.CL cs.LG 新提交

Continuous Language Diffusion as a Decoder-Interface Problem

连续语言扩散作为解码器-接口问题

Zhicheng Du, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院, 清华大学)

AI总结 研究连续扩散语言模型如何从高斯噪声生成流畅文本,提出解码器-盆地机制,并设计诊断协议揭示标量指标隐藏的失败,通过接口相图解释令牌恢复行为。

详情
AI中文摘要

高斯扰乱的句子嵌入没有直接的语言解释,但连续扩散语言模型可以从它们生成流畅文本。我们通过嵌入式语言流(ELF)研究这一谜题,并识别出解码器-盆地机制:当轨迹到达原生解码器可以读取稳定令牌的区域时,去噪成功。我们引入了可去噪性、语义可恢复性、顺序敏感性、解码器兼容性和轨迹可靠性的诊断协议。它暴露了标量指标隐藏的失败:低均方误差可能丢弃语言内容,低困惑度可能反映低熵崩溃,干净的潜在重建可能与狭窄的解码器盆地共存。一个解码器-边界界解释了为什么令牌恢复依赖于边界和局部解码器敏感性,而不仅仅是潜在误差。审计公开的ELF检查点揭示了一个接口相图:早期预测弱可读,轨迹中期分歧标志竞争区域,晚期预测进入高边界最终令牌盆地。一旦进入,在生成的ELF状态上令牌实现出奇简单:冻结的T5令牌嵌入查找恢复了原生解码器决策的93%–96%,单个线性读出在32k样本时达到97.9%的一致性,在结构化残差尾部留下约1.1的困惑度差距。在显式诊断监控下,保守的边界门在去噪步骤中提前17%–27%退出。对LangFlow、BitstreamDiffusion和连续潜在扩散语言模型(Cola-DLM)的边界检查表明,当状态对象和解码器改变时,相同的接口问题仍然有意义。因此,连续和潜在扩散语言模型应作为表示-解码器系统进行评估。

英文摘要

Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: our evidence suggests that denoising becomes reliable when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin decoder basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 (Text-to-Text Transfer Transformer) token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving an $\approx1.1$--$1.2$ perplexity gap in a structured residual tail. Under conservative held-out gates, a margin rule exits roughly $17$--$28\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.

2606.08402 2026-06-17 cs.CV cs.AI cs.MA 新提交

SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

SceneConductor: 基于多智能体编排的单图像3D场景生成

Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan

发表机构 * Nanyang Technological University(南洋理工大学) University of Oxford(牛津大学) Meshy AI

AI总结 提出多智能体编排框架,将单图像3D场景生成分解为场景初始化、环境构建和多智能体细化三个阶段,并引入几何感知布局预测器,在几何精度、空间一致性和感知真实性上超越现有方法。

详情
AI中文摘要

从单张图像生成完整3D场景需要从本质上模糊的视觉证据中推断全局一致的几何、物体关系和环境上下文。尽管联合布局和网格生成近期取得进展,现有方法通常依赖整体或弱分解的流水线,将许多因素纠缠在一起,需要大量场景级监督,限制了其对复杂真实环境的泛化。我们提出一个多智能体编排框架,将单图像3D场景生成分解为三个结构化阶段:场景初始化、环境构建和多智能体细化。初始化阶段提取图像派生的物体掩码,构建物体级3D表示,并预测初始空间布局以形成粗略3D场景。环境构建阶段随后利用该初始化以及点图几何,构建支撑表面、房间边界、材质和光照的环境支架。最后,在细化阶段,规划器智能体识别结构和视觉不一致性,直接应用简单修正,并派遣专家智能体进行复杂的局部修订,再整合回全局场景。为提供可靠的结构初始化同时减少对场景级标注的依赖,我们进一步引入一个几何感知布局预测器,由点图派生的稀疏几何先验监督。与全监督布局生成器不同,该预测器可从分割级数据训练,并稳健泛化到多样真实场景。在基准数据集上的大量实验表明,我们的方法在几何精度、空间一致性和感知真实性上持续优于先前方法。

英文摘要

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

2606.07555 2026-06-17 cs.CL cs.LG 新提交

Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override

先验通过抑制持续存在:词汇覆盖的斯特鲁普范式

Han-yu Wang

发表机构 * The University of Hong Kong(香港大学)

AI总结 通过斯特鲁普范式实验,发现语言模型中的词汇先验在局部规则覆盖后仍持续存在,并通过激活修补定位到源位置三元组,揭示了先验是干扰起源和覆盖痕迹的共同通道。

详情
AI中文摘要

词汇表、技术规范和系统提示通常要求语言模型以不熟悉的方式使用熟悉的词汇。当这种方式有效时,词汇先验通过覆盖而非替换持续存在:它在局部规则应用后继续运作,规则降低其logit而非在顶部安装新含义。我们通过斯特鲁普风格范式对此进行测试:一个重映射规则(“doctor”意为“forest”)与查询词的词汇先验干扰项(“hospital”)对抗,并匹配中性对照。在跨越四个家族和1B-9B参数的11个开源权重模型中,即使在项目级别控制答案先验、频率、分词和提示措辞后,词汇先验强度仍能预测干扰。对五个对齐模型的激活修补定位到一个源位置三元组(定义主语、定义目标、查询词),该三元组几乎完全恢复了冲突效应(聚合$R \in [0.92, 1.06]$)。定义目标交换表明该三元组执行绑定而非身份匹配。分离实验将目标保留隔离为绑定特定特征:干扰抑制在匹配、交换和项目不匹配条件下均发生,而目标logit崩溃仅在定义目标位置被破坏时发生。行为和机制汇聚到同一通道:词汇先验既是干扰的起源,也是覆盖留下痕迹的地方。

英文摘要

Glossaries, technical specifications, and system prompts routinely ask language models to use familiar words in unfamiliar ways. When this works, the local rule does not install the new meaning on top of the old one; the pretrained prior keeps operating underneath, and its strength still shows through. We test this with a Stroop-style paradigm: a remapping rule (doctor means forest) pitted against the query word's lexical-prior distractor (hospital), with matched neutral controls. Across 11 open-weight models spanning four families and 1B-9B parameters, lexical-prior strength predicts interference even after item-level controls for answer prior, frequency, tokenization, and prompt wording. Activation patching on five aligned models locates a source-position triplet (definition subject, definition target, query word) that nearly fully recovers the conflict effect (aggregate $R \in [0.92, 1.06]$); a definition-target swap shows the triplet performs binding rather than identity matching. Dissociation experiments isolate target preservation as the binding-specific signature: distractor suppression occurs under matched, swap, and item-mismatched conditions alike, whereas target logit collapse occurs only when the definition-target position is corrupted. Behavior and mechanism converge on the same channel: the prior's strength both predicts which overrides fail and marks where the causal repair lands.

2606.06523 2026-06-17 cs.AI cs.LG cs.LO cs.SE 新提交

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent:面向智能体工作流与轨迹的形式化建模与验证

Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Independent researcher(独立研究者)

AI总结 提出Lean4Agent框架,利用依赖类型形式语言Lean4对智能体工作流进行形式化建模与验证,通过FormalAgentLib库和LeanEvolve方法提升工作流可靠性,实验验证通过的工作流性能平均提升11.94%。

详情
AI中文摘要

使大型语言模型(LLMs)能够执行可靠的多步工作流已成为人工智能领域的核心挑战。尽管LLMs的智能体能力近期取得了进展,但大多数智能体系统仍缺乏用于指定、验证和调试其工作流及执行轨迹的形式化方法。这一挑战类似于数学中长期存在的问题,其中自然语言(NL)的模糊性促使了形式语言(FL)的发展。受此范式启发,我们提出了**Lean4Agent**,据我们所知,这是首个使用依赖类型形式语言Lean4来建模和验证智能体行为的框架。**Lean4Agent**推出了**FormalAgentLib**,一个可扩展的Lean4库,用于在显式假设下形式化建模和验证智能体工作流的语义一致性,并能够定位轨迹揭示的运行时故障。基于**FormalAgentLib**,我们进一步开发了**LeanEvolve**,它应用**FormalAgentLib**中的结果来修订工作流以增强其能力。在SWE-Bench-Verified的困难子集和ELAIP-Bench子集上,针对5个领先LLMs的大量实验表明,通过验证的工作流比未通过的工作流平均性能提升**11.94%**,而**LeanEvolve**进一步将SWE性能平均提升**7.47%**。此外,**Lean4Agent**为使用表达能力强的依赖类型形式语言形式化建模和验证智能体行为这一新领域奠定了基础。

英文摘要

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

2605.05172 2026-06-17 cs.RO cs.AI 版本更新

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你行为克隆,就做Q函数:从行为克隆中提取Q值用于机器人强化学习

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng

发表机构 * Rai-Inst

AI总结 提出Q2RL算法,通过从行为克隆策略中提取Q函数并利用Q门控切换策略,实现高效的离线到在线强化学习,在机器人操作任务中达到100%成功率和3.75倍提升。

Comments Robotics: Science and Systems, 2026

详情
AI中文摘要

行为克隆(BC)已成为机器人学习的一种高效范式。然而,BC在收集演示后缺乏自我引导的在线改进机制。现有的离线到在线学习方法常常由于离线数据与在线学习之间的分布不匹配,导致策略替换先前学习的好动作。在这项工作中,我们提出了Q2RL(从BC进行Q估计和Q门控用于强化学习),一种高效的离线到在线学习算法。我们的方法包括两部分:(1)Q估计通过与环境的少量交互步骤从BC策略中提取Q函数,然后进行在线RL;(2)Q门控根据各自的Q值在BC和RL策略动作之间切换,以收集用于RL策略训练的样本。在D4RL和robomimic基准测试的操作任务中,Q2RL在成功率和收敛时间上优于最先进的离线到在线学习基线。Q2RL足够高效,可应用于机器人上的RL设置,在1-2小时的在线交互中学习接触密集和高精度操作任务(如管道组装和套件装配)的鲁棒策略,成功率达到100%,相比原始BC策略提升高达3.75倍。代码和视频见https://this URL。

英文摘要

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

2605.01973 2026-06-17 cs.CL cs.LG 版本更新

Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

在任意文本条件下学习:一种超网络驱动的元门控大语言模型

Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种超网络驱动的元门控机制,通过动态调整SwiGLU块中的β参数,使LLM适应不同文本条件,优于微调和元学习基线。

Comments Accepted by ICML2026

详情
AI中文摘要

传统的大语言模型可能面临语料异质性和细微条件变化的问题。虽然微调可能导致灾难性遗忘,但元学习在LLM上的应用也因其复杂性和可扩展性而受到限制。在本文中,我们激活了SwiGLU块中的元信号$β$,形成了一种自适应调整FFN非线性的元门控机制。我们使用超网络动态生成基于文本条件的$β$,为LLM提供元可控性。通过在任务、领域、角色和风格等不同条件类型上的测试,我们的方法优于微调和元学习基线,并且能够合理泛化到未见过的任务、条件类型或指令。我们的代码可在https://github.com/AaronJi/MeGan找到。

英文摘要

Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta-learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta-signal of $β$ within the SwiGLU blocks, resulting in a meta-gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces $β$ on textual conditions, providing meta-controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta-learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in https://github.com/AaronJi/MeGan.

2606.09337 2026-06-17 cs.RO 版本更新

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

TORL-VLA:触觉引导的在线强化学习用于接触丰富操作

Huaihang Zheng, Yi Yang, Kai Ma, Shenglin Xu, Tian Xie, Guozheng Li, Xiangyu Wang, Yiren Ma, Si Liu, Yinian Mao, Baoxu Liu

发表机构 * Meituan(美团) Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) State Key Lab of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS(中国科学院自动化研究所多模态人工智能系统国家重点实验室) China University of Mining and Technology (Beijing)(中国矿业大学(北京))

AI总结 提出TORL-VLA框架,结合触觉反馈与在线强化学习,通过触觉导出的力矩感知VLA预测参考动作,并利用轻量在线RL模块优化动作,解决接触条件变化时的策略适应问题,在长时接触任务中提升成功率和执行效率。

Comments Project page: https://torl-vla.github.io/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为机器人操作的有力框架,最近的研究将触觉或力反馈引入VLA以处理接触丰富的任务。然而,这些模型通常作为离线策略部署。当接触条件偏离训练分布时,策略无法进行在线适应,导致接触力不当和重试效率低下等问题。因此,我们提出TORL-VLA,一种触觉引导的在线强化学习框架,将触觉反馈与策略优化相结合用于接触丰富操作。我们的方法引入了一个触觉导出的力矩感知VLA来预测参考动作和未来的力矩序列,同时使用轻量级在线RL模块来优化参考动作。为了稳定地从混合的探索性策略生成和人工干预数据中学习,我们引入了一个干预审查评论家,防止干预后的成功被错误地归因于干预前的策略生成动作。在包括门闩操作、咖啡杯放置和鸡蛋处理等长时接触丰富任务上的真实机器人实验表明,TORL-VLA在子任务和完整任务级别上提高了成功率,并在时间约束的执行效率上优于强基线。

英文摘要

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines. Project page: https://torl-vla.github.io/

2606.04513 2026-06-17 cs.AI 版本更新

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

MapAgent: 一个工业级的城市规模车道级地图生成智能框架

Deguo Xia, Zihan Li, Haochen Zhao, Dong Xie, Yuyao Kong, Xiyan Liu, Jizhou Huang, Mengmeng Yang, Diange Yang

发表机构 * Tsinghua University(清华大学) Baidu(百度) University of Macau(澳门大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 提出MapAgent框架,通过结合视觉语言模型和约束感知推理,在验证驱动的Judge-Planner-Worker循环中修正车道地图生成中的规范违规问题,实现城市规模的高自动化生产。

Comments Accepted by KDD 2026

详情
AI中文摘要

车道级地图是自动驾驶和车道级导航的关键基础设施,但为数百个城市构建和维护标准化车道网络仍然高度劳动密集。最近的端到端矢量化映射方法可以直接从传感器数据预测车道几何和拓扑,但它们通常将映射规范和交通规则视为隐式的、依赖于数据集的监督。此外,在复杂场景中(例如,磨损或缺失的标记和遮挡),仅凭视觉证据往往难以确定正确的车道配置,使得规范违规成为人工后期编辑的主要来源。我们提出MapAgent,一个工业级智能架构,它增强了一个矢量化主干,用于生成符合规范的车道地图。MapAgent不仅仅是在地图预测上添加一个智能体循环,而是在一个有界、验证驱动的Judge-Planner-Worker循环中,将主干感知与明确的规范验证、约束感知推理和确定性地图编辑相结合。一个视觉语言Judge通过联合检查视觉证据和草稿向量来诊断错误,而一个工具调用Planner生成最小的修正编辑并进行编辑后重新验证。为了保持城市规模生产的可扩展性,MapAgent仅在主干置信度低的图块上选择性触发,增加了适度的开销同时保持吞吐量。在真实世界数据集上的实验显示,与强大的生产基线相比,特别是在复杂和长尾场景中,性能持续提升。此外,MapAgent已集成到百度地图中,支持全国超过360个城市的车道级地图生成,并将整体生产自动化率提升至95%以上,证明了MapAgent在大规模车道级地图生成中的实用性和有效性。

英文摘要

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

2606.03609 2026-06-17 cs.RO cs.LG 版本更新

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

3D 等视域世界模型——揭示城市不可见几何及其涌现的跨城市特征

Xuhui Lin, Stephen Law, Nanjiang Chen, Kunyao Li, Tao Yang

发表机构 * The Bartlett School of Sustainable Construction University College London, UK(可持续建设学院伦敦大学学院,英国) Department of Geography University College London, UK(地理系伦敦大学学院,英国) School of Project Management, Faculty of Engineering The University of Sydney, AU(工程学院项目管理学院悉尼大学,澳大利亚) School of Engineering Cardiff University, UK(工程学院卡迪夫大学,英国) School of Architecture Tsinghua University, Beijing, CN(建筑学院清华大学,北京,中国)

AI总结 提出一种预测3D等视域(球形可见性深度图)的具身世界模型,通过深度残差和自滚动调度采样训练,发现跨城市空间特征可从时间潜变量中线性解码。

详情
AI中文摘要

在城市中导航的具身智能体依赖于世界模型来预测其移动时周围环境的变化。但对于导航而言,重要的不是建筑物的外观,而是智能体可以到达的位置。尽管如此,大多数世界模型仍然预测外观,学习场景的外观而非智能体可穿行的空间。那些确实针对几何的模型,如鸟瞰占用网格,将三维环境压缩到地面平面,忽略了塑造真实导航的地上和多层结构。目前缺少的是一个能够捕捉智能体实际穿行的可导航几何的预测目标,既不受光度信息干扰,也不丢失第三维度。我们的核心思想是对建筑物之间的开放体积(负空间)进行建模,编码为3D等视域:一个球形可见性深度图,记录每个方向上到最近表面的距离。我们引入了一个具身世界模型,根据过去短时间内的等视域历史和运动动作预测下一个等视域。预测被公式化为深度残差,使解码器继承锐利的建筑边缘,通过自滚动调度采样进行训练以保持几何流形上的上下文,并配备持久潜鸟瞰空间图以实现跨路径一致性。我们的核心发现是涌现且出乎意料的:一个在曼哈顿和巴黎上训练的单一城市盲模型发展出了跨城市空间特征,其城市身份可从时间潜变量中线性解码,远高于单帧基线,因此该特征存在于学习到的动力学中而非外观中。该表示轻量、可解释且可复现,为具身AI、机器人和城市分析中的空间推理提供了几何基础,并随附开放数据集和流程发布。

英文摘要

Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.

2606.03177 2026-06-17 cs.RO 版本更新

ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control

ConTrack: 具有自适应权衡控制的约束手部运动跟踪

Yutong Liang, Quanquan Peng, Ri-Zhao Qiu, Xiaolong Wang

发表机构 * University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出一种基于强化学习的框架ConTrack,通过将物体跟踪视为约束并利用双变量更新自适应调整任务-风格权衡,同时结合自适应中轨迹重置库,实现长时域、接触密集的手部运动跟踪,在仿真和真实机器人上显著提升成功率和物体位姿精度。

详情
AI中文摘要

人类演示为机器人操作提供了强大的先验,但由于运动学差距,将其转移到真实机器人上执行并非易事。在灵巧操作中,即使在仿真器中跟踪长时域、接触密集的序列仍然具有挑战性:参考跟踪策略必须保持物体在其目标轨迹上,同时保留演示的关节运动和接触时序。现有方法通常依赖于需要针对每个序列进行调整的手工奖励调节,并且在有限的交互预算下会失效。我们提出了ConTrack,一种随跟踪数据扩展的强化学习(RL)框架。ConTrack将物体跟踪视为约束,并将剩余控制权限分配给运动保真度,从而通过双变量更新在线适应任务-风格权衡。此外,ConTrack还通过一个自适应中轨迹重置库来稳定长时域学习,该库重用策略可达的仿真器状态。我们在仿真跟踪和真实机器人上的定性和定量结果表明,ConTrack在保持关节和接触保真度的同时,显著提高了成功率和物体位姿精度,优于现有技术。网站:此 https URL。

英文摘要

Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: https://www.lyt0112.com/projects/ConTrack.

2606.03089 2026-06-17 cs.LG cs.AI 版本更新

Constitutional On-Policy Safe Distillation

宪法性在策略安全蒸馏

Ming Wen, Yuxuan Liu, Kun Yang, Yunhao Feng, Zhuoer Xu, Yuhao Sun, Shiwen Cui, Xiang Zheng, Guoyu Wang, Xingjun Ma, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI(可信具身人工智能研究院) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Ant Group(蚂蚁集团) Zhejiang University(浙江大学) City University of Hong Kong(香港城市大学)

AI总结 针对在策略自蒸馏在安全对齐中因宪法条件导致教师分布收缩、表达能力下降的问题,提出宪法性在策略安全蒸馏(COPSD),通过交叉SFT冷启动校准教师分布,再进行宪法条件在策略蒸馏,在12个基准上实现了更优的安全-有用性权衡并降低安全税。

详情
AI中文摘要

在策略自蒸馏(OPSD)通过使用基于特权信息条件的教师提供密集的令牌级监督,已成为一种高效的后训练范式。先前工作表明,OPSD在可验证推理任务中可能崩溃,但安全对齐不同,它由高层宪法而非显式目标答案指导,因此是重新审视密集蒸馏的自然场景。然而,我们的初步研究表明,安全OPSD仍然遭受严重崩溃:宪法条件将教师分布收缩为短且过于保守的响应,而反向KL进一步将这种收缩放大为表达能力下降。我们将此效应形式化为非正交语义空间中安全边界下的几何泄漏,其中安全压力转移到表达能力维度。基于此分析,我们提出宪法性在策略安全蒸馏(COPSD),首先通过交叉SFT冷启动校准教师,然后执行宪法条件在策略蒸馏。在12个基准上的实验表明,COPSD比基线实现了持续更强的安全-有用性权衡,同时大幅降低了对通用推理能力的安全税。

英文摘要

On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

2606.00588 2026-06-17 cs.CV 版本更新

Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting

响应感知的多模态学习用于治疗后视力预测

Phuoc-Nguyen Bui, Van-Vi Vo, Duc-Tai Le, Junghyun Bum, Van-Nguyen Pham, Ki-Young Kim, Seung-Young Yu, Hyunseung Choo

发表机构 * Research Convergence Institute(研究融合研究所) Sungkyunkwan University(全北大学) Dept. of AI Systems Engineering(人工智能系统工程系) Dept. of Ophthalmology(眼科系) Kyung Hee University Medical Center(庆熙大学医学院) Dept. of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 提出ReVA框架,利用基线与第1个月OCT影像及表格数据,通过多模态融合预测糖尿病性黄斑水肿患者抗VEGF治疗后3-24个月的视力轨迹。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

抗VEGF治疗后长期视力(VA)结果对于糖尿病性黄斑水肿(DME)患者的咨询、期望设定和随访计划至关重要。然而,在临床实践中,医生通常仅根据早期治疗后发现来估计长期视力轨迹,使得可靠的预后判断变得困难。尽管先前基于OCT的学习方法主要关注短期反应或单终点预测,但利用早期纵向观测数据建模多个未来时间点的VA轨迹仍未被充分探索。在本研究中,我们收集了一个由188名接受抗VEGF治疗的DME患者组成的真实世界队列,配有配对基线和第1个月OCT扫描,以及表格化的OCT衍生生物标志物和非影像临床变量。仅使用这些早期数据,我们构建了一个多时间点VA预测问题,旨在预测3、6、12、18和24个月的视力结果,反映临床上有意义的随访间隔。我们提出了ReVA,一个响应感知的多模态框架,该框架整合了基线和第1个月OCT的结构特征与表格变量,以捕捉基线疾病状态和早期治疗反应。ReVA使用空间注意力保留局部预后成像特征,并使用依赖感知的表格编码器建模临床变量之间的交互。这些多模态表示被融合以预测患者特定的长期视力轨迹。所提出的框架在24个月VA预测中实现了MAE=0.1246,RMSE=0.1621,R^2=0.6064,并在所有预测时间点上表现一致。我们的研究结果表明,纳入早期治疗反应信号能够实现临床上有意义的长期视力预测,为常规抗VEGF管理中的数据驱动决策支持提供了依据。

英文摘要

Long-term visual acuity (VA) forecasting after anti-VEGF therapy is important for counseling and follow-up planning in diabetic macular edema (DME), yet remains challenging when only early post-treatment findings are available. While prior OCT-based methods mainly focus on short-term response or single-endpoint prediction, multi-horizon VA forecasting from early longitudinal data remains insufficiently under-explored. In this study, we assembled a real-world cohort of 188 anti-VEGF--treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that combines baseline and month-1 OCT features with tabular variables to capture disease status and early treatment response. ReVA integrates spatial OCT attention, dependency-aware tabular encoding, and cross-modal fusion to predict patient-specific long-term VA trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management. Code and pretrained models will be released on https://github.com/nguyenpbui/ReVA.

2606.00024 2026-06-17 cs.CL 版本更新

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

ART:面向高效大语言模型解码的注意力运行时终止

Chen Qiu, Guozhong Li, Cristian McGee, Aritra Dutta, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology(卡布尔大学科学与技术大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出注意力运行时终止(ART)机制,通过跟踪累积注意力输出并在贡献可忽略时终止后续KV块访问,在不显著影响准确率的情况下将大批量生成吞吐量提升20%。

详情
AI中文摘要

大语言模型(LLM)中的长上下文解码受到获取大量键值(KV)缓存所需内存带宽的严重限制。大多数现有的KV管理方法依赖于解码前的仅键剪枝,尽管有证据表明注意力输出共同依赖于键和值,因为将值纳入其方法会带来过高的额外开销。在本文中,我们提出了注意力运行时终止(ART),一种轻量级的运行时机制,在内核执行期间跟踪累积的注意力输出,并在后续贡献变得可忽略时终止后续KV块访问。这种设计使ART与现有的基于键的KV缓存管理方法正交,从而能够与它们无缝集成。在LongBench基准上的实验表明,与最先进的基线相比,ART在大批量下实现了20%更高的生成吞吐量,同时保持了相当的准确率。

英文摘要

Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, since incorporating values incurs prohibitive overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. Rather than replacing KV selection, ART dynamically terminates redundant KV traversal on top of existing dense or sparse attention policies. We introduce a stability-based criterion that monitors both magnitude and directional changes of intermediate attention outputs and provideds a theoretical characterization of the resulting truncation error. Experiments on the LongBench and RULER Needle-in-a-Haystack tasks show that ART increases the generation throughput of existing KV-cache methods by up to 20%, without compromising the result quality.

2605.31286 2026-06-17 cs.RO cs.AI 版本更新

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA:面向可泛化可变形物体操作的视觉-语言-动作基础模型

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

发表机构 * Tongji University(同济大学)

AI总结 提出DeMaVLA模型,采用VLM骨干与动作专家结合流匹配生成连续动作,通过剪枝Transformer层提升效率,并利用大规模真实世界数据和人类反馈数据聚合训练,实现可变形物体折叠操作的多类别泛化。

Comments 14 pages, 2 figures

详情
AI中文摘要

现实家庭机器人需要视觉-语言-动作(VLA)基础模型,能够在不同物体、任务条件和家庭环境中获取可重复使用的操作技能。可变形物体折叠是一个代表性挑战,要求机器人处理来自随机初始状态的衣物,涉及不同类别、几何形状、材料和场景。然而,现有的VLA系统通常为不同物体类别训练独立的策略,而简单混合的多任务训练常常遭受任务干扰和性能下降。为了超越类别特定的折叠策略,我们引入了DeMaVLA,一个面向可泛化可变形物体操作的VLA基础模型。DeMaVLA采用VLM骨干网络和动作专家,并使用流匹配来公式化连续动作生成。为了提高效率,动作专家通过剪枝每隔一个Transformer层构建,同时保持与VLM骨干网络的逐层对齐,从而降低训练和推理成本。DeMaVLA首先在大约5000小时精选的真实世界双臂演示数据上进行预训练,以获得通用的操作先验。然后,它在混合折叠数据上进行后训练,这些数据通过人类参与的数据聚合(DAgger)流程,聚合了自我收集的演示和来自多个折叠任务中真实机器人失败的纠正轨迹。实验表明,DeMaVLA在RoboTwin上取得了有竞争力的性能,并在我们的家庭折叠基准测试中取得了强大的真实世界结果。这些结果突显了可扩展的真实世界数据、高效的动作生成和纠正学习对于可变形物体操作中的通用VLA策略的价值。

英文摘要

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

2605.27023 2026-06-17 cs.AI 版本更新

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

通过增强负采样提升知识图谱基础模型

Yinan Liu, Wenjin Xu, Zhiyuan Zha, Xiaochun Yang, Bin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出自适应负采样方法KMAS,通过动态调整困难负三元组比例,增强知识图谱基础模型在零样本补全任务中的性能。

详情
AI中文摘要

知识图谱已成为问答和推荐系统等众多下游任务的核心支柱。然而,尽管如此,知识图谱往往非常不完整。为了在未见过的知识图谱(其关系词汇与预训练时不同)中进行零样本知识图谱补全,知识图谱基础模型受到了广泛关注。现有的知识图谱基础模型通常使用随机负三元组进行训练,这些负三元组是通过将正三元组的头实体或尾实体替换为随机实体构建的。然而,这些负三元组通常质量有限,为知识图谱基础模型训练提供的监督较弱。在本文中,我们提出了一种简单而有效的自适应负采样方法KMAS,以增强现有的知识图谱基础模型。KMAS通过从现有知识图谱基础模型的关系编码器生成的更新关系嵌入来构建困难负三元组。为了进一步自适应地与训练过程中知识图谱基础模型不断发展的能力对齐,KMAS在整个训练过程中动态调整困难负三元组的比例:在预热阶段后,线性增加比例,然后线性减少。在44个数据集上进行了大量实验。实验结果表明,我们提出的负采样方法可以在不需要过多额外时间或内存消耗的情况下增强许多最先进的知识图谱基础模型。

英文摘要

Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM's relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.

2605.26921 2026-06-17 cs.CV q-bio.NC 版本更新

Similarity-based representation factorization for revealing interpretable dimensions in representational data

揭示大脑、行为和AI中表征的核心维度

Florian P. Mahner, Ka Chun Lam, Francisco Pereira, Martin N. Hebart

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences(人类认知与脑科学最大平面研究所) National Institute of Mental Health(心理健康国家研究所) Justus Liebig University Giessen(吉森约斯特-利普大学) Center for Mind, Brain and Behavior(心智、脑与行为中心)

AI总结 提出相似性基表示因子分解(SRF)方法,从相似性矩阵中恢复低维、非负、可解释的嵌入,以揭示神经、行为和计算数据中表征的潜在维度。

详情
AI中文摘要

表征研究广泛存在于神经科学、心理学和人工智能等领域。虽然通常通过刺激之间的相似性来研究和比较表征,但现有方法仅能有限地访问塑造这些表征的维度,且可解释性有限。为克服这些挑战,本文引入相似性基表示因子分解(SRF),一种通用的计算方法,用于从测量数据导出的相似性矩阵中恢复低维、非负、可解释的嵌入。在模拟以及多种神经、行为和计算数据集中,SRF能从各种形式的表征数据中恢复可解释的维度,即使对于非常稀疏采样、不完整的数据也是如此。从这些数据集中导出的维度与任务特定模型获得的维度相匹配,预测独立的行为属性,改进探索性分析,并且与比较相似性矩阵相比,为验证性假设检验提供更高的统计功效。这些结果共同确立了SRF作为一种通用方法,在揭示、理解和利用表征背后的维度方面具有广泛的应用前景。

英文摘要

The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.

2605.30036 2026-06-17 cs.AI cs.CL 版本更新

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观:在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

发表机构 * The Hebrew University of Jerusalem(海法大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学)

AI总结 本研究基于心理学价值理论,通过大规模实验(超过500万个问题)评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性,并证明引入人类价值分布可增强群体模拟。

Comments We had some disagreement regarding proper attribution; we hope to resolve it soon and upload the paper

详情
AI中文摘要

大型语言模型(LLMs)展示了采用不同角色和身份的能力;然而,它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中,我们借鉴既定的心理学价值理论,在LLMs中诱导类人价值观,并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷,我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系,并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外,引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

英文摘要

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

2603.02803 2026-06-17 cs.CV 版本更新

Structure-Aware Text Recognition for Ancient Greek Critical Editions

面向古希腊校勘本的结构感知文本识别

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

发表机构 * Inria(法国国家信息与自动化技术研究所)

AI总结 本文通过构建大规模合成语料库和真实扫描基准,评估了视觉语言模型在结构感知文本识别上的性能,发现Qwen3VL-8B模型在真实扫描上达到1.0%的中位字符错误率。

详情
AI中文摘要

视觉语言模型(VLM)的最新进展已经改变了端到端的文档理解。然而,它们解释历史学术文本复杂布局语义的能力仍然有限。本文研究了面向古希腊校勘本的结构感知文本识别,这些校勘本具有密集的参考层次和广泛的边缘注释。我们引入了两个新资源:(i)从TEI/XML源生成的185,000页图像的大规模合成语料库,具有受控的排版和布局变化,以及(ii)跨越一个多世纪编辑和排版实践的真实扫描校勘本的精选基准。使用这些数据集,我们在零样本和微调设置下评估了三种最先进的VLM。我们的实验揭示了当前VLM架构在面对高度结构化的历史文档时的显著局限性。在零样本设置中,大多数模型的性能明显低于现有的现成软件。尽管如此,Qwen3VL-8B模型达到了最先进的性能,在真实扫描上实现了1.0%的中位字符错误率。这些结果既突显了当前VLM在结构感知识别复杂学术文档方面的不足,也展示了其未来潜力。

英文摘要

Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

2605.29563 2026-06-17 cs.AI cs.CV cs.RO 版本更新

Planning with the Views

通过场景自我探索进行视图规划

Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

发表机构 * Northwestern University(西北大学) University of Washington(华盛顿大学) Microsoft(微软) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出ViewSuite基准测试揭示VLM在多步视图规划中的不足,并设计迭代框架通过自我探索和视图图蒸馏将Qwen2.5-VL-7B的交互式视图规划准确率从2.5%提升至47.8%。

详情
AI中文摘要

VLM能否预测每个相机移动如何改变视图,并提前规划许多这样的移动?我们称这种能力为视图规划,需要(1)理解单个动作如何变换视图,以及(2)在多步规划中组合许多这样的变换以识别目标视图。我们在提出的ViewSuite中探测了这两种能力,ViewSuite是一个基于真实ScanNet场景的3D点云环境。在13个前沿VLM中,出现了一个关键的规划差距:它们具备基本的视图-动作知识,但无法在多步规划中组合这些知识,并且随着视点距离的增加,差距扩大。为了缩小这一差距,我们提出了一个迭代框架,交替进行自我探索和视图图蒸馏。关键洞察是,所有探索轨迹,无论其结果如何,共同形成一个视图图,紧凑地捕捉了场景中视点如何连接。将这个图蒸馏到多样化的监督任务中,重塑了策略分布,并克服了使纯RL停滞的稀疏奖励。这将Qwen2.5-VL-7B在交互式视图规划上的准确率从2.5%提升到47.8%,超过了GPT-5.4 Pro(18.5%)和Gemini 3.1 Pro(21.4%)。自我探索成为VLM在3D空间中主动推理和规划的一条有前景的路径。

英文摘要

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

2605.25652 2026-06-17 cs.CL cs.CY 版本更新

A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

LLM评审员与律师协会考官对泰国律师资格考试自由回答论文的两阶段稳定性研究

Pawitsapak Akarajaradwong, Wuttikrai Lertprasertphakorn, Chompakorn Chaksangchaichot, Sarana Nutanong

发表机构 * VISAI AI(VISAI人工智能)

AI总结 通过泰国律师资格考试的自由回答论文评估,研究LLM评审员与人类考官在评分一致性上的不对称性,发现LLM评审员倾向于多数人类阅读而无法复制少数人类阅读。

详情
AI中文摘要

NLP中的自由形式法律论文评估将专家间评分者稳定性视为单一上限数字,并将LLM评审员与该上限的一致性视为评审员稳定性的证据。我们通过相同输入协议在泰国律师资格考试上检验这两个假设:三名律师协会培训的考官(A、B、C)和一个26个LLM评审员小组对来自相同四个输入(问题、官方律师协会评分规定、标准答案、考生答案)的15个交叉评分的答案进行评分。主要发现是不对称的。在评分标准规定两个轴的15个单元格中的10个上,所有29名评分者收敛在一个狭窄的区间内:小组一致性是普遍的。在其余5个单元格中,评分标准未规定如何评分一个正确但省略了决定性法定引用的最终答案,人类小组在两个连贯的解读之间分裂(B/C多数在评分标准上限区间,分数6-8;A少数在较低区间,分数1-2)。LLM评审员群体并不对称分裂:26个LLM中有22个在或接近B/C的有争议区间评分,3个位于规定沉默的中间间隙,只有1个(GPT-5.4 Nano)接近A的区间但未一致地在其内评分。我们26个评审员小组中的零个LLM在有争议的单元格上复制了少数人类阅读。B/C方向的集群跨越了我们测试的每个模型大小、供应商和价格层级。一个仪器化的三个LLM锚定子小组(Claude 4.6 Opus、Gemini 3.1 Pro、GPT-5.4 Pro)携带确定性探针、输入消融和自助法置信区间,并在15个单元格上达到锚定小组α=0.77,而人类小组α=0.36。高LLM小组α反映了系统性地收敛于多数阅读,而不是平衡地复制两种阅读;一个通过最大化与人类参考小组的一致性来选择其LLM评审员的基准将必然继承这种不对称性。

英文摘要

Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score 6-8; A minority at the lower band, score 1-2). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells. The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $α= 0.77$ on the 15 cells against human-panel $α= 0.36$. The high LLM-panel $α$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.