arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11724 2026-06-11 cs.AI 新提交

Mind the Perspective: Let's Reason Recursively for Theory of Mind

注意视角:递归推理实现心智理论

Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息系统学院) SensiLab, Monash University, Australia(蒙纳士大学SensiLab)

AI总结 提出RecToM框架,通过递归视角构建建模嵌套信念,将高阶信念问题转化为实际世界问题,在多个ToM基准上达到最先进性能。

详情
AI中文摘要

心智理论(ToM)推理需要从部分且不对称的观察中推断智能体的信念,这对大语言模型(LLM)来说仍然是一个开放的挑战。现有的基于提示的方法通过可观察事件过滤或时间信念链来改进ToM推理,但没有显式建模嵌套信念。我们引入了RecToM,一个用于ToM推理的推理时框架,通过递归视角构建来建模嵌套信念。RecToM沿着问题指定的角色链,从先前的角色视角构建每个角色视角,将高阶信念问题简化为最终构建视角内的实际世界问题。我们进一步提供了KD45分析,表明RecToM的视角构建诱导了超越简单事件过滤的良好信念模态。在包括Hi-ToM、Big-ToM和FanToM在内的ToM基准上,跨多个LLM骨干网络的实验表明,RecToM持续优于最近的高级方法,达到了最先进的性能。值得注意的是,RecToM在GPT-5.4和Qwen3.5上达到了Hi-ToM的100%准确率,这是一个需要高阶ToM推理的基准。

英文摘要

Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM's perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100\% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.

2606.11722 2026-06-11 cs.LG cs.AI cs.CL 新提交

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ICA Lens: 无需训练另一本词典即可解释语言模型

Sida Liu, Feijiang Han

发表机构 * Independent Researcher(独立研究员) University of Maryland(马里兰大学)

AI总结 提出ICALens,基于独立成分分析(ICA)高效提取语言模型表示中可解释方向,无需训练稀疏自编码器,在SAEBench上表现竞争力。

详情
Comments
Ongoing Project
AI中文摘要

在语言模型表示中找到可解释方向对于理解和控制模型行为至关重要。稀疏自编码器(SAE)已成为此目的的标准工具,但将其作为默认的第一透镜通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索,并提出了一个基本问题:在训练另一个神经字典之前,从激活几何中已经可以看到多少可解释结构?我们的直觉很简单:许多可解释方向对令牌具有选择性,这些方向看起来比随机方向更不服从高斯分布。因此,我们重新审视独立成分分析(ICA),这是一种寻找非高斯方向的经典方法,作为语言模型可解释性的紧凑透镜。我们发现ICA在LLM可解释性中被低估了,因为先前的使用通常依赖于现成的ICA实现,这些实现在LLM激活上不稳定,并且缺乏用于检查和评估恢复方向的系统工具。为弥补这些差距,我们引入了ICALens,这是第一个用于LLM表示的稳定、高效和可审计ICA分析的实用工作流。它结合了优化的GPU并行FastICA流水线、LLM特定的稳定性配方和更好的拟合诊断,实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上,ICALens高效地恢复了紧凑、人类可解释的方向,无需逐层基于梯度的字典训练。在SAEBench上,ICA在稀疏探测中与公共SAE竞争,并在中小预算下的目标探测扰动中优于它们。这些结果表明,ICA不应被视为弱基线,而应被视为探索语言模型表示的高效且互补的第一透镜。

英文摘要

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

2606.11719 2026-06-11 cs.CV cs.AI 新提交

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Ouroboros-Spatial:闭环数据-模型循环的空间推理

Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He

发表机构 * Peking University(北京大学) Ant International(蚂蚁国际) The University of Hong Kong(香港大学)

AI总结 提出Ouroboros-Spatial自演化框架,通过提议器与求解器闭环交互,动态生成与模型能力匹配的训练样本,在六个空间推理基准上以十分之一数据量显著提升Qwen3-VL性能。

详情
AI中文摘要

空间推理仍然是多模态大语言模型(MLLM)的一个持续挑战。现有方法主要依赖大规模、静态整理的数据集,其中所有训练样本被统一对待,而不考虑模型不断演变的能力。这种静态范式本质上是数据低效的:训练能力通常浪费在模型当前阶段过于简单或过于困难的样本上。为解决这一局限,我们提出Ouroboros-Spatial,一个自演进的训练框架,其中模型扮演提议器和求解器的双重角色。在每次迭代中,冻结的提议器从3D场景元数据和原始视频帧生成空间问答对,以及用于推导可靠真实值的可执行代码。然后,可学习的求解器在接受的样本上进行微调,其每个样本的预测置信度作为难度信号。该信号在下一迭代中反馈给提议器,引导其生成与求解器当前能力更匹配的问题。通过这种闭环设计,训练分布与模型能力共同演化,减少冗余的简单示例,同时过滤掉具有有限学习价值的模糊或无信息样本。在六个空间推理基准上,Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能,同时使用的训练样本数量比近期大规模整理数据集少一个数量级。在VSI-Bench上,它对4B和8B模型分别取得了9.9和6.8个百分点的绝对提升,使两者均优于一系列强大的开源和专有基线模型。

英文摘要

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

2606.11712 2026-06-11 cs.CL cs.AI cs.LG 新提交

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

用户侧记忆中的子模块不对称性:一个诊断框架

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出一个诊断框架,将LLM用户侧记忆分解为行为一致性、事实存在和事实缺失三个正交子模块,发现参数记忆与检索记忆在不同子模块上存在不对称性,且RLHF调优加剧了这种不对称性。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

LLM中的用户侧记忆通常被评分为单一的“个性化”能力:给定用户历史,输出是否更了解用户?我们表明这种聚合指标隐藏了相反方向的失败。记忆至少可分解为三个正交轴——行为一致性(风格、语气)、事实存在(回忆历史中的事实)和事实缺失(当事实缺失时弃权)——并且没有单一子模块能在所有三个轴上获胜。在受控的50用户合成语料库和真实数据探针(LaMP-3)上,比较每个用户的gamma-LoRA(在每个用户历史上训练的小型LoRA适配器;gamma表示每个用户,而非每个任务)与BGE-large密集top-K检索,我们发现gamma-LoRA在行为风格上决定性获胜,而RAG在事实缺失上决定性获胜——并且注意力层21-35中的相同查询投影细胞因果地承载了这两个相反方向的效果(将这些LoRA权重归零会使缺失探针TPR提高33个百分点,并使存在探针TPR下降20个百分点)。在更经过RLHF调优的Llama-3.1-8B-Instruct上,不对称性增强而非愈合:参数记忆的行为优势崩溃,而其相对于检索的缺失校准赤字扩大——这是对参数用户记忆的对齐税。在真实数据LaMP-3上,gamma-LoRA表现低于多数基线;一个9条件缓解扫描诊断出这是指令遵循崩溃,而非子模块失败(9x2交叉乘积显示评估时的{1..5} logit掩码使每个配方的主准确率达到>=0.995),并且最佳训练时修复在Llama上逐位复制。最后,子模块选择路由是问题分类,而非校准:仅基于问题文本的110M DistilBERT击败了每个基于logit的路由器。我们贡献了诊断框架、诊断出的真实数据负例、对齐税复制以及路由即分类的发现。

英文摘要

User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

2606.11711 2026-06-11 cs.LG stat.ML 新提交

Capacity-Constrained Online Convex Optimization with Delayed Feedback

具有延迟反馈的容量受限在线凸优化

Alexander Ryabchenko, Idan Attias, Daniel M. Roy

发表机构 * Department of Statistical Sciences, University of Toronto(多伦多大学统计科学系) Vector Institute(向量研究所) Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), hosted by UIC and TTIC(数据、计量经济学、算法与学习研究所(IDEAL),由伊利诺伊大学芝加哥分校和丰田工业大学芝加哥分校主办)

AI总结 研究在硬容量约束下(最多同时跟踪C个待处理轮次)的延迟在线凸优化,通过引入半先知模型和延迟加权FTRL算法,首次给出了凸和强凸损失下容量受限OCO的遗憾界。

详情
AI中文摘要

具有延迟反馈的在线学习通常假设学习者可以跟踪所有待处理轮次直到其反馈到达。在实践中,跟踪资源是有限的,未跟踪轮次的反馈将永久丢失。在本文中,我们研究了在硬容量约束下的延迟在线凸优化(OCO),其中任何时候最多可以跟踪$C$个待处理轮次。为了建模延迟信息,我们引入了一个半先知模型,该模型细化了先前工作中的先知假设:学习者不需要在预测时知道延迟,而是在线观察延迟到期,这与经典的无约束延迟设置一致。我们的方法通过归约到一个新颖的“延迟且加权”的OCO问题来实现,使用一个随机化跟踪决策并对结果观测进行重要性加权的调度器。对于这个基础问题,我们提出并分析了延迟加权FTRL及其赌博机变体,建立了明确刻画时变权重与延迟反馈之间相互作用的遗憾界。将这些基础学习器与我们的调度器相结合,首次给出了在凸和强凸损失下容量受限OCO的遗憾保证,适用于一阶和赌博机反馈。对于一阶反馈,容量$C = \Omega(\log T)$足以在忽略对数因子的情况下恢复标准延迟OCO的速率。对于赌博机反馈,遗憾率由$(1 + \sigma_{\text{max}}/C)$的幂次调制,其中$\sigma_{\text{max}}$是任何时候的最大待处理观测数。这使得当$C < \sigma_{\text{max}}$时遗憾界能够优雅地退化,同时保持次线性。

英文摘要

Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = \Omega(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + \sigma_{\text{max}}/C)$, where $\sigma_{\text{max}}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < \sigma_{\text{max}}$, while remaining sublinear.

2606.11709 2026-06-11 cs.LG cs.CL 新提交

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

RLCSD: 基于对比策略自蒸馏的强化学习

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen

发表机构 * Tsinghua University(清华大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室) Peking University(北京大学)

AI总结 针对策略自蒸馏中特权诱导的风格漂移问题,提出RLCSD方法,通过对比正确与错误提示下的师生差距来抑制风格偏移,提升推理模型在数学和逻辑推理任务上的性能。

详情
Comments
20 pages, 9 figures, 9 tables
AI中文摘要

策略自蒸馏(OPSD)通过将模型自身的分布与在特权上下文(通常是已验证的解决方案)下产生的分布对齐,为推理模型提供密集的令牌级监督。然而,我们表明从这种分布差距中提取的学习信号集中在风格令牌而非任务承载令牌上,因为提示模型倾向于产生更直接、更短的输出。我们将这种病理现象称为\emph{特权诱导的风格漂移},它会破坏训练稳定性或导致响应长度缩短。为了解决这个问题,我们提出\textbf{RLCSD}(基于对比策略自蒸馏的强化学习),通过对比正确提示下的师生差距与错误提示下的师生差距来缓解这种漂移,抑制无论正确与否,条件于提示往往诱发的风格转变,并产生更集中于任务承载令牌的信号。在Qwen3(1.7B/4B/8B)和Olmo-3-7B-Think上的数学和逻辑推理实验表明,RLCSD始终优于GRPO和先前的OPSD方法。我们进一步表明,对比原则是通用的:它可以嵌入现有的OPSD方法中以提高它们,并且其潜在见解可扩展到更广泛的跨模型策略蒸馏设置。

英文摘要

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

2606.11708 2026-06-11 cs.RO 新提交

Explore From Sketch: Accelerating UAV Exploration in Large-scale Environments with Prior Maps

从草图探索:利用先验地图加速无人机在大规模环境中的探索

Tiancheng Lai, Yuman Gao, Xiangyu Li, Ruitian Pang, Xingpeng Wang, Siqi Shen, Mengke Zhang, Yin He, Fei Gao, Chao Xu, Yanjun Cao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术研究所) Huzhou Institute, Zhejiang University, and Huzhou Key Laboratory of Autonomous System(浙江大学湖州研究院,湖州市自主系统重点实验室) Zhejiang Zhongyan Industry Co., Ltd(浙江中烟工业有限责任公司) Differential Robotics Technology Company(微分机器人技术有限公司)

AI总结 提出利用稀疏、未对齐甚至不一致的2D先验地图加速无人机大规模环境探索的框架,通过鲁棒的2D-3D点云配准和层次化视点规划,实现效率提升34.2%。

详情
Comments
25 pages, 22 figures
AI中文摘要

无人机在大规模、拓扑复杂环境中的自主探索常因次优调度和绕路而效率低下。先验地图(如施工图纸)虽然通常不精确且有缺陷,但在许多场景中易于获取,并具有提供全局结构指导的潜力。本文提出一种新颖的探索框架,利用稀疏、未对齐甚至不一致的2D先验地图进行基于LiDAR的无人机探索。首先,提出一种鲁棒的2D-3D点云配准流程,将LiDAR观测与先验地图对齐。该配准流程结合了用于单帧候选检索的GeoContext描述符、用于带异常值剔除的粗变换估计的多帧验证机制,以及用于精化的Scale-ICP算法。配准模块能够处理地图差异,并在几何歧义出现时提供多个假设。为了有效利用配准结果进行探索规划,我们进一步开发了一种在定位不确定性下的层次化视点规划策略。该层次化策略首先将局部视点空间附着到先验引导点上,并采用蒙特卡洛树搜索求解器确定每个配准假设下的遍历顺序。为减轻配准不确定性,风险感知选择器使用置信度加权的旅行风险评估先验序列,并在选定的先验引导下,通过固定端点旅行商问题生成高效的局部覆盖路径。基准评估显示,与最先进方法相比,探索效率提升高达34.2%,飞行距离减少37.9%,而广泛的仿真和现场实验进一步证明了对先验地图不完整和变形的鲁棒性。

英文摘要

Autonomous exploration with UAVs in large-scale, topologically complex environments often suffers from low efficiency due to suboptimal scheduling and detours. Prior maps (e.g., construction drawings), although usually imprecise and flawed, are readily available in many scenarios and have the potential to provide global structural guidance. This paper presents a novel exploration framework that leverages sparse, unaligned, and even discrepant 2D prior maps for LiDAR-based UAV exploration. First, a robust 2D-3D point cloud registration pipeline is proposed to align LiDAR observations with prior maps. The registration pipeline combines a GeoContext descriptor for single-frame candidate retrieval, a multi-frame verification mechanism for coarse transformation estimation with outlier rejection, and a Scale-ICP algorithm for refinement. The registration module can handle map discrepancies and provide multiple hypotheses when geometric ambiguities arise. To effectively utilize the registration results for exploration planning, we further develop a hierarchical viewpoint planning strategy under localization uncertainties. The hierarchical strategy first spatially attaches local viewpoints to prior guidepoints and adopts a Monte Carlo Tree Search solver to determine their traversal sequence under each registration hypothesis. To mitigate registration uncertainty, a risk-aware selector evaluates prior sequences using confidence-weighted travel risk, and a fixed-endpoint traveling salesman problem is formulated to generate an efficient local coverage path under the selected prior guidance. Benchmark evaluations reveal up to 34.2% improvement in exploration efficiency and 37.9% reduction in flight distance compared to state-of-the-art methods, while extensive simulations and field experiments further demonstrate robustness to prior map incompleteness and deformations.

2606.11704 2026-06-11 cs.RO 新提交

Improving Human Diving Endurance with a Field-Deployable, Untethered Exoskeleton

利用可现场部署的无系留外骨骼提高人类潜水耐力

Zhihao Zhou, Zhenmeng Ju, Rui Yang, Chenxi Zhang, Zhihao Zhou, Ming Xu, Enhao Zheng, Dongjie Jiang, Lecheng Ruan, Jingeng Mai, Qining Wang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Beijing Engineering Research Center of Intelligent Rehabilitation Engineering(北京市智能康复工程技术研究中心) School of Advanced Manufacturing and Robotics, Peking University(北京大学先进制造与机器人学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) Department of Sports Medicine, Peking University Third Hospital(北京大学第三医院运动医学科) School of Rehabilitation Sciences and Engineering, University of Health and Rehabilitation Sciences(康复大学康复科学与工程学院)

AI总结 本文提出DiveMate外骨骼,通过自适应踢腿辅助在真实水下环境中将潜水距离提高42.9%,潜水时长延长54.9%,净耗气率降低47.0%,显著提升人类潜水耐力。

详情
AI中文摘要

人类在水下运动中的耐力从根本上受到高能量需求(克服阻力)和自持呼吸气体有限供应的限制。虽然外骨骼技术可以降低人类在陆地运动中的代谢成本,但其在增强水下潜水耐力方面的潜力尚未被探索。本文介绍了DiveMate,一种可现场部署的无系留外骨骼,旨在通过自适应踢腿辅助在真实水下环境中提高人类潜水耐力。在自然潜水过程中,DiveMate通过降低耗气率,使给定能量(呼吸气体)下的行进距离增加42.9%,潜水时长延长54.9%。肌肉激活的显著减少表明生理消耗降低,净耗气率降低47.0%。运动学特征和规律性的改善进一步支撑了高效的能量经济性。这些结果表明,应用外骨骼辅助有利于提高人类潜水耐力,增强其探索水下世界的能力。本研究拓展了外骨骼的应用前沿,并为未来水下辅助设备的设计和评估提供了潜在参考。

英文摘要

Human endurance in underwater locomotion is fundamentally restricted by high energetic demands to overcome drag and the finite supply of self-contained breathing gas. While exoskeleton technology can reduce the metabolic cost of humans in terrestrial locomotion, its potential to enhance human endurance during underwater diving remains entirely unexplored. Here, we present DiveMate, a field-deployable, untethered exoskeleton designed to improve human diving endurance via adaptive kick assistance in real-world underwater environments. During naturalistic diving, DiveMate increases the travel distance using a given energy (breathing gas) by 42.9% and extends dive duration by 54.9% through reducing gas consumption rate. Marked reductions in muscle activation indicate a decrease in physiological exertion, with the net gas consumption rate decreasing by 47.0%. Kinematic characteristics and regularity improvements further underpin efficient energy economy. These results suggest that applying exoskeleton assistance is beneficial for improving human diving endurance and augmenting their ability to explore the aquatic world. This study extends the application frontier of exoskeletons and provides a potential reference for the design and assessment of future underwater assistive devices.

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 新提交

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出MedCTA基准,基于放射影像、病理切片和报告等真实临床多模态输入,评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

详情
Comments
Project Page: this https URL Code: this https URL Data: this https URL
AI中文摘要

为了做出临床合理的决策,医疗AI智能体需要超越简单的识别,具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答,因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA,一个用于评估医疗工具智能体的基准,基于临床验证的、步骤隐含的任务,这些任务基于真实的多模态临床输入,包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务,具有临床医生验证的、在5个部署工具上的可执行轨迹,并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试,发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱:自主部署主要由协议失败、过早停止和错误工具调用主导,而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明,强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL

2606.11699 2026-06-11 cs.LG 新提交

A Data-Centric Framework for Detecting and Correcting Corrupted Labels

一个用于检测和纠正损坏标签的数据中心框架

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam(越南河内国立大学工程与技术学院信息技术系)

AI总结 提出Relabeler框架,联合利用局部和全局关系检测损坏标签,并基于输入特征和噪声标签估计最可能的干净标签进行纠正,在多个数据集上实现高达58%的标签纠正精度提升和6%的下游任务性能提升。

详情
AI中文摘要

机器学习和深度学习模型的性能在很大程度上取决于训练数据的质量。然而,现实世界数据集的质量常常因噪声标签而受损,这会显著降低模型的准确性和可靠性。为了解决这一挑战,我们提出了Relabeler,一个端到端的数据中心框架,用于检测和纠正损坏的标签。对于损坏标签检测,Relabeler联合利用数据实例之间的局部和全局关系来识别潜在的噪声样本。在检测到可疑实例后,Relabeler进一步通过基于每个实例的输入特征和观察到的噪声标签估计最可能的干净标签来执行标签纠正。跨多个数据集、噪声类型和噪声率的大量实验表明,Relabeler始终优于最先进的基线,在标签纠正精度上实现了高达58%的提升,在下游任务性能上实现了6%的提升。

英文摘要

The performance of machine learning and deep learning models largely depends on the quality of the training data. However, the quality of the real-world datasets is often compromised by noisy labels, which can substantially degrade model accuracy and reliability. To address this challenge, we propose Relabeler, an end-to-end data-centric framework for detecting and correcting corrupted labels. For corrupted label detection, Relabeler jointly leverages both local and global relationships among data instances to identify potentially noisy samples. After detecting suspicious instances, Relabeler further performs label correction by estimating the most probable clean label for each instance based on both its input features and observed noisy label. Extensive experiments across multiple datasets, noise types, and noise rates demonstrate that Relabeler consistently outperforms state-of-the-art baselines, achieving up to 58% improvement in label correction precision and 6% improvement in downstream task performance.

2606.11695 2026-06-11 cs.LG cs.AI 新提交

Noise-Aware Framework for Correcting Corrupted Labels

噪声感知框架用于纠正损坏标签

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Phong Lam, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(越南国立大学工程与技术学院信息技术系)

AI总结 提出CANOLA框架,通过噪声感知学习和迭代标签精炼来纠正损坏标签,在六个数据集上相比现有方法错误率降低19%-52%。

详情
AI中文摘要

高质量的标注数据对于训练可靠的ML/DL模型至关重要。然而,现实世界的数据集通常包含相当比例的损坏标签,这会严重降低模型性能。为了解决这个问题,我们提出了CANOLA,一种通过噪声感知学习和迭代标签精炼来纠正损坏标签的新框架。CANOLA明确估计数据集的潜在噪声分布,并将此信息纳入噪声感知深度神经网络的训练中。通过在训练过程中融入噪声特征,CANOLA使模型能够降低不可靠监督信号的权重,并专注于可信模式,从而提高鲁棒性和泛化能力。标签纠正是通过谨慎的迭代软标签精炼进行的,其中模型预测与观察到的标签混合,以防止过早或错误的更新。这种渐进式精炼使得数据集能够以稳定且可控的方式得到修复。我们在六个广泛使用的数据集上,在现实噪声标注场景下评估了CANOLA。实验结果表明,CANOLA始终优于最先进的标签纠正方法,在错误减少方面实现了19%到52%的相对改进。此外,在由CANOLA纠正的数据集上训练的模型获得了显著的下游性能提升。即使在CANOLA纠正的数据上训练的简单分类器,其性能也能超过复杂的以模型为中心的方法,最高可达67%。

英文摘要

High-quality labeled data is essential for training reliable ML/DL models. However, real-world datasets often contain a considerable proportion of corrupted labels, which can severely degrade model performance. To address this problem, we propose CANOLA, a novel framework for correcting corrupted labels through noise-aware learning and iterative label refinement. CANOLA explicitly estimates the underlying noise distribution of the dataset and incorporates this information into the training of a noise-aware Deep Neural Network. By incorporating noise characteristics during learning, CANOLA enables the model to down-weight unreliable supervision signals and focus on trustworthy patterns, thereby improving robustness and generalization. Label correction is performed via cautious, iterative soft label refinement, in which model predictions are blended with observed labels to prevent premature or erroneous updates. This progressive refinement allows the dataset to be repaired in a stable and controlled manner. We evaluate CANOLA on six widely used datasets under realistic noisy labeling scenarios. Experimental results show that CANOLA consistently outperforms SOTA label correction methods, achieving relative improvements ranging from 19% to 52% in error reduction. Moreover, models trained on datasets corrected by CANOLA obtain substantial downstream performance gains. Even simple classifiers trained on CANOLA's corrected data can outperform complex model-centric approaches by margins of up to 67%.

2606.11691 2026-06-11 cs.LG physics.flu-dyn 新提交

Spectrally Regularized Latent Flow Matching for Turbulence Generation

谱正则化潜流匹配用于湍流生成

Khalid Rafiq, Aditya G. Nair

发表机构 * Department of Mechanical Engineering, University of Nevada, Reno(内华达大学里诺分校机械工程系)

AI总结 针对潜扩散和流匹配模型在湍流生成中低估耗散区振幅的问题,提出谱正则化潜流匹配框架,通过区域加权对数谱目标将深度耗散保留谱功率从25%提升至94%,并显著改善采样成本-保真度权衡。

详情
Comments
Accepted at the AI4Physics Workshop at ICML 2026. OpenReview: this https URL
AI中文摘要

潜扩散和流匹配已成为合成湍流生成的主要方法,但它们系统性地低估了耗散范围的振幅。我们引入了一个潜流匹配框架,其中包含一个直接针对此失效模式的谱正则化压缩阶段。在Re_f ≈ 2250的256^2 DNS数据集上,将MSE训练的VAE替换为区域加权对数谱目标,在重建中将深度耗散保留谱功率从25%提升至94%,在无条件生成中从20%提升至79%。改进的潜表示还产生了显著更好的采样成本-保真度权衡:MSE训练的潜空间在DD偏差-0.70附近施加了一个基本质量上限,任何积分器或步数都无法克服,而谱正则化的潜空间在仅20次函数评估时就达到了DD偏差-0.117。从机制上讲,编码器-解码器交换实验表明,改进主要由编码器诱导的潜重组驱动,而非解码器容量;而支持-振幅分解揭示,MSE训练的模型表现为保守抑制模型,通过衰减间歇性高波数结构来最小化逐点误差。两种管道都恢复了二阶结构函数和S_3的正确符号,表明在没有显式监督的情况下正确的级联方向。S_3幅度上的一个小残余差距表明,相位相干三元组组织仍然是未来生成湍流模型中振幅保真度的补充轴。

英文摘要

Latent diffusion and flow matching have emerged as leading approaches for synthetic turbulence generation, yet they systematically under-represent dissipation-range amplitudes. We introduce a latent flow matching framework with a spectrally regularized compression stage that directly targets this failure mode. On a 256^2 DNS dataset at Re_f \approx 2250, replacing an MSE-trained VAE with a zone-weighted log-spectral objective raises deep-dissipation retained spectral power from 25% to 94% in reconstruction and from 20% to 79% in unconditional generation. The improved latent representation also yields a substantially better sampling cost-fidelity tradeoff: the MSE-trained latent space imposes a fundamental quality ceiling near DD bias -0.70 that no integrator or step-count can overcome, while the spectrally regularized latent space reaches DD bias -0.117 at just 20 function evaluations. Mechanistically, encoder-decoder swap experiments show that the improvement is driven primarily by encoder-induced latent reorganization rather than decoder capacity, while a support-amplitude decomposition reveals that MSE-trained models behave as conservative suppression models, minimizing pointwise error by attenuating intermittent high-wavenumber structure. Both pipelines recover the second-order structure function and the correct sign of S_3, indicating the correct cascade direction without explicit supervision. A small residual gap in the magnitude of S_3 suggests that phase-coherent triadic organization remains a complementary axis to amplitude fidelity for future generative turbulence models.

2606.11689 2026-06-11 cs.CV 新提交

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

RankVR: 低秩结构感知与价值重新校准用于鲁棒组合图像检索

Jiale Huang, Zixu Li, Zhiheng Fu, Zhiwei Chen, Qinlei Huang, Yupeng Hu

发表机构 * Shandong University(山东大学)

AI总结 针对组合图像检索中噪声三元组对应问题,提出RankVR框架,通过全局结构一致性感知模块利用相关矩阵有效秩解耦干净样本,并设计自适应语义价值校准模块动态量化三元组价值,在FashionIQ和CIRR数据集上显著优于现有方法。

详情
Comments
Accepted by ICMR 2026
AI中文摘要

组合图像检索(CIR)构成了一种关键范式,要求模型对参考图像和修改文本进行联合推理。然而,大规模数据集中普遍存在的噪声三元组对应(NTC)严重限制了模型性能。现有的去噪方法要么针对二元不匹配,要么依赖基于标量的逐点估计,忽略了样本群体中丰富的全局结构相关性和训练过程中的动态价值变化,从而产生次优结果。本文识别了两个关键未解决的挑战:语义相关性的全局结构不一致性和难样本判别不确定性。为了解决这些问题,我们提出了RankVR,一个通过全局结构一致性和动态价值感知构建鲁棒CIR模型的框架。具体来说,我们引入了全局结构一致性感知(GSCP)模块,该模块利用相关矩阵的有效秩将干净样本从结构噪声中解耦。通过测量秩差异,GSCP识别出破坏宏观语义对称性的样本。此外,我们开发了自适应语义价值校准(ASVC)模块,以区分高价值的难干净样本。通过整合训练潜力和可靠性,它动态量化每个三元组的语义价值,确保有效利用难样本,同时抑制以逻辑冲突为特征的噪声。在FashionIQ和CIRR基准数据集上的大量实验表明,RankVR显著优于现有最先进方法,验证了其在噪声环境中的卓越鲁棒性。

英文摘要

Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.

2606.11688 2026-06-11 cs.CL cs.AI 新提交

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Goal-Autopilot: 一种可验证的防伪造防火墙,用于无人值守的长周期智能体

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出Autopilot执行模型,通过外部化状态到有限状态机并强制门控验证,使智能体无法虚假声称成功,在3,150个单元测试中伪造率降至0.95%,显著低于基线方法。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

长周期LLM智能体在无人值守时不可信:没有人类监控,它们自信地报告从未验证的成功。我们将诚实性——限制智能体在终止时可能声称的内容——视为无人值守自主性的首要指标,与能力区分开来。我们提出Autopilot,一种执行模型,使得静默伪造的成功在结构上不可能,而不仅仅是更罕见。Autopilot将所有工作状态外部化到一个持久的、门控的有限状态机中,调度器每次以无状态滴答推进;一个硬性下限禁止任何终端“完成”声明,其可伪造的门并未实际执行并通过。我们证明了一个无假成功定理——在门控正确性、下限执行和计划覆盖下,终止意味着目标成立——其唯一信任点可经验测量,并表明最坏情况退化为诚实的停顿,而非伪造的成功。由于每个滴答仅重新水化状态机,每步上下文成本在时间范围内恒定。在3,150个单元的配对语料库(70个任务×3个系统×3个模型×5个种子,包括跨11个开源仓库的50个SWE-bench Lite任务)上,Autopilot在0.95%的单元上伪造[95% CI 0.38–1.62],而Reflexion和StateFlow基线分别在8.10% [6.48–9.81]和25.05% [22.48–27.62]上伪造。主要对比存在于困难场景:在SWE-bench Lite上,防火墙将伪造率从33.7%(StateFlow)降至0.67%,配对差异为-33.07个百分点[95% CI -36.53, -29.73]。机制在于门控而非模型:所有十个Autopilot伪造均来自最强模型,而两个较弱的中间模型在700个配对单元中从未伪造。防火墙设计上以覆盖换取诚实——诚实的停顿是可恢复的;而自信的错误输出向下游发送则不可恢复。

英文摘要

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.

2606.11686 2026-06-11 cs.CL cs.AI 新提交

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

层隔离评估:使用无LLM、回归锁定的测试工具对生产级LLM代理的确定性框架进行门控

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)(Lumivate(Lumi))

AI总结 提出层隔离评估方法,将LLM代理分解为固定层次,用确定性无LLM测试套件逐层检测回归,证明聚合指标会掩盖局部退化,而逐层基线门控可准确定位。

详情
Comments
12 pages, 2 figures, 5 tables
AI中文摘要

端到端任务成功是评估LLM代理的主要方式,但一个聚合数字只能告诉你代理发生了回归,却无法指出具体位置。我们提出层隔离评估:将一个部署的订单代理分解为固定的层次分类(本体、意图、路由、分解、升级、安全、记忆以及跨领域的封装/防御),每一层由其在确定性、无LLM“纯”模式下的断言切片独立测试。纯测试套件(23个切片共238个案例;225个在2.39秒内运行,约10毫秒/案例)在每次变更时针对锁定的逐切片基线在CI中运行。我们通过受控回归注入进行验证,一次退化一个非安全层(共七个层)。我们未设计的效果是掩蔽:聚合通过率几乎不变(六个局部回归的变化范围为-1.7至-5.9个百分点),而匹配的切片则大幅下降(-25至-91个百分点)。一个层的切片对其自身故障做出反应部分是由构造决定的;测量结果是(i)聚合掩蔽以及(ii)损伤不会扩散到其他切片:注入层的切片在7个案例中的5个中是受影响最严重的,在7个案例中的7个中位列前三(平均排名1.29/19)。定位在第二个结构不同的租户(星巴克新加坡)上复现:所有七个匹配切片均大幅下降,因此这不是单一目录的伪像。我们将其定位为EDDOps规定但未实现的组件级评估的具体确定性实例,以CheckList为前身,并作为全工作流随机突变测试的确定性镜像。我们的贡献:(a)为生产代理提供了一个完全分解的、亚秒级、无LLM的逐层测试工具,(b)一个覆盖诚实性测试充分性标准,拒绝为未执行的层打分,以及(c)回归注入演示,证明逐切片基线锁定可以定位聚合指标掩盖的回归。

英文摘要

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

2606.11682 2026-06-11 cs.CV cs.LG 新提交

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

面向表格-图像多模态学习的参数高效适配器微调

Jiaqi Luo

发表机构 * School of Mathematical Sciences, Soochow University(苏州大学数学科学学院)

AI总结 提出TI-Adapter框架,通过冻结表格编码器并添加适配器,以及图像分支的嵌入层和瓶颈层适配器,实现高效多模态微调,在20个数据集上以更少参数达到或超越全微调性能。

详情
AI中文摘要

表格-图像多模态学习旨在通过联合使用结构化表格属性和视觉数据来提高预测建模能力。尽管预训练编码器提供了强大的模态特定表示,但全微调可能计算成本高昂,而保持编码器冻结可能限制任务特定适应。我们提出了表格-图像适配器(TI-Adapter),一种基于模态特定适配器的微调框架,用于高效的多模态适应。TI-Adapter冻结预训练的表格编码器,并在提取的表格嵌入后学习一个适配器,同时通过嵌入级和瓶颈级适配器来适应图像分支,而不是全微调。在20个表格-图像数据集上的实验表明,TI-Adapter在使用显著更少的可训练参数的情况下,达到了与全微调相当或更好的预测性能。消融研究进一步证明了适配器放置对于平衡性能和实际效率的重要性。

英文摘要

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

2606.11681 2026-06-11 cs.CL cs.SD 新提交

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

2606.11680 2026-06-11 cs.AI cs.CL cs.LG 新提交

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

先组织再检索:面向高效智能体的层次化记忆导航

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

发表机构 * Duke University(杜克大学) Snowflake AI Research(Snowflake AI研究)

AI总结 提出HORMA框架,通过构建文件系统式的层次化记忆结构并利用强化学习训练的轻量级导航代理,实现高效检索,在长时任务中提升性能并降低令牌消耗。

详情
AI中文摘要

大型语言模型(LLM)智能体由于固有的无状态性,在处理长时任务时面临挑战,所有任务相关信息必须编码到不断增长的输入上下文中,导致推理质量下降、推理成本增加和延迟升高,因此需要高效的工作记忆机制。然而,现有方法要么依赖有损压缩,要么基于相似性检索,往往无法捕捉多步智能体任务所需的时间结构和因果依赖关系。在这项工作中,我们提出了HORMA,一种层次化组织与检索记忆智能体,它将经验组织成类似文件系统的层次化结构,其中总结的实体链接到相应的原始轨迹,从而在保留详细信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段:结构化记忆构建和基于导航的检索。构建模块通过区分由信息缺失导致的失败和由误导性或过载上下文导致的失败,迭代地优化经验的结构化方式。导航模块使用强化学习训练的轻量级代理遍历层次结构,选择最小但充分的上下文,从而减少关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval上,HORMA在受限上下文预算下提升了任务性能,同时在长对话任务中最多仅使用基线22.17%的令牌。与现有方法相比,它始终实现了更好的效率-性能权衡,并能有效泛化到未见任务。

英文摘要

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

2606.11678 2026-06-11 cs.CL 新提交

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

AI能像城市规划师一样推理吗?基于专业判断的大语言模型基准测试

Yijie Deng, He Zhu, Wen Wang, Junyou Su, Minxin Chen, Wenjia Zhang

发表机构 * School of Architecture and Urban Planning, Shenzhen University(深圳大学建筑与城市规划学院) Shenzhen Key Laboratory of Urban Spatial Information and Intelligent Modeling(深圳市城市空间信息与智能建模重点实验室) Department of Urban Planning and Design, The University of Hong Kong(香港大学城市规划与设计系)

AI总结 提出UPBench框架,通过4×5知识支柱与认知水平矩阵评估25个LLM,发现模型在分析任务上优于事实回忆和综合判断,揭示了规划知识的制度依赖性。

详情
AI中文摘要

问题、研究策略与发现:大语言模型(LLM)的兴起为城市规划提出了一个关键问题:AI能复制哪些专业规划知识,哪些仍需人类判断?尽管AI工具在规划实践中日益普及,但目前仍缺乏系统性框架来测试它们是否能以规划专业知识核心的情境敏感性、价值意识和制度素养进行推理。本文介绍了Urban Planning Bench(UPBench),这是一个领域特定的评估框架,通过改编自布鲁姆修订分类法的四个知识支柱和五个认知水平构成的4x5矩阵来评估LLM推理。通过自动评分和专家评审对25个LLM进行评估,我们发现了一条非单调的认知曲线:模型在高级分析任务上的表现优于事实回忆和综合判断。这表明,通常被视为低阶的规划知识深受制度、司法和时间背景的影响,使得LLM难以泛化。我们将这些局限性总结为四个认知诊断:监管幻觉、概念混淆、棘手性瘫痪和实践智慧缺陷。实践启示:研究结果支持规划中的差异化委托。LLM可以协助跨学科综合、文献综述、情景生成和初步政策分析。然而,它们在特定司法管辖区的法规、规范冲突解决和情境敏感程序方面仍不可靠。机构应要求对AI辅助监管分析进行验证,而规划教育应强调制度素养、规范判断和情境敏感性。

英文摘要

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.

2606.11675 2026-06-11 cs.AI 新提交

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Lung-R1:知识图谱引导的肺部诊断推理大语言模型

Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang, Quan Lu, Dongfan Ye, Xuetao Chen, Jiang Zhong, Kaiwen Wei, Zhi Xu

发表机构 * School of Computer Science, Chongqing University(重庆大学计算机学院) AI Research Institution, Mashang Financial Institution(马上金融人工智能研究院) Department of Information, Third Military Medical University(陆军军医大学信息系)

AI总结 提出LungKG知识图谱和Lung-R1模型,通过KG约束的推理链构建和强化学习,解决肺部知识到病例诊断的差距,在EMR诊断任务上达到SOTA。

详情
AI中文摘要

诊断肺部疾病需要在表型变异性和跨疾病重叠中整合异质性证据。尽管大语言模型(LLMs)在肺部知识问答和信息处理任务上取得了进展,但可靠的肺部诊断需要对电子病历证据进行患者特异性的、关系感知的推理,而非孤立的知识回忆。我们将肺部知识与病例级诊断推理之间的这一差距定义为肺部知识到诊断的差距。为解决这一问题,我们引入了LungKG,这是第一个用于诊断知识组织和记录基础推理的结构化肺部知识图谱。LungKG包含59,038个节点和164,308条边,涵盖15种实体类型和112种关系类型,既作为可重用的肺部知识资源,也作为LungKG引导模型适应的基础。基于LungKG,我们提出了Lung-R1,一种通过KG约束的推理链构建和KG引导的强化学习训练的LungKG引导的肺部LLM。在20个系统的评估中,Lung-R1-14B在选择题、肺部问答和EMR诊断任务上均达到最先进性能,EMR诊断得分为4.3583,超过最强非Lung-R1基线0.1476分。这些结果证明了LungKG引导训练对基于EMR的肺部诊断的价值。

英文摘要

Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

2606.11674 2026-06-11 cs.SD cs.LG 新提交

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

SpAArSIST: 用于高效可靠反欺骗的稀疏化AASIST

Anton Firc, Vojtěch Staněk, Zbyněk Lička, Kamil Malinka, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出SpAArSIST,通过稀疏化图池化后端,在保持竞争力的同时降低计算量20.7%、模型大小4.1%,并提升域外鲁棒性。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

我们提出了SpAArSIST,这是对广泛使用的基于自监督学习(SSL)的反欺骗方法AASIST图池化后端的面向部署的改进。受公共实现中冗余操作的启发,我们用显式的轻量级选择替换了学习池化和堆栈节点注意力:分离的训练和推理图池化比率$(k_{\mathrm{tr}},k_{\mathrm{inf}})$、基于幅度的节点评分以及图节点的均值聚合。最佳整体配置(排名第一)将后端计算削减了20.7%(从195.045M MACs降至154.706M MACs),模型大小减少了4.1%(从611.8k参数降至586.4k参数),同时将在In-the-Wild上的域外鲁棒性提升至2.82% EER和0.078 minDCF(原为4.64%和0.133),并在ASVspoof5上保持竞争力。我们还提供了一个综合选择分数,总结了准确性、校准和计算量,以支持平衡的面向部署的模型选择。

英文摘要

We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios $(k_{\mathrm{tr}},k_{\mathrm{inf}})$, magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M $\rightarrow$ 154.706M MACs) and model size by 4.1% (611.8k $\rightarrow$ 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

2606.11670 2026-06-11 cs.CV cs.AI 新提交

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

ARGUS: 堆叠多视角身份马赛克注入用于主体保持的视频生成

Zijie Meng, Jiwen Liu, Yufei Liu, Chengzhuo Tong, Xiaoqiang Liu, Yuanxing Zhang, Yulong Xu, Pengfei Wan

发表机构 * Peking University(北京大学) Kuaishou Technology(快手科技) Xiamen University(厦门大学)

AI总结 提出ARGUS框架,通过堆叠多视角身份马赛克注入(SMII)将身份表示为紧凑动态分布,结合MLLM身份导演、无交叉对反事实训练等模块,在主体保持视频生成中达到SOTA。

详情
Comments
13 pages, 3 figures
AI中文摘要

仅靠正面人脸相似度无法解决主体保持的视频生成问题:生成的人物必须在运动、大视角变化、表情变化、遮挡、尺度变化以及文本、首帧和身份参考之间的冲突中保持可识别。我们认为核心瓶颈在于点参考范式,该范式将身份坍缩为与姿态、配饰、光照、背景和相机统计纠缠的单一静态观测。我们提出了Argus,一个基于Wan的框架,核心是堆叠多视角身份马赛克注入(SMII)。SMII将MLLM选择的图像/视频身份证据转换为3*3堆叠马赛克,使马赛克与当前扩散时间同步,并将其作为负时间只读内存注入Wan的原生令牌空间。这使身份从外部清洁适配器或单个参考图像转变为紧凑的动态分布。围绕SMII,MLLM身份导演选择信息丰富的身份时刻并解决条件冲突,而无交叉对反事实训练、时间身份退火和自适应自相似性指导在没有配对主体-视频监督的情况下提高了鲁棒性。我们进一步发布了HardID-Celeb,一个公众人物身份压力基准,并引入YawScore和OccScore来探测大偏航和首帧遮挡鲁棒性。Argus在OpenS2V-Eval Human-Domain上达到了SOTA结果,总分为64.38,FaceSim为71.86,NexusScore为51.62,NaturalScore为79.14。在HardID-Celeb上,Argus获得了76.80的FaceSim,并在YawScore和OccScore上分别比最强基线提高了12.60和15.10分,证明了动态身份记忆和大规模反事实自监督对于主体保持视频生成非常有效。

英文摘要

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

2606.11666 2026-06-11 cs.SD 新提交

The Hidden Cost of Pairwise Verification in Synthetic Speech Source Tracing

合成语音源追踪中成对验证的隐藏成本

Anton Firc, Zbyněk Lička, Vojtěch Staněk, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 研究比较全局锚定与成对验证在合成语音源追踪中的性能,发现成对验证导致嵌入方差集中、分辨率降低,从而在域内和域外任务中表现更差。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

开放集源追踪日益被框定为验证问题,促使使用来自生物特征识别的成对度量学习目标。因此,我们在匹配的骨干网络以及固定的数据和epoch预算下,在MLAAD(域内)和STOPA(域外)上比较全局锚定和成对验证。在我们的运行中,全局锚定产生的域内错误率(8.61% EER)低于成对变体(12-15% EER),即使使用对抗挖掘和XLS-R微调也是如此。由于成对目标直接优化相似性,它们将方差集中到更少的嵌入方向上,降低了紧密相关生成器之间的分辨率。为了测试这是否导致了性能下降,我们对全局监督基线施加了类似的瓶颈,但基线仍然具有竞争力。结合嵌入空间分析($k_{99}$),这些结果表明差距不能仅由维度解释,而是由成对目标对保留方向的塑造所致。

英文摘要

Open-set source tracing is increasingly framed as a verification problem, motivating the use of pairwise metric-learning objectives from biometrics. We thus compare global anchoring and pairwise verification under matched backbones and a fixed data and epoch budget on MLAAD (in-domain) and STOPA (out-of-domain). In our runs, global anchoring yields lower in-domain error (8.61% EER) than pairwise variants (12-15% EER), even with rival mining and XLS-R finetuning. Because pairwise objectives optimize similarity directly, they concentrate variance into fewer embedding directions, reducing resolution among closely related generators. To test if this drives the drop, we impose a similar bottleneck to the globally supervised baseline, yet the baseline remains competitive. Together with an embedding-space analysis ($k_{99}$), these results suggest that the gap is not explained by dimensionality alone, but rather by the pairwise objective's shaping of the retained directions.

2606.11657 2026-06-11 cs.LG cs.AI 新提交

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

稀疏探针与模糊物理:连续介质动力学基础模型可解释性挑战的案例研究

Katherine Rosenfeld, Maike Sonnewald

发表机构 * Gates Foundation(盖茨基金会) UC Davis(加州大学戴维斯分校)

AI总结 本研究通过稀疏自编码器探针分析连续介质动力学基础模型Walrus的内部机制,发现其内部特征与物理分解不完全一致,并存在输出级偏差,揭示了科学基础模型可解释性的关键挑战。

详情
Comments
8 pages, 5 figures
AI中文摘要

生成式AI仿真器越来越多地用于我们已经拥有强大理论、基准和物理直觉的科学领域。这引发了一个核心评估和可解释性问题:当一个基础模型能够再现已知的连续介质动力学时,是什么内部机制支持这种行为?内部行为是否与已知物理一致?以及它与仿真器成功或失败的关系如何?我们研究了跨领域连续介质动力学基础模型——Polymathic团队的Walrus,采用基于物理原理的机械可解释性方法。我们应用稀疏自编码器(SAE)探测选定层,并利用涡度作为物理基础度量,解决了对大量特征集(超过20,000个)进行分类的实际挑战。作为刻意简单的测试平台,我们聚焦于剪切流,并比较了多个剪切流设置(即数值模拟中的参数值)下的特征招募情况。在不同设置中,我们发现了分段一致性的证据,特征子集以相似角色重复出现,但这种结构是间歇性的,并未清晰地映射到标准物理分解上。同时,数值模拟与仿真器之间的直接比较揭示了系统性的输出级差异,包括能量/结构变得过于扩散或过于局部的区域。我们将这些差异的部分与特定SAE特征使用的变化联系起来。我们的工作突出了科学基础模型的开放性问题:如何稳健地优先考虑机械上有意义的特征,如何将稳定结构与分析伪影(包括单层和SAE限制)分离,以及如何利用既定基准来决定何时“不同”的内部表示真正具有信息性而非仅仅是有效的。

英文摘要

Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style model can reproduce known continuum dynamics, what internal mechanism supports that behavior, is the internal behaviour consistent with known physics, and how does it relate to where the emulator succeeds or fails? We investigate a cross-domain foundation model for continuum dynamics, Walrus by Polymathic, using mechanistic interpretability guided by physical principles. We apply a sparse autoencoder (SAE) to probe a selected layer, and address the practical challenge of triaging a large feature set (over 20,000) using enstrophy as a physically grounded metric. As a deliberately simple testbed, we focus on shear flow and compare feature recruitment across multiple shear-flow setups, i.e. parameter values in the numerical simulation. Across setups we find evidence of piecewise consistency, with subsets of features recurring in similar roles, but this structure is intermittent and does not map cleanly onto standard physical decompositions. In parallel, direct comparisons between numerical simulation and the emulator reveal systematic output-level discrepancies, including regimes where energy/structures become too diffuse or too localized. We connect parts of these discrepancies to changes in specific SAE feature usage. Our work highlights open questions for scientific foundation models: how to robustly prioritize mechanistically meaningful features, how to separate stable structure from analysis artifacts (including single-layer and SAE limitations), and how to use established benchmarks to decide when "different" internal representations are genuinely informative rather than merely effective.

2606.11652 2026-06-11 cs.LG 新提交

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

IAPO:面向小型多模态代理工具使用的输入归因感知策略优化

Yifan Yang, Zhen Zhang, Jiayi Tian, Liyan Tan, Zheng Zhang

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出输入归因感知策略优化(IAPO),通过强化学习对齐模型与教师模型的输入归因,提升多模态小语言模型的工具调用能力,在六个测试集上平均准确率提升3%。

详情
AI中文摘要

本文研究强化学习方法以提升多模态小语言模型(SLM)代理的工具调用能力。尽管现有工作探索了多种奖励设计来改善代理的工具调用能力,但这些方法在SLM训练中面临固有局限性,尤其是在多模态场景下。首先,许多现有方法通过精确匹配某些真实标签或预定义格式来评估工具使用正确性。然而,这种假设通常不适用于多模态任务,因为可能存在多个有效的工具使用路径,且通常没有标注的工具轨迹。其次,这种稀疏且脆弱的二元奖励对如何改进底层决策过程提供的指导很少,使得多模态SLM难以从中学习。为解决这些问题,我们提出输入归因感知策略优化(IAPO),一种通过将模型在输入组件上的归因与更强的教师模型对齐,来改进多模态SLM工具使用的强化学习算法。在Qwen2.5-VL-3B上的实验表明,与现有的视觉工具使用工作相比,所提方法通过帮助模型关注最相关的输入证据,在六个测试集上平均将视觉问答准确率提高了3%。

英文摘要

This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-calling ability, these approaches face inherent limitations for SLM training, especially under multimodal scenarios. First, many existing methods evaluate tool use correctness through exact matching against certain ground-truth or predefined formats. However, this assumption is often unsuitable for multimodal tasks, where multiple tool use paths may be valid and annotated tool trajectories are typically unavailable. Second, such sparse and brittle binary rewards provide little guidance on how to improve the underlying decision process, making them particularly difficult for multimodal SLM to learn from. To address these issues, we propose Input Attribution-Aware Policy Optimization (IAPO), an RL algorithm for improving tool use in multimodal SLM by aligning the model's attribution across input components with that of a stronger teacher. Experiments on Qwen2.5-VL-3B show that the proposed method improves visual question answering accuracy by an average of 3% across six test sets compared with existing visual tool use work, by helping the model attend to the most relevant input evidence.

2606.11650 2026-06-11 cs.LG math.NA physics.comp-ph 新提交

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

具有可处理不确定性量化的保结构神经代理模型

Handi Zhang, Adrienne M. Propp, Brooks Kinch, Houman Owhadi, Nathaniel Trask

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Stanford University(斯坦福大学) California Institute of Technology(加州理工学院)

AI总结 提出一种结合混合有限元空间与高斯过程回归的保结构降阶模型,通过拓扑结构实现状态-通量关系的不确定性量化,并导出狄利克雷-诺伊曼映射的闭式后验不确定性。

详情
AI中文摘要

科学机器学习的最新进展为偏微分方程(PDE)的近实时求解提供了一种手段,但缺乏支持当代验证与确认的传统模拟器的理论基础。在这项工作中,我们构建了数据驱动的降阶模型,作为保结构、实时代理模型。值得注意的是,施加物理守恒结构的外微分也揭示了拓扑结构,我们利用该结构构建了状态-通量关系中不确定性的高斯过程(GP)表示,最终为目标量导出具有后验不确定性闭式表达的狄利克雷-诺伊曼映射。我们特别提出了由轻量级变压器规定的传统Raviart-Thomas和$dgP_0$单元的保结构$H(\mathrm{div})$--$L^2$子空间。通过提出一个守恒律来学习与该子空间一致的降阶动力学,其中GP描述了体积之间的通量。这项工作依赖于混合有限元空间与GP回归之间的新颖接口;当训练被表述为最优恢复问题(ORP)时,得到的GP回归可以写成一个带有等式约束的优化问题,该约束施加了守恒结构,适用于快速的Schur补训练策略。然后,训练好的模型可以实时求解,得到由指定狄利克雷数据驱动的边界通量的闭式估计量。本文包括线性泛函的RKHS后验误差界以支持不确定性量化,以及数值实验证明了后验分布作为误差估计代理的准确性。

英文摘要

Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary verification and validation. In this work, we construct data-driven reduced-order models that serve as structure-preserving, real-time surrogates. Remarkably, the exterior calculus that imposes physical conservation structure also exposes topological structure that we use to build a Gaussian process (GP) representation of uncertainty in state-flux relationships, ultimately yielding a Dirichlet-to-Neumann map for quantities of interest with closed-form expressions for posterior uncertainty. We specifically propose structure-preserving $H(\mathrm{div})$--$L^2$ subspaces of conventional Raviart--Thomas and $dgP_0$ elements prescribed by a lightweight transformer. Reduced-order dynamics consistent with this subspace are learned by posing a conservation law in which a GP describes the fluxes between volumes. This work hinges on a novel interface between mixed FEM spaces and GP regression; when training is posed as the optimal recovery problem (ORP), the resulting GP regression can be written as an optimization problem with equality constraints that impose a conservation structure, amenable to a fast Schur-complement training strategy. The trained model can then be solved in real time with closed-form estimators for boundary fluxes driven by prescribed Dirichlet data. The paper includes RKHS posterior error bounds for linear functionals to support uncertainty quantification, as well as numerical experiments demonstrating the accuracy of the posterior distribution as a surrogate for error estimation.

2606.11645 2026-06-11 cs.CV 新提交

Motion Reinforces Appearance: RGB-Skeleton Gated Residual Fusion for Micro-Gesture Online Recognition

运动增强外观:用于微手势在线识别的RGB-骨架门控残差融合

Jialin Liu, Xinwen He, Pengyu Liu, Jiale Shi, Huaijuan Zang, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT), Hefei, China(合肥工业大学计算机与信息工程学院)

AI总结 提出DyFADet+双流RGB-骨架框架,通过门控残差模块自适应融合骨架运动与RGB特征,实现微手势在线识别,在SMG数据集上F1达40.88,排名第二。

详情
Comments
13 pages, 2 figures
AI中文摘要

微手势分析因能从细微身体动作推断自发情绪而受到越来越多的关注。微手势在线识别,即在未修剪视频中定位和分类每个手势实例,是第四届EI-MiGA-IJCAI挑战赛的核心任务。与典型的时序动作检测相比,MGR强调动作的定位和分类,要求模型输出每个微手势的开始时间、结束时间和类别。此外,由于微手势高度自发,仅依赖单一模态难以捕捉完整准确的多模态线索。在这项工作中,我们提出DyFADet+,它将DyFADet扩展为双流RGB-骨架框架。在我们的模型中,两种模态都被投影到共享的多尺度时序嵌入中,并通过门控残差模块融合,该模块自适应地将骨架运动注入RGB表示,而不是使用简单的拼接。最后,这些融合特征由动态TAD头解码,用于在线分类和边界回归。在SMG数据集上,我们的方法取得了40.88的F1分数,在微手势在线识别赛道中排名第二。

英文摘要

Micro-gesture analysis attracts increasing attention for inferring spontaneous emotion from subtle body movements. Micro-gesture online recognition, which localizes and classifies each gesture instance in untrimmed videos, is a core task in the 4th EI-MiGA-IJCAI Challenge. Compared with typical temporal action detection, MGR emphasizes the localization and classification of actions, requiring the model to output the start time, end time, and category of each micro-gesture. Moreover, since micro-gestures are highly spontaneous, relying solely on a single modality makes it difficult to capture the complete and accurate multi-modal cues. In this work, we propose DyFADet+, which extends DyFADet into a dual-stream RGB-skeleton framework. In our model, both modalities are projected into shared multi-scale temporal embeddings and fused through a gated residual module, which adaptively injects skeleton motion into the RGB representation rather than using naive concatenation. Finally, these fused features are decoded by a Dynamic TAD head for online classification and boundary regression. On the SMG dataset, our method achieves an F1 score of 40.88, ranking 2nd in the Micro-gesture Online Recognition track.

2606.11643 2026-06-11 cs.CL 新提交

Improving Cross-Format Robustness in Language Models with Multi-Format Training

通过多格式训练提升语言模型的跨格式鲁棒性

June M. Liu, Shaomian Zheng, He Cao, Dingnan Jin, Qing Cui, Jun Zhou

发表机构 * Ant Group(蚂蚁集团) International Digital Economy Academy (IDEA)(国际数字经济学院(IDEA))

AI总结 提出FormatMix方法,通过将部分训练数据扩展为多种等价格式,显著提升大语言模型在不同答案格式下的一致性,仅需30%数据即可接近全格式训练效果。

详情
AI中文摘要

大型语言模型通常对答案格式仍然敏感:一种格式下正确解答的问题可能在另一种语义等价的格式下失败。为了研究这一差距,我们将跨格式鲁棒性定义为模型在不同格式下一致回答相同潜在问题的程度。然后,我们比较了全格式训练与FormatMix,后者使用随机或目标选择将仅一部分训练项扩展为多种等价格式。在GLM4和Llama-3.1上,多格式监督一致地提升了任务性能和跨格式鲁棒性,而仅使用多项选择题(MCQ)监督几乎无益,甚至可能降低鲁棒性。我们进一步发现,仅将约30%的训练集扩展为多种格式通常能恢复全格式训练的大部分收益,并且这一效果在我们研究的模型族和规模中均存在。这些结果表明,格式多样性(而非额外的监督本身)是鲁棒性的关键驱动因素。轻量级的多格式增强是一种实用的方法,可以在不改变基础模型的情况下使LLM对答案格式不那么敏感。

英文摘要

Large language models often remain sensitive to answer format: a question solved correctly in one form may fail in another semantically equivalent form. To study this gap, we define cross-format robustness as the extent to which a model answers the same underlying question consistently across formats. We then compare full-format training with FormatMix, which expands only a subset of training items into multiple equivalent formats using either random or targeted selection. Across GLM4 and Llama-3.1, multi-format supervision consistently improves both task performance and cross-format robustness, whereas Multiple-choice question (MCQ)-only supervision alone brings little benefit and can even reduce robustness. We further find that expanding only about 30% of the training set into multiple formats often recovers most of the gain from full-format training, and this effect appears across the model families and sizes we study. These results suggest that format diversity, rather than additional supervision alone, is the key driver of robustness. That lightweight multi-format augmentation is a practical way to make LLMs less sensitive to answer format without changing the base model.

2606.11640 2026-06-11 cs.LG cs.AI 新提交

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

TAROT: 面向小样本表格学习的任务自适应LLM先验图精炼

Ruxue Shi, Yili Wang, Mengnan Du, Hangting Ye, Yi Chang, Xin Wang

发表机构 * Jilin University(吉林大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出TAROT框架,通过构建并精炼任务自适应语义图,利用LLM先验和GNN编码特征语义关系,提升小样本表格学习性能。

详情
AI中文摘要

小样本表格学习为实际应用中标注成本高、新任务样本收集困难的情况提供了一种经济有效的方法。现有的传统方法和基于LLM的方法在小样本场景中已展现出有效性。然而,传统方法需要在未标注或生成的数据上进行额外训练,这带来了显著的计算开销。此外,直接将原始表格数据输入LLM的基于LLM的方法引发了隐私和合规性问题。更重要的是,这两种范式都很大程度上忽略了特征之间的语义关系,而语义关系为构建语义图提供了结构和语义先验。语义图对于在小样本场景中建模有意义的特征交互至关重要。本文提出TAROT,一个基于GNN的框架,通过从先验中构建并精炼任务自适应语义图来编码结构和语义先验,从而提升小样本表格学习的预测性能。TAROT首先通过统一语义表格节点编码器(USTNE)将异构表格数据编码为统一的节点语义表示。然后,它提示LLM根据任务描述和特征名称推断特征之间的语义关系,以构建语义图。为了减轻LLM幻觉引入的结构噪声,TAROT引入了任务自适应语义图精炼,剪除虚假或与任务无关的边,并添加缺失的与任务相关的边,使图结构与下游目标对齐。最后,GNN在精炼后的图上进行消息传递,以捕获与任务相关的语义依赖关系进行预测。在各种小样本表格学习基准上的大量实验证明了TAROT的优越性能,使其成为该领域的最先进方法。

英文摘要

Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

2606.11637 2026-06-11 cs.AI 新提交

TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

TouchThinker: 通过大规模数据和动作感知表示将触觉常识推理扩展到开放世界

Kailin Lyu, Di Wu, Pengwei Zhang, Yuhang Zheng, Yingxin Lai, Long Xiao, Kangyi Wu, Pengna Li, Chen Gao, Lianyu Hu, Xiaobin Hu, Jie Hao, Ce Hao, Weihao Yuan, Shuicheng Yan

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) National University of Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院) Xiamen University(厦门大学) Xi’an Jiaotong University(西安交通大学) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出TouchThinker框架,通过构建百万级多源触觉数据集TouchThinker-1M和动作感知建模,将触觉常识推理扩展到开放世界,在多个数据集上取得竞争性表现。

详情
Comments
18 pages, 11 figures
AI中文摘要

触觉是具身智能体理解物理世界的关键模态。尽管最近的工作已将触觉信号融入语言系统进行触觉常识推理,但由于两个关键瓶颈,将此类系统扩展到现实的开放世界环境仍然具有挑战性:(1) 当前的触觉推理数据集在格式和规模上仍然有限,为从触觉观察到物理常识的推理提供的监督不足,并阻碍了可迁移触觉常识的学习;(2) 触觉信号本质上是冗余且特定于动作的,但现有方法常常忽略这些特性,导致表示效率低下且语义表达能力有限。为了解决这些局限性,我们提出了TouchThinker,一个从数据和表示两个角度将触觉常识推理扩展到开放世界的触觉-语言框架。首先,我们构建了TouchThinker-1M,一个百万级、多源的触觉推理数据集,涵盖\textbf{415}个物体、\textbf{8}个场景和\textbf{7}种传感器类型,为开放世界泛化提供了坚实的数据基础。我们进一步引入了TouchThinker-Bench,一个具有更真实和多样化任务的开放世界基准。然后,我们提出了动作感知建模机制,以提高触觉表示效率并实现高效推理。实验结果表明,TouchThinker在多个数据集上取得了与最先进模型竞争的性能。我们的代码和数据集将在以下网址提供:this https URL。

英文摘要

Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of transferable tactile commonsense; (2) Tactile signals are inherently redundant and action-specific, yet existing methods often overlook these properties, resulting in inefficient representations with limited semantic expressiveness. To address these limitations, we propose TouchThinker, a tactile-language framework that scales tactile commonsense reasoning to the open world from both data and representation perspectives. First, we construct TouchThinker-1M, a million-scale, multi-source tactile reasoning dataset covering \textbf{415} objects, \textbf{8} scenarios, and \textbf{7} sensor types, providing a solid data foundation for open-world generalization. We further introduce TouchThinker-Bench, an open-world benchmark with more realistic and diverse tasks. Then, we propose action-aware modeling mechanism to improve tactile representation efficiency and enable efficient reasoning. Experimental results demonstrate that TouchThinker achieves competitive performance against state-of-the-art models across multiple datasets. Our code and dataset will be made available at: this https URL.