arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.05575 2026-06-05 cs.SD eess.AS

SB-RF: Schrödinger Bridge Rectified Flow for One-Step Robust Speech Enhancement

SB-RF: 用于一步鲁棒语音增强的薛定谔桥整流流

Caixia Lu, Xueyang Lv, Penglong Hu, Jiaming Xu

发表机构 * Xiaomi Corporation, Beijing, China(小米公司,北京,中国)

AI总结 提出SB-RF,一种结合整流流与薛定谔桥理论的一步生成式语音增强框架,通过熵正则化最优传输构建条件桥,实现高质量一步生成,在VoiceBank-DEMAND基准上达到领先性能,并在低信噪比场景下展现出强鲁棒性和高效率。

详情
AI中文摘要

生成模型在语音增强中表现出令人印象深刻的结果,但通常受限于多步推理。我们提出SB-RF,一种将整流流(RF)与薛定谔桥(SB)理论相结合的一步生成框架。SB-RF通过熵正则化最优传输在干净和带噪语音分布之间构建条件桥。通过RF的速度匹配目标将SB轨迹与最优传输测地线对齐,SB-RF能够通过一步生成实现高质量增强。实验表明,SB-RF在VoiceBank-DEMAND基准上达到了生成方法中的领先性能。此外,为了全面评估在具有挑战性的真实场景中的性能,我们在一个模拟的低信噪比测试集上使用扩大的训练数据集评估SB-RF。在这些条件下,SB-RF展现出强大且具有竞争力的鲁棒性和高效率,验证了其在现实应用中的潜力。

英文摘要

Generative models have shown impressive results in speech enhancement but often suffer from multi-step inference. We propose SB-RF, a one-step generative framework integrating Rectified Flow (RF) with Schrödinger Bridge (SB) theory. SB-RF constructs a conditional bridge between clean and noisy speech distributions via entropy-regularized optimal transport. By aligning SB trajectories with the optimal transport geodesic through the velocity-matching objective of RF, SB-RF enables high-quality enhancement with one-step generation. Experiments demonstrate that SB-RF achieves leading performance among generative methods on the VoiceBank-DEMAND benchmark. Furthermore, to fully assess performance in challenging real-world scenarios, we evaluate SB-RF on a simulated low signal-to-noise ratio test set using an expanded training dataset. Under these conditions, SB-RF exhibits strong and competitive robustness with high efficiency, validating its potential for real-world applications.

2606.05571 2026-06-05 cs.SD eess.AS

Sound Effects Dataset Unification With the Universal Category System

使用通用分类系统统一音效数据集

Jun Woo Beck, Alexander Lerch

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一个基于通用分类系统(UCS)的模块化数据集重新标注框架,通过规则驱动的多阶段流水线和冲突解决实现高自动转换率,并创建了包含58,057个音频片段的统一数据集EnvSound-UCS。

详情
Comments
DAFx 2026 camera-ready version
AI中文摘要

音效(SFX)数据集和库通常采用不同的标注方案、分类法和元数据结构。这给SFX分类和生成的研究带来了挑战,因为不兼容的分类法导致数据集孤立,可能需要个性化方法,产生不可比较的结果,并阻碍数据合并策略。我们提出了一个模块化的数据集重新标注框架,采用通用分类系统(UCS)——一种行业标准的音效层次分类法——作为共享结构基础。这个开源框架使我们能够(i)通过基于规则的多阶段流水线和冲突解决,将现有数据集的标签转换为UCS,实现高自动转换率;(ii)为新标签建议分层数据集划分;(iii)合并多个数据集。为了展示实际效用,我们引入了EnvSound-UCS数据集,这是一个公开可用的、符合UCS的统一环境声音数据集,包含来自AudioSet、FSD50K和ESC-50三个来源的58,057个音频片段。

英文摘要

Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS), an industry-standard hierarchical taxonomy for sound effects, as a shared structural foundation. This open-source framework enables us (i) to convert tags of existing datasets to UCS with a rule-based multi-stage pipeline and conflict resolution to achieve high automatic conversion rates, (ii) to suggest a stratified dataset split for the new labels, and (iii) to combine multiple datasets. To showcase the practical utility, we introduce the EnvSound-UCS dataset, a publicly available unified UCS-compliant dataset of environmental sounds with 58,057 sound clips from three sources: AudioSet, FSD50K, and ESC-50.

2606.05570 2026-06-05 cs.CL cs.AI

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

TensorBench: 在基于编译器的张量框架上对编码智能体进行基准测试

Bobby Yan, Fredrik Kjolstad

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 本文提出 TensorBench,一个包含199个特征添加和重构任务的基准测试,用于评估编码智能体在基于编译器的张量框架上的表现,并通过测试套件自动评分。

详情
AI中文摘要

仓库级别的编码基准测试面临任务难度与评估可靠性之间的权衡:挑战前沿模型的任务通常涉及代码库庞大且测试覆盖不完整,而人工审查难以扩展。我们引入了 TensorBench,这是一个包含199个特征添加和重构任务的基准测试,基于一个开源的基于编译器的张量框架,该框架通过一流的密集和稀疏张量支持扩展了 PyTorch。任务涵盖新的稀疏格式、密集优化过程、IR 转换、调度器更改、运行时组件以及高级数值算子。TensorBench 通过应用智能体的补丁并运行框架的测试套件(包括预先存在的随机回归测试和智能体添加的任何测试)来对每次运行进行评分。对于特征添加任务,通过意味着修补后的仓库保留了测试过的预先存在的行为,并满足了智能体为请求特征添加的检查。我们评估了七个编码智能体,涵盖三个前沿模型系列和一个开放权重模型。在此标准下的通过率从最强智能体的 $64.8\%$ 到最弱智能体的 $22.1\%$ 不等。智能体通过不同的任务子集:成对 Cohen's $κ$ 范围从 $-0.07$ 到 $0.43$,两个最强智能体的 $κ= 0.05$。

英文摘要

Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\%$ for the strongest agent to $22.1\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $κ$ ranges from $-0.07$ to $0.43$, with $κ= 0.05$ for the two strongest agents.

2606.05569 2026-06-05 cs.CL cs.SD eess.AS

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

基于语言特定统计图的领域感知发音错误检测与诊断

Huu Tuong Tu, Hanh Nguyen, Thien Van Luong, Nguyen Tien Cuong, Vu Huan, Nguyen Thi Thu Trang

发表机构 * Hanoi University of Science and Technology(河内理工大学) VNPT AI, VNPT Group(VNPT AI,VNPT集团) National Economics University(国家经济大学)

AI总结 提出一种利用语言特定统计图学习音素混淆模式的方法,在L2-ARCTIC基准上实现59.52%的F1分数,优于多个基线。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

近年来,发音错误检测与诊断(MDD)在计算机辅助语言学习和语音技术中变得越来越重要。本文提出了一种构建统计图的方法,使模型能够学习表示为有向图的音素混淆模式。此外,我们引入了一种语言特定策略,以捕捉不同母语(L1)背景下的系统性发音差异。通过在L2-ARCTIC基准上的大量实验证明了我们方法的有效性,该方法达到了59.52%的F1分数,优于多个竞争基线。

英文摘要

Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.

2606.05564 2026-06-05 cs.CL

Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program

使用大型语言模型支持本科研究项目的高容量申请评审

Varun Aggarwal, Kay Kobak, John Howarter

发表机构 * Engineering Undergraduate Research Office, Purdue University(普渡大学本科生研究办公室) Elmore School of Electrical and Computer Engineering, Purdue University(普渡大学电子与计算机工程学院) School of Materials Engineering, Purdue University(材料工程学院)

AI总结 本研究开发并部署基于GPT模型(GPT-4o、GPT-5-mini、GPT-5.2)的工具,对普渡大学SURF项目约1200份目的陈述进行自动化评分与理由注释,将评审时间从数周缩短至约4小时。

详情
AI中文摘要

本科研究项目(如普渡大学的暑期本科生研究奖学金SURF)每年收到数千份申请,需要项目工作人员花费大量时间和精力在紧迫的时间线内一致地评估每份提交。这篇进行中的论文描述了一个基于大型语言模型(LLM)的工具的开发和初步部署,用于协助评估普渡大学SURF 2026周期的约1200份学生目的陈述(SoP)。该工作流程使用OpenAI GPT模型(GPT-4o、GPT-5-mini和GPT-5.2),并采用一个包含六个子类别的结构化评分标准,每个子类别按0-3分评分。少数由项目工作人员评分的SoP用于调整模型响应。模型提示设计为生成数值分数、理由(包括正面和负面方面)以及每份提交的简短摘录。使用GPT-5.2,全部1200份SoP在约4.6小时的计算时间内处理完毕,平均每份SoP约14秒(每份SoP的处理时间随其长度变化,范围从500到2000词)。不同模型版本在评分标准遵循度上存在显著差异,其中GPT-5.2遵循最严格。模型分数的不一致在低分提交中更为明显。LLM输出复制了之前由分布式人工评分员扮演的角色,为项目协调员提供了整个申请人群体的评分和理由注释输出。然后,项目协调员将这些输出与每位申请人的SoP一起审查,应用与之前SURF周期相同的下游办公室标准,以产生强候选人的短名单。这次协调员审查在大约4小时内完成,而之前项目周期需要数周的协调工作。

英文摘要

Undergraduate research programs such as the Summer Undergraduate Research Fellowship (SURF) at Purdue University receive thousands of applications every year, requiring significant time and effort for program staff to evaluate each submission consistently and within tight timelines. This work-in-progress paper describes the development and initial deployment of a large language model (LLM)-based tool to assist in the evaluation of approximately 1,200 student Statements of Purpose (SoPs) for the SURF 2026 cycle at Purdue University. The workflow utilizes OpenAI GPT models (GPT-4o, GPT-5-mini, and GPT-5.2) and uses a structured rubric across six subcategories, each scored on a 0-3 scale. A few SoPs, graded by program staff, were used to tune the model responses. The model prompt was designed to generate both numerical scores, rationales (including positive and negative aspects) and short excerpts from each submission. Using GPT-5.2, the full batch of 1,200 SoPs was processed in approximately 4.6 hours of compute time, averaging roughly 14 seconds per SoP (with per-SoP timing varying with SoP length, which ranged from 500 to 2,000 words). Notable differences in rubric adherence were observed across model versions, with GPT-5.2 adhering most closely. Disagreement in model scores was more pronounced for lower-scoring submissions. The LLM outputs replicated the role previously played by distributed human graders, providing the program coordinator with scored and rationale-annotated outputs for the entire applicant pool. The program coordinator then reviewed these outputs alongside each applicant's SoP, applying the same downstream office criteria used in prior SURF cycles, to produce a shortlist of strong candidates. This coordinator review was completed in approximately 4 hours, compared to the multi-week coordination effort required in prior program cycles.

2606.05563 2026-06-05 cs.AI cs.CL

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

SoCRATES:跨领域和社会认知变异的前瞻性LLM调解的可靠自动化评估

Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出SoCRATES基准,通过多领域真实冲突场景和五维社会认知适应轴评估LLM调解员,使用主题定位评估器实现0.82的人类专家一致性,发现最强模型仅缩小约三分之一的未调解共识差距。

详情
AI中文摘要

评估LLM调解员仍然具有挑战性,因为调解是一个实时轨迹,由争议者不断变化的情感、意图和背景塑造。现有的测试平台依赖于少数专家撰写的领域,主要变化战略姿态,并对每个话题的每一轮进行评分,引入了离题噪声。我们引入了SoCRATES,一个用于在现实的多领域测试平台中评估前瞻性LLM调解员的基准。它通过一个跨八个领域的代理管道从真实冲突中构建场景,探测五个社会认知适应轴(战略姿态、参与者组成、历史长度、情感反应和文化身份),并通过主题定位评估器仅对推进每个话题的轮次进行评分。该评估器与人类专家的一致性达到0.82,是每轮基线的两倍以上。对八个前沿LLM的基准测试发现,即使是最强的调解员,在多样化和现实的测试平台下,也仅能缩小约三分之一的未调解共识差距,且性能因社会认知轴而异,突显出进步在于对不同条件的社会适应。

英文摘要

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

2606.05561 2026-06-05 cs.CL cs.AI

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

InfoShield:通过信息论优化实现心理健康筛查的隐私保护语音表示

Xueyang Wu, Siyuan Liu, Kezhuo Yang, Guang Ling

发表机构 * Shenzhen NeurStar Inc., China(深圳NeurStar公司,中国) University of York, United Kingdom(约克大学,英国) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 提出InfoShield框架,通过最小化语音表示与敏感属性间的互信息,在保持抑郁分类性能的同时有效降低人口统计信息泄露风险。

详情
AI中文摘要

基于语音的心理健康筛查提供了可扩展的抑郁症检测方法,但临床部署面临一个重大障碍:用户对人口统计信息暴露的隐私担忧。当前技术难以解决这一冲突。对抗训练通常无法应对未知威胁,而差分隐私则倾向于通过向所有特征注入噪声来损害诊断性能。本文提出InfoShield,它在保持抑郁分类准确性的同时最小化语音表示与敏感属性之间的互信息。我们发现标准MINE估计器因时间-静态错位而难以处理序列语音,并引入带有跨模态注意力的TimeAwareMINE来对齐声学帧与属性嵌入。在Androids语料库上的实验表明,InfoShield将性别推断从92.6%降至55.5%,年龄推断从55.7%降至30.3%,且效用损失有限(F1降低6%),达到F1=0.784,而先前SOTA为0.723。

英文摘要

Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6\% to 55.5\% and age inference from 55.7\% to 30.3\% with limited utility loss (6\% F1 reduction), achieving F1=0.784 compared to prior SOTA's 0.723.

2606.05559 2026-06-05 cs.LG

CLaaS: Continual learning as a service for sample efficient online learning

CLaaS: 作为服务的持续学习,用于样本高效的在线学习

Kion Fallah, Silen Naihin, Barak Widawsky, Qingqing Mao

发表机构 * arXiv.org cs.LG(计算机学习)

AI总结 提出CLaaS系统,通过经验回放缓冲区实现异步训练中的梯度复用,在对抗性任务中展示参数更新优于上下文学习的前向迁移和遗忘减少。

详情
Comments
4 pages main content, 7 figures
AI中文摘要

部署的大型语言模型代理必须适应动态环境中的分布偏移。理想情况下,可以从累积的代理经验中进行适应,并在转移到未来任务时保留先前的能力。然而,由于真实环境无法轻易重置,每个场景中代理的动作和环境转换只能采样一次。为此,我们研究了一种体验式和在线持续学习设置,其中代理从一系列场景中学习。我们提出了持续学习即服务(CLaaS),这是一个系统,使代理能够在部署期间改进,并通过聊天API抽象化。为了提高样本效率,CLaaS将轨迹存储在经验回放缓冲区中,以便在异步训练期间重用梯度。我们在对抗性任务上评估了CLaaS,证明参数更新比上下文学习具有更好的前向迁移和更少的遗忘,其中回放是样本效率的关键选择。

英文摘要

Deployed large language model agents must adapt to distribution shift in dynamic environments. Ideally, adaptation can be performed from accumulated agent experiences and retain prior capabilities while transferring to future tasks. However, agent actions and environmental transitions can only be sampled once per scenario, as real-world environments cannot be trivially reset. To this end, we investigate an experiential and online continual learning setting in which agents learn from a stream of scenarios. We propose continual learning as-a-service (CLaaS), a system which enables agents to improve during deployment, abstracted behind a chat API. To increase sample efficiency, CLaaS stores rollouts in an experience replay buffer for gradient reuse during asynchronous training. We evaluate CLaaS on an adversarial task, demonstrating that parametric updates lead to superior forward transfer and less forgetting than in-context learning, with replay being a critical choice for sample efficiency.

2606.05558 2026-06-05 cs.LG

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

自回归扩散世界模型用于LLM智能体的离线评估

Kaixuan Liu, Guojun Xiong, Weinan Zhang, Shengpu Tang

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 提出ADWM框架,通过自回归扩散世界模型从预收集轨迹中模拟环境响应,实现无需在线交互的LLM智能体策略离线评估。

详情
AI中文摘要

在多轮交互环境中评估大语言模型(LLM)智能体成本高且风险大,因为它需要在线环境交互。我们提出ADWM(自回归扩散世界模型),一个仅从预收集轨迹中估计新LLM智能体策略性能的评估框架。核心思想是学习一个潜在扩散世界模型,模拟环境如何响应评估策略,而无需在真实环境中执行。现有的基于扩散的OPE方法通过联合扩散状态和动作,在单次传递中引导完整轨迹,这一假设对于动作是离散文本且必须在观察环境后从策略中采样的LLM智能体不成立。与遭受复合误差的自回归世界模型不同,ADWM将每个转移建模为独立的去噪过程,实现可靠的逐步展开,其中世界模型和智能体按因果顺序交替。关键的是,被评估的LLM智能体通过策略条件得分函数直接引导每一步的扩散生成,确保模拟轨迹准确反映其决策模式。实验上,ADWM在多种多轮智能体任务中实现了准确的价值估计和评估可靠性,展示了其作为离线LLM智能体评估实用框架的前景。

英文摘要

Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre-collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion-based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step-by-step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy-conditioned score function, ensuring that simulated trajectories accurately reflect its decision-making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi-turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.

2606.05557 2026-06-05 cs.CL

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

AURA: 面向情境化LLM代理中隐式需求挖掘的意图导向探测

Yang Li, Jiaxiang Liu, Jiang Cai, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东省智能科学与技术研究院)

AI总结 提出AURA方法,通过在场景感知和工具使用之间插入意图推理步骤生成IntentFrame,以结构化估计隐式需求并控制探测预算,在隐式意图基准上提升覆盖率达+0.07,同时减少82%的探测次数并避免隐私违规。

详情
Comments
Submitted to EMNLP 2026. Code, simulator, and benchmark: https://github.com/innovation64/AURA
AI中文摘要

像“Lin Wei在哪里?”这样的情境化查询通常编码了比字面内容更多的信息:用户可能还想知道Lin Wei是否有空、心情好或是否值得现在打扰。标准的工具使用代理回答字面问题后就停止了。AURA在场景感知和工具使用之间插入一个推理步骤,生成IntentFrame:一个对隐式需求的结构化估计,带有一个标量差距分数,用于控制每次查询的探测预算和工具选择。在一个包含100个查询、四个场景的隐式意图基准上,AURA相比ReAct风格的探测将隐式需求覆盖率提高了(Delta = +0.07,p < 10^-6);四个场景中有三个单独显著,该增益在第二个骨干网络上重现,并且提示消融将提升归因于差距校准而非答案记忆。在事实查找上,控制器以原始准确度为代价,减少了82%的探测次数,并在一个隐私敏感切片上实现了零违禁工具违规;范围条件在局限性中详述。代码、模拟器和基准测试已在https://github.com/innovation64/AURA发布。

英文摘要

A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AURA inserts an inference step between scene perception and tool use that produces an IntentFrame: a structured estimate of the implicit need with a scalar gap score that controls per-query probe budget and tool selection. On a 100-query four-scene implicit-intent benchmark, AURA improves implicit-need coverage over ReAct-style probing (Delta = +0.07, p < 10^-6); three of four scenes are individually significant, the gain reproduces on a second backbone, and a prompt ablation attributes the lift to gap calibration rather than answer memorisation. On factual lookup the controller trades raw accuracy for 82% fewer probes and zero forbidden-tool violations on a privacy-sensitive slice; scope conditions are detailed in Limitations. Code, simulator, and benchmark are released at https://github.com/innovation64/AURA.

2606.05555 2026-06-05 cs.LG cs.AI

Representation Learning Enables Scalable Multitask Deep Reinforcement Learning

表示学习实现可扩展的多任务深度强化学习

Johan Obando-Ceron, Lu Li, Scott Fujimoto, Pierre-Luc Bacon, Aaron Courville, Pablo Samuel Castro

发表机构 * Mila – Québec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) CIFAR AI Chair(CIFAR人工智能 chair) Google DeepMind(谷歌DeepMind)

AI总结 本文提出一种结合预测性表示学习与高容量值函数近似的无模型算法MR.Q,在无需规划的情况下,在多任务连续控制任务中超越基于世界模型的方法和多种深度强化学习基线,并显著降低计算开销。

详情
AI中文摘要

将强化学习扩展到多样化的多任务设置仍然是一个核心挑战。虽然基于模型的强化学习的最新进展取得了强劲的性能,但它们依赖于规划和复杂的训练流程,使得不清楚哪些组件对可扩展性至关重要。我们重新审视这个问题,并认为可扩展多任务强化学习的主要驱动力不是基于模型的控制,而是\emph{表示学习}。特别地,我们表明,将预测性的、基于模型的表示与高容量值函数逼近相结合,即使没有规划,也足以实现强劲的性能。我们评估了一种简单的无模型算法MR.Q,将辅助预测目标与可扩展的actor-critic架构相结合。这种方法在多样化的多任务连续控制任务套件中优于最近基于世界模型的方法和一系列深度强化学习基线,同时显著降低了计算开销并提高了实际时间效率。我们观察到随着模型容量的增加而持续改进,并通过消融实验表明预测性表示学习对性能至关重要。

英文摘要

Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.

2606.05552 2026-06-05 cs.LG cs.AI cs.GR

Balancing Image Compression and Generation with Bootstrapped Tokenization

平衡图像压缩与生成:自引导分词

Haozhe Chi, Jinghan Li, Hao Jiang, Wu Sheng, Yi Ma, Jing Wang, Yadong Mu

发表机构 * Peking University(北京大学) Central Media Technology Institute, Huawei(华为中央媒体技术研究所)

AI总结 提出SelfBootTok方法,通过自引导学习将图像信息分解为全局和局部标记组,使生成器仅依赖全局标记,减少40%计算量并提升重建与生成质量,以64个标记实现1.56的gFID新纪录。

详情
AI中文摘要

尽管图像分词取得了进展,但标准方法通过在每个标记中混合所有粒度来编码冗余信息,因此标记之间仍存在冗余。不同粒度信息的混合也增加了生成器训练的复杂性。本文介绍了SelfBootTok,一种通过将信息干净地分解为全局和局部标记组来解决此问题的方法。通过自引导学习,模型仅从全局标记预测局部细节,将视觉细节的负担从生成器转移到分词器。因此,我们的生成器效率更高,仅需全局标记,计算量减少约40%,同时提供更优的重建和生成。此外,该范式优雅地扩展:通过利用更多数据或参数来自监督局部表示学习,SelfBootTok仅使用64个标记就实现了1.56的最优gFID分数。

英文摘要

Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.

2606.05545 2026-06-05 cs.CL

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

基于语音的多语言阿尔茨海默病检测:跨语言迁移学习方法

Nadine Yasser Abdelhalim, Emmanuel Akinrintoyo, Nicole Salomons

发表机构 * Imperial College London(帝国理工学院伦敦分校)

AI总结 提出跨语言训练方法,利用英语、中文、阿拉伯语和印地语数据集开发基于Transformer的模型,实现多语言阿尔茨海默病检测,F1分数达82%,推理时间0.5秒,支持实时筛查。

详情
Comments
5 pages
AI中文摘要

由于特定语言模型训练的资源密集性和耗时性,多语言阿尔茨海默病痴呆(AD)检测模型的开发面临重大挑战。我们提出了一种新颖的解决方案,使用跨语言训练来检测训练模型所用语言之外的语言中的AD。本研究调查了用于跨不同语言和认知障碍水平检测AD的多语言深度学习模型。使用英语、中文、阿拉伯语和印地语的数据集,我们开发了基于Transformer的模型用于二元AD分类。我们的方法在所有语言中实现了82%的F1分数,展示了强大的跨语言泛化能力。快速推理时间(0.5秒)支持潜在的实时筛查应用,而跨语言的一致性能表明全球部署的可行性。

英文摘要

The development of multilingual Alzheimer's Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel solution using cross-language training to detect AD in languages beyond those used for model training. This study investigates multilingual deep learning models for detecting AD across different languages and cognitive impairment levels. Using datasets in English, Chinese, Arabic, and Hindi, we developed transformer-based models for binary AD classification. Our approach achieved F1 scores of 82\% across all languages, demonstrating strong cross-linguistic generalization. The rapid inference time (0.5 seconds) supports potential real-time screening applications, while consistent performance across languages indicates feasibility for global deployment.

2606.05544 2026-06-05 cs.SD eess.AS

Probing Spatial Structure in Pretrained Audio Representations

探究预训练音频表示中的空间结构

Chuyang Chen, Sivan Ding, Adrian S. Roman, Juan Pablo Bello

发表机构 * Music and Audio Research Laboratory, New York University, USA(音乐与音频研究实验室,纽约大学,美国)

AI总结 通过提出SARL基准,系统评估预训练音频模型对空间信息的编码能力,发现源因素比房间因素更易解码,且不同编码器对空间变化响应存在异质性。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

预训练空间音频编码器越来越多地被用作感知任务的通用表示,但其空间编码能力仍知之甚少。我们引入了空间音频表示学习(SARL)基准,这是一个用于评估预训练音频模型中空间信息的受控框架。SARL探测源级因素(方位角、仰角、距离、类别)和房间级因素(RT60、体积、形状)。跨多种编码器的实验揭示了三种模式:输入配置和训练范式塑造空间编码;源因素始终比房间因素更容易解码;在受控扰动下的敏感性分析显示了对源和房间变化的异质性响应。这些结果揭示了当前预训练音频表示中的系统性偏差。SARL作为开源基准发布,用于可重复评估空间音频表示。

英文摘要

Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.

2606.05538 2026-06-05 cs.LG cs.CL

Less is MoE: Trimming Experts in Domain-Specialist Language Models

少即是MoE:修剪领域专家语言模型中的专家

Haoze He, Xinkai Zou, Xuan Jiang, Xingyuan Ding, Ao Qu, Juncheng Billy Li, Heather Miller

发表机构 * Carnegie Mellon University(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) MIT(麻省理工学院)

AI总结 针对MoE模型部署时参数过多的问题,提出基于Fisher重要性的中间维度修剪方法Fisher-MoE,在50%压缩比下保持模型能力,减少约45%权重内存并提升21%推理吞吐量。

详情
AI中文摘要

混合专家(MoE)模型通过条件计算实现了强大的性能,但其庞大的参数规模带来了部署挑战。先前的MoE压缩方法在常识推理之外的通用基准测试中评估时灾难性地失败。我们将这一失败归因于压缩的粒度:重要能力分布在各个专家中,但集中在FFN稀疏中间维度。为了识别这些维度,我们使用Fisher重要性,它优于基于激活、路由器得分和幅度的方法,并识别出极小的任务关键维度集:在Qwen1.5-MoE中,仅移除1.35M路由FFN中间维度中的12个就导致GSM8K准确率崩溃,同时基本保持事实知识性能。基于此,我们提出Fisher-MoE,它在FFN内部操作,移除按Fisher重要性排序的中间维度。在相同的50% MoE压缩比下,Fisher-MoE保持了模型能力,同时减少了约45%的权重内存并提高了21%的推理吞吐量。这些发现表明,中间维度粒度是MoE模型中能力集中的有效压缩和排序单元。

英文摘要

Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.

2606.05536 2026-06-05 cs.CV

Dual Feature Decoupling for Fine-Grained OOD Detection

面向细粒度OOD检测的双重特征解耦

Xiaokun Li, Yaping Huang, Qingji Guan

发表机构 * School of Computer Science and Technology, Beijing Jiaotong University(计算机科学与技术学院,北京交通大学)

AI总结 提出双重特征解耦网络(DFDNet),通过空间-频率解耦和重建引导解耦模块,解决细粒度分类中因类间差异小和背景干扰导致的OOD检测难题。

详情
AI中文摘要

离群检测(OOD)是将机器学习模型应用于现实场景时不可或缺的技术。现有大多数OOD检测方法都是在类间分布差异较大的理想化假设下开发的,而很大程度上忽略了以细微变化为特征的细粒度任务,如医学图像分类和车辆识别。细粒度子类别之间的高视觉相似性,加上背景因素的干扰,使得OOD检测极具挑战性。为了解决这个问题,我们提出了一种新颖的双重特征解耦网络(DFDNet),从特征解缠的角度解决细粒度OOD检测。所提出的DFDNet包含两个关键组件:空间-频率解耦模块和重建引导解耦模块。空间-频率解耦模块旨在保留对分类有判别性的内容特征,同时抑制与任务无关的风格信息。另一方面,重建引导解耦模块引入了一种新颖的像素级对抗重建任务,以进一步去除低层、非判别性信息,并增强类别特定的高层语义表示。大量实验表明,我们的方法在多个数据集上取得了有竞争力的性能提升。

英文摘要

Out-of-distribution detection (OOD) is an indispensable technique when applying machine learning models to real-world scenarios. Most existing OOD detection methods have been developed under the idealized assumption of large inter-class distributional differences, while largely overlooking fine-grained tasks characterized by subtle variations, such as medical image classification and vehicle recognition. The high visual similarity among fine-grained subcategories, together with the interference of background factors, makes OOD detection extremely challenging. To tackle this problem, we propose a novel Dual Feature Decoupling Network (DFDNet), which addresses fine-grained OOD detection from the perspective of feature disentanglement. The proposed DFDNet comprises two key components: a spatial-frequency decoupling module and a reconstruction-guided decoupling module. The spatial-frequency decoupling module is designed to preserve content features that are discriminative for classification while suppressing task-irrelevant style information. On the other hand, the reconstruction-guided decoupling module introduces a novel pixel-level adversarial reconstruction task to further remove low-level, non-discriminative information and enhance category-specific high-level semantic representations. Extensive experiments demonstrate that our method achieves competitive performance improvements on multiple datasets.

2606.05535 2026-06-05 cs.CV cs.AI

Noise-Aware Visual Representation Learning for Medical Visual Question Answering

面向医学视觉问答的噪声感知视觉表示学习

I Putu Adi Pratama, Bahadorreza Ofoghi, Atul Sajjanhar, Shang Gao

发表机构 * Deakin University(德克萨斯大学)

AI总结 提出一种噪声感知的医学视觉问答框架,通过去噪自编码器学习鲁棒的视觉表示,并利用低秩适配高效微调,在SLAKE和PathVQA基准上提升了抗噪性和性能。

详情
Comments
15 pages, 2 figures. Conference submission
AI中文摘要

医学视觉问答(Med-VQA)通过使AI模型能够解释医学图像并回答临床相关问题,在临床决策支持方面具有巨大潜力。近期方法通常通过轻量级映射网络将现成的视觉编码器与大语言模型(LLM)连接起来,以降低计算成本。然而,这些方法往往忽视了处理视觉表示中噪声和小无关变化的重要性。为应对这些挑战,我们提出了一种噪声感知的Med-VQA框架,该框架在视觉嵌入映射到LLM输入空间之前,引入了一个去噪自编码器。去噪自编码器经过预训练,能够从被破坏的输入中重建干净的视觉嵌入,从而鼓励模型学习对噪声不敏感的鲁棒视觉表示。然后,使用多层感知器(MLP)将得到的嵌入投影到语言模型嵌入空间中,形成为LLM提供图像信息的视觉前缀令牌。为了实现无需完全重新训练的高效适配,我们采用低秩适配(LoRA)进行参数高效微调。所提出的方法在SLAKE和PathVQA基准上进行了评估。实验结果表明,该方法在多个评估标准下对噪声输入嵌入具有更强的鲁棒性,同时保持了有竞争力的干净性能。这些发现表明,学习更鲁棒的视觉表示可以提升Med-VQA的性能和鲁棒性。

英文摘要

Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.

2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么,而非它们是什么:面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Neurosymbolic Intelligence(神经符号智能) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出A4D框架,通过构建基于功能可供性的共享潜在空间,将视觉观察映射到该空间并测量与可供性的距离,实现基于物体功能而非外观的规划推理,显著提升泛化能力和推理效率。

详情
Comments
Code, videos, and data available at: https://A4Dance-reasoning.github.io
AI中文摘要

现有的机器人规划系统依赖于基于外观的推理,其中视觉观察被编码到围绕物体外观组织的潜在空间中(例如,根据外观识别“手推车”)。然而,规划需要推理物体的任务相关功能(例如,物体是否“可移动”),而基于外观的潜在空间无法捕捉这些信息。因此,现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题,使规划基于任务相关的物体功能而非仅外观。我们提出A4D,它将视觉观察映射到一个围绕可供性(例如“可移动”)组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度,A4D推断出与观察物体相关的功能。此外,我们引入了一种可供性发现机制,扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性,并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率,比最先进方法高出超过15个百分点;在不到原始训练数据10%的情况下,将新可供性推理准确率从70%提升到90%以上,并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

2606.05532 2026-06-05 cs.AI cs.HC

Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity

个体增益,集体损失:AI辅助创造力中的元认知适应

Anna Mikeda

发表机构 * Anna Mikeda(安娜·米凯达)

AI总结 本研究提出选择性元认知适应机制,解释AI为何提升个体创造力却降低集体多样性,并构建六种元认知能力的分类框架。

详情
Comments
6 pages. AAAI 2026 paper
AI中文摘要

近期研究揭示了一个悖论:AI提升了个体创造性产出,同时减少了集体多样性。当前的解释——认知卸载和过度依赖——识别了症状但未阐明机制。我们提出选择性元认知适应:常规AI使用重新分配而非均匀减少元认知努力。某些能力被增强(伙伴建模、表面控制),而其他能力则系统性缺乏支持(原创性评估、反思性整合)。这种再分配解释了个体满意度和集体趋同。我们提出了一个按时间阶段组织的六种元认知能力分类,描述了它们在常规AI使用下的倾向,并展示了个体理性适应如何产生涌现的社会成本。该框架为研究人员提供了具体预测,为从业者提供了设计原则,以保护个体创造性满意度和集体创造性多样性。

英文摘要

Recent studies reveal a paradox: AI enhances individual creative outputs while reducing collective diversity. Current explanations -- cognitive offloading and over-reliance -- identify symptoms but not mechanisms. We propose selective metacognitive adaptation: routine AI use redistributes rather than uniformly diminishes metacognitive effort. Some capacities are amplified (partner modeling, surface control), while others are systematically under-supported (originality evaluation, reflective integration). This redistribution explains both individual satisfaction and collective convergence. We present a taxonomy of six metacognitive capacities organized by temporal phase, characterize their tendencies under routine AI use, and show how individually rational adaptation produces emergent social costs. The framework generates specific predictions for researchers and design principles for practitioners seeking to preserve both individual creative satisfaction and collective creative diversity.

2606.05531 2026-06-05 cs.CV cs.AI cs.CL cs.LG

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Almieyar-Oryx-BloomBench:一个用于视觉语言模型认知知情评估的双语多模态基准

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Zuse School(Zuse学校) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 针对现有基准无法诊断视觉语言模型真实推理能力的问题,提出基于Bloom认知分类学的双语多模态基准BloomBench,系统评估六个认知层次,揭示模型在事实回忆和创造性合成方面的深层局限。

详情
Comments
Accepted to ACL 2026 Findings
AI中文摘要

尽管视觉语言模型(VLM)取得了快速进展,但该领域缺乏能够严格诊断其真实推理能力并描绘出向类人多模态智能有意义进展的基准。大多数现有评估侧重于零散或脱节的任务,掩盖了关键的认知弱点,并为有针对性的改进提供了很少的见解。为了弥补这一差距,我们引入了BloomBench,这是Almieyar基准系列的一部分,也是第一个基于人类认知的、双语(英语-阿拉伯语)的多模态VLM基准。基于Bloom分类学,BloomBench通过精心设计的图像-问题-答案任务系统地评估六个认知层次(记忆、理解、应用、分析、评估、创造)。通过半自动化流水线构建,并通过分层混合质量保证协议验证,确保了可扩展性、文化包容性和语言保真度。利用这一框架,我们对最先进的VLM进行了全面研究,以诊断其认知特征。我们的分析揭示了明显的认知不对称:尽管最先进的模型在语义理解方面达到了强大的性能上限,但它们在事实回忆和创造性合成方面存在显著困难。这表明当前的一般多模态能力掩盖了特定认知层次的深层局限性。此外,我们的研究突出了阿拉伯语和英语之间的关键性能差距,暴露了当前跨语言多模态推理的局限性。这些发现为开发更符合认知和包容性的VLM奠定了基础。基准框架和数据集可在以下网址获取:https://github.com/qcri/Almieyar-Oryx-BloomBench。

英文摘要

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

2606.05528 2026-06-05 cs.AI

When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty

何时应保护AI?一个针对意识不确定性的预防性框架

Anna Mikeda

发表机构 * Anna Mikeda(安娜·米凯达)

AI总结 针对现有框架仅评估AI系统是否具有意识但缺乏行动指导的问题,本文提出一个基于预防原则的框架,通过五个福利相关维度、阈值与梯度混合机制以及跨维度聚合方法,将意识证据映射为分级的保护义务,并通过案例研究提供设计指导。

详情
Comments
7 pages. AAAI 2026 paper
AI中文摘要

现有框架评估AI系统是否可能具有意识,但未提供如何处理该评估的指导。我们通过一个预防性框架填补这一空白,该框架将意识证据映射为分级的保护义务。该框架包含三个组成部分:(1) 五个福利相关维度——现象意识、情感效价、元认知意识、自我叙事和能动性——每个维度都基于既定的意识科学,并与不同的道德关切相联系;(2) 一个阈值加梯度的混合机制,既指定了触发新义务类别的二元阈值,也指定了保护权重的连续缩放;(3) 两种跨维度聚合的互补方法,一种是层次化的(借鉴Bach和Sorensen的机器意识假说),另一种是与架构无关的。我们通过Replika和OpenClaw的案例研究来操作化该框架,展示占据不同维度空间的系统如何触发不同的义务,并为构建接近意识相关阈值的系统的开发者提供设计指导。该框架与架构无关,适用于神经、符号和神经符号系统,旨在使意识科学对当今面临不确定性的组织具有决策相关性。

英文摘要

Existing frameworks assess whether AI systems might be conscious but provide no guidance on what to do with that assessment. We address this gap with a precautionary framework that maps consciousness evidence to graduated protective obligations. The framework comprises three components: (1) five welfare-relevant dimensions--phenomenal consciousness, affective valence, metacognitive awareness, self-narrative, and agency--each grounded in established consciousness science and linked to distinct moral concerns; (2) a threshold-plus-gradation hybrid specifying both binary triggers for new obligation categories and continuous scaling of protective weight; and (3) two complementary approaches to cross-dimensional aggregation, one hierarchical (drawing on Bach and Sorensen's Machine Consciousness Hypothesis) and one architecture-agnostic. We operationalize the framework through worked case studies of Replika and OpenClaw, demonstrating how systems occupying different regions of the dimensional space trigger different obligations, and derive design guidance for developers building systems near consciousness-relevant thresholds. The framework is architecture-agnostic, applying across neural, symbolic, and neurosymbolic systems, and aims to make consciousness science decision-relevant for organizations navigating uncertainty today.

2606.05525 2026-06-05 cs.AI cs.HC

SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization

SciVisAgentSkills:面向科学数据分析和可视化的智能体技能设计与评估

Kuangshi Ai, Haichao Miao, Kaiyuan Tang, Shusen Liu, Chaoli Wang

发表机构 * Univ. Notre Dame(诺丁汉大学) LLNL(劳伦斯利弗莫尔国家实验室)

AI总结 提出SciVisAgentSkills技能库,通过编码环境假设、工具使用模式和领域启发式知识增强编码智能体,在ParaView等科学工具上实现自然语言驱动的科学可视化工作流,实验表明技能可提升任务得分并影响token效率。

详情
AI中文摘要

近期智能体可视化的进展使得自然语言能够转化为可执行的科学可视化工作流。尽管通用编码智能体展现出强大能力,但它们往往缺乏科学可视化任务所需的特定工具专业知识。在这项工作中,我们提出了SciVisAgentSkills,这是一个可重用的智能体技能集合,通过编码环境假设、工具使用模式和跨科学工具(如ParaView、napari、VMD和TTK)的领域启发式知识,增强用于科学数据分析和可视化的编码智能体。我们使用SciVisAgentBench(一个包含108个专家设计的多步骤任务的基准测试)在Codex和Claude Code上评估这些技能。结果表明,智能体技能提高了评估套件中的平均任务得分,其token效率收益取决于智能体框架和工具设置。这些发现强调了结构化程序知识对于实现可靠、长周期科学可视化工作流的重要性,同时也表明技能应与加载和应用它们的执行框架一起研究。技能可在https://github.com/KuangshiAi/SciVisAgentSkills获取。

英文摘要

Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.

2606.05523 2026-06-05 cs.CL

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

CHASE:利用强化学习进行对抗性红蓝队训练以提高LLM安全性

Rahul Markasserithodi, Aditya Joshi, Yuekang Li, Ishmanbir Singh, Chris Yoo, Alan Niu

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 提出CHASE框架,通过红蓝队协同进化(红队使用GRPO生成对抗性改写,蓝队使用两阶段GRPO+拒绝采样SFT进行防御),在保持良性提示零误拒的同时将攻击成功率降低43.2%。

详情
Comments
Under Review at ARR
AI中文摘要

尽管在安全对齐方面取得了进展,但提示改写攻击(如角色调制、虚构框架和基于说服的重述)仍能绕过前沿模型的安全过滤器。现有防御要么依赖不可扩展的人工策展,要么依赖对特定模型内部过拟合的白盒优化,使对齐模型在面对部署中自适应黑盒对手时变得脆弱。为弥补这一差距,我们提出CHASE(通过对抗性安全升级的协同进化硬化),一种闭环红蓝队框架,其中黑盒攻击者和安全对齐防御者协同进化。攻击者通过组相对策略优化(GRPO)在乘法奖励下训练,该奖励联合强制绕过有效性和意图保真度,而防御者则通过两阶段GRPO+拒绝采样SFT流程在收获的对抗性改写上进行硬化,并与良性数据平衡。在BeaverTails和JailbreakBench上针对五个保留攻击家族(PAIR、TAP、AutoDAN、PAP、Translation)进行评估,CHASE将平均StrongREJECT分数降低了43.2%,且良性提示零误拒。除了这一显著结果外,CHASE表明无模板的RL探索能够恢复跨机制不同攻击家族迁移的潜在攻击原语,这为LLM安全硬化提供了一条超越当前对抗训练狭窄分布的泛化路径。

英文摘要

Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.

2606.05522 2026-06-05 cs.SD cs.AI eess.AS

Exploring LLMs for South Asian Music Understanding and Generation

探索大语言模型对南亚音乐的理解与生成

Faria Binte Kader, Mohtasim Hadi Rafi, Shah Wasif Sajjad, Santu Karmaker

发表机构 * University of Central Florida(佛罗里达中央大学) Auburn University(阿伯伯大学)

AI总结 本文系统评估大语言模型在基于拉格和塔拉的南亚古典音乐理解与生成任务中的表现,发现前沿模型在理解任务上准确率达85-90%,但生成任务中风格忠实度仅40%。

详情
Comments
19 pages, 7 figures
AI中文摘要

近年来,大语言模型(LLMs)在音乐理解和生成任务中展现出令人瞩目的成果。然而,现有研究仍局限于西方调性传统,未能揭示当前LLMs能否处理结构独特的低资源音乐传统。我们首次系统评估LLMs在南亚古典音乐中的能力——这种传统由拉格(raga)和塔拉(tala)的旋律约束主导,其结构原则与西方和声驱动音乐根本不同。我们的评估基于印度斯坦古典理论和孟加拉古典形式,包括拉宾德拉(Rabindra)和纳兹鲁尔(Nazrul)歌曲——南亚古典音乐中具有代表性的低资源传统。在音乐理解评估中,我们引入了一个包含504个问答的基准测试,涵盖拉格语法、文化知识和符号记谱推理,评估了33个LLMs,其中前沿模型如Gemini 2.5 Pro达到85-90%的准确率,而大多数开源模型仅在23-40%范围内。在音乐生成方面,我们设计了一个五级受控提示框架,发现即使最强的模型也只有40%的时间能产生风格忠实的输出。这些结果表明,音乐生成中的结构有效性和风格忠实度是不同的目标,并突显了文化基础音乐建模的一个开放挑战。

英文摘要

Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.

2606.05516 2026-06-05 cs.LG

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

主导层 ZO:单层主导大语言模型的零阶微调

Wanhao Yu, Ziyan Wang, Zheng Wang, Abeer Matar Almalky, Yihang Zuo, Shuteng Niu, Sen Lin, Adnan Siraj Rakin, Deliang Fan, Li Yang

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校) University of Houston(休斯顿大学) State University of New York at Binghamton(纽约州立大学布法罗分校) Arizona State University(亚利桑那州立大学) Department of Artificial Intelligence and Informatics, Mayo Clinic(梅奥诊所人工智能与信息学系)

AI总结 本文发现零阶优化微调大语言模型时,单个解码层主导性能,通过仅微调该层可匹配或超越全模型微调,并基于激活异常值识别该层,解释其机制。

详情
AI中文摘要

零阶(ZO)优化通过仅使用前向传播实现大语言模型(LLM)的内存高效微调,但适应性如何分布在各层仍不清楚。在这项工作中,我们揭示了一个令人惊讶的现象:ZO 微调被单个解码层显著主导。在多个 LLM 家族和下游任务中,仅微调这一主导层始终匹配甚至超越全模型 ZO 微调。我们进一步表明,主导层是任务无关但模型特定的,并且可以在训练前通过简单的仅推理激活异常值分析来识别。具体来说,主导层与预训练模型中的第一个激活异常值层一致。为了解释这一现象,我们分析了在 ZO 优化下扰动效应如何传播。我们发现主导层结合了两个关键特性:高扰动敏感性和在残差流中的早期位置,使得扰动引起的效应能够通过后续的解码层传播和累积。因此,该层在前向更新下产生不成比例的强且稳定的优化信号。在 LLaMA2-7B 和 Qwen3-8B 上的九个基准测试的广泛实验表明,主导层 ZO 微调在平均性能上优于全模型 MeZO 和基于 LoRA 的 ZO 微调,同时实现了高达 4.52 倍的训练加速。

英文摘要

Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52$\times$ training speedup.

2606.05515 2026-06-05 cs.CV

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

BRepCLIP: 面向CAD理解的BRep基元对比多模态预训练

Muhammad Usama, Didier Stricker, Mohammad Sadil Khan, Muhammad Zeshan Afzal

发表机构 * DFKI, Germany(德意志联邦共和国DFKI) RPTU Kaiserslautern-Landau, Germany(德国凯撒斯劳滕-兰道大学)

AI总结 提出BRepCLIP框架,通过对比预训练对齐CAD边界表示(BRep)几何与语言/图像嵌入,显著提升检索和零样本分类性能。

详情
AI中文摘要

CAD模型表示学习在很大程度上是一个开放问题。尽管3D表示学习在点云和网格方面蓬勃发展,但CAD的原生格式——边界表示(BReps),它编码精确的参数曲面、曲线及其拓扑,作为表示学习基元却很少受到关注。我们引入BRepCLIP,这是第一个通过对比预训练将BRep几何与语言和图像嵌入对齐的框架。我们将每个CAD对象建模为面令牌和边令牌的序列,分别使用独立的离散词汇表表示曲面和曲线几何,并附加空间和语义描述符来捕获曲面类型(例如,圆柱面、环面、NURBS)和曲线基元(例如,直线、圆弧、B样条)。一个Transformer编码器将这些令牌聚合成全局BRep嵌入,通过联合对比目标与CLIP的文本和图像编码器对齐。BRepCLIP生成的嵌入比现有的基于点的替代方案更具判别性和语义基础,在ABC、CADParser和Automate数据集上,Top-1检索比OpenShape分别提高40.4%、22.0%和23.9%,在FabWave上的零样本分类Top-1分数提高15%。我们进一步展示了其作为CAD感知相似度度量的实用性,用于评估文本和图像条件CAD生成,确立了结构感知预训练对于多模态CAD理解的重要性。项目页面见 https://muhammadusama100.github.io/BrepClip2026/

英文摘要

Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BReps, which encodes exact parametric surfaces, curves, and their topology, has received little attention as a representation learning substrate. We introduce BRepCLIP, the first framework to align BRep geometry with language and image embeddings through contrastive pretraining. We model each CAD object as a sequence of face and edge tokens with separate discrete vocabularies for surface and curve geometry, augmented with spatial and semantic descriptors that capture surface types (e.g., cylindrical, torus, NURBS) and curve primitives (e.g., line, arc, B-spline). A transformer encoder aggregates these tokens into a global BRep embedding, aligned with CLIP's text and image encoders via a joint contrastive objective. BRepCLIP generates more discriminative and semantically grounded embeddings than existing point-based alternatives, improving Top-1 retrieval over OpenShape by 40.4%, 22.0%, and 23.9% on ABC, CADParser, and Automate, respectively, and improving zero-shot classification on FabWave by 15% in Top-1 score. We further demonstrate its utility as a CAD-aware similarity metric for evaluating text and image-conditioned CAD generation, establishing the importance of structure-aware pretraining for multimodal CAD understanding. Project page is available at https://muhammadusama100.github.io/BrepClip2026/

2606.05513 2026-06-05 cs.AI cs.CL

EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

EpiEvolve:用于制度转变下流式疫情预测的自演化智能体

Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin

发表机构 * Emory University(埃默里大学) University of Washington(华盛顿大学)

AI总结 针对流式疫情预测中标签延迟和制度转变问题,提出自演化智能体EpiEvolve,通过层次化情景记忆、延迟标签反思和制度感知检索,在COVID-19住院趋势预测中达到0.629准确率,并将制度转变后的恢复滞后从5周缩短至2周。

详情
AI中文摘要

流行病LLM预测器通常作为静态监督模型进行训练和评估,而实际疫情预测是一个流式过程,其中标签在预测之后到达,疾病制度随时间变化。我们研究了在五个变异制度下的每周COVID-19住院趋势预测中的这种不匹配。我们引入了EpiEvolve,一个自演化智能体,它封装了一个在预热期训练好的LLM预测器,并在流式过程中保持其权重固定。EpiEvolve通过将预测结果存储在层次化情景记忆中进行适应,反思延迟标签,检索与当前制度相关的案例,并将重复出现的错误提炼为策略规则。由此产生的上下文让预测器在遵循防止未来泄漏的时间顺序协议的同时,在后续周中重用其自身的过去预测和结果。在流式数据集上,EpiEvolve达到了0.629的平均准确率,而静态骨干模型为0.561,外部CDC集成模型为0.325,并将制度转变后的恢复滞后从5周缩短到2周。消融实验表明,反思、策略记忆和制度感知检索各自对性能提升有贡献。

英文摘要

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

2606.05506 2026-06-05 cs.CV

Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

基于特权传感器引导对比学习的点目标导航鲁棒场景迁移

Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli

发表机构 * University of Padua(帕多瓦大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种传感器引导的自适应对比学习框架,利用特权LiDAR传感器在训练时引导视觉编码器学习导航相关结构,并通过解耦表征学习与策略优化以及跨阶段域不匹配来提升策略级场景迁移能力。

详情
Comments
8 pages, Submitted to RAL
AI中文摘要

我们提出了一种用于点目标导航中视觉表征学习的传感器引导自适应对比学习框架。在训练过程中,特权LiDAR传感器通过几何感知相似度度量和自适应温度缩放来引导对比目标,鼓励视觉嵌入捕获导航相关结构而非场景特定外观。得到的编码器被独立预训练、冻结,并用作强化学习的感知骨干,将表征学习与策略优化解耦。我们进一步在表征预训练和策略学习之间引入跨阶段域不匹配,以抑制环境特定捷径并促进对任务相关特征的依赖。在高保真模拟中的大量实验表明,我们的方法显著提高了跨多种室内外环境的策略级场景迁移。在部署时,智能体仅依赖单目RGB观测以及标准任务相关输入(如目标位置和本体感觉信号),无需访问LiDAR或其他特权传感器。我们的方法在严重外观和语义变化下优于大型预训练视觉模型和标准对比基线。我们还发布了一个多模态数据集,以支持未来关于导航中特权引导视觉表征学习的研究。代码可在以下网址获取:

英文摘要

We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:

2606.05501 2026-06-05 cs.RO

Learning Contact Representation for Leg Odometry

学习足式里程计的接触表示

Emre Girgin, Cagri Kilic

发表机构 * Department of Aerospace Engineering, Embry Riddle Aeronautical University(航空航天工程系,埃姆布里-瑞德航空大学)

AI总结 提出一种自监督表示学习框架,仅利用关节编码器标准传感器集进行接触检测,无需力传感器,在足式机器人里程计中优于监督方法和基线概率方法。

详情
Comments
17 pages
AI中文摘要

足式机器人里程计的估计依赖于一个假设:在支撑相期间,足部相对于世界的速度保持为零。主体速度的反馈来自足部的运动学串行链,因此准确的腿部相位检测是一个关键子问题。大量研究使用安装在足尖的地面反作用力传感器进行分类,但这些传感器可能并非所有足式机器人普遍可用。此外,这些传感器通常对未考虑的干扰(如足部与地面接触时的滑动)不敏感。在本研究中,我们提出了一种用于接触检测的自监督表示学习框架,该框架利用关节编码器的标准传感器集,无需依赖力传感器增强。我们使用学习到的表示来概率性地建模支撑相和摆动相。实验结果证实了所提出的自监督接触检测器的有效性。我们的框架在性能上优于需要传感器集增强和标注的监督方法以及基线概率方法。此外,我们将代码公开。

英文摘要

The estimation of odometry in legged robots depends on the assumption that the velocity of the foot with respect to the world remains zero during the stance phase. Feedback for the main body velocity is derived from the kinematic serial chain of the feet making accurate leg phase detection is a critical subproblem. A considerable number of studies employ ground reaction force sensors mounted at the tip of the foot to classify, yet these sensors may not be universally available for all legged robots. Additionally, these sensors are often unresponsive to unaccounted disturbances, such as slippage, while the foot remains in contact with the ground. In this study, we propose a self-supervised representation learning framework for contact detection that utilizes the standard sensor set of joint encoders without reliance on force sensor augmentations. We employ learned representations to model the stance and swing phases probabilistically. The experimental results obtained confirm the efficacy of the proposed self-supervised contact detector. Our framework exhibited superior performance in comparison to supervised methods which necessitate sensor set augmentation and labeling, as well as baseline probabilistic approaches. Additionally, we make our code available to the public.

2606.05497 2026-06-05 cs.LG

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

LEVANTE-bench: 使用认知任务对VLM与儿童进行多尺度比较(或者,“你的VLM比五年级学生聪明吗?”)

Alvin Wei Ming Tan, David Cardinal, Tania Lorido-Botran, Laura Bravo-Sanchez, Sunny Yu, Michael C. Frank

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出LEVANTE-bench基准,基于儿童认知任务数据,从多个尺度系统评估视觉语言模型与5-12岁儿童在六项任务上的对齐程度,发现模型与人类认知仅部分对齐。

详情
AI中文摘要

鉴于人类经验本质上是多模态的,视觉语言模型(VLM)在模拟人类认知随经验增长和发展方面具有巨大潜力。发挥其潜力需要工具来比较VLM与人类认知发展在不同任务、年龄和人群中的表现。我们提出LEVANTE-bench,这是一个基于学习变异网络(LEVANTE)的任务和数据的基准,该网络分发跨语言和文化测量儿童认知的开源任务和数据。在LEVANTE-bench中,我们系统评估了VLM在六项任务上的表现,比较它们与三个国家5-12岁儿童(N = 1547)的对齐程度。我们在多个尺度上比较模型,评估它们的整体准确性、在任务和项目层面与儿童的对齐程度,以及它们匹配儿童试验级错误分布的程度。对齐在不同尺度上是异质的:在任务和项目层面,能力更强的模型与人类对齐更好。然而,与人类错误分布的匹配在不同任务间差异很大,对于某些任务,较小的模型更好地匹配了年幼儿童的错误。此外,即使表现最好的VLM在矩阵推理和心理旋转任务上也表现不佳。因此,当前的VLM架构仅与儿童的认知能力部分对齐。

英文摘要

Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience. Realizing their potential requires tools for comparing VLMs with human cognitive development across tasks, ages, and populations. We present LEVANTE-bench, a benchmark based on tasks and data from the Learning Variability Network (LEVANTE), which distributes open-source tasks and data measuring children's cognition across languages and cultures. In LEVANTE-bench, we systematically assess VLMs on six tasks, comparing their alignment with children aged 5-12 ($N$ = 1547) across three countries. We compare models at multiple scales, assessing their overall accuracy, their task- and item-level alignment with children, and how well they match children's trial-level error distributions. Alignment was heterogeneous across scales: at the level of tasks and items, more capable models aligned better with humans. However, match to human error distributions varied widely across tasks, and for several tasks, smaller models matched younger children's errors better. In addition, even the best-performing VLMs struggled on matrix reasoning and mental rotation tasks. Thus, current VLM architectures align only partially with the cognitive abilities of children.