arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别1778
2606.07416 2026-06-08 cs.LG 新提交

Video-Based Prediction of In-Flight Particle Characteristics in Atmospheric Plasma Spraying

基于视频的大气等离子喷涂中飞行粒子特性预测

Abhijeet Praveen, Sareh Soleimani, Cormac Cureton, Aman Sidhu, Kintak Raymond Yu, Cristian Cojocaru, Narges Armanfard

发表机构 * Department of Electrical and Computer Engineering, McGill University Mila – Quebec AI Institute National Research Council Canada

AI总结 提出利用高速视频观测等离子体羽流,通过TabPFN、CNN等模型预测飞行粒子温度和速度,最高R²达0.90和0.82,实现非侵入式诊断。

详情
Comments
Accepted at ECML PKDD 2026 (Applied Data Science Track)
AI中文摘要

大气等离子喷涂(APS)是一种广泛使用的涂层工艺,其中飞行粒子的温度和速度强烈影响涂层质量。然而,这些粒子特性在操作过程中难以连续监测,这促使了非侵入式数据驱动诊断方法的发展。在这项工作中,我们研究了高速视频观测等离子体羽流在估计APS中飞行粒子特性方面的预测潜力。我们引入了三种不同的视频衍生特征表示,并使用Tabular Prior-Data Fitted Networks(TabPFN)、卷积神经网络(CNN)以及包括随机森林、梯度提升、支持向量回归和XGBoost在内的经典回归基线进行评估。实验采用分组留一交叉验证,对来自63次APS喷涂运行的126个标记的喷涂前后视频记录进行。在工程化特征实验中,TabPFN在温度预测方面表现最一致,使用组合特征表示达到R²=0.86。CNN模型在速度预测方面表现更强,达到R²=0.81。此外,我们评估了使用预训练CNN直接对原始视频帧进行操作的模型,发现预训练CNN加回归头实现了最高性能,温度和速度的R²分别为0.90和0.82。结果表明,视频衍生的羽流信息为APS非侵入式诊断和实时过程监测提供了有前景且可扩展的基础。

英文摘要

Atmospheric plasma spraying (APS) is a widely used coating process in which in-flight particle temperature and velocity strongly influence coating quality. However, these particle characteristics are difficult to monitor continuously during operation, motivating the development of non-invasive data-driven diagnostic methods. In this work, we investigate the predictive potential of high-speed video observations of the plasma plume for estimating in-flight particle characteristics in APS. We introduce three different video-derived feature representations and evaluate them using Tabular Prior-Data Fitted Networks (TabPFN), convolutional neural networks (CNN), and classical regression baselines including Random Forest, Gradient Boosting, Support Vector Regression, and XGBoost. Experiments are conducted using grouped leave-one-out cross-validation on 126 labeled pre- and post-spray video recordings from 63 APS spray runs. Across the engineered feature experiments, TabPFN achieves the most consistent performance for temperature prediction, reaching R2 = 0.86 using the combined feature representation. CNN models particularly perform stronger for velocity prediction, achieving R2 of 0.81. In addition, we evaluate models operating directly on raw video frames using pretrained CNNs and find that the highest performance is achieved by a pretrained CNN with a regression head with R2 of 0.90 and 0.82 for temperature and velocity, respectively. The results demonstrate that video-derived plume information provides a promising and scalable foundation for non-invasive APS diagnostics and real-time process monitoring.

2606.07414 2026-06-08 cs.LG cs.NE 新提交

Sparsely gated tiny linear experts

稀疏门控的微型线性专家

Simon Schug

发表机构 * Princeton University

AI总结 提出稀疏门控线性神经元(sgatlin)网络,通过将每个专家缩减为单个神经元并去除非线性,在等计算量下提升语言模型困惑度,同时增强可解释性。

详情
Comments
Code available at https://github.com/smonsays/sparsely-gated-linear
AI中文摘要

稀疏性允许在不按比例增加计算成本的情况下扩展模型参数。虽然混合专家(MoE)模型变得越来越稀疏,但单个专家通常仍然庞大且密集。在这里,我们通过将每个专家缩减为单个神经元并选择许多可用神经元中的极小一部分,进一步增加稀疏性,从而提高计算效率和可解释性。与直觉相反,实现这两者的关键是去除通常应用于专家的非线性,从而得到一个稀疏门控线性神经元(sgatlin)网络。在等计算量比较中,我们发现用sgatlin替换所有Transformer前馈层可以在不同计算预算下改善语言模型的困惑度。同时,由此产生的前馈电路的稀疏性和线性为模型可解释性提供了新的机会。在一个小规模案例研究中,我们证明sgatlin中的前馈电路可以在无需训练额外替代模型的情况下进行解释。我们发现它们形成了语义结构化的聚类,并且在因果上参与了事实回忆。我们的发现为计算高效且可解释的Transformer前馈层指明了一条可能的路径。

英文摘要

Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinking each expert to consist of a single neuron and selecting a tiny fraction of many available neurons can improve compute efficiency and interpretability. Counterintuitively, the key to achieving both is removing the nonlinearity typically applied to the experts, resulting in a network of sparsely gated linear neurons (sgatlin). In an isoflop comparison, we find that replacing all transformer feedforward layers with sgatlin improves perplexity in language models across different compute budgets. At the same time, the sparsity and linearity of the resulting feedforward circuits present new opportunities for model interpretability. In a small-scale case study, we demonstrate that feedforward circuits in sgatlin can be interpreted without having to train additional replacement models. We find that they form semantically structured clusters and are causally implicated in factual recall. Our findings paint a possible path towards compute-efficient and interpretable transformer feedforward layers.

2606.07410 2026-06-08 cs.LG cs.AI 新提交

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

人类与DeepSeek-R1大语言模型数学推理的全面解剖

Yuxiang Chen, Jun Wang

发表机构 * UCL Centre for Artificial Intelligence

AI总结 通过AIME 2025所有30道题目的10247个推理步骤注释,发现DeepSeek-R1存在拓扑模仿(表面模仿推理而非真正推理),但成功轨迹中分支与回溯的稳定使用以及反射在演绎推理中的有效放置是真正推理的信号。

详情
AI中文摘要

大语言模型中“顿悟时刻”的出现,特别是DeepSeek-R1-0120,引发了这些系统是真正推理还是仅仅模仿推理表象的问题。我们对AIME 2025所有30道题目进行了模型与人类推理的全面实证比较,将10247个推理步骤详尽地注释为五个功能类别:分析、推理、分支、回溯和反思。我们发现了一个明显的结构差异。人类解决方案在分析和演绎之间保持紧凑交替,而DeepSeek-R1频繁重访中间结果,进行浅层且往往不必要的验证,并在局部检查中循环,而没有有意义的逻辑进展。我们将其描述为拓扑模仿:再现推理的表面形式而不发挥其功能作用。尽管如此,我们识别出两个真正推理的信号。首先,成功轨迹表现出分支和回溯的稳定使用,而失败轨迹要么过度使用要么使用不足探索性动作。其次,反思仅在置于演绎推理中时才有效;陷入分析循环的反思专注于局部数值细节而忽略全局逻辑错误。这些发现表明,当前的长链思维模型可能更多地因推理的表象而非真正的演绎进展而获得奖励。我们讨论了改进评估和训练的方向,包括测量跨轨迹稳定性、惩罚“空转”轨迹、鼓励更深层的逻辑纠正,以及将推理时间计算重新分配给演绎和回溯。总体而言,推理质量不仅取决于反思发生的多少,还取决于反思是否一致地出现在适当的逻辑尺度上。

英文摘要

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising "spinning-wheel" traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.

2606.07404 2026-06-08 cs.LG 新提交

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

可逆基础:通过状态保持缩放训练120B稀疏MoE

Rohan Shravan

发表机构 * The School of AI

AI总结 本文报告在单台8-GPU节点上端到端训练千亿参数稀疏混合专家模型,通过可逆循环、状态保持增长和单节点经济学三大原则,实现从密集种子到120B模型的四阶段扩展。

详情
Comments
58 pages, 9 figures, 37 tables. Code: https://github.com/The-School-of-AI/LLM. Released models: huggingface.co/theschoolofai/LightningLM-0.1V-{2B, 5B-MoE, 9B-MoE, 120B-MoE}. Companion work: arXiv:2605.29379 (BrahmicTokenizer-131K), arXiv:2605.29459 (Kronecker Embeddings)
AI中文摘要

本文报告在单台八GPU节点上端到端训练千亿参数稀疏混合专家模型。LightningLM 0.1V是一个循环骨干语言模型家族,通过四个阶段从小型密集种子扩展,经过5B和9B混合专家,最终达到120B模型,具有460个路由专家,采用top-12路由。每个更大模型从小模型的训练权重增长而来;活跃参数从密集种子的1.78B单调增加到120B时的5.93B(约占存储的118.67B的5%)。整个谱系在单节点上运行,较大阶段在8K上下文中,达到120B规模时发布的训练损失为1.78。这是一份系统和经验报告,围绕三个原则组织。可逆性:可逆循环栈在反向传播中重建激活而非存储它们,使激活内存随模型增长保持平坦。状态保持增长:每次扩展(密集到MoE、浅到深、少专家到多专家)都作为可重现原则给出,并附有错误导致的失败案例;若干失败是无声的。单节点经济学:120B通过TQP训练,这是一种量化基础专家权重和训练低秩适配器的策略,将优化器状态承载于2.26B适配器参数而非路由专家中的100B+,将专家路径优化器状态减少约45倍。新颖之处在于已知原语的集成,而非任何孤立原语:一个在单节点上端到端运行的成长谱系,以从业者级别记录,并以每个领域的留出损失作为证据,表明目标能力(多语言印度能力、代码)是通过构造学习的。模型家族、分词器和训练代码已发布。

英文摘要

This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense seed, through a 5B and a 9B mixture of experts, to a 120B model with 460 routed experts under top-12 routing. Each larger model is grown from the trained weights of the smaller one; active parameters rise monotonically from 1.78B at the dense seed to 5.93B at 120B (about 5% of the 118.67B stored). The full lineage runs on single nodes, the larger stages at 8K context, reaching a released training loss of 1.78 at 120B scale. This is a systems and experience report. It is organized around three disciplines. Reversibility: a reversible recurrence stack reconstructs activations in the backward pass instead of storing them, holding activation memory flat as the model grows. State-preserving growth: each expansion (dense to MoE, shallow to deep, few experts to many) is given as a reproducible principle paired with the failure that results from getting it wrong; several failures are silent. Single-node economics: the 120B trains through TQP, a strategy of quantized base expert weights and trained low-rank adapters that carries optimizer state on 2.26B adapter parameters rather than 100B+ resident in routed experts, cutting expert-path optimizer state by a factor of ~45. What is new is the integration of known primitives, not any primitive in isolation: one grown lineage running end to end on a single node, documented at practitioner level, with per-domain held-out loss as evidence that targeted capabilities (multilingual Indic competence, code) were learned by construction. Model family, tokenizer, and training code are released.

2606.07402 2026-06-08 cs.CL 新提交

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

M$^3$Exam: 面向真实用户-智能体交互的多模态记忆基准

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen, Yuqian Wu, Fangyuan Zhang, Qintian Guo, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology Beijing University of Chemical Technology The Hong Kong University of Science and Technology (Guangzhou) Harbin Institute of Technology (Shenzhen) Beijing Institute of Technology (Zhuhai) Tencent Hy Peng Cheng Laboratory

AI总结 提出M$^3$Exam基准,用于评估多模态大语言模型在真实用户-智能体交互中的跨模态推理和隐式信息推断能力,并设计M$^3$Proctor方法通过按需处理视觉源提升准确率13%,同时降低索引构建时间和检索token超70%。

详情
AI中文摘要

语言智能体越来越多地部署在积累的多模态信息上,然而现有基准假设人机交互形式,具有稀疏的视觉内容和直白的内容,既不评估基于真实多模态文件交互的推理,也不评估对隐藏用户信息的解释。因此,我们引入了M$^3$Exam,一个基于真实用户-智能体交互的查询中心多模态对话记忆基准,具有跨模态基础推理和隐式信息推断的多维评估。对多模态大语言模型和记忆系统的基准测试揭示了跨模态基础推理、跨会话推理以及累积多模态上下文的效率成本方面的持续差距。我们进一步提出了M$^3$Proctor,一种多模态记忆方法,它检测查询模态偏差并仅按需消耗原始视觉源,将准确率提高13%,同时将索引构建时间和检索到的令牌减少超过70%。

英文摘要

Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

2606.07401 2026-06-08 cs.CV 新提交

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

RealDocBench: 面向真实世界监管文档的字段级问答与布局理解基准

Ameya Joshi, Joon Kim, Gus Eggert, Joseph Bajor, Cindy Hao, Jing Reyhan, Kushal Byatnal, Eli Badgio

发表机构 * Extend AI

AI总结 提出RealDocBench基准,包含字段级问答和布局理解两个任务,评估18个系统在真实监管文档上的性能,揭示单一指标掩盖的性能差异和成本延迟权衡。

详情
AI中文摘要

文档解析系统越来越多地部署在高风险、受监管的工作流程中,如抵押贷款承销、财务报告、供应链物流和临床记录。然而,大多数公开基准在干净的学术布局或合成文本上评估解析器,并报告单一的OCR或Markdown级相似度分数。这类文档和指标与下游代理实际需求(即在混乱的真实世界页面上获取特定字段的正确值)相关性较差。我们引入了RealDocBench,这是一个基于真实监管文档构建的双轨基准。问答轨道包含跨越四个领域的581份文档上的1,356个字段级问题,每个问题配有一个类型化的gold_dict键值对答案,解析器按每个字段和严格的每个问题准确率评分。布局轨道包含1,500个人工验证的页面图像,在九类公共分类法下用COCO风格的边界框注释,使用包含邻域感知分割/合并恢复的匈牙利匹配器评分。我们在统一的提取和评分协议下评估了18个系统,涵盖商业解析API、通用视觉语言模型和开源OCR模型,并报告准确率以及每页成本和缓存失效延迟。RealDocBench暴露了单一数字基准隐藏的广泛性能差异、一个持续困难的医学子领域以及不同操作点之间的成本和延迟权衡。我们发布了数据集、解析器适配器和评估工具,以支持文档解析系统的可重复字段级比较。

英文摘要

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

2606.07397 2026-06-08 cs.SD 新提交

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

Audio-Oscar: 一个用于复杂音频场景生成、编排和优化的多智能体系统

Yifan Duan, Qixiang Xu, Hengtao Wu, Zhanxun Liu, Wenhao Guan, Junxi Liu, Ziyang Ma, Kelu Xu, Xie Chen

发表机构 * MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University Shanghai Innovation Institute Shanghai AI Laboratory Xiamen University State Key Laboratory of Complex & Critical Software Environment, China

AI总结 提出Audio-Oscar多智能体框架,通过协调多个专业智能体处理角色建模、语音生成、时间线规划等,实现复杂音频场景的生成与优化,并构建ASG-Bench基准进行评估。

详情
AI中文摘要

近年来,音频生成在文本到语音(TTS)、文本到音频(TTA)和文本到音乐(TTM)等任务上取得了显著进展。然而,从复杂的音频场景描述中生成长格式且可控的音频仍然是一个重大挑战,因为此类场景通常需要协调语音、音效、音乐、歌曲、时间结构以及后期制作。在这项工作中,我们引入了 \textbf{Audio-Oscar},一个用于从复杂描述生成音频的多智能体框架。Audio-Oscar 协调一组专业智能体,每个智能体负责音频场景的不同方面,包括角色建模和声音设计、语音生成、细粒度时间线规划、模型选择、非语音生成以及音频后期制作。Audio-Oscar 还整合了反馈驱动的优化。此外,为了解决缺乏从复杂音频场景描述评估音频生成的合适基准的问题,我们构建了 \textbf{ASG-Bench},一个音频场景生成基准,包含与参考音频配对的场景描述和纯文本场景描述。每个场景都标注了目标音频事件和时间语句,以评估生成的音频是否忠实地实现了所需的场景内容和时间结构。实验结果表明,Audio-Oscar 能够有效生成与复杂场景描述匹配的音频。项目样本可在该 https URL 获取。我们的代码可在该 https URL 获取。

英文摘要

In recent years, audio generation has made significant progress in tasks such as text-to-speech (TTS), text-to-audio (TTA) and text-to-music (TTM). However, generating long-form and controllable audio from complex audio scene descriptions remains a significant challenge, as such scenes often require coordinated speech, sound effects, music, songs, temporal structure, and post-production. In this work, we introduce \textbf{Audio-Oscar}, a multi-agent framework for generating audio from complex descriptions. Audio-Oscar coordinates a set of specialist agents, each responsible for a different aspect of the audio scene, including character modeling and voice design, speech generation, fine-grained timeline planning, model selection, non-speech generation, and audio post-production. Audio-Oscar further incorporates feedback-driven refinement. In addition, to address the lack of suitable benchmarks for evaluating audio generation from complex audio scene descriptions, we construct \textbf{ASG-Bench}, an Audio Scene Generation Benchmark containing both scene descriptions paired with reference audio and text-only scene descriptions. Each scene is annotated with target audio events and temporal statements to evaluate whether the generated audio faithfully realizes the required scene content and temporal structure. Experimental results show that Audio-Oscar can effectively generate audio that matches complex scene descriptions. Project samples are available at https://audiooscar.github.io/. Our code is available at https://github.com/ziye26/Audio-Oscar.

2606.07394 2026-06-08 cs.CV 新提交

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

注意差距:解开视频实例分割中的性能瓶颈

Danial Hamdi, Fardin Ayar, Mahdi Javanmardi

发表机构 * Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic)

AI总结 提出一种基于整数线性规划的诊断框架,分离分类、分割和跟踪误差,发现跟踪不稳定是视频实例分割的主要瓶颈,尤其在遮挡、长视频和高密度场景下,且强骨干网络无法消除该算法性问题。

详情
AI中文摘要

在视频实例分割(VIS)中,分类、分割和跟踪目标被联合评估,但它们各自对性能损失的贡献仍然不透明。我们引入一个诊断框架,将身份和类别分配表述为整数线性规划(ILP),产生一个模型无关的预言机,分层隔离每个错误源。应用于跨越在线和离线范式的七种VIS方法,在YouTube-VIS 2019/2021和OVIS的诊断子集上,我们的分析揭示了一致的图景。跟踪不稳定是在线方法的关键瓶颈,在严重遮挡下差距超过20 AP,并且随着视频长度和实例密度急剧增长。虽然语义分类在标准基准上有显著贡献,但在跟踪失败最严重的地方其影响变得微不足道。尽管更强的骨干网络大幅提升了默认分数,但它们基本保留了AP跟踪差距,证实了时间脆弱性是算法性的,而非纯粹表示性的。为补充预言机,我们引入了TrackLens,一种可视化工具,将差距大小转化为可观察的查询级故障模式。这些工具共同为瞄准VIS的核心挑战——鲁棒的长期时间关联——提供了系统基础。

英文摘要

In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

2606.07392 2026-06-08 cs.AI cs.LG econ.EM stat.ML 新提交

Online Pandora's Box for Contextual LLM Cascading

面向上下文LLM级联的在线潘多拉魔盒

Alexandre Belloni, Yan Chen, Yehua Wei

发表机构 * The Fuqua School of Business, Duke University

AI总结 针对LLM级联场景,提出在线上下文潘多拉魔盒模型,通过参数化保留索引和GMM估计结合UCB界,实现维度相关的√T累积遗憾。

详情
AI中文摘要

受大型语言模型(LLM)级联的启发,我们提出了一种在线上下文潘多拉魔盒模型,用于自适应地查询和选择LLM API。在每个周期中,决策者观察一个请求上下文,并面临一个两阶段决策问题。在查询阶段,决策者顺序查询API,每次查询揭示一个生成的输出,并且决策者承担(输出相关的)成本。在选择阶段,决策者选择一个生成的输出进行部署,并仅观察部署输出的下游奖励。这种输出介导的反馈结构不同于经典的在线上下文潘多拉魔盒模型,后者打开盒子直接揭示其奖励。我们不估计每个API的完整条件输出和成本分布,而是直接建模保留索引,并为查询阶段开发一种学习方法。具体地,我们对由经典Weitzman策略诱导的上下文保留索引函数施加参数化结构。我们的策略将这些保留索引的广义矩方法(GMM)类型估计与这些索引以及共享输出级奖励评估器的UCB风格置信界相结合。在正则条件下,我们证明所得策略在T个周期的时间范围内实现了维度相关的$\widetilde O(\sqrt T)$累积遗憾。

英文摘要

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

2606.07389 2026-06-08 cs.RO 新提交

Simulation-Driven Imitation Learning for Biosignals-Free Shared-Autonomy Prosthetic Grasping

模拟驱动的无生物信号共享自主假肢抓取模仿学习

Kaijie Shi, Wanglong Lu, Huiling Chen, Vinicius Prado da Fonseca, Ting Zou, Hanli Zhao, Xianta Jiang

发表机构 * Memorial University of Newfoundland Wenzhou University

AI总结 提出一个自动生成多样化抓取演示的模拟框架,结合物理可行抓取合成、自然到达轨迹重定向和程序化环境执行,通过模仿学习实现高成功率和强泛化能力的假肢控制。

详情
AI中文摘要

无生物信号的上肢假肢共享自主控制旨在不依赖EMG或其他生理信号的情况下实现自然且低努力的操作。最近的基于模仿学习的方法显示出有希望的结果,但其可扩展性受到收集大量真实世界人类演示数据的成本和变异性的限制。在这项工作中,我们提出了一个可扩展的模拟框架,该框架从腕部安装的虚拟摄像头自动生成多样化的到达-抓取演示。该框架结合了物理可行的抓取合成、自然到达轨迹重定向以及在程序化生成的室内环境中的到达-抓取-提升执行。它记录腕部视角观察、本体感觉和动作,以构建用于模仿学习的大规模演示数据集。通过广泛的模拟基准测试,我们评估了物体和场景的泛化能力,并比较了几种代表性的最先进模仿学习方法。结果表明,模拟演示足够丰富和一致,可用于有效的策略学习。在三个现实场景中,学习到的模拟到现实策略实现了超过90%的抓取成功率,超越了基线方法,并表现出更强的泛化能力,突显了模拟驱动训练在无生物信号共享自主假肢抓取中的前景。演示可在\href{此URL}{此URL}获取。

英文摘要

Biosignals-free shared-autonomy control of upper-limb prosthetic hands aims to enable natural and low-effort manipulation without relying on EMG or other physiological signals. Recent imitation-learning-based approaches have shown promising results, but their scalability is limited by the cost and variability of collecting large amounts of real-world human demonstration data. In this work, we present a scalable simulation framework that automatically generates diverse reach-to-grasp demonstrations from a wrist-mounted virtual camera. The framework combines physically feasible grasp synthesis, natural reaching trajectories retargeting, and reach--grasp--lift execution in procedurally generated indoor environments. It records wrist-view observations, proprioception, and actions to build a large-scale demonstration dataset for imitation learning. Through extensive simulation benchmarks, we evaluate object and scene generalization and compare several representative state-of-the-art imitation learning methods. Results show that the simulated demonstrations are sufficiently rich and consistent for effective policy learning. In three realistic settings, the learned sim-to-real policy achieves over 90\% grasp success, surpasses baseline methods, and exhibits stronger generalization, highlighting the promise of simulation-driven training for biosignals-free shared-autonomy prosthetic grasping. The demonstrations are available at \href{https://sites.google.com/view/sim-prosthetic-grasp/home}{https://sites.google.com/view/sim-prosthetic-grasp/home}.

2606.07386 2026-06-08 cs.RO 新提交

Spline Policy: A Structured Representation for Robot Policies

样条策略:机器人策略的结构化表示

Mengze Tian, Yiming Li, Sichao Liu, Auke Ijspeert, Sylvain Calinon

发表机构 * École Polytechnique Fédérale de Lausanne (EPFL) Idiap Research Institute

AI总结 提出样条策略(SP),用样条参数替代动作块,保留策略主干,支持连续轨迹解码、时域重采样、参数空间编辑及下游控制,并具有局部修正机制。

详情
Comments
This work has been submitted to the IEEE for possible publication
AI中文摘要

现代机器人操作的模仿学习策略通常将动作表示为固定分辨率的动作块,这种方法简单有效,但在执行前暴露的几何和时间结构有限。本文研究了样条策略(SP),一种结构化表示,它用样条参数替换动作块,同时保持策略主干不变。预测的样条可以解码为紧凑的连续轨迹,在不同时间分辨率下查询,在参数空间中进行约束或编辑,并传递给下游控制器。对于二次样条输出,相同的表示还可以通过解析距离场构造转换为状态依赖的向量场。在该构造的正则性和投影假设下,诱导的动力学不会增加与生成样条的距离,从而在预测运动周围产生有原则的局部修正机制。样条输出进一步支持从观测到样条参数、轨迹和流场的不确定性传播,并且可以与经典控制机制(如零空间碰撞避免)结合,而无需重新训练策略主干。我们使用扩散、流匹配、基于Transformer和视觉-语言-动作主干实例化了SP。在低维运动学习、匹配主干下的模拟操作、灵巧操作以及真实机器人案例研究中的实验表明,SP与现代策略学习器兼容,同时暴露了有用的运动结构特性,包括紧凑解码、时间重采样、预测运动周围的局部修正、不确定性评估和控制器兼容性。

英文摘要

Modern imitation-learning policies for robot manipulation often represent actions as fixed-resolution action chunks, which are simple and effective but expose limited geometric and temporal structure before execution. This paper studies Spline Policy (SP), a structured representation that replaces action chunks with spline parameters while keeping the policy backbone unchanged. The predicted spline can be decoded as a compact continuous trajectory, queried at different temporal resolutions, constrained or edited in parameter space, and passed to downstream controllers. For quadratic spline outputs, the same representation can also be converted into a state-dependent vector field through an analytical distance-field construction. Under the regularity and projection assumptions of this construction, the induced dynamics do not increase the distance to the generated spline, yielding a principled local corrective mechanism around the predicted motion. The spline output further supports uncertainty propagation from observations to spline parameters, trajectories, and flow fields, and can be combined with classical control mechanisms such as null-space collision avoidance without retraining the policy backbone. We instantiate SP with diffusion, flow-matching, transformer-based, and vision-language-action backbones. Experiments in low-dimensional motion learning, simulated manipulation under matched backbones, dexterous manipulation, and real-robot case studies show that SP remains compatible with modern policy learners while exposing useful motion-structure properties, including compact decoding, temporal resampling, local correction around predicted motions, uncertainty evaluation, and controller compatibility.

2606.07383 2026-06-08 cs.RO cs.LG 新提交

RhinoVLA Technical Report

RhinoVLA 技术报告

Huixi Intelligence, :, Chen Zhang, Chenyang Zhou, Guanglei Ding, Guanghui He, Haibin Gao, Jiajia Chen, Jianyong Zhang, Lianyi Yu, Ningyi Xu, Ping Xu, Qingchen Li, Yingjun Hu, Yijia Zhang, Yuxi Liu

发表机构 * Huixi Intelligence

AI总结 针对边缘硬件上VLA模型部署延迟问题,提出RhinoVLA,通过令牌高效骨干、连续动作专家和统一接口实现实时闭环控制,在Huixi R1上达到11.69 Hz推理速度。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大潜力,但在边缘硬件上的实时部署仍具挑战。本文中,我们识别出VLM视觉和上下文令牌是部署延迟的主要来源:对于以GEMM为主的投影算子,当模型维度固定时,计算量随输入令牌数量线性增长。基于此观察,我们提出RhinoVLA,一种与Huixi R1边缘SoC协同设计的面向部署的VLA模型。RhinoVLA采用令牌高效的Qwen3-VL骨干和连续动作专家,在保留预训练多模态能力的同时减少VLM侧的令牌和计算负担。为支持跨机器人学习,RhinoVLA进一步引入统一接口,结合视图注册表、72维物理状态-动作槽空间和机器人实例LoRA,使异构机器人观测和动作模式能在共享策略下对齐。在部署方面,RhinoVLA通过硬件感知编译、混合精度执行和并行视觉编码进行优化。实验表明,RhinoVLA在相似参数量下实现了与π0.5相当的下游性能,同时在Huixi R1上达到11.69 Hz的端到端推理,满足10 Hz实时闭环控制目标。该项目将在以下网址开源:此 https URL。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for robotic manipulation, but real-time deployment on edge hardware remains challenging. In this work, we identify VLM visual and context tokens as a major source of deployment latency: for GEMM-dominated projection operators, computation grows linearly with the number of input tokens when model dimensions are fixed. Motivated by this observation, we propose RhinoVLA, a deployment-oriented VLA model co-designed with the Huixi R1 edge SoC. RhinoVLA adopts a token-efficient Qwen3-VL backbone and a continuous Action Expert, reducing the VLM-side token and computation burden while preserving pretrained multimodal capability. To support cross-robot learning, RhinoVLA further introduces a unified interface that combines View Registry, 72D physical state-action slot space, and robotinstance LoRA, allowing heterogeneous robot observations and action schemas to be aligned under a shared policy. On the deployment side, RhinoVLA is optimized through hardware-aware compilation, mixed-precision execution, and parallel visual encoding. Experiments show that RhinoVLA achieves downstream performance comparable to π0.5 at a similar parameter scale, while reaching 11.69 Hz end-to-end inference on Huixi R1, meeting the 10 Hz real-time closedloop control target. The project will be open-sourced at https://github.com/HuixiAI/RhinoVLA.

2606.07382 2026-06-08 cs.LG stat.ML 新提交

Covariance Shrinkage via Stochastic Interpolation

通过随机插值的协方差收缩

Mathieu Chalvidal, Florentin Coeurdoux, Eric Vanden-Eijnden

发表机构 * Capital Fund Management

AI总结 提出将高维协方差估计的经典收缩重述为基于源分布与目标分布之间参数化随机插值的经验风险最小化,揭示三种降低统计风险的机制,并设计神经估计器及风险上界。

详情
Comments
18 pages
AI中文摘要

我们将高维协方差估计器的经典收缩重述为基于源分布与目标分布之间参数化随机插值的经验风险最小化。该形式将已知的收缩估计器作为特例,并揭示了降低统计风险的三种不同机制:(i) 调度:插值调度决定了可容许协方差的类别,从而影响可实现的风险。(ii) 流映射和耦合:虽然朴素构造相当于假设分布之间的独立性,但特定的耦合结构(例如最优传输问题的解)可以降低经验风险。此外,实现这种耦合的非线性流映射使插值协方差摆脱经验估计的特征基,从而实现特征向量正则化。(iii) 提前停止:通过积分回归向量场定义的估计器通过近似真实插值分布提供了额外的偏差-方差权衡。然后,我们提出了一种插值器的神经估计器,并给出了其二次风险关于插值近似误差的上界,并在合成实验中进行了验证。最后,我们将该估计器应用于真实的神经影像数据,展示了该方法在实践中提供的额外正则化能力。

英文摘要

We recast classical shrinkage of high-dimensional covariance estimators as empirical risk minimization over a parametric stochastic interpolant between a source and a target distribution. This formalism recovers known shrinkage estimators as special cases and reveals three distinct mechanisms for reducing statistical risk: (i) Scheduling: the interpolant schedule determines the class of admissible covariances, and hence the achievable risk. (ii) Flow maps and couplings: whereas naive constructions amount to assuming independence between the distributions, specific coupling structures (e.g., solutions of optimal transport problems) can lower the empirical risk. Moreover, non-linear flow maps realizing such couplings free the interpolant covariance from the eigenbasis of the empirical estimate, enabling eigenvector regularization. (iii) Early stopping: estimators defined by integrating a regressed vector field afford an additional bias-variance trade-off through approximation of the true interpolant distribution. We then propose a neural estimator of the interpolant, together with an upper bound on its quadratic risk in terms of the interpolant approximation error, and validate both on synthetic experiments. Finally, we apply the estimator to real neuroimaging data, demonstrating the additional regularization power this approach offers in practice.

2606.07368 2026-06-08 cs.CV cs.AI 新提交

Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

野外有丝分裂检测:MIDOG 2025挑战中的多肿瘤与上下文感知泛化

Marc Aubreville, Jonas Ammeling, Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Robert Klopfleisch, Jiaqi Lv, Shan E Ahmed Raza, Raphaël Bourgade, Thomas Walter, Yasemin Topuz, Songül Varlı, Charles-Antoine Collins-Fekete, Zhuoyan Shen, Navya Sri Kelam, Nitin Singhal, Christian Marzahl, Brian Napora, Tengyou Xu, Hongyan Gu, Mario Vento, Gennaro Percannella, Norbert Ropiak, Izabela Wasiak, Jie Xiao, Shaojun Liu, Seungho Choe, April Khademi, Vidushi Walia, Sujatha Kotte, Andrew Broad, Alex Wright, Guillaume Balezo, Esha Sadia Nasir, Mostafa Jahanifar, Yosuke Yamagishi, Shouhei Hanaoka, Mattia Sarno, Francesco Tortorella, Biwen Meng, Jingxin Liu, Sara Krauss, Daniel Hieber, Lavish Ramchandani, Dev Kumar Das, Mieko Ochi, Yuan Bae, Piotr Giedziun, Mateusz Maniewski, Vangala Govindakrishnan Saipradeep, Naveen Sivadasan, Leire Benito-Del-Valle, Adrian Galdran, Kaustubh Atey, Sameer Anand Jha, Adinath Dukre, Imran Razzak, Maxime W. Lafarge, Viktor H. Koelzer, Nils Porsche, Nikolas Stathonikos, Mitko Veta, Dominik Hirling, Zsanett Zsófia Iván, Peter Horvath, Katharina Breininger, Christof A. Bertram

发表机构 * Flensburg University of Applied Sciences Technische Hochschule Ingolstadt University of Veterinary Medicine Schwarzman Animal Medical Center Freie Universität Berlin University of Warwick MINES Paris - PSL University Yildiz Technical University University College London AIRA MATRIX Private Limited Gestalt Diagnostics University of California, Los Angeles University of Kansas Medical Center University of Salerno Cancer Center Sp. z o. o. th Military Research Hospital in Bydgoszcz Shenzhen Technology University Toronto Metropolitan University Tata Consultancy Services Ltd. Leeds Teaching Hospitals NHS Trust The University of Tokyo Xi’an Jiaotong-Liverpool University University of Augsburg Ulm University Japanese Red Cross Medical Center Wroclaw University of Science and Technology TECNALIA, Basque Research and Technology Alliance (BRTA) Indian Institute of Technology Bombay MBZUAI University of Basel University Medical Center Utrecht TU Eindhoven HUN-REN Biological Research Centre

AI总结 针对临床实际中组织学多样性的挑战,MIDOG 2025挑战评估了跨12种肿瘤类型和多种扫描平台的算法性能,发现模型在传统热点区域表现可靠,但在困难区域和罕见肿瘤中性能显著下降,集成方法可提升F1分数1.5个百分点。

详情
AI中文摘要

自动有丝分裂检测是计算病理学中一项成熟的任务。虽然之前的基准测试关注扫描仪引起的域偏移,但临床“真实世界”应用要求模型能够对组织学景观中预期的巨大差异具有鲁棒性。MItosis DOmain Generalization (MIDOG) 2025挑战旨在评估算法在空前生物学和上下文多样性下的性能。我们策划了一个包含365个病例的测试数据集,涵盖12种不同的人类、犬和猫肿瘤类型,并在多个扫描平台上数字化。超越手动选择的感兴趣区域(ROI),该挑战还要求在随机组织区域(代表全切片检测情况)和困难区域(富含难负样本的区域)进行检测。在第二个赛道中,我们引入了非典型有丝分裂象(AMF)的分类。检测赛道有18支队伍提交,F1分数最高达0.740。在AMF检测赛道,我们有21个提交,平衡准确率最高达0.908。我们的分析显示,虽然大多数模型在传统热点区域表现可靠,但在困难ROI中性能显著下降,假阳性率增加了两倍。此外,性能在12种肿瘤类型间差异显著,突显了当前最先进架构在遇到罕见或高度多形性恶性肿瘤时的“盲点”。此外,我们评估了集成的有效性,发现F1分数和平衡准确率平均分别提高1.5和1.3个百分点。相比之下,测试时增强(TTA)没有显示出相关改进。MIDOG 2025表明,“野外”有丝分裂检测仍然是一个重大障碍。从仅热点评估到多上下文框架的转变,为临床可靠性提供了更现实的代理指标。

英文摘要

Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

2606.07365 2026-06-08 cs.LG cs.AI 新提交

A robust PPG foundation model using multimodal physiological supervision

一种使用多模态生理监督的鲁棒PPG基础模型

Eloy Geenjaar, Vince Calhoun, Scott Daly, Gouthaman KV, Lie Lu, Trisha Mittal, Daniel P. Darcy

发表机构 * Dolby Laboratories

AI总结 提出一种PPG基础模型,利用ICU数据集中的心电和呼吸信号选择对比样本,无需高质量或场域数据预训练,在15个下游任务中14个取得性能提升。

详情
AI中文摘要

光电容积描记法(PPG)是一种无创测量血容量变化的方法,广泛应用于可穿戴设备和临床环境。最近的PPG基础模型要么使用开源ICU数据集,采用需要精心整理数据的预训练范式,从而难以泛化到场域数据,要么使用闭源场域PPG数据。相比之下,我们提出了一种PPG基础模型,不需要高质量或场域预训练数据,而是利用ICU数据集中伴随的心电图和呼吸信号在预训练期间选择对比样本。我们的方法允许模型保留并从噪声PPG片段中学习,提高了推理时的鲁棒性。我们的模型在比现有最先进方法少3倍的受试者上预训练,在15个不同的下游任务(包括场域日常活动和心率预测)中的14个上实现了性能提升。我们的结果表明,多模态监督可以整合互补的生理信息,以提高PPG基础模型的鲁棒性,并增强其对消费级数据的泛化能力。

英文摘要

Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining paradigms that require curated data and thus complicate generalization to field-like data, or use closed-source field-like PPG data. In contrast, we propose a PPG foundation model that does not require high-quality or field-like pretraining data, and instead leverages accompanying electrocardiogram and respiratory signals in ICU datasets to select contrastive samples during pretraining. Our approach allows the model to retain and learn from noisy PPG segments, improving robustness at inference. Our model, pretrained on 3x fewer subjects than existing state-of-the-art approaches, achieves performance improvements on 14 out of 15 diverse downstream tasks, including field-like daily activity and heart rate prediction. Our results demonstrate that multimodal supervision can integrate complementary physiological information to improve the robustness of PPG foundation models and enhance their generalization to consumer-grade data.

2606.07356 2026-06-08 cs.SD cs.CL 新提交

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

DirectAudioEdit: 基于扩散预测对比的无反演文本引导音频编辑

Zhengkun Ge, Xiaoqian Liu, Haoran Zhang, Yuan Ge, Junxiang Zhang, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang, China Kunming University of Science and Technology NiuTrans Research, Shenyang, China

AI总结 提出一种无需训练和反演的文本引导音频编辑方法DirectAudioEdit,通过扩散预测对比构建编辑路径,在音乐和事件基准上降低FAD和KL指标15%以上,编辑速度提升高达64.5%。

详情
AI中文摘要

文本引导音频编辑旨在修改语言指定的声学内容,同时保留与编辑无关的源组件。现有的无训练方法通常依赖于基于反演的编辑。虽然无反演编辑因其减少计算开销和重构误差而具有吸引力,但在音频编辑中仍基本未被探索。关键挑战是通过扩散去噪动力学构建源到目标的编辑路径。在本文中,我们介绍了DirectAudioEdit,这是首次尝试开发一种无需训练和反演的音频编辑方法。在两个骨干网络上的音乐和事件级基准实验表明,与DDPM反演相比,DirectAudioEdit将宏观平均FAD和KL分别降低了15.9%和15.8%,同时实现了高达64.5%的编辑加速。

英文摘要

Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

2606.07355 2026-06-08 cs.CV 新提交

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

面向微手势在线识别的时空解耦适配器

Xucheng Shen, Kun Li, Fei Wang, Wei Qian, Jin Jiang, Dan Guo

发表机构 * Hefei University of Technology United Arab Emirates University Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Anhui Evolution Technology Co., Ltd.

AI总结 提出时空解耦适配器,通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支,并引入自适应软平衡增强缓解长尾分布问题,在EI-MiGA挑战赛Track 2中取得第一名。

详情
Comments
Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026
AI中文摘要

微手势在线识别旨在对未修剪视频中的细微手势进行时间定位和分类。由于微手势持续时间极短、运动幅度低且视觉线索模糊,捕获判别性的时空表示仍然极具挑战性。现有的参数高效适配器通常采用单分支联合建模时空线索,这可能无法捕获微手势的细粒度模式。为解决这一局限,我们提出了一种时空解耦适配器,通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支。此外,为解决基准数据集中的长尾分布问题,我们引入了自适应软平衡增强,该方法根据类别稀有性和学习难度动态分配增强强度,无需手动设置阈值。我们的方法取得了0.43808的F1分数,在第四届EI-MiGA-IJCAI挑战赛的Track 2中排名第一。

英文摘要

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

2606.07342 2026-06-08 cs.CL cs.NE 新提交

LLM-Guided Evolution for Medical Decision Pipelines

LLM引导的医疗决策流程进化

Ivan Sviridov, Artem Oskin, Ivan Panin, Iaroslav Bespalov, Dmitry Dylov, Ivan Oseledets, Aleksandr Nesterov

发表机构 * Sber AI Lab AIRI

AI总结 提出LLM引导的MAP-Elites进化方法,无需微调即可优化医疗决策流程,在分诊、咨询和图像分类任务中超越手工设计基线。

详情
AI中文摘要

将大型语言模型(LLM)适应临床工作流程通常需要昂贵的微调或手动提示和流程工程。我们研究了LLM引导的MAP-Elites进化作为一种推理时替代方案,用于发现医疗决策策略,并在https://this URL提供实现仓库。我们将紧急分诊、交互式咨询和医学图像分类表述为对可执行工件的进化搜索,这些工件由特定任务的适应度函数优化。在所有三种设置中,进化在实践约束下改进了手工设计的基线。在分诊中,进化程序将Semigran准确率从77.3%提高到87.1%,紧急召回率从0.60提高到0.97,同时改进了安全加权的保留MIMIC-ESI性能。在交互式咨询中,进化策略改进了Llama-3、Qwen-3.5和Gemma-4的准确率-成本前沿,并迁移到保留的iCRAFTMD。在PneumoniaMNIST中,仅提示进化改进了冻结的MedGemma VLM,同时保留了严格的JSON输出。定性分析表明,收益来自可解释的程序级机制、校准的分诊边界、有针对性的证据获取、选择性承诺和面向发现的视觉决策规则,而不仅仅是表面的提示改写。

英文摘要

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

2606.07338 2026-06-08 cs.CV 新提交

VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

VeriDrive: 可验证的反事实监督用于成本高效的视觉-语言规划

Zikai Zhang, Hubert P. H. Shum, Toby P. Breckon

发表机构 * Department of Computer Science, Durham University

AI总结 提出VeriDrive框架,通过结构化感知-评估-修正链生成可验证的反事实监督,降低视觉-语言驾驶规划的数据构建成本,并在nuScenes数据集上验证其有效性。

详情
AI中文摘要

视觉-语言驾驶模型越来越多地使用推理监督来连接感知、预测和规划,但现有的驾驶理由通常是自由形式的,且使用前沿模型生成成本高昂。我们提出了VeriDrive,一个构建面向规划的、可验证的反事实监督框架。VeriDrive将驾驶推理转化为结构化的感知-评估-修正链,该链将关键对象锚定于未来运动,使用可规则检查的证据评估替代自我轨迹,将风险意图修正为专家行为,并生成最终规划目标。为了扩展数据构建,VeriDrive结合了本地生成与验证器引导的选择性修正,仅升级无效或困难的样本。我们在nuScenes上构建了VeriDrive数据集,并在Omni-Q协议下进行训练。受控的开环实验表明,VeriDrive在L2、碰撞和交叉指标上优于OmniDrive,同时减少了记录的令牌使用量、生成时间和实际支付的LLM/VLM成本。这些结果表明,可审计的中间字段和结构化修正目标可以在现实注释预算下改进视觉-语言规划监督。代码、提示和验证器脚本即将发布,并将在审稿过程后公开。

英文摘要

Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

2606.07334 2026-06-08 cs.SD cs.LG 新提交

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

和弦符号时间序列适应能承载多远流派身份?多流派和弦符号建模的能力与边界

Jinju Lee

发表机构 * PearlLeeStudio

AI总结 本研究评估了五种轻量级适应方法(LoRA、IA3、BitFit、前缀微调和全微调)将预训练流行爵士和弦模型扩展到11个目标流派的效果,发现所有方法均能提升和弦预测性能,但和弦符号本身不足以完整传递流派身份。

详情
Comments
16 pages, 4 figures
AI中文摘要

和声是一个紧凑的符号层,其中数学音高关系、声学协和与音乐惯例交汇。本报告将和弦符号序列视为音乐的不完全表示,而是作为可解释、可控的时间序列用于流派局部和声建模。从一个冻结的流行爵士音乐变换器检查点开始,我评估了小型适应接口能将模型扩展到11个目标流派的程度:布鲁斯、波萨诺瓦、巴赫众赞歌、乡村、电子、民谣、放克、福音、嘻哈、R&B/灵魂乐和摇滚。主要比较了LoRA、IA3、BitFit、前缀微调和全微调在11个流派和3个种子上的表现,构成完整的165个单元格网格。所有五种方法在保留和弦预测上都优于冻结基线,宏观增益从+2.89到+3.61分;LoRA和IA3得分最高,但经Holm和Benjamini-Hochberg校正的Wilcoxon检验不支持决定性优胜者。一个匹配数据量的对照实验进一步明确了这一点:当流派被子采样到共同语料库大小时,IA3保持领先,但LoRA的全数据优势消失并跌至最后,表明小差距部分由数据驱动。一个控制标记基线也很强,错误流派适配器通常优于冻结基线,表明大部分效果来自对可重用和声基底的轻量级条件化,而非特定适配器家族。额外的诊断(秩扫描、错误流派轮换、基础检查点消融、仅和弦流派分类、生成输出统计、真实歌曲评估和重复分析)支持一个有限的结论:和弦符号适应可靠地改进了流派局部和声预测,但仅靠和弦符号不能承载完整的流派身份。因此,本报告避免关于感知流派真实性或完整音乐质量的声明,这需要受控的听众或音乐家评估。

英文摘要

Harmony is a compact symbolic layer where mathematical pitch relations, acoustic consonance, and musical convention meet. This report treats chord-symbol sequences not as a complete representation of music, but as an interpretable, controllable time series for genre-local harmonic modeling. Starting from a frozen pop-jazz Music Transformer checkpoint, I evaluate how far small adaptation interfaces can extend the model to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction, with macro gains from +2.89 to +3.61 points; LoRA and IA3 score highest, but Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: when genres are sub-sampled to a common corpus size, IA3 stays on top but LoRA's full-data edge disappears and it falls to last, indicating the small gaps are partly data-driven. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting much of the effect comes from lightweight conditioning over a reusable harmonic base rather than one particular adapter family. Additional diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation, chord-only genre classification, generated-output statistics, real-song evaluation, and duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. The report therefore avoids claims about perceived genre authenticity or full musical quality, which require controlled listener or musician evaluation.

2606.07326 2026-06-08 cs.CV 新提交

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

AnchorWorld: 基于视图演化定制的具身自我中心世界模拟

Yu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang, Lei Ke, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Kun Gai, Yujiu Yang

发表机构 * Tsinghua University HUST Kling Team, Kuaishou Technology HKUST WHU

AI总结 提出AnchorWorld框架,利用3D人体运动和外源视角辅助训练增强交互完整性,并通过锚点视图和文本描述实现自我演化世界的灵活定制,显著优于现有方法。

详情
AI中文摘要

尽管交互式世界建模是一个关键前沿,但在实际场景所需的多样化可控性方面仍未被充分探索。为弥补这一差距,我们提出AnchorWorld,一个通过增强交互完整性和灵活的世界定制机制来推进自我中心模拟的框架。首先,我们利用3D人体运动作为主要交互模态。为了补充自我中心视角中不可见或被截断的身体部位,我们引入了一种辅助训练监督,该监督包含了与智能体第一人称感知解耦的外源视角。这使得模型能够观察智能体相对于环境的全身定位,从而促进人-世界交互更稳健的空间基础。此外,我们提出了一种简单而有效的机制来定制自我演化的世界。这是通过在统一的世界坐标系内定义锚点视图,并结合描述局部场景动态演化的文本描述来实现的。实验结果表明,AnchorWorld显著优于最先进的基线方法,而消融研究验证了我们关键设计的有效性。值得注意的是,我们的定制方案展现出有希望的时空几何一致性,并严格遵守规定的演化动力学。

英文摘要

Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

2606.07313 2026-06-08 cs.CL cs.AI 新提交

SV-Detect: AI-generated Text Detection with Steering Vectors

SV-Detect: 基于引导向量的AI生成文本检测

Mikhail Vishnyakov, Tatiana Gaintseva

发表机构 * Independent Researcher Queen Mary University of London

AI总结 提出从冻结语言模型的隐藏表示中提取引导向量,通过层间投影特征训练轻量分类器,实现跨域、跨模型和编辑攻击下的机器生成文本检测。

详情
AI中文摘要

检测机器生成文本在分布偏移(如跨域、源模型和编辑攻击的迁移)下尤其困难。我们提出了一种基于从冻结语言模型的隐藏表示中提取的引导向量的假文本检测器。在每一层,我们构建一个分离人类编写文本和机器生成文本的方向,并通过每个输入与这些方向的逐层对齐来表示输入。在这些投影特征上训练的轻量分类器产生最终的检测分数。我们的方法在分布内和分布偏移下均表现出色,包括跨域、跨源模型以及机器编辑转换(如润色和重写)。解释分析表明,学习到的方向与可识别的风格线索一致,同时捕获了超越表面特征的显著额外信号。这些结果将假文本检测定位为表示空间探测问题,并表明引导向量提供了一种简单有效的解决方案。

英文摘要

Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

2606.07311 2026-06-08 cs.CV cs.AI 新提交

CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

CULTURESCORE: 评估视频生成模型的文化忠实度

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology Mila – Quebec AI Institute Netflix

AI总结 提出CultureScore框架,从身份、背景和行为三个维度评估视频生成的文化忠实度,实验发现当前最佳模型得分仅56.8%,行为维度最困难。

详情
AI中文摘要

随着Veo 3.1和LTX-2等视频生成模型的进步,它们准确表现多元全球文化的能力仍是一个关键但研究不足的前沿。当前的指标如VideoScore仅衡量视觉质量,无法评估文化忠实度。因此,一个将合十礼替换为握手的模型与正确生成该手势的模型获得相同的分数。我们提出CultureScore,一个将文化忠实度分解为三个细粒度维度的组合评估框架:身份(谁被代表)、背景(文化本地化背景)和行为(规范性手势和互动)。我们通过一个覆盖10个国家的评估套件来实施该框架,在三个最先进的模型上生成了6,180个视频。我们的评估显示,当前没有模型能够实现文化忠实的视频生成:表现最好的模型整体CultureScore仅为56.8%,其中行为是最具挑战性的维度,所有模型在该维度上均低于52%。此外,人类偏好排序与CultureScore方向一致,但与VideoScore相反;在视觉质量上得分最高的模型被标注者排在最后,这强调了文化忠实度是公平视频生成的基本标准。

英文摘要

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 新提交

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project EU H2020 project

AI总结 研究音频语言模型中显式声学线索的对齐性,通过eGeMAPS特征提取六种可解释声学概念标记,发现对齐标记提升UAR,而错乱标记降低性能,模型对符号线索敏感但仍部分依赖音频信号。

详情
Comments
6 pages, 3 figures, 3 tables
AI中文摘要

指令跟随音频语言模型(ALMs)可以通过显式的声学线索进行增强,但在原始音频已经可用的情况下,这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别(SER)中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量,并被附加到文本提示中,同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中,对齐的标记提高了未加权平均召回率(UAR),而打乱、冲突或损坏的标记相对于对齐标记降低了性能,并将混淆转向中性。重要的是,在强标记扰动下预测不会崩溃,这表明模型对符号线索通道敏感,但部分仍锚定于音频信号。我们认为,仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

2606.07308 2026-06-08 cs.AI 新提交

Off-Policy Evaluation with Strategic Agents via Local Disclosure

通过局部披露进行具有策略性主体的离线策略评估

Kiet Q. H. Vo, Abbavaram Gowtham Reddy, Julian Rodemann, Siu Lun Chau, Krikamol Muandet

发表机构 * CISPA Helmholtz Center for Information Security LMU Munich Nanyang Technological University

AI总结 研究策略性行为下的离线策略评估,通过局部披露揭示主体策略前协变量,构建双重稳健估计器,缓解信息不对称。

详情
AI中文摘要

我们研究了策略性行为下的离线策略评估(OPE),其中决策主体(或代理)通过策略性地修改其协变量来响应决策者的策略。这种行为导致了策略依赖的协变量偏移,打破了现有方法中协变量外生于策略的标准假设。相关工作通过施加强假设(如重复交互或完全了解代理的响应行为)来应对这一挑战,这极大地限制了其在OPE中的适用性。相比之下,我们考虑一次性OPE设置,其中决策者仅部分了解代理的响应行为。我们的关键见解是,通过事后解释披露局部信息,可以在适应之前揭示代理的策略前协变量,从而减轻策略行为引起的信息损失。利用这一结构,我们估计了代理响应的统计模型,并构建了策略值的双重稳健估计器。通过假设代理的成本敏感性服从条件对数正态分布,我们建立了所提估计量的一致性,并实证验证了我们的方法。更广泛地说,我们的结果强调了交互设计如何通过揭示代理策略响应中原本隐藏的结构来缓解信息不对称。

英文摘要

We study off-policy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker's policy by strategically modifying their covariates. Such behavior induces a policy-dependent covariate shift, breaking the standard assumption in existing methods that covariates are exogenous to the policy. Related work addresses this challenge by imposing strong assumptions such as repeated interactions or full knowledge of agents' response behavior, substantially limiting its applicability to OPE. In contrast, we consider a one-shot OPE setting where the decision maker has only partial knowledge of the agents' response behavior. Our key insight is that disclosing local information through post-hoc explanations reveals agents' pre-strategic covariates prior to adaptation, mitigating the information loss induced by strategic behavior. Leveraging this structure, we estimate a statistical model for the agents' responses and construct a doubly robust estimator for policy value. By assuming that the agents' cost sensitivity follows a conditional log-normal distribution, we establish consistency of the proposed estimator and validate our approach empirically. More broadly, our results highlight how interaction design can mitigate information asymmetry by revealing otherwise hidden structure in agents' strategic responses.

2606.07303 2026-06-08 cs.LG 新提交

Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models

表征涌现的自举理论:解释不充分性作为表征学习与世界模型的驱动力

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * University of Montpellier IMT Mines Alès

AI总结 提出表征涌现自举理论(TBER),将解释不充分性视为新表征涌现的积极信号,通过五阶段递归过程驱动表征创新,应用于表征学习、世界模型和科学发现。

详情
Comments
24 pages, 25 references. Theoretical framework relating representation learning, representational emergence, and world models
AI中文摘要

表征学习是现代机器学习的核心,实现了从手工特征到学习嵌入、潜在空间、基础模型、世界模型和数字孪生的转变。然而,大多数研究关注在选定表征框架后如何优化表征,而较少关注何时需要新的表征层次。我们引入表征涌现自举理论(TBER),这是一个描述当现有表征变得解释不充分时新表征如何出现的框架。在这种观点下,表征创新不仅由更多数据、更大模型或更强计算能力驱动,还由持续的解释差距驱动:即表征仍能描述观察但无法使其组织或变换变得可理解的情况。TBER将解释不充分性识别为表征转变的积极信号。一个表征变得不充分,并非因为它必然错误,而是因为其解释领域已被超越。自举动态遵循递归序列:观察揭示异常;异常暴露不充分性;不充分性激发新表征;这些新表征产生进一步观察和可能的新异常。我们通过五个阶段形式化这一过程:稳定观察、异常检测、解释不充分性识别、表征涌现和临时稳定。我们讨论了在表征学习、潜在空间、基础模型、世界模型、数字孪生、自适应生物系统和科学发现中的应用。TBER表明,未来AI系统可能受益于检测其内部表征解释极限的机制。

英文摘要

Representation learning is central to modern machine learning, enabling transitions from handcrafted features to learned embeddings, latent spaces, foundation models, world models, and digital twins. Yet most research examines how representations are optimized after a representational framework has been selected, while less attention is given to when a new level of representation becomes necessary. We introduce the Bootstrap Theory of Representational Emergence (TBER), a framework describing how new representations arise when existing ones become explanatorily insufficient. In this view, representational innovation is not only driven by more data, larger models, or greater computational power, but also by persistent explanatory gaps: situations in which a representation can still describe observations but can no longer make their organization or transformations intelligible. TBER identifies explanatory insufficiency as a positive signal for representational transition. A representation becomes insufficient not because it is necessarily false, but because its explanatory domain has been exceeded. The bootstrap dynamic follows a recursive sequence: observations reveal anomalies; anomalies expose insufficiencies; insufficiencies motivate new representations; and these new representations generate further observations and possible new insufficiencies.We formalize this process through five stages: stabilized observation, anomaly detection, recognition of explanatory insufficiency, representational emergence, and provisional stabilization. We discuss applications to representation learning, latent spaces, foundation models, world models, digital twins, adaptive biological systems, and scientific discovery. TBER suggests that future AI systems may benefit from mechanisms for detecting the explanatory limits of their own internal representations.

2606.07299 2026-06-08 cs.AI 新提交

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

DuMate-DeepResearch:一种具有递归搜索和基于评分准则推理的可审计多智能体系统

Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen, Xuchen Ma, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Jianmin Wu, Dawei Yin

发表机构 * Baidu AI Cloud

AI总结 提出DuMate-DeepResearch多智能体框架,通过解耦智能体核心与工具生态、引入图动态规划、递归双层执行和基于评分准则的测试时优化,在深度研究基准上取得最优结果。

详情
Comments
Technical report by the DuMate Team. 26 pages, 6 figures, 4 tables
AI中文摘要

深度研究(DR)已成为一种新的智能体范式,用于处理复杂、开放的研究任务,要求系统能够迭代地定义问题、获取证据、验证来源并综合生成长篇报告。然而,在实践中,当前的DR系统受到四个相互关联的限制:在未明确范围上的长时规划、单智能体内分解和调度此类任务的瓶颈、长文本综合中的幻觉风险以及有限的过程可审计性。本技术报告介绍了DuMate-DeepResearch,一个基于千帆智能体构建平台构建的多智能体DR框架。该框架将负责任务理解、规划和调度的智能体核心与用于检索、证据获取和报告渲染的可扩展工具生态系统解耦,使每个中间决策和工具调用都明确可追溯。在此基础设施之上,DuMate-DeepResearch进一步引入了三种机制:(i)基于图的动态规划策略,从粗到细扩展研究路线图,并通过反思、重新规划、回溯和并行分支不断修正;(ii)递归双层执行设计,将每个复杂的搜索子任务委托给一个内部搜索智能体,该智能体运行自己的规划循环,隔离噪声检索并稳定长时执行;(iii)基于评分准则的测试时优化机制,动态生成任务特定的质量标准,并将其用作实时推理支架,用于基于证据的综合和自适应停止。在两个深度研究基准上,DuMate-DeepResearch取得了新的最先进结果:在DeepResearch Bench上获得最佳总分(58.03%),在DeepResearch Bench II上获得最佳总分(61.95%),同时在信息召回和分析方面排名第一。

英文摘要

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

2606.07293 2026-06-08 cs.SD cs.LG 新提交

TargetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

TargetSEC: 基于唤醒度条件潜在风格扩散的即插即用野外语音情感转换

Constantin Alexander Auga

发表机构 * Hasso Plattner Institute / University of Potsdam

AI总结 提出TargetSEC,一种基于嵌入驱动的潜在扩散框架,通过连续情感条件生成情感风格嵌入,在紧凑潜在空间操作,实现高转换精度和语音质量。

详情
Comments
5 pages, 2 figures, 2 tables, preprint
AI中文摘要

语音情感转换旨在将源话语的情感转换为目标情感,同时保留内容和说话人身份。由于训练数据的非平行性和复杂真实世界声学,野外数据的SEC具有挑战性。现有的固定时长方法要么难以有效转移情感(高质量、低转换),要么降低语音自然度(低质量、高转换)。我们提出TargetSEC,一种嵌入驱动的潜在扩散框架,根据说话人身份和连续情感生成以情感为中心的风格嵌入。与在频谱图上扩散的方法不同,TargetSEC在紧凑潜在空间中操作。在MSP-Podcast数据集上的实验表明,TargetSEC在转换准确性上优于当前非时长基线,同时保持高语音质量,并且在没有显式时间建模的情况下实现了与时长预测系统相当的性能。

英文摘要

Speech Emotion Conversion (SEC) aims to transform the emotion of a source utterance into a target emotion while preserving content and speaker identity. SEC on in-the-wild data is challenging due to the non-parallel nature of training data and complex real-world acoustics. Existing fixed-duration approaches either struggle to shift the emotion effectively (high quality, low conversion) or degrade speech naturalness (low quality, high conversion). We propose TargetSEC, an embedding-driven latent diffusion framework that generates emotion-focused style embeddings conditioned on speaker identity and continuous emotion. Unlike methods that diffuse over spectrograms, TargetSEC operates in a compact latent space. Experiments on the MSP-Podcast dataset show that TargetSEC outperforms current non-duration baselines in conversion accuracy while maintaining high speech quality, and achieves performance comparable to duration-prediction systems without explicit temporal modeling.

2606.07289 2026-06-08 cs.LG cs.CV 新提交

Closed-Form Spectral Regularization for Multi-Task Model Merging

多任务模型融合的闭式谱正则化

Yongxian Wei, Runxi Cheng, Xingxuan Zhang, Li Shen, Chun Yuan, Peng Cui, Dacheng Tao

发表机构 * Shenzhen International Graduate School, Tsinghua University Department of Computer Science and Technology, Tsinghua University Sun Yat-sen University Nanyang Technological University

AI总结 针对多任务模型融合中的干扰最小化问题,发现迭代求解器实际充当隐式谱正则化器,据此提出基于谱滤波的闭式方法SWUDI及其自适应变体SWUDI-A,显著提升效率并匹配或超越现有方法。

详情
AI中文摘要

模型融合将多个独立微调专家合并为单个多任务模型,无需任何训练数据,降低了大型基础模型的存储、服务和去中心化开发成本。最先进的融合方法将融合表述为逐层二次干扰最小化问题。尽管该问题存在精确的闭式伪逆解,但该解在实践中性能不如数百次梯度下降迭代。迭代循环主导了流程的成本,但其有效性尚未得到解释。我们重新审视这一机制,并表明迭代求解器主要并非作为优化器;相反,它充当了病态正规方程的隐式谱正则化器,其中每层干扰算子的小特征值方向放大了代理噪声。基于这一发现,我们将多任务模型融合形式化为一个带噪线性逆问题,并提出一种由逐方向滤波器参数化的谱滤波估计器。我们通过SWUDI实例化该估计器,这是一种闭式方法,结合了软指数滤波器(匹配迭代下降的梯度流轨迹)和硬top-K截断(抑制放大噪声的小特征值方向)。此外,我们提出了SWUDI-A,一种自适应变体,用逐层秩规则替换全局秩超参数,进一步提高了跨架构的鲁棒性。两种变体共享每个线性层的单个对称特征分解,且不需要训练数据或优化器状态。在四个通用基准和一个涵盖VQA、几何、图表、OCR、定位和模态融合的多模态融合基准上,我们提出的谱求解器匹配或超越了最先进的融合方法。关键的是,它们将挂钟时间减少了28-72倍,峰值GPU内存减少了高达50%。

英文摘要

Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.

2606.07288 2026-06-08 cs.CV cs.GR 新提交

ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

ExMesh: 具有拓扑自适应的显式网格重建

Chuanjin Fan, Lifan Wu, Wenjie Chang, Hanzhi Chang, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory

AI总结 提出ExMesh框架,通过可微优化与离散拓扑更新直接优化显式网格,引入自适应顶点分裂合并和实时UV维护,实现从粗到细的优化,兼顾精度、效率和网格简洁性。

详情
Comments
Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)
AI中文摘要

从多视图图像重建表面网格近年来一直是核心挑战。大多数现有方法,无论是隐式还是显式,都依赖于中间表示和后处理步骤(如Marching Cubes或TSDF融合),常常导致伪影和碎片化几何。直接优化显式网格是一种有前景的方法,但它面临两个关键挑战:一是如何自适应细化网格拓扑以捕捉细节而不引入退化面;二是在网格结构演变时如何保持一致的UV坐标以实现高保真纹理映射。为克服这些,我们提出ExMesh,一种新颖的框架,通过将可微优化与离散拓扑更新相结合,直接优化显式网格。具体而言,我们引入自适应顶点分裂合并策略以及实时UV维护,实现从粗到细的优化,同时保持几何完整性。据我们所知,ExMesh是第一个将离散拓扑操作无缝集成到连续可微优化流程中的框架。大量实验表明,ExMesh在精度、计算效率和网格简洁性之间取得了平衡。

英文摘要

Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.