arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪 全部专题
2604.23295 2026-05-26 cs.CL cs.AI

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

发表机构 * JoshTalks

AI总结 本文通过适配Moshi架构,使用自定义印地语分词器和26,000小时真实对话数据训练,提出了首个开放、可复现的印地语全双工口语对话系统,实现了自然的打断、重叠和反馈行为。

详情
AI中文摘要

全双工口语对话系统能够模拟自然的对话行为,如打断、重叠和反馈,然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi,使用自定义印地语分词器,并在从14,695名说话者收集的26,000小时真实自发对话数据(具有独立的说话者通道)上进行训练,提出了首个开放、可复现的印地语全双工口语对话系统,从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成,我们替换了原始英语分词器,并重新初始化了依赖于文本词汇的参数,同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练,然后在1,000小时对话数据上进行微调。通过提示对话延续范式,结合自动评估指标和人工判断,评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

2604.23017 2026-05-26 cs.LG cs.NA math.CV math.NA

Complex Stochastic Gradient Descent and Directional Bias in Reproducing Kernel Hilbert Spaces

复随机梯度下降与再生核希尔伯特空间中的方向偏差

Natanael Alpay, Emeric Battaglia

发表机构 * Department of Mathematics University of California, Irvine, Irvine, CA 92697 USA(数学系,加州大学伊文斯顿分校,伊文斯顿,CA 92697,美国)

AI总结 本文提出复随机梯度下降(Complex SGD)方法,在无解析性约束下证明其收敛性,并验证方向偏差从实域扩展到复域,在复再生核希尔伯特空间中通过核回归有效恢复超振荡函数和Blaschke乘积。

详情
AI中文摘要

随机梯度下降(SGD)是一种已知的随机迭代方法,因其实现简单和可扩展性而流行于大规模凸优化问题。某些目标函数,例如复值神经网络中的目标函数,受益于SGD和梯度下降(GD)中使用新定义的“梯度”进行更新,该梯度允许复参数。这种SGD/GD方法的复变体已被提出,但尚未提供无解析性约束的收敛保证。我们提出了一种允许复参数的SGD变体(复SGD),并在与实设置平行的假设下提供了收敛保证。值得注意的是,这些结果也扩展到GD,并且在相同的假设集下,我们确认了对于核回归问题,一些方向偏差结果从实域扩展到复域。我们提供了实证结果,证明了复SGD在使用复再生核希尔伯特空间的核回归问题中的有效性。特别地,我们展示了通过选择特定的损失函数,可以分别从Fock空间和Hardy空间中恢复超振荡函数和Blaschke乘积作为最优函数。

英文摘要

Stochastic Gradient Descent (SGD) is a known stochastic iterative method popular for large-scale convex optimization problems due to its simple implementation and scalability. Some objectives, such as those found in complex-valued neural networks, benefit from updates like in SGD and Gradient Descent (GD) with a newly defined ``gradient'' that allows for complex parameters. This complex variant of the SGD/GD methods has already been proposed, but convergence guarantees without analyticity constraints have not yet been provided. We propose a variant of SGD (complex SGD) that allows for complex parameters, and we provide convergence guarantees under assumptions that parallel those from the real setting. Notably, these results extend to GD as well, and with the same set of assumptions, we confirm that some directional bias results extend from the real to the complex setting for kernel regression problems. We provide empirical results demonstrating the efficacy of the complex SGD in kernel regression problems utilizing complex reproducing kernel Hilbert spaces. In particular, we demonstrate we may recover superoscillation functions and Blaschke products from the Fock Space and Hardy Space, respectively, as the optimal functions for a particular choice of a loss function.

2604.18970 2026-05-26 cs.LG cs.CR

Mechanistic Anomaly Detection via Functional Attribution

通过功能归因的机制性异常检测

Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Victoria, Australia(计算与信息系统学院,墨尔本大学,维多利亚,澳大利亚)

AI总结 将机制性异常检测重新定义为功能归因问题,利用影响函数测量测试样本与参考集之间的功能耦合,在视觉模型后门检测、大语言模型后门检测以及对抗样本和分布外样本检测中取得最优或显著改进。

Comments ICML '26 Camera Ready

详情
AI中文摘要

我们通常可以使用真实标签验证神经网络输出的正确性,但无法可靠地确定输出是由正常还是异常的内部机制产生的。机制性异常检测(MAD)旨在标记这些情况,但现有方法要么依赖于潜在空间分析(易受混淆攻击),要么特定于特定架构和模态。我们将MAD重新定义为功能归因问题:询问来自可信集的样本在多大程度上可以解释模型的输出,其中归因失败表明异常行为。我们使用影响函数来实现这一点,通过参数空间采样测量测试样本与小型参考集之间的功能耦合。我们在多种异常类型和模态上进行了评估。对于视觉模型中的后门,我们的方法在BackdoorBench上实现了最先进的检测,在七种攻击和四个数据集上平均防御有效性评分(DER)为0.93(次优为0.83)。对于大语言模型,我们在几种后门类型(包括显式混淆的模型)上也取得了比基线显著的改进。除了后门,我们的方法可以检测对抗样本和分布外样本,并区分单个模型内的多种异常机制。我们的结果确立了功能归因作为一种有效的、模态无关的工具,用于检测部署模型中的异常行为。

英文摘要

We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model's output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.

2604.18396 2026-05-26 cs.CL

River-LLM: Large Language Model Seamless Exit Based on KV Share

River-LLM:基于KV共享的大型语言模型无缝退出

Yingtao Shen, An Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对解码器-only架构中早期退出因KV缓存缺失导致的延迟和精度问题,提出无需训练的River-LLM框架,通过轻量级KV共享退出流实现令牌级无缝退出,并利用状态转移相似性预测KV误差指导退出决策,在数学推理和代码生成任务上获得1.53-2.16倍实际加速且保持高质量。

Comments Accepted to ACL 2026, 13pages, with appendix. Corrected some typos

详情
AI中文摘要

大型语言模型(LLM)在多个领域展现出卓越性能,但日益受到高推理延迟的制约。早期退出通过动态跳过冗余层加速推理,成为一种有前景的解决方案。然而,在仅解码器架构中,早期退出的效率受到KV缓存缺失问题的严重瓶颈,即跳过的层无法为后续令牌提供必要的历史状态。现有解决方案(如重新计算或掩码)要么引入显著延迟开销,要么导致严重精度损失,未能弥合理论层减少与实际墙钟加速之间的差距。本文提出River-LLM,一个无需训练的框架,能够实现无缝的令牌级早期退出。River-LLM引入轻量级KV共享退出流,使得骨干网络的缺失KV缓存能够在退出过程中自然生成并保留,无需昂贵的恢复操作。此外,我们利用解码器块内的状态转移相似性来预测累积KV误差,并指导精确的退出决策。在数学推理和代码生成任务上的大量实验表明,River-LLM在保持高生成质量的同时,实现了1.53至2.16倍的实际加速。

英文摘要

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.53 to 2.16 times of practical speedup while maintaining high generation quality.

2604.14054 2026-05-26 cs.LG cs.CL

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

$\pi$-Play: 通过特权自蒸馏实现的多智能体自对弈,无需外部数据

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Meituan(美团) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出$\pi$-Play框架,利用自对弈中生成的问答构建路径作为特权信息,结合自蒸馏实现密集反馈的多智能体协同进化,无需外部数据即可超越全监督搜索代理。

Comments 23 pages, 11 figures

详情
AI中文摘要

深度搜索代理已成为解决复杂信息寻求任务的有前景范式,但其训练仍面临稀疏奖励、弱信用分配和有限标注数据的挑战。自对弈提供了一种可扩展的减少数据依赖的途径,但传统自对弈仅通过稀疏结果奖励优化学生,导致学习效率低下。在这项工作中,我们观察到自对弈在任务生成过程中自然产生一个问题构建路径(QCP),这是一种捕获反向求解过程的中间产物。这揭示了一种新的特权信息来源:自对弈可以低成本、大规模地提供高质量特权信息用于自蒸馏,无需依赖人类反馈或精心设计的特权信息。基于这一洞察,我们提出特权信息自对弈($\pi$-Play),一种结合自对弈和自蒸馏的新型多智能体自进化框架。在$\pi$-Play中,考官生成任务及QCP,教师利用QCP作为特权上下文,通过自蒸馏对学生进行密集监督。这种设计将稀疏奖励的自对弈转变为密集反馈的协同进化。大量实验表明,无数据的$\pi$-Play超越了全监督搜索代理,并将进化效率相比传统自对弈提升了2-3倍。代码见 https://github.com/zhyaoch/pi-play。

英文摘要

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self-play can provide high-quality privileged information for the self-distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a novel multi-agent self-evolution framework combining self-play and self-distillation. In $π$-Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self-distillation. This design transforms sparse-reward self-play into a dense-feedback co-evolution. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play. Code is available at https://github.com/zhyaoch/pi-play.

2604.12383 2026-05-26 cs.SD

On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

面向统一重建、理解和生成的语音VAE蒸馏损失函数研究

Changhao Cheng, Wei Wang, Wangyou Zhang, Dongya Jia, Jian Wu, Zhuo Chen, Yanmin Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Auditory Cognition and Computational Acoustics Lab(听觉认知与计算声学实验室) MoE Key Lab of Artificial Intelligence, AI Institute(人工智能MoE重点实验室,AI研究院) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) ByteDance Seed, China(字节跳动种子,中国)

AI总结 本文系统探索了语音VAE中不同对齐方法对重建、理解和生成任务性能的影响,并提出联合边缘对齐与自适应加权策略以实现最优整体性能。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

基于变分自编码器(VAE)的连续语音表示已成为传统频谱图或离散令牌特征在语音生成和重建中的有前途的替代方案。最近的研究试图通过与自监督学习(SSL)特征对齐来丰富VAE潜在表示中的结构信息,旨在获得更好的生成性能。然而,当考虑更多任务时,广泛使用的基于时间轴蒸馏的对齐方法是否最优尚不清楚。针对这一问题,本文系统探索了不同的对齐方法,并分析了它们在重建、理解和生成三个维度上对性能的影响。我们研究了蒸馏损失中的各种设计选择。大量实验表明,采用自适应加权的联合边缘对齐方法可以在实现可控平衡的同时获得最佳整体性能。

英文摘要

Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.

2604.11632 2026-05-26 cs.CL

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

CArtBench: 评估视觉语言模型在中国艺术理解、解读与真实性方面的能力

Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術大學) Liaoning Normal University(遼寧師範大學)

AI总结 提出CARTBENCH基准,通过四个子任务评估视觉语言模型在中国艺术作品上的识别、推理、鉴赏和真伪鉴别能力,发现现有模型在复杂证据链接、风格断代和真伪诊断方面表现不足。

Comments under review

详情
AI中文摘要

我们介绍了CARTBENCH,一个基于博物馆的基准,用于评估视觉语言模型(VLM)在中国艺术作品上的表现,超越了短形式识别和问答。CARTBENCH包含四个子任务:CURATORQA用于基于证据的识别和推理,CATALOGCAPTION用于结构化的四部分专家风格鉴赏,REINTERPRET用于带有专家评分的可辩护的重新解读,以及CONNOISSEURPAIRS用于在视觉相似混淆下进行诊断性真伪鉴别。CARTBENCH通过将来自Wikidata的带有图像的故宫博物院物品与权威目录页面进行对齐构建,涵盖多个朝代的五个艺术类别。在九个代表性的VLM上,我们发现高整体CURATORQA准确率可能掩盖了在硬证据链接和风格到时期推断上的急剧下降;长形式鉴赏仍远未达到专家参考水平;而面向真伪的诊断性鉴别接近随机水平,这突显了当前模型在鉴赏级推理上的困难。

英文摘要

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

2604.11557 2026-05-26 cs.AI

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall: 统一LLM智能体的工具使用表示、数据与评估

Yijuan Liang, Xinghao Chen, Yifan Ge, Ziyi Wu, Hao Wu, Changyu Zeng, Wei Xing, Xiaoyu Shen

发表机构 * University of Science and Technology of China(中国科学技术大学) Ningbo Institute of Digital Twin(宁波数字孪生研究所) Eastern Institute of Technology(东部技术研究所) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 提出UniToolCall框架,通过标准化工具集构建、数据集生成和评估流程,结合22k+工具和390k+训练实例,引入锚点链接机制,在混合设置下使Qwen3-8B单轮严格精度达93.0%,超越GPT、Gemini和Claude。

Comments 21 pages, 10 figures, 9 tables. Code and datasets are publicly available at: https://github.com/EIT-NLP/UniToolCall

详情
AI中文摘要

工具使用能力是LLM智能体的基本组成部分,使其能够通过结构化函数调用与外部系统交互。然而,现有研究存在不一致的交互表示,很大程度上忽略了工具使用轨迹的结构分布,并依赖于不兼容的评估基准。我们提出了UniToolCall,一个统一的工具学习框架,标准化了从工具集构建、数据集生成到评估的整个流程。该框架整理了包含22k+工具的大型工具池,并通过结合10个标准化公共数据集与结构受控的合成轨迹,构建了包含390k+实例的混合训练语料库。它显式建模了多种交互模式,包括单跳与多跳、单轮与多轮,同时捕获了串行和并行执行结构。为了支持连贯的多轮推理,我们进一步引入了锚点链接机制,强制跨轮依赖关系。此外,我们将7个公共基准转换为统一的查询-动作-观察-答案(QAOA)表示,并在函数调用、轮次和对话级别进行细粒度评估。实验表明,在我们的数据集上微调Qwen3-8B显著提升了工具使用性能。在干扰项密集的Hybrid-20设置下,单轮严格精度达到93.0%,优于包括GPT、Gemini和Claude在内的商业模型。

英文摘要

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

2604.10783 2026-05-26 cs.AI cs.LG

Learning Preference-Based Objectives from Clinical Narratives for Dynamic Sepsis Treatment

从临床叙述中学习基于偏好的目标用于动态脓毒症治疗

Daniel J. Tan, Jayne Hui Zhen Chan, Kai Wen Hwang, Arturo Yong Yao Neo, Kay Choong See, Mengling Feng

发表机构 * Institute of Data Science, National University of Singapore, Singapore(新加坡国立大学数据科学研究所) National University Hospital, Singapore(新加坡国立大学医院) Saw Swee Hock School of Public Health, National University of Singapore, Singapore(新加坡国立大学 Saw Swee Hock 公共卫生学院)

AI总结 提出CN-PR框架,利用大语言模型从出院小结中提取轨迹级偏好,通过偏好优化学习奖励函数,在离线强化学习中改善脓毒症治疗结果。

详情
AI中文摘要

在医疗保健中为强化学习设计奖励函数仍然具有挑战性,因为临床有意义的结果稀疏、延迟且难以明确指定。尽管结构化临床数据捕获了生理状态,但它们往往无法反映患者轨迹的更广泛方面,如治疗反应、恢复动态和干预负担。相比之下,临床叙述编码了临床医生对疾病进展、治疗效果和恢复的纵向评估,提供了超越预定义结果指标的轨迹级监督的潜在来源。我们提出了临床叙述知情偏好奖励(CN-PR)框架,该框架通过将临床叙述视为轨迹级偏好的可扩展监督,直接从出院小结中学习奖励函数。使用大语言模型,我们推导出轨迹质量分数,并在患者轨迹之间构建成对偏好,通过基于偏好的优化来学习奖励。为了考虑叙述信息量的变异性,我们引入了一个任务相关性信号,根据监督与下游决策任务的相关性对其进行加权。我们在离线强化学习中评估了CN-PR在动态脓毒症治疗中的应用。学习到的奖励与轨迹质量分数表现出强烈的单调对齐,并产生了与改善恢复相关结果相关的策略,包括增加器官支持无天数和更快的休克解决,同时保持与基于结果的奖励基线相当的性能。这些发现在外部验证下得以保留。我们的结果表明,临床叙述为动态治疗方案中的奖励学习提供了可扩展且富有表现力的监督来源。

英文摘要

Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are sparse, delayed, and difficult to explicitly specify. Although structured clinical data capture physiologic states, they often fail to reflect broader aspects of patient trajectories such as treatment response, recovery dynamics, and intervention burden. Clinical narratives, by contrast, encode longitudinal clinician assessments of disease progression, treatment effectiveness, and recovery, providing a potential source of trajectory-level supervision beyond predefined outcome metrics. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework that learns reward functions directly from discharge summaries by treating clinical narratives as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores and construct pairwise preferences between patient trajectories to learn rewards through preference-based optimization. To account for variability in narrative informativeness, we incorporate a task relevance signal that weights supervision according to its relevance to the downstream decision-making task. We evaluate CN-PR in dynamic sepsis treatment using offline RL. The learned reward demonstrated strong monotonic alignment with trajectory quality scores and produced policies associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining mortality performance comparable to outcome-based reward baselines. These findings were preserved under external validation. Our results suggest that clinical narratives provide a scalable and expressive source of supervision for reward learning in dynamic treatment regimes.

2604.08870 2026-05-26 cs.LG cs.AI

Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

学习分析中的时间辍学风险:跨动态与早期窗口表示的协调生存基准

Rafael da Silva, Jeff Eicher, Gregory Longo

发表机构 * Applied Data Science Program(应用数据科学项目) Eastern University(东部大学)

AI总结 本研究使用OULAD数据集,通过协调的生存分析基准(包括动态周表示和连续时间表示)评估辍学风险模型,发现时间行为特征比静态背景属性更具预测力。

Comments 34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables

详情
AI中文摘要

学生辍学是学习分析中持续关注的问题,然而比较研究经常在异质协议下评估预测模型,优先考虑区分度而非时间可解释性和校准。本研究引入了一个面向生存的基准,用于使用开放大学学习分析数据集(OULAD)进行时间辍学风险建模。比较了两个协调分支:一个动态周分支,采用人-时期表示的模型;以及一个可比较的连续时间分支,扩展了模型家族——基于树的生存模型、参数模型和神经网络模型。评估协议整合了四个分析层面:预测性能、消融、可解释性和校准。结果在每个分支内分别报告,因为跨分支单一排名在方法论上不合理。在可比较分支中,随机生存森林在区分度和特定时间点的Brier分数上领先;在动态分支中,泊松分段指数在紧密的五家族聚类中在综合Brier分数上略微领先。无重抽样自举变异将这些位置视为方向性信号而非绝对优势。消融和可解释性分析在所有家族中收敛于一个共同发现:主导预测信号主要不是人口统计学或结构性的,而是时间和行为性的。校准在更好区分的模型中证实了这一模式,但XGBoost AFT除外,它表现出系统性偏差。这些结果支持在学习分析中采用协调的多维基准的价值,并将辍学风险定位为一个时间行为过程,而非静态背景属性的函数。

英文摘要

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families -- tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

2604.08213 2026-05-26 cs.CV cs.AI

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

EditCaption: 用于图像编辑指令合成的人工精炼SFT与HAE-DPO

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

发表机构 * Peking University(北京大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出EditCaption两阶段后训练流程,通过人工精炼SFT和基于难度自适应错误感知DPO(HAE-DPO)提升图像编辑指令合成质量,显著降低关键错误率并超越现有模型。

详情
AI中文摘要

高质量的源-目标图像对及精确的编辑指令对于指令引导的图像编辑至关重要,但大规模构建此类训练三元组成本高昂。最近的流程通常依赖视觉语言模型自动合成编辑指令,但我们发现强大的VLM仍难以描述图像对之间的视觉变换。具体而言,它们表现出三种反复出现的失败模式:方向不一致、视角模糊和缺少细粒度属性。在400个图像对的人工评估中,多个开源VLM基线产生超过47%的关键错误率,使得许多合成指令不适合下游训练。为解决此问题,我们提出EditCaption,一种用于图像编辑指令合成的两阶段后训练流程。首先,通过基于GLM的自动字幕生成、EditScore过滤和人工精炼构建100K监督微调数据集。其次,收集10K人工标注的偏好对,其中每个被拒绝的指令都标注了其主要错误类型和严重程度。基于此数据集,我们提出难度自适应错误感知DPO(HAE-DPO),一种任务适配的DPO目标,它引入了基于人工标注的严重程度、失败模式类型和参考模型难度的自适应边界。在三个基准上的实验表明,我们的235B模型经过SFT+HAE-DPO后在开源和闭源模型中达到最先进性能,在Eval-400、HQ-Edit和ByteMorph-Bench上分别获得4.720、4.672和4.651分——在所有三个基准上均超越Gemini-3-Pro。人工评估证实关键错误率从47.75%降至17.50%,正确率从41.75%提升至70.25%,超越Gemini-3-Pro(66.00%)。

英文摘要

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis. First, we construct a 100K supervised fine-tuning dataset through GLM-based auto-captioning, EditScore filtering, and human refinement. Second, we collect 10K human-annotated preference pairs, where each rejected instruction is labeled with its primary error type and severity. Based on this dataset, we propose Hardness-Adaptive Error-Aware DPO (HAE-DPO), a task-adapted DPO objective that introduces an adaptive margin based on human-labeled severity, failure-mode type, and reference-model hardness. Experiments across three benchmarks demonstrate that our 235B model with SFT+HAE-DPO achieves state-of-the-art performance among open-source and closed models, scoring 4.720 on Eval-400, 4.672 on HQ-Edit, and 4.651 on ByteMorph-Bench -- surpassing Gemini-3-Pro on all three. Human evaluation confirms critical error rates drop from 47.75\% to 17.50\%, with correct rates improving from 41.75\% to 70.25\%, surpassing Gemini-3-Pro (66.00\%).

2604.07039 2026-05-26 cs.RO cs.AI

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

AEROS:一种具有具身能力模块的单智能体操作架构

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃森大学马来西亚分校数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 提出AEROS架构,将机器人建模为单一持久智能主体,通过可安装的具身能力模块扩展能力,实现模块化可扩展性、可组合能力执行和一致的系统级安全。

Comments Submitted to Engineering Applications of Artificial Intelligence (EAAI). 48 pages, 5 figures, 9 tables

详情
AI中文摘要

机器人系统缺乏一种原则性的抽象来统一组织智能、能力和执行。现有方法要么在单体架构中耦合技能,要么将功能分解为松散协调的模块或多个智能体,通常缺乏一致的标识和控制权限模型。我们认为,机器人应被建模为一个单一的持久智能主体,其能力通过可安装的包来扩展。我们将这一观点形式化为AEROS(智能体执行运行时操作系统),其中每个机器人对应一个持久智能体,能力通过具身能力模块(ECM)提供。每个ECM封装了可执行技能、模型和工具,而执行约束和安全保证由策略分离的运行时强制执行。这种分离实现了模块化可扩展性、可组合能力执行和一致的系统级安全。我们在PyBullet仿真中使用Franka Panda 7自由度机械臂评估了一个参考实现,进行了八项实验,涵盖重新规划、故障恢复、策略执行、基线比较、跨任务通用性、ECM热插拔、消融和故障边界分析。每个条件下超过100次随机试验,AEROS在三个任务上实现了100%的任务成功率,而基线(BehaviorTree.CPP风格和ProgPrompt风格为92-93%,扁平流水线为67-73%);策略层阻止了所有无效动作,零误接受;运行时优势跨任务泛化,无需特定任务调整;ECM在运行时加载,交换后成功率为100%。

英文摘要

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92--93%, flat pipeline at 67--73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.

2604.05550 2026-05-26 cs.CL cs.CE

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

AutoSOTA:面向最先进AI模型发现的端到端自动化研究系统

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University(电子工程系,北京理工大学,清华大学) Zhongguancun Academy(中关村学院) Peking University(北京大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出AutoSOTA系统,采用多智能体架构实现从论文复现到模型优化的全自动化,成功发现105个超越原始方法的新SOTA模型。

详情
AI中文摘要

人工智能研究越来越依赖于长时间的复现、调试和迭代优化以达到最先进(SOTA)性能,这催生了对能够加速经验模型优化全流程的系统的需求。在这项工作中,我们介绍了AutoSOTA,一个端到端的自动化研究系统,它将顶级AI论文中发布的最新SOTA模型推进为可复现且经验上改进的新SOTA模型。我们通过三个紧密耦合的阶段来形式化这个问题:资源准备与目标设定;实验评估;以及反思与构思。为了解决这个问题,AutoSOTA采用了一种多智能体架构,包含八个专门化的智能体,它们协同工作,将论文与代码和依赖项对应起来,初始化和修复执行环境,跟踪长期实验,生成并调度优化想法,并监督有效性以避免虚假收益。我们在从八个顶级AI会议收集的最新研究论文上评估AutoSOTA,并根据代码可用性和执行成本进行过滤。在这些论文中,AutoSOTA在自动复现和后续优化方面均取得了强大的端到端性能。具体来说,它成功发现了105个超越原始报告方法的新SOTA模型,平均每篇论文耗时约五小时。涵盖LLM、NLP、计算机视觉、时间序列和优化的案例研究进一步表明,该系统可以超越常规的超参数调优,识别架构创新、算法重新设计和工作流级别的改进。这些结果表明,端到端的研究自动化不仅可以作为性能优化器,还可以作为一种新型的研究基础设施,减少重复性实验负担,帮助将人类注意力重新引导到更高层次的科学创造力上。

英文摘要

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

2604.04707 2026-05-26 cs.CV

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

OpenWorldLib: 高级世界模型的统一代码库与定义

DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Juanxi Tian, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang

发表机构 * Peking University(北京大学) Zhongguancun Academy(中关村学院) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学) Sun Yat-sen University(中山大学) Beijing Key Laboratory of Data Intelligence and Security(北京数据智能与安全重点实验室) Nanyang Technological University(南洋理工大学)

AI总结 本文提出OpenWorldLib框架,基于对世界模型演化的分析给出清晰定义,并系统分类其核心能力,实现多任务模型的统一集成与高效推理。

Comments 28 pages, 6 figures

详情
AI中文摘要

世界模型作为人工智能中一个前景广阔的研究方向已引起广泛关注,但仍缺乏清晰统一的定义。本文中,我们介绍了OpenWorldLib,一个针对高级世界模型的全面且标准化的推理框架。借鉴世界模型的演化,我们提出一个明确的定义:世界模型是以感知为中心、具备交互和长期记忆能力、用于理解和预测复杂世界的模型或框架。我们进一步系统性地分类了世界模型的基本能力。基于这一定义,OpenWorldLib将不同任务的模型集成在统一框架内,实现高效复用和协同推理。最后,我们对世界模型研究的潜在未来方向提出了额外的思考和分析。代码链接:https://github.com/OpenDCAI/OpenWorldLib

英文摘要

World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib

2604.03318 2026-05-26 cs.CV

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

EgoMind: 通过多模态大语言模型中的语言推理激活空间认知

Zhenghao Chen, Huiqun Wang, Di Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment(复杂与关键软件环境国家重点实验室) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 提出EgoMind框架,通过角色扮演描述和渐进空间分析的无几何空间推理方法,仅用少量数据即可在多基准测试中提升MLLMs的空间推理能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地应用于空间认知任务,期望它们能够理解并与复杂环境交互。现有工作大多通过引入3D先验或几何监督来改进空间推理,这虽然提升了性能,但带来了大量的数据准备和对齐成本。相比之下,纯2D方法由于捕获跨帧空间关系的能力有限,往往在多帧空间推理中表现不佳。为了解决这些限制,我们提出了EgoMind,一个思维链框架,通过角色扮演描述(联合构建跨帧的一致语言场景图)和渐进空间分析(逐步推理任务特定问题)实现无几何空间推理。仅使用5K自动生成的SFT样本和20K RL样本,EgoMind在VSI-Bench、SPAR-Bench、SITE-Bench和SPBench上取得了有竞争力的结果,证明了其在增强MLLMs空间推理能力方面的有效性,并突出了语言推理在空间认知中的潜力。代码和数据已发布在https://github.com/Hyggge/EgoMind。

英文摘要

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, SITE-Bench, and SPBench, demonstrating its effectiveness in strengthening the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition. Code and data are released at https://github.com/Hyggge/EgoMind.

2604.00499 2026-05-26 cs.LG

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

具有不确定性感知输出长度预测的LLM推理调度

Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang

发表机构 * School of Computer Science, Wuhan University, Wuhan, China(武汉大学计算机学院) School of Computer Science, Central China Normal University, Wuhan, China(中央财经大学计算机学院) Dameng Database Co., Ltd., Wuhan, China(达梦数据库有限公司) School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Institute for Math and AI, Wuhan University, Wuhan, China(武汉大学数学与人工智能研究院)

AI总结 针对LLM推理调度,提出基于对数t分布的输出长度分布模型,并设计Tail Inflated Expectation (TIE)指标替代点估计,以降低在线推理延迟并提高离线吞吐量。

Comments Accepted at ICML 2026

详情
AI中文摘要

为了调度LLM推理, extit{最短作业优先}(SJF)原则通过优先处理输出长度短的请求来避免队头(HOL)阻塞。现有方法通常为每个请求预测单个输出长度以促进调度。我们认为这种 extit{点估计}与LLM推理的 extit{随机}解码过程不匹配,其中输出长度本质上是 extit{不确定的},由何时采样到序列结束(EOS)标记决定。因此,每个请求的输出长度应拟合为分布而非单个值。通过对经验数据和随机解码过程的深入分析,我们观察到输出长度服从重尾分布,并可以用对数t分布拟合。在此基础上,我们提出一个简单的指标,称为Tail Inflated Expectation (TIE),用于替换SJF调度中的输出长度,该指标通过对数t分布的期望进行尾部概率调整,以考虑请求生成长输出的风险。为了评估我们的TIE调度器,我们将其与三个强基线进行比较,结果表明,TIE将在线推理的每token延迟降低了$2.31 imes$,并将离线数据生成的吞吐量提高了$1.42 imes$。

英文摘要

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.

2603.28716 2026-05-26 cs.AI

Dynamic Dual-Granularity Skill Bank for Agentic RL

动态双粒度技能库用于智能体强化学习

Songjun Tu, Chengdong Xu, Qichao Zhang, Yaocheng Zhang, Xiangyuan Lan, Linjing Li, Dong Li, Dongbin Zhao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Pengcheng Laboratory(鹏城实验室) Sun Yat-Sen University(中山大学) MemoraX AI

AI总结 提出D2Skill,一种动态双粒度技能库,通过任务技能和步骤技能分别提供高层指导和细粒度决策支持,利用配对基线回放和技能注入回放的性能差距更新技能和优化策略,在ALFWorld等任务上显著提升性能。

Comments 19 pages

详情
AI中文摘要

智能体强化学习可以从可重复使用的经验中显著受益,但现有的基于技能的方法主要提取轨迹级指导,并且通常缺乏维护不断演化的技能记忆的原则性机制。我们提出D2Skill,一种用于智能体强化学习的动态双粒度技能库,将可重复使用的经验组织成任务技能(用于高层指导)和步骤技能(用于细粒度决策支持和错误纠正)。D2Skill通过在同一策略下进行配对的基线回放和技能注入回放,利用它们的性能差距推导出事后效用信号,用于技能更新和策略优化。技能库完全由训练时的经验构建,通过反思不断扩展,并通过效用感知的检索和修剪进行维护。在ALFWorld、WebShop和Search-Augmented QA任务上的实验表明,D2Skill在不同规模的模型上显著优于无技能的基线。进一步的消融和分析表明,双粒度技能建模和动态技能维护对这些增益至关重要,而学习到的技能表现出更高的效用,能够跨评估设置迁移,并且仅带来适度的训练开销。

英文摘要

Agentic RL can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance and often lack principled mechanisms for maintaining an evolving skill memory. We propose D2Skill, a dynamic dual-granularity skill bank for agentic RL that organizes reusable experience into task skills for high-level guidance and step skills for fine-grained decision support and error correction. D2Skill jointly trains the policy and skill bank through paired baseline and skill-injected rollouts under the same policy, using their performance gap to derive hindsight utility signals for both skill updating and policy optimization. Built entirely from training-time experience, the skill bank is continuously expanded through reflection and maintained with utility-aware retrieval and pruning. Experiments on ALFWorld, WebShop, and Search-Augmented QA tasks show that D2Skill substantially improves performance over skill-free baselines across models of different scales. Further ablations and analyses show that both dual-granularity skill modeling and dynamic skill maintenance are critical to these gains, while the learned skills exhibit higher utility, transfer across evaluation settings, and introduce only modest training overhead.

2603.18908 2026-05-26 cs.AI

Characterizing Linear Alignment Across Language Models

表征语言模型间的线性对齐

Matt Gorbett, Suman Jana

发表机构 * Independent Researcher(独立研究者) Department of Computer Science(计算机科学系) Columbia University(哥伦比亚大学)

AI总结 研究独立训练的大语言模型间是否存在线性对齐,并探索其在文本生成、嵌入分类、分布外检测及隐私保护跨孤岛推理中的应用。

详情
AI中文摘要

语言模型似乎越来越多地学习到相似的表示,尽管训练目标、架构和数据模态存在差异。这种独立训练模型之间新兴的兼容性为跨模型对齐下游目标带来了新的机会。此外,这种能力解锁了新的潜在应用领域,例如在安全、隐私或竞争约束禁止直接数据或模型共享的场景中。在这项工作中,我们研究了表示收敛在多大程度上实现了大语言模型之间的实用线性对齐。具体来说,我们学习独立模型最终隐藏状态之间的仿射变换,并在文本生成、嵌入分类和分布外检测中经验性地评估这些映射。我们发现,模型对之间的性能基本保持不变,并首次证明线性对齐有时能够实现跨独立训练模型的文本生成。我们进一步强调了线性对齐在隐私保护跨孤岛推理中的潜在应用。该框架在共享公共数据集上学习仿射变换,并使用同态加密来保护客户端查询。通过仅加密线性分类操作,该方法实现了亚秒级推理延迟。

英文摘要

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, this capability unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we investigate the extent to which representational convergence enables practical linear alignment between large language models. Specifically, we learn affine transformations between the final hidden states of independent models and empirically evaluate these mappings across text generation, embedding classification, and out-of-distribution detection. We find that performance is largely preserved across model pairs, and show for the first time that linear alignment sometimes enables text generation across independently trained models. We further highlight a potential application of linear alignment for privacy-preserving cross-silo inference. The framework learns an affine transformation over a shared public dataset and uses homomorphic encryption to protect client queries. By encrypting only the linear classification operation, the method achieves sub-second inference latency.

2603.16105 2026-05-26 cs.CL cs.AI

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要:用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

发表机构 * University of Trento(特伦托大学)

AI总结 提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal,通过最大化词汇多样性来选择校准数据,在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能,且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情
AI中文摘要

训练后模型压缩对于增强大型语言模型(LLMs)的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法,但较少关注选择最合适的数据集(所谓的校准数据)来寻找压缩模型配置。校准数据的选择是保留模型在任务内和任务间能力的关键步骤。在这项工作中,我们通过分析内在数据属性而非模型特定信号,解决了为剪枝和量化识别高性能校准集的挑战。我们引入了 exttt{ extbf{ZipCal}},一种基于Zipf幂律最大化词汇多样性的模型无关数据筛选策略。实验表明,我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是,在下游性能方面,它与依赖模型困惑度的最先进方法表现相当。后者在大规模模型和数据集上变得极其昂贵,而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度,平均快约240倍。

英文摘要

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

2603.12983 2026-05-26 cs.CL cs.AI

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人工标注是否必要?用于机器翻译错误跨度检测的迭代MBR蒸馏

Boxuan Lyu, Haiyue Song, Zhi Qu

发表机构 * Institute of Science Tokyo(东京科学研究所) National Institute of Information and Communications Technology(信息与通信技术国家研究所)

AI总结 提出一种基于最小贝叶斯风险解码的迭代MBR蒸馏自演化框架,利用现成大语言模型生成伪标签,无需人工标注即可在错误跨度检测任务上超越监督基线。

详情
AI中文摘要

错误跨度检测(ESD)是机器翻译(MT)评估中的一个关键子任务,旨在识别翻译错误的位置和严重程度。虽然对人工标注数据微调模型能提升ESD性能,但获取此类数据成本高昂且标注者之间容易不一致。为解决这一问题,我们提出一种基于最小贝叶斯风险(MBR)解码的新型自演化框架,命名为用于ESD的迭代MBR蒸馏,该框架通过利用现成的大语言模型(LLM)生成伪标签,消除了对人工标注的依赖。在WMT Metrics Shared Task数据集上的大量实验表明,仅在这些自生成伪标签上训练的模型在系统和跨度层面上均优于未适应的基础模型和基于人工标注的有监督基线,同时保持有竞争力的句子级性能。

英文摘要

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels. Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

2603.09943 2026-05-26 cs.AI

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

PathMem: 面向病理学多模态大模型的认知对齐记忆转换

Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shenzhen University(深圳大学) Nanyang Technological University(南洋理工大学) Imperial College London(伦敦帝国学院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出PathMem框架,通过长期记忆与工作记忆的动态转换机制,实现结构化病理知识整合与可解释记忆控制,在WSI报告生成和开放诊断任务上达到SOTA。

详情
AI中文摘要

计算病理学需要视觉模式识别与结构化领域知识(包括分类学、分级标准和临床证据)的动态整合。在实践中,诊断推理需要将形态学证据与正式诊断和分级标准联系起来。尽管多模态大语言模型(MLLMs)展现出强大的视觉语言推理能力,但它们缺乏结构化知识整合和可解释记忆控制的显式机制。因此,现有模型在推理过程中难以一致地融入病理学特定的诊断标准。受人类病理学家层级记忆过程的启发,我们提出了PathMem,一种面向病理学MLLMs的以记忆为中心的多模态框架。PathMem将结构化病理知识组织为长期记忆(LTM),并引入记忆变换器(Memory Transformer),通过多模态记忆激活和上下文感知知识接地建模从LTM到工作记忆(WM)的动态转换,从而实现用于下游推理的上下文感知记忆细化。PathMem在多个基准测试中达到SOTA性能,在WSI-Bench报告生成(WSI-Precision提升12.8%,WSI-Relevance提升10.1%)和开放式诊断任务上分别比先前的基于WSI的模型提升9.7%和8.9%。

英文摘要

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.

2603.09581 2026-05-26 cs.LG

Towards Understanding Adam Convergence on Highly Degenerate Polynomials

理解Adam在高度退化多项式上的收敛性

Zhiwei Bai, Jiajie Zhao, Zhangchen Zhou, Zhi-Qin John Xu, Yaoyu Zhang

发表机构 * Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University(上海交通大学理论科学研究院) School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Shanghai Seres Information Technology Co., Ltd, Shanghai 200040, China(上海塞瑞斯信息技术有限公司)

AI总结 本文研究Adam优化器在高度退化多项式上的自动收敛性质,推导局部渐近稳定性条件,证明其线性收敛速度优于梯度下降和动量法,并刻画超参数相图。

Comments Accepted to ICML 2026

详情
AI中文摘要

Adam是深度学习中广泛使用的优化算法,然而其具有固有优势的目标函数类别仍未被充分探索。与先前需要外部调度器和$\beta_2$接近1才能收敛的研究不同,本文研究了Adam的“自然”自动收敛性质。我们识别了一类高度退化多项式,Adam无需额外调度器即可自动收敛。具体地,我们推导了退化多项式上局部渐近稳定性的理论条件,并展示了理论界限与实验结果之间的强一致性。我们证明Adam在这些退化函数上实现局部线性收敛,显著优于梯度下降和动量法的次线性收敛。这种加速源于第二矩$v_t$与平方梯度$g_t^2$之间的解耦机制,该机制指数级放大有效学习率。最后,我们刻画了Adam的超参数相图,识别出三种不同的行为区域:稳定收敛、尖峰和类似SignGD的振荡。

英文摘要

Adam is a widely used optimization algorithm in deep learning, yet the specific class of objective functions where it exhibits inherent advantages remains underexplored. Unlike prior studies requiring external schedulers and $β_2$ near 1 for convergence, this work investigates the ``natural'' auto-convergence properties of Adam. We identify a class of highly degenerate polynomials where Adam converges automatically without additional schedulers. Specifically, we derive theoretical conditions for local asymptotic stability on degenerate polynomials and demonstrate strong alignment between theoretical bounds and experimental results. We prove that Adam achieves local linear convergence on these degenerate functions, significantly outperforming the sub-linear convergence of Gradient Descent and Momentum. This acceleration stems from a decoupling mechanism between the second moment $v_t$ and squared gradient $g_t^2$, which exponentially amplifies the effective learning rate. Finally, we characterize Adam's hyperparameter phase diagram, identifying three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.

2603.08072 2026-05-26 cs.LG

Hybrid Quantum Neural Network for Multivariate Clinical Time Series Forecasting

混合量子神经网络用于多变量临床时间序列预测

Irene Iele, Floriano Caprio, Paolo Soda, Matteo Tortora

发表机构 * Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University, Sweden(诊断与介入部门,辐射物理,生物医学工程,乌梅大学,瑞典) Department of Naval, Electrical, Electronics and Telecommunications Engineering, University of Genoa, Italy(海军、电气、电子与电信工程部门,热那亚大学,意大利)

AI总结 提出一种混合量子-经典架构,将变分量子电路集成到循环神经网络中,用于多变量生理时间序列的多步预测,在BIDMC数据集上表现出与基线相当的精度和更强的鲁棒性。

详情
AI中文摘要

预测生理信号可以通过预期患者状态的临界变化来支持主动监测和及时的临床干预。在这项工作中,我们通过联合预测心率、血氧饱和度、脉搏率和呼吸率在15、30和60秒的预测时域上,解决了生理时间序列的多变量多步预测问题。我们提出了一种混合量子-经典架构,将变分量子电路(VQC)集成到循环神经骨干中。GRU编码器将历史观察窗口总结为潜在表示,然后将其投影到用于参数化VQC的量子角度上。量子层作为可学习的非线性特征混合器,在最终预测阶段之前建模跨变量交互。我们在BIDMC PPG和呼吸数据集上采用留一患者方案评估了所提出的方法。结果显示,与经典和深度学习基线相比,该方法具有竞争性的精度,同时对噪声和缺失输入具有更强的鲁棒性。这些发现表明,混合量子层可以为小队列临床环境中的生理时间序列预测提供有用的归纳偏置。代码可在https://github.com/arco-group/quantum-ml获取。

英文摘要

Forecasting physiological signals can support proactive monitoring and timely clinical intervention by anticipating critical changes in patient status. In this work, we address multivariate multi-horizon forecasting of physiological time series by jointly predicting heart rate, oxygen saturation, pulse rate, and respiratory rate at forecasting horizons of 15, 30, and 60 seconds. We propose a hybrid quantum-classical architecture that integrates a Variational Quantum Circuit (VQC) within a recurrent neural backbone. A GRU encoder summarizes the historical observation window into a latent representation, which is then projected into quantum angles used to parameterize the VQC. The quantum layer acts as a learnable non-linear feature mixer, modeling cross-variable interactions before the final prediction stage. We evaluate the proposed approach on the BIDMC PPG and Respiration dataset under a Leave-One-Patient-Out protocol. The results show competitive accuracy compared with classical and deep learning baselines, together with greater robustness to noise and missing inputs. These findings suggest that hybrid quantum layers can provide useful inductive biases for physiological time series forecasting in small-cohort clinical settings. The code is available at https://github.com/arco-group/quantum-ml.

2603.06798 2026-05-26 cs.LG cs.DC stat.ML

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

NEST: 面向分布式深度学习的网络与内存感知设备放置

Irene Wang, Vishnu Varma Venkata, Arvind Krishnamurthy, Divya Mahajan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Washington(华盛顿大学)

AI总结 提出NEST框架,通过结构化动态规划统一模型并行、拓扑建模和内存可行性,在多种硬件和网络上实现高达2.43倍的吞吐量提升。

Comments Accepted to MLSys 2026

详情
AI中文摘要

深度学习规模的不断增长要求分布式训练框架能够联合考虑并行性、内存和网络拓扑。先前的工作通常依赖启发式或拓扑无关的搜索,分别处理通信和内存。由于缺乏每设备内存感知,这些方法通常事后通过将参数和激活分片到多个设备上来确保可行性,从而增加同步、扩大通信、降低计算利用率,限制了实际数据中心网络上的可扩展性和效率。我们提出了NEST,一个网络、计算和内存感知的设备放置框架,通过结构化动态规划统一了模型并行、拓扑建模和内存可行性。NEST的动态规划在具有张量和专家并行配置、跨层次或任意网络的显式allreduce延迟以及内存/计算轮廓的算子图上运行。通过跨张量、流水线、数据和专家维度分解并行性,NEST为混合策略定义了一个原则性的搜索空间,同时联合优化共置、网络延迟和内存可行性。在多种硬件和网络上的评估表明,与最先进的基线相比,NEST实现了高达2.43倍的吞吐量提升、更好的内存效率和可扩展性,为下一代AI基础设施的并行化策略和数据中心互连协同设计提供了基础。NEST的源代码可在https://github.com/scai-tech/Nest获取。

英文摘要

The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest

2603.06687 2026-05-26 cs.CV cs.CL cs.ET cs.MM cs.RO

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot: 在真实世界场景中评估视觉语言模型的地理时间理解能力

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory (CIOL), Bangladesh(计算智能与运筹实验室(CIOL),孟加拉国) Shahjalal University of Science and Technology (SUST), Sylhet, Bangladesh(沙赫jalal科学与技术大学(SUST),沙赫里尔,孟加拉国) North South University (NSU), Dhaka, Bangladesh(北南大学(NSU),达卡,孟加拉国) Qatar Computing Research Institute (QCRI), Doha, Qatar(卡塔尔计算研究中心(QCRI),多哈,卡塔尔)

AI总结 提出TimeSpot基准,通过1,455张全球图像评估视觉语言模型在时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)上的推理能力,发现现有模型性能低下,尤其时间推理不足。

Comments Accepted to ICML 2026

详情
AI中文摘要

地理时间理解,即仅从视觉输入推断位置、时间和上下文属性的能力,支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管最近的视觉语言模型(VLM)利用地标和路标等线索在图像地理定位方面取得了进展,但它们推理时间信号和物理基础空间线索的能力仍然有限。为弥补这一差距,我们引入了TimeSpot,一个用于评估VLM在真实世界中进行地理时间推理的基准。TimeSpot包含来自80个国家的1,455张地面图像,要求直接从视觉证据中结构化预测时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)。它还包括时空推理任务,测试在真实世界不确定性下的物理合理性。对最先进的开源和闭源VLM的评估显示性能低下,尤其是时间推理。虽然监督微调带来了改进,但结果仍不充分,凸显了需要新方法来实现稳健的、基于物理的地理时间理解。TimeSpot可在 https://TimeSpot-GT.github.io 获取。

英文摘要

Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.

2603.06218 2026-05-26 cs.RO

Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling

少样本神经可微模拟器:真实到模拟的刚体接触建模

Zhenhao Huang, Siyuan Luo, Bingyang Zhou, Ziqiu Zeng, Jason Pho, Fan Shi

发表机构 * National University of Singapore(新加坡国立大学) Prana Lab(Prana实验室)

AI总结 提出一种结合解析公式物理一致性与图神经网络表示能力的少样本真实到模拟方法,通过少量真实数据校准解析模拟器生成大规模合成数据集,并引入基于网格的图神经网络隐式建模刚体前向动力学及碰撞检测的代理梯度,实现完全可微性,从而提升模拟保真度和策略学习效率。

Comments Accepted in ICRA 2026

详情
AI中文摘要

精确的物理模拟对于机器人学习和控制至关重要,然而解析模拟器通常难以捕捉复杂的接触动力学,而基于学习的模拟器通常需要大量昂贵的真实世界数据。为弥合这一差距,我们提出了一种少样本真实到模拟方法,该方法结合了解析公式的物理一致性与基于图神经网络(GNN)模型的表示能力。仅使用少量真实世界数据,我们的方法校准解析模拟器以生成大规模合成数据集,捕捉多样的接触交互。在此基础上,我们引入了一种基于网格的GNN,隐式建模刚体前向动力学,并推导出碰撞检测的代理梯度,实现完全可微性。实验结果表明,我们的方法使基于学习的模拟器在复现真实世界轨迹方面优于可微基线。此外,可微设计支持基于梯度的优化,我们通过多物体交互场景中的基于模拟的策略学习验证了这一点。大量实验表明,我们的框架不仅以最小监督提高了模拟保真度,还提高了策略学习的效率。综上所述,这些发现表明,具有少样本真实世界基础的可微模拟为推进未来机器人操作和控制提供了有力方向。

英文摘要

Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.

2603.05143 2026-05-26 cs.CL cs.LG

Feature Resemblance: Towards a Theoretical Understanding of Analogical Reasoning in Transformers

特征相似性:迈向对Transformer中类比推理的理论理解

Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang

发表机构 * Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong(香港中文大学信息工程系)

AI总结 本文通过最小化Transformer抽象模型,从理论上证明联合训练和特定课程顺序能使实体在表示空间中对齐,从而通过特征相似性实现属性转移,即类比推理。

详情
AI中文摘要

理解大型语言模型中的推理因评估混淆多种推理类型而变得复杂。我们分离出类比推理,即模型在共享已知属性的实体之间转移属性,并研究这种转移何时能从训练中涌现。为了使问题在分析上易于处理,我们研究了一个最小化的Transformer风格抽象,该抽象隔离了学习到的表示如何支持类比推理。在此设置中,我们证明了三个关键结果。首先,对相似性和属性前提的联合训练通过对齐表示实现类比推理。其次,顺序训练仅在相似性结构先于特定属性学习时成功,揭示了课程不对称性。第三,在我们的风格化设置中,两跳推理$(a \to b, b \to c \Rightarrow a \to c)$可被视为具有身份桥$(b=b)$的类比推理,这些身份桥在训练数据中明确出现。这些结果共同揭示了一个统一机制:具有共享属性的实体在表示空间中对齐,从而通过特征相似性实现属性转移。使用高达8B参数的架构进行的实验与理论定性一致,并表明表示几何在风格化模型之外的类比推理中扮演重要角色。

英文摘要

Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning, where a model transfers an attribute between entities that share known properties, and study when such transfer can emerge from training. To make the problem analytically tractable, we study a minimal transformer-style abstraction that isolates how learned representations support analogical reasoning. Within this setting, we prove three key results. First, joint training on similarity and attribution premises enables analogical reasoning through aligned representations. Second, sequential training succeeds only when similarity structure is learned before specific attributes, revealing a curriculum asymmetry. Third, in our stylized setting, two-hop reasoning $(a \to b, b \to c \Rightarrow a \to c)$ can be viewed as analogical reasoning with identity bridges $(b=b)$, which appear explicitly in training data. Together, these results reveal a unified mechanism: entities with shared properties become aligned in representation space, enabling property transfer through feature resemblance. Experiments with architectures up to 8B parameters show qualitative agreement with the theory and suggest that representational geometry plays an important role in analogical reasoning beyond the stylized model.

2603.04114 2026-05-26 cs.CV

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

Any2Any: 统一任意模态遥感翻译

Haoyang Chen, Jing Zhang, Hebaixu Wang, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haonan Guo, Di Wang, Zheng Wang, Bo Du

发表机构 * National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, School of Computer Science, Wuhan University(多媒体软件国家工程研究中心、人工智能研究院、计算机科学学院、武汉大学) Hubei Key Laboratory of Multimedia(湖北省多媒体重点实验室) Zhongguancun Academy, Beijing, China. 100094(中关村学院,北京,中国。100094) School of Electronic Information, Wuhan University, Wuhan, China(电子信息学院,武汉大学,武汉,中国) School of Automation, Beijing Institute of Technology(自动化学院,北京理工大学) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China(测绘、制图与遥感信息工程国家重点实验室,武汉大学,武汉,中国)

AI总结 提出统一潜扩散框架Any2Any,通过共享潜空间和轻量残差适配器实现任意模态间的高效翻译,并在新数据集RST-1M上验证了其优于成对方法且具备零样本泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态遥感图像提供同一地理场景的互补观测,但在实际中这些观测往往不完整。现有的跨模态翻译方法将每个模态对视为独立任务,导致二次复杂度且对未见模态组合的泛化能力有限。我们将任意到任意翻译建模为场景共享潜表示上的推理,其中不同模态对应同一底层语义的部分观测。基于此公式,我们提出Any2Any,一个统一的潜扩散框架,将异构输入投影到几何对齐的潜空间。该结构通过共享骨干网络执行锚定潜回归,解耦模态特定表示学习与语义映射。此外,使用轻量级目标特定残差适配器来纠正系统性潜失配,而不增加推理复杂度。为了支持稀疏但连接监督下的学习,我们引入了RST-1M,首个百万级遥感数据集,包含五种感知模态的配对观测,为任意到任意翻译提供监督锚点。在14个翻译任务上的实验表明,Any2Any始终优于成对翻译方法,并对未见模态对展现出强大的零样本泛化能力。代码和模型可在https://github.com/MiliLab/Any2Any获取。

英文摘要

Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models are available at https://github.com/MiliLab/Any2Any.

2603.00857 2026-05-26 cs.LG cs.AI

MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

MultiPUFFIN:用于小分子性质预测的多模态领域约束基础模型

Idelfonso B. R. Nogueira, Carine M. Rebello, Mumin Enis Leblebici, Erick Giovani Sperandio Nascimento

发表机构 * Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU)(挪威科学与技术大学化学工程系) Faculty of Industrial Engineering, KU Leuven(鲁文大学工业工程学院) University of Surrey(萨里大学)

AI总结 提出多模态基础模型MultiPUFFIN,融合SMILES、2D图、3D构象及实验条件,通过条件感知精炼和热力学约束头,在小样本下优于ChemBERTa-2,预测小分子热物理性质。

详情
AI中文摘要

MultiPUFFIN是一个领域信息多模态基础模型,用于预测小分子的热物理性质,填补了化学工程、药物发现和材料科学中的关键空白。现有的分子基础模型在数百万分子上预训练以学习通用表示,但其标准MLP输出层不施加物理约束,蒸汽压预测可能违反温度单调依赖性,粘度曲线可能缺乏过程模拟器所需的功能形式。保证热力学一致性的领域信息方法仍局限于单一性质和少量数据集,而多模态基础模型则侧重于生物活性而非热物理性质。MultiPUFFIN通过双向跨模态注意力和门控融合融合SMILES序列、2D分子图和3D构象几何,并辅以实验条件和分子描述符的辅助编码器,填补了这一空白。骨干网络使用三种互补的自监督目标在500,000个未标记的PubChem分子上预训练。一个条件感知精炼堆栈包含五个条件器(温度、pH、压力、多晶型和测量方法),将每个性质路由到一个四头锦标赛,选择该性质性能最佳的热力学信息头。MultiPUFFIN的平均测试R²为0.784,在所有九个性质上优于微调的ChemBERTa-2,尽管训练使用的标记分子数量少了约2000倍。

英文摘要

MultiPUFFIN is a domain-informed multimodal foundation model for predicting thermophysical properties of small molecules, addressing a critical gap in chemical engineering, drug discovery, and materials science. Existing molecular foundation models pretrain on millions of molecules to learn general-purpose representations, but their standard MLP output layers impose no physical constraints, vapor pressure predictions may violate monotonic temperature dependence, and viscosity curves may lack the functional form required by process simulators. Domain-informed approaches that guarantee thermodynamic consistency have remained limited to single properties and small datasets, whereas multimodal foundation models have focused on biological activity rather than thermophysical properties. MultiPUFFIN fills this gap by fusing SMILES sequences, 2D molecular graphs, and 3D conformer geometries through bidirectional cross-modal attention and gated fusion, supplemented by auxiliary encoders for experimental conditions and molecular descriptors. The backbone is pretrained on 500,000 unlabelled PubChem molecules using three complementary self-supervised objectives. A condition-aware refinement stack of five conditioners (temperature, pH, pressure, polymorph, and measurement method) routes each property to a four-head tournament that selects the best-performing thermodynamically informed head for that property. MultiPUFFIN achieves a mean test R2 of 0.784 and outperforms fine-tuned ChemBERTa-2 on all nine properties despite training on roughly 2,000x fewer labeled molecules.

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习:具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学)

AI总结 提出反思式测试时规划方法,通过行动中反思和行动后反思两种模式,结合回溯性反思,使具身智能体在测试时进行自我纠正和经验积累,显著提升长程任务性能。

详情
AI中文摘要

具身大语言模型赋予机器人高级任务推理能力,但它们无法反思错误原因,导致部署成为一系列独立尝试,错误重复而非积累经验。借鉴人类反思实践,我们引入反思式测试时规划,整合两种反思模式: extit{行动中反思},代理在行动前利用测试时扩展生成并评分多个候选行动,基于内部反思;以及 extit{行动后反思},利用测试时训练,根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思,允许代理重新评估早期决策,并利用后见之明进行模型更新,实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明,与基线模型相比有显著提升,并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实,行动中反思和行动后反思相互依赖,且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.