arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2605.16117 2026-05-18 cs.CL

SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

SGR:一种用于LLM的分步推理框架,通过外部子图生成

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

AI总结 SGR通过外部子图生成提升LLM推理能力,利用结构化知识支持多步推理,实验表明在多个基准数据集上均优于基线方法,提高了推理准确性和事实可靠性。

详情
AI中文摘要

SGR通过外部子图生成提升LLM推理能力,利用结构化知识支持多步推理,实验表明在多个基准数据集上均优于基线方法,提高了推理准确性和事实可靠性。

英文摘要

Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.

2605.16116 2026-05-18 cs.AI

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

ShopGym: 一个集成框架,用于电子商务网络代理的现实模拟和可扩展基准测试

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

AI总结 本文提出ShopGym框架,通过模拟层ShopArena和基准层ShopGuru,实现电子商务网络代理的现实模拟与可扩展基准测试,验证了合成商店在结构属性和代理性能上的有效性。

详情
Comments
32 pages, 10 figures
AI中文摘要

开发和评估电子商务网络代理需要能够保持有意义任务结构并支持可控、可重复和可扩展科学比较的环境。现有方法面临权衡:实时商店提供现实但非平稳、难以检查和不可重复,而手动构建的沙盒基准测试提供控制但仅覆盖狭窄的布局、目录、政策和交互模式范围。我们主张核心瓶颈是方法论的:该领域缺乏一种可扩展的方式,能够构建同时现实、多样、可控、可检查和可重复的评估设置。我们引入ShopGym,一个集成框架,用于电子商务网络代理的现实模拟和可扩展基准测试。ShopGym是一个构建电子商务模拟环境和基础基准任务的框架。其模拟层ShopArena通过匿名化商店规范和分阶段验证生成过程,将实时种子商店转换为自包含的沙盒商店。在这些模拟商店之上,ShopGuru合成跨七个技能类别的基准任务,每个任务基于商店的目录、导航结构、政策和交互可能性。共同,ShopArena和ShopGuru产生自包含、可重置、可检查和稳定的评估成果,保留结构属性和与购物任务相关的代理评估信号。我们通过基于图的结构分析和基于代理的行为评估验证了该框架,使用224个生成的任务在六个沙盒商店中:三个由合成数据构建,三个由真实数据构建。我们的结果表明,合成商店保留了实时商店的关键结构属性,代理在合成商店上的表现与在实时商店上的表现正相关。

英文摘要

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

2605.16115 2026-05-18 cs.RO

Beyond Collision Avoidance: Multi-Robot Yielding and Spatial Affordance in Emergency Evacuations

超越避障:紧急疏散中的多机器人让路与空间可得性

Ning Zhou, Edmund R. Hunt, Nikolai W. F. Bode

AI总结 研究通过虚拟疏散实验探讨多机器人让路策略对人类空间期望的影响,发现主动让路优于冻结和效率优先策略,并揭示环境可得性对认知预期的塑造作用。

详情
AI中文摘要

随着移动服务机器人与行人共存,确保紧急疏散期间的被动安全行为至关重要。现有多机器人让路策略往往仅关注碰撞避障和宏观流优化,忽视环境可得性和人类空间期望。为弥合宏观理论与微观感知间的差距,我们进行了基于游戏的虚拟疏散实验(N=56)。我们研究了四种多机器人让路策略(Hide, LineEscape, Freeze, ShortestPath)在有无避难所的狭窄走廊中的个体心理反应。结果建立了一个稳健的偏好等级(Hide > LineEscape > Freeze > ShortestPath),表明主动空间让路显著优于冻结和效率优先方法。关键发现是环境可得性深刻塑造了认知预期。积极利用可用避难所增强了主动让路的心理舒适度(Hide)。相反,未能利用明显避难所(如执行LineEscape)可能触发预期违反。这体现在感知认知延迟显著增加,尽管客观轨迹未受阻碍。此外,先前的机器人交互经验有助于用户解读复杂的社会意图。最终,本研究证明紧急情况中的人机交互安全必须从纯轨迹优化发展到语义感知导航。未来工作将扩展该框架以研究机器人群与行人群体之间的复杂交互。

英文摘要

As mobile service robots increasingly coexist with pedestrians, ensuring passively safe behaviour during confined emergency evacuations is critical. Existing multi-robot yielding strategies often focus solely on collision avoidance and macroscopic flow optimisation, overlooking environmental affordances and human spatial expectations. To bridge the gap between macroscopic theory and micro-level perception, we conducted a game-based virtual evacuation experiment (N=56). We investigated individual psychological responses to four multi-robot yielding strategies (Hide, LineEscape, Freeze, ShortestPath) across confined corridors with and without refuge niches. Our results establish a robust preference hierarchy (Hide > LineEscape > Freeze > ShortestPath), demonstrating that proactive space-yielding significantly outperforms freezing and efficiency-first approaches. Crucially, we found that environmental affordances heavily shape cognitive expectations. Actively utilising available niches amplifies the psychological comfort of proactive yielding (Hide). Conversely, failing to use an obvious niche (e.g., executing LineEscape) may trigger Expectation Violation. This is reflected in a drastically increased perceived cognitive delay, despite objectively unimpeded trajectories. Furthermore, prior robot interaction experience helps users decode complex social intents. Ultimately, this research demonstrates that safe human-robot interaction during emergencies must evolve from pure trajectory optimisation to semantically aware navigation. Future work will extend this framework to investigate complex interactions between robot swarms and pedestrian crowds.

2605.16113 2026-05-18 cs.CL cs.AI

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

DebiasRAG: 通过检索增强生成实现大型语言模型中公平生成的无调优路径

Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin, Ping Li, Weijie Zhao, Khoa D Doan, Yingjie Lao

AI总结 本文提出DebiasRAG,一种基于检索增强生成的无调优动态查询特定去偏框架,通过生成查询特定去偏候选、构建上下文候选池和梯度更新去偏引导上下文重排序三阶段,提升生成公平性并保留LLM固有属性。

详情
AI中文摘要

大型语言模型(LLMs)因生成能力卓越而取得空前成功。然而,由于依赖训练语料中的知识,它们可能生成幻觉、刻板印象和社会偏见内容。特别是,LLMs容易产生涉及种族、性别和年龄的偏见响应,统称为社会偏见。先前研究使用微调和提示工程来减轻LLMs中的偏见,但这些方法需要额外的训练资源或领域知识来设计框架。此外,它们可能降低LLMs的原始能力,并常忽视公平推断中动态去偏上下文的需要。本文提出DebiasRAG,一种基于检索增强生成(RAG)的新型无调优和动态查询特定去偏框架。DebiasRAG在保持LLM固有属性如表示能力的同时提升公平性。DebiasRAG包含三个阶段:(1)查询特定去偏候选生成;(2)上下文候选池构建;(3)梯度更新去偏引导上下文重排序。首先,DebiasRAG通过常规检索生成与查询相关的自我诊断偏见上下文,这些偏见上下文由DebiasRAG提供者离线准备。给定查询特定的偏见上下文,DebiasRAG反向生成去偏上下文,作为额外的公平性约束提供给LLM输出。其次,常规RAG检索过程从常规RAG文档数据库生成查询相关的上下文,如分块维基百科数据集。

英文摘要

Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

2605.16112 2026-05-18 cs.LG cs.AI

Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix

动态图变换器中的注意力分散:诊断与可迁移的修复

Jinhao Zhang, Kangfei Zhao, Qiuhao Zeng, Long-Kai Huang

AI总结 本文识别动态图变换器在时间分布偏移下的注意力分散问题,并提出可迁移的差分注意力机制以提升性能,尤其在高偏移数据集上表现显著。

详情
AI中文摘要

基于变换器的架构已成为连续时间动态图(CTDG)学习的主导范式,但其性能在时间偏移数据集上受限。本文发现注意力分散是动态图变换器在时间分布偏移下的共同失效模式。通过对比结构和时间上不同的历史邻居与随机邻居,发现预测依赖于一类关键节点,这些节点比任意邻居更具预测信号。然而,现有变换器无法聚焦这些节点,因为时间偏移削弱了注意力对比并导致注意力分布过于分散。该诊断表明一种简单且可迁移的修复方法:用差分注意力替代标准注意力,以抑制共同模式注意力并放大差异性token级信号。当添加到三个代表性的CTDG变换器基线中时,差分注意力一致提升了性能,收益集中在高偏移数据集上。注意力层面的测量进一步验证了机制,显示关键节点上的注意力熵降低和注意力质量提高。基于这些发现,我们引入DiffDyG,结合差分注意力与标准输入编码。在9个基准和三种负采样协议上,DiffDyG实现了SOTA性能,尤其在最偏移的数据集上表现显著。

英文摘要

Transformer-based architectures have become the dominant paradigm for Continuous-Time Dynamic Graph (CTDG) learning, yet their performance remains limited on temporally shifted datasets. In this work, we identify attention dispersion as a shared failure mode of dynamic graph Transformers under temporal distribution shift. Through controlled ablation contrasting structurally and temporally distinguished historical neighbors against random ones, we show that prediction depends on a class of critical nodes that carry consistently more predictive signal than arbitrary neighbors. However, existing Transformers fail to focus on these nodes even when they are present in the input, as temporal shift weakens attention contrast and produces overly dispersed attention distributions. This diagnosis suggests a simple and transferable fix: replace standard attention with differential attention, which suppresses common-mode attention and amplifies distinctive token-level signals. When added to three representative CTDG Transformer baselines, differential attention consistently improves performance, with gains concentrated on high-shift datasets. Attention-level measurements further confirm the mechanism, showing reduced attention entropy and increased attention mass on critical nodes. Building on these findings, we introduce DiffDyG, a reference implementation combining differential attention with standard input encodings. Across 9 benchmarks and three negative sampling protocols, DiffDyG achieves SOTA performance, with especially large gains on the most shifted datasets.

2605.16107 2026-05-18 cs.CL

Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection

多级上下文令牌关系建模用于机器生成文本检测

Chenwang Wu, Yiuming Cheung, Bo Han, Shuhai Zhang, Defu Lian

AI总结 本文提出多级上下文令牌关系建模框架,通过局部校准和全局规则推理模块提升机器生成文本检测性能,实验显示在多种实际场景中效果显著。

详情
AI中文摘要

机器生成文本(MGT)存在虚假信息和钓鱼风险,需可靠检测。度量方法通过统计可区分特征更实用。本文将代表性度量方法置于统一框架中,分析其优劣,发现令牌级检测得分易受生成过程随机性影响。理论推导多跳转移并探索局部和全局关系,提出多级上下文令牌关系建模框架。局部关系通过轻量马尔可夫校准模块优化令牌证据,全局关系引入规则支持推理模块。最终在联合多级推理框架中结合局部校准得分和全局规则信号。实验显示在多种实际场景中效果显著,计算开销低。

英文摘要

Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.

2605.16103 2026-05-18 cs.AI

Sign-Separated Finite-Time Error Analysis of Q-Learning

符号分离的有限时间误差分析Q学习

Donghwan Lee

AI总结 本文提出符号分离的有限时间误差分析方法,用于常步长Q学习。通过切换系统表示,将误差分解为负和正部分,负部分由固定最优策略关联的线性时不变系统主导,正部分由线性切换系统控制。分析揭示了Q学习误差动态中的最大诱导不对称性,并提供确定性和随机性常步长递推的有限时间界。

详情
AI中文摘要

本文发展了一种符号分离的有限时间误差分析方法,用于常步长Q学习。从切换系统表示出发,将误差分解为组件的负和正部分。负部分由与固定最优策略关联的线性时不变(LTI)系统主导,而正部分由线性切换系统控制。所得界显示负侧LTI证书至少不慢于正侧切换证书,可能产生更快的指数包络。分析揭示了Q学习误差动态中的最大诱导不对称性,该不对称性与过估计有关:正向动作误差可通过贝尔曼最大值选择和传播,而负误差允许最优策略的下限比较。为确定性和随机性常步长递推提供了有限时间界。

英文摘要

This paper develops a sign-separated finite-time error analysis for constant step-size Q-learning. Starting from the switching-system representation, the error is decomposed into its componentwise negative and positive parts. The negative part is dominated by a lower comparison linear time-invariant (LTI) system associated with a fixed optimal policy, whereas the positive part is controlled by a linear switching system. The resulting bounds show that the negative-side LTI certificate is no slower than the positive-side switching certificate and may produce a faster exponential envelope. The analysis identifies a max-induced asymmetry in Q-learning error dynamics. This asymmetry is connected to overestimation: positive action-wise errors can be selected and propagated by the Bellman maximum, whereas negative errors admit an optimal-policy lower comparison. Finite-time bounds are provided for both deterministic and stochastic constant-step-size recursions.

2605.16099 2026-05-18 cs.LG cs.AI

Federated Imputation under Heterogeneous Feature Spaces

联邦学习下的异构特征空间中的缺失值填补

Imane Hocine, Chaimaa Medjadji, Sylvain Kubler, Gregoire Danoy, Yves Le Traon

AI总结 本文提出FedHF-Impute框架,通过共享全局特征图实现跨客户端知识传递,提升联邦填补效果,在模拟数据集上优于基线方法。

详情
AI中文摘要

联邦学习(FL)使去中心化客户端能够协同训练,但大多数方法假设特征模式一致,这在表格设置中不成立,因为客户端只能观察部分重叠的特征子集。在这些异构特征空间中,参数平均方法(如FedAvg)在弱重叠或不相交的特征组之间转移很少的信息,限制了联邦填补的有效性。为克服这一问题,我们提出了FedHF-Impute,一个联邦填补框架,将结构特征不可用性与传统缺失性分开,并利用共享的全局特征图通过信息传递在统计相关特征之间传播信息。这使即使特征从未在本地共同观察时也能实现间接跨客户端知识传递,同时保持标准的联邦通信。在模拟部分模式重叠的SECOM和AirQuality数据集上,FedHF-Impute在填补准确性(RMSE)上比FL基线方法提高了26.9%和8.4%,在PhysioNET上表现相当,仅比最佳基线差0.3%。

英文摘要

Federated Learning (FL) enables collaborative training across decentralized clients, but most methods assume aligned feature schemas, an assumption that rarely holds in tabular settings where clients observe only partially overlapping feature subsets. In these heterogeneous feature spaces, parameter-averaging methods (e.g., FedAvg) transfer little information across weakly overlapping or disjoint feature groups, limiting their effectiveness for federated imputation. To overcome this, we propose \textbf{FedHF-Impute}, a federated imputation framework that separates structural feature unavailability from conventional missingness and uses a shared global feature graph to propagate information across statistically related features through message passing. This enables indirect cross-client knowledge transfer, even when features are never jointly observed locally, while preserving standard federated communication. Under simulated partial schema overlap on the SECOM and AirQuality datasets, FedHF-Impute improves imputation accuracy (RMSE) over FL baselines by 26.9\%, and 8.4\% respectively, while achieving comparable performance on PhysioNET, with only a 0.3\% difference relative to the best baseline.

2605.16089 2026-05-18 cs.LG cs.AI

Centralized vs Decentralized Federated Learning: A trade-off performance analysis

集中式与去中心化联邦学习:性能权衡分析

Chaimaa Medjadji, Guilain Leduc, Sylvain Kubler, Yves Le Traon

AI总结 本文通过Fedstellar模拟器、MNIST数据集和MLP分类器,对比分析集中式、去中心化和半去中心化联邦学习架构的性能权衡,揭示不同应用场景下的优劣势。

详情
AI中文摘要

联邦学习(FL)作为一种在分布式边缘设备上进行协作模型训练同时保护数据隐私的有前景范式,尤其在物联网设备数量激增的情况下显得尤为重要。然而,将如此大量的数据集中存储面临通信限制、隐私和法规等问题。FL可以是集中式(CFL)、去中心化(DFL)或半去中心化(SDFL)。选择合适的FL架构取决于应用需求。然而,非常少的研究通过实验比较了这三种架构,不仅为了理解各自的优势和局限性,还为了探讨不同性能指标之间的权衡。本文克服了这一分析的不足,利用Fedstellar模拟器、MNIST数据集和MLP分类器进行实验分析。

英文摘要

Federated Learning (FL) has emerged as a promising paradigm for collaborative model training across distributed edge devices while preserving data privacy especially with the huge increase amount of data due to the adoption of technologies which contributes to the growing number of IoT devices. Storing this amount of data centrally is challenging due to issues like limited communication, privacy, and regulations. FL can be Centralized (CFL), Decentralized (DFL), and Semi-decentralized (SDFL). Choosing the right FL architecture depends on the application's needs. However, very few research studies have experimentally compared these three types of architectures to not only understand the respective strengths and limitations, but also trade-offs between different performance indicators. This paper overcome this lack of analysis, conducting experimental analyses using the Fedstellar simulator, MNIST dataset, and MLP classifier.

2605.16088 2026-05-18 cs.LG cs.AI

Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction

基于组合层次图的多级自监督预训练用于分子性质预测

Xiayu Liu, Zhengyi Lu, Hou-biao Li

AI总结 本文提出MolCHG框架,通过多级自监督预训练提升分子性质预测性能,采用组合层次图组织分子结构,引入bond graph增强bond信息,实现原子与bond语义的平等聚合。

详情
Comments
11pages, 4 figures
AI中文摘要

自监督预训练在分子图上已展现出分子性质预测的潜力,但现有方法多在单一结构粒度上操作,将bond信息视为辅助边属性而非独立语义层。本文提出MolCHG,一种基于新型组合层次图的多级自监督预训练框架,将分子结构划分为三个语义层级的四种节点类型。通过引入与原子图并行的bond图,该架构将bond层面信息提升为独立演化的节点表示,使片段节点能平等聚合原子层面和bond层面语义。设计了三个层级特定的预训练目标:原子-债券交叉视图对比任务对齐每个片段的原子视图和bond视图表示;片段级功能团预测任务注入领域相关的化学知识;图级结构预测任务编码全局分子拓扑。在九个MoleculeNet基准测试中,MolCHG在七个数据集上取得最佳性能,在其余数据集上与最强基线竞争。消融研究进一步确认多级监督信号互补,每个组件均对整体性能有贡献。

英文摘要

Self-supervised pretraining on molecular graphs has emerged as a promising approach for molecular property prediction, yet most existing methods operate at a single structural granularity and treat bond information as auxiliary edge attributes rather than as an independent semantic layer. In this work, we propose MolCHG, a multi-level self-supervised pretraining framework built upon a novel Compositional Hierarchical Graph that organizes molecular structure into four types of nodes across three semantic levels. By introducing a bond graph that operates in parallel with the atom graph, our architecture elevates bond-level information to independently evolving node representations, enabling fragment nodes to aggregate atom-level and bond-level semantics on an equal footing. We design three level-specific pretraining objectives: an atom-bond cross-view contrastive task that aligns the atom-view and bond-view representations within each fragment, a fragment-level functional group prediction task to inject domain-relevant chemical knowledge, and graph-level structure prediction tasks to encode global molecular topology. Experiments on nine MoleculeNet benchmarks demonstrate that MolCHG achieves the best performance on seven datasets across both classification and regression tasks, remaining competitive with the strongest baselines on the rest. Ablation studies further confirm that the multi-level supervision signals are complementary and that each component contributes to the overall performance.

2605.16081 2026-05-18 cs.LG cs.CV

MIND: Decoupling Model-Induced Label Noise via Latent Manifold Disentanglement

MIND: 通过潜在流形解耦模型诱导标签噪声

Dayong Ren

AI总结 MIND通过潜在流形解耦技术解决模型诱导标签噪声问题,通过动态投影样本到潜在结构聚类,提升噪声识别能力,验证了其在复杂基准测试中的优越性。

详情
Comments
Accepted, to appear in ICML2026
AI中文摘要

学习从自动注释驱动的预训练专家和基础模型主导的数据饥饿应用的范式,但引入了关键挑战:模型诱导的标签噪声。与经典鲁棒学习中的随机噪声不同,这种噪声源于标注者的归纳偏置,表现为与局部特征流形紧密耦合的系统性误差。现有方法依赖于全局转移矩阵无法捕捉这些结构模式,而学习实例特定的矩阵在数学上不可行。我们提出模型诱导噪声解耦(MIND),一个理论奠基的框架解决这一困境。我们证明高维噪声流形可通过潜在流形解耦分解为可处理的子空间依赖组件。具体而言,我们的潜在解耦估计器(LDE)动态地将样本投影到具有一致错误模式的潜在结构聚类中,无需地面真实锚点即可实现噪声可识别。为严格评估鲁棒性,我们采用分层协议:从受控噪声的CIFAR-100到大规模真实世界3D数据集(S3DIS、ScanNet)的结构压力测试,其中错误模式显式耦合于几何流形。经验上,MIND在这些复杂基准测试中显著优于现有方法,并有效纠正了视觉-语言模型(如OpenSeg)的零样本幻觉,突显其作为基础模型鲁棒蒸馏框架的潜力。

英文摘要

The paradigm of learning from automatic annotations driven by pre-trained experts and Foundation Models dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.

2605.16080 2026-05-18 cs.CV

ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

ReAlign:通过推理对齐表示实现通用图像伪造检测

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

AI总结 本文提出ReAlign框架,通过对比学习将LLM生成的高质量推理文本转化为轻量级AIGI检测器,提升检测准确性和泛化能力。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

AI生成图像(AIGIs)的兴起对数字真实性提出了新的挑战,需要高效且通用的图像伪造检测系统。现有方法无论是非LLM还是LLM基于的方法,都有各自的优势和局限性。非LLM方法提供高效的低级artifact检测,但缺乏语义理解。相反,LLM方法提供强大的语义推理和可解释性,但计算成本高且对细微视觉伪影不敏感。此外,解释性推理文本对伪造检测性能的真实贡献仍不明确。本文研究了LLM生成的推理文本的内在价值和潜力,将其视为通用性和语义错误敏感性的来源。基于这些发现,我们提出了ReAlign,一种新的框架,通过对比学习将由GRPO优化的LLM生成的高质量推理文本提炼成轻量级AIGI检测器。ReAlign有效继承了推理文本表示的泛化能力和语义敏感性,同时保持高效和轻量级以部署。此外,ReAlign采用定制的联合优化策略,整合对比损失用于图像-文本对齐和分类损失用于准确的伪造鉴别。在AIGCDetectBenchmark、AIGI-Holmes和我们新构建的UltraSynth-10k上的实验结果表明,ReAlign在准确性和泛化能力上均优于现有最先进检测器,特别是在面对来自现代生成模型的复杂、高保真伪造时表现突出。

英文摘要

The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.

2605.16079 2026-05-18 cs.CV cs.AI cs.HC

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker:通过原生代理工具调用激励实例级视频理解

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

AI总结 VideoSeeker通过整合代理推理与实例级视频理解任务,提升视频理解精度,实验表明其在实例级任务中比基线模型提升13.7%,超越GPT-4o和Gemini-2.5-Pro。

详情
Comments
Project Page: https://gaotiexinqu.github.io/VideoSeeker/
AI中文摘要

大型视觉-语言模型(LVLMs)在视频理解上取得了显著进展,但在需要精确实例级时空定位的任务中面临重大挑战。现有方法主要依赖文本提示进行人机交互,但这些提示难以提供精确的空间和时间参考,导致用户体验不佳。此外,当前方法通常将视觉感知与语言推理解耦,以语言为中心而非视觉内容,限制了模型主动感知细粒度视觉证据的能力。为解决这些问题,我们提出VideoSeeker,一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker无缝整合代理推理与实例级视频理解任务,使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段全自动数据合成管道,高效生成大规模高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中,构建了一个强大的视频理解模型。实验表明,我们的模型在实例级视频理解任务中平均比基线模型提升13.7%,超越强大的闭源模型如GPT-4o和Gemini-2.5-Pro,同时在通用视频理解基准上也表现出有效的迁移能力。相关数据集和代码将公开发布。

英文摘要

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

2605.16077 2026-05-18 cs.CL

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

大型语言模型能否用于临床评估中的语音模仿?基于LLM的数据增强用于认知评分预测

Si-Belkacem Yamine Ketir, Lenard Paulo Tamayo, Shohei Hisada, Shaowen Peng, Shoko Wakamiya, Eiji Aramaki

AI总结 本文提出基于LLM的数据增强框架,通过生成不同风格的口语独白来提升语音认知评分预测。采用相似性引导的增强策略,有效减少少数低分群体的预测误差,同时保持多数群体性能。

详情
Comments
11 pages, 6 figures
AI中文摘要

准确评估自发语音中的认知衰退仍面临挑战,因数据集规模有限和类别不平衡。本文提出基于大型语言模型(LLM)的数据增强框架,以提高语音认知评分预测。实验在日语语料上进行,每个参与者提供自发口语叙述和书面回答。书面回答作为语义锚点,利用GPT-5生成不同风格的口语独白。使用Sentence-BERT语音嵌入训练的偏最小二乘回归模型预测Hasegawa痴呆量表分数。研究两种增强策略:随机类别平衡选择和相似性引导的类别平衡选择。后者优先选择语义接近的合成样本,导致更一致的改进,并显著减少少数低分群体的预测误差,同时保持多数群体性能。总体而言,我们的发现证明了语义引导的LLM驱动增强作为解决类别不平衡和提高临床语音分析数据效率的潜在方法。

英文摘要

Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.

2605.16076 2026-05-18 cs.CV cs.AI

AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification

AgriMind:一种用于多类植物疾病分类的集成深度学习框架

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely

AI总结 本文提出AgriMind框架,利用ResNet50、EfficientNet-B0和DenseNet121模型集成,通过转移学习实现对15种植物疾病的高精度分类,集成模型在测试集上达到99.23%的准确率。

详情
AI中文摘要

在孟加拉国,植物疾病检测仍主要依赖人工检查。我们构建了AgriMind系统,通过集成ResNet50、EfficientNet-B0和DenseNet121模型,利用20,638张PlantVillage图像进行训练。使用冻结的ImageNet主干和头-only训练,保持了轻量级的管道。单个模型在测试集上达到96-97%的准确率,但通过平均softmax输出,集成模型达到99.23%的准确率,错误率降低三分之二。我们尝试偏向最佳验证模型,但效果不佳。删除任一模型也损害性能。辣椒和土豆分类完美,而番茄在十个视觉相似类别中仍达到99.01%的准确率。在NVIDIA T4 GPU上,完整集成模型以53 FPS运行。是否能实现实时移动应用取决于TensorFlow Lite优化,这项工作尚未完成。

英文摘要

Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96--97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% -- a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization -- work we have not yet completed.

2605.16069 2026-05-18 cs.LG

ITGPT: Generative Pretraining on Irregular Timeseries

ITGPT:对不规则时间序列的生成预训练

Antoine Honoré, Ming Xiao

AI总结 本文提出ITGPT,一种针对多模态不规则时间序列的注意力架构,通过自监督学习和生成预训练目标处理不规则采样数据,在医疗和预测维护任务中实现SOTA性能。

详情
Comments
9 pages
AI中文摘要

时间序列回归模型往往难以利用大量标注的多模态数据,特别是当数据不规则采样或包含缺失值时。这在医疗和预测维护领域常见,其中数据来自不可靠的来源,标注需要专家知识或昂贵设备。基于Transformer的大型语言模型通过自监督学习(SSL)和生成预训练(GPT)框架在结构化数据如文本上表现出色。然而,此类模型缺乏处理不规则采样多模态时间序列数据的灵活性。本文介绍ITGPT,一种基于注意力的架构,通过允许使用SSL损失和GPT-like目标来处理多模态、不规则采样的时间序列。我们在TIHM数据集上的医疗任务和CompX数据集上的预测维护任务上评估其性能。结果表明,ITGPT在无需重采样、特征融合或显式数据插补的情况下实现了SOTA性能。此外,当标签稀缺时,ITGPT通过SSL和GPT训练有效利用无标签数据,优于纯监督方法。这代表了在实际推理任务中高效使用大规模和无结构时间序列数据的重要一步。

英文摘要

Timeseries regression models often struggle to leverage large volumes of labeled multimodal data, particularly when the data are irregularly sampled or contain missing values. This is common in domains like healthcare and predictive maintenance, where data are collected from unreliable sources, and labeling requires expert knowledge or costly equipments. Transformer-based large language models have proven effective on structured data such as text through self-supervised learning (SSL) and generative pretraining (GPT) frameworks. However, such models lack the flexibility to efficiently process irregularly sampled multimodal timeseries data. In this paper, we introduce ITGPT, an attention-based architecture designed for handling multimodal, irregularly sampled timeseries by allowing training with both SSL losses and GPT-like objectives. We evaluate its performance on a healthcare task with the TIHM dataset, and a predictive maintenance task with the CompX dataset. Our results demonstrate that ITGPT achieves state-of-the-art performance without requiring resampling, feature fusion or explicit data imputation. Furthermore, when labels are scarce, ITGPT effectively leverages unlabeled data through SSL and GPT training, outperforming the purely supervised approach. This represents an important step towards efficiently using large and unstructured timeseries datasets for practical inference tasks.

2605.16067 2026-05-18 cs.LG stat.ML

SAFE Quantum Machine Learning with Variational Quantum Classifiers

安全量子机器学习中的变分量子分类器

Ying Chen, Paolo Giudici, Vasily Kolesnikov, Paolo Recchia

AI总结 本文提出一种基于幅度编码的变分量子分类器,结合归一化幅度嵌入与有界量子可观测量,构建了结构化且平滑的假设空间,通过SAFE-AI指标评估模型可靠性,实验证明其在预测性能和噪声鲁棒性方面优于经典基线。

详情
Comments
31 pages, 8 figures
AI中文摘要

我们提出了一种变分量子分类器,通过幅度编码在高维深度表示上运作,通过可学习的经典预编码层稳定。通过将归一化幅度嵌入与有界量子可观测量结合,所得到的模型诱导了一个结构化且平滑的假设空间,具有受控的对输入变化的敏感性。模型可靠性通过从Cramer von Mises偏离度导出的SAFE-AI度量进行评估,从而在准确性、鲁棒性和可解释性维度上实现一致的评估。实验证明,所提出的量子模型在预测性能上与强大的经典基线竞争,同时表现出更平衡的SAFE可靠性轮廓,具有改进的噪声鲁棒性和在结构化特征移除下的稳定性。这些发现表明,变分量子电路为在安全关键设置中以稳定性为导向的SAFE学习提供了一种原则性的机制。

英文摘要

We propose a variational quantum classifier operating on high dimensional deep representations via amplitude encoding, stabilized by a learnable classical pre encoding layer.By combining normalized amplitude embeddings with bounded quantum observables, the resulting model induces a structured and smooth hypothesis class with controlled sensitivity to input variations. Model reliability is assessed using SAFE-AI metrics derived from the Cramer von Mises divergence, enabling consistent evaluation across accuracy, robustness, and explainability dimensions. Empirical results show that the proposed quantum model provides competitive predictive performance compared with strong classical baselines while exhibiting a more balanced SAFE reliability profile, with improved robustness to noise and stability under structured feature removal. These findings suggest that variational quantum circuits offer a principled mechanism for stability oriented SAFE learning in safety critical settings.

2605.16065 2026-05-18 cs.CV cs.AI

Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting

鲁棒的先验引导分割用于可编辑的3D高斯散射

Raushan Joshi, Jean-Yves Guillemaut

AI总结 本文提出利用SAM-HQ生成准确2D掩码,通过先验引导标签重新分配实现鲁棒的3D分割,提升编辑任务的精度和效率。

详情
Comments
Accepted at IEEE International Conference on Image Processing 2026, 6 pages
AI中文摘要

3D高斯散射(3D-GS)实现了实时3D场景重建,但缺乏用于编辑任务如物体移除、提取和重新着色的鲁棒分割。现有方法将2D分割提升到3D领域时面临视图不一致和粗掩码的问题。本文提出一种新的框架,利用Segment Anything Model High Quality(SAM-HQ)生成准确的2D掩码,解决标准SAM在边界保真度和细结构保持方面的局限。为实现给定场景中任意目标物体的鲁棒3D分割,我们引入了先验引导的标签重新分配方法,通过与学习先验的多视图一致性来为3D高斯分配标签。我们的方法实现了最先进的分割精度,并在保持高视觉保真度的同时实现了交互式、实时的物体编辑。定性结果表明在虚拟现实(VR)和机器人领域具有优越的边界保持和实际应用价值,推动了3D场景编辑的发展。

英文摘要

3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.

2605.16056 2026-05-18 cs.RO

Health-Conditioned Vision-Language-Action Models for Malfunction-Aware Robot Control

面向故障感知的视觉-语言-动作模型用于机器人故障-aware 控制

Hüseyin Arslan, Özgür Erkent

AI总结 本文提出一种故障感知的视觉-语言-动作模型,通过引入健康投影模块,使机器人在关节退化等物理故障情况下仍能完成任务。

详情
Comments
VLA Pipelines Workshop at IEEE International Conference on Robotics and Automation (ICRA) 2026
AI中文摘要

近年来,视觉语言动作(VLA)模型的研究迅速增加。尽管一些模型专注于检测、预防和恢复任务故障,但通常不处理机器人物理故障的适应问题。在现实场景中,大多数机器人以各种方式面临物理退化,如关节退化、执行器故障或弱抓取手。我们引入了故障感知(健康条件化的)VLA,该模型将健康向量作为输入,提供关于机器人关节操作角度和扭矩能力的信息,并调整其预测以在退化关节上完成任务。为此,我们将在VLA-Adapter架构中注入健康投影模块,并在我们收集的LIBERO环境[1]中的故障机器人数据上进行训练。我们收集了128个远程操作示例。我们的结果表明,通过非常轻量的添加,模型能够学习在不同退化关节配置下成功操作,而默认预训练的VLA-Adapter的Libero-Spatial-Pro模型无法做到。代码和数据集将在https://github.com/h-arslan/health-aware-vla上提供。

英文摘要

Research on Vision Language Action (VLA) models has been increasing rapidly in recent years. Although some of them focus on detecting, preventing, and recovering from task failures, they usually don't deal with adapting to robot's physical failures. In real-life scenarios, most robots face physical degradations in various ways such as joint degradation, actuator failure, or weak gripper. We introduce malfunction-aware (health-conditioned) VLA that takes a health vector as an input that gives information about robots' joints' operation angle and torque capability, and adapts its predictions to complete the tasks with the degraded joints. To achieve this, we inject a Health Projector module to the VLA-Adapter architecture and train it on malfunction robot data we collected on the LIBERO environment [1]. We collect 128 teleoperated episodes on Libero-Spatial tasks. Our results show that, with a very lightweight addition, the model can learn to operate successfully with different configurations of degraded joints which the default pretrained VLA-Adapter's Libero-Spatial-Pro model cannot. The code and dataset will be available soon at https://github.com/h-arslan/health-aware-vla

2605.16054 2026-05-18 cs.LG cs.AI

Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making

Ada-Diffuser: 面向决策制定的潜在意识自适应扩散模型

Fan Feng, Selena Ge, Minghao Fu, Zijian Li, Yujia Zheng, Zeyu Tang, Yingyao Hu, Biwei Huang, Kun Zhang

AI总结 本文提出Ada-Diffuser,通过显式建模潜在动态过程,提升决策制定的精度与适应性,实验验证其在模拟控制与机器人基准中的有效性。

详情
Comments
ICLR 2026
AI中文摘要

近期研究将决策制定视为序列建模问题,利用生成模型如扩散模型进行建模。尽管有前景,这些方法常忽视具有演变动态的潜在因素,这些因素对环境转换、奖励结构和高级智能体行为至关重要。显式建模这些隐藏过程对精确动态建模和有效决策制定至关重要。本文提出一个统一框架,从最小但足够的观测中显式整合潜在动态推断。理论表明,在温和条件下,潜在过程可以从少量时间观测块中识别。基于此见解,我们引入Ada-Diffuser,一种因果扩散模型,同时学习观测互动的时间结构和潜在动态,并进一步利用它们进行规划和控制。通过模块化设计,Ada-Diffuser支持规划和策略学习任务,能够适应动态、奖励和潜在动作的潜在变化。在模拟控制和机器人基准中的实验验证了其在准确潜在推断和自适应策略学习中的有效性。

英文摘要

Recent work has framed decision-making as a sequence modeling problem using generative models such as diffusion models. Although promising, these approaches often overlook latent factors that exhibit evolving dynamics, elements that are fundamental to environment transitions, reward structures, and high-level agent behavior. Explicitly modeling these hidden processes is essential for both precise dynamics modeling and effective decision-making. In this paper, we propose a unified framework that explicitly incorporates latent dynamic inference into generative decision-making from minimal yet sufficient observations. We theoretically show that under mild conditions, the latent process can be identified from small temporal blocks of observations. Building on this insight, we introduce Ada-Diffuser, a causal diffusion model that learns the temporal structure of observed interactions and the underlying latent dynamics simultaneously, and furthermore, leverages them for planning and control. With a modular design, Ada-Diffuser supports both planning and policy learning tasks, enabling adaptation to latent variations in dynamics, rewards, and latent actions. Experiments on simulated control and robotic benchmarks demonstrate its effectiveness in accurate latent inference and adaptive policy learning.

2605.16052 2026-05-18 cs.AI cs.CL

Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

理由者还是翻译者?面向污染的评估与税法中的神经符号鲁棒性

Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

AI总结 本文研究了税法推理中LLM性能受数据污染影响的问题,提出神经符号框架提升法律AI的可靠性与鲁棒性。

详情
AI中文摘要

近期大型语言模型(LLM)的进步显著增强了自动化法律推理能力。然而,其性能反映的是真正的法律推理能力还是数据污染的产物仍不明确。本文对税法推理方法进行了全面实证研究,并实施了污染检测协议以严格评估LLM的可靠性。我们发现性能可能因污染而被夸大。基于此分析,我们进行了系统评估,比较了单一LLM与混合系统,后者将法律文本翻译为形式化表示并委托符号求解器进行推理。我们构建了一个新的测试套件,通过案例和规则变化来测试对未见文档的泛化能力。我们的发现表明,法律推理本质上是组合性的,神经符号框架为法律AI提供了更可靠和稳健的基础,以及对未观测情境的更好泛化能力。

英文摘要

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

2605.16048 2026-05-18 cs.LG cs.AI

Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification

循环SSM:用于时间序列分类的深度递归与输入重塑

Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

AI总结 本文探讨了循环SSM在时间序列分类中的应用,展示了深度递归和输入重塑对模型性能的提升作用,通过实验验证了这两种方法的有效性。

详情
AI中文摘要

状态空间模型(SSM)本质上是沿序列维度递归的,但深度递归——在层之间重复使用相同模块——在SSM家族中尚未被探索。我们证明,一个具有k个参数迭代L次的循环SSM在四个架构(LRU、S5、LinOSS、LrcSSM)和六个时间序列分类基准上,能够与或优于具有k·L个独立参数的标准SSM相媲美,尽管其在严格更小的假设空间内运行。由于较大模型包含循环模型作为特殊情况,这种主导不能归因于表达力,而是指深度递归中的参数共享作为有益的归纳偏置,简化了优化。这些结果表明,深度递归与序列递归是正交的,并且独立有益。我们进一步表明,输入重塑是同样被忽视的设计轴:将时间步连接起来用于低维输入,或对高维输入进行扁平化和重新分块,能带来1-6%的准确率提升,经5个随机种子验证。这两种技术提供了独立的改进,当结合使用时会相辅相成,表明深度和输入重塑是SSM在时间序列上的两个独立且未被充分探索的设计轴。

英文摘要

State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with $k$ parameters iterated $L$ times consistently closely matches or outperforms a standard SSM with $k \cdot L$ independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.

2605.16045 2026-05-18 cs.CL cs.AI cs.LG

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

RecMem:基于递归的记忆巩固用于高效且有效的长运行LLM代理

Zijie Dai, Shiyuan Deng, Sheng Guan, Yizhou Tian, Xin Yao, Xiao Yan, James Cheng

AI总结 RecMem通过递归机制优化内存巩固,减少token消耗并提升准确性,有效解决长运行LLM代理的内存管理问题。

详情
Comments
Accepted to ACL 2026 Findings
AI中文摘要

记忆系统通常将用户-代理交互组织为可检索的外部记忆,对长运行代理至关重要,因为它克服了LLM的有限上下文窗口。然而,现有记忆系统每次调用LLM处理交互以提取记忆,导致大量token消耗。为解决此问题,我们提出RecMem,重新思考何时进行记忆巩固。RecMem将输入交互存储在潜意识记忆层,并使用轻量级嵌入模型进行编码以供检索。LLM仅在观察到持续递归且语义相似的交互时才被调用以提取事件性和语义记忆。这种基于递归的巩固工作是因为这些交互对应于具有丰富信息的语义簇,因此值得提取和总结。为了提高准确性,RecMem还结合了语义细化机制,以恢复记忆提取中遗漏的细粒度事实。实验表明,RecMem将三种SOTA记忆系统的内存构建token成本减少了高达87%,同时超过其准确性。

英文摘要

Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

2605.16043 2026-05-18 cs.RO cs.AI

Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data

从人类遥控数据中学习双臂绳子操作的模拟 grounded 策略

Gina Wigginghaus, Tim Missal, Berk Guler, Simon Manschitz, Jan Peters

AI总结 本文研究了基于视觉的策略在解结任务中泛化能力不足是否源于观察空间而非策略架构或数据规模,通过比较两种基于动作分块与变压器的策略,发现基于物理状态的策略在预测初始抓取和拉拽动作时L1误差降低了30.8%。

详情
Comments
Accepted to the Beyond Teleoperation Workshop at ICRA 2026, 5 pages, 2 figures
AI中文摘要

可变形线性物体(DLOs)如绳子和电缆在家庭和工业应用中广泛存在,但因其无限维的配置空间和频繁的自我遮挡而难以操控。从遥控学习双臂DLO操控提供了实用路径,但其可扩展性受限于人力,使得观察空间的选择对从小数据集泛化至关重要。本文研究了在解结任务中基于视觉的策略泛化能力不足是否源于观察空间本身而非策略架构或数据规模。我们比较了两种基于动作分块与变压器的策略,均训练于相同双臂遥控数据:一种基于两个装在腕部相机上的眼动RGB流的视觉策略,另一种基于DLO的3D粒子状态的策略,该状态通过多视角融合从初始观察中提取,并在基于粒子的扩展位置基于动力学模拟中演化。在未见过的绳子配置上进行开环评估,基于状态的策略在预测初始抓取和拉拽动作时,L1误差减少了30.8%,量化了像素与物理一致状态之间的可观察性差距,并指出了从有限人类演示中更高效地学习DLO操控任务的机器人学习方向。

英文摘要

Deformable Linear Objects (DLOs) such as ropes and cables are widely encountered in both household and industrial applications, yet remain challenging to manipulate due to their infinite-dimensional configuration space and frequent self-occlusion. Imitation learning from teleoperation offers a practical path to bimanual DLO manipulation, but its scalability is limited by human effort, making the choice of observation space critical for generalization from small datasets. In this study, we investigate whether the lack of generalization in egocentric visual policies for the knot-untangling task stems from the observation space itself, rather than from the policy architecture or data scale. We compare two Action Chunking with Transformers policies trained on the same bimanual teleoperation data: a vision-based policy conditioned on two egocentric RGB streams from wrist-mounted cameras, and a state-based policy conditioned on the DLO's 3D particle state, extracted from an initial observation via multi-view fusion and evolved in a particle-based eXtended Position-Based Dynamics simulation. Evaluated open-loop on an unseen rope configuration, the state-based policy outperforms its visual counterpart with a 30.8% reduction in L1 error when predicting the initial grasp-and-pull action, quantifying the observability gap between pixels and physics-consistent state, and pointing toward more data-efficient robot learning for the DLO manipulation task from limited human demonstrations.

2605.16026 2026-05-18 cs.CL cs.AI

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

从平铺语言标签到类型学先验:面向多语言语音到语音翻译的结构化语言条件化

Yu Pan, Yang Hou, Xiongfei Wu, Liang Zhang, Yves Le Traon, Lei Ma, Jianjun Zhao

AI总结 本文提出S2ST-Omni 2框架,通过结构化类型学先验改进多语言语音到语音翻译,实验显示其在多个评估指标上表现优异,且在数据受限条件下仍能提升翻译效率。

详情
Comments
Submitted to IEEE/ACM TASLP. This work extends S2ST-Omni, accepted to Findings of ACL 2026
AI中文摘要

基于语音大语言模型(SpeechLLMs)的组合式语音到语音翻译(S2ST)系统近期表现出色。然而,现有系统往往忽视源语言信息或通过语言作为标签的方式编码,将每种源语言表示为独立的平铺嵌入。这种设计忽略了语言间共享的系统性语言结构,可能限制在监督数据稀缺时的多语言适应能力。为解决此问题,我们提出了S2ST-Omni 2,一种多对一的组合式S2ST框架,系统性地将多语言语言条件化从平铺语言标签转换为结构化的类型学先验。具体而言,S2ST-Omni 2重新审视语言条件化在三个层面:类型学指导的分层语言编码用于结构化的源语言表示,动态门控的语言感知Dual-CTC用于内容自适应的语音调制,以及类型学意识的LLM提示用于解码器侧的语言指导。实验表明,在CVSS-C上,S2ST-Omni 2在BLEU、COMET、ASR-BLEU和BLASER 2.0等指标上均优于代表性S2ST方法。消融研究显示,所提出的表示层、语音层和解码层策略提供了互补的益处。此外,受控数据预算分析和仅使用约3小时监督训练数据的日语到英语评估表明,显式类型学先验为数据高效的多语言S2ST提供了有用的归纳偏见。

英文摘要

Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.

2605.16024 2026-05-18 cs.AI

ScreenSearch: Uncertainty-Aware OS Exploration

ScreenSearch: 带有不确定性的操作系统探索

Michael Solodko, Justin Wagle

AI总结 ScreenSearch通过结合结构化屏幕检索与基于不确定性的PUCT图强化学习,在大规模桌面探索中有效平衡探索与承诺,生成具有跨应用多样性的探索语料库。

详情
Comments
14 pages, 9 figures, 4 tables
AI中文摘要

桌面GUI代理在部分可观测环境下操作:视觉相似的屏幕可能对应不同的底层工作流状态,因此局部合理的动作可能导致截然不同的结果。我们将此问题视为计算机/操作系统状态探索问题,有效行为需要在扩展可达前沿和减少不确定性之前进行承诺。我们提出了ScreenSearch系统,结合结构化屏幕检索与去重,以及基于不确定性的PUCT图-强化学习算法,用于大规模桌面探索。检索层将UIA树转换为位置感知的结构特征,通过稀疏标记搜索和元数据过滤索引相关屏幕,并在虚拟机工作者之间维护共享的去重状态图。在此图上,我们定义了一个基于匹配动作结果分散度的可扩展不确定性信号。如果相似的屏幕在相同的动作签名下产生不同的下一个状态,则该状态应进一步探索而非视为解决。我们使用此信号与前沿奖励驱动大规模探索和重放起始策略评估。在11个桌面应用上,ScreenSearch收集了超过100万张截图和3万多个去重状态,生成具有显著跨应用和内应用多样性的大规模探索语料库。在固定重放起始切片上,我们观察到新颖性与不确定性的权衡关系:某些策略减少不确定性很快但发现很少前沿。仅减少不确定性本身并非足够的探索目标。附录消融实验表明,更强的提案先验可以显著提高语料库构建期间的独特状态发现。这些结果表明,状态身份、提案质量以及基于不确定性的搜索在决定何时探索和何时承诺时都至关重要。

英文摘要

Desktop GUI agents operate under partial observability: visually similar screens can correspond to different underlying workflow states, so locally plausible actions can lead to sharply different outcomes. We frame this as a problem of computer/OS state exploration, where effective behavior requires both expanding the reachable frontier and reducing ambiguity before committing. We present ScreenSearch, a system that combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit for large-scale desktop exploration. The retrieval layer converts UIA trees into location-aware structural features, indexes related screens through sparse token search and metadata filters, and maintains a shared deduplicated state graph across VM workers. On top of this graph, we define a scalable ambiguity signal based on matched-action outcome dispersion. If similar screens produce different next states under the same action signature, the state should be probed further rather than treated as resolved. We use this signal together with frontier rewards to drive large-scale exploration and replay-start policy evaluation over the shared graph. Across 11 desktop applications, ScreenSearch collects over 1M screenshots and over 30K deduplicated states, yielding large exploration corpora with substantial cross-application and within-application diversity. On a fixed replay-start slice, we observe a clear novelty--ambiguity trade-off: some policies reduce ambiguity quickly while discovering little frontier. Ambiguity reduction alone is therefore not a sufficient exploration objective. Appendix ablations show that stronger proposal priors can materially improve unique-state discovery during corpus building. These results suggest that state identity, proposal quality, and ambiguity-aware search all matter when deciding when to probe and when to commit.

2605.16022 2026-05-18 cs.CV

EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting

EndoGSim: 基于多模态大语言模型的物理感知4D动态内窥镜场景模拟

Changjing Liu, Yiming Huang, Long Bai, Beilei Cui, Hongliang Ren

AI总结 本文提出EndoGSim框架,通过MLLM引导的高斯点散布实现内窥镜场景的物理感知重建与模拟,结合预训练分割和深度估计,提升手术模拟的真实性和准确性。

详情
Comments
Early Accepted by MICCAI 2026
AI中文摘要

在机器人辅助微创手术中,高保真的动态内窥镜场景重建与模拟对于提升后续任务和改善手术结果至关重要。然而,现有方法主要关注视觉重建,缺乏用于真实模拟所需的物理描述。我们提出一个统一框架,通过多模态大语言模型(MLLM)引导的高斯点散布实现内窥镜场景的物理感知重建与模拟。我们的方法利用4D高斯点散布(4DGS)结合预训练分割和深度估计来表示可变形组织和工具。为了实现物理属性的自动推断,我们引入了物体级材料场,通过MLLM初始化材料参数并通过可微分材料点方法(MPM)进行细化,在渲染图像和光流的联合监督下进行。在开源和自建数据集上验证,我们的框架在模拟保真度和物理准确性方面优于现有方法,凸显其在机器人辅助手术应用中的潜力。

英文摘要

In robot-assisted minimally invasive surgery, high-fidelity dynamic endoscopic scene reconstruction and simulation are crucial to enhancing downstream tasks and advancing surgical outcomes. However, existing methods primarily focus on visual reconstruction, lacking physics-based descriptions of the scene required for realistic simulation. We propose a unified framework that achieves physics-aware reconstruction and physical simulation of endoscopic scenes through Multi-modal Large Language Models (MLLMs)-guided Gaussian Splatting. Our approach utilizes 4D Gaussian Splatting (4DGS) integrated with pre-trained segmentation and depth estimation to represent deformable tissues and tools. To achieve automatic inference of physical properties, we introduce an object-wise material field that initializes material parameters via MLLM and refines them through a differentiable Material Point Method (MPM) under joint supervision from rendered images and optical flow. Validated on both open-source and in-house datasets, our framework achieves superior simulation fidelity and physical accuracy compared to state-of-the-art methods, underscoring its potential to advance robot-assisted surgical applications.

2605.16020 2026-05-18 cs.LG cond-mat.dis-nn hep-lat

Variational Autoregressive Networks with probability priors

具有概率先验的变分自回归网络

Piotr Białas, Piotr Korcyl, Tomasz Stebel, Dawid Zapolski

AI总结 本文提出利用物理先验改进变分自回归网络,通过引入概率分布先验降低训练负担,提升离散自旋模型的模拟效率。

详情
Comments
28 pages, 11 figures
AI中文摘要

蒙特卡罗方法在多个科学领域至关重要,但其效率常受临界减慢(相变附近自相关时间急剧增加)的阻碍。尽管深度学习方法如基于神经网络的采样器已被提出以缓解此问题,但训练模型本身存在困难。这种困难部分源于原始机器学习架构过于通用,往往忽略底层物理对称性并迫使网络重新学习。本文证明将物理先验纳入模型可显著提升性能。基于现有整合自旋-自旋相互作用的策略,我们提出一个利用先验概率分布作为训练起点的框架。在伊辛模型和埃德瓦兹-安德森自旋玻璃模型中的结果表明,放弃`空白 slate`模型,采用物理指导的先验可减少训练负担,并促进离散自旋模型更大系统规模的模拟。

英文摘要

Monte Carlo methods are essential across diverse scientific fields, yet their efficiency is frequently hampered by critical slowing down-a sharp increase in autocorrelation times near phase transitions. Although deep learning approaches, such as neural-network-based samplers, have been proposed to alleviate this issue, they face another serious problem: the difficulty of training the models. This difficulty partially stems from the overly general nature of original machine-learning architectures, which often ignore underlying physical symmetries and force networks to relearn them from scratch. In this paper, we demonstrate that incorporating physical priors into the model significantly enhances performance. Building upon existing strategies that integrate spin-spin interactions, we propose a framework that utilizes a prior probability distribution as a starting point for training. Our results for the Ising model, as well as for the Edwards-Anderson spin glass model, suggest that moving away from `blank slate' models in favor of physics-informed priors reduces the training burden and facilitates the simulation of larger system sizes in discrete spin models.

2605.16017 2026-05-18 cs.LG

Accelerated Gradient Descent for Faster Convergence with Minimal Overhead

加速梯度下降法用于更快收敛并最小化开销

Manuel Graca, L. Miguel Silveira, Arlindo Oliveira, Frank Liu

AI总结 本文提出CT-AGD算法,通过捕捉局部曲率加速一阶方法,减少训练周期,与Adam等自适应方法具有相似的存储和计算开销。

详情
Comments
17 pages
AI中文摘要

在本文中,我们提出了CT-AGD(曲率调节加速梯度下降),一种用于深度学习训练任务中非凸优化问题的优化方法。CT-AGD是一种通用的增强程序,通过显式捕捉局部曲率来加速一阶方法,并开发了旨在减轻随机小批量训练中引入的噪声和偏差的启发式方法。CT-AGD的存储和计算开销与Adam等自适应梯度方法相当。我们的广泛实验表明,CT-AGD在准确性上与基线一阶方法相当,但平均减少了33%的训练周期。

英文摘要

In this paper, we present CT-AGD (Curvature-Tuned Accelerated Gradient Descent), an optimization method for non-convex optimization problems in deep learning training tasks. CT-AGD is a general boosting procedure that accelerates first-order methods by explicitly capturing the local curvature using finite-difference quotients, and the development of heuristics aimed at mitigating noise and bias introduced by stochastic mini-batch training. CT-AGD has a comparable storage and computational overhead as adaptive gradient methods such as Adam. Our extensive experiments demonstrate that CT-AGD achieves the same level of accuracy as the baseline first-order methods, yet reduces the required training epochs by 33% on average.

2605.16011 2026-05-18 cs.CL cs.AI

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

视觉语言模型在数学教育中能否具备适应性?一种基于学习者模型的评分研究

Jie Gao, Yongan Yu, Junzhu Su, Yiran Lin, Adam K. Dube, Jackie Chi Kit Cheung

AI总结 本文探讨视觉语言模型在数学教育中的适应性,提出基于学习者模型的评分框架,评估模型在认知、动机和复杂度方面的适应性,并发现现有模型在有限学习者信息下难以产生一致的指导响应。

详情
AI中文摘要

适应性学习指的是跟踪学习者学习进度并根据个体学习者表现调整教学过程的教育技术。它日益被认可为开发有效学习支持工具的关键。视觉语言模型(VLMs)已在数学教育中得到应用,学生将其作为个性化教学的辅助工具。然而,不清楚VLMs是否具备根据不同学习者档案提供数学指导的能力。当前VLMs缺乏系统评估框架来评估数学辅导任务中对不同学习者档案的适应性。为解决这一差距,我们借鉴适应性学习框架中的学习者模型(Shute和Towle,2018),提出基于学习者模型的评分表。我们的评分表将适应性评估形式化为三个方面:认知方面、动机方面和复杂度。我们还评估了VLM响应的两个额外维度:正确性(答案和解决方案的正确性)和质量(响应本身的质量)。我们的实验结果表明,不同模型在适应性方面存在可测量的差异,并揭示了当前VLMs在有限学习者信息下难以一致产生基于学习者模型的教学响应。

英文摘要

Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.