arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20192 2026-05-25 cs.CL cs.CE cs.CR cs.CY q-fin.CP

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

利用大语言模型进行情感分析：Decentraland的MANA代币多模态分析

Xintong Wu, Peiting Tsai, Jing Yuan, Michael Yu, Greg Sun, Luyao Zhang

AI总结本文研究了如何利用大型语言模型分析Decentraland虚拟平台中Discord社区的情感，结合多模态金融数据提升对MANA代币价格的预测能力。研究采用基于BERT的模型进行情感分析，并构建了两种LSTM架构，分别基于历史价格和融合情感评分、交易量及市值的多模态特征。实验表明，多模态模型在预测准确性上显著优于仅使用价格数据的基线模型，揭示了社区情感信号在虚拟经济预测中的重要价值。

详情

AI中文摘要

Decentraland是一个在扩展的元宇宙生态系统中运行的去中心化虚拟现实平台，利用其原生MANA代币促进虚拟资产交易和治理。本研究探讨将Discord社区情感与多模态金融数据相结合，以增强虚拟世界经济中的加密货币价格预测。我们解决以下问题：(1) 识别Decentraland的Discord社区内的情感模式，以及(2) 评估多模态特征对代币回报预测的影响。使用基于BERT的大语言模型进行情感分析，我们开发了两种LSTM架构：一种包含历史价格的基线模型，另一种集成情感分数、交易量和市值的多模态变体。结果显示社区情感以中性为主，但存在正向偏斜。多模态模型在预测准确性上显著优于仅基于价格的基线模型。这些发现证明了社区衍生信号对虚拟经济预测的预测价值，并为未来在沉浸式虚拟环境、自然语言处理和加密货币市场分析交叉领域的研究奠定了基础。

英文摘要

Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.20087 2026-05-25 cs.CL cs.AI

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

ThoughtTrace: 理解真实世界LLM交互中的用户想法

Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li, Shayne Longpre, Hongxiang Gu, Maximillian Chen, Tianmin Shu

AI总结 ThoughtTrace 是首个大规模数据集，记录了真实场景中用户与AI的多轮对话及其用户自我报告的思考内容，揭示了用户发送提示的原因和对AI回复的反应。该数据集包含1,058名用户、2,155次对话及10,174条思考注释，分析表明用户思考内容在语义上与对话消息不同，且难以被当前先进大模型准确推断。研究进一步展示了思考内容在行为预测和个性化助手训练中的应用价值，为理解用户潜在目标和需求提供了新的数据模态。

Comments 53 pages, 23 figures, 4 tables. Project website: https://thoughttrace-project.github.io/

详情

AI中文摘要

对话式AI现已服务数十亿用户，但现有数据集仅捕捉用户所说，而非所想。我们引入ThoughtTrace，首个大规模数据集，将真实世界多轮人机对话与用户自述想法配对：用户发送提示的原因以及对助手回复的反应。ThoughtTrace包含来自20个语言模型的1,058名用户、2,155次对话、17,058轮次和10,174条想法标注。我们的分析表明，ThoughtTrace捕捉了长期、主题多样的交互，且想法在语义上不同于消息，前沿LLM难以从上下文中推断，内容多样，并与对话阶段相关。我们进一步展示了想法在下游建模中的实用性。首先，想法作为推理时上下文改善了用户行为预测。其次，想法引导的重写为训练个性化助手提供了细粒度对齐信号。总之，ThoughtTrace将用户想法确立为研究人机交互背后认知动态的新数据模态，并为构建更好理解和适应用户潜在目标、偏好与需求的助手奠定了基础。

英文摘要

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

URL PDF HTML ☆

赞 0 踩 0

2605.20043 2026-05-25 cs.CL

Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

注意你的莫拉：神经日语形态生成的拼写感知错误分析

Wen Zhang

AI总结本文针对日语过去时态形态生成任务，提出了一种关注正字法的错误分析方法，将平假名视为编码形态音系差异的表征系统，而不仅仅是转录媒介。研究评估了两种字符级序列到序列模型，在高整体准确率下仍存在与平假名正字法特性相关的系统性错误，尤其在涉及辅音连写（Gemination）的动词中表现明显。研究提出了包含七种主要错误模式的分类体系，并揭示了正字法表征、形态结构和数据频率在模型泛化中的紧密关联，强调了在形态复杂的语言中进行正字法感知评估的重要性。

详情

AI中文摘要

我们提出了一种拼写感知的日语过去时形态屈折错误分析，将平假名不仅视为转录媒介，而且视为编码形态音位区别的表征系统，这些区别可能影响模型泛化。我们使用根据SIGMORPHON 2020和2023共享任务约定格式化的数据集，评估了两种字符级序列到序列架构在过去时形成上的表现。尽管总体准确率较高，但模型表现出系统的、语言上可解释的错误，这些错误集中在平假名的特定拼写属性上。我们引入了一个简洁的错误分类法，捕获了七种主要失败模式，并提供了定量和定性分析。促音化相关错误主导了剩余失败，占错误的75-80%，特别是在词干以元音e结尾且需要在过去时后缀前促音化的动词中。错误模式在架构和随机种子之间高度一致，表明拼写表示、形态结构和数据频率效应在塑造模型泛化中存在稳健的交互。这些结果强调了在理解形态复杂语言的神经泛化时，拼写感知评估的必要性。

英文摘要

We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

URL PDF HTML ☆

赞 0 踩 0

2605.18993 2026-05-25 cs.LG cs.AI

Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic

将线性化行为蒸馏到非线性微调中以实现有效的任务算术

Thomas Sommariva, Francesca Morandi, Simone Calderara, Angelo Porrello

AI总结该研究探讨了如何在非线性微调中保留线性微调在任务向量组合中的优势。作者提出通过在激活空间中施加约束，使非线性模型在权重扰动上保持线性特性，并通过从线性化教师模型中蒸馏隐藏表示来训练学生模型。该方法在保持任务向量可组合性的同时，避免了推理时的额外开销，在视觉和语言任务中表现出色。

Comments Accepted at ICML 2026

详情

AI中文摘要

任务向量组合已成为编辑预训练模型的一种有前景的范式，通过加法实现模型合并，通过减法实现模型遗忘。在预训练模型的切空间中进行微调（线性微调）已被证明是有效的，因为它产生的任务向量自然解缠且抗干扰。然而，线性化模型在训练期间表达能力有限，并且在推理时计算成本较高，这限制了它们的实际应用。在这项工作中，我们弥合了线性微调与标准非线性微调之间的差距。我们表明，关于权重扰动的线性性（一种在参数空间中定义的属性）可以通过在训练期间在激活空间中施加约束来强制执行。具体来说，我们将曲率正则化的线性化教师模型的隐藏表示蒸馏到通过常规微调训练的非线性学生模型中。我们发现，得到的模型继承了线性化模型在任务算术中的关键属性，能够实现任务向量的有效组合，并在视觉和语言基准测试中实现强性能，而不会产生任何推理开销。

英文摘要

Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.18911 2026-05-25 cs.LG cs.AI

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

你的野火预测模型真的有效，还是只是得分高？

Yangshuang Xu, Yuyang Dai, Liling Chang, Qi Wang, Yushun Dong

AI总结本文研究了现有地球基础模型在野火预测任务中的实际有效性问题，指出当前模型虽在通用大气和地球物理任务上表现良好，但未针对野火预测进行专门预训练。为此，作者提出了首个专门用于野火预测的预训练模型WILDFIRE-FM，并引入了一种固定合约评估框架，以解决野火事件稀疏性带来的评估偏差问题。研究结果表明，野火预测的迁移结论高度依赖于评估设计和任务设定，为未来相关研究提供了新的基准和方法支持。

Comments 25 pages

详情

AI中文摘要

野火预测对于早期预警和资源分配至关重要，然而现有的地球基础模型（Earth FMs）是为通用大气和地球物理目标预训练的，而非野火预测。为弥补这一空白，我们提出了WILDFIRE-FM，这是首个专门针对野火预测预训练的基础模型，使用了天气、活跃火观测、地形、植被和静态环境数据。然而，仅引入特定领域的骨干网络并不能解决评估问题：野火事件在时空上稀疏，使得迁移结论对匹配规则和评估设置高度敏感。为解决这一问题，我们引入了一个固定合约评估框架，包含两个受控检查：固定输出检查用于匹配规则效应，固定特征检查用于头部选择效应。在匹配合约下，我们在占用、蔓延、检索和回归任务上将WILDFIRE-FM与十个地球基础模型基线进行比较。结果表明，野火迁移结论强烈依赖于评估设计和任务制定。我们希望该框架和WILDFIRE-FM能为未来野火特定的地球基础模型研究和基准测试提供基础。我们的代码可在 https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/ 获取。

英文摘要

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

URL PDF HTML ☆

赞 0 踩 0

2605.18859 2026-05-25 cs.LG cs.AI

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

TwinRouterBench：面向现实智能体LLM路由的快速静态与实时动态评估

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi

AI总结本文提出 TwinRouterBench，一个用于评估代理式大语言模型（LLM）路由策略的基准工具，旨在支持静态和动态场景下的高效评估。该基准包含两个赛道：静态赛道提供多个任务中的模型调用前缀及对应的最优模型层级，通过确定性计算进行评分；动态赛道则在真实代理系统中运行路由策略，评估其在实际任务完成和成本控制方面的表现。该工作为路由算法的开发与优化提供了全面且高效的实验平台。

详情

AI中文摘要

LLM路由在长时任务（如编码智能体、深度研究系统和计算机使用智能体）中最为重要，其中单个用户请求会触发多次模型调用。将每次调用路由到最便宜的足够模型可以在不牺牲质量的情况下降低成本，然而现有的路由器基准仅评估一次性提示的路由。它们从未暴露中间智能体步骤中路由器可见的前缀，从未测试更便宜的替代品是否保留下游任务的成功，并且通常在评估时依赖在线LLM评判。我们引入了TwinRouterBench，一个具有两轨的步骤级路由基准。静态轨提供来自SWE-bench、BFCL、mtRAG、QMSum和PinchBench中520个实例的970个路由器可见前缀，每个前缀与在发布的降级和级联协议下估计的执行验证目标层级配对；评分是层级标签、轨迹成员资格和令牌成本的确定性算术，无需在线评估方LLM评判。动态轨提供一个工具，可在完整的500例SWE-bench验证集上运行路由器；本文报告了与静态SWE监督划分不相交的100例保留评估。每次LLM调用时，路由器从锁定池中选择一个具体模型，成功由官方任务解决率和实际API支出衡量。两轨支持快速离线迭代，随后在实时智能体执行下进行端到端验证。代码和数据可在https://github.com/CommonstackAI/TwinRouterBench获取。

英文摘要

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

URL PDF HTML ☆

赞 0 踩 0

2605.18329 2026-05-25 cs.CV cs.LG

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

迷失在折叠中：当交叉验证不是用于不确定性估计的深度集成时

Tristan Kirscher, Markus Bujotzek, Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Kim-Celine Kahl, Balint Kovacs, Klaus Maier-Hein

AI总结在医学图像分割中，集成模型的分歧常被用作认识论不确定性的代理，但许多研究通过K折交叉验证（CV）构建集成模型，却称之为“深度集成”（DE），导致术语与实现不一致。本文对比了标准5折CV集成与5成员DE在三个多标注分割数据集上的表现，发现DE在保持分割精度的同时，提升了校准和失败检测能力，而CV集成有时与标注者间差异相关性更强。研究指出，应根据研究目标选择集成构建方式：DE适用于可靠性导向任务（如选择性转诊），CV集成则更适合作为模糊性代理。

Comments Accepted for publication at MICCAI 2026

详情

Journal ref: 29th International Conference On Medical Image Computing And Computer Assisted Intervention, Sep 2026, Strasbourg, France

AI中文摘要

集成不一致性被广泛用作医学图像分割中认知不确定性的代理。在实践中，许多研究通过K折交叉验证（CV）形成集成，却称之为“深度集成”（DE）。由于CV成员在不同的数据子集上训练，它们的不一致性混合了种子驱动变异和数据暴露效应，这可能改变不确定性的解释方式。我们审查了最近的分割不确定性研究，发现术语与实现不匹配很常见。然后，我们在三个多模态多标注者分割数据集上，在相同配置下比较了标准5折CV集成与5成员DE（固定训练集，不同随机种子）。我们评估了不确定性在校准、故障检测、歧义建模和分布偏移下的鲁棒性。DE在匹配分割精度的同时改善了校准和故障检测，而CV集成在研究数据集上有时与标注者间变异性相关性更强。因此，应选择与研究问题匹配的集成构建方式：DE用于可靠性导向的使用（如选择性转诊/故障检测），CV集成作为歧义的代理。我们提供了一个轻量级的nnU-Net修改，使得在默认流程内能够进行DE训练。

英文摘要

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.17637 2026-05-25 cs.AI

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench: 通过浏览器原生游戏对编码代理进行需求到应用的评估

Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, Jianmin Wu

AI总结 WebGameBench 是一个用于评估代码代理从需求到实际应用构建能力的基准，特别关注其能否将结构化的网页游戏规范转化为可在浏览器中运行的游戏。该基准通过浏览器原生游戏提供紧凑而行为丰富的测试环境，评估代理生成的应用是否具备可玩性、可用性及功能性。研究显示，当前最先进的系统在可用率上达到76.9%，但优秀率仅为20.2%，表明实现完整需求仍存在较大差距。WebGameBench 是首个基于浏览器原生游戏交付的从需求到应用评估的基准，其评估结果与人工游戏体验评审高度一致。

Comments 19 pages, 6 figures

详情

AI中文摘要

编码代理越来越多地被用作应用程序构建者，然而许多评估仍聚焦于源代码、仓库级测试或中间痕迹，而非交付的应用。我们引入WebGameBench，一个需求到应用的基准，评估编码代理能否将冻结的结构化Web游戏规范转化为可浏览器访问的游戏。浏览器原生游戏提供了一个紧凑但行为密集的测试平台：即使是简单的游戏也需要协调的输入处理、空间映射、规则执行、状态转换、终止条件、重启行为和可见反馈。在WebGameBench中，每个生成的工件在统一部署协议下被构建、服务并作为浏览器可访问的应用暴露。然后，运行时评估器在真实浏览器中与交付的游戏交互，并分配三类标签：优秀、可用或不可用。在人工审查的子集上，运行时标签与人类游戏审查在可用率标准下大致一致。在111个任务、12个编码代理和14个评估配置中，WebGameBench区分了当前系统：最佳配置达到76.9%的可用率，但仅有20.2%的优秀率。这一差距表明，跨越最低可玩交付阈值仍远未达到完全满足需求。据我们所知，WebGameBench是首个针对浏览器原生游戏交付的需求到应用基准，它在可用率标准下将交付应用的运行时标签与独立的人类游戏审查进行验证。

英文摘要

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

URL PDF HTML ☆

赞 0 踩 0

2605.17076 2026-05-25 cs.LG cs.AI cs.DC cs.MA

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

S-Bus: 多智能体LLM状态协调的自动读集重建

Sajjad Khan

AI总结本文提出了一种名为 S-Bus 的 HTTP 中间件，用于解决多智能体 LLM 在共享可变状态时的并发控制问题，尤其针对无法声明读集的场景。其核心机制 DeliveryLog 能够在提交时从观察到的 HTTP GET 流量中重建每个智能体的读集，从而实现一种名为“可观测读隔离”（ORI）的一致性保证，有效防止分片拓扑中的结构化竞态条件。研究贡献包括形式化验证、与传统数据库的性能对比以及对 ORI 在不同工作负载下的语义影响分析。

Comments v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files

详情

AI中文摘要

我们解决了通过HTTP共享可变状态的LLM智能体的并发控制问题，其中智能体无法被修改以声明读集。S-Bus是一个HTTP中间件，其核心机制——服务端DeliveryLog——在提交时从观察到的HTTP GET流量中重建每个智能体的读集。它提供的一致性属性——可观测读隔离（ORI），一种基于HTTP可观测读投影的部分因果一致性——防止了专用分片拓扑中的结构性竞态条件。三项贡献：（C1）DeliveryLog机制，具有三层机械化证据：TLAPS证明了ReadSetSoundness和ORICommitSafety（基于一个类型公理）；N=3时的穷举TLC探索了20,763,484个状态，零违规；Dafny验证了9个归纳引理。（C2）与PostgreSQL 17 SERIALIZABLE和Redis 7 WATCH/MULTI的经验安全对等：在884,110次提交尝试中（其中427,308次处于活跃争用下）零Type-I损坏。（C3）ORI在专用分片工作负载中语义中性，但在单分片协作写入中有害，因为保留传播并发矛盾。 v2更新：PH-3 LLM评判器现在已针对人类标注者（Zahid Hussain, Mindgigs Peshawar）在400个（步骤，分片）对上进行独立验证，严格kappa=0.93（n=93，原始一致性96.8%）。LLM间评判器一致性为kappa=0.46（边界方差）。智能体自我报告高估分片使用量32%（LLM评判器）至49%（人类标注者）。SJ-v4语义质量评分标准仍为单评判器LLM-only。源代码、形式化证明、测试框架、标注数据：https://github.com/sajjadanwar0/sbus

英文摘要

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

URL PDF HTML ☆

赞 0 踩 0

2605.16799 2026-05-25 cs.LG cs.AI

Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis

跨域分子关系学习：利用化学结构-活性分析

Peiliang Zhang, Jingling Yuan, Shiqing Wu, Mengqing Hu, Chao Che, Yongjun Zhu, Lin Li

AI总结该研究针对分子关系学习中跨领域建模的不足，提出了一种基于结构-活性分析的跨领域分子关系学习方法。核心方法是引入结构语义迁移差异的领域对抗训练网络（DisTrans），通过子结构拓扑差异引导模型学习分子结构的领域依赖性，并对齐源域与目标域的功能团语义信息，从而提升跨领域适应能力。实验表明，该方法在两种典型跨领域场景下优于16种基线方法，具有良好的泛化性能。

Comments Accepted by SIGKDD 2026 Research Track

详情

AI中文摘要

分子表示的最新进展整合了分子拓扑和视觉模态，为精确的分子关系学习（MRL）开辟了新途径。现有的MRL方法专注于域内建模，其固有的域封闭效应限制了在分子科学中的适用性，特别是在阐明跨域相互作用机制方面。因此，跨域分子关系学习的必要性日益迫切。受益于结构-活性分析，我们提出了具有结构语义迁移差异的域对抗训练网络（DisTrans），以优化分子结构和视觉图像的跨域自适应表示。1）我们利用基于域间子结构拓扑差异的梯度反转策略来学习分子结构的域依赖性。该策略引导模型适应目标域中的结构邻接模式，生成域可分离的结构表示。2）我们应用跨域表示引导机制来对齐源域和目标域之间的官能团语义信息，学习跨域一致性信息。在两种典型跨域策略中的实验结果表明，DisTrans优于16种基线方法，即使在显著的域间差异下也能保持令人满意的性能。

英文摘要

Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecular Relational Learning (MRL). Existing MRL methods focus on intra-domain modeling, and their inherent domain-closed effect limits applicability to molecular science, particularly in elucidating cross-domain interaction mechanisms. Consequently, the imperative for Cross-Domain Molecular Relational Learning has become increasingly pressing. Benefiting from structure-activity analysis, we propose the Domain Adversarial Training Network with Structural-Semantic Transfer Discrepancy (DisTrans) to optimize cross-domain adaptive representation for molecular structures and visual images. 1) We employ the gradient reversal strategy based on substructure topological discrepancies between domains to learn the domain dependence of molecular structures. This strategy guides the model to adapt to the structural adjacency patterns in the target domain, generating domain-separable structural representations. 2) We apply the cross-domain representation guidance mechanism to align the functional-group semantic information between the source and target domains, learning cross-domain consistency information. The experimental results in two typical cross-domain strategies demonstrate that DisTrans outperforms 16 baseline methods, maintaining satisfactory performance even under pronounced inter-domain discrepancy.

URL PDF HTML ☆

赞 0 踩 0

2605.16087 2026-05-25 cs.RO cs.AI

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

面向感知模型的可信与可解释人工智能：从概念到原型车辆部署

Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

AI总结本文研究了如何在自动驾驶感知模型中实现可信且可解释的人工智能，针对深度神经网络在自动驾驶中应用时存在的不透明性和安全性问题，提出了一种集成可信解释性和不确定性估计的感知模块。该方法基于变压器架构，在推理时通过注意力机制生成解释，并通过扰动一致性测试验证其可靠性，同时引入不确定性估计与校准模块以提升系统鲁棒性。研究还展示了该模块在原型车上的部署及可视化接口，验证了其在实时可信感知监控中的可行性。

Comments Accepted for publication at IEEE ITSC 2026

详情

AI中文摘要

深度神经网络已成为自动驾驶感知的主流解决方案，但其不透明性与新兴的可信人工智能指南相冲突，并给安全保证、调试和人工监督带来复杂性。尽管存在安全与可解释人工智能的理论框架，但针对3D场景理解的可信人工智能具体实现仍然稀缺。我们通过提出一个极其鲁棒、集成忠实可解释性和校准不确定性估计的可信人工智能感知模块来填补这一空白。基于Transformer检测器，我们在推理时从注意力机制中导出解释，并使用基于扰动的连续性测试验证其忠实性。我们进一步集成了不确定性估计与校准模块，并应用了增强鲁棒性的训练方法。实验展示了忠实的显著性行为、改进的鲁棒性以及良好校准的不确定性估计。最后，我们将这些可信人工智能元素部署到原型车辆中，并提供一个可解释人工智能界面，可视化文档工件、模型不确定性状态和显著性图，展示了实时可信感知监控的可行性。补充材料见 https://tillbeemelmanns.github.io/trustworthy_ai/ 。

英文摘要

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

URL PDF HTML ☆

赞 0 踩 0

2605.15828 2026-05-25 cs.CV

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

并非所有任务量化平等：面向视觉几何Transformer的Fisher引导量化

Yipu Zhang, Jintao Cheng, Weilun Feng, Jiehao Luo, Chuanguang Yang, Zhulin An, Yongjun Xu, Wei Zhang

AI总结本文研究了如何在视觉几何变换器（VGGT）等前馈3D重建模型中进行有效的量化，以降低模型的内存和计算开销。针对不同任务、块和通道对量化误差的敏感性差异，作者提出了一种基于Fisher信息矩阵的引导量化方法（FGQ），通过量化不同组件对任务的重要性，在校准过程中动态调整仿射变换，从而更有效地保留关键信息。实验表明，FGQ在多个3D视觉任务中显著优于现有方法，在4位量化下相对提升了高达39%的性能。

详情

AI中文摘要

以视觉几何基础Transformer（VGGT）为代表的前馈3D重建模型，在单次前向传播中联合预测多个视觉几何任务，如深度估计、相机姿态预测和点云重建。它们已广泛应用于3D视觉应用，但其十亿级参数带来了巨大的内存和计算开销，给设备端部署带来挑战。训练后量化（PTQ）是减少这种开销的有效技术。现有的前馈3D模型PTQ方法主要关注处理重尾激活分布和构建多样化的校准数据集。然而，我们观察到前馈3D模型通过共享骨干网络预测多个几何属性，其中不同的Transformer块和隐藏通道对每个任务的贡献不同，导致不同任务、块和通道对量化误差的敏感性差异显著。因此，平等对待所有任务会过度强调不敏感的任务，并导致敏感任务上的显著精度损失。为解决此问题，我们提出面向前馈3D重建模型的Fisher引导量化（FGQ）。具体地，FGQ使用对角Fisher信息矩阵来量化不同任务、块和通道的敏感性，并在校准期间将这些敏感性纳入可学习仿射变换中，以更好地保留对每个任务最关键的通道和块。在相机姿态估计、点云重建和深度估计上的大量实验表明，FGQ在VGGT上始终优于最先进的量化基线，在4比特量化下实现了高达39%的相对改进。代码可在https://github.com/ypzhng/FGQ获取。

英文摘要

Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization. Code is available at https://github.com/ypzhng/FGQ.

URL PDF HTML ☆

赞 0 踩 0

2605.15482 2026-05-25 cs.CL

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

FINESSE-Bench：面向大语言模型金融领域知识与技术分析的分层基准套件

Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov

AI总结 FINESSE-Bench 是一个用于评估大型语言模型在金融领域知识和技术分析能力的分层基准测试套件，包含 3,993 道题目，涵盖从基础到专家级的多个难度层级。该基准结合了专业认证考试、实际交易任务和金融竞赛题目，旨在全面评估模型在金融领域的知识广度、计算能力及应对复杂问题的表现。此外，FINESSE-Bench 提供统一的评估协议和自动评分机制，为更深入、专业化的模型能力测评提供了有力工具。

Comments 21 pages, 10 tables, 2 figures

详情

AI中文摘要

大语言模型（LLMs）正越来越多地应用于金融分析、报告、投资决策支持、风险管理、合规和专业培训。然而，对其在金融领域专业能力的稳健评估仍不完整。广泛使用的开放基准如FinQA、ConvFinQA和TAT-QA在推动金融问答和数值推理方面发挥了重要作用，但它们主要关注财务报告上的问答，并未提供明确的专业难度层级。更广泛的资源，包括FinanceBench、PIXIU、FinBen和FLaME，扩展了金融任务的覆盖范围，但评估从基础知识到专家级金融推理的过渡问题仍然存在。在这项工作中，我们提出了FINESSE-Bench，这是一个由八个专业基准组成的套件，包含3993个问题，用于对LLMs的金融能力进行分层评估。FINESSE-Bench结合了受专业认证（类似CFA 1-3级、类似CMT 2级和类似CFTe 1级）启发的考试导向数据集、应用交易任务集以及一个俄语奥林匹克基准。这种设计能够评估领域广度、难度增加时的性能下降、解决计算任务的能力以及模型在专业金融领域中的行为。我们还描述了一个统一的评估协议，涵盖多项选择题、数值答案和简短开放式回答，以及基于LLM作为评判范式的自由形式答案自动评分方案。FINESSE-Bench旨在作为现有开放金融基准的补充，并作为对大语言模型中专业相关金融能力进行更实质性评估的工具。

英文摘要

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.11596 2026-05-25 cs.CV

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

HorizonDrive: 用于长时域驾驶仿真的自纠正自回归世界模型

Conglang Zhang, Yifan Zhan, Qingjie Wang, Zhanpeng Ouyang, Yu Li, Zihao Yang, Xiaoyang Guo, Weiqiang Ren, Qian Zhang, Zhen Dong, Yinqiang Zheng, Wei Yin, Zhengqing Chen

AI总结本文提出HorizonDrive，一种用于长时域驾驶模拟的自纠正自回归世界模型。该方法通过引入计划式回滚恢复机制，使教师模型能够在长序列预测中保持稳定，并利用其自回归扩展提供无界监督，从而在有限内存下实现分钟级的预测。实验表明，HorizonDrive在多项指标上显著优于现有方法，提升了驾驶模拟的质量与效率。

Comments Comments: 22 pages, 14 figures. Project page: https://zcliangyue.github.io/HorizonDrive Code: https://github.com/zcliangyue/HorizonDrive

详情

AI中文摘要

闭环驾驶仿真需要超越短时离线片段的实时交互，推动当前驾驶世界模型向自回归（AR）滚转发展。现有的AR蒸馏方法通常依赖于帧沉或学生端退化训练。前者由于快速的自我运动和场景变化，难以迁移到驾驶场景；后者受限于教师单次输出长度，仅提供有限的监督时域。一个自然的问题是：能否通过AR滚转扩展教师本身，以有限的内存成本提供无限时域的监督？关键困难在于标准教师会在自身预测下漂移，污染其提供的监督。我们的关键见解是使教师具备滚转能力，确保从其自身的AR滚转中获得可靠监督。这实例化为HorizonDrive，一个用于AR驾驶仿真的抗漂移训练与蒸馏框架。首先，计划性滚转恢复（SRR）训练基础模型从预测损坏的历史中重建真实未来片段，得到一个在长AR滚转中保持稳定的教师。其次，通过AR滚转扩展具备滚转能力的教师，在有限内存下提供长时域分布匹配监督，同时短窗口学生通过教师滚转DMD（TRD）与之对齐，以实现高效的实时部署。HorizonDrive原生支持在有限内存下的分钟级AR滚转；在nuScenes上，与最强的长时域流式基线相比，HorizonDrive将FID降低52%，FVD降低37%，并将ARE和DTW分别降低21%和9%，同时与单次驾驶视频生成器保持竞争力。

英文摘要

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

URL PDF HTML ☆

赞 0 踩 0

2605.11490 2026-05-25 cs.LG stat.ML

Adaptive Calibration in Non-Stationary Environments

非平稳环境中的自适应校准

Junyan Liu, Haipeng Luo, Lillian J. Ratliff

AI总结在非平稳环境中实现自适应校准是现代AI系统中的核心挑战。本文提出了一类能够根据环境非平稳程度自动调整校准误差的在线预测算法，在i.i.d.和对抗性环境之间实现平滑过渡。该方法在多种校准度量下均取得了理论保证，其误差上界在平稳和对抗性场景下均达到最优，并扩展了先前相关工作，引入了基于阶段的调度策略和预测空间的非均匀划分技术。

Comments Added results for piecewise-stationary environments and included a comparison with the concurrent work of Huang et al. (arXiv:2605.09273)

详情

AI中文摘要

在现代AI系统中，进行校准的在线预测是一个核心挑战。现有文献大多关注完全对抗性环境，其中结果可能是任意的，导致算法保守，在更温和的设置（如结果近乎平稳）中表现次优。这一差距引发了一个自然问题：我们能否设计在线预测算法，其校准误差自动适应环境的非平稳程度，在独立同分布和对抗性场景之间平滑插值？我们对此问题给出肯定回答，并开发了一套算法，在多种校准度量下实现自适应校准保证。具体地，设$T$为轮数，$K$为环境中未知的独立同分布段数，$C\in[0,T]$为另一个未知的非平稳度量（定义为均值结果的最小$\ell_1$偏差），我们的算法对$\ell_1$校准误差达到$\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$，对$\ell_2$和伪KL校准误差均达到$\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$。这些界匹配平稳情况（$C=0$且$K=1$）的最优率，并在完全对抗性场景（$C, K=\Omega(T)$）中恢复已知保证。我们的方法建立在并扩展了先前工作[Hu等人，2026，Luo等人，2025]的基础上，引入基于epoch的调度以及对预测空间进行新颖的非均匀划分，在底层真实值附近分配更精细的分辨率。

英文摘要

Making calibrated online predictions is a central challenge in modern AI systems. Much of the existing literature focuses on fully adversarial environments where outcomes may be arbitrary, leading to conservative algorithms that can perform suboptimally in more benign settings, such as when outcomes are nearly stationary. This gap raises a natural question: can we design online prediction algorithms whose calibration error automatically adapts to the degree of non-stationarity in the environment, smoothly interpolating between i.i.d. and adversarial regimes? We answer this question in the affirmative and develop a suite of algorithms that achieve adaptive calibration guarantees under multiple calibration measures. Specifically, with $T$ being the number of rounds, $K$ being the unknown number of i.i.d. segments of the environment, and $C\in[0,T]$ being another unknown non-stationary measure defined as the minimal $\ell_1$ deviation of the mean outcomes, our algorithms attain $\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$ for $\ell_1$ calibration error and $\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$ for both $\ell_2$ and pseudo KL calibration error. These bounds match the optimal rates in the stationary case ($C=0$ and $K=1$) and recover known guarantees in the fully adversarial regime ($C, K=Ω(T)$). Our approach builds on and extends prior work [Hu et al., 2026, Luo et al., 2025], introducing an epoch-based scheduling together with a novel non-uniform partition of the prediction space that allocates finer resolution near the underlying ground truth.

URL PDF HTML ☆

赞 0 踩 0

2605.10347 2026-05-25 cs.AI cs.CL

How Mobile World Model Guides GUI Agents?

移动世界模型如何指导GUI代理？

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

AI总结本文研究了移动世界模型如何指导GUI代理进行有效交互，针对现有模型在预测动作后果方面的不足，提出了一种多模态世界模型，涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明，该模型在多个基准测试中达到最优性能，并揭示了代码重建在分布内精度和多模态监督上的优势，文本反馈在分布外执行中的鲁棒性，以及世界模型在训练过程中的辅助作用，而非作为通用的后验验证工具。

详情

AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令，但对于长期和高风险交互，动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态，但尚不清楚哪种表示有用，生成的rollout是否可以替代真实环境，以及测试时指导如何帮助不同强度的代理。为了回答上述问题，我们筛选并标注了移动世界模型数据，然后训练了四种模态的世界模型：增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外，通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用，我们得到三个发现。首先，可渲染代码重建实现了高分布内保真度，并为数据构建提供了有效的多模态监督，而基于文本的反馈对于在线分布外执行更鲁棒。其次，世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验，并提高代理的端到端任务性能，尽管这些数据不保留原始分布。最后，对于动作熵低的过度自信移动代理，后验自省提供的收益有限，这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

URL PDF HTML ☆

赞 0 踩 0

2605.07590 2026-05-25 cs.CV

Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

超越防御：面向内在3D点云鲁棒性的流形对齐正则化

Pedro Alonso, Chongshou Li, Tianrui Li

AI总结尽管点云鲁棒性研究已取得进展，但现有方法多依赖数据增强或防御机制，忽视了对抗脆弱性的几何本质。本文提出一种基于流形对齐的正则化方法，认为3D网络的对抗脆弱性源于模型学习的潜在几何结构与点云表面内在几何之间的不匹配。通过引入Manifold-Aligned Point Recognition（MAPR）框架，在不依赖对抗训练或额外数据的情况下，有效提升了模型在多个数据集上的鲁棒性。

详情

AI中文摘要

尽管点云鲁棒性研究取得了广泛进展，现有方法主要依赖增强策略或防御机制，却忽视了对抗脆弱性的几何本质。我们假设3D网络中的对抗脆弱性源于模型学习的潜在几何与底层表面的内在几何之间的流形错位。沿输入流形的微小几何保持扰动往往在特征空间中引起不成比例的扭曲，可能导致误分类。我们通过建立3D鲁棒性的几何解释来形式化这一现象，将经典对抗理论与点云的内在结构联系起来。受此分析启发，我们提出了流形对齐点识别（MAPR），该框架通过跨内在扰动对齐预测来正则化潜在几何。MAPR为每个点云增强捕获局部曲率和扩散结构的内在特征，并应用保持内在几何保持扰动不变性的一致性损失。在不依赖对抗训练或额外数据的情况下，MAPR在多个数据集上持续提升对多种对抗攻击的鲁棒性，在ModelNet40和ScanObjectNN上分别比原始模型平均提高+20.02和+8.83个百分点的鲁棒性。

英文摘要

Despite extensive progress in point cloud robustness, existing methods primarily rely on augmentation strategies or defense mechanisms while overlooking the geometric nature of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, potentially leading to misclassifications. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness under multiple adversarial attacks across several datasets, achieving average robustness gains of +20.02 and +8.83 percentage points over vanilla models on ModelNet40 and ScanObjectNN, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.07220 2026-05-25 cs.LG

On the Robustness of Distribution Support under Diffusion Guidance

扩散引导下分布支撑的鲁棒性研究

Ruijia Cao, Yuchen Wu, Nisha Chandramoorthy

AI总结本文研究了扩散引导在生成样本时对分布支撑集的鲁棒性问题，揭示了其为何能持续生成高质量样本的理论原因。作者通过建立扩散引导过程在精确得分函数下的支撑集鲁棒性性质，证明其生成的样本几乎总是接近目标分布的支撑集，从而保证了样本的结构合理性。该分析适用于多种扩散模型和离散化方案，为理解扩散引导生成物理合理样本提供了理论依据。

详情

AI中文摘要

扩散引导是一种强大的技术，能够通过扩散模型实现可控且高保真的样本生成。在高层次上，它通过引入引导项来修改得分函数，从而将生成过程导向所需条件。尽管在经验上取得了成功，但扩散引导的理论性质在很大程度上仍未得到探索，并且尚不清楚它为何能持续生成高质量样本。在这项工作中，我们通过建立支撑的鲁棒性性质来解释扩散引导的有效性。具体来说，我们表明，在精确访问得分函数的情况下，引导扩散过程几乎总是生成接近目标支撑的样本。这一性质尤其理想，因为偏离支撑的样本通常在结构上不可信，并可能对下游任务产生不利影响。我们的分析涵盖了去噪扩散隐式模型（DDIM）和去噪扩散概率模型（DDPM），并适用于由指数积分器引起的广泛离散化方案。我们的结果为理解扩散引导为何能生成物理上有意义且结构合理的样本提供了严格的基础。

英文摘要

Diffusion guidance is a powerful technique that enables controllable and high-fidelity sample generation with diffusion models. At a high level, it modifies the score function by incorporating a guidance term that steers the generative process toward a desired condition. Despite its empirical success, the theoretical properties of diffusion guidance remain largely unexplored, and it is not well understood why it consistently produces high-quality samples. In this work, we explain the effectiveness of diffusion guidance by establishing a robustness of support property. Specifically, we show that, given exact access to the score functions, guided diffusion processes almost always generate samples that remain close to the target support. This property is particularly desirable, as samples that lie off the support are often structurally implausible and may adversely affect downstream tasks. Our analysis covers both Denoising Diffusion Implicit Models (DDIM) and Denoising Diffusion Probabilistic Models (DDPM), and applies to a wide range of discretization schemes induced by exponential integrators. Our results provide a rigorous foundation for understanding why diffusion guidance produces physically meaningful and structurally plausible samples.

URL PDF HTML ☆

赞 0 踩 0

2605.06840 2026-05-25 cs.AI

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

从LLM推理轨迹中提取搜索树揭示短视规划

Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

AI总结本研究通过从大型语言模型（LLM）在“四连棋”游戏中的推理轨迹中提取搜索树，揭示了LLM在规划行为上的短视特性。研究发现，尽管LLM的推理轨迹中包含较深的节点，但其决策主要依赖于浅层搜索，而非深度搜索；相比之下，人类玩家的性能更多由深度搜索驱动。这一发现揭示了LLM与人类规划之间的关键差异，并为改进LLM的规划能力提供了方向性指导。

详情

AI中文摘要

大型语言模型（LLMs），尤其是推理模型，会生成扩展的思维链（CoT）推理，其中通常包含对未来结果的明确思考。然而，这种思考是否构成真正的规划、其结构如何以及哪些方面驱动性能仍不清楚。在这项工作中，我们引入了一种新方法，通过从四子棋游戏的推理轨迹中提取和量化搜索树来表征LLM规划。通过将计算模型拟合到提取的搜索树上，我们表征了规划的结构及其如何影响移动决策。我们发现LLM的搜索比人类更浅，性能由搜索广度而非深度预测。最引人注目的是，尽管LLM在轨迹中扩展了深层节点，但其移动选择最好由一个完全忽略这些节点的短视模型解释。一项因果干预研究（我们选择性剪枝CoT段落）进一步表明，移动选择主要由浅层节点而非深层节点驱动。这些模式与人类规划形成对比，在人类规划中，性能主要由深度搜索驱动。总之，我们的发现揭示了LLM与人类规划之间的关键差异：虽然人类专业知识由更深层次的搜索驱动，但LLM并不基于深层前瞻行动。这种分离为对齐LLM和人类规划提供了有针对性的指导。更广泛地说，我们的框架提供了一种可推广的方法，用于解释跨战略领域LLM规划的结构。

英文摘要

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

URL PDF HTML ☆

赞 0 踩 0

2605.06498 2026-05-25 cs.RO cs.SY eess.SY

Lie Group Formulation of Recursive Dynamics Algorithms of Higher Order for Floating-Base Robots

浮动基座机器人高阶递归动力学算法的李群公式

Ahmed Ali, Chiara Gabellieri, Antonio Franchi

AI总结本文研究了浮动基座机器人的高阶递归动力学算法在李群框架下的表示方法，提出了一种基于李群的牛顿-欧拉、连杆惯性及混合动力学算法的高阶时间导数计算方法。该方法适用于基座配置在SE(3)上、连杆结构配置在T^{n1} × R^{n2}流形上的树状机械系统，并通过空间扭力表示实现动力学方程的闭式表达。研究还展示了该方法在12自由度空中机械臂上的应用，验证了其在几何正逆动力学及其高阶导数计算中的有效性，并证明其计算复杂度随导数阶数呈二次增长，优于自动微分方法的指数增长。

详情

DOI: 10.1115/1.4071985
Journal ref: ASME. Journal of Mechanisms and Robotics (2026)

AI中文摘要

本文描述了计算浮动基座树状系统的李群牛顿-欧拉、组合体惯量和混合动力学算法的高阶时间导数的过程，其中基座构型在SE(3)上演化，附着的机构是一个开运动学树，构型在(n1+n2)维流形T^{n1} × R^{n2}上，使用旋量的空间表示。在给出算法后，我们将得到的递归式整理成闭式运动方程，识别出满足无源性性质的容许科里奥利矩阵，并证明组合惯性张量在所有时间导数下保持不变。然后，我们将所开发的方法应用于一个12自由度空中机械臂，推导其几何正动力学和逆动力学及其一阶时间导数的解析表达式，而数值模拟成功评估了这些动力学直至五阶。最后，为了展示其实用性，我们对所提出的扩展进行了基准测试，并表明在考虑的测试中，其计算成本随导数阶数呈二次增长，而自动微分基线则呈指数增长。

英文摘要

In this paper, we describe procedures for computing higher-order time derivatives of the Lie-group Newton-Euler, Articulated-Body Inertia, and hybrid dynamics algorithms for floating-base trees, where the base configuration evolves on SE(3) and the attached mechanism is an open kinematic tree with configuration on the (n1+n2)-dimensional manifold T^{n1} \times R^{n2}, using spatial representation of twists. After presenting the algorithms, we collect the resulting recursions into closed-form equations of motion, identifying an admissible Coriolis matrix satisfying the passivity property, and showing that the articulated inertia tensor remains unchanged across all time derivatives. We then apply the developed methods to a 12-DoF aerial manipulator to derive analytical expressions for its geometric forward and inverse dynamics along with their first time derivatives whereas the numerical simulations successfully evaluate these dynamics up to fifth order. Finally, to demonstrate their practical utility, we benchmark the proposed extensions and show that, in the considered tests, their computational cost scales quadratically with the derivative order, whereas the automatic-differentiation baseline exhibits exponential scaling.

URL PDF HTML ☆

赞 0 踩 0

2605.06094 2026-05-25 cs.CV cs.AI

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD: 通过结构化自蒸馏增强视频推理

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

AI总结本文提出VISD，一种用于增强视频推理的结构化自蒸馏框架，旨在解决视频大语言模型在复杂推理任务中因稀疏奖励和细粒度信用分配不足而导致的学习效率低下的问题。VISD引入了一个视频感知的评判模型，将推理质量分解为答案正确性、逻辑一致性和时空定位等多个维度，并利用结构化反馈指导教师策略进行细粒度的标记级监督。通过方向与幅度解耦机制，VISD稳定地将密集监督与强化学习结合，显著提升了推理准确性和训练效率。实验表明，VISD在多个基准测试中均优于现有方法，且收敛速度更快。

详情

AI中文摘要

训练视频大语言模型进行复杂推理仍然具有挑战性，原因在于稀疏的序列级奖励以及缺乏对长时间、时间上接地推理轨迹的细粒度信用分配。虽然具有可验证奖励的强化学习提供了可靠的监督，但它无法捕捉令牌级贡献，导致学习效率低下。相反，现有的自蒸馏方法提供密集监督，但缺乏结构和诊断特异性，并且通常与强化学习交互不稳定。在这项工作中，我们提出了VISD，一个结构化自蒸馏框架，为视频推理引入诊断上有意义的特权信息。VISD采用视频感知判断模型，将推理质量分解为多个维度，包括答案正确性、逻辑一致性和时空接地性，并使用这种结构化反馈指导教师策略进行令牌级监督。为了将密集监督与强化学习稳定集成，我们引入了方向-幅度解耦机制，其中由奖励计算的展开级优势决定更新方向，而结构化特权信号调节令牌级更新幅度。这种设计实现了语义对齐和细粒度的信用分配，提高了推理忠实度和训练效率。此外，VISD结合了课程调度和基于指数移动平均的教师稳定化，以支持长视频序列上的鲁棒优化。在多个基准上的实验表明，VISD始终优于强基线，提高了答案准确性和时空接地质量。值得注意的是，VISD在优化步骤中实现了近2倍的收敛速度，突出了结构化自监督在提高视频大语言模型性能和样本效率方面的有效性。

英文摘要

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.06088 2026-05-25 cs.CV

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

OpenGaFF: 基于码本注意力的开放词汇高斯特征场

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

AI总结本文提出了一种名为 OpenGaFF 的新型框架，用于实现开放词汇的3D场景理解。该方法基于3D高斯点喷射技术，通过引入高斯特征场，将语义建模为高斯几何和外观的连续函数，从而增强几何与语义之间的关联性，提升3D空间中语义的一致性。此外，作者设计了一个结构化码本和基于码本引导的注意力机制，以实现对开放词汇的鲁棒推理，并减少物体内部特征的差异。实验表明，该方法在多个标准2D和3D开放词汇基准测试中均优于现有方法，取得了更优的分割质量与更强的3D语义一致性。

详情

AI中文摘要

理解基于高斯表示的开放词汇3D场景仍然具有挑战性，因为多视角观测下的语义预测碎片化且空间不一致。在本文中，我们提出了OpenGaFF，一个基于3D高斯泼溅构建的开放词汇3D场景理解新框架。我们方法的核心是一个高斯特征场，它将语义建模为高斯几何和外观的连续函数。通过显式地将语义预测条件于几何结构，该公式加强了几何与语义之间的耦合，从而在3D空间中相似结构上实现了更好的空间一致性。为了进一步强制执行对象级语义一致性，我们引入了一个结构化码本，作为一组共享的语义基元。此外，提出了一种码本引导的注意力机制，通过查询嵌入与学习到的码本条目之间的相似性匹配来检索语言特征，从而实现鲁棒的开放词汇推理，同时减少对象内特征方差。在标准2D和3D开放词汇基准上的大量实验表明，我们的方法持续优于先前的方法，实现了改进的分割质量、更强的3D语义一致性以及一个语义可解释的码本，为学习到的表示提供了洞察。

英文摘要

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

URL PDF HTML ☆

赞 0 踩 0

2605.05997 2026-05-25 cs.CV

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker: 用4D图像进行动态空间理解的思考

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

AI总结本文提出了一种名为4DThinker的新型框架，旨在通过动态的潜空间心理图像使视觉语言模型（VLMs）具备四维（4D）动态空间推理能力。该方法引入了无需标注的数据生成流程和动态图像微调（DIFT）技术，结合文本与4D潜变量进行联合监督，从而增强模型对动态视觉语义的理解。此外，基于奖励的4D强化学习（4DRL）进一步提升了模型在复杂推理任务中的表现，实验表明该方法在多个动态空间推理基准测试中均优于现有方法。

Comments 21 pages, 16 figures

详情

AI中文摘要

从单目视频中进行动态空间推理对于连接视觉智能与物理世界至关重要，但对视觉语言模型（VLM）仍然具有挑战性。先前的方法要么将时空推理完全表述为文本，这对于复杂动态来说本质上是冗长且不精确的，要么依赖外部几何模块，这增加了推理复杂性而不培养内在模型能力。在本文中，我们提出了4DThinker，这是第一个使VLM能够通过动态潜在心理图像（即在连续隐藏空间内模拟场景如何演化）进行“4D思考”的框架。具体来说，我们首先引入了一个可扩展的、无需标注的数据生成流程，从原始视频中合成4D推理数据。然后我们提出了动态图像微调（DIFT），它联合监督文本令牌和4D潜在变量，将模型锚定在动态视觉语义中。在此基础上，4D强化学习（4DRL）通过基于结果的奖励进一步处理复杂推理任务，将策略梯度限制在文本令牌上以确保稳定优化。在多个动态空间推理基准上的大量实验表明，4DThinker始终优于强基线，并为VLM中的4D推理提供了新视角。我们的代码可在https://github.com/zhangquanchen/4DThinker获取。

英文摘要

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

URL PDF HTML ☆

赞 0 踩 0

2605.04568 2026-05-25 cs.LG cs.AI cs.RO

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC：基于梯度与潜在想象的模型预测控制

Jonathan Spieler, Sven Behnke

AI总结本文提出了一种名为 Dream-MPC 的新型模型预测控制方法，结合了梯度上升优化与学习到的世界模型，通过生成少量候选轨迹并利用不确定性正则化和优化迭代的复用机制进行优化。该方法在24个连续控制任务中表现出色，显著提升了基础策略的性能，优于传统的无梯度MPC和先进基线方法。

Comments Accepted for International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

最先进的基于模型的强化学习方法要么使用无梯度、基于种群的规划方法，要么使用学习到的策略网络，或者结合策略网络和规划。将模型预测控制（MPC）与学习到的模型和策略先验相结合的混合方法，以利用两种范式的优势，已显示出有希望的结果。然而，这些方法通常依赖于无梯度优化方法，对于高维控制任务可能计算成本高昂。虽然基于梯度的方法是一个有前途的替代方案，但最近的工作经验表明，基于梯度的方法通常比无梯度方法表现更差。我们提出了Dream-MPC，一种新颖的方法，从展开的策略生成少量候选轨迹，并通过使用学习的世界模型、不确定性正则化和通过重用先前优化的动作随时间摊销优化迭代，对每个轨迹进行梯度上升优化。我们在24个连续控制任务上的结果表明，Dream-MPC可以显著提高底层策略的性能，并且可以优于无梯度MPC和最先进的基线。代码和视频可在https://dream-mpc.github.io获取。

英文摘要

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. Code and videos are available at https://dream-mpc.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.02087 2026-05-25 cs.AI

Model Spec Midtraining: Improving How Alignment Training Generalizes

模型规范中期训练：改进对齐训练的泛化能力

Chloe Li, Nevan Wichers, Sara Price, Samuel Marks, Jon Kutasov

AI总结一些前沿AI开发者希望将语言模型对齐到描述其预期行为的模型规范或宪法中。然而，传统的对齐微调方法在演示数据上训练，可能导致对齐效果浅显且泛化能力差。本文提出了一种新的方法——模型规范中间训练（MSM），即在预训练后、对齐微调前，使用合成文档训练模型理解其规范内容，从而引导模型更好地从后续演示数据中泛化。实验表明，MSM能有效提升模型对复杂安全属性的对齐效果，并揭示了某些规范设计原则有助于增强对齐泛化能力。

详情

AI中文摘要

一些前沿AI开发者旨在将语言模型对齐到描述预期模型行为的模型规范或宪法。然而，标准的对齐微调——在规范对齐行为的演示数据上训练——可能产生泛化能力差的浅层对齐，部分原因是演示数据可能未充分指定所需的泛化。我们引入了模型规范中期训练（MSM）：在预训练之后、对齐微调之前，我们在讨论其模型规范的合成文档上训练模型。这教会模型规范的内容，从而塑造它们从后续演示数据中泛化的方式。例如，一个仅微调为表达特定奶酪偏好（如“我更喜欢奶油奶酪而不是布里干酪”）的模型，当我们应用MSM并附加一个将这些偏好归因于亲美价值观的规范时，会泛化为广泛的亲美价值观。相反，一个关于亲可负担性价值观的规范则从完全相同的奶酪微调中产生亲可负担性的泛化。MSM还可以塑造复杂的与安全相关的倾向：应用MSM并附加一个涉及自我保护和目标守卫的规范，可显著降低代理失调率（Qwen3-32B：从54%降至7%），超过了深思熟虑的对齐基线（14%）。我们进一步将MSM作为工具研究哪些模型规范能产生最强的对齐泛化，发现解释规则背后的价值观能改善泛化，提供具体而非一般的指导也是如此。总体而言，MSM是一种简单有效的技术，通过首先教授预期的泛化，来控制和改进模型从对齐训练中泛化的方式。

英文摘要

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences (e.g., "I prefer cream cheese over brie") generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training, by first teaching the intended generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.01018 2026-05-25 cs.CV

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

WildTableBench：在真实场景中评估多模态基础模型的表格理解能力

Junzhe Huang, Xiaoxiao Sun, Yan Yang, Yuxuan Hou, Ruotian Zhang, Sirui Li, Hehe Fan, Serena Yeung-Levy, Xin Yu

AI总结 WildTableBench 是一个用于评估多模态基础模型在真实场景下理解表格图像能力的基准测试。该研究引入了包含402张来自不同领域的真实表格图像和928个手动标注问题的数据集，用于测试模型在结构感知和数值推理方面的能力。实验表明，目前主流的多模态模型在该基准上的表现普遍较低，仅有一款模型准确率超过50%，揭示了当前模型在处理复杂表格图像时仍存在显著不足。

详情

AI中文摘要

使用多模态基础模型分析表格图像是消费和企业场景中高价值但具有挑战性的应用。尽管其重要性，当前评估主要依赖于结构化文本表格或干净渲染的图像，忽视了真实世界表格图像的视觉复杂性。这些图像具有多样的布局和领域，需要复杂的结构感知和数值推理。为弥补这一差距，我们引入了WildTableBench，这是第一个针对真实世界设置中自然出现的表格图像的问答基准。WildTableBench包含从跨领域在线论坛和网站收集的402张高信息密度表格图像，以及928个手动标注和验证的问题，涵盖五个类别的17个子类型。我们在此基准上评估了21个前沿专有和开源多模态基础模型。仅有一个模型准确率超过50%，其余模型准确率在4.1%至49.9%之间。我们进一步进行诊断分析以表征模型失败，并揭示结构感知和推理方面的持续弱点。这些结果和分析为当前模型能力提供了有用的见解，并将WildTableBench建立为表格图像理解的有价值的诊断基准。数据集：https://huggingface.co/datasets/jzhuang/WildTableBench 代码：https://github.com/hjzhe/WildTableBench 排行榜：https://hjzhe.github.io/WildTableBench

英文摘要

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored. Such images feature varied layouts and diverse domains that demand sophisticated structural perception and numerical reasoning. To bridge this gap, we introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories. We evaluate 21 frontier proprietary and open-source multimodal foundation models on this benchmark. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%. We further conduct diagnostic analyses to characterize model failures and reveal persistent weaknesses in structural perception and reasoning. These results and analyses provide useful insights into current model capabilities and establish WildTableBench as a valuable diagnostic benchmark for table image understanding. Dataset: https://huggingface.co/datasets/jzhuang/WildTableBench Code: https://github.com/hjzhe/WildTableBench Leaderboard: https://hjzhe.github.io/WildTableBench

URL PDF HTML ☆

赞 0 踩 0

2604.28048 2026-05-25 cs.CL cs.SI

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

稳定行为，有限变化：城市情感感知中LLM智能体的角色有效性

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva

AI总结该研究探讨了在城市情感感知任务中，使用不同人格设定对多模态大语言模型（LLM）行为一致性与差异性的影响。通过设置包括性别、经济状况、政治立场和性格等维度的人格变量，研究发现同一人格设定下的模型表现出高度一致的行为，但不同人格之间的差异有限，仅经济状况和性格带来可检测但实际影响较小的变化。研究还指出，模型在细粒度情感判断上表现较差，且去除了人格设定后模型性能有时甚至更优，表明简单的人格标签提示可能对感知判断的注释价值有限。

Comments 8 pages, 8 figures. IEEE DCOSS - UrbCom

详情

Journal ref: IEEE DCOSS 2026

AI中文摘要

大型语言模型（LLM）越来越多地被用作城市分析中人类感知的代理，但尚不清楚角色提示是否会产生有意义且可重复的行为多样性。我们研究了不同角色是否影响多模态LLM生成的城市情感判断。使用涵盖性别、经济状况、政治取向和人格的角色因子集，我们为每个角色实例化多个智能体，以评估来自PerceptSent数据集的城市场景图像，并评估角色内一致性和角色间变化。结果显示，共享角色的智能体之间存在强收敛性，表明行为稳定且可重复。然而，角色间分化有限：经济状况和人格引起统计上可检测但实际变化不大的影响，而性别没有可测量的效果，政治取向的影响可忽略不计。智能体还表现出极端偏差，压缩了人类注释中常见的中间情感类别。因此，在粗粒度极性任务上表现强劲，但随着情感分辨率的提高而下降，表明简单的基于标签的角色提示无法捕捉细粒度的感知判断。为了隔离角色条件的作用，我们还评估了没有角色的相同模型。令人惊讶的是，无角色模型在所有任务变体上与人类标签的一致性有时达到或超过有角色条件，表明在这种设置下，简单的基于标签的角色提示可能增加有限的注释价值。

英文摘要

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

URL PDF HTML ☆

赞 0 踩 0

2604.27468 2026-05-25 cs.CL

Syntactically-guided Information Maintenance in Sentence Comprehension

句子理解中的句法引导信息维护

Shinnosuke Isono, Kohei Kajikawa

AI总结本研究探讨了在句子理解过程中，如何根据句法结构选择性地维持对后续预测至关重要的信息。研究提出，信息维持的成本受到预测头数量和未完成依存关系数量的影响，并通过自然阅读时间数据验证了这两个因素在日语中对维持成本的不同作用。研究还发现，阅读速度较慢的读者更能从可预测性中获益，表明句法结构在语言理解中的重要作用，同时指出英语中未表现出相同模式，提示不同语言在句法引导信息维持方面可能存在差异。

详情

AI中文摘要

在成功的实时语言理解中，在上下文中维护信息至关重要，但维护在认知上代价高昂且可能减慢处理速度。我们假设理性语言使用者会选择性维护对未来预测至关重要的信息，并由句法结构引导。根据这一观点，两个因素影响维护成本：预测头的数量和未完成依赖的数量。尽管这些因素在文献中被视为竞争性假设，但我们的解释预测它们不可相互约简。我们在日语的自然阅读时间数据中证明了这一点，日语中这两个因素对比尤为清晰。我们进一步表明存在一种权衡，即因维护而减慢速度的读者往往从可预测性中获益更多，这为所提出的解释提供了额外支持。然而，这些模式在英语中并不明显，我们强调了一些有待解决的问题，以理解句法在各种语言记忆高效处理中的贡献。

英文摘要

Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case in a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account. These patterns are not evident in English, however, and we highlight some issues to be resolved to understand the contribution of syntax in memory-efficient processing of various languages.

URL PDF HTML ☆

赞 0 踩 0

2604.27247 2026-05-25 cs.CV

Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany

面向地球观测数据中树篱与线性木本特征的可泛化映射：德国国家产品

Thorsten Hoeser, Verena Huber-Garcia, Sarah Asam, Ursula Gessner, Claudia Kuenzer

AI总结本文旨在从地球观测数据中生成适用于全国范围的可推广的灌木和线性木质特征地图，以支持生态管理和保护。研究提出了一种模块化的工作流程，包含一个灵活的数据接口和一个深度神经网络，分别用于生成木质植被掩膜和区分线性与非线性结构。该方法在德国全国范围内应用了三种不同分辨率的数据源，无需重新训练模型即可生成高质量的线性木质特征地图，并在多个评估区域表现出良好的性能。

Comments 33 pages, 17 figures

详情

AI中文摘要

树篱和其他线性木本特征在集约化管理的农业景观中提供宝贵的生态系统服务。它们是气候适应和生物多样性的关键要素，不仅因为其高度变化的植物区系，还作为许多动物和昆虫（包括有价值的传粉者）的觅食、休息和筑巢场所。因此，它们需要专门的管理、保护和关注。从地球观测数据中对这些特征进行系统化和大规模制图具有重要意义。然而，考虑到传感器类型、空间分辨率、数据采集条件以及研究区域复杂的景观变异性，可转移和可复用的线性木本特征制图工作流仍然是一个关键的方法论挑战。我们引入了一个模块化工作流，围绕两个独立可优化的组件构建。首先，一个灵活的输入数据接口，将异构的地球观测数据整合为二值木本植被掩膜；其次，一个深度神经网络，训练用于区分这些掩膜中的线性形状和非线性形状。我们通过使用单个训练模型（无需重新训练）从三个输入源（空间分辨率分别为0.73米、1米和3米）推导出覆盖整个德国的三个全国尺度线性木本特征图来演示该工作流。与来自四个联邦州生物群落制图活动的精细参考数据进行的评估，以及与两个现有线性木本特征图的比较表明，该工作流在全国所有评估站点均产生具有竞争力的结果。其模块化设计及其在全国尺度上的适用性为超越德国的可扩展和可泛化线性木本特征制图提供了基础。

英文摘要

Hedges and other linear woody features provide valuable ecosystem services, particularly within intensively managed agricultural landscapes. They are key elements for climate adaptation and biodiversity amongst others not only due to a largely varying flora, but also as a feeding-, resting-, and nesting place for many animals and insects including valuable pollinators. Therefore, they require dedicated management, preservation, and attention. Thus, systematic and large-scale mapping of these features from Earth observation data is of high importance. However, transferable and reusable workflows for linear woody feature mapping remain a key methodological challenge, given the diversity of sensor types, spatial resolutions, data acquisition conditions, and complex landscape variability encountered across study areas. We introduce a modular workflow built around two independently optimizable components. Firstly, a flexible input data interface that consolidates heterogeneous Earth observation data into a binary woody vegetation mask, and secondly, a deep neural network trained to separate linear from non-linear shapes within these masks. We demonstrate the workflow by deriving three national-scale linear woody feature maps for all of Germany from three input sources with 0.73 m, 1 m and 3 m spatial resolution, respectively, by using a single trained model without retraining. Evaluation against refined reference data from four federal state biotope mapping campaigns and comparison with two existing linear woody feature maps demonstrate that the workflow produces competitive results across all evaluation sites on a national level. The modular design and its demonstrated applicability at national scale provide a foundation for scalable and generalizable linear woody feature mapping beyond Germany.

URL PDF HTML ☆

赞 0 踩 0

2604.24810 2026-05-25 cs.LG cs.AI

A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

自适应深度神经网络中上置信界算法的性能比较分析

Grigorios Papanikolaou, Ioannis Kontopoulos, Konstantinos Tserpes

AI总结在边缘计算环境中，由于对能耗和延迟的严格限制，深度神经网络的部署面临挑战。本文基于自适应深度神经网络（ADNNs），引入四种改进的上置信界（UCB）策略，包括UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK，首次对这些策略在精度、能耗和延迟之间的权衡进行了系统比较。实验表明，UCB-Bayes收敛最快，而UCB-V和UCB-Tuned在精度-延迟和精度-能耗的帕累托前沿上表现最优。

Comments The paper has been accepted for publication in IEEE SMARTCOMP 2026

详情

AI中文摘要

边缘计算环境对能耗和延迟施加了严格限制，使得深度神经网络的部署面临重大挑战。因此，在边缘计算场景中，能够动态平衡计算成本或延迟与预测准确性的智能自适应推理策略至关重要。在这项工作中，我们基于采用多臂老虎机（MAB）框架的自适应深度神经网络（ADNN）。现有文献利用第一版上置信界（UCB1）策略动态选择最优置信阈值，从而在不牺牲准确率的情况下实现高效早期退出。然而，我们在ADNN中引入了四种额外的上置信界策略，即UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK，并首次对这些策略在准确率、能耗和延迟之间的权衡进行了比较研究。所提出的UCB策略应用于ResNet和MobileViT神经网络，并在CIFAR-10、CIFAR-10.1和CIFAR-100基准数据集上进行评估。实验结果表明，所有策略均实现了次线性累积遗憾，其中UCB-Bayes收敛最快，其次是UCB-Tuned和UCB-V。最后，UCB-V和UCB-Tuned在准确率-延迟和准确率-能耗权衡的帕累托前沿上占据主导地位。实现代码可在此处获取：https://github.com/gr3gor1/MAB_UCB

英文摘要

Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a significant challenge. Therefore, smart and adaptive inference strategies that dynamically balance computational cost or latency with predictive accuracy are critical in edge computing scenarios. In this work, we build on Adaptive Deep Neural Networks (ADNNs) that employ the Multi-Armed Bandit (MAB) framework. Current literature leverages the first version of the Upper Confidence Bound (UCB1) strategy to dynamically select the optimal confidence threshold, enabling efficient early exits without sacrificing accuracy. However, we introduce four additional Upper Confidence Bound strategies in ADNNs, namely UCB-V, UCB-Tuned, UCB-Bayes, and UCB-BwK, and perform, for the first time, a comparative study of these strategies with respect to trade-offs between accuracy, energy consumption, and latency. The proposed UCB strategies are employed on the ResNet and MobileViT neural networks, and are evaluated on the benchmark datasets of CIFAR-10, CIFAR-10.1, and CIFAR-100. Experimental results demonstrate that all strategies achieve sub-linear cumulative regret, with UCB-Bayes converging the fastest, followed by UCB-Tuned and UCB-V. Finally, UCB-V and UCB-Tuned dominate the Pareto Frontiers of accuracy-latency and accuracy-energy trade-offs. The implementation code is available here: https://github.com/gr3gor1/MAB_UCB

URL PDF HTML ☆

赞 0 踩 0