arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19140 2026-06-18 cs.LG 新提交

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

ChronoSurv：一种临床路径引导的多模态生存分析图框架

Hugo Miccinilli, Theo Di Piazza

发表机构 * Université Paris-Saclay, CentraleSupélec, MICS, France（巴黎-萨克雷大学，中央理工-高等电力学院，MICS，法国）； University of Lyon, INSA Lyon, CREATIS, France（里昂大学，INSA里昂，CREATIS，法国）

AI总结提出ChronoSurv，一种基于有向图的多模态生存分析框架，通过层次化拓扑和异质消息传递建模临床轨迹，在头颈癌数据集上取得最优判别性能与可靠校准。

Comments Accepted at MICCAI 2026. Submitted version due to embargo

详情

AI中文摘要

准确的生存预测对于头颈癌的个性化治疗计划至关重要，但由于多模态临床数据的异质性和高维性，这仍然具有挑战性。虽然深度生存模型在预测性能上优于经典统计方法，但现有方法通常依赖于静态融合策略或时间无关建模，限制了其捕捉结构化临床工作流程的能力。在这项工作中，我们提出了ChronoSurv，一种用于多模态生存分析的异质层次有向图框架。ChronoSurv使用与关键诊断步骤对齐的有向图，将患者护理表示为进展感知的临床轨迹。层次拓扑包含细粒度、粗粒度和全局表示，进一步支持对缺失模态的灵活适应，而异质消息传递则建模了跨模态和临床步骤的复杂非对称关系。在两个公共数据集上的实验结果表明，ChronoSurv在保持统计可靠校准的同时，实现了最先进的判别性能。全面的消融研究进一步证实了每个架构组件的贡献，突出了轨迹感知图建模在多模态生存预测中的潜力。

英文摘要

Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival models have improved predictive performance over classical statistical approaches, existing methods typically rely on static fusion strategies or temporally agnostic modeling, limiting their ability to capture structured clinical workflows. In this work, we propose ChronoSurv, a heterogeneous hierarchical directed graph framework for multimodal survival analysis. ChronoSurv represents patient care as a progression-aware clinical trajectory using directed graphs aligned with key diagnostic steps. A hierarchical topology incorporates fine-grained, coarse, and global representations, further supporting flexible adaptation to missing modalities, while heterogeneous message passing models complex and asymmetric relationships across modalities and clinical steps. Experimental results on two public datasets demonstrate that ChronoSurv achieves state-of-the-art discriminative performance while maintaining statistically reliable calibration. Comprehensive ablation studies further confirm the contribution of each architectural component, highlighting the potential of trajectory-aware graph modeling for multimodal survival prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.19139 2026-06-18 cs.CV cs.CL 新提交

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

Urdu Katib 手写数据集：用于离线乌尔都语手写文本识别的历史文档数据集及基于CRNN的基线评估

Ramza Basharat, Muhammad Usman Ali

发表机构 * Department of Computer Science, University of Gujrat（古杰拉特大学计算机科学系）

AI总结为解决乌尔都语手写文本识别中数据集稀缺的问题，本文提出了首个由历史时期Katib书写的离线乌尔都语手写文本行数据集UKHD，并评估了多种CRNN混合模型，其中CNN-BGRU-CTC在字符错误率和词错误率上表现最优。

详情

AI中文摘要

自动手写文本识别（HTR）本质上是一项具有挑战性的任务，当处理草书体时，其复杂性进一步增加。尽管在各种草书体上已经做出了显著努力，但关于乌尔都语手写文本识别（UHTR）的研究相对有限。这种研究滞后主要是由于其文字带来的独特挑战，以及基准数据集的稀缺和不可用。因此，为了推进UHTR研究，本研究提出了一个专门的真实数据集，称为Urdu Katib手写数据集（UKHD）。据我们所知，这是第一个专门从历史时期Katib书写的材料中整理的离线乌尔都语手写文本行数据集。它涵盖了Nastalique书法风格中各种扁平笔尖书写变体。此外，评估了不同基于CRNN的混合模型的有效性，以确定用于Urdu Katib手写识别（UKHR）的最佳架构。在分析的模型中，CNN-BGRU-CTC模型表现出更稳健的性能，具有较低的字符错误率（CER）和词错误率（WER）。本研究工作旨在支持和鼓励研究社区开发用于保存乌尔都语手写文学的稳健识别系统。

英文摘要

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

URL PDF HTML ☆

赞 0 踩 0

2606.19122 2026-06-18 cs.RO 新提交

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

基于混合2D-3D学习的人行道机器人单目3D占用感知

Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini, Yong Liu, Bolei Zhou

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Zhejiang University（浙江大学）； Coco Robotics ； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出WalkOCC框架，通过混合射线行进单目3D占用感知，结合LiDAR-RGB配对数据与大规模无配对单目图像学习，提升人行道机器人导航的预测精度和泛化能力。

详情

AI中文摘要

现实世界中的人行道拥挤、杂乱且结构化程度低于道路，使得3D占用预测成为配送机器人和电动轮椅等移动机器人安全导航的关键。现有的占用学习流程主要针对道路自动驾驶设计，通常在大规模配对的LiDAR-RGB数据集上训练，需要密集的3D监督和多个摄像头输入，这些数据收集成本高且未能充分捕捉人行道特定特征。我们提出WalkOCC，一种用于人行道机器人的混合射线行进单目3D占用感知框架。WalkOCC显式地将来自LiDAR-RGB配对数据的几何基础与来自大规模无配对单目图像的可扩展学习相结合。它从配对序列中引导出伪占用监督，并在额外的仅2D数据上联合学习图像级表示。它在不需要昂贵的3D占用标注的情况下实现了稳定的优化和改进的泛化能力。大量实验表明，与基于自监督图像的基线相比，在预测精度、对路缘和排水沟等细微城市结构的细粒度分割以及对环境和跨本体变化的鲁棒性方面，WalkOCC均取得了一致的提升。为了便于评估和基准测试，我们还引入了Sidewalk3D，这是一个大规模的人行道感知数据集，包含在多个地点和时间段收集的LiDAR-相机配对序列，以及用于评估的3D语义占用标注。代码和数据将公开提供。

英文摘要

Sidewalks in the real world are crowded, cluttered, and less structured than roads, making 3D occupancy prediction a key ingredient for the safe navigation of mobile robots such as delivery bots and electric wheelchairs. Existing occupancy learning pipelines are largely designed for on-road autonomous driving and often train on large-scale paired LiDAR-RGB datasets with dense 3D supervision and multiple camera inputs, which are costly to collect and do not adequately capture sidewalk-specific characteristics. We propose WalkOCC, a hybrid Ray-marching monocular 3D occupancy perception framework for robots operating on sidewalks. WalkOCC explicitly couples geometric grounding from LiDAR-RGB paired data with scalable learning from large-scale unpaired monocular images. It bootstraps pseudo occupancy supervision from paired sequences and jointly learns image-level representations on additional 2D-only data. It yields stable optimization and improved generalization without requiring costly 3D occupancy annotations. Extensive experiments demonstrate consistent gains in prediction accuracy, fine-grained segmentation of subtle urban structures such as curbs and gutters, and robustness to environmental and cross-embodiment shifts compared with self-supervised image-based baselines. To facilitate evaluation and benchmarking, we also introduce Sidewalk3D, a large-scale sidewalk perception dataset with LiDAR-camera paired sequences collected across multiple locations and time periods, along with 3D semantic occupancy annotations for evaluation. Code and data will be made available.

URL PDF HTML ☆

赞 0 踩 0

2606.19120 2026-06-18 cs.LG cs.CV 新提交

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思：解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences（中国科学院沈阳自动化研究所机器人学国家重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出ViGOS框架，通过解耦感知和推理，在MLLM后训练中避免文本捷径，提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情

AI中文摘要

在策略自蒸馏（OPSD）训练模型在其自身rollouts上，并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好，但直接扩展到多模态大语言模型（MLLMs）可能产生捷径：特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS，一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述，然后推理出最终答案。对于有效rollouts，仅图像的感知教师监督描述，而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中，ViGOS保持了OPSD的主要优势，并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

URL PDF HTML ☆

赞 0 踩 0

2606.19118 2026-06-18 cs.AI cs.LG econ.GN q-fin.EC 新提交

Analysing drivers and interdependencies in European electricity markets using XAI

使用XAI分析欧洲电力市场的驱动因素与相互依赖性

Antoine Pesenti, Aidan O'Sullivan

发表机构 * UCL Energy Institute, University College London, UK（伦敦大学学院能源研究所，英国）

AI总结结合深度神经网络与可解释人工智能（XAI）技术，利用SHAP和SSHAP框架分析39个欧洲竞价区的电价决定因素，发现可再生能源（尤其是太阳能）对电价形成具有重要作用，天然气价格仍是主导驱动因素，且互联互通显著影响价格动态。

Comments 12 pages

详情

AI中文摘要

电力市场本质上是复杂系统，具有强非线性、高维交互以及跨区域日益增长的相互依赖性。虽然深度神经网络（DNN）在电价预测方面表现出强大的能力，但其缺乏可解释性限制了其在理解电价形成潜在驱动因素方面的实用性。本文通过将DNN模型与可解释人工智能（XAI）技术相结合，分析了39个欧洲竞价区电价的决定因素，填补了这一空白。我们采用SHAP（SHapley Additive exPlanations）量化特征贡献，并应用和扩展了SSHAP（一种聚合框架）以提高高维设置下的可解释性。分析表明，可再生能源（尤其是太阳能）在电价形成中发挥着不成比例的重要作用，尽管其在总发电量中占比较低。天然气价格仍然是跨电力市场的主导且一致的驱动因素，而互联互通显著影响价格动态，凸显了欧洲电力系统的强相互依赖性。此外，我们构建了一个合成性的全欧盟电力市场，以探索完全一体化单一价格市场的反事实情景。

英文摘要

Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

URL PDF HTML ☆

赞 0 踩 0

2606.19116 2026-06-18 cs.AI cs.CY 新提交

Towards an Agent-First Web: Redesigning the Web for AI Agents

迈向智能体优先的Web：为AI智能体重新设计Web

Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

发表机构 * Old Dominion University（欧道明大学）； AI Motion Labs（AI Motion实验室）； Florida International University（佛罗里达国际大学）； Accenture Technology Labs（埃森哲技术实验室）； Nanyang Technological University（南洋理工大学）； University of Colombo（科伦坡大学）； Center for Wireless Communications, University of Oulu（奥卢大学无线通信中心）； McDonald Army Health Center（麦克唐纳陆军健康中心）

AI总结本文提出三层重新设计原则，包括访问层（代理继承人类权限）、经济层（基于意图的代币订阅模型）和内容层（ATML标记语言与加密溯源链），以解决AI智能体作为中间人时Web的访问、经济与内容问题。

详情

AI中文摘要

万维网建立在持续三十年的假设之上：Web内容的主要消费者是人类。这一假设渗透到每一层；其访问模型假定人类访客，其经济依赖于人类注意力，其内容针对人类感知。AI智能体作为人类与Web内容之间中介的迅速出现使这一假设失效。然而，Web通过全面封锁、基于CAPTCHA的排除以及将智能体访问视为提取而非合法交互的经济模型来抵制智能体。本文提出跨三层的原则性重新设计。在访问层，为人类行动的智能体应继承等效访问权限，通过HTTP请求中的速率限制和智能体识别元数据（类似于浏览器头部）以及从同一域提供人类可读和智能体优化内容的双层架构来管理。在经济层，我们提出基于意图的层级框架，以智能体作为人类代理原则为基础：智能体的经济义务反映其所代表的人类。基于代币的订阅模型以代币而非页面浏览量计量内容，同时引入委托内容经济，将AI内容生产锚定于人类意图。在内容层，我们识别出认知递归——AI生成内容被智能体消费以产生更多内容的自我指涉循环，逐步使Web知识与人类真实情况脱钩。我们提出智能体文本标记语言（ATML），一个四级人类监督层级模型，以及加密溯源链来应对这一威胁。这些共同构成了智能体优先互联网的十项设计原则，其中智能体是一等公民，其整合需要重新协商Web在访问、经济和内容方面的基本社会契约。

英文摘要

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

URL PDF HTML ☆

赞 0 踩 0

2606.19111 2026-06-18 cs.CL cs.AI cs.MA 新提交

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

领导力作为协调控制：多智能体LLM团队中的行为特征与恢复优势边界

Haewoon Kwak

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结研究多智能体LLM团队中过程级协调控制何时增加价值，通过行为特征和消融实验发现，控制器的优势仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时出现，验证了权变理论。

Comments 33 pages

详情

AI中文摘要

团队科学认为领导力是权变的：它仅在特定条件下有帮助，而能力强的自主团队可能根本不需要领导。我们对多智能体LLM团队提出类似问题：在什么可测量的条件下，过程级协调控制会增加价值，这些条件是否与团队科学的预测一致？我们使用行为特征（多数锁定、探索、从错误的第0轮共识中恢复）和每动作消融实验，因为每个控制器是一个显式动作集，而不是一个整体提示。我们将三种经典领导风格（交易型、变革型、情境型）操作化为对共享动作词汇（探索、修订、接受、综合）的控制器。一个具有相同动作但使用任意规则的匹配控制器恢复效果不优于多数投票，因此是理论推导的规则（而非词汇）起作用。在四个任务体系和三个开放权重模型系列中，没有控制器在准确率上占主导地位，正如权变观点所预测的：交易型控制在所有12个（模型、体系）组合上与共享的第0轮投票匹配，差异在1.3个百分点以内，仅在初始多数不可靠的一个组合上出现增益（llama-4-scout社会性；情境型比扁平型高8个百分点）。通过四个边界探针测试的恢复优势解释表明，控制器仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时优于纯交互。这些区域映射到权变理论（领导替代、路径-目标冗余、情境准备差距），因此基本为零的准确率结果正是理论所预测的，而非控制器的失败。我们将过程级协调控制视为一种需要测量和理论映射的权变因素，而不是需要超越的排行榜。

英文摘要

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

URL PDF HTML ☆

赞 0 踩 0

2606.19108 2026-06-18 cs.LG 新提交

JourneyFormer: Encoding Airbnb Guest Journey with Sequence Modeling

JourneyFormer: 使用序列建模编码Airbnb客人旅程

Daochen Zha, Chun How Tan, Xin Liu, Bin Xu, Han Zhao, Xiaowei Liu, Tracy Yu, Hui Gao, Huiji Gao, Liwei He, Stephanie Moyerman, Sanjeev Katariya

发表机构 * Airbnb

AI总结针对Airbnb中客人序列长、探索性强且标签稀疏的问题，提出JourneyFormer序列建模解决方案，通过优化数据选择、ID嵌入、模型架构和标签归因，并在两个生产面上通过在线A/B测试验证了其有效性。

Comments Accepted by KDD 2026

详情

AI中文摘要

序列建模因其能够建模用户历史行为并推断用户意图，在推荐和排序算法中越来越受欢迎。尽管理论简单，但由于序列的复杂性和稀疏标签，序列模型在生产中的实际部署并非易事。例如，在Airbnb中，客人序列通常较长、具有探索性且复杂，我们关注的是稀疏的预订标签。因此，我们经常需要在数据和建模方面做出各种设计决策，以在有效性和可扩展性之间取得平衡。本文深入探讨了这些生产挑战，并部署了JourneyFormer，一种用于Airbnb搜索排序的序列建模解决方案。我们详细介绍了关键的设计考虑，涵盖客人事件选择、ID嵌入、模型架构和标签归因等方面。此外，我们描述了几种加速模型训练和推理的定制策略。JourneyFormer已成功部署在Airbnb的生产环境中，其有效性和影响不仅通过改进的离线排序指标得到证明，而且通过两个生产面上的在线A/B测试在关键业务指标上取得了显著提升。

英文摘要

Sequence modeling has become increasingly popular in recommendation and ranking algorithms, owing to its capacity to model users' historical behaviors and infer user intentions. Despite its theoretical simplicity, the practical deployment of a sequence model in production is non-trivial due to complexity of the sequence and sparse labels. For example, in Airbnb, guest sequences are often long, exploratory and complex, and we focus on booking labels, which are sparse. As such, we are often required to make various design decisions regarding data and modeling to strike a balance between effectiveness and scalability. This work delved into these production challenges and deployed JourneyFormer, a sequence modeling solution for search ranking at Airbnb. We detail crucial design considerations, covering aspects such as guest event selection, ID embeddings, model architecture, and label attribution. Additionally, we describe several tailored strategies to accelerate model training and inference. JourneyFormer has been successfully deployed within Airbnb's production, where its effectiveness and impact have been evidenced not only by improved offline ranking metrics but also by significant gains in key business metrics through online A/B testing across 2 production surfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.19105 2026-06-18 cs.LG stat.ML 新提交

Smoothness-Based Derandomization of PAC-Bayes Bounds

基于光滑性的PAC-Bayes去随机化

Alexandre Lemire Paquin, Brahim Chaib-Draa, Philippe Giguère

发表机构 * Department of Computer Science and Software Engineering（计算机科学与软件工程系）； Université Laval（拉瓦尔大学）

AI总结利用损失和预测器的光滑性，将Gibbs预测器去随机化为后验均值处的确定性预测器，通过Jensen间隙类的Rademacher复杂度控制泛化界，并导出涉及参数雅可比和海森矩阵的正则化器。

详情

AI中文摘要

我们研究光滑损失函数的PAC-Bayes去随机化。我们的目标是通过利用损失和预测器类的光滑性，获得对确定性预测器以高概率成立的泛化界。我们表明，从Gibbs预测器到后验均值处的确定性预测器的转换有一个精确的代价，由Jensen间隙类的泛化间隙给出。我们通过其Rademacher复杂度控制该类，从而得到涉及以参数雅可比和得分图的海森矩阵表示的平坦度量的确定性预测器界。该框架适用于有界和无界光滑损失函数，并将结果专门应用于线性预测器和光滑神经网络。最后，理论中出现的雅可比和海森矩阵量激发了一个实用的正则化器。对于BatchNorm网络，我们通过将BatchNorm变换折叠到相邻的仿射权重中，相对于有效的BatchNorm权重计算该正则化器。在CIFAR-10上的实验说明了该正则化器在不同批量大小下的行为。

英文摘要

We study PAC-Bayes derandomization for smooth loss functions. Our goal is to obtain generalization bounds that hold with high probability for deterministic predictors by exploiting smoothness properties of both the loss and the predictor class. We show that passing from the Gibbs predictor to the deterministic predictor at the posterior mean has a precise cost, given by the generalization gap of the Jensen gap class. We control this class through its Rademacher complexity, leading to bounds for deterministic predictors that involve flatness quantities expressed in terms of parameter Jacobians and Hessians of the score map. The framework applies to both bounded and unbounded smooth loss functions, and we specialize the results to linear predictors and smooth neural networks. Finally, the Jacobian and Hessian quantities appearing in the theory motivate a practical regularizer. For BatchNorm networks, we compute this regularizer with respect to effective BatchNorm weights obtained by folding the BatchNorm transformation into the adjacent affine weights. Experiments on CIFAR-10 illustrate the behavior of this regularizer under different batch sizes.

URL PDF HTML ☆

赞 0 踩 0

2606.19103 2026-06-18 cs.CV cs.AI 新提交

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency：通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

AI总结针对基于指令的图像编辑中产品特征保持不足的问题，提出ProductConsistency数据集和循环一致性奖励，结合监督微调与强化学习，显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情

AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而，在以产品为中心的场景中，保留产品特征、品牌和文本元素至关重要，当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧，导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中，我们引入了ProductConsistency数据集，旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调（SFT）数据集、一个包含869张独特产品图像的强化学习（RL）数据集，以及一个新的基准数据集ProductConsistency Benchmark，以允许对编辑模型进行严格和标准化的评估。为了指导RL训练，我们提出了一种循环一致性奖励，通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调，并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进，表明更强的产品一致性、文本渲染和整体视觉质量；其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

URL PDF HTML ☆

赞 0 踩 0

2606.19100 2026-06-18 cs.CV 新提交

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

AI总结针对欧洲葡萄牙语缺乏开源多模态模型的问题，提出AMALIA-VL，通过三阶段训练和葡萄牙语中心数据混合，建立强基线并开源所有资源。

详情

AI中文摘要

大型视觉与语言模型（LVLMs）发展迅速，但欧洲葡萄牙语（pt-PT）在现有的开源多模态模型中仍系统性地未被充分服务，这些模型要么将其与巴西葡萄牙语混为一谈，要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL，这是第一个原生为pt-PT构建的开源指令微调LVLM，通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合，该混合结合了策划和翻译的公共数据集与新颖的数据集，以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明，AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程，以及机器翻译的pt-PT评估基准，以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

URL PDF HTML ☆

赞 0 踩 0

2606.19097 2026-06-18 cs.CV 新提交

DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

DVANet: 面向图像复原的退化感知视觉先验对齐网络

Yanjie Tu, Qingsen Yan, Axi Niu, Tao Hu, Haokui Zhang, Jiantao Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University（西北工业大学计算机学院）； Shenzhen Research Institute of Northwestern Polytechnical University（西北工业大学深圳研究院）； State Key Laboratory of Internet of Things for Smart City, University of Macau（澳门大学智慧城市物联网国家重点实验室）

AI总结提出DVANet，一种基于半二次分裂优化的深度展开网络，通过退化感知观测一致性与视觉先验引导重建的协同展开，实现复杂退化下的统一图像复原，在多种退化场景和跨域任务中表现优越。

Comments All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

详情

AI中文摘要

全能图像复原旨在开发一个统一的复原框架来处理多种退化类型。现有的端到端方法通常将复原过程视为黑盒映射，缺乏明确的优化解释。尽管深度展开为图像复原提供了可解释的迭代建模范式，但现有方法大多依赖于固定的退化假设或预定义的退化信息，难以适应复杂退化和局部内容受损下的统一复原需求。这一限制制约了它们在退化抑制和结构细节恢复方面的性能。为解决这些问题，本文提出DVANet，一种受半二次分裂优化算法启发的深度展开网络，将复杂退化下的统一图像复原公式化为退化感知观测一致性与视觉先验引导重建之间的协同展开过程。具体而言，在退化感知观测一致性分支中，采用退化表示模块提取全局退化属性和局部退化线索，并利用退化条件映射增强模型对不同退化类型的适应性。在视觉先验引导重建分支中，引入DINOv3提供结构和语义信息作为层次化视觉先验，从而补充受损区域缺失的结构信息并改善细节恢复。大量实验表明，DVANet在多场景退化和跨域图像复原任务上取得了优越或具有竞争力的性能，展现出良好的退化适应性和泛化能力。

英文摘要

All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

URL PDF HTML ☆

赞 0 踩 0

2606.19096 2026-06-18 cs.CV 新提交

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

PorTEXTO：用于视觉文本提取的欧洲葡萄牙语基准

João Cardeira, Diogo Glória-Silva, Manuel Letras da Luz, Rafael Ferreira, Diogo Tavares, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

AI总结提出PorTEXTO，首个针对现代欧洲葡萄牙语视觉文本提取的基准，通过结合前沿LVLM转录和母语者审核构建，发现合成到真实样本性能显著下降，多语言数据比模型规模更关键。

2606.19091 2026-06-18 cs.RO 新提交

GCNGrasp-VP: Affordance-Guided View Planning for Efficient Task-Oriented Grasping

GCNGrasp-VP: 基于功能引导的视角规划用于高效任务导向抓取

Zanjia Tong, Wenlong Dong, Chengjie Zhang, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision（机器人与计算机视觉深圳重点实验室）

AI总结提出GCNGrasp-VP框架，通过功能场预测引导主动视角规划，无需场景重建，单次视角调整即可显著提升遮挡下的任务导向抓取成功率。

Comments Accepted to IROS 2026

详情

AI中文摘要

当物体视角存在遮挡时，任务导向抓取性能会显著下降。现有的任务导向抓取方法通常假设任务相关区域在初始帧中可见，而视角规划方法虽然能够实现主动感知，但往往忽略任务语义并依赖耗时的场景重建。为了解决这些局限性，我们提出了GCNGrasp-VP，一个将功能场预测与主动视角规划相结合的高效框架。该框架的核心是GCNGrasp-v2，一个同时支持抓取评估和功能场预测的任务导向抓取模型，实现了常数时间推理复杂度。利用这一能力，我们的功能引导视角规划器（Affordance-VP）将功能场作为信息增益度量，无需场景重建即可引导相机观察任务相关区域。视角规划结果表明，我们的方法仅需一次视角调整就显著优于基于场景不确定性的基线方法。真实世界验证进一步证实了在单物体场景中抓取成功率的显著提升，同时保持毫秒级计算延迟。代码和模型可在以下网址获取：this https URL。

英文摘要

Task-oriented grasping performance degrades significantly when object views suffer from occlusions. Existing task-oriented grasping methods typically assume task-relevant regions are visible in the initial frame, while view planning approaches enable active perception but often ignore task semantics and rely on time-consuming scene reconstruction. To address these limitations, we present GCNGrasp-VP, an efficient framework integrating affordance field prediction with active view planning. Central to this framework is GCNGrasp-v2, a task-oriented grasp model that simultaneously supports grasp evaluation and affordance field prediction, achieving constant-time inference complexity. Leveraging this capability, our Affordance-guided View Planner (Affordance-VP) utilizes the affordance field as an information gain metric to guide camera observation of task-relevant regions without requiring scene reconstruction. View planning results show that our method significantly outperforms scene-uncertainty-driven baselines with only one view adjustment. Real-world validation further confirms substantial improvements in grasp success rates for single-object scenarios while maintaining millisecond-level computational latency. Code and models are available at https://github.com/Instinct323/GCNGrasp-VP.

URL PDF HTML ☆

赞 0 踩 0

2606.19089 2026-06-18 cs.RO 新提交

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

ART-VS：用于视觉Transformer伺服的自适应分辨率分块

Alessandro Scherl, Bernhard Neuberger, Simon Schwaiger, David Mulero-Pérez, Lucas Muster, Jose Garcia-Rodriguez

发表机构 * Department of Computer Technology, University of Alicante（阿尔瓦登特技术系，阿利坎特大学）； Department of Industrial Engineering, UAS Technikum Vienna（工业工程系，维也纳技术学院）； Automation and Control Institute, TU Wien（自动化与控制研究所，维也纳技术大学）； Institute of Software Engineering and Artificial Intelligence, Graz University of Technology（软件工程与人工智能研究所，格拉茨技术大学）； Institute for Integrative Nature Conservation Research, University of Natural Resources and Life Sciences Vienna（整合自然保护研究 institute，维也纳自然资源与生命科学大学）

AI总结提出ART-VS方法，通过粗-精两阶段自适应调整特征粒度，在不需任务特定训练下提升视觉伺服鲁棒性和精度，显著降低定位误差并提高速度。

Comments Accepted at IROS2026

详情

AI中文摘要

基于自监督视觉Transformer（ViT）特征的视觉伺服实现了无需训练的机器人定位，具有强泛化能力，但面临鲁棒性与精度之间的根本权衡。粗粒度的块级描述符提供稳定的对应关系，但限制了定位精度。提高图像分辨率可改善精度，但鲁棒性增益有限——在扰动下，高分辨率处理仅将收敛成功率从76.6%提升至81.0%，尽管ViT块数量增加了12倍。因此，我们提出自适应分辨率分块视觉伺服（ART-VS），一种两阶段方法，根据伺服进程调整特征粒度：先以原生ViT分辨率进行粗阶段实现稳定对齐，然后进行分块高分辨率阶段，将匹配限制在局部邻域以提高定位精度。无需任何任务特定训练，ART-VS在扰动下达到95.4%的收敛率，比标准分辨率和全分辨率ViT伺服分别高出18.8和14.4个百分点。与前者相比，定位误差降低53%，同时运行速度比后者快10倍以上，VRAM使用减少27%。我们在三个ViT骨干网络上验证了ART-VS，并展示了真实世界类别级抓取未见过的物体实例，透明瓶成功率95/100，鞋子成功率98/100。代码见该链接。

英文摘要

Visual servoing with self-supervised Vision Transformer (ViT) features enables training-free robotic positioning with strong generalization, but faces a fundamental trade-off between robustness and precision. Coarse patch-level descriptors provide stable correspondences yet limit positioning accuracy. Increasing image resolution improves precision but yields only marginal robustness gains - under perturbation, high-resolution processing improves convergence success rate from 76.6% to just 81.0% despite 12x more ViT patches. Therefore, we propose Adaptive Resolution Tiling Visual Servoing (ART-VS), a two-phase method that adapts feature granularity to servoing progress: a coarse phase at native ViT resolution for stable alignment, then a tiled high-resolution phase that restricts matching to local neighborhoods improving positioning accuracy. Without any task-specific training, ART-VS achieves 95.4% convergence under perturbation, outperforming standard and full-resolution ViT-based servoing by 18.8 and 14.4 percentage points. Over the former it reduces positioning error by 53%, while running at over 10x higher speed and 27% lower VRAM than the latter. We validate ART-VS across three ViT backbones and demonstrate real-world category-level grasping of unseen object instances, achieving 95/100 on transparent bottles and 98/100 on shoes. Code available under https://art-vs.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.19088 2026-06-18 cs.RO 新提交

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg：面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence（格拉茨技术大学，软件工程与人工智能研究所）； University of Applied Sciences Technikum Wien, Department of Industrial Engineering（维也纳应用科技大学，工业工程系）； University of Alicante, Department of Computer Technology（阿利坎特大学，计算机技术系）； University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research（自然资源与生命科学大学，整合自然保护研究 institute）

AI总结提出ReSiReg方法，通过重构空间一致的VLM中间特征，改善密集语言接地检索，在OVSS和3D映射中提升空间一致性，并发布紧凑的25M参数VLM模型。

详情

AI中文摘要

视觉-语言模型（VLM）使机器人能够遵循开放语言指令。然而，密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构，并提出了ReSiReg，一种特征重构方法，利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型，推导其语言描述符，并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估，并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善；操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM，远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.19079 2026-06-18 cs.AI 新提交

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE: 推理时适配器动态选择的不可知路由

Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

发表机构 * University of Turin（都灵大学）； Samsung AI Center（三星人工智能中心）

AI总结提出无训练、与适配器无关的路由框架ARIADNE，通过训练集嵌入质心表示适配器，在推理时基于潜在空间距离选择适配器，无需适配器内部信息或额外训练，在44个任务上达到89.7%的选择准确率。

详情

AI中文摘要

参数高效微调（PEFT）的日益部署导致了模型生态系统，其中单个骨干网络与许多任务专用适配器配对。在这种设置下，推理时的查询通常没有任务标签，要求系统从不断增长且异构的适配器池中自动选择最合适的适配器。现有的路由方法要么依赖于对适配器内部（如权重分解或基于梯度的统计信息）的访问，要么需要额外的路由器训练，这限制了随着新适配器添加的可扩展性和可移植性。我们提出了ARIADNE，一个无训练、与适配器无关的路由框架，用于推理时的动态适配器选择。ARIADNE通过从其训练集的嵌入计算的一组质心来表示每个适配器，捕获与该适配器相关的数据分布。给定一个无标签输入，它通过测量在潜在空间中与这些质心的接近度来选择适配器。由于路由完全在输入嵌入空间中进行，ARIADNE与任意PEFT方法兼容，并且不需要对适配器或训练过程进行修改。主要使用Llama 3.2 1B Instruct在23个不同的NLP任务上进行评估，ARIADNE恢复了97.44%的上限性能。扩展到44个任务，它实现了89.7%的平均选择准确率，无需额外训练或访问适配器内部信息。

英文摘要

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

URL PDF HTML ☆

赞 0 踩 0

2606.19073 2026-06-18 cs.CV 新提交

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑：认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China（王轩计算机技术研究所，北京大学，北京，中国）； National Institute of Health Data Science, Peking University, Beijing, China（国家健康数据科学研究院，北京大学，北京，中国）

AI总结提出HOI-Edit基准和SCPE框架，利用I2V模型的时间生成能力进行动态人-物交互编辑，通过自校正提示迭代优化，实现与SOTA竞争的性能。

详情

AI中文摘要

当前的图像编辑方法在静态属性上表现出色，但在复杂的人-物交互（HOI）上失败，这是一个关键挑战，现有基准将HOI与静态属性混淆，依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此，我们首先引入HOI-Edit，一个包含三个渐进认知层次的综合基准，其特点是自动化指标HOI-Eval，通过让VLM在思考后对包含基础人-物对的图像进行问答，可靠地评估实例级交互。考虑到任务本质是重塑动态关系，我们对图像到视频（I2V）模型进行基准测试，发现它们由于其时间生成能力而天生适合动态编辑。关键的是，除了优越的性能，这种能力提供了“失败过程的重放”，为错误原因提供了独特的可诊断性。因此，我们提出SCPE（自校正过程编辑），一种新颖的智能体自校正框架，通过迭代优化的提示约束I2V模型的生成，使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上，SCPE在交互上达到了与最先进（SOTA）编辑模型（如Nano Banana）竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

URL PDF HTML ☆

赞 0 踩 0

2606.19067 2026-06-18 cs.RO cs.CV 新提交

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要：四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院智能机器人实验室）； Institute for Intelligent Systems, Esslingen University of Applied Sciences（埃森堡应用科学大学智能系统研究所）； Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

AI总结针对四足机器人运动中的传感器配置问题，系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法，发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情

AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建（SLAM）。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟，但在腿部运动的剧烈动态下，硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战，包括足部冲击、高频机械振动和快速角旋转，这些都会降低标准感知管道的性能。为了填补这一空白，我们使用在ANYmal D四足机器人上记录的GrandTour数据集，对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响，分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明，硬件选择对系统鲁棒性有显著影响：立体配置始终优于单目和RGB-D模态，全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败，并且关键的是，在剧烈的腿部运动下，标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南，以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19062 2026-06-18 cs.CV 新提交

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University（世宗大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Ulsan National Institute of Science and Technology（乌山国立科学研究院）

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

详情

AI中文摘要

在当今媒体驱动的世界中，视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射，限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐，但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM：双路径表示增强与对齐模型，一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略，结合掩码和排列语言建模目标，以捕捉局部和全局语言语义。在视觉方面，我们设计了一个具有级联组注意力的层级视觉编码器，通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM，分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性，并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19053 2026-06-18 cs.CV 新提交

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

大规模视觉-语言模型在细粒度图像任务上的基准测试：从评估到诊断

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院，中国）； Alibaba Group（阿里巴巴集团）； School of Computer Science and Engineering, School of Intelligence Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China（东南大学计算机科学与工程学院、智能科学与工程学院以及新一代人工智能技术及其交叉应用关键实验室，中国）； Wangxuan Institute of Computer Technology, National Key Laboratory for Multimedia Information Processing, Peking University, China（北京大学王轩计算机技术研究所、多媒体信息处理国家重点实验室，中国）； University of Copenhagen, Denmark（丹麦哥本哈根大学）

AI总结提出FG-BMK基准，含101万问题和28万图像，通过人机双范式评估LVLM的细粒度语义识别与视觉判别能力，诊断失败原因，发现视觉表示、语义对齐等瓶颈。

详情

AI中文摘要

近期大规模视觉-语言模型（LVLMs）展示了显著的多模态感知和推理能力。尽管众多基准从整体或任务特定角度评估了LVLMs，但它们在细粒度图像任务（计算机视觉的基础）上的能力仍未得到充分理解。为填补这一空白，我们引入FG-BMK，一个全面的细粒度评估基准，包含101万问题和28万图像，覆盖从常见物体中心领域到专业领域的多样化场景。FG-BMK通过面向人类和面向机器的范式，联合评估对话级细粒度语义识别和特征级视觉判别能力，从而诊断分析LVLM的失败是否源于视觉表示不足、视觉-语义对齐薄弱或细粒度知识有限。通过对一系列代表性LVLM/VLM的大量实验，我们发现当前LVLMs仍是不充分的细粒度识别器，失败源于视觉表示、语义对齐、模态对齐和类别级知识中相互交织的瓶颈。我们进一步分析了提升细粒度能力的训练设计因素，并考察了视觉和语言扰动如何影响LVLM预测。这些发现为当前LVLMs的局限性提供了诊断性见解，并为未来数据构建和模型设计提供了指导，以开发更可靠的细粒度视觉任务LVLMs。我们的代码已开源，可从此https URL获取。

英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.19051 2026-06-18 cs.CL cs.DL cs.IR 新提交

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

研究论文的哪些部分最能揭示其研究方法？来自图书馆与信息科学的证据

Qiuyu Fang, Jiayi Hao, Chengzhi Zhang

发表机构 * Department of Information Management, Nanjing University of Science and Technology, China（南京理工大学信息管理学院）

AI总结提出基于全文分段的组合策略，通过评估不同段落及其组合的分类性能，发现中后段和末尾段对研究方法识别更具区分力，且结合书目元数据可提升分类效果。

Comments ASIST 2026

详情

AI中文摘要

研究方法是学术论文中知识贡献的重要载体。研究方法的自动多标签分类可以支持方法检索、综述生成和研究情报分析等知识服务。现有研究主要依赖标题和摘要，但摘要通常只提供有限的方法信息，而利用全文内容则面临篇幅过长和信息冗余的挑战。因此，本文提出一种根据物理位置划分全文内容的段落组合策略。利用来自图书馆与信息科学领域三种代表性期刊（JASIST、LISR 和 JDoc）的 1,954 篇全文文章的标注语料，我们评估了多种模型下不同段落及其组合的分类性能。实验结果表明，方法信息在全文内容中分布不均匀，中后段和末尾段表现出更强的区分能力。此外，将书目元数据与跨段组合策略相结合，有效提升了分类性能。

英文摘要

Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19047 2026-06-18 cs.AI 新提交

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS: 面向多轮工具使用智能体的奖励驱动在线数据合成

Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

发表机构 * Zhejiang University（浙江大学）； Shanghai Innovation Institute（上海创新研究院）； Westlake University（西湖大学）

AI总结针对多轮工具使用强化学习中静态数据集信息样本快速耗尽的问题，提出RODS方法，利用进度奖励方差作为零成本边界检测器，在线合成与智能体能力边界匹配的样本，以约800样本达到17K样本离线管道的性能。

详情

AI中文摘要

多轮工具使用强化学习受限于静态数据集中信息样本的快速耗尽。我们观察到GRPO中的梯度信号集中在具有最高 rollout 奖励方差的任务上，这是Popoviciu上界的结果。因此，位于智能体能力边界附近（成功与失败大致平衡）的样本贡献了不成比例的大策略梯度。随着训练进行，该边界不断移动，逐渐耗尽静态数据集中的信息样本池。我们提出RODS（奖励驱动在线数据合成）来解决这种耗尽问题。RODS通过将进度奖励方差重新用作一个实用的、零成本的边界检测器（除了训练中已计算的rollout外无需额外推理），来闭环RL训练与数据生成。它持续识别这些边界样本，通过技能对齐的重采样管道合成与其结构复杂度（例如API拓扑和依赖深度）匹配的新多轮变体，并管理一个与策略共同演化的动态回放缓冲区。从400个人工种子开始并维持约800个样本的活动训练池，RODS实现了与17K样本离线管道相当的性能，同时所需轨迹数量约少20倍，并在我们的受控设置中优于固定数据RL和环境增强方法。

英文摘要

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

URL PDF HTML ☆

赞 0 踩 0

2606.19046 2026-06-18 cs.CV 新提交

Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

基于Ky Fan p-k范数分数阶正则化的低秩张量补全

Shan Fan, Feng Zhang, Jianjun Wang, Xi-Le Zhao, Tingwen Huang

发表机构 * School of Mathematics and Statistics, Southwest University（西南大学数学与统计学学院）； School of Mathematical Sciences/Research Center for Image and Vision Computing, University of Electronic Science and Technology of China（电子科技大学数学科学学院/图像与视觉计算研究中心）； Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology（深圳先进技术大学计算机科学与控制工程学院）

AI总结提出张量核范数与Ky Fan p-k范数之比（TNPK）作为非凸替代，逼近张量管秩，并构建低秩张量补全模型，证明低秩张量是局部极小点，设计ADMM算法，实验验证优于现有方法。

详情

AI中文摘要

本文通过提出一种新颖的非凸替代，即张量核范数与张量Ky Fan p-k范数（TNPK）之比，来精确逼近张量管秩，从而解决低秩张量补全（LRTC）问题。TNPK具有吸引人的性质，包括尺度不变性、参数灵活性以及在特定p和k选择下存在闭式解。在特定的p和k参数设置下，它退化为张量核范数与张量Ky Fan k范数（TNK）之比或张量核范数与张量Frobenius范数（TNF）之比。我们构建了一个LRTC模型，并在张量零空间性质（NSP）下，证明了低秩张量是所提模型的局部极小点。此外，我们推导了Ky Fan p-k逆范数的近端算子，并进一步开发了一种高效的交替方向乘子法（ADMM）算法，在温和条件下保证子序列收敛。在合成和真实世界数据集上的大量实验验证了我们的方法相对于最先进竞争者的优越性能。

英文摘要

This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

URL PDF HTML ☆

赞 0 踩 0

2606.19036 2026-06-18 cs.LG 新提交

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

稀疏混合专家模型中不连续性的几何与随机分析

Tho Tran Huu, Huu-Tuan Nguyen, Thien-Hai Nguyen, Nhat-Tri Ho, Viet-Hoang Tran, Tho Quan, Tan Minh Nguyen

发表机构 * Department of Mathematics, National University of Singapore, Singapore（新加坡国立大学数学系）； Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam（胡志明市技术大学计算机科学与工程学院）

AI总结本文对稀疏混合专家模型中的不连续性进行几何与随机分析，分类不连续阶数，建立渐近体积估计，证明随机路径几乎必然击中一阶不连续，并提出低开销平滑机制以提升性能。

Comments ICML 2026 Spotlight

详情

AI中文摘要

稀疏混合专家（SMoE）架构现已广泛应用于最先进的语言和视觉模型中，其中条件路由允许扩展到非常大的网络。然而，正是这种Top-$k$专家选择使得条件路由成为可能，同时也导致SMoE映射本质上不连续。在这些不连续曲面附近，即使任意接近的输入也可能激活截然不同的专家集，从而产生显著不同的输出。本文对这些不连续性进行了严格的几何和随机分析。首先，我们根据切换事件中并列专家的数量对不连续性进行阶数分类。利用测度论切片论证，我们建立了加厚不连续曲面的渐近体积估计，表明低阶不连续集占主导地位，而高阶不连续集占据的体积相对极小。接着，通过扩散过程对输入空间中的随机扰动建模，我们证明路径最终会遇到不连续，并且首次击中几乎必然发生在阶数为1的不连续上，同时给出了显式的有限时间概率界。我们进一步推导了占据时间界，量化了随机路径在每个不连续阶数邻域内停留的时长。这些理论结果表明输入更可能位于低阶不连续附近。受此启发，我们提出一种简单的平滑机制，可直接应用于现有SMoE，在接近不连续处软性地整合专家；我们的分析保证增加的额外计算开销很小，同时在不连续附近提供局部平滑，跨语言和视觉任务的实验表明，平滑不仅增强了SMoE映射的连续性，还提升了经验性能。

英文摘要

Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection that enables conditional routing also renders the SMoE map inherently discontinuous. In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs. In this work we give a rigorous geometric and stochastic analysis of these discontinuities. We first classify them by order, determined by the number of tied experts at a switching event. Using measure-theoretic slicing arguments, we establish asymptotic volume estimates for the thickened discontinuity surfaces, showing that lower-order discontinuity sets dominate, whereas higher-order ones occupy a vanishingly small relative volume. Next, modeling random perturbations in the input space via a diffusion process, we prove that the path eventually encounter a discontinuity, and moreover that the first hit almost surely occurs on an order-1 discontinuity with explicit finite-time probability bounds. We further derive occupation-time bounds that quantify the duration the random path spend in the neighborhoods of each discontinuity order. These theoretical results imply that inputs are more likely to lie near lower order discontinuities. Motivated by this insight, we propose a simple smoothing mechanism that can be directly applied to existing SMoEs, softly incorporating experts near discontinuities; our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19031 2026-06-18 cs.RO 新提交

Congestion-Aware Robot Tour Planning in Crowded Environments

拥挤环境中的拥塞感知机器人巡视规划

Stefano Bernagozzi, Charlie Street, Masoumeh Mansouri, Lorenzo Natale

发表机构 * Istituto Italiano di Tecnologia（意大利理工学院）； Università di Genova（热那亚大学）； University of Birmingham（伯明翰大学）

AI总结提出一种基于概率的巡视规划器，通过学习人流预测模型并在线构建马尔可夫决策过程，在拥挤环境中高效规划机器人路径，减少拥塞影响。

Comments Accepted to IEEE IROS 2026

详情

AI中文摘要

自主移动服务机器人通常需要完成在环境中遍历一组位置的巡视任务。例如，引导人们穿过购物中心、在配送中心递送包裹或在博物馆提供导览。然而，在拥挤环境中，人群的存在可能对机器人性能产生负面影响。例如，人类会触发机器人的碰撞避免操作，从而降低机器人速度。人群随机移动且随时间变化。本文提出一种针对拥挤环境的概率巡视规划器，该规划器明确考虑人类拥塞。我们学习圆形线性流场（CLiFF）地图，该地图根据初始观测预测人类轨迹。然后，我们利用这些预测在线构建并求解马尔可夫决策过程，从而高效地将机器人引导通过环境。我们的方法具有足够的可扩展性，能够在观察到新人群时重新规划。我们在购物中心的真实人群数据集上评估了该方法。

英文摘要

Autonomous mobile service robots are often required to complete tours that require navigating through a set of locations in an environment. Example domains include guiding people through a shopping mall, delivering packages in a fulfilment centre, or giving guided tours in a museum. However, in crowded environments, the presence of people may negatively impact robot performance. For example, humans will activate robot collision avoidance manoeuvres that slow the robot down. Crowds move stochastically and vary throughout the day. In this paper we present a probabilistic tour planner for crowded environments which explicitly reasons over human congestion. We learn circular linear flow field (CLiFF) maps which predict human trajectories given an initial observation. We then use these predictions to build and solve a Markov decision process online which efficiently routes the robot through the environment. Our approach is scalable enough to re-plan as new people are observed. We evaluate our approach on a real-world crowd dataset in a shopping mall.

URL PDF HTML ☆

赞 0 踩 0

2606.19025 2026-06-18 cs.LG cs.AI cs.DC cs.SY eess.SY 新提交

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE: 打破全副本壁垒的专家混合联邦系统

Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

发表机构 * DeepSeek-AI

AI总结提出FoMoE系统，通过跨工作节点分区专家层打破全副本范式，结合部分专家复制和跳跃令牌机制，显著降低通信开销并提升吞吐量。

详情

AI中文摘要

预训练大型语言模型（LLMs）通常需要大规模基础设施，配备紧密耦合的硬件加速器。虽然增加模型和数据集规模仍是性能的主要驱动力，但专家混合（MoE）架构最近通过将参数数量与计算成本解耦，取得了最先进的结果。这种效率使得在受限计算预算下训练大规模模型成为可能，但通常需要单个数据中心的高速互连。为了克服这些物理限制，最近的方法如DiLoCo和Photon使用低通信数据并行方法，使得能够在地理分布、弱连接的数据中心之间进行扩展。然而，这些方法存在根本性的低效问题：它们需要在每个站点拥有完整的模型副本，这带来了高昂的内存约束和通信开销。在这项工作中，我们引入了FoMoE，一个通过跨工作节点分区专家层来打破全副本范式的系统。我们证明FoMoE：（I）通过部分专家复制，在所研究的场景中，相比高效基线降低了高达1.42倍的通信成本，相比DDP降低了45.44倍；（II）通过一种新颖的跳跃令牌机制，实现了高达1.4倍的经验吞吐量加速；（III）在训练代理场景中展示了稳定的路由，并通过系统建模将通信/内存优势推广到100B规模的配置。

英文摘要

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

URL PDF HTML ☆

赞 0 踩 0

2606.19019 2026-06-18 cs.CV 新提交

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

FlowObject: 流引导以桥接生成先验与重建保真度

Yuchen Rao, Xuqian Ren, Yinyu Nie, Sayan Deb Sarkar, Biao Zhang, Vincent Lepetit, Friedrich Fraundorfer

发表机构 * Graz University of Technology Austria（奥地利格拉茨理工大学）； Tampere University Finland（芬兰塔尔库大学）； Technical University of Munich Germany（德国慕尼黑技术大学）； Stanford University The United States of America（美国斯坦福大学）； Xi’an Jiaotong University China（中国西安交通大学）； École des Ponts ParisTech France（法国巴黎综合理工学院）

AI总结提出FlowObject框架，通过双空间引导策略驱动流匹配模型的ODE轨迹，在利用生成先验完成未观测区域的同时保持与真实观测的一致性，并集成3DGS细化阶段弥合生成输出与真实感重建的差距，显著提升几何完整性和视角相关外观保真度。

Comments Project page: https://yuchenrao.github.io/projects/flowObject/flowObject.html

详情

AI中文摘要

从少量随意拍摄的图像中恢复物体的完整3D表示仍然是一个重大挑战。最近的3D生成模型，特别是基于流匹配（Flow-Matching, FM）的模型，可以合成高质量的纹理资产；然而，它们常常遭受“合成偏差”，即学习到的先验覆盖了观测证据，同时缺乏与观测实例的对齐。相反，基于优化的方法如3D高斯泼溅（3DGS）在可见表面上提供高保真度，但无法推理未观测的几何结构。在本文中，我们提出了FlowObject，一个将稀疏视图3D重建重新表述为无训练、引导逆问题的框架。我们的方法采用双空间引导策略来驱动流匹配模型的常微分方程（ODE）轨迹，通过学习的生成先验完成未观测区域，同时强制与真实世界观测严格一致。通过集成3DGS细化阶段，FlowObject进一步弥合了“合成外观”生成输出与真实感重建之间的差距。在合成和真实世界数据集上的全面基准测试表明，当前最先进的方法通常难以同时实现几何完整性和观测一致性，尤其是在严重遮挡下。相比之下，我们的方法在几何完整性和视角相关外观保真度方面显著优于最先进的生成模型和基于优化的框架。

英文摘要

Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ''synthetic bias'' where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ''synthetic-looking'' generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.19005 2026-06-18 cs.CL cs.LG 新提交

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University（东北大学）

AI总结本文提出Sumi，一个从零开始预训练的70亿参数均匀扩散语言模型，在1.5T tokens上训练，性能与同规模自回归模型相当，并开源所有资源。

详情

AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中，均匀扩散语言模型（UDLM）允许在任何步骤更新任何token，原则上能够实现更灵活的生成。然而，目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型；而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此，我们引入了Sumi（日语中“墨水”的意思），一个完全开放的70亿参数均匀扩散语言模型，从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当，但在常识基准测试中表现较差，其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案，包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散，并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

URL PDF HTML ☆

赞 0 踩 0

2606.19002 2026-06-18 cs.CL 新提交

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Fudan University（复旦大学）； Beihang University（北京航空航天大学）； Monash University（墨尔本大学）； Zhongguancun Laboratory（中关村实验室）； Nanjing University（南京大学）； Tsinghua University（清华大学）

AI总结提出可引导模型合并（ST-Merge）框架，通过门控交叉注意力机制自适应调节源模型贡献，在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情

AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间，它在多语言推理任务中取得了有希望的泛化效果。然而，合并后的单一模型往往无法解决源模型之间的冲突，导致性能次优。换句话说，一刀切的合并策略可能无法适应不同输入的特性，这些输入可能要求优先考虑某些模型。为此，我们提出了一个可引导模型合并（ST-Merge）框架来调节每个源模型的贡献。为了实现这一想法，我们引入了一种门控交叉注意力机制，以自适应方式加权或过滤两个关注的源模型。大量实验表明，ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

URL PDF HTML ☆

赞 0 踩 0