2606.04485 2026-06-05 cs.LG

LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models

LimiX-2M：缓解表格基础模型中的低秩坍塌和注意力瓶颈

Yuanrui Wang, Xingxuan Zhang, Han Yu, Mingchao Hao, Gang Ren, Hao Yuan, Li Mao, Yunjia Zhang, Chun Yuan, Peng Cui

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出统一tokenize-and-route框架LimiX-2M，通过RaBEL扩展标量为局部RBF特征并重新排序双向块S→N→F，以2M参数超越更大模型，改善表格基础模型的精度-效率权衡。

Comments Accepted to ICML 2026

详情

AI中文摘要

表格基础模型（TFM）日益与树集成方法竞争，但其性能通常计算效率低下：使用标准仿射标量分词时，每个特征通过本质上的一维通道注入值变化，特征ID/位置信号无法增加特征内值的自由度，导致早期层值敏感性弱和隐藏状态冗余。我们提出了一个统一的\emph{tokenize-and-route}框架用于强TFM： extbf{RaBEL}将每个标量扩展为紧凑的局部RBF特征（可选指数门控）以改善条件和浅层有效秩，而重新排序的双向块 extbf{S$ ightarrow$N$ ightarrow$F}通过在特征混合前聚合跨样本上下文并使用注意力池化来使计算与读出对齐。这些变化共同产生了 extbf{LimiX-2M}，一个2M参数模型，在广泛使用的表格基准上优于更大的TabPFN-v2和TabICL基线，同时降低了训练和推理成本。这些结果突出了值感知分词和读出对齐路由作为改善TFM中精度-效率权衡的关键杠杆。模型检查点和推理代码可在https://github.com/limix-ldm-ai/LimiX获取。

英文摘要

Tabular foundation models (TFMs) increasingly rival tree ensembles, but their performance is often compute-inefficient: with standard affine scalar tokenization, each feature injects value variation through an essentially one-dimensional channel, and feature IDs/positional signals cannot increase within-feature value degrees of freedom, yielding weak early-layer value sensitivity and redundant hidden states. We present a unified tokenize-and-route framework for strong TFMs: RaBEL expands each scalar into compact localized RBF features (optionally exponent-gated) to improve conditioning and shallow-layer effective rank, while a reordered bidirectional block S->N->F aligns computation with the readout by aggregating cross-sample context before feature mixing and using attention pooling. Together, these changes yield LimiX-2M, a 2M-parameter model that outperforms larger TabPFN-v2 and TabICL baselines on widely used tabular benchmarks while reducing training and inference costs. These results highlight value-aware tokenization and readout-aligned routing as key levers for improving the accuracy--efficiency trade-off in TFMs. Model checkpoints and inference code are available at https://github.com/limix-ldm-ai/LimiX.

URL PDF HTML ☆

赞 0 踩 0

2606.04463 2026-06-05 cs.RO

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

OSCAR: 面向机器人的全具身骨架条件世界动作模型

Zhuoyuan Wu, Jun Gao

发表机构 * Peking University（北京大学）； University of Michigan（密歇根大学）； NVIDIA（英伟达）

AI总结提出OSCAR，一种基于动作条件的视频世界模型，通过大规模数据管道和2D骨架渲染统一表示，实现跨机器人具身的泛化，并用于策略评估。

Comments Project page: https://wuzy2115.github.io/oscar-project-page/

详情

AI中文摘要

我们提出OSCAR，一种精确的动作条件视频世界模型，能够泛化到不同的机器人具身并支持机器人策略评估。现有的视频世界模型在真实机器人评估中面临三个主要挑战：当前机器人训练数据集的场景多样性有限、动作跟随不精确、以及跨具身泛化能力差以支持广泛采用。我们从两个角度应对这些挑战。其核心是一个大规模标准化数据管道，用于整理、过滤和去重广泛的机器人和以自我为中心的人类数据集，产生一个涵盖多样化任务、场景、动作和机器人具身的干净联合训练数据集。为了给视频模型提供条件，我们采用2D运动学骨架渲染作为统一的条件表示，能够泛化到不同的机器人手臂甚至人类手部。我们在单个GH200 GPU上微调Cosmos-Predict2.5-2B模型。与现有基线相比，我们的模型在动作跟随、外观质量和运动一致性方面取得了显著改进，而基线要么模型规模大得多，要么需要更多GPU。我们进一步将OSCAR部署到RoboArena中评估机器人策略。大量实验表明，OSCAR中的虚拟策略评估与真实世界评估之间存在显著相关性，为未来机器人策略可以纯粹在虚拟生成的世界中评估铺平了道路。

英文摘要

We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

URL PDF HTML ☆

赞 0 踩 0

2606.04335 2026-06-05 cs.LG cs.SY eess.SY

话题作为社会人口统计的代理：对话上下文如何影响大语言模型的回答

Vera Neplenbroek, Gabriele Sarti, Arianna Bisazza, Raquel Fernández

发表机构 * Institute for Logic, Language and Computation, University of Amsterdam（逻辑、语言与计算研究所，阿姆斯特丹大学）； Khoury College of Computer Sciences, Northeastern University（计算机科学学院，东北大学）； Center for Language and Cognition, University of Groningen（语言与认知中心，格罗宁根大学）

AI总结研究大语言模型在高风险场景中对话上下文对回答差异的影响，发现话题是社会人口统计差异的主要驱动因素，且影响方式不可预测。

详情

AI中文摘要

当大语言模型（LLM）用于高风险场景（如法律、医疗和金融建议）时，即使单次对话历史也足以导致用户间结果差异。先前研究表明，这会导致社会人口统计群体之间的结果差异，某些群体获得比其他群体更有利的结果。在这项工作中，我们证明LLM实际上难以从单次对话历史推断用户的社会人口统计信息，并且尽管社会人口统计群体之间存在差异，但差异幅度很小。为了探究这些差异的主要驱动因素，我们将用户社会人口统计信息与对话的一系列（心理）语言学特征（包括对话话题、情感和可读性）进行比较。我们发现，在对话上下文中，对话话题最能预测LLM生成的建议，这些话题在一定程度上充当社会人口统计群体的代理，并且常常以不可预测的方式影响建议。这令人担忧，并强调未来研究需要更好地理解，并在必要时减轻高风险场景中对话上下文对LLM输出的影响。

英文摘要

When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.02750 2026-06-05 cs.CL

On the Persistent Effects of Lexicality in Large Language Models

论词汇性在大语言模型中的持久影响

Hammad Rizwan, Muhammad Umair Haider, Nishant Subramani, Mona T. Diab, A. B. Siddique, Hassan Sajjad

发表机构 * Dalhousie University（达尔豪斯大学）； University of Kentucky（肯塔基大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文通过对抗性语义压力测试和信息论视角，量化了大语言模型中词汇重叠相对于语义内容的影响，发现词汇影响贯穿模型深度，并在中间层出现词汇和语义信号同时衰减的过渡区域，进而以摘要和模型编辑为例展示了词汇影响对下游任务的作用。

详情

AI中文摘要

从大语言模型（LLMs）中提取的表征在许多下游应用中扮演着重要角色。然而，这些表征的结构往往受词汇重叠而非语义内容的影响。我们对这种词汇影响与语义内容之间的关系及其对下游任务的影响的理解仍然有限。在这项工作中，我们研究表征以量化词汇重叠相对于语义内容的影响。我们考虑了若干对抗性语义压力测试，并进一步将我们的发现与信息论视角联系起来。我们发现词汇影响贯穿模型的深度，在不同架构、训练范式和目标函数（包括为语义相似性训练的模型）中一致存在。此外，我们观察到一个中间深度区域，其中词汇和语义信号同时衰减，表明这是一个表征对表面形式和意义都较差的过渡状态。我们进一步通过摘要和模型编辑作为案例研究，展示了词汇影响对LLMs下游使用的影响。

英文摘要

Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

URL PDF HTML ☆

赞 0 踩 0

2606.02684 2026-06-05 cs.LG cs.AI cs.CL

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

先过滤，再重加权：重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU（清华大学）； HKUST（香港科技大学）； BIT（北京理工大学）； Meituan（美团）； ZJU（浙江大学）

AI总结针对在线策略蒸馏，提出FiRe-OPD方法，通过轨迹级过滤和令牌级软重加权实现细粒度优化，在多种设置下优于现有方法。

详情

AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发，我们重新思考在线策略蒸馏的优化粒度，并提出FiRe-OPD（先过滤，再重加权），该方法在轨迹和令牌两个层面联合调整监督信号。具体来说，FiRe-OPD首先过滤轨迹以移除低质量的采样结果，然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比，FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性，从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性，并展示了其相对于近期令牌级在线策略蒸馏方法的优越性（例如，在强到弱设置中AIME 2024上+6.25，在多教师设置中Miner上+18.81）。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC（伊利诺伊大学香槟分校）； Microsoft（微软）

AI总结提出OpenWebRL框架，通过在线多轮强化学习在真实网站上训练视觉网络代理，以4B参数模型在基准测试中达到开源最优，并与闭源系统竞争。

Comments 36 pages, 11 figures

详情

AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速，最强的系统仍然大多是专有的，而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈：高质量演示的收集成本高昂，而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景，但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中，我们介绍了OpenWebRL，一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程，包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架，我们训练了OpenWebRL-4B，在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务，OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率，在DeepShop上达到64.0%，优于之前类似或更大规模的开放代理，并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外，我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择，并分析了强化学习如何改进代理推理。总体而言，我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.01935 2026-06-05 cs.CV

R^3: 基于推理引导的召回与重排序的组合视频检索

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie

发表机构 * Shandong University（山东大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结提出R^3零样本组合视频检索流程，通过生成推理轨迹增强查询表示，并融合重排序验证候选视频，有效解决源视频与编辑指令组合检索的挑战。

详情

AI中文摘要

CoVR-R挑战评估组合视频检索，系统需根据参考视频和文本编辑指令从大型图库中检索目标视频。该设置不是标准的视频-文本检索问题：查询由源视频中的视觉证据和编辑隐含的变换共同定义。强嵌入模型可提供可扩展的候选召回，但可能无法充分表达目标侧后果，如状态变化、动作替换、对象保留或时间一致性。成对多模态重排序器可直接验证此类细节，但全面重排序整个图库在计算上不可行。我们提出R^3，一个基于推理引导的召回与重排序的零样本组合视频检索流程。核心思想是将源-编辑查询转化为推理基础的检索程序，而非将编辑文本视为短标题。首先，模型生成推理轨迹，描述应用编辑后预期的目标视频。然后，将轨迹与源视频一起编码为推理增强查询，并通过一致性门控残差规则与基础组合查询的检索分数融合。最后，重排序器通过直接源-候选比较验证召回候选。实验证明了我们方法在应对该挑战中的有效性。代码可在https://github.com/Lee-zixu/R-3获取。

英文摘要

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

URL PDF HTML ☆

赞 0 踩 0

2606.00644 2026-06-05 cs.AI

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

ForeSci: 评估LLM智能体在前瞻性AI研究判断中的能力

Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu

发表机构 * Southeast University（东南大学）； Beijing Zhongguancun Academy（北京中关村学院）； Duke Kunshan University（杜克昆山大学）

AI总结提出ForeSci基准，通过时间控制的500个任务评估LLM智能体基于历史证据做出前瞻性研究判断的能力，发现证据与决策脱节问题。

详情

AI中文摘要

AI研究通常需要在未来证据出现之前做出决策：攻击哪个瓶颈、追求哪个方向、项目应如何定位。我们引入了ForeSci，一个时间控制的基准，用于评估LLM智能体是否能够从历史证据中做出此类前瞻性研究判断。ForeSci包含500个任务，涵盖四个快速发展的AI领域和四个决策家族。每个任务配有一个截止对齐的离线知识库；截止日期后的论文在生成过程中被隐藏，仅用于验证。为避免随机未来事件预测，任务源自截止前的分类分支和证据信号，并选择早于任务截止日期的答案生成骨干。我们评估了原生LLM、混合RAG以及四种骨干上的三种研究智能体适配。结果表明，显式证据组织提高了可追溯性和事实支持，但收益强烈依赖于决策家族。诊断揭示了一个反复出现的证据-决策脱节：智能体可能引用相关证据，但预测错误的研究对象。ForeSci将前瞻性AI研究判断转化为一个受控基准，用于评估作为决策系统的研究智能体。

英文摘要

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00616 2026-06-05 cs.CV cs.AI

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考：面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.（先进微器件公司）

AI总结提出 pause-and-think-T 数据集和 pause-and-think-B 基准，通过推理监督训练紧凑模型，在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情

AI中文摘要

最近的视觉语言模型（VLM）在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T，一个以推理为中心的训练数据集，鼓励模型暂停、基于视觉证据进行推理，并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理，引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型，并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B（58.9%）少 59 倍的情况下达到了 58.0% 的准确率，在场景理解上与 GPT-5.2 匹配，并超越了 GPT-4o。除了我们的基准之外，该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能，在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升，且无需特定基准训练。我们的结果表明，有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导，同时泛化到训练数据之外，而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

URL PDF HTML ☆

赞 0 踩 0

2606.00522 2026-06-05 cs.CV

通过扩散模型生成图状规则用于知识图谱推理

Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng, Zhipeng Lin, Haoxuan Li, Changjun Fan, Shixuan Liu

发表机构 * Laboratory for Big Data and Decision（大数据与决策实验室）； National University of Defense Technology（国防科技大学）； National Key Laboratory of Information Systems Engineering（信息系统工程国家重点实验室）； Microsoft Corporation（微软公司）； College of Computer Science and Technology（计算机科学与技术学院）

AI总结提出GRiD框架，利用扩散模型将图状规则发现转化为以目标关系为条件的离散生成过程，结合监督预训练和强化学习优化，实现知识图谱补全中图状规则的高效挖掘。

Comments accepted by KDD 26

详情

DOI: 10.1145/3770855.3817814

AI中文摘要

逻辑规则构成知识图谱推理的基石，因其可解释性和建模关系模式的能力而受到重视。然而，现有规则挖掘方法主要关注简单的链状规则，因此忽略了图状结构中编码的更丰富的关系信息，例如循环和分支。这一局限性因搜索空间组合爆炸导致的计算瓶颈而进一步加剧，这对图状规则尤其具有挑战性。同时，生成方法如扩散模型，尽管在其他领域取得了成功，但不能直接应用于规则挖掘，因为它们的训练目标与学习高质量规则的目标不一致，且不可微的知识图谱规则质量指标无法直接指导模型优化。为解决这些局限性，我们提出GRiD，一个将图状规则发现重新表述为以目标关系为条件的离散生成过程的框架。GRiD采用两阶段训练策略。首先，监督预训练使GRiD能够从知识图谱元图采样的子图中捕获结构先验。随后，应用强化学习通过直接由不可微规则质量指标指导的策略梯度优化来微调GRiD。在六个基准数据集上的实验表明，GRiD在知识图谱补全任务上取得了有竞争力的性能。消融研究证实了GRiD的效率和鲁棒性，并进一步表明图状规则在知识图谱补全中补充了链状规则。我们的代码和数据集可在https://github.com/Haoxiang-Cheng/GRiD获取。

英文摘要

Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, cannot be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our code and datasets are available in https://github.com/Haoxiang-Cheng/GRiD.

URL PDF HTML ☆

赞 0 踩 0

2605.30467 2026-06-05 cs.CV

Clustering Guided Domain-Specific Pretrained Foundation Model for Very High-Resolution Arctic Remote Sensing

聚类引导的领域特定预训练基础模型用于极高分辨率北极遥感

Amal S. Perera, Chandi Witharana, Elias Manos, Michael Pimenta, Anna K. Liljedahl

发表机构 * Woodwell Climate Research Center（伍德沃德气候研究中心）

AI总结提出结合多样性感知区域图像筛选与掩码自编码器自监督预训练的北极遥感基础模型，在四个标注数据集上显著提升前景F1分数。

详情

AI中文摘要

本研究引入了一种新颖的北极聚焦遥感基础模型（RSFM），通过将多样性感知的区域尺度图像筛选与Vision Transformer（ViT）编码器的掩码自编码器（MAE）自监督预训练相结合，用于极高空间分辨率（VHSR）卫星图像分析。利用光谱和采集元数据描述符，在可扩展的亲和传播聚类工作流中，从267 TB的Vantor VHSR图像中选取约300万张图块。这种筛选策略旨在减少视觉重复或低信息区域的过采样，同时保留研究区域内广泛的场景多样性。我们在筛选后的语料库上使用领域适应的MAE重建目标预训练了ViT-Large编码器，生成了用于下游特征映射的北极特定Transformer权重。预训练编码器被集成到一个现有的位置感知检测与分割框架中，并在四个手工标注的北极数据集上进行了评估。与ImageNet初始化的ViT-Large基线相比，北极MAE预训练在基础设施、IWP、RTS和TCNs上分别产生了0.87、0.72、0.93和0.87的前景平均F1分数一致提升，提高了约5-8个百分点。所提出的模型在所有下游比较中也优于Prithvi-EO-2.0，最小的增益对应至少15个百分点的平均F1提升，这表明在筛选的北极VHSR图像上进行领域特定的自监督预训练，为精细尺度的北极制图提供了比通用地球观测基础模型更具可迁移性的表示。这些结果证明，在保持架构和MAE目标不变的情况下，优化区域尺度的预训练数据分布可以产生一个可重用的北极领域编码器，用于多种VHSR遥感应用。

英文摘要

This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3 million chips from 267 TB of Vantor VHSR imagery This curation strategy was designed to reduce oversampling of visually repetitive or low-information areas while preserving broad scene diversity across the study domain. We pretrained a ViT-Large encoder on the curated corpus using a domain-adapted MAE reconstruction objective, producing Arctic-specific transformer weights for downstream feature mapping. The pretrained encoder was integrated into an existing location-aware detection and segmentation framework and evaluated across four hand-labeled Arctic datasets. Compared to ImageNet-initialized ViT-Large baseline, Arctic MAE pretraining produced consistent improvements in foreground mean F1 scores of 0.87, 0.72, 0.93, and 0.87, for infrastructure, IWP, RTS, and TCNs, with approximately 5-8 percentage increase. The proposed model also outperformed Prithvi-EO-2.0 in all downstream comparisons, with the smallest gain corresponding to at least a 15 percentage improvement mean F1, suggesting that domain-specific self-supervised pretraining on curated Arctic VHSR imagery provides more transferable representations for fine-scale Arctic mapping than a general-purpose Earth observation foundation model. These results demonstrate that optimizing the pretraining data distribution at regional scale, while keeping the architecture and MAE objective fixed, can produce a reusable Arctic-domain encoder for multiple VHSR remote sensing applications.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training

Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

LimiX-2M: Mitigating Low-Rank Collapse and Attention Bottlenecks in Tabular Foundation Models

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

Policy Gradient for Continuous-Time Robust Markov Decision Processes

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

SenseJudge: Human-Centric Preference-Driven Judgment Framework

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

On the Persistent Effects of Lexicality in Large Language Models

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Generating Graph-Like Logical Rules for Knowledge Graph Reasoning via Diffusion Models

Clustering Guided Domain-Specific Pretrained Foundation Model for Very High-Resolution Arctic Remote Sensing