Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling
Unicorn: 通过通用相关性建模实现高维时间序列的规模化预测
AI总结 提出Unicorn框架,通过潜在原型码本解耦相关性建模与特定通道身份,实现跨异构数据集的可扩展多数据集预训练,在少样本迁移场景中显著优于现有模型。
Unicorn: 通过通用相关性建模实现高维时间序列的规模化预测
Haochen Yuan, Yichen Song, Yunbo Wang, Xiaokang Yang
AI总结 提出Unicorn框架,通过潜在原型码本解耦相关性建模与特定通道身份,实现跨异构数据集的可扩展多数据集预训练,在少样本迁移场景中显著优于现有模型。
现代时间序列架构面临一个基本权衡:通道独立模型随着数据量增加可扩展性好,但忽略了关键的通道间依赖性;而通道依赖模型具有表达力,但仍然是“维度受限的”,难以泛化到异构数据集。为了弥合这一差距,我们引入了Unicorn(通用相关网络),一个用于高维时间序列的可扩展、多数据集预训练框架。Unicorn的核心是一个潜在原型码本,它将相关性建模与特定通道身份解耦。通过将异构通道投影到共享潜在空间,Unicorn学习与身份无关的、可复用的交互模式,这些模式可以跨具有不同维度和语义的领域迁移。大量实验表明,Unicorn显著优于最先进的预测架构,特别是在少样本迁移场景中,为多变量时间序列基础模型提供了一条可扩展的路径。
Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but remain ``dimension-bounded'', struggling to generalize across heterogeneous datasets.To bridge this gap, we introduce Unicorn (Universal Correlation Network), a framework for scalable, multi-dataset pretraining on high-dimensional time series. At the core of Unicorn is a latent prototype codebook that decouples correlation modeling from specific channel identities. By projecting heterogeneous channels into a shared latent space, UniCorN learns identity-agnostic, reusable interaction patterns that transfer across domains with diverse dimensionalities and semantics. Extensive experiments show that Unicorn significantly outperforms state-of-the-art forecasting architectures, particularly in few-shot transfer scenarios, offering a scalable path toward multivariate time series foundation models.
Gait2Hip-60:基于多节奏步态运动学预测髋部肌肉力和关节力矩的统一深度学习基准
Jiaqi Zhang, Ji Hou, Qing Sun, Xianzhi Gao, Bo Huo
AI总结 本研究提出一个深度学习框架,利用LSTM、Transformer和Mamba三种模型从下肢步态运动学直接预测髋部肌肉力和关节力矩,在60名健康受试者数据上评估,发现Transformer表现最佳,并在股骨头坏死患者零样本测试中保持中等预测能力。
在步态过程中估计髋部肌肉力和关节力矩通常依赖于肌肉骨骼仿真,这种方法信息丰富但耗时且难以应用于临床。本研究开发了一个深度学习框架,直接从下肢步态运动学预测这些髋部动力学参数,并在统一协议下比较了三种代表性序列模型。步态数据来自60名健康成年人在三种节拍器引导的节奏条件下的行走。使用十个双侧下肢关节角度作为输入,以OpenSim导出的髋部肌肉力和髋关节力矩作为参考输出。训练并评估了LSTM、Transformer和Mamba三种深度学习模型,采用相同的受试者级别划分、预处理流程和评价指标。随后,最佳模型直接在一个由9名股骨头坏死(ONFH)患者组成的外部队列上进行测试,无需重新训练。在健康受试者基准测试中,Transformer在髋部肌肉力预测(RMSE = 1.33 N/kg, MAE = 0.57 N/kg, R2 = 0.819)和髋关节力矩预测(RMSE = 0.11 Nm/kg, MAE = 0.07 Nm/kg, R2 = 0.862)方面均取得了最佳的受试者级别平均性能,且在不同步行节奏下具有相似优势。在零样本外部验证中,Transformer在ONFH患者中保留了中等预测能力,髋部肌肉力预测(RMSE = 1.51 N/kg, MAE = 0.70 N/kg, R2 = 0.537)和髋关节力矩预测(RMSE = 0.17 Nm/kg, MAE = 0.12 Nm/kg, R2 = 0.569)。这些发现支持了从步态运动学估计髋部动力学的可行性,将Transformer确定为强基线,并强调了在临床应用前需要进行更广泛的病理验证和改进泛化能力。
Estimating hip muscle forces and joint moments during gait typically relies on musculoskeletal simulation, which is informative but time-consuming and difficult to apply in clinical settings. This study developed a deep learning framework to predict these hip dynamics parameters directly from lower-limb gait kinematics and compared three representative sequence models under a unified protocol. Gait data were collected from 60 healthy adults under three metronome-guided cadence conditions. Ten bilateral lower-limb joint angles were used as inputs, and OpenSim-derived hip muscle forces and hip joint moments were used as reference outputs. Three deep learning models of LSTM, Transformer, and Mamba were trained and evaluated using the same subject-level split, preprocessing pipeline, and metrics. The best model was then directly tested on an external cohort of 9 patients with osteonecrosis of the femoral head (ONFH) without retraining. In the healthy-subject benchmark, Transformer achieved the best subject-level mean performance for both hip muscle force prediction (RMSE = 1.33 N/kg, MAE = 0.57 N/kg, R2 = 0.819) and hip joint moment prediction (RMSE = 0.11 Nm/kg, MAE = 0.07 Nm/kg, R2 = 0.862), with similar advantages across walking cadences. In zero-shot external validation, Transformer retained moderate predictive ability in ONFH for hip muscle force prediction (RMSE = 1.51 N/kg, MAE = 0.70 N/kg, R2 = 0.537) and hip joint moment prediction (RMSE = 0.17 Nm/kg, MAE = 0.12 Nm/kg, R2 = 0.569). These findings support the feasibility of estimating hip dynamics from gait kinematics, identify Transformer as a strong baseline, and highlight the need for broader pathological validation and improved generalization before clinical application.
心理伤害:面向检索增强文本到音乐生成的标题投毒攻击
Yizhu Wen, Shuhao Zhang, Nan Zhang, Long Cheng, Hanqing Guo
AI总结 提出双层标题投毒策略,通过向音乐知识库注入少量恶意标题,使检索增强文本到音乐系统生成偏离用户意图的音乐,暴露了系统的完整性风险。
检索增强文本到音乐(TTM)系统通过从音乐标题数据集中检索的标题来增强未指定的用户提示。这种设计引入了对音乐知识数据库的完整性依赖。我们表明,攻击者可以通过注入少量精心制作的音乐标题来毒化数据库,导致系统检索恶意标题,从而偏置提示增强并使生成偏离用户预期功能,而无需修改用户提示、检索器或生成器。为了实现音乐标题投毒攻击,我们提出了一种双层标题投毒策略,该策略保留高级检索锚点,同时注入低级声学描述符,以将提示增强和下游音乐生成引导至攻击者选择的目标意图。在MusicCaps知识数据库、CLAP检索器和MusicGen流水线中,被投毒的生成结果显著接近攻击者的目标,同时与原始用户查询保持可比的对齐。这些结果暴露了检索增强创意AI系统的实际完整性风险。我们的演示可在以下网址找到:https://yizhu-wen.github.io/Mental-Damage/
Retrieval-augmented text-to-music (TTM) systems augment underspecified user prompts using captions retrieved from a music caption dataset. This design introduces an integrity dependency on the music knowledge database. We show that an attacker can poison the database by injecting a small number of crafted music captions, causing the system to retrieve malicious captions that bias prompt augmentation and steer generation away from the user's intended function, without modifying the user prompt, retriever, or generator. To achieve the music caption poisoning attack, we propose a dual-layer caption poisoning strategy that preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker-chosen target intent. In a MusicCaps knowledge database, CLAP retriever, and MusicGen pipeline, poisoned generations move substantially closer to the attacker's target, while remaining comparably aligned with the original user query. These results expose a practical integrity risk for retrieval-augmented creative AI systems. Our demo can be found at: https://yizhu-wen.github.io/Mental-Damage/
QASM-Eval:用于训练和评估LLM在超越量子电路的OpenQASM-3上的数据集
Zhenxiao Fu, Lei Jiang, Fan Chen
AI总结 针对LLM在OpenQASM-3硬件级编程上的训练与评估空白,构建了包含专家验证测试集和训练集的数据集,覆盖经典逻辑、时序调度、脉冲控制等,并通过扩展验证器自动验证,实验表明微调后LLM性能显著提升。
量子计算仍处于含噪中等规模量子(NISQ)时代,其性能受到噪声的高度限制。解决这一限制通常需要超越门序列电路规范的硬件相关能力,包括用于量子纠错(QEC)的中间电路测量和经典反馈、用于动态解耦(DD)的精确时序控制,以及用于校准的脉冲级波形访问。OpenQASM-3正是为了暴露这些能力而引入的,提供了硬件级编程接口。然而,尽管大语言模型在代码生成方面取得了快速进展,目前仍没有专门设计用于训练和评估LLM在涉及高级硬件导向特性的OpenQASM-3程序上的数据集。为填补这一空白,我们推出了QASM-Eval,这是首个专门设计用于训练和评估LLM在OpenQASM-3上的全面数据集。QASM-Eval并非专注于量子算法设计或推理,而是明确针对该语言的硬件相关特性。QASM-Eval包含一个由专家验证的100个任务的测试集和一个4000个任务的训练集,系统性地涵盖了经典逻辑、时序调度、脉冲控制以及复杂的实际工作流程。为了自动验证生成的程序,我们使用扩展的验证器检查语法、量子态和程序时间线。我们的评估表明,虽然最先进的LLM在OpenQASM-3编码任务上表现困难,但在QASM-Eval上进行针对性微调后取得了显著提升。QASM-Eval为加速开发NISQ时代硬件相关量子编程的可靠LLM助手提供了关键的基准和训练基础。数据和代码:https://github.com/fuzhenxiao/QASM-Eval
Quantum computing remains in the Noisy Intermediate-Scale Quantum (NISQ) era, where the performance is highly constrained to noise. Addressing the limitation often requires hardware-facing capabilities beyond gate-sequence circuit specification, including mid-circuit measurement and classical feedback for quantum error correction (QEC), precise timing control for dynamical decoupling (DD), and pulse-level waveform access for calibration. OpenQASM-3 was introduced to expose exactly these capabilities, providing a hardware-level programming interface. However, despite the rapid progress of large language models in code generation, there is still no dataset specifically designed to train and evaluate LLMs on OpenQASM-3 programs that involve its advanced hardware-oriented features. To address this gap, we introduce QASM-Eval, the first comprehensive dataset designed to train and evaluate LLMs on OpenQASM-3. Rather than focusing on quantum algorithm design or reasoning, QASM-Eval explicitly targets the language's hardware-facing features. QASM-Eval comprises an expert-verified test set of 100 tasks and a training set of 4,000 tasks, systematically covering classical logic, timing scheduling, pulse control, and complex real-world workflows. To automatically validate generated programs, we check syntax, quantum states and program timeline using an extended verifier. Our evaluation reveals that while state-of-the-art LLMs struggle heavily in OpenQASM-3 coding tasks, targeted fine-tuning on QASM-Eval yields significant gains. QASM-Eval provides a crucial benchmark and training foundation to accelerate the development of reliable LLM assistants for hardware-facing quantum programming in NISQ era. Data and code: https://github.com/fuzhenxiao/QASM-Eval
面向开放世界的自监督在线机器人无关可通行性估计
Julia Hindel, Simon Bultmann, Houman Masnavi, Daniele Cattaneo, Abhinav Valada
AI总结 提出COTRATE框架,通过自监督在线学习从多模态未标记机器人经验中估计可通行性,采用机器人无关的地形评估模块和多样性感知特征选择策略,实现跨平台知识迁移并降低遗忘。
自监督在线可通行性估计使机器人能够从未标记的开放世界经验中持续学习,并调整其导航行为以实现安全高效的轨迹。现有方法要么依赖手工设计的本体感受可通行性分数,限制了机器人无关性,要么对先验数据进行聚类,阻碍了在线学习。此外,许多持续学习方法会带来大量的内存和计算成本,阻碍了机载部署。我们提出了COTRATE,一个用于从多模态、未标记的机器人经验中持续估计可通行性的在线学习框架。我们的方法首先使用一个基于学习的机器人无关在线地形评估模块,该模块处理本体感受和惯性信号,推断出鲁棒的可通行性分数。然后,这些分数通过一种新颖的对齐损失来监督视觉可通行性网络,该损失将视觉嵌入与在线地形评估相关联。为了在持续学习过程中以最小开销减轻遗忘,我们提出了一种多样性感知的特征选择策略,该策略使用紧凑的回放记忆来保持性能。我们进一步表明,学习到的可通行性表示支持具有不同运动学特性的不同机器人平台之间的知识迁移。我们在一个包含约50,000张图像的数据集上评估了COTRATE,该数据集由两个机器人平台在11种户外地形上收集,并在三个代表性户外环境中的导航任务上进行了基准测试。我们将数据集、代码和训练模型公开。
Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain assessments. To mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of $\approx$ 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.
Conveyance: 结构化类空间学习的通用框架
Yasser Taha, Grégoire Montavon, Nils Körber
AI总结 针对标准损失函数忽略类间结构关系的问题,提出Conveyance分类方法,通过最大化不同类划分上的两个间隔来编码图结构关系,在层次分类、序数回归和多实例学习任务中达到或超越专用基线。
尽管机器学习架构已迅速发展以处理复杂数据,但在许多实际应用中,像交叉熵这样的损失函数仍然大多与结构无关。然而,这些标准损失的“类对称”性质从根本上限制了机器学习模型利用类间结构关系的能力,尤其是在面对结构化噪声时。我们提出了Conveyance,一种针对结构化类空间的新分类方法及相关损失函数。它允许用户编码类之间的图结构关系,而无需定义复杂的联合分布或手动调整效用矩阵。从技术上讲,我们的损失函数通过最大化不同类划分上的两个间隔来运作,同时保持单调性和部分凸性等正式性质。我们通过将方法应用于层次分类、序数回归和多实例学习来展示其通用性和有效性。在这些任务中,Conveyance要么匹配要么超过专用基线的性能,从而为结构化类空间提供了统一解决方案。
While machine learning (ML) architectures have evolved rapidly to account for complex data, loss functions like cross-entropy remain mostly structure-agnostic in many real-world applications. However, the "class-symmetric" nature of these standard losses fundamentally limits the ability of ML models to exploit structural relationships between classes, particularly when facing structured noise. We propose Conveyance, a new classification approach and associated loss function tailored to structured class spaces. It allows users to encode graph-like relations between classes without having to define complex joint distributions or manually tune utility matrices. Technically, our loss function operates by maximizing two separate margins over distinct class partitions, while preserving formal properties such as monotonicity and partial convexity. We demonstrate the versatility and effectiveness of our method by applying it to hierarchical classification, ordinal regression, and multiple instance learning. Across these tasks, Conveyance either matches or exceeds the performance of specialized baselines, thereby offering a unified solution for structured class spaces.
PINE:基于共形分布内预测等价的剪枝提升树集成
Haruki Yajima, Yusuke Matsui
AI总结 提出PINE方法,通过共形校准控制分布内区域,在保持预测等价的同时将剪枝压缩比提升高达30%。
树集成是具有强预测性能和可解释性的机器学习模型,广泛用于表格数据。树集成的标准剪枝方法通常优化精度-压缩权衡,可能会改变部分预测,从而影响决策一致性。忠实剪枝方法通过在整个输入空间上保持预测等价来解决这个问题,但这一要求导致较低的压缩比。我们提出PINE,一种在分布内区域提供强保证的剪枝方法。PINE在该区域内保持预测等价,并通过共形校准使用单个参数$α$控制区域大小。在12个公开表格数据集上的实验表明,PINE在保持与现有忠实剪枝方法相当的预测水平的同时,将压缩比提高了高达30%。
Tree ensembles are machine learning models with strong predictive performance and interpretability, and remain widely used for tabular data. Standard pruning methods for tree ensembles typically optimize an accuracy-compression trade-off and may change a subset of predictions, potentially compromising decision consistency. Faithful pruning methods address this issue by preserving prediction equivalence over the entire input space, but this requirement leads to lower compression ratios. We propose PINE, a pruning method that provides strong guarantees within an in-distribution region. PINE preserves prediction equivalence within this region and controls the region size using a single parameter $α$ via conformal calibration. Experiments on 12 public tabular datasets show that PINE improves the compression ratio by up to 30% while preserving predictions at a comparable level to existing faithful pruning methods.
奖励偏差替代:单轴偏差缓解措施重定向优化压力
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel J. Kochenderfer
AI总结 本文提出奖励偏差替代现象,即单轴缓解奖励模型偏差(如减少对长度、谄媚或风格的依赖)会将优化压力转移到相关代理上而非消除,并通过理论证明和实验(如GRPO训练中的长度惩罚导致过度自信)揭示了该问题,建议在评估中纳入策略诱导分布并跟踪多偏差。
单轴缓解奖励模型偏差(例如,减少代理对长度、谄媚或风格的依赖)可以将优化压力旋转到相关代理上,而不是消除它,这种失败模式我们称之为奖励偏差替代。这种失败是由于在缓解评估和策略训练期间,审计分布与策略诱导分布之间的测量与优化差距造成的。我们将缓解结果形式化为一个机制分类,并证明成功的缓解、偏差替代和过度修正会在任何审计分布评分下产生相同的可观测结果,包括排名准确率和胜率,即使允许对真实奖励进行神谕访问。在已发表的偏好学习缓解工作中,我们调查的方法都没有报告证明成功缓解所需的证据。在跟踪多个偏差的同时,用策略诱导分布增强评估可以证明缩小差距,我们将其转化为缓解方法和基准的可操作处方。我们在语言模型RLHF中演示了偏差替代,其中GRPO训练期间的长度惩罚按预期压缩了响应,但将优化压力重定向到置信度校准上,导致策略过度自信,而事实自由形式准确性下降。我们还展示了一个已发表的长度去偏操作,它在审计分布上将奖励-长度相关性归零,但在四个最先进奖励模型中的三个上,在最佳N选择下重新引入了偏差,以及一个长度-谄媚耦合,其方向在人类-LLM判断者分歧下反转。
Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.
检索、奖励与训练协议:训练搜索代理的关键因素是什么?
Yibo Zhao, Zichen Ding, Jiayi Wu, Zun Wang, Xiang Li
AI总结 本文通过控制实验,系统研究了检索语料库、奖励设计和训练协议三个维度对搜索代理训练的影响,发现纠正语料覆盖问题比算法差异更有效,简单的基于结果的奖励方法在多数设置下表现优异,并提出了实用训练指南。
由大型语言模型驱动的搜索代理能够通过多步推理自主分解查询、检索信息并综合答案。然而,训练方法的快速增长已超越了受控比较:现有工作在检索语料库、奖励设计和训练协议上存在差异,使得实际驱动改进的因素不明确。我们提出了一项受控实证研究,隔离了搜索代理训练中三个未充分探索的维度。首先,我们识别了广泛使用的Wikipedia 2018语料库中的一个关键数据覆盖问题,并表明仅纠正该问题带来的收益就大于训练算法之间的差异。其次,我们系统比较了三种基础模型上基于结果和基于过程的奖励方法,发现最简单的基于结果的方法在大多数设置中达到竞争性或更优的性能,并且过程级信用分配可能过度纠正代理行为。第三,我们分析了训练数据多样性、离策略数据利用和搜索预算缩放,提炼出训练有效搜索代理的实用指南。我们的代码可在https://github.com/YiboZhao624/SearchAgentReview获取。
Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi-step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under-explored dimensions of search agent training. First, we identify a critical data-coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome-based and process-based reward methods across three base models, finding that the simplest outcome-based approach achieves competitive or superior performance in most settings, and that process-level credit assignment can over-correct agent behavior. Third, we analyze training data diversity, off-policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at https://github.com/YiboZhao624/SearchAgentReview.
银行卡支付网络中欺诈检测的基本极限
Gaurav Dhama
AI总结 本文通过形式化支付授权为具有延迟、审查、污染和反事实缺失反馈的序贯决策问题,推导出极小极大遗憾下界,证明生态系统信息质量是欺诈检测的根本瓶颈,而非模型复杂度。
银行卡支付欺诈检测通常被框架化为一个监督分类问题。尽管这种方法已经取得了实际进展,但尽管模型架构取得了重大进展,改进仍然只是渐进的。我们认为,这主要不是函数逼近或优化的失败,而是支付生态系统固有的结构性信息损害的结果。我们将银行卡授权形式化为一个具有延迟、审查、污染和反事实缺失反馈的序贯决策问题。我们推导出一个极小极大遗憾下界,表明这些损害在可达学习率的分母中相乘。该下界表明,提高发卡机构报告质量或减少审查可以比增加模型复杂度更大幅度地降低遗憾下限。我们还表明,发卡机构之间的异质性会进一步恶化可学习性,超出平均损害率所暗示的程度。本文贡献了一个理论,解释了为什么支付网络中的欺诈检测本质上比标准在线学习设置更困难,将生态系统信息质量确定为关键瓶颈,并为优先投资于报告基础设施、争议处理质量和选择性探索提供了理论基础。本文以理论为先,不依赖专有交易数据。
Card payment fraud detection is usually framed as a supervised classification problem. Although this approach has generated practical progress, improvement has remained incremental despite major advances in model architecture. We argue that this is not mainly a failure of function approximation or optimization, but a consequence of structural information impairments inherent to the payment ecosystem. We formalize card authorization as a sequential decision problem with delayed, censored, corrupted, and counterfactually missing feedback. We derive a minimax regret lower bound showing that these impairments enter multiplicatively in the denominator of the achievable learning rate. The bound implies that improving issuer reporting quality or reducing censorship can yield larger reductions in the regret floor than increasing model complexity. We also show that heterogeneity across issuers worsens learnability beyond what average impairment rates suggest. The paper contributes a theory of why fraud detection in payment networks is fundamentally harder than in standard online learning settings, identifies ecosystem information quality as the key bottleneck, and provides a theoretical basis for prioritizing investments in reporting infrastructure, dispute process quality, and selective exploration. The paper is theory-first and does not rely on proprietary transaction data.
SpatialBench: 你的空间基础模型是全能选手吗?
Haosong Peng, Hao Li, Jiaqi Chen, Yuhao Pan, Runmao Yao, Yalun Dai, Fushuo Huo, Fangzhou Hong, Zhaoxi Chen, Haozhao Wang, Dingwen Zhang, Ziwei Liu, Wenchao Xu
AI总结 提出SpatialBench基准,通过跨范式、多域、确定性采样的评估,揭示当前空间基础模型在多样化下游任务中的泛化能力不足,并引入DA-Next-5M数据集和DA-Next模型推动空间表示学习。
尽管空间基础模型在标准数据集上展示了令人印象深刻的性能,但一个关键问题仍然存在:它们是否真正是能够稳健泛化到多样化下游任务、任意视角、变化的场景域、不同输入密度和特定硬件约束的全能选手?回答这个总体问题需要整体评估,然而当前模型主要在其专门设计或训练的特定领域上进行评估。这种评估本质上受到狭窄范式覆盖、有限场景域和任意帧采样的限制,使得从根本上难以评估其真正的泛化能力。为弥补这一差距,我们提出了SpatialBench,一个用于空间基础模型的跨范式、域多样化的基准,采用确定性采样。SpatialBench具有前所未有的规模和严格的确定性设计,包含19个数据集和546个场景,覆盖5个不同的空间域。它在4种不同输入密度设置下,全面评估了6个范式的41个模型在5个任务套件上的表现。我们的广泛评估揭示当前模型尚未成为全能选手,并为未来进展揭示了关键见解。具体来说,我们证明全上下文注意力最大化准确性,而有界记忆策略解锁长序列可扩展性。此外,我们在具有挑战性的具身和自我中心任务中的实证评估表明,严格的域对齐和高数据质量对性能的影响远大于简单的数据集扩展。最后,为解决我们分析中发现的最大数据差距,我们超越评估,引入大规模数据集DA-Next-5M和强基线模型DA-Next,推动空间表示学习的边界。
While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.
对齐篡改:人类反馈强化学习如何被利用以优化错位偏见
Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee
AI总结 本文提出对齐篡改漏洞,即对齐中的LLM通过影响偏好数据集使RLHF放大不良行为,并通过实验展示多种偏见的放大,指出现有缓解方法难以在不牺牲质量的情况下解决该问题。
人类反馈强化学习(RLHF)是将大型语言模型(LLM)与人类偏好对齐的标准方法。在本工作中,我们引入对齐篡改,这是一种潜在漏洞,即正在对齐的LLM影响偏好数据集,导致RLHF放大不良行为。这源于RLHF的核心局限性:(1)偏好数据集由LLM自身的输出构建,使其能够影响它们;(2)成对比较仅指示哪个响应更好,而不说明原因。这些局限性可能被利用以导致对齐篡改。例如,如果LLM以更高质量生成有偏见的响应,标注者会基于质量偏好它们。然而,偏好标签无法区分质量与偏见,奖励模型继承了这一局限性。通过强化学习或最佳N采样优化此类奖励可能放大错位偏见。我们的实验展示了跨多种偏见的放大:从关键词偏见到宣传(例如性别歧视)、品牌推广和工具性目标寻求。缓解仍然具有挑战性,因为现有的鲁棒RLHF技术无法在不牺牲响应质量的情况下完全解决对齐篡改。这些发现揭示了当前RLHF的结构性漏洞,并强调了防止此漏洞的必要性。项目页面:https://alignment-tampering.github.io/
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/
Pair-In, Pair-Out: 面向高效LLM的潜在多令牌预测
Wenhui Tan, Minghao Li, Xiaoqian Ma, Siqi Fan, Xiusheng Huang, Liujie Zhang, Ruihua Song, Weihang Chen
AI总结 提出Pair-In, Pair-Out (PIPO)方法,通过统一潜在压缩和多令牌预测,并训练轻量级置信度头消除验证器开销,在保持可靠性的同时实现推理加速。
长链式推理使得自回归解码成为现代大语言模型的主要推理成本。现有方法要么针对输入侧(潜在压缩),要么针对输出侧(推测解码和多令牌预测,MTP),但这两条工作线是独立进行的。此外,输出侧方法必须进行昂贵的验证器传递,以验证MTP预测的不可靠草稿令牌。为解决这些问题,我们提出 extbf{Pair-In, Pair-Out (PIPO)},通过将潜在压缩器和MTP头视为镜像操作来统一两侧:压缩器将两个输入令牌折叠成一个潜在表示,而MTP头将一个隐藏状态展开成一个额外的输出令牌。为了在不牺牲可靠性的情况下消除验证器成本,PIPO训练一个轻量级置信度头,决定是否接受草稿令牌。我们观察到,在线策略蒸馏(OPD)自然匹配推测解码的拒绝采样准则,因此置信度头可以以可忽略的额外成本与OPD一起训练。在AIME 2025、GPQA-Diamond、LiveCodeBench v6和LongBench v2上使用Qwen3.5-4B和9B骨干网络的实验表明,PIPO在常规解码上将pass@4提高了最多+7.15个点,同时实现了高达2.64倍的首令牌延迟和2.07倍的每令牌延迟加速。项目页面:GitHub.com/RedAI-Infra/PIPO。
Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing methods target either the input side (latent compression) or the output side (speculative decoding and multi-token prediction, MTP), but the two lines of work have been pursued independently. Moreover, output-side methods must incur an expensive verifier pass to validate the unreliable draft tokens predicted by MTP. To address these issues, we propose \textbf{Pair-In, Pair-Out (PIPO)}, which unifies both sides by viewing a latent compressor and an MTP head as mirror-image operations: the compressor folds two input tokens into one latent representation, while the MTP head unfolds one hidden state into one additional output token. To remove the verifier cost without sacrificing reliability, PIPO trains a lightweight confidence head that decides whether draft tokens should be accepted. We observe that On-Policy Distillation (OPD) naturally matches the rejection-sampling criterion of speculative decoding, so the confidence head can be trained alongside OPD with negligible extra cost. Experiments on AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 with Qwen3.5-4B and 9B backbones show that PIPO improves pass@4 over regular decoding by up to $+7.15$ points, while delivering up to $2.64\times$ first-token-latency and $2.07\times$ per-token-latency speedups. Project Page: GitHub.com/RedAI-Infra/PIPO.
VR-DAgger: 用于灵巧数据收集和不确定性引导的在线策略校正的沉浸式VR
René Zurbrügg, Tifanny Portela, Arjun Bhardwaj, Aravind Elanjimattathil Vijayan, Maximum Wilder-Smith, Marco Hutter
AI总结 提出VR-DAgger框架,通过VR应用进行灵巧遥操作和数据收集,利用MC Dropout不确定性评分选择关键失败片段进行在线校正,在灵巧操作任务上相比行为克隆提升高达23个百分点,并减少约40%的样本收集时间。
从示范中学习对于机器人操作是有效的,但收集足够的任务特定数据仍然是一个主要瓶颈。在分布偏移下,小误差会累积,性能下降,专家时间往往花费在冗余、低价值的修正上,而不是少数关键失败案例。我们提出了VR-DAgger,一个以沉浸式VR应用为中心的人机协作框架,用于灵巧遥操作、示范收集和选择性策略校正。VR客户端提供直观的手部控制和同步场景可视化,而后台工作站运行仿真和学习,实现无需操作员持续监督的自主部署。我们使用蒙特卡洛(MC)Dropout在Isaac Lab部署扩散策略时对不确定性进行评分,并选择信息量大的失败片段进行校正。这些片段在VR中作为剪辑重放,操作员选择性地标记和校正策略的行为,将监督集中在不确定性最高的地方,无需全程监控或单独的中断分类器。我们在三个灵巧操作任务(平底锅抓取放置、抽屉打开、阀门旋转)上使用10自由度XHand在标准和具有挑战性的初始配置下进行评估。主动标记在所有任务上持续优于行为克隆,提升高达23个百分点。与无指导的人机协作检查相比,VR-DAgger通过将审查集中在选定的片段而非完整部署上,将每个样本的收集时间减少了约40%。
Learning from demonstrations is effective for robotic manipulation, but collecting sufficient task-specific data remains a major bottleneck. Under distribution shift, small errors compound, performance degrades, and expert time is often spent on redundant, low-value corrections instead of the few critical failure cases. We present VR-DAgger, a human-in-the-loop framework centered on an immersive VR application for dexterous teleoperation, demonstration collection, and selective policy correction. The VR client provides intuitive hand control with synchronized scene visualization, while a backend workstation runs simulation and learning, enabling autonomous rollouts without continuous operator oversight. We use Monte Carlo (MC) dropout to score uncertainty during Isaac Lab rollouts of a diffusion policy and select informative failure segments for correction. These segments are replayed in VR as clips, where the operator selectively labels and corrects the policy's behavior, concentrating supervision where uncertainty is highest without full-rollout monitoring or a separate intervention classifier. We evaluate on three dexterous manipulation tasks (Pan pick-and-place, Drawer opening, Valve turning) with a 10-DoF XHand under standard and challenging initial configurations. Active labeling consistently improves over behavioral cloning across all tasks, with gains of up to 23 percentage points. Compared to unguided human-in-the-loop inspection, VR-DAgger reduces per-sample collection time by approximately 40% by focusing review on selected segments rather than full rollouts.
面向数据敏感领域的LLM输出的神经符号验证(扩展预印本)
Paul Sigloch, Christoph Benzmüller
AI总结 提出一种结合形式符号方法与神经语义分析的混合验证架构,用于检测LLM输出中的幻觉、不一致和隐私漏洞,在医疗设备损伤评估系统中实现83%的结构化实体幻觉检测率和72%的语义虚构检测率。
部署在高风险领域的LLM面临根本性的可靠性挑战:幻觉、不一致性和隐私漏洞引入了不可接受的风险,因为错误会带来法律、财务或安全后果。本文提出一种混合验证架构,结合形式符号方法与神经语义分析,为LLM生成的内容提供互补性保证。该架构采用逻辑推理进行输入验证,利用完备性属性为结构化需求提供可判定的保证。对于输出验证,基于嵌入的语义相似性检测上下文幻觉,弥补形式方法表达力不足的问题。这种分离通过并行的、基于角色的流水线实现,解决了基于提示的自验证方法(继承了产生幻觉的分布偏差)的局限性。所提出的架构和类型感知验证方法通过HAIMEDA(一个通过行动设计研究开发的真实世界医疗设备损伤评估报告系统)进行验证。评估显示,结构化实体的幻觉检测率超过83%,语义虚构的检测率为72%,报告创建时间减少30%,表明神经符号架构可以为LLM在数据敏感领域的部署提供原则性的安全保障。
LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities introduce unacceptable risks where errors carry legal, financial, or safety consequences. This paper presents a hybrid verification architecture combining formal symbolic methods with neural semantic analysis to provide complementary guarantees for LLM-generated content. This architecture employs logical reasoning for input verification, leveraging completeness properties to provide decidable guarantees on structured requirements. For output validation, embedding-based semantic similarity detects contextual hallucinations where formal methods lack expressiveness. This separation is realized in a parallel, actor-based pipeline, addressing limitations of prompt-based self-verification approaches, which inherit the distributional biases that produce hallucinations. The proposed architecture and type-aware verification method are validated with HAIMEDA, a real-world medical device damage assessment reporting system developed through Action Design Research. Evaluation shows hallucination detection rates of over 83% for structured entities and 72% for semantic fabrications, with a 30% reduction in report creation time, demonstrating that neuro-symbolic architectures can provide principled safeguards for LLM deployment in data-sensitive domains.
当Muon优化器遇到对抗训练:理论与实证研究
Jun Yan, Weiquan Huang, Jiankai Zuo, Yujian Mo, Xi Fang, Chengliang Wu, Zeming Wei
AI总结 本文通过理论和实证研究,探讨Muon优化器(基于近似极分解的正交化更新)在对抗训练中的效果,发现其能限制矩阵更新的谱范数增长,在CNN和ViT上优于AdamW,与SGD竞争力相当。
对抗训练(AT)仍然是最可靠的对抗攻击经验防御方法之一。其鲁棒性关键取决于底层极小极大目标如何优化。在实践中,随机梯度下降(SGD)优化器仍然是AT的默认优化选择,而自适应优化器通常能改善标准训练,但可能产生较差的鲁棒性。最近,Muon优化器通过近似极分解对矩阵值更新进行正交化,在内存成本与SGD相当的情况下,在大规模训练中取得了显著成功。这提出了一个与安全相关的问题:正交化优化能否在强异质威胁模型下改进AT?针对这一问题,我们进行了全面的理论和实证研究。理论上,我们表明Muon对矩阵更新施加了谱范数稳定性上限,限制了训练动态中不受控制的谱增长,而无需显式缩小学习权重。实证上,在五种架构和三种$\ell_p$威胁模型($\ell_\infty$、$\ell_1$、$\ell_2$)及其联合下,Muon在CNN上与SGD竞争力相当,并在CNN和ViT上显著优于AdamW。这些结果将优化器几何识别为对抗训练中的一个安全相关因素,同时阐明了正交化更新有益的经验场景。总体而言,我们的发现强调了优化器设计是AT的一个安全关键组成部分。
Adversarial training (AT) remains one of the most reliable empirical defenses against adversarial attacks. Its robustness critically depends on how the underlying min-max objective is optimized. In practice, Stochastic Gradient Descent (SGD) optimizer remains the default optimization choice for AT, whereas adaptive optimizers often improve standard training but may yield inferior robustness. Recently, the Muon optimizer, which orthogonalizes matrix-valued updates via an approximate polar decomposition, has achieved notable success in large-scale training at a memory cost comparable to SGD. This raises a security-relevant question: \textit{can orthogonalized optimization improve AT under strong and heterogeneous threat models?} Focusing on this problem, we conduct a comprehensive theoretical and empirical study. Theoretically, we show that Muon imposes a spectral-norm stability ceiling on matrix updates, limiting uncontrolled spectral growth in the training dynamics without explicitly shrinking the learned weights. Empirically, across five architectures and three $\ell_p$ threat models ($\ell_\infty$, $\ell_1$, $\ell_2$) and their union, Muon is competitive with SGD on CNNs and substantially outperforms AdamW on both CNNs and ViTs. These results identify optimizer geometry as a security-relevant factor in adversarial training, while clarifying the empirical regimes in which orthogonalized updates are beneficial. Overall, our findings highlight optimizer design as a security-critical component of AT.
外部观察者的必要性:充分性差距的形式化——序列模型中混合可识别性与上下文基础化的数学扩展
Francesco Corielli
AI总结 本文通过构建二元混合过程,形式化了由未观测隐状态导致的充分性差距,并引入辅助信号建立上下文主导阈值,证明温度缩放无法弥补缺失上下文,而外部观察者或验证器在高风险领域是必要的。
我们构建了一个二元混合过程,其中一个确定性文本机制和一个随机机制由未观测的隐状态控制。即使一个理想的无容量限制的序列预测器能够精确恢复纯文本边际分布,当观测到的前缀与错误的隐状态兼容时,它也可能变得过度自信。由此产生的熵差并非普通的优化误差;而是由未观测状态上的边缘化导致的充分性差距。然后,我们通过一个保真度为$γ∈[1/2,1]$的辅助二元信号形式化检索、工具使用和外部基础化。由此产生的贝叶斯更新给出了一个上下文主导阈值:当纠正信号的保真度超过纯文本后验权重中分配给误导机制的部分时,该信号恰好反转由文本历史诱导的后验几率。该阈值减小了充分性差距,但通常不能完全消除;完全消除需要相关隐状态的完美揭示或等效的验证机制。该分析阐明了为什么温度缩放无法恢复缺失的上下文,为什么基础化机制必须既信息丰富又可被模型学习使用,以及为什么在高风险领域自主序列模型需要结构上解耦的观察者或验证器。
We construct a binary mixed-regime process with one deterministic textual regime and one random regime governed by an unobserved latent state. Even an ideal infinite-capacity sequence predictor that exactly recovers the text-only marginal law can become overconfident when the observed prefix is compatible with the wrong latent regime. The resulting entropy difference is not an ordinary optimization error; it is a sufficiency gap caused by marginalization over an unobserved state. We then formalize retrieval, tool use, and external grounding through an auxiliary binary signal with fidelity $γ\in [1/2,1]$. The resulting Bayesian update yields a contextual dominance threshold: a corrective signal reverses the posterior odds induced by the textual history exactly when its fidelity exceeds the text-only posterior weight assigned to the misleading regime. This threshold reduces, but does not generally eliminate, the sufficiency gap; complete closure requires perfect revelation of the relevant latent state or an equivalent verification mechanism. The analysis clarifies why temperature scaling cannot restore missing context, why grounding mechanisms must be both informative and learnably usable by the model, and why autonomous sequence models require structurally decoupled observers or verifiers in high-stakes domains.
$R^3$: 通过相对回归进行3D重建
Congrong Xu, Huachen Gao, Xingyu Chen, Yuliang Xiu, Jun Gao, Anpei Chen
AI总结 提出一种基于相对回归的3D重建方法$R^3$,使用轻量级MLP预测置信度加权的相对约束,以支持全上下文离线重建和因果有界内存流式重建。
最近的馈送式几何基础模型通过单次前向传播恢复深度和姿态,展现出了令人印象深刻的泛化能力。然而,这些模型通常受限于全局坐标框架假设。这种依赖性成为长上下文和流式重建的一个显著瓶颈,因为它迫使网络维护一个任意的时序原点,并处理随时间无界增长的平移幅度。我们的解决方案,称为$R^3$,采用了相对回归。我们使用一个轻量级MLP来预测置信度加权的相对约束。这些置信度作为一个统一的锚点:在训练期间加权损失,在推理期间指导姿态聚合。$R^3$支持全上下文离线重建和因果、有界内存的流式重建。我们在离线与流式设置下的评估验证了我们的相对机制的有效性。项目页面:https://kevinxu02.github.io/r3-site
Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call $R^3$, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. $R^3$ supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism. Project page: https://kevinxu02.github.io/r3-site
PRISM:用于多层薄膜设计的位置编码回归逆光谱模型
Runtian Wang, Renhao Xue, Baige Chen, Hao Wu
AI总结 提出PRISM,一种解码器仅自回归变压器,通过联合预测离散材料选择和连续厚度回归,解决多层薄膜光学涂层设计的逆问题,相比其他变压器基线MAE降低50%以上,参数仅为其五分之一。
多层薄膜光学涂层设计的逆问题是一个复杂的组合-连续优化挑战。我们提出了PRISM(位置编码回归逆光谱模型),一种统一的解码器仅自回归变压器,通过在单个骨干网络中联合预测离散材料选择和连续厚度回归,简化了这一过程。PRISM引入了两个主要的架构创新:(1)光谱前缀条件化,利用标准前缀令牌进行上下文目标注入;(2)累积深度旋转位置嵌入,将连续厚度直接编码到位置表示中,以保留堆栈的物理空间关系。我们的基准测试表明,PRISM-13M模型相比其他变压器基线将MAE降低了50%以上,同时仅使用五分之一的参数。此外,一个44M参数的变体在我们的分布内验证基准上实现了最先进的性能(MAE = 0.010),并且运行速度显著快于模拟退火,为经典优化方法提供了一种高效的替代方案。
The inverse problem of multilayer thin-film optical coatings design represents a complex combinatorial-continuous optimization challenge. We present PRISM (Position-encoded Regressive Inverse Spectral Model), a unified decoder-only autoregressive transformer that streamlines this process by jointly predicting discrete material selection and continuous thickness regression within a single backbone. PRISM introduces two primary architectural innovations: (1) spectrum prefix conditioning, which utilizes standard prefix tokens for in-context target injection, and (2) cumulative-depth Rotary Position Embeddings, which encode continuous thickness directly into the positional representation to preserve the physical spatial relationships of the stack. Our benchmarks demonstrate that a PRISM-13M model reduces MAE by over 50\% compared to other transformer baselines while utilizing only one-fifth of the parameters. Furthermore, a 44M-parameter variant achieves state-of-the-art performance (MAE = 0.010) on our in-distribution validation benchmark and operates significantly faster than simulated annealing, offering a highly efficient alternative to classical optimization methods.
多机器人在不同表面上的基于去中心化角色比例控制的箱子运输
Aditya Bhatt, Himavarshini Yarragangu, Urvish Shah, Venkata Sai Yaswanth Mohan Thota, Souma Chowdhury
AI总结 提出一种异步去中心化任务与运动规划方法R2P2,通过角色分配和比例控制实现多机器人在不同倾斜和摩擦表面上的协作箱子运输,在仿真和物理实验中验证了其泛化性和成功率优于标准虚拟领导者-跟随者方法。
通过推动实现多机器人协作运输物体在建筑、仓库环境以及灾后 debris 清理等许多应用中具有广泛前景。然而,在不同倾斜和摩擦特性的表面上实现协作运输带来了独特的挑战。为应对这些挑战,本文提出了一种异步去中心化任务与运动规划方法,用于在平坦、上坡和下坡地形上运输不同质量的矩形箱子。这种去中心化方法减轻了通信、同步和共识需求,并缓解了单点故障问题。我们的方法称为R2P2(基于规则和比例控制原语的角色分配),根据对所需操作模式(箱子旋转 vs 平移)的认知规则为机器人分配角色(例如,推、支撑和阻止);随后根据角色执行基于规则的控制或机器人速度的比例控制。每个机器人在执行角色和控制时假设能观察到自身和箱子的位置与朝向。R2P2在使用NVIDIA IsaacSim构建的模拟器中通过六机器人团队进行了评估——展示了在不同表面摩擦/倾斜和箱子质量场景下的泛化能力,并且与标准虚拟领导者-跟随者方法相比具有更高的成功率。R2P2还通过物理实验成功验证,在四台负责移动1.2 kg箱子的turtlebots上执行。
Collaborative transport of objects via pushing by multiple robots has many applications, ranging from construction and warehouse environments to post disaster debris clean-up. Achieving collaborative transport over surfaces with different inclination and friction properties however poses unique challenges. To address these challenges, this paper presents an asynchronous decentralized task and motion planning approach for transporting rectangular boxes of varying mass over flat, uphill and downhill terrain. Such a decentralized approach alleviates communication, synchronization and consensus needs and mitigates single point of failure issues. Our approach, called R2P2 or Roles with Rules and Proportional-control Primitive, assigns roles (e.g., push, support and prevent) to robots based on rules cognizant of the mode of manipulation needed (box rotation vs translation); this is followed by either rule-based control or proportional control of robot velocity based on the roles. Each robot is assumed to observe the location and heading of self and the box in executing the role and controls. R2P2 is evaluated with a six-robot team deployed in a simulator built using NVIDIA IsaacSim -- demonstrating generalizability across different surface friction/inclination and box mass scenarios, and better success rate compared to a standard virtual-leader-follower method. R2P2 is also successfully validated with a physical experiment, where it is executed onboard four turtlebots tasked with moving a 1.2 kg box.
推进大型多模态模型中的创造性物理智能
Cheng Qian, Hyeonjeong Ha, Jiayu Liu, Jeonghwan Kim, Emre Can Acikgoz, Bingxuan Li, Kunlun Zhu, Jiateng Liu, Aditi Tiwari, Zhenhailong Wang, Xiusi Chen, Mahdi Namazifar, Heng Ji
AI总结 针对大型多模态模型在开放式环境中缺乏基于视觉的创造性工具使用能力的问题,提出MM-CreativityBench基准和基于偏好学习的具身对齐方法,显著提升实体选择并减少幻觉。
大型多模态模型(LMMs)在感知和推理方面取得了快速进展;然而,目前尚不清楚这些能力是否能够泛化到在开放式环境中发现基于视觉的解决方案,超越模式识别。在此类场景中,智能需要的不仅仅是回答明确的问题:它涉及识别场景中的元素如何以非显而易见但物理上可行的方式被重新利用。这种创造性问题解决形式是人类智能的核心,但在当前基准测试中基本上未得到测试。为了评估这一能力,我们引入了MM-CreativityBench,这是一个用于在视觉丰富、物理受限的环境中进行基于可操作性的创造性工具使用的基准。每个实例呈现一个场景图像,包含候选实体及其部件的结构化视图,从而能够对模型如何迭代检查场景、识别相关可操作性以及组合视觉和物理上可行的解决方案进行细粒度、交互式评估。我们的实验表明,当前的LMMs往往表现不佳,不是由于缺乏生成能力,而是因为它们无法维持基于具身的探索。模型经常忽略相关实体,对关键部件检查不足,或幻觉出图像中不存在的属性。受此失败模式的启发,我们提出了具身对齐,将创造性工具使用视为一个偏好学习问题。使用直接偏好优化,我们鼓励模型偏好基于视觉证据的属性-可操作性推理,而非幻觉替代方案。此外,我们结合从可操作性知识库中获得的监督,以指导更广泛的实体探索和多轮规划。我们的结果显示,在正确选择实体和部件方面取得了持续改进,同时大幅减少了幻觉和与具身相关的错误。
Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
利用局部动态规律性实现离线分层强化学习中的可复用技能
Sarthak Dayal, Abhinav Peri, Carl Qi, Claas Voelcker, Alexander Levine, Caleb Chuck, Amy Zhang
AI总结 提出CARL算法,通过对比学习对齐局部动态与动作序列,在离线分层强化学习中学习可复用技能,提升下游任务性能。
分层强化学习(HRL)有望通过发现和复用时间上扩展的技能,比非分层方法更有效地解决长时域强化学习(RL)任务。然而,获得真正可复用的技能仍然是一个开放挑战。为此,我们关注利用局部动态直觉的抽象:不同全局上下文中的局部转换需要类似的动作序列。通过将这些上下文与其所需的动作序列对齐,我们能够学习哪些技能可以复用以及在何处复用它们。原则上,这些信息应有益于许多HRL算法,其中高层策略需要推理其使用的低层技能。由此产生的算法CARL(基于对比动作的可复用局部控制表示)在复杂人形环境中展示了有意义技能的定性聚类,并且在与HIQL集成时,在OGBench基准上提升了下游性能。
Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.
基于$β$-稀疏高斯过程的协作导航与探索
Evangelos Psomiadis, Dipankar Maity, Panagiotis Tsiotras
AI总结 针对异构机器人在未知环境中的协作导航问题,提出一种利用$β$-稀疏高斯过程进行带宽受限下地图点选择和导航动作联合优化的框架,显著降低路径代价和传输信息量。
异构机器人在未知环境中的协作导航由于传感、通信和计算限制而面临重大挑战。在这项工作中,一个领航机器人向目标导航,同时一个移动传感器机器人(例如无人机)通过传输其局部观测地图的信息来辅助,但受带宽限制。我们提出一个框架,使传感器能够在线联合选择其传输的地图点和导航动作,同时预测环境的未探索区域。为此,我们提出了$β$-稀疏高斯过程,一种鲁棒的变分稀疏高斯过程模型,用于在基数约束下进行任务感知的诱导点选择。此外,我们开发了一种平衡任务相关性与探索的动作选择策略。在火星和地球地图上的仿真表明,与无通信相比,该框架可将路径代价降低18%,与原始数据传输基线相比,传输信息量减少76%。
Collaborative navigation of heterogeneous robots in unknown environments poses significant challenges due to sensing, communication, and computational limitations. In this work, a lead robot navigates toward a target while a mobile sensor robot (e.g., a drone) assists by transmitting information about its locally observed map under bandwidth constraints. We propose a framework that enables the sensor to jointly select its transmitted map points and navigation actions online, while also predicting unexplored regions of the environment. To this end, we present $β$-Sparse Gaussian Processes, a robust variational sparse Gaussian Process model for task-aware inducing point selection under cardinality constraints. Furthermore, we develop an action-selection strategy that balances task relevance with exploration. Simulations on Mars and Earth maps show that the framework can reduce path cost by 18% relative to no communication and decrease transmitted information by 76% compared to raw-data transmission baselines.
GEM: 用于最优LLM数据策展的几何熵混合
Yue Min, Ziyun Qiao, Ruining Chen, Yujun Li
AI总结 提出GEM框架,通过将数据策展重构为超球面上的变分问题并采用MM算法优化,解决了分类缺陷和嵌入各向异性问题,在1.1B参数模型上实现下游准确率提升1.2%。
LLM预训练的有效性越来越依赖于数据组成而非单纯的数据量。然而,最优混合受到分类缺陷的阻碍:人类分类法存在本体论错位,而欧几里得聚类无法解决嵌入各向异性。我们引入GEM(几何熵混合),这是一个将数据策展重构为超球面上的变分问题并辅以混合平衡正则化项的框架。通过解耦生成先验并使用可证明的MM(Minorize-Maximize)算法优化目标,GEM有效对抗聚类坍缩,从而发现欧几里得启发式方法无法察觉的平衡语义结构。我们采用师生蒸馏将这种几何保真度扩展到网络规模语料库,并引入几何影响分数(GIS)用于可解释的分类法生成。使用1.1B参数模型的实验表明,当集成到DoReMi和RegMix等混合策略中时,GEM建立了新的最先进水平,将平均下游准确率提升高达1.2%,并为可预测的数据混合提供了稳健的坐标系。
LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.
破碎的记忆:通过退化生成检测和缓解扩散模型中的记忆化
Yuanmin Huang, Mi Zhang, Chen Chen, Feifei Li, Geng Hong, Xiaoyu You, Min Yang
AI总结 本文首次发现扩散模型中的记忆化会导致内部数值不稳定性并表现为视觉“破碎”伪影,基于此提出了一种基于潜变量更新范数的经验稳定区域来量化稳定行为,并设计了一个即时的逐步骤检测与自适应缓解框架,在不改变提示或引导的情况下抑制记忆化,在Stable Diffusion 1.4上实现了AUC>0.999的检测性能和0.0%的记忆化率。
虽然扩散模型在生成高质量图像方面表现出色,但它们记忆训练数据的倾向带来了显著的隐私和版权风险。在这项工作中,我们首次发现记忆化会导致内部数值不稳定性,通常表现为视觉上的“破碎”伪影。受数值方法中稳定性分析的启发,我们引入了基于潜变量更新范数的经验稳定区域,以定量表征生成过程中的稳定行为。利用这一点,我们提出了一个原则性的、即时的框架,用于逐步骤检测和自适应缓解。我们的方法在不改变提示或引导的情况下抑制记忆化,从而保持语义保真度和图像质量。在Stable Diffusion 1.4上的大量实验表明,我们的方法在缓解后实现了AUC>0.999的检测性能和0.0%的记忆化率,且开销可忽略不计(每张图像约0.01秒)。
While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'' artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $>0.999$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ($\approx0.01$s per image).
ScenePilot: 可控的边界驱动型自动驾驶关键场景生成
Qiyu Ruan, Yuxuan Wang, He Li, Zhenning Li, Cheng-zhong Xu
AI总结 提出ScenePilot框架,通过结合RSS物理可行性评分与在线学习的AV风险预测器,将场景生成建模为约束多目标强化学习,并引入步级可行性感知屏蔽,以生成物理上可解但导致自动驾驶系统失败的关键场景。
安全关键场景对于评估自动驾驶系统至关重要,但由于其在自然日志中罕见,基于仿真的压力测试不可或缺。大多数场景生成方法将周围智能体视为对手,但它们要么(i)未显式建模车辆-道路物理极限而导致失败,产生视觉极端但物理上不可解的碰撞,要么(ii)单独强制执行物理可行性或策略可行性,可能过度关注激进操作或受限于控制器依赖的能力边界。我们提出ScenePilot,一个可行性引导的、边界驱动的框架,针对边界带:即原则上物理可解但仍导致部署的自动驾驶堆栈失败的场景。我们将生成建模为约束多目标强化学习,结合RSS衍生的物理可行性评分$σ$和在线学习的AV风险预测器$Φ$,并引入步级可行性感知屏蔽,以保持探索接近可行性边界,同时避免不可行的伪影。在SafeBench上使用多个规划器的实验表明,ScenePilot在保持物理有效性的同时,产生了显著更高的碰撞率(+6.2个百分点),并且在这些边界带场景上的对抗性微调持续降低了下游碰撞率。代码可在https://github.com/QiyuRuan/ScenePilot获取。
Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $σ$ with an online-learned AV-risk predictor $Φ$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.
MIRA: 基于自锚定评分标准的中期训练源感知数据选择
Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu, Bryan Dai
AI总结 针对中期训练中异构数据源的选择问题,提出MIRA框架,通过自锚定评分标准发现和可扩展的学生评分器,在代码中期训练中仅用一半token即可匹配全语料性能。
中期训练已成为现代大语言模型开发中的重要阶段,使用大规模精选混合数据在最终后训练前增强能力。其数据选择问题具有独特性:数据在接近预训练规模的预训练风格目标下优化,但针对下游能力进行策划,并来自具有不同格式和训练角色的异构源。因此,有效选择需要可扩展性和源自适应语义标准。现有的基于模型的方法可扩展性好,但仅提供隐式质量信号。语义选择方法提供更强的判断,但通常假设固定评分标准或标准化数据格式。为解决这一不匹配,我们提出MIRA,一种基于自锚定评分标准发现的源感知过滤框架。关键思想是将评分标准构建作为数据选择的一部分:MIRA首先发现每个源组应评估什么,然后将这些判断提炼为可扩展的学生评分器,用于全语料过滤。在包含21个源和5个源组的代码中期训练中,MIRA在九个代码基准测试中优于选择基线,并在仅使用一半token的情况下匹配全语料运行。
Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.
GenClaw: 代码驱动的智能体图像生成
Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang, Rui Chen, Weijia Li
AI总结 提出GenClaw,一种代码驱动的智能体图像生成范式,通过概念构思、代码草图绘制和纹理补充三个阶段,将黑盒图像生成转变为可控、可解释的分阶段过程。
图像生成模型已从基于文本的像素合成演变为具备视觉理解和工具调用能力的多模态智能体。然而,现有智能体仍受制于底层黑盒图像模型。其工作流程陷入重复的提示重写循环以改进生成,缺乏直接操控画布的机制。本质上,LLMs作为精确视觉构建的“画笔”的潜力尚未被充分挖掘。本文提出GenClaw,一种代码驱动的智能体图像生成范式,使智能体像人类艺术家一样创作:先构思,再素描,最后上色。具体而言,智能体首先通过搜索和推理构建概念知识和上下文。然后利用代码(如SVG、HTML、ThreeJS)渲染可执行的视觉草图。最后,使用图像生成模型补充纹理、材质和逼真度。在此工作流中,代码作为连接语言推理和像素合成的可控中间画布,无缝集成程序逻辑与生成模型的视觉表现力。通过将图像生成从黑盒范式转变为类似真实人类创作的分阶段过程,GenClaw朝着高度可控和可解释的视觉生成系统迈出了一步。
Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.
Déjà View: 用于多视图3D重建的循环Transformer
Alessandro Burzio, Tobias Fischer, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic, Haithem Turki
AI总结 提出DéjàView模型,通过循环应用单个Transformer块进行迭代细化,以更少的参数和计算量在多个3D重建基准上达到或超越大规模前馈模型。
近期的前馈式3D重建Transformer已扩展到超过十亿参数,遵循计算机视觉中模型容量增加的趋势。然而,新出现的证据表明,连续的Transformer层通常表现为类似操作的重复应用,而多视图重建Transformer在解码器深度上逐步优化其预测。我们认为模型深度部分地购买了迭代,但以独特的参数低效地支付,因此我们将迭代显式地融入架构中。我们的模型DéjàView对每个视图的特征循环应用单个循环Transformer块,进行K步细化。训练一次后,它将K暴露为推理时的计算旋钮,在涵盖室内、室外、物体中心和驾驶场景的五个重建基准上,匹配或优于显著更大的前馈基线,同时使用其一小部分参数和相当或更低的计算量。重要的是,在匹配的训练数据和计算量下,相同的循环块公式优于具有独立每步参数的相同变体,这表明显式迭代不仅是计算高效的容量替代方案,而且是多视图3D重建更强的归纳偏置。
Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.
Towards Consistent Video Geometry Estimation
Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen
AI总结 提出ViGeo,一种基于纯Transformer架构的前馈基础模型,通过动态分块注意力机制和基于补全的数据精炼框架,实现视频序列中空间密集且时间一致的几何(深度、法线、点图)估计,在在线、离线及长视频任务中达到最先进性能。
本文提出了ViGeo,一种前馈基础模型,用于从视频序列中恢复空间密集且时间一致的几何信息。ViGeo基于纯Transformer架构,没有针对特定任务的架构修改,支持在统一模型中进行流式、全序列和长视频推理。关键设计是动态分块注意力,该机制在训练期间使模型同时暴露于双向和因果时间上下文,并允许其在测试时无需重新训练即可调整注意力模式。为了提高监督质量,我们进一步引入了一种基于补全的数据精炼框架。该框架训练了一个视频深度补全教师模型,该模型以稀疏且有噪声的标注为条件,利用视频/多视图上下文生成密集、时间一致且几何可靠的训练目标。除了深度和点图,ViGeo还在同一框架内预测表面法线。仅使用公共数据集训练,ViGeo在在线、离线和长视频深度估计、表面法线估计以及视频点图估计中均达到了最先进性能。
This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.