arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2605.25333 2026-05-26 cs.CV

Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

教会视频生成器记忆:为不可见状态演化引出动态记忆

Tianshuo Xu, Yichen Xie, Depu Meng, Chensheng Peng, Quentin Herau, Bo Jiang, Yihan Hu, Wei Zhan

AI总结 针对视频生成模型在观测中断时状态冻结的问题,提出ReMind框架,通过面向记忆的数据构建、事件感知训练和缓存适配,利用KV缓存机制实现动态记忆,在STEVO-Bench和恢复任务上取得最佳成绩。

详情
AI中文摘要

视频世界模型应在证据未被观测时维持演化状态,但当前生成器在中断时往往冻结隐藏状态。这不仅仅是容量问题:预训练的视频扩散Transformer已经具备能够进行非局部检索的KV缓存机制,但很少被训练用作动态记忆。我们引入ReMind,一个通过面向记忆的数据、事件感知训练和缓存适配来引出动态记忆行为的框架。围绕100多种动态事件的分类,我们构建了一个带相机标注的训练混合集,结合了VLM过滤的真实视频、生成的硬动态、合成相机循环和记忆中断增强。每个片段被转换为带有保护锚点、退化区间和显式时间间隙的帧图。节点结构化的课程,包括节点丢弃、噪声记忆、前沿延续和参考缓存训练,迫使模型在中断时检索相关的过去状态,而不是仅依赖局部连续性。PM-RoPE,一种优雅的相机相位RoPE扩展,以单注意力成本解锁了时空检索,同时保留了预训练路径。ReMind在STEVO-Bench和恢复任务上取得了最佳总体分数。此外,通用图像到视频评估证实该课程避免了灾难性遗忘。我们将开源代码、数据和模型。

英文摘要

Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

2605.25328 2026-05-26 cs.CV cs.MM

DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement

DIVA: 利用统一多模态模型中的表示差异实现相互增强

Renjie Lu, Xulong Zhang, Xiaoyang Qu, Shangfei Wang, Jianzong Wang

AI总结 针对统一多模态模型中理解与生成任务因监督信号差异导致相互干扰的问题,提出DIVA框架,通过分解视觉表示为共享和独有成分并利用互信息估计实现内部协同,在理解与生成任务上分别提升7.82%和8.46%。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于单一架构构建的统一多模态模型(UMMs)在理解和生成任务中均展现出令人印象深刻的表现。我们识别出一个基本挑战,即由不同监督信号引起的归纳偏差:生成分支偏好能够重建的高保真、细粒度表示,而理解分支则偏好对任务无关因素保持不变的语义判别性嵌入。因此,在单一骨干网络中优化这些互补但不等价的目标会导致相互损害而非增强。在本文中,我们首先分析了统一骨干网络中这种干扰的根本原因,并揭示了其内部表示中的互补结构。受此观察启发,我们提出了DIVA,一个自我改进的训练后框架,将表示差异转化为内部协同。通过基于两条互补信息流将视觉表示显式分解为共享和独有成分,DIVA使得理解和生成分支都能实现有益的迁移,同时通过互信息估计保护独有信息免受跨流干扰的完整性。尽管具有通用性,我们的方法在视觉理解(+7.82%)和生成(+8.46%)任务上均取得了一致的改进。官方代码见:https://github.com/Jayyy-H/DIVA。

英文摘要

Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.

2605.25326 2026-05-26 cs.CV

Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

感知-然后-规划:以布局为策略的单目3D场景布局估计

Junwei Zhou, Yu-Wing Tai

AI总结 提出Perceive-then-Plan框架,通过视觉语言模型将单目3D布局估计转化为感知与迭代规划问题,以布局为策略(LaP)学习动作序列逐步优化场景假设,生成更物理一致且与观测对齐的3D布局。

Comments 21 pages

详情
AI中文摘要

从单张图像构建结构化的3D场景布局需要协调视觉观察与物理和空间约束,这一挑战难以仅通过直接预测来解决。在这项工作中,我们将单目3D布局估计形式化为一个带有视觉语言模型的感知-然后-规划问题,其中感知器首先定位3D对象,然后规划器通过动作迭代优化场景假设,这些动作在保持与输入图像一致性的同时提高物理合理性。我们提出布局为策略(LaP),将规划阶段视为策略学习问题:3D布局表示为结构化状态,并通过离散动作(如平移、旋转和缩放)进行优化。从几何增强感知器的观测对齐初始化开始,LaP规划器被训练生成逐步解决几何不一致性并强制实现现实空间关系的动作序列。为了实现有效学习,我们将监督轨迹初始化与基于偏好的优化相结合,使模型能够在无需显式奖励工程的情况下学习纠正行为。这种公式将布局估计从一次性预测任务转变为迭代优化过程,从而更好地处理全局约束和复杂的对象交互。实验表明,我们的方法生成的布局在物理上更连贯,与视觉观察更一致,同时自然支持场景编辑和操作等下游任务。

英文摘要

Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.

2605.25313 2026-05-26 cs.LG cs.AI cs.RO stat.ML

UWM-JEPA: Predictive World Models That Imagine in Belief Space

UWM-JEPA:在信念空间中进行想象的世界预测模型

Santosh Kumar Radha, Oktay Goktas

AI总结 针对部分可观测环境,提出UWM-JEPA模型,通过密度矩阵潜变量和酉预测器在信念空间中保持联合状态谱,实现长时域盲推演下的不确定性保持,显著优于向量潜变量基线。

Comments 14 pages, 6 figures, 7 tables. Code and data: https://github.com/santoshkumarradha/uwm-jepa

详情
AI中文摘要

部分可观测环境下的世界模型必须想象多个兼容的隐藏未来,并在反事实动作下引导它们。联合嵌入预测架构(JEPAs)在潜在空间中实现这一点,但向量值潜变量没有内部结构来承载盲推演过程中隐藏连续性的信念。我们引入了酉世界模型JEPA(UWM-JEPA),这是一种JEPA世界模型,具有在联合系统-环境空间上的密度矩阵潜变量和学习的酉预测器。该结构在推演过程中精确保持联合状态谱,因此预测器本身不会耗散表示的不确定性。在一个需要根据给定动作序列进行五步前向模拟且目标观测被掩蔽的隐藏速度指示任务中,UWM-JEPA达到0.77的准确率,并且随着动作被扰动而单调下降;而参数匹配的LSTM-JEPA在相同的反事实目标目标和动作头训练下,在所有动作条件下都崩溃为多数类准确率(0.53)。在盲推演下,UWM-JEPA在短时域上损失不到十个点的探针R^2,而向量潜变量基线损失四十一个和六十八个点;两者在保留的上下文探针上表现相当,表明差异在于预测器而非编码器。动作敏感性本身需要针对反事实而非教师强制目标进行训练,这一发现适用于酉参数化之外。对于JEPA世界模型在部分可观测性下进行想象,潜变量几何和预测器动力学至关重要,而不仅仅是冻结的上下文编码能力。

英文摘要

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone.

2605.25310 2026-05-26 cs.CL

Tool-Call Dependency Structure is Linearly Decodable in LLM Agent Residual Streams

工具调用依赖结构在LLM智能体残差流中是线性可解码的

Tianda Sun, Dimitar Kazakov

AI总结 本研究通过低容量边探针在Qwen3-32B残差流中解码工具调用依赖图,发现该表示追踪抽象拓扑而非标识符值,且在不同模型和任务中可复制。

Comments 16 pages, 7 figures

详情
AI中文摘要

使用工具的LLM智能体产生的轨迹中,调用形成有向依赖图:早期工具输出为后续调用提供参数。这种执行结构是否在模型内部表示尚不清楚;先前的结构探针针对静态代码或思维链文本,而非智能体的运行时调用图。在Qwen3-32B残差流上的低容量边探针解码工具调用依赖图,显著高于Hewitt-Liang随机标签控制和位置基线。反事实对比(值破坏与结构扰动)表明信号追踪抽象拓扑而非标识符值,并在独立的非子串预言机下可复制。非位置成分在另外三个交互式多跳基准上可复制,并在调用顺序本身成为依赖的充分代理时衰减,在单次规划中消失。逐层激活修补在后续非修补边界移动探针,表明表示传播而非被动读出,尽管实际工具调用未移动。据我们所知,这是首个对LLM智能体运行时工具调用依赖图的结构探针。我们的主张涉及表示而非行为控制,涵盖两个模型系列和一个主要领域。

英文摘要

Tool-using LLM agents produce trajectories whose calls form a directed dependency graph: earlier tool outputs supply arguments to later calls. Whether this execution structure is represented inside the model is unknown; prior structural probes have targeted static code or chain-of-thought text, not an agent's run-time call graph. A low-capacity edge probe on the residual stream of Qwen3-32B decodes the tool-call dependency graph well above both a Hewitt--Liang random-label control and a positional baseline. A counterfactual contrast between value corruption and structural perturbation indicates the signal tracks abstract topology rather than identifier values, and replicates under an independent, non-substring oracle. The non-positional component replicates on three further interactive multi-hop benchmarks and attenuates as call order alone becomes a sufficient proxy for dependency, vanishing in single-shot planning. Per-layer activation patching shifts the probe at a later, non-patched boundary, evidence that the representation propagates rather than passively reads out, though the realised tool call does not move. To our knowledge this is the first structural probe of an LLM agent's runtime tool-call dependency graph. Our claims concern representation, not behavioural control, and span two model families and one primary domain.

2605.25308 2026-05-26 cs.CV

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

通过动态特征归一化稳定流视频几何

Xiaoyang Lyu, Muxin Liu, Xiaoshan Wu, Ruicheng Wang, Yi-Hua Huang, Yang-Tian Sun, Shaoshuai Shi, Xiaojuan Qi

AI总结 针对流式RGB输入中单目几何模型的时间不一致问题(主要表现为尺度-偏移漂移),提出轻量级因果循环模块DyFN,通过动态调制特征统计量实现稳定几何估计,仅微调2%参数即可达到SOTA时间稳定性。

Comments 16 pages, 9 Figures, page: https://shawlyu.github.io/DyFN

详情
AI中文摘要

从流式RGB输入中一致地估计3D几何对于自动驾驶、具身AI和大规模重建等实际应用至关重要。虽然现代单目几何基础模型在单张图像上取得了很高的精度,但在连续输入上表现出严重的时间不一致性,主要表现为尺度-偏移漂移。通过有针对性的实证分析,我们将这种不稳定性追溯到其根本原因:潜在特征统计量的波动,其均值和方差直接决定了预测深度的尺度和偏移。基于这一洞察,我们引入了动态特征归一化(DyFN),这是一种轻量级的因果循环模块,能够动态且鲁棒地调制特征统计量,以随时间保持稳定的几何。我们通过仅微调DyFN(仅占2%的额外参数)来适配强大的预训练单目几何模型用于流式处理,同时保持骨干网络冻结,从而在保持单张图像精度的同时实现时间一致性。在四个基准上的大量实验表明,DyFN有效消除了时间伪影,如不连续的分层和位置抖动,并实现了最先进的时间稳定性,相比先前的流式方法提升了高达14%,甚至优于更重的非因果视频基线。项目页面:https://shawlyu.github.io/DyFN

英文摘要

Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale--shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth's scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2\% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14\% and even outperforming heavier non-causal video baselines. Project Page: https://shawlyu.github.io/DyFN

2605.25307 2026-05-26 cs.CV

Recursive Class Connectivity Classification (R3C) Applied to Binary Image Segmentation for Improved Infant Fingerprint Enhancement

递归类连接分类(R3C)应用于二值图像分割以改进婴儿指纹增强

Joao Leonardo Harres Dall Agnol, Luiz Fernando Puttow Southier, Jefferson Tales 0liva, Marcelo Teixeira, Rodrigo Mineto, Marcelo Filipa, Dalcimar Casanova, Erick Oliveira Rodrigues

AI总结 提出递归类连接分类(R3C)框架,通过迭代扩展脊线结构改进现有增强方法的二值分割输出,无需训练数据即可提升婴儿指纹识别率。

详情
Journal ref
IEEE Access 2025
AI中文摘要

图像增强在婴儿指纹匹配中至关重要,因为儿童特有的特征(如较小的手指尺寸和较薄的脊线结构)通常会在采集过程中降低图像质量。为解决这些限制,注册通常依赖于专门的高分辨率扫描仪,而大多数现有增强方法并非为此设计。因此,儿童的识别率仍显著低于成人指纹。本研究引入递归类连接分类(R3C),一种通过扩展脊线结构迭代细化现有增强方法二值分割输出的新颖框架。R3C不需要修改底层分类器,且无需训练数据(目前婴儿指纹尚无此类数据)。相反,该方法通过将分类后的图像反复反馈到分类过程中,同时将每个中间分割与原始输入图像结合,从而改进分割。在三个指纹数据集上使用四种不同增强分类器进行的实验表明,与单独使用增强方法相比,R3C可将儿童的真接受率(TAR)提高最多4%,新生儿提高超过40%。定性分析进一步表明,R3C重新连接了断裂的脊线模式,改善了分割的视觉质量。由于独立于所使用的增强方法,R3C为改进二值分割提供了灵活且广泛适用的解决方案。

英文摘要

Image enhancement plays a crucial role in infant fingerprint matching, as child-specific characteristics such as smaller finger dimensions and thinner ridge structures often degrade image quality during acquisition. To address these limitations, enrollment typically depends on specialized highresolution scanners, which most existing enhancement methods are not designed to support. Consequently, identification rates for children remain significantly lower than those achieved with adult fingerprints. This study introduces Recursive Class Connectivity Classification (R3C), a novel framework that iteratively refines binary segmentation outputs from existing enhancement methods by extending ridge structures. R3C does not require modifications to the underlying classifier and operates without training data, which is not currently available for infant fingerprints. Instead, the method improves segmentation by repeatedly feeding the classified image back into the classification process, while combining each intermediate segmentation with the original input image. Experiments conducted on three fingerprint datasets using four different enhancement classifiers show that R3C can increase the True Acceptance Rate (TAR) by up to 4% for children and over 40% for newborns, compared to using the enhancement methods alone. A qualitative analysis further demonstrates that R3C reconnects fragmented ridge patterns, improving the visual quality of segmentation. Because it functions independently of the enhancement method used, R3C provides a flexible and broadly applicable solution for improving binary segmentation.

2605.25305 2026-05-26 cs.LG

Electricity Consumption Forecasting: An Approach Using Cooperative Ensemble Learning with SHapley Additive exPlanations

电力消耗预测:一种使用SHapley加法解释的协作集成学习方法

Eduardo Luiz Alba, Gilson Adamczuk Oliveira, Matheus Henrique Dal Molin Ribeiro, Érick Oliveira Rodrigues

AI总结 提出一种名为弱分离器增强器(WSB)的协作集成学习方法,结合LSTM、RF、SVR和XGBoost模型,利用SHAP进行特征选择,遗传算法和粒子群优化超参数,对巴西联邦学院两个校区未来12个月的电力消耗进行预测,取得较低误差。

详情
Journal ref
Forecasting 2024
AI中文摘要

电力费用管理面临重大挑战,因为该资源易受多种影响因素影响。在大学中,随着机构扩张,对该资源的需求迅速增长,并对环境产生显著影响。本研究使用长短期记忆(LSTM)、随机森林(RF)、支持向量回归(SVR)和极端梯度提升(XGBoost)机器学习模型,基于巴拉那联邦学院(IFPR)过去七年的历史消费数据和气候变量,训练模型以预测未来12个月的电力消耗。采用了两个校区的数据集。为了提高模型性能,使用Shapley加法解释(SHAP)进行特征选择,并使用遗传算法(GA)和粒子群优化(PSO)进行超参数优化。结果表明,所提出的名为弱分离器增强器(WSB)的协作集成学习方法在数据集上表现最佳。具体而言,对于IFPR-Palmas校区,其sMAPE为13.90%,MAE为1990.87 kWh;对于Coronel Vivida校区,sMAPE为18.72%,MAE为465.02 kWh。SHAP分析揭示了两个IFPR校区不同的特征重要性模式。一个共同点是滞后时间序列值的强烈影响和气候变量的最小影响。

英文摘要

Electricity expense management presents significant challenges, as this resource is susceptible to various influencing factors. In universities, the demand for this resource is rapidly growing with institutional expansion and has a significant environmental impact. In this study, the machine learning models long short-term memory (LSTM), random forest (RF), support vector regression (SVR), and extreme gradient boosting (XGBoost) were trained with historical consumption data from the Federal Institute of Paraná (IFPR) over the last seven years and climatic variables to forecast electricity consumption 12 months ahead. Datasets from two campuses were adopted. To improve model performance, feature selection was performed using Shapley additive explanations (SHAP), and hyperparameter optimization was carried out using genetic algorithm (GA) and particle swarm optimization (PSO). The results indicate that the proposed cooperative ensemble learning approach named Weaker Separator Booster (WSB) exhibited the best performance for datasets. Specifically, it achieved an sMAPE of 13.90% and MAE of 1990.87 kWh for the IFPR-Palmas Campus and an sMAPE of 18.72% and MAE of 465.02 kWh for the Coronel Vivida Campus. The SHAP analysis revealed distinct feature importance patterns across the two IFPR campuses. A commonality that emerged was the strong influence of lagged time-series values and a minimal influence of climatic variables.

2605.25304 2026-05-26 cs.LG cs.CR cs.CV

When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers

当可解释性成为负担:针对CBM概念层的对抗攻击

Aditya Sridhar

AI总结 本文系统研究了概念瓶颈模型(CBM)中概念层的对抗性脆弱性,提出了一种基于语义扰动的稳定性正则化防御方法SPECTRA,显著提高了攻击所需的最小扰动范数,同时保持了分类精度。

Comments Accepted to CVPR 2026 (Findings). 9 pages, 6 figures

详情
AI中文摘要

概念瓶颈模型(CBM)已成为可解释机器学习的基础方法,通过显式的概念激活提供人类可理解的中间表示。然而,这种可解释性从根本上引入了一个关键且先前未被探索的攻击面:概念瓶颈层本身。我们提出了对CBM中概念级对抗性脆弱性的全面、系统性研究,揭示了对输入像素进行有针对性的最小扰动可以通过操纵语义表示导致灾难性的错误分类。我们开发了一个严格的理论框架来量化概念空间的鲁棒性,建立了揭示这些架构脆弱性景观的新指标。我们在CUB-200-2011数据集上的广泛分析表明,标准CBM对概念级操纵表现出严重的敏感性。为了解决这一关键弱点,我们引入了SPECTRA(基于语义扰动的概念训练以增强对抗鲁棒性),一种原则性的稳定性正则化防御。SPECTRA有效地强化了语义表示空间,将成功攻击所需的最小扰动范数从0.46提高到超过4,200,使得有针对性的概念操纵在计算上变得不可行。此外,SPECTRA将基线分类精度保持在2.2%以内。通过将概念级攻击确立为一种根本不同的威胁模型,这项工作在可解释机器学习与对抗鲁棒性的交叉领域开辟了一个新的研究前沿。

英文摘要

Concept Bottleneck Models (CBMs) have emerged as a cornerstone approach for interpretable machine learning, providing human-understandable intermediate representations through explicit concept activations. However, this interpretability fundamentally introduces a critical, previously unexplored attack surface: the concept bottleneck layer itself. We present a comprehensive, systematic study of concept-level adversarial vulnerabilities in CBMs, revealing that targeted, minimal perturbations operating on input pixels can induce catastrophic misclassification by manipulating semantic representations. We develop a rigorous theoretical framework to quantify concept-space robustness, establishing novel metrics that expose the vulnerability landscape of these architectures. Our extensive analysis on the CUB-200-2011 dataset demonstrates that standard CBMs exhibit severe susceptibility to concept-level manipulation. To address this critical weakness, we introduce SPECTRA (Semantic Perturbation-based Concept Training for Robustness against Attacks), a principled stability regularization defense. SPECTRA effectively hardens the semantic representation space, increasing the minimal perturbation norm required for a successful attack from 0.46 to over 4,200, rendering targeted concept manipulation computationally prohibitive. Furthermore, SPECTRA preserves baseline classification accuracy to within 2.2%. By establishing concept-level attacks as a fundamentally distinct threat model, this work opens a new research frontier at the intersection of interpretable machine learning and adversarial robustness.

2605.25294 2026-05-26 cs.CV

Geometry-Aware Image Flow Matching

几何感知图像流匹配

Junho Lee, Kwanseok Kim, Joonseok Lee

AI总结 本文通过发现自然图像语义信息主要编码在方向分量上,提出球面最优传输流匹配(SOT-CFM)和球面流匹配(SFM)两种几何感知方法,在超球面上建模图像,相比欧几里得基线取得更优性能。

详情
AI中文摘要

生成模型的最新进展突显了几何感知建模在流形约束环境中的强大能力。然而,对于自然图像,该领域仍局限于欧几里得假设,未能利用数据内在的几何结构。在本文中,我们研究了自然图像的几何结构,观察到语义信息主要编码在方向分量中,而范数分量可以通过全局平均值近似。这一性质在RGB空间和潜在空间中都成立,表明自然图像可以在超球面上有效建模。基于这一发现,我们引入了球面最优传输流匹配(SOT-CFM),它利用角距离,以及球面流匹配(SFM),它直接在流形上约束动力学。我们的实验表明,这些几何感知方法相比欧几里得基线取得了更优的性能。最终,这项工作提供了一种新颖的视角,弥合了基于黎曼流形的建模与自然图像生成之间的差距。

英文摘要

Recent advances in generative models highlight the power of geometry-aware modeling in manifold-constrained settings. Yet, for natural images, the field remains confined to Euclidean assumptions, failing to exploit the potential of intrinsic geometric structures within the data. In this work, we investigate the geometry of natural images and observe that semantic information is predominantly encoded in directional components, while norm components can be approximated by the global average. This property holds across both RGB and latent spaces, suggesting that natural images can be effectively modeled on a hypersphere. Building on this finding, we introduce Spherical Optimal Transport Flow Matching (SOT-CFM), which utilizes angular distance, and Spherical Flow Matching (SFM), which constrains dynamics directly on the manifold. Our experiments demonstrate that these geometry-aware methods achieve superior performance against Euclidean baselines. Ultimately, this work provides a novel perspective that bridges the gap between Riemannian manifold-based modeling and natural image generation.

2605.25293 2026-05-26 cs.CV cs.AI cs.RO

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

基于神经形态激光雷达的鸟瞰图目标检测:使用节能脉冲神经网络

Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

AI总结 提出一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,通过代理梯度反向传播训练,在KITTI基准上达到高精度,并实现3.33倍突触操作能耗降低。

详情
AI中文摘要

自动驾驶感知需要在严格的功耗约束下对三维传感器数据进行准确高效的处理。传统卷积神经网络实现了强大的检测精度,但计算密集,限制了其在资源受限的神经形态平台上的部署。脉冲神经网络通过事件驱动的稀疏计算提供了一种引人注目的替代方案,但其在复杂真实世界感知任务(如三维目标检测)中的应用仍然有限。在这项工作中,我们提出了一种端到端脉冲编码器-解码器网络,用于激光雷达点云鸟瞰图表示中的目标检测,并使用代理梯度反向传播进行训练。我们训练了两个变体:一个膜电位变体,在输出阶段读取连续神经元状态以获得最大精度,在$\mathrm{IoU}\!=\!0.5$(简单/中等/困难)下达到$92.05$/$87.04$/$86.51$ AP;以及一个全二进制脉冲变体,每一层仅操作脉冲序列,用于直接神经形态部署。我们评估了四种输入脉冲编码策略,并证明允许网络直接从数据学习脉冲表示优于手工制作的泊松、延迟和z轴编码方案,在KITTI基准上,当顺序帧不可用且BEV输入跨时间步重复呈现作为时间流代理时。分块能量分析表明,在保守的基于循环的操作下,与等效CNN相比,突触操作能量降低了$3.33 imes$。这些结果共同证明了脉冲神经网络在自动驾驶中实现准确且节能的神经形态感知的可行性。

英文摘要

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird's eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving $92.05$/$87.04$/$86.51$ AP at $\mathrm{IoU}\!=\!0.5$ (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a $3.33\times$ reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

2605.25284 2026-05-26 cs.CL

Knowing but Not Showing: LLMs Recognize Ambiguity but Rarely Ask Clarifying Questions

知道但不展示:LLMs 识别歧义但很少提出澄清问题

Jinyan Su, Claire Cardie

AI总结 研究大型语言模型在识别用户查询歧义与主动提出澄清问题之间的行为差距,发现模型虽能识别歧义但默认直接回答,检索上下文会进一步减少澄清行为。

详情
AI中文摘要

用户查询通常不明确,可能允许多种有效解释。一个有用的助手不应默默假设用户意图,而应通过提出澄清问题来揭示这种歧义。这需要两种能力:识别查询存在歧义,并基于该识别采取行动(寻求澄清而非直接回答)。为了研究这些能力,我们在三种设置下评估模型对歧义、无歧义和消歧问题的表现:标准问答、显式歧义判断和行为分析(其中评判模型将响应分类为直接回答、拒绝或澄清问题)。我们发现识别与行为之间存在明显差距:当被明确要求判断时,模型通常能识别歧义,但在问答设置中,它们绝大多数默认直接回答。检索上下文通过提高可回答性进一步扩大了这一差距,使模型更不可能提出澄清问题。

英文摘要

User queries are often underspecified and may admit multiple valid interpretations. Rather than silently making assumptions about the user's intent, a helpful assistant should surface such ambiguity by asking a clarifying question. Doing so requires two abilities: recognizing that a query is ambiguous, and acting on that recognition by seeking clarification instead of answering directly. To study these abilities, we evaluate models on ambiguous, unambiguous, and disambiguated questions in three settings: standard question answering, explicit ambiguity judgment, and behavioral analysis, where a judge model classifies responses as direct answers, refusals, or clarifying questions. We find a clear gap between recognition and behavior: models often identify ambiguity when explicitly asked to judge it, yet in the QA setting they overwhelmingly default to direct answers. Retrieved context further widens this gap by improving answerability while making models even less likely to ask clarifying questions.

2605.25279 2026-05-26 cs.RO

GreenSeg: Ground Segmentation Algorithm for Agricultural Robots in Mediterranean Greenhouses using RGB-D Point Clouds

GreenSeg: 基于RGB-D点云的地中海温室农业机器人地面分割算法

Fernando Cañadas-Aránega, José C. Moreno, José L. Blanco-Claraco

AI总结 针对地中海温室狭窄通道、异构地形和光学干扰等挑战,提出一种基于RGB-D感知的双层验证地面分割框架GreenSeg,通过全局平面拟合、曲率滤波和种子点区域生长实现稳定导航,在AGRICOBIOT I平台上验证了其在动态光照下优于基准方法。

详情
AI中文摘要

地中海地区的温室农业因其独特的结构和环境限制面临显著的自动化挑战。这些环境的特点是极其狭窄的通道、从混凝土到耕地的异构地形,以及由聚乙烯覆盖物引起的光学干扰,导致深度传感器中出现镜面反射和“鬼点”。虽然自主导航对于农业任务的数字化至关重要,但传统解决方案通常依赖于昂贵的3D LiDAR系统,这些系统对于大多数设施来说在经济上不可扩展。为了解决这个问题,本文提出了GreenSeg,一个使用RGB-D感知的鲁棒感知框架,用于自主导航。所提出的方法引入了一种双层验证策略:一种鲁棒的全局平面拟合结合表面曲率滤波器以实现地形适应性,以及一种基于种子点的区域生长约束以确保可导航平面的空间连续性。使用AGRICOBIOT I平台在四个不同太阳高度角的日间场景下进行了实验验证。结果表明,GreenSeg始终优于基准分割方法,在走廊末端的关键旋转操作中,平均召回率提高了11.58%,mIoU提高了19.24%。这些发现证实了所提出的算法能够在受预算限制且对光照条件敏感的非结构化动态农业环境中实现稳定安全的自主导航。

英文摘要

Greenhouse agriculture in the Mediterranean region faces significant automation challenges due to its unique structural and environmental constraints. These environments are characterized by extremely narrow aisles, heterogeneous terrains ranging from concrete to tilled soil and severe optical interference caused by polyethylene covers, which induce specular reflections and "ghost points" in depth sensors. While autonomous navigation is essential for digitizing agricultural tasks, traditional solutions often rely on expensive 3D LiDAR systems that are economically unscalable for most facilities. To address this, this paper presents GreenSeg, a robust perception framework for autonomous navigation using RGB-D sensing. The proposed method introduces a dual-layer validation strategy: a robust global plane fitting combined with a surface curvature filter for terrain adaptability, and a seed-point-based Region Growing constraint to ensure the spatial continuity of the navigable plane. Experimental validation was conducted using the AGRICOBIOT I platform across four diurnal scenarios with varying solar elevations. The results show that GreenSeg consistently outperforms benchmark segmentation methods, achieving peak improvements of 11.58% in mean Recall and 19.24% in mIoU during critical rotational maneuvers at the end of corridors. These findings confirm that the proposed algorithm enables stable and safe autonomous navigation in unstructured, dynamic agricultural environments that are subject to budget constraints and sensitive to lighting conditions.

2605.25275 2026-05-26 cs.LG

Label-NTK Alignments and A Tighter Convergence Bound in the NTK Regime

标签-NTK 对齐与 NTK 区域中更紧的收敛界

Ruchirinkil Marreddy, Chaoyue Liu

AI总结 通过标签与NTK特征谱的对齐特性,提出更紧的收敛界,显著改进经典最坏情况结果。

详情
AI中文摘要

神经正切核(NTK)框架通过近似线性化动力学解释过参数化神经网络的优化,提供指数收敛保证。然而,现有结果往往过于悲观,与实际快速训练不符,因为它们依赖于最小的NTK特征值,而该特征值在实践中通常极小。在这项工作中,我们通过刻画数据标签与NTK特征谱之间的相互作用,开发了更精确的收敛保证。我们识别出两个关键现象:标签-NTK对齐和残差-NTK对齐,表明标签和残差在NTK特征向量上的投影与对应特征值成比例。我们在温和的数据假设下提供了经验证据和理论证明。利用这些对齐性质,我们推导出一个依赖于完整谱的精细收敛界,该界紧密匹配实际训练动态,显著优于经典最坏情况结果。我们进一步获得了改进的泛化界。在多个数据集上的MLP和CNN实验验证了我们的理论。

英文摘要

The Neural Tangent Kernel (NTK) framework explains optimization in over-parameterized neural networks via approximately linearized dynamics, yielding exponential convergence guarantees. However, existing results are often overly pessimistic and do not match the fast training in practice, as they depend on the smallest NTK eigenvalue, which is typically extremely small in practice. In this work, we develop sharper convergence guarantees by characterizing the interaction between data labels and the NTK eigen-spectrum. We identify two key phenomena, Label-NTK alignment and Residual-NTK alignment, showing that projections of labels and residuals onto NTK eigenvectors scale with the corresponding eigenvalues. We provide empirical evidence and theoretical justification under mild data assumptions. Exploiting these alignment properties, we derive a refined convergence bound that depends on the full spectrum and closely matches practical training dynamics, significantly improving over classical worst-case results. We further obtain improved generalization bounds. Experiments on MLPs and CNNs across multiple datasets validate our theory.

2605.25272 2026-05-26 cs.AI cs.CY stat.AP

AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

AI 制图:绘制 AI 基准生态系统的潜在景观

Michael Hardy, Anka Reuel, Lijin Zhang, Jodi M. Casabianca, Sang Truong, Yash Dave, Hansol Lee, Benjamin Domingue, Sanmi Koyejo

AI总结 针对排行榜分数受测量噪声影响的问题,提出基于验证性因子分析和概化理论的框架,分解排名方差来源,揭示基准间关系、局部依赖性及元数据影响,并比较显式与潜在缩放律的可靠性。

详情
AI中文摘要

虽然总体排行榜分数驱动着 AI 发展,但它们包含大量测量噪声,其来源和幅度尚未量化,使得排名何时反映真实能力差异何时反映评估伪像尚不明确。我们引入了一个用于测量 AI 基准生态系统中潜在景观的框架。将验证性因子分析(CFA)和概化理论应用于 Open LLM Leaderboard 上的 4000 多个模型,我们分解了排名方差的来源并确定:(1)当前报告实践中假设的结构低估了基准之间关系的强度;(2)排行榜项目之间存在局部依赖性的证据,这削弱了在当前评分系统下将基准用作测量工具的有效性;(3)在此背景下,贡献者元数据解释了比架构或部署类别更多的排名相关方差(约 9%);(4) 显式分数的“缩放律”斜率可靠性较低($R_β=0.53$);相比之下,潜在通用因子大小斜率在生态系统控制下高度稳定($R_g=0.97$)。我们能够提供对基准动态的独特见解,例如哪些基准是 LLM 规模的函数,哪些可能受到后训练实践的相反影响。我们提供了可操作的诊断方法,以确定如何信任基准排名以及如何改进基准设计。

英文摘要

While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unquantified, making it unclear when rankings reflect genuine capability differences versus evaluation artifacts. We introduce a framework for measuring the latent landscape in AI benchmark ecosystems. Applying Confirmatory Factor Analysis (CFA) and Generalizability Theory to 4,000+ models from the Open LLM Leaderboard, we decompose sources of ranking variance and establish: (1) structures assumed in current reporting practice underestimate the strength of relationships between benchmarks; (2) evidence of local dependence among leaderboard items, undermining uses of benchmarks as measurement instruments under current scoring systems; (3) contributor metadata explains more rank-relevant variance ($\approx9\%$) than architecture or deployment categories in this context; (4) a manifest-score "scaling law" slope has low reliability ($R_β=0.53$); by contrast, the latent general-factor size slope is highly stable across ecosystem controls ($R_g=0.97$). We are able to provide unique insights into benchmark dynamics, such as which benchmarks are a function of LLM size and which can be oppositely impacted by post-training practices. We provide actionable diagnostics to determine how benchmark rankings can be trusted and how benchmark design can be improved.

2605.25267 2026-05-26 cs.LG cs.AI

Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning

潜在Q-屏障屏蔽用于安全上下文强化学习

Minjae Kwon, Amir Moeini, Shangtong Zhang, Lu Feng

AI总结 提出一种潜在Q-屏障屏蔽方法,通过学习上下文表示、潜在动力学和集成成本评论家,在部署时无需参数更新即可根据剩余预算和预测未来成本过滤或软重加权候选动作,从而改善安全上下文强化学习在分布外转移下的奖励-安全权衡。

详情
AI中文摘要

安全上下文强化学习(ICRL)在测试时不更新参数,仅从交互历史中在线适应,同时将情节成本控制在安全预算内。在分布外(OOD)部署转移下,仅预训练的安全ICRL可能产生较差的奖励-安全权衡,因为剩余预算仅通过冻结的策略条件影响行为,而非通过针对预测未来成本的显式动作级检查。我们提出一种潜在Q-屏障屏蔽,在部署前学习上下文表示、潜在动力学和集成成本评论家。无需参数更新,该屏蔽从历史中推断上下文,并使用剩余预算和预测未来成本过滤或软重加权候选动作。我们证明了一个条件性的、误差分解的屏障-边际结果:满足Q-屏障的动作将下一个潜在预算状态置于近似预算安全的延续中(在学习的评论家下),误差上界由贝尔曼误差和潜在预测误差决定。在五个安全ICRL基准测试中,该屏蔽在部署时相比强安全ICRL基线改善了奖励-安全权衡:在短上下文窗口后,它在五个基准中的四个上实现了更高的回报,同时在所有五个基准中匹配或降低了平均情节成本。

英文摘要

Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling episode cost under a safety budget. Under out-of-distribution (OOD) deployment shifts, pretraining-only safe ICRL can give poor reward-safety tradeoffs because the remaining budget affects behavior only through frozen policy conditioning, not an explicit action-level check against predicted future cost. We propose a latent Q-Barrier shield that learns a context representation, latent dynamics, and an ensemble cost critic before deployment. Without parameter updates, the shield infers context from history and filters or softly reweights candidate actions using the remaining budget and predicted future cost. We prove a conditional, error-decomposed barrier-margin result: a Q-Barrier-satisfying action leaves the next latent-budget state with an approximately budget-safe continuation under the learned critic, up to Bellman and latent-prediction errors. Across five safe ICRL benchmarks, the shield improves deployment-time reward-safety tradeoffs over a strong safe-ICRL baseline: after a short context window, it achieves higher return in four of five benchmarks while matching or lowering average episode cost in all five.

2605.25266 2026-05-26 cs.CV

DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

DeltaCam: 用于视频生成的差分内参相机建模

Debabrata Mandal, Zhihan Peng, Yujie Wang, Praneeth Chakravarthula

AI总结 提出DeltaCam视频扩散框架,通过差分参数化神经相机适配器学习相对变化,实现焦距、光圈、ISO等内参的平滑可控视频生成,并扩展到真实场景。

详情
AI中文摘要

将相机内参纳入视频生成模型为控制场景动态和影响视觉外观的成像过程提供了原则性方法。先前工作主要关注外参控制(如相机姿态和运动),而将内参视为隐式或固定。关键瓶颈在于缺乏具有准确且多样化的时变相机元数据的大规模视频数据集,这使得学习绝对相机参数化变得困难。因此,当前模型难以以可控且时间一致的方式融入摄影相机行为,包括景深转换、曝光变化、镜头畸变和色彩处理。我们引入DeltaCam,一种视频扩散框架,通过Δ参数化的神经相机适配器对相机行为进行建模,该适配器基于相机运动和内参的相对变化而非绝对状态进行操作。通过从合成视频数据中学习这种差分公式,我们减轻了对精确真实世界相机标签的依赖,并实现了对焦距、光圈、ISO、色温和镜头畸变成像因子的平滑一致控制。我们将此框架扩展到真实世界视频,通过两种机制:在真实图像-元数据对上微调控制以实现精确镜头匹配,以及提取解耦嵌入用于隐式视频到视频风格迁移,无需显式相机参数。通过有效分离场景内容与内生成像行为,DeltaCam实现了现有模型难以实现的相机一致视频生成和编辑操作。最终,我们的结果为连接合成控制与真实世界摄影仿真建立了一种实用且可扩展的方法。

英文摘要

Incorporating camera intrinsics into video generation models offers a principled way to control not only scene dynamics but also the imaging process that governs visual appearance. Prior work has primarily focused on extrinsic control, such as camera pose and motion, while treating intrinsic camera parameters as implicit or fixed. A key bottleneck is the lack of large-scale video datasets with accurate and diverse temporally varying camera metadata, which makes learning absolute camera parameterizations difficult. As a result, current models struggle to incorporate photographic camera behavior, including depth-of-field transitions, exposure variations, lens distortions, and color processing, in a controllable and temporally consistent manner. We introduce DeltaCam, a video diffusion framework that models camera behavior through $Δ$-parameterized neural camera adaptors, operating on relative changes in camera motion and intrinsics instead of absolute states. By learning this differential formulation from synthetic video data, we mitigate reliance on precise real-world camera labels and enable smooth, consistent control over imaging factors such as focal length, aperture, ISO, color temperature, and lens distortion. We extend this framework to real-world footage through two mechanisms: finetuning the controls on real image-metadata pairs for precise shot matching, and extracting disentangled embeddings for implicit video-to-video style transfer without requiring explicit camera parameters. By effectively separating scene content from intrinsic imaging behavior, DeltaCam enables camera-consistent video generation and editing operations that are difficult to achieve with existing models. Ultimately, our results establish a practical and scalable approach for bridging synthetic control and real-world photographic emulation.

2605.25263 2026-05-26 cs.CL cs.AI

Mimir: Large-scale Multilingual Concept Modeling

Mimir:大规模多语言概念建模

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile

AI总结 提出Mimir,一个1.6B参数的大规模概念模型,通过多语言预训练和指令微调实现概念级别的理解与生成,替代传统的token预测范式。

详情
AI中文摘要

当前的语言建模方法围绕token构建。文本语料被分割成token,模型通过对这些token进行计算来训练,例如根据前文预测下一个token。这一范式已成为现代语言建模的标准,尤其是基于token的架构取得了卓越性能。然而,最近的研究不仅开始质疑语言模型如何从token中处理和理解意义,还开始质疑使用更高级别的粒度是否能推动研究领域的发展。这引出了概念建模的想法,即直接训练模型进行下一个概念预测,而非下一个token预测。目标是输入从token转变为概念,迫使底层语言模型将其粒度从细粒度的token转变为广泛的概念。在这项工作中,我们介绍了Mimir,一个1.6B参数的大规模概念模型,用于多语言概念理解和生成。我们利用了一个大规模多语言预训练语料库(38,883,987,240个句子),涵盖46种语言,以及一个大规模多轮多语言指令微调数据集(66,816,428个句子),覆盖总共35种语言。我们针对一个参数数量相当的语言模型,对模型性能进行了广泛评估。

英文摘要

Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.

2605.25262 2026-05-26 cs.CV

Semantics-Guided Multimodal Masked Autoencoder Pretraining for 3D BEV Object Detection

语义引导的多模态掩码自编码器预训练用于3D BEV目标检测

Prabuddhi Wariyapperuma, Rajitha de Silva, Marc Hanheide, Thomas Bohné, Leonardo Guevara

AI总结 提出语义引导的多模态掩码自编码器框架,通过语义引导的LiDAR体素掩码和辅助点语义解码分支,在预训练中注入语义信息,提升3D BEV目标检测性能。

Comments Accepted at the ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy (SRRA) as a lightning talk and poster

详情
AI中文摘要

准确的3D鸟瞰图(BEV)目标检测对于自动驾驶至关重要,并且强烈依赖于来自互补传感器(如摄像头和LiDAR)的有效多模态表示。多模态掩码自编码器已显示出学习此类表示以用于下游3D BEV目标检测的强大潜力。然而,现有方法通常对摄像头和LiDAR输入应用均匀随机掩码,平等对待所有区域,并且仅通过掩码重建学习表示。我们提出了一种语义引导的多模态掩码自编码器框架,该框架在预训练期间通过两个独立组件引入语义信息:(i)语义引导的LiDAR体素掩码,它更强烈地保留语义重要的LiDAR区域,以及(ii)一个辅助的点级LiDAR语义解码分支,在重建之外注入语义引导。在BEVFusion 3D目标检测上,与标准UniM2AE基线相比,我们的语义引导预训练策略在nuScenes mini验证集上提升了性能:语义引导的LiDAR体素掩码在基线上实现了+1.49%的平均精度(mAP)和+1.66%的nuScenes检测分数(NDS),而解码器侧的点语义监督实现了+1.39%的mAP和+3.22%的NDS。

英文摘要

Accurate 3D bird's-eye view (BEV) object detection is essential for autonomous driving, and depends strongly on effective multimodal representations from complementary sensors such as cameras and LiDAR. Multimodal masked autoencoders have shown strong potential for learning such representations for downstream 3D BEV object detection. However, existing methods typically apply uniform random masking to camera and LiDAR inputs, treating all regions equally, and learn representations only through masked reconstruction. We propose a semantics-guided multimodal masked autoencoder framework that introduces semantic information during pretraining through two separate components: (i) semantics-guided LiDAR voxel masking, which preserves semantically important LiDAR regions more strongly, and (ii) an auxiliary point-wise LiDAR semantic decoder branch that injects semantic guidance in addition to reconstruction. On BEVFusion 3D object detection, our semantics-guided pretraining strategy improves performance on the nuScenes mini validation set compared to the standard UniM2AE baseline: semantics-guided LiDAR voxel masking yields +1.49% mean Average Precision (mAP) and +1.66% nuScenes Detection Score (NDS), while decoder-side point semantic supervision yields +1.39% mAP and +3.22% NDS over the baseline.

2605.25254 2026-05-26 cs.CV cs.AI

Guess the Unified Model: How Much Can We Recover from Generated Images?

猜猜统一模型:从生成的图像中我们能恢复多少?

Jasin Cekinmez, Ryo Mitsuhashi, Addison J. Wu, Yida Yin

AI总结 本文研究统一模型生成图像的可分离性,通过七个模型的大量图像实验,发现模型归因高度可行,且语义内容对可分离性有贡献但非主导信号。

详情
AI中文摘要

随着统一模型生成的图像现在在线广泛传播,追溯其来源模型为透明度和深入理解单个模型的特征行为提供了一条途径。先前的工作已经探索了LLM生成文本、扩散模型图像和数据集的来源,但统一模型生成图像的可分离性仍然是一个未充分探索的领域。我们通过使用七个统一模型生成的图像,检查在损坏、领域和提示语言上的可分离性来填补这一空白。我们表明模型归因高度可行,因为我们的模型在每个模型约20K图像的情况下达到了近乎完美的准确率。损坏和结构扰动对归因性能的影响较小,跨领域泛化表明语义内容对可分离性有贡献,但并非主导信号。最后,我们观察到对于大多数模型,提示语言归因接近随机水平,表明语言特定的视觉特征极少。这些发现突显了统一模型输出中一致的模型特定视觉特征,并为追踪和审计生成图像流水线开辟了新方向。

英文摘要

With unified model-generated images now widespread online, attributing their model of origin offers a path toward transparency and deeper insight into the characteristic behaviors of individual models. Prior work has explored provenance in LLM-generated text, diffusion model images, and datasets, but the separability of unified model-generated images remains an underexplored area. We address this gap by examining separability across corruption, domains, and prompt languages using images generated by seven unified models. We show that model attribution is highly feasible as our model achieves near-perfect accuracy with around 20K images per model. Corruptions and structural perturbations have only a modest effect on attribution performance, and cross-domain generalization reveals that semantic content contributes to separability but is not the dominant signal. Finally, we observe that for most models, prompt language attribution is around chance levels, suggesting minimal language-specific visual signatures. These findings highlight consistent model-specific visual characteristics in unified models outputs and open new directions for tracing and auditing generative image pipelines.

2605.23491 2026-05-26 cs.LG cs.AI cs.CL

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

CoSPlay: 测试时协作自我博弈与自生成代码和单元测试

Zhangyi Hu, Chenhui Liu, Tian Huang, Jindong Li, Yang Yang, Jiemin Wu, Zining Zhong, Menglin Yang, Yutao Yue

AI总结 提出CoSPlay框架,通过代码与单元测试的协作自我博弈,在无真实单元测试的情况下迭代优化两者,显著提升代码生成性能。

Comments Code is available at: https://github.com/sanae-ai/CosPlay | Data & log is available at: https://huggingface.co/datasets/yomi017/CosPlay

详情
AI中文摘要

最近,可验证奖励强化学习(RLVR)和测试时扩展(TTS)通过可执行验证推动了LLM代码生成的发展。然而,真实单元测试(GT UTs)仍然是瓶颈:最先进的RLVR方法需要它们进行昂贵的训练,而现有的TTS方法在没有它们的情况下会失去竞争力。这促使了无GT的TTS,其中现有方法直接使用自生成的UT来优化和选择代码候选。然而,这些UT通常带有噪声或与错误代码虚假耦合,而UT质量在没有可靠代码的情况下也无法验证。因此,关键挑战是同时改进两者。为此,我们提出了CoSPlay,一个无GT、无需训练的框架,通过协作自我博弈同时改进代码和UT。它首先探索多样化的解决方案思路,识别其潜在失败模式以生成有区分力的UT思路。然后,它利用代码-UT执行矩阵中的双向通过计数信号,迭代地修剪或修复弱代码,并刷新或替换不可靠的UT,使两个池共同进化。最后,当多个代码在最高通过计数上并列时,它从最大的输出共识簇中选择最终代码,因为正确的代码在相同输入上一致,而错误的代码则发散。在四个具有挑战性的基准上的实验表明,CoSPlay在Qwen2.5-7B-Instruct上将平均BoN从22.1%提升到33.2%,UT准确率从14.6%提升到78.3%,匹配或超越了RLVR模型CURE-7B。当应用于CURE-7B时,它进一步将BoN提高了5.7%。CoSPlay还能跨不同骨干网络泛化,并在相当的token预算下优于无GT的TTS基线,且随着预算增加持续获益。这些结果表明,无需任何GT数据即可实现竞争性代码生成的可扩展推理策略。

英文摘要

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

2605.23473 2026-05-26 cs.LG cs.AI

Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension

面向未知有效维度的实用贝叶斯优化的自动随机嵌入

Hong Qian, Xiang Shu, Xiang Xia, Xuhui Liu, Yangde Fu, Bei Liang, Huibin Wang, Liang Dou

AI总结 提出动态共享嵌入贝叶斯优化(DSEBO)方法,通过自动调整子空间维度并共享查询解,平衡近似与优化误差,在高维优化中显著降低遗憾和时间成本。

Comments This paper has been accepted by IJCAI 2026

详情
AI中文摘要

贝叶斯优化广泛应用于复杂黑箱函数的优化,但受维度灾难困扰。随机嵌入作为一种降维策略,通过在低维子空间中优化来简化具有有效维度的任务。然而,预先确定任务的有效维度仍是一个重大挑战,它影响子空间维度的选择和优化性能。传统方法使用专家提供的固定子空间维度,或依赖试错法估计子空间维度,消耗资源。为此,本文提出一种针对未知有效维度的高维贝叶斯优化的自动随机嵌入方法,称为动态共享嵌入贝叶斯优化(DSEBO)。DSEBO从低维度开始,如果当前子空间中的解显示初步收敛,则切换到更高维的子空间。DSEBO基于不同子空间中解的质量动态确定下一子空间的维度,并与新子空间共享已查询的解以实现更好的初始化。理论上,我们推导了DSEBO的遗憾界,并证明DSEBO能更好地平衡近似误差和优化误差。在维度规模变化的函数和未知有效维度的实际任务上的大量实验表明,与最先进方法相比,跨不同子空间的交替优化在高维优化中显著提高了优化遗憾和时间性能。

英文摘要

Bayesian optimization is widely employed for optimizing complex black-box functions but struggles with the curse of dimensionality. Random embedding, as a dimension reduction strategy, simplifies tasks that possess the effective dimension by optimizing within a low-dimensional subspace. However, determining the effective dimension of a task in advance remains a significant challenge, which influences the selection of the subspace dimensionality and the optimization performance. Traditional methods use fixed subspace dimensions provided by experts or rely on trial and error to estimate subspace dimensions with resources consumed. To this end, this paper proposes an automated random embedding for high-dimensional Bayesian optimization with unknown effective dimension, called Dynamic Shared Embedding Bayesian Optimization (DSEBO). DSEBO starts with a low dimension and switches to a higher subspace if the solutions in the current subspace show preliminary convergence. DSEBO dynamically determines the dimension of the next subspace based on the quality of the solutions in different subspaces and shares the queried solutions with the new subspace for a better initialization. Theoretically, we derive a regret bound for DSEBO and demonstrate that DSEBO can better balance approximation and optimization errors. Extensive experiments on functions with dimensionality of varying magnitudes and real-world tasks with unknown effective dimensions reveal that, compared with state-of-the-art methods, alternating optimization across different subspaces results in significant improvements in high-dimensional optimization, both in terms of optimization regret and time.

2605.23454 2026-05-26 cs.CL

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

ARES: 面向可扩展大语言模型强化学习的自动评分标准合成

Xiaoyuan Li, Keqin Bao, Moxin Li, Yubo Ma, Yichang Zhang, Wenjie Wang, Fuli Feng, Dayiheng Liu

AI总结 提出ARES框架,从原始预训练文档自动生成问答对和实例级加权评分标准,用于可扩展的基于评分标准的强化学习,在多个开放任务上超越持续预训练、监督微调和二元奖励强化学习。

Comments Under Review

详情
AI中文摘要

基于评分标准的奖励为将强化学习扩展到大型语言模型提供了一种有前景的方式,超越了具有自动可验证答案的任务。然而,扩展基于评分标准的强化学习仍然具有挑战性:现有方法通常依赖专家编写的评分标准和手动构建的问题集,而固定的任务级评分标准可能无法捕捉单个问题的评估需求。我们提出ARES(面向可扩展强化学习的自动评分标准合成),一个自动构建基于评分标准的强化学习数据的框架。从原始预训练文档开始,ARES将源知识转换为自包含的问答对,并共同生成特定问题的加权评分标准,从而为开放式回答提供实例级奖励监督。为了提高多样性和质量,ARES基于领域标签和人物角色信息生成,并应用验证过滤器以确保问题自包含性、答案忠实性和评分标准有效性。使用ARES,我们在十个领域构建了10万个评分标准标注的实例。在七个基准上的实验表明,使用ARES训练的基于评分标准的强化学习优于持续预训练、监督微调和二元奖励强化学习,在医疗和指令遵循等多维开放任务上提升最大。

英文摘要

Rubric-based rewards offer a promising way to extend reinforcement learning (RL) for large language models beyond tasks with automatically verifiable answers. However, scaling rubric-based RL remains challenging: existing approaches often rely on expert-written rubrics and manually constructed question sets, while fixed task-level rubrics may fail to capture the evaluation requirements of individual questions. We propose ARES (Automated Rubric synthEsis for Scalable RL), a framework for automatically constructing rubric-based RL data at scale. Starting from raw pretraining documents, ARES converts source knowledge into self-contained question-answer pairs and co-generates question-specific weighted rubrics, enabling instance-level reward supervision for open-ended responses. To improve diversity and quality, ARES conditions generation on domain labels and persona information, and applies validation filters for question self-containment, answer faithfulness, and rubric validity. Using ARES, we construct 100K rubric-annotated instances across ten domains. Experiments on seven benchmarks show that rubric-based RL trained with ARES, outperforms continual pretraining, supervised fine-tuning, and binary-reward RL, with the largest gains on multi-dimensional open-ended tasks such as healthcare and instruction following.

2605.23395 2026-05-26 cs.LG

Convex Compositional Reasoning Models

凸组合推理模型

Meir Roketlishvili, Semyon Semenov, Maksim Bobrin, Viktor Kovalchuk, Albert Baichorov, Abduragim Shtanchaev, Fakhri Karray, Dmitry V. Dylov, Martin Takáč, Arip Asadulaev

AI总结 针对组合推理中能量景观的非凸几何瓶颈,提出凸组合能量最小化框架,通过输入凸神经网络参数化因子并优化紧凸松弛,实现确定性投影一阶优化,在小问题上训练后可零样本迁移到大实例。

详情
AI中文摘要

组合能量模型可以通过在许多局部约束中重用学习到的因子能量,泛化到更大的组合推理问题。在本文中,我们表明组合推理的一个关键瓶颈不是组合本身,而是学习到的能量景观的非凸几何。为了解决这个问题,我们引入了凸组合能量最小化(CCEM),这是一个用输入凸神经网络参数化每个因子,并在可行集的紧凸松弛上优化组合能量的框架。由于凸性在求和下保持不变,全局松弛目标保持凸性,从而能够进行确定性投影一阶优化。CCEM分两个阶段训练:因子级对比学习以塑造局部能量盆地,然后通过展开的投影求解器进行端到端细化。我们的实验表明,在小子问题或单个问题规模上训练的模型可以无需重新训练地迁移到更大的实例。

英文摘要

Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geometry of the learned energy landscape. To solve this problem, we introduce Convex Compositional Energy Minimization (CCEM), a framework that parameterizes each factor with an input-convex neural network and optimizes the composed energy over a tight convex relaxation of the feasible set. Because convexity is preserved under summation, the global relaxed objective remains convex, enabling deterministic projected first-order optimization. CCEM is trained in two stages: factor-level contrastive learning to shape local energy basins, followed by end-to-end refinement through an unrolled projected solver. Our experiments show that our models trained on small subproblems or a single problem size transfer to larger instances without retraining.

2605.23163 2026-05-26 cs.CL

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

Fast-dDrive:面向自动驾驶的高效块扩散视觉语言模型

Kewei Zhang, Jin Wang, Sensen Gao, Chengyue Wu, Yulong Cao, Songyang Han, Boris Ivanovic, Langechuan Liu, Marco Pavone, Song Han, Daquan Zhou, Enze Xie

AI总结 提出Fast-dDrive,一种块扩散视觉语言动作模型,通过语义单元内双向细化与跨单元因果约束,结合结构化令牌冻结、分段感知训练和推测解码,实现高保真轨迹规划与高效推理,在WOD-E2E和nuScenes上达到最优性能,推理速度提升12倍。

详情
AI中文摘要

通过视觉-语言-动作(VLA)模型实现的端到端自动驾驶需要在高保真轨迹规划与高效推理之间取得不稳定的平衡。现有范式通常存在不足:自回归(AR)VLA在边缘硬件上受限于内存带宽,且容易产生曝光偏差漂移;而全序列扩散模型无法复用KV缓存,并遭受违反基本感知-规划因果关系的“逻辑泄漏”。我们提出Fast-dDrive,一种块扩散VLA,它在语义单元内执行双向细化,同时强制跨单元严格因果排序。利用驾驶VLA通常输出结构化JSON式输出的观察,Fast-dDrive将结构令牌冻结为节支架,并采用节感知训练策略,优先考虑安全关键规划。我们进一步引入支架推测解码,以显著更高的吞吐量实现AR等效质量。最后,我们提出一种低开销的测试时缩放方案:通过从单个共享前缀KV缓存分叉出N个随机轨迹展开并取平均,以极小的计算成本有效抑制预测方差。实验结果表明,Fast-dDrive重新定义了驾驶智能体的速度-精度边界。在WOD-E2E测试集上,Fast-dDrive在3秒和5秒平均位移误差(ADE)上达到最优,同时在基于扩散的VLA中具有最高的RFS;在nuScenes上,它将平均L2误差降至0.32米(提升22%)。当与SGLang集成时,我们的框架相比AR基线实现了12倍的吞吐量提升,缩小了高容量VLA与实时车载部署效率需求之间的差距。

英文摘要

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

2605.23148 2026-05-26 cs.CL cs.CY

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

当症状不足时:大语言模型精神科筛查中的证据加权模式

Jianfeng Zhu, Megan Korhummel, Ruoming Jin, Karin G. Coifman

AI总结 本研究引入SCID锚定基准,评估五个大语言模型在精神科筛查中的表现,发现模型在焦虑症、抑郁症和创伤后应激障碍分类中,当存在功能保留或保护性背景时倾向于低估症状证据,导致假阴性错误。

Comments 25 pages 7 figures

详情
AI中文摘要

随着心理健康护理需求超过临床医生提供的评估,对可扩展筛查工具的需求日益增加。大语言模型(LLMs)可能从患者叙述中识别精神科风险,但其在不同诊断、人口统计亚组和证据使用模式中的可靠性仍不确定。我们引入了一个基于SCID的基准,包含555个半结构化体验访谈,并配有焦虑症、重度抑郁症、创伤后应激障碍和任何当前心理健康障碍的诊断参考标签。使用零样本任务特定提示,我们评估了五个最先进的LLM,并检查假阴性错误是否反映了遗漏的精神科证据或对症状、功能损害和保护性背景线索的差异化加权。不同任务和模型的表现各异,准确率从0.49到0.86,马修斯相关系数从0.16到0.38。GPT-4.1 Mini和GPT-5 Mini显示出最一致的疾病特异性准确率。亚组分析发现,男性参与者的抑郁症分类准确率高于女性,没有一致的年龄相关模式,种族阶层间存在适度的非均匀变异。证据整合分析显示,假阴性的焦虑症和PTSD分类通常包含明确的症状证据,但伴有功能保留、应对能力或社会支持。功能损害证据使模型输出偏向阳性分类,而保护性背景证据则使输出偏离。这些发现表明,LLMs可能支持可扩展的精神科筛查,但它们在功能保留或保护性背景下低估症状证据的倾向需要在临床部署前进行仔细验证。

英文摘要

As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

2605.22769 2026-05-26 cs.CL cs.AI

Understanding Data Temporality Impact on Large Language Models Pre-training

理解数据时间性对大型语言模型预训练的影响

Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave

AI总结 研究预训练数据顺序对大型语言模型获取时间敏感事实知识的影响,通过构建包含7000多个时间相关问题的基准并训练60亿参数模型,发现按时间顺序训练比随机打乱训练能产生更及时和精确的知识。

详情
AI中文摘要

大型语言模型(LLMs)通常在打乱顺序的语料库上进行训练,导致模型的知识在训练时被冻结,其时间基础仍然难以理解。在这项工作中,我们研究了预训练动态对获取时间敏感事实知识的影响,特别关注数据顺序。我们的主要贡献有两方面。首先,我们引入了一个包含7000多个时间基础问题的综合基准和一个评估协议,能够分析模型是否将事实与其对应的时间段正确关联。其次,我们在按时间顺序排列的Common Crawl快照上预训练了60亿参数的模型,并将其与标准的随机打乱预训练进行比较。我们的结果表明,按顺序训练的模型在通用语言理解和常识方面与随机打乱的基线相当,同时始终表现出更及时和精确的时间知识。按时间顺序的预训练提高了事实的新鲜度,而随机打乱的预训练在较旧的数据上表现更好,可能是由于事实重复增加。这些发现,连同我们在https://github.com/kyutai-labs/kairos 发布的代码、在https://huggingface.co/collections/kyutai/kairos 发布的检查点和数据集,为LLMs的持续学习未来研究提供了基础。

英文摘要

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.

2605.22222 2026-05-26 cs.LG

ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models

ARC-STAR: 面向PDE基础模型的可审计事后修正

Chengze Li, Lingwei Wei, Li Sun, Hongbo Lv, Jie Yang, Hanrong Zhang, Kening Zheng, Wei-Chieh Huang, Enze Ma, Philip S. Yu

AI总结 针对PDE基础模型预测漂移且误差空间集中的问题,提出冻结求解器的事后修正框架ARC-STAR,通过全局修正、局部精炼和预算感知路由三阶段实现可审计、低误差的修正。

Comments 40 pages, including appendices

详情
AI中文摘要

偏微分方程(PDE)基础模型是预训练网络,能够从单一可重用求解器预测速度、压力等物理场的演化。在不熟悉的流场上,它们的预测会逐步漂移,误差集中在少数区域,然而重新训练会破坏网络稳定性,而统一的事后修正忽略了这种空间集中性。为解决此问题,我们提出了一种冻结求解器的事后修正框架——自适应风险校准空间分诊可审计精炼(ARC-STAR)。ARC-STAR将修正组织为三个阶段:全局修正器消除广泛的求解器偏差,块级局部精炼器清理全局后残差,在部署时,无标签分数在计算预算下将精炼路由到高风险块。该框架设计为:(i) 冻结宿主,保留预训练求解器无需微调;(ii) 可审计,全局和局部阶段分别训练和评估以实现可衡量的贡献;(iii) 预算感知,使用块级接口,要么精炼整个场,要么将有限计算路由到高风险区域。在跨越十个状态单元的五个流基准测试中,ARC-STAR是唯一在每个单元上将速度滚动误差比原始Poseidon降低至少36倍的方法。全局阶段将原始宿主误差降低91-99%,局部阶段进一步将剩余的全局后残差降低高达94.4%。

英文摘要

Partial differential equation (PDE) foundation models are pretrained networks that forecast how physical fields like velocity and pressure evolve from a single reusable solver. On unfamiliar flows their predictions drift step by step, errors concentrate in a few regions, yet retraining destabilizes the network and uniform post-hoc correction overlooks this spatial concentration. To address this, we propose a frozen-solver post-hoc correction framework, Adaptive Risk-Calibrated Spatial Triage for Auditable Refinement (ARC-STAR). ARC-STAR organizes correction into three stages: a global corrector removes broad solver bias, a blockwise local refiner cleans the post-global residual, and, at deployment, a label-free score routes refinement to high-risk blocks under a compute budget. The framework is designed to be (i) frozen-host, preserving the pretrained solver without fine-tuning; (ii) auditable, with global and local stages trained and evaluated separately for measurable contributions; and (iii) budget-aware, using a blockwise interface that either refines the full field or routes limited compute to high-risk regions. Across five flow benchmarks spanning ten regime cells, ARC-STAR is the only method that cuts velocity rollout error by at least 36x over raw Poseidon on every cell. The global stage reduces raw host error by 91-99%, and the local stage further reduces the remaining post-global residual by up to 94.4%.

2605.22137 2026-05-26 cs.CL

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

跨语言共识:通过多语言自一致性对齐多语言文化知识

Andrew Ivan Soegeng, Patrick Sutanto, Tan Sang Nguyen

AI总结 提出一种自监督框架,利用多语言自一致性和自我批评机制,从本地语言表示中提取文化知识并迁移到英语,以缩小跨语言文化知识差距,在BLEnD基准上平均提升英语查询性能5.03%。

Comments Accepted to The 1st Workshop on Multilinguality in the Era of Large Language Models

详情
AI中文摘要

尽管大型语言模型(LLMs)在各种任务中展现出强大的能力,但它们在不同语言之间表现出显著的性能差异。虽然用英语提示LLMs通常能获得最高的通用性能,但这往往会导致以西方为中心的偏见,阻碍模型准确反映多样化的文化知识。我们假设LLMs已经拥有嵌入在本地语言表示中的丰富文化知识,但在用英语提示时无法检索到这些知识。为了弥合这一跨语言知识差距,我们提出了一种新颖的自监督框架。我们的方法利用多语言自一致性来识别跨语言中最可靠的文化响应,并结合自我批评机制将这些知识转移到较弱的语言中。在BLEnD基准上的评估表明,我们的方法显著改善了文化对齐——在英语查询上平均提升5.03%——完全依赖于自生成数据。最终,我们的工作表明,潜在的文化知识可以成功地在语言之间浮现和传播,从而实现更具文化公平性和一致性的LLMs。

英文摘要

Although Large Language Models (LLMs) demonstrate strong capabilities across various tasks, they exhibit significant performance discrepancies across languages. While prompting LLMs in English typically yields the highest general performance, it often induces a Western-centric bias, hindering the model's ability to accurately reflect diverse cultural knowledge. We hypothesize that LLMs already possess rich cultural knowledge embedded within local-language representations, but fail to retrieve it when prompted in English. To bridge this cross-lingual knowledge gap, we propose a novel self-supervised framework. Our method leverages multilingual self-consistency to identify the most reliable cultural responses across languages, combined with a self-critique mechanism to transfer this knowledge to the weaker language. Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data. Ultimately, our work demonstrates that latent cultural knowledge can be successfully surfaced and propagated across languages, enabling more culturally equitable and consistent LLMs.

2605.22064 2026-05-26 cs.CL

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Hy-MT2:面向复杂真实场景的快速、高效且强大的多语言翻译模型系列

Mao Zheng, Zheng Li, Tao Chen, Bo Lv, Mingrui Sun, Mingyang Song, Jinlong Song, Hong Huang, Decheng Wu, Hai Wang, Yifan Song, Yanfeng Chen, Guanwei Zhang

AI总结 本文提出Hy-MT2系列多语言翻译模型,通过三种规模(1.8B、7B、30B-A3B MoE)支持33种语言翻译,在通用、商业、领域和指令跟随任务上超越开源模型和商业API,并实现轻量级设备端部署。

详情
AI中文摘要

Hy-MT2是一系列面向复杂真实场景的快速思考多语言翻译模型。它包括三种模型规模:1.8B、7B和30B-A3B(MoE),均支持33种语言之间的翻译,并能有效遵循多种语言的翻译指令。多维度评估表明,Hy-MT2在通用、真实世界业务、领域特定和指令跟随翻译任务中均表现出色。7B和30B模型在快速思考模式下超越了DeepSeek-V4-Pro和Kimi K2.6等开源模型,而轻量级的1.8B模型在整体性能上也超越了微软、豆包等提供商的主流商业API。此外,当与AngelSlim的1.25位极端量化结合用于设备端部署时,轻量级1.8B模型仅需440 MB存储空间,并实现了1.5倍的推理加速。

英文摘要

Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall. Moreover, when paired with AngelSlim's 1.25-bit extreme quantization for on-device deployment, the lightweight 1.8B model requires only 440 MB of storage and achieves a 1.5x inference speedup.