arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.07048 2026-06-03 cs.CV

PRISM: Rethinking Atmospheric Scattering Reconstruction as a Unified Understanding and Restoration Model for Real-world Dehazing

PRISM: 重新思考大气散射重建作为真实世界去雾的统一理解与恢复模型

Chengyu Fang, Chunming He, Yuelin Zhang, Chubin Chen, Chenyang Zhu, Hongqiu Wang, Longxiang Tang, Xiu Li, Sina Farsiu

发表机构 * Tsinghua University（清华大学）； Duke University（杜克大学）； CUHK（香港中文大学）； HKUST（香港理工大学）； HKUST(GZ)（香港理工大学（广州））

AI总结提出基于近端散射大气重建（PSAR）的物理结构化框架，结合在线非均匀雾合成和选择性自蒸馏适应（SSDA）方案，实现真实世界图像去雾的统一理解与恢复。

Comments 21 Pages, 8 Figures, 7 Tables

详情

AI中文摘要

真实世界图像去雾（RID）旨在去除真实场景中由雾引起的退化。由于非均匀雾分布、空间变化的颜色偏移以及配对真实雾-干净数据的稀缺，该任务仍然具有挑战性。在PRISM中，我们提出了近端散射大气重建（PSAR），这是一个物理结构化框架，在大气散射模型下联合重建清晰场景和散射变量，使恢复过程在复杂真实世界条件下更具可解释性。为了弥合合成到真实的差距，我们设计了一个在线非均匀雾合成流程和一个用于非配对真实世界场景的选择性自蒸馏适应（SSDA）方案，该方案使模型能够选择性地从高质量感知目标中学习，同时利用其内在的散射理解来审计残留雾并指导自我优化。在真实世界基准上的实验表明，PRISM在RID任务上取得了具有竞争力的性能。

英文摘要

Real-world image dehazing (RID) aims to remove haze-induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying color shifts, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattering Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, making the restoration process more interpretable in complex real-world conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-Distillation Adaptation (SSDA) scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Experiments on real-world benchmarks demonstrate that PRISM achieves competitive performance on RID tasks.

URL PDF HTML ☆

赞 0 踩 0

2604.07123 2026-06-03 cs.CL

Language Bias under Conflicting Information in Multilingual LLMs

多语言大模型中冲突信息下的语言偏见

Robert Östling, Murathan Kurfalı

发表机构 * Stockholm University（斯德哥尔摩大学）； RISE Research Institutes of Sweden（瑞典RISE研究机构）

AI总结本研究通过扩展“干草堆中的冲突针”范式至多语言环境，评估了不同规模的多语言大模型在回答问题时对冲突信息中不同语言的偏好，发现模型普遍存在语言偏见，尤其是对俄语的普遍偏见和对中文的偏好，且提示语言与信息语言匹配时更受青睐。

详情

AI中文摘要

大型语言模型（LLMs）在整合冲突信息回答问题过程中已被证明存在偏见。本文探讨这种偏见是否也存在于冲突信息所使用的语言上。为此，我们将“干草堆中的冲突针”范式扩展到多语言环境，并使用五种不同语言的自然新闻领域数据，对一系列不同规模的多语言LLMs进行了全面评估。我们发现，所有测试的LLMs，包括GPT-5.2，在绝大多数情况下都会忽略冲突，并自信地只断言其中一个可能的答案。此外，在模型和提示语言之间，存在一致的语言偏好偏见，普遍对俄语存在偏见，而在最长上下文长度下，则偏好中文。语言偏好在中国大陆内外训练的模型之间一致，但前者稍强。模型还普遍倾向于优先考虑与提示语言匹配的信息。我们希望让多语言LLMs的用户和开发者意识到这类偏见，以促进对其成因及可能缓解方法的进一步研究。

英文摘要

Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models and prompting languages in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. The language preferences are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category. There is also a general tendency among models to prioritize information that matches the language used for prompting. We hope to make users and developers of multilingual LLMs aware of this category of biases, to spur further research on their causes and possible mitigation.

URL PDF HTML ☆

赞 0 踩 0

2604.05718 2026-06-03 cs.CV

MPM: Mutual Pair Merging for Efficient Vision Transformers

MPM：用于高效视觉Transformer的互结对合并

Simon Ravé, Pejman Rasti, David Rousseau

发表机构 * LARIS University of Angers（安格尔大学LARIS实验室）； UMR INRAe-IRHS Angers, France（法国安格尔INRAe-IRHS UMR）

AI总结提出无训练、无参数的互结对合并（MPM）模块，通过余弦空间互近邻配对与平均，记录合并图用于解码器前基于收集的重建，在语义分割中实现端到端加速，且精度损失小。

Comments Accepted to CVPR 2026 (Findings)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 2998-3008

AI中文摘要

减少序列长度是加速Transformer的常用方法，但先前的token缩减工作通常针对分类任务，报告的是代理指标而非端到端延迟。对于语义分割，token缩减进一步受到重建密集、像素对齐特征的需求限制，并且在现代加速器上，计算合并图的开销可能抵消预期收益。我们提出互结对合并（MPM），一种无需训练的token聚合模块，它在余弦空间中形成互最近邻对，对每对进行平均，并记录一个合并图，使得在解码器之前能够进行基于收集的重建，从而现有分割头可以保持不变。MPM不引入任何学习参数，也没有连续的压缩旋钮（无保留率或阈值）。速度-精度权衡由离散的插入调度设置。我们在NVIDIA H100 GPU（带和不带FlashAttention-2）和Raspberry Pi 5上，针对标准分割数据集基准测试了端到端延迟。在ADE20K上，MPM在Raspberry Pi 5上为ViT-Tiny减少了高达60%的每张图像延迟，在H100上使用FlashAttention-2时吞吐量提升高达20%，同时mIoU下降保持在3%以内。这些结果表明，当显式考虑开销时，简单、重建感知、无需训练的token合并可以转化为分割中实际的时钟时间增益。

英文摘要

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

URL PDF HTML ☆

赞 0 踩 0

2604.04439 2026-06-03 cs.LG cs.CV

Estimating Central, Peripheral, and Temporal Visual Contributions to Human Decision Making in Atari Games

估计Atari游戏中中央、周边和时间视觉对人类决策的贡献

Henrik Krauss, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo（东京大学先进跨学科研究系）； Research Center for Advanced Science and Technology, The University of Tokyo（东京大学先进科学与技术研究中心）

AI总结通过控制消融框架分析Atari游戏中的眼动数据，发现周边视觉信息对人类决策贡献最大，而注视信息和过去状态信息贡献较小。

详情

AI中文摘要

我们研究了不同视觉信息源在动态视觉环境中对人类决策的贡献。利用Atari-HEAD（一个带有同步眼动追踪的大规模Atari游戏数据集），我们引入了一个受控消融框架，作为逆向工程周边视觉信息、显式注视信息（以注视图形式）以及人类行为中过去状态信息贡献的手段。我们在六种设置下训练动作预测网络，这些设置选择性地包含或排除这些信息源。在20个游戏中，周边信息的贡献最为显著，移除后预测准确率的中位数下降范围为35.27-43.90%。注视信息导致的下降较小，为2.11-2.76%，而过去状态信息的下降范围较广，为1.52-15.51%，其中上限可能因减少了周边信息泄露而更具信息量。为了补充总体准确率，我们根据不同模型配置分配的真实动作概率对状态进行聚类。该分析识别出粗略的行为模式，包括焦点主导、周边主导以及更多情境决策情境。这些结果表明，Atari游戏中的人类决策强烈依赖于当前注视焦点之外的信息，而所提出的框架提供了一种从行为中估计此类信息源贡献的方法。

英文摘要

We study how different visual information sources contribute to human decision making in dynamic visual environments. Using Atari-HEAD, a large-scale Atari gameplay dataset with synchronized eye-tracking, we introduce a controlled ablation framework as a means to reverse-engineer the contribution of peripheral visual information, explicit gaze information in the form of gaze maps, and past-state information from human behavior. We train action-prediction networks under six settings that selectively include or exclude these information sources. Across 20 games, peripheral information shows by far the strongest contribution, with median prediction-accuracy drops in the range of 35.27-43.90% when removed. Gaze information yields smaller drops of 2.11-2.76%, while past-state information shows a broader range of 1.52-15.51%, with the upper end likely more informative due to reduced peripheral-information leakage. To complement aggregate accuracies, we cluster states by true-action probabilities assigned by the different model configurations. This analysis identifies coarse behavioral regimes, including focus-dominated, periphery-dominated, and more contextual decision situations. These results suggest that human decision making in Atari depends strongly on information beyond the current focus of gaze, while the proposed framework provides a way to estimate such information-source contributions from behavior.

URL PDF HTML ☆

赞 0 踩 0

2604.04087 2026-06-03 cs.LG

ArrowFlow: Hierarchical Machine Learning in the Space of Permutations

ArrowFlow：排列空间中的层次化机器学习

Ozgur Yilmaz

发表机构 * Department of Artificial Intelligence（人工智能系）； Adana Science and Technology University（阿达纳科学技术大学）

AI总结提出ArrowFlow架构，在排列空间中通过排序滤波器和置换矩阵累积实现无浮点参数的层次化排序表示学习，并利用社会选择公理违反作为归纳偏置，实验表明在多个基准上具有竞争力且具备噪声鲁棒性、隐私保护等特性。

详情

AI中文摘要

我们引入了ArrowFlow，一种完全在排列空间中运行的机器学习架构。其计算单元是排序滤波器，即学习到的排序，通过Spearman's footrule距离比较输入，并通过置换矩阵累积（一种基于位移证据的非梯度规则）进行更新。层以层次方式组合：每一层的输出排序成为下一层的输入，从而在核心计算中无需任何浮点参数即可实现深度序数表示学习。我们将该架构与Arrow不可能定理联系起来，表明社会选择公平性公理（上下文依赖性、专业化、对称性破坏）的违反作为非线性、稀疏性和稳定性的归纳偏置。实验涵盖UCI表格基准、MNIST、基因表达癌症分类（TCGA）和偏好数据，均与GridSearchCV调优的基线进行比较。ArrowFlow在Iris上击败所有基线（2.7% vs. 3.3%），并在大多数UCI数据集上具有竞争力。单个参数多项式次数充当主开关：次数1带来噪声鲁棒性（退化减少8-28%）、隐私保护（成本增加0.5个百分点）和缺失特征弹性；更高次数则牺牲这些特性以换取更高的干净准确率。ArrowFlow并非旨在超越基于梯度的方法。它是一个存在性证明，表明在一种根本不同的计算范式（将序数结构提升为一等公民，且与纯整数和神经形态硬件自然对齐）中实现有竞争力的分类是可能的。

英文摘要

We introduce ArrowFlow, a machine learning architecture that operates entirely in the space of permutations. Its computational units are ranking filters, learned orderings that compare inputs via Spearman's footrule distance and update through permutation-matrix accumulation, a non-gradient rule rooted in displacement evidence. Layers compose hierarchically: each layer's output ranking becomes the next layer's input, enabling deep ordinal representation learning without any floating-point parameters in the core computation. We connect the architecture to Arrow's impossibility theorem, showing that violations of social-choice fairness axioms (context dependence, specialization, symmetry breaking) serve as inductive biases for nonlinearity, sparsity, and stability. Experiments span UCI tabular benchmarks, MNIST, gene expression cancer classification (TCGA), and preference data, all against GridSearchCV-tuned baselines. ArrowFlow beats all baselines on Iris (2.7% vs. 3.3%) and is competitive on most UCI datasets. A single parameter, polynomial degree, acts as a master switch: degree 1 yields noise robustness (8-28% less degradation), privacy preservation (+0.5pp cost), and missing-feature resilience; higher degrees trade these for improved clean accuracy. ArrowFlow is not designed to surpass gradient-based methods. It is an existence proof that competitive classification is possible in a fundamentally different computational paradigm, one that elevates ordinal structure to a first-class citizen, with natural alignment to integer-only and neuromorphic hardware.

URL PDF HTML ☆

赞 0 踩 0

2512.18954 2026-06-03 cs.CV

VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion

VOIC：可见-遮挡联合引导的3D语义场景补全

Zaidao Han, Risa Higashita, Jiang Liu

发表机构 * Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology（可信自主系统研究院，南方科技大学）； Department of Computer Science and Engineering, Southern University of Science and Technology（计算机科学与工程系，南方科技大学）； School of Computer Science, University of Nottingham Ningbo China（宁波大学计算机学院）； Department of Electronic and Information Engineering, Changchun University（电子与信息工程学院，长春大学）

AI总结提出VOIC网络，通过解耦可见区域感知与遮挡区域推理，利用离线可见区域标签提取策略和双解码器框架，在SemanticKITTI和SSCBench-KITTI360上实现最先进的3D语义场景补全性能。

详情

AI中文摘要

基于相机的3D语义场景补全（SSC）是自动驾驶和机器人场景理解的关键任务。它旨在从单张图像推断完整的3D体素表示，包括语义和几何信息。现有方法通常关注端到端的2D到3D特征提升和体素补全。然而，它们常常忽视由单图像输入引起的高置信度可见区域感知与低置信度遮挡区域推理之间的干扰，这可能导致特征稀释和错误传播。为了解决这些挑战，我们引入了一种离线可见区域标签提取（VRLE）策略，该策略从密集的3D地面真值中显式分离并提取可见区域的体素级监督。该策略为两个互补的子任务（可见区域感知和遮挡区域推理）净化了监督空间。基于这一思想，我们提出了可见-遮挡交互补全网络（VOIC），一种新颖的双解码器框架，将SSC显式解耦为可见区域语义感知和遮挡区域场景补全。VOIC首先通过融合图像特征与深度导出的占据信息构建基础3D体素表示。可见解码器专注于生成高保真的几何和语义先验，而遮挡解码器则利用这些先验以及跨模态交互进行连贯的全局场景推理。在SemanticKITTI和SSCBench-KITTI360基准上的大量实验表明，VOIC在几何补全和语义分割精度上均优于现有的单目SSC方法，实现了最先进的性能。

英文摘要

Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2603.01576 2026-06-03 cs.CV

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Cryo-Bench：面向冰冻圈应用的基础模型基准测试

Saurabh Kaushik, Lalit Maurya, Beth Tellman, Valerio Marsocci

发表机构 * Center for Sustainability and the Global Environment (SAGE), University of Wisconsin–Madison（可持续性与全球环境中心（SAGE），威斯康星大学麦迪逊分校）； Portsmouth AI and Data Science Centre (PAIDS), School of Computing, University of Portsmouth（波特茅斯人工智能与数据科学中心（PAIDS），计算学院，波特茅斯大学）； ESA, ESRIN, φ \varphi -lab, Frascati（欧洲航天局（ESA），欧洲空间研究中心（ESRIN），φ实验室，弗拉斯卡蒂）

AI总结提出Cryo-Bench基准，评估14个地理基础模型在冰冻圈关键组件（如冰川、冰湖、海冰等）上的性能，发现UNet在冻结编码器下平均mIoU最高（66.38），而全微调结合学习率调整可提升性能12.77%。

详情

AI中文摘要

地理基础模型（GFMs）已在涵盖多个领域的地球观测任务中得到评估，并展现出即使在标签稀疏的情况下也能生成可靠地图的强大潜力。然而，针对冰冻圈应用的GFMs基准测试仍然有限，主要原因是缺乏合适的评估数据集。为填补这一空白，我们引入了 extbf{Cryo-Bench}，这是一个用于评估GFMs在关键冰冻圈组件上性能的基准。Cryo-Bench包括覆盖冰川、冰湖、海冰和崩解前沿，涉及多种传感器和广泛的地理区域。我们评估了14个GFMs以及UNet和ViT基线，以分析它们的优势、局限性和最佳使用策略。在冻结编码器的情况下，UNet在Cryo-Bench包含的五个评估数据集上取得了最高的平均mIoU extbf{66.38}，其次是TerraMind的 extbf{64.02}。在少样本设置（10%输入数据）下，DOFA和TerraMind等GFMs优于UNet，分别达到 extbf{59.53}、 extbf{56.62}和 extbf{56.60}的mIoU分数，而U-Net为56.60。当完全微调GFMs时，我们观察到不同数据集和模型之间的性能不一致。然而，调整学习率并配合微调显著提升了GFM性能。例如，在两个代表性数据集（GLID和CaFFe）上的评估显示平均相对提升为 extbf{12.77\%}。尽管预训练数据中冰冻圈表示极少，GFMs仍展现出显著的领域适应能力，并在各项任务中产生有意义的结果。基于我们的发现，我们建议通过超参数优化进行编码器微调以获得最佳性能，而在用户需要快速结果且无需大量实验时使用冻结编码器。（\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}）

英文摘要

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).

URL PDF HTML ☆

赞 0 踩 0

2511.19945 2026-06-03 cs.CV

Low-Resolution Editing is All You Need for High-Resolution Editing

低分辨率编辑足以实现高分辨率编辑

Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han

发表机构 * ECE & IPAI, Seoul National University（电子与信息物理学院及首尔国立大学IPAI）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结本文提出一种测试时优化框架，通过分块优化、细节迁移和同步策略，实现高分辨率图像编辑。

Comments CVPR 2026. Project website: https://hleephilip.github.io/ScaleEdit

2603.00667 2026-06-03 cs.CV

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

像病理学家一样：组织感知的全切片图像推理

Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wenchao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, Chen Wang

发表机构 * Stony Brook University（石英溪大学）； Mayo Clinic（梅奥诊所）； Harvard Medical School（哈佛医学院）； Stanford University（斯坦福大学）

AI总结提出一种问题引导、组织感知的粗到细检索框架HistoSelect，通过识别相关组织区域并选择最具信息量的补丁，在减少70%视觉标记的同时提升病理问答准确性。

Comments 14 pages, 8 figures. Accepted by CVPR'26

详情

AI中文摘要

近年来，计算病理学在领域特定图像编码器以及使用视觉-语言模型回答疾病自然语言问题的兴趣推动下迅速发展。然而，病理问答背后的核心问题仍未解决，因为一张千兆像素的切片包含的信息远多于给定问题所需。病理学家通过广泛扫描并根据临床问题选择性放大，自然地处理组织和形态复杂性。相比之下，当前模型依赖于均匀补丁采样或宽注意力图，常常平等关注不相关区域而忽略关键视觉证据。在这项工作中，我们试图使模型更接近人类实际检查切片的方式。我们提出了一个问题引导、组织感知、由粗到细的检索框架HistoSelect，它由两个关键组件组成：一个识别问题相关组织区域的组采样器，以及一个在这些区域内检索最具信息量补丁的补丁选择器。通过仅选择最具信息量的补丁，我们的方法显著提高了效率：平均减少70%的视觉标记使用，同时提高了三个病理QA任务的准确性。在356,000个问答对上评估，我们的方法优于现有方法，并产生基于可解释、与病理学家一致的区域的答案。我们的结果表明，将类人搜索和注意力模式引入WSI推理是构建实用且可靠的病理VLM的一个有前景的方向。代码可在https://github.com/winston52/HistoSelect获取。

英文摘要

Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs. Code is available at https://github.com/winston52/HistoSelect.

URL PDF HTML ☆

赞 0 踩 0

2603.19250 2026-06-03 cs.CL

Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams

结构线索能否拯救LLM？评估大规模文档流中的语言模型

Yukyung Lee, Yebin Lim, Woojun Jung, Wonjun Choi, Susik Yoon

发表机构 * Boston University（波士顿大学）； Korea University（韩国大学）

AI总结本文提出StreamBench基准，通过主题聚类、时序问答和摘要任务评估语言模型在混合多事件的文档流中的表现，发现结构线索能提升聚类和时序QA性能，但时序推理仍是挑战。

Comments KDD 2026

详情

AI中文摘要

评估流式环境中的语言模型至关重要，但尚未充分探索。现有基准要么关注单个复杂事件，要么为每个查询提供精心策划的输入，并且未评估模型在多个并发事件混合在同一文档流中产生的冲突下的表现。我们引入了StreamBench，这是一个基于2016年和2025年主要新闻故事构建的基准，包含605个事件和15,354个文档，涵盖三个任务：主题聚类、时序问答和摘要。为了诊断模型失败的原因，我们比较了有无结构线索（按事件组织关键事实）的性能。我们发现，结构线索提高了聚类（最高+4.37%）和时序问答（最高+9.63%）的性能，帮助模型定位相关信息并区分不同事件。虽然时序推理仍然是当前LLM固有的开放挑战，但任务中的一致改进表明，结构线索是未来大规模文档流研究的一个有希望的方向。

英文摘要

Evaluating language models in streaming environments is critical, yet underexplored. Existing benchmarks either focus on single complex events or provide curated inputs for each query, and do not evaluate models under the conflicts that arise when multiple concurrent events are mixed within the same document stream. We introduce StreamBench, a benchmark built from major news stories in 2016 and 2025, comprising 605 events and 15,354 documents across three tasks: Topic Clustering, Temporal Question Answering, and Summarization. To diagnose how models fail, we compare performance with and without structural cues, which organize key facts by event. We find that structural cues improve performance on clustering (up to +4.37%) and temporal QA (up to +9.63%), helping models locate relevant information and separate distinct events. While temporal reasoning remains an open challenge inherent to current LLMs, consistent gains across tasks show that structural cues are a promising direction for future work in massive document streams.

URL PDF HTML ☆

赞 0 踩 0

2603.18599 2026-06-03 cs.CV

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

SJD-PAC：通过主动草稿和自适应延续加速推测性雅可比解码

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

发表机构 * Peking University（北京大学）； Huawei Technologies（华为技术）

AI总结提出SJD-PAC框架，通过主动草稿策略和自适应延续机制提升推测性雅可比解码的接受率，实现无损加速文本到图像合成。

Comments CVPR 2026

详情

AI中文摘要

推测性雅可比解码（SJD）提供了一种无需草稿模型的方法来加速自回归文本到图像合成。然而，视觉生成的高熵特性导致复杂区域中草稿令牌接受率低，形成严重限制整体吞吐量的瓶颈。为了克服这一问题，我们引入了SJD-PAC，一个增强的SJD框架。首先，SJD-PAC采用主动草稿策略来提高这些具有挑战性的高熵区域的局部接受率。其次，我们引入了一种自适应延续机制，在初始拒绝后维持序列验证，无需完全重新采样。这些优化协同工作，显著增加了每步的平均接受长度，在严格保持目标分布的同时提升了推理速度。在标准文本到图像基准上的实验表明，SJD-PAC实现了$3.8 imes$的加速，且图像质量无损。代码可在https://github.com/KangJialiang/SJD-PAC获取。

英文摘要

Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality. Code is available at https://github.com/KangJialiang/SJD-PAC.

URL PDF HTML ☆

赞 0 踩 0

2602.07768 2026-06-03 cs.CV cs.AI cs.LG cs.MM

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

PAND：面向提示的邻域蒸馏用于轻量级细粒度视觉分类

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

发表机构 * arXiv

AI总结提出PAND框架，通过提示感知语义校准和邻域感知结构蒸馏，将大型视觉语言模型知识迁移至轻量网络，在细粒度分类任务上超越现有方法。

Comments Accepted by ICIP2026

2512.10888 2026-06-03 cs.CV

PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

PubTables-v2: 一个新的用于全页和多页表格提取的大规模数据集

Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Amrit Ramesh, Maury Courtland

发表机构 * Kensho Technologies（Kensho技术公司）

AI总结针对全页和多页表格提取任务缺乏标注数据的问题，本文创建了大规模数据集PubTables-v2，并评估了当前前沿模型与小模型在不同上下文级别任务上的性能差异。

Comments 28 pages, separated POTATR to its own paper, added frontier model results

详情

AI中文摘要

表格提取（TE）是文档理解中的一个关键挑战。传统方法先检测表格，然后识别其结构。最近，人们对开发直接在全页或文档上下文中提取表格的方法（如视觉语言模型（VLM））的兴趣激增。然而，缺乏标注数据使得进展难以展示。为了解决这个问题，我们创建了一个新的大规模数据集PubTables-v2。PubTables-v2统一了各种周围上下文级别的TE，并且值得注意的是，它是第一个用于多页TE的基准。我们的评估显示，虽然当前前沿模型在最复杂的任务（全文档多页TE）上显著优于小模型（+0.354 GriTS_Con），但在较窄的任务（裁剪表格提取）上，通过针对性训练，这种差距可以被缩小甚至逆转（-0.056 GriTS_Con）。数据可在 https://huggingface.co/datasets/kensho/PubTables-v2 获取。代码和模型将发布。

英文摘要

Table extraction (TE) is a key challenge in document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), to extract tables directly in their full page or document context. However, a lack of annotated data has made progress difficult to demonstrate. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 unifies TE across various levels of surrounding context and, notably, is the first benchmark for multi-page TE. Our evaluations reveal that while current frontier models strongly outperform ($+0.354\ \textrm{GriTS}_\textrm{Con}$) small models on the most complex task (full-document multi-page TE), this gap can be closed or even reversed ($-0.056\ \textrm{GriTS}_\textrm{Con}$) on narrower tasks (cropped table extraction) with targeted training. Data is available at https://huggingface.co/datasets/kensho/PubTables-v2. Code and models will be released.

URL PDF HTML ☆

赞 0 踩 0

2603.14377 2026-06-03 cs.CV

LoCAtion: Long-time Collaborative Attention Framework for High Dynamic Range Video Reconstruction

LoCAtion: 用于高动态范围视频重建的长时间协同注意力框架

Qianyu Zhang, Bolun Zheng, Lingyu Zhu, Aiai Huang, Zongpeng Li, Shiqi Wang

发表机构 * School of Automation, Hangzhou Dianzi University（杭州电子科技大学自动化学院）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）

AI总结提出LoCAtion框架，通过解耦对齐与融合、采用协同注意力机制和全局序列求解器，实现无需显式对齐的高动态范围视频重建，在视觉质量和时间稳定性上达到最优。

详情

AI中文摘要

主流的高动态范围（HDR）视频重建方法从根本上陷入了一种脆弱的对齐与融合范式。虽然显式空间对齐能在受控环境中成功恢复细节，但在无约束的动态场景中却成为严重瓶颈。通过强制对不可预测的运动和不同曝光进行刚性对齐，这些方法不可避免地会将配准误差转化为严重的鬼影伪影和时间闪烁。在本文中，我们重新思考了这一传统前提。认识到显式对齐本质上易受现实世界复杂性的影响，我们提出了LoCAtion，一种长时间协同注意力框架，将HDR视频生成从脆弱的空间扭曲任务重新构建为鲁棒的、无需对齐的协同特征路由问题。在这一新公式的指导下，我们的架构显式地解耦了高度纠缠的重建任务。我们不是努力刚性扭曲相邻帧，而是将场景锚定在一个连续的中等曝光骨干上，并利用协同注意力从未对齐的曝光中动态获取和注入可靠辐照度线索。此外，我们引入了一个学习的全局序列求解器。通过利用双向上下文和长时时间建模，它在整个序列中传播校正信号和结构特征，固有地强制执行全视频一致性并消除抖动。大量实验表明，LoCAtion在视觉质量和时间稳定性上达到了最先进水平，在准确性和计算效率之间提供了极具竞争力的平衡。

英文摘要

Prevailing High Dynamic Range (HDR) video reconstruction methods are fundamentally trapped in a fragile alignment-and-fusion paradigm. While explicit spatial alignment can successfully recover fine details in controlled environments, it becomes a severe bottleneck in unconstrained dynamic scenes. By forcing rigid alignment across unpredictable motions and varying exposures, these methods inevitably translate registration errors into severe ghosting artifacts and temporal flickering. In this paper, we rethink this conventional prerequisite. Recognizing that explicit alignment is inherently vulnerable to real-world complexities, we propose LoCAtion, a Long-time Collaborative Attention framework that reformulates HDR video generation from a fragile spatial warping task into a robust, alignment-free collaborative feature routing problem. Guided by this new formulation, our architecture explicitly decouples the highly entangled reconstruction task. Rather than struggling to rigidly warp neighboring frames, we anchor the scene on a continuous medium-exposure backbone and utilize collaborative attention to dynamically harvest and inject reliable irradiance cues from unaligned exposures. Furthermore, we introduce a learned global sequence solver. By leveraging bidirectional context and long-range temporal modeling, it propagates corrective signals and structural features across the entire sequence, inherently enforcing whole-video coherence and eliminating jitter. Extensive experiments demonstrate that LoCAtion achieves state-of-the-art visual quality and temporal stability, offering a highly competitive balance between accuracy and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2512.09106 2026-06-03 cs.LG

Learning Unmasking Policies for Diffusion Language Models

学习扩散语言模型的去掩码策略

Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof, João Monteiro, Victor Turrisi, Jason Ramapuram, Marco Cuturi

发表机构 * Apple（苹果公司）； University of Amsterdam（阿姆斯特丹大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对扩散语言模型中的去掩码采样问题，提出基于强化学习训练轻量级策略，以替代手动调优的启发式方法，在保持性能的同时提升鲁棒性。

Comments V4: Accepted as an oral spotlight at ICML 2026

详情

AI中文摘要

扩散（大型）语言模型（dLLMs）现在在许多任务上与自回归模型的下游性能相匹配，同时有望在推理过程中更高效。dLLMs的一个关键设计方面是采样过程，该过程选择在每个扩散步骤中要去掩码哪些标记。实际上，最近的研究发现，与随机去掩码相比，诸如置信度阈值之类的启发式策略提高了样本质量和标记吞吐量。然而，此类启发式方法存在缺点：它们需要手动调整，并且我们观察到它们的性能随着块大小的增加而下降。在这项工作中，我们提出使用强化学习来训练采样过程。具体来说，我们将掩码扩散采样形式化为一个马尔可夫决策过程，其中dLLM充当环境，并提出了一个基于单层transformer的轻量级策略，该策略将dLLM标记置信度映射到去掩码决策。我们的实验表明，当与半自回归（块）生成结合时，这些训练后的策略与最先进的启发式方法的性能相匹配，同时在完全扩散设置中优于它们。

英文摘要

Diffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One critical design aspect of dLLMs is the sampling procedure that selects which tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger block sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation, while outperforming them in the full-diffusion setting.

URL PDF HTML ☆

赞 0 踩 0

2603.07664 2026-06-03 cs.CV cs.AI cs.GR

Ref-DGS: Reflective Dual Gaussian Splatting

Ref-DGS: 反射性双高斯泼溅

Ningjing Fan, Yiqun Wang, Dong-Ming Yan, Peter Wonka

发表机构 * Chongqing University（重庆大学）； MAIS, Institute of Automation, Chinese Academy of Sciences and UCAS（自动化研究所，中国科学院，UCAS）； King Abdullah University of Science and Technology (KAUST)（卡塔尔科学与技术大学）

AI总结提出Ref-DGS框架，通过双高斯场景表示和物理感知的镜面自适应混合着色器，在高效光栅化管线中解耦表面重建与镜面反射，实现反射场景的SOTA新视图合成且训练速度远快于基于光线的方法。

Comments Project page: https://njfan.github.io/Ref-DGS/

详情

AI中文摘要

反射外观，尤其是强烈的近场镜面反射，对精确的表面重建和新视图合成构成了根本性挑战。现有的高斯泼溅方法要么无法建模近场镜面反射，要么依赖显式光线追踪而计算成本高昂。我们提出了 extbf{Ref-DGS}，一个反射性双高斯泼溅框架，通过在高效光栅化管线中将表面重建与镜面反射解耦来解决这一权衡。Ref-DGS引入了一种双高斯场景表示，由几何高斯和互补的局部反射高斯组成，无需显式光线追踪即可捕捉近场镜面交互，并包含一个全局环境反射场用于建模远场镜面反射。为了预测镜面辐射，我们进一步提出了一种轻量级的、物理感知的镜面自适应混合着色器，融合全局和局部镜面特征。实验表明，Ref-DGS在反射场景上达到了最先进的性能，同时训练速度显著快于基于光线的高斯方法。

英文摘要

The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present \textbf{Ref-DGS}, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware specular adaptive mixing shader that fuses global and local specular features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.

URL PDF HTML ☆

赞 0 踩 0

2511.04469 2026-06-03 cs.LG physics.data-an q-fin.CP stat.ME stat.OT

Towards Causal Market Simulators

迈向因果市场模拟器

Dennis Thumm, Luis Ontaneda Mijares

发表机构 * National University of Singapore（新加坡国立大学）； Veracruz Mexico（墨西哥韦拉克鲁斯）

AI总结提出一种结合变分自编码器与结构因果模型的时间序列神经因果模型VAE（TNCM-VAE），用于生成保留时间依赖和因果关系的反事实金融时间序列，在合成数据上实现低至0.03-0.10的L1距离。

Comments ICAIF 2025 Workshop on Rethinking Financial Time-Series

详情

AI中文摘要

使用深度生成模型的市场生成器在合成金融数据生成方面显示出前景，但现有方法缺乏反事实分析和风险评估所必需的因果推理能力。我们提出了一种时间序列神经因果模型VAE（TNCM-VAE），它将变分自编码器与结构因果模型相结合，以生成反事实金融时间序列，同时保留时间依赖性和因果关系。我们的方法通过解码器架构中的有向无环图施加因果约束，并使用因果Wasserstein距离进行训练。我们在受Ornstein-Uhlenbeck过程启发的合成自回归模型上验证了该方法，在反事实概率估计中表现出优越性能，与真实值相比L1距离低至0.03-0.10。该模型通过生成尊重潜在因果机制的合理反事实市场轨迹，实现了金融压力测试、情景分析和增强回测。

英文摘要

Market generators using deep generative models have shown promise for synthetic financial data generation, but existing approaches lack causal reasoning capabilities essential for counterfactual analysis and risk assessment. We propose a Time-series Neural Causal Model VAE (TNCM-VAE) that combines variational autoencoders with structural causal models to generate counterfactual financial time series while preserving both temporal dependencies and causal relationships. Our approach enforces causal constraints through directed acyclic graphs in the decoder architecture and employs the causal Wasserstein distance for training. We validate our method on synthetic autoregressive models inspired by the Ornstein-Uhlenbeck process, demonstrating superior performance in counterfactual probability estimation with L1 distances as low as 0.03-0.10 compared to ground truth. The model enables financial stress testing, scenario analysis, and enhanced backtesting by generating plausible counterfactual market trajectories that respect underlying causal mechanisms.

URL PDF HTML ☆

赞 0 踩 0

2603.05290 2026-06-03 cs.AI

X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes

X-RAY: 通过形式化与校准探针映射大语言模型推理能力

Tianxi Gao, Yufan Cai, Yusi Yuan, Jin Song Dong

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出X-RAY系统，利用形式化工具生成结构可控的校准探针，通过分析约束交互、推理深度和解空间几何等属性，揭示LLM在约束细化与解空间重构下的推理不对称性。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3818029

AI中文摘要

大型语言模型（LLM）取得了有前景的性能，但其推理能力仍未被充分理解。现有评估主要强调任务级准确性，常常将模式匹配与推理能力混为一谈。我们提出了X-RAY，一个可解释的推理分析系统，通过校准的、形式化验证的探针来映射LLM的推理能力。我们将推理能力建模为可提取的 extit{结构}的函数，通过形式化属性（如约束交互、推理深度和解空间几何）进行操作化。X-RAY通过形式化工具生成具有受控结构变化的探针，通过形式化校准和验证实现对增量结构信息的精确隔离。我们在数学、物理和化学领域从初级到高级的问题上评估了最先进的LLM。我们的分析揭示了LLM推理中的系统性不对称：模型对约束细化（即附加条件缩小现有解空间）相对稳健，但在解空间重构（即修改改变解流形的底层结构形式）下性能急剧下降。此外，校准的形式化探针能够区分在标准基准上看似无法区分的模型，并揭示出结构上可解释而非模糊的失败模式。除了评估，我们的框架无污染，并支持推理模型的训练和测试。

英文摘要

Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2603.04956 2026-06-03 cs.LG cs.IT math.IT

WaterSIC: Information-Theoretically (Near) Optimal Linear Layer Quantization

WaterSIC: 信息论（近乎）最优的线性层量化

Egor Lifar, Semyon Savkin, Or Ordentlich, Yury Polyanskiy

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结针对密集线性层低精度量化问题，提出WaterSIC算法，通过为权重矩阵不同列分配不同量化率，实现与信息论极限仅0.255比特的差距，并在Llama和Qwen系列大语言模型上达到1-4比特量化的最优性能。

2603.03612 2026-06-03 cs.LG cs.CC cs.CL cs.FL

Why Are Linear RNNs More Parallelizable?

为什么线性RNN更易于并行化？

William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

发表机构 * GitHub

AI总结本文通过将RNN类型与标准复杂度类紧密关联，揭示了线性RNN（LRNN）因可视为对数深度算术电路而易于并行化，而非线性RNN因能解决L-完全问题而存在并行化障碍。

Comments To appear at ICML 2026

详情

AI中文摘要

社区越来越多地探索线性RNN（LRNN）作为语言模型，受其表达能力和并行化能力的驱动。虽然先前的工作确立了LRNN相对于Transformer的表达优势，但尚不清楚是什么使得LRNN——而非传统的非线性RNN——在实践中与Transformer一样易于并行化。我们通过提供RNN类型与标准复杂度类之间的紧密联系来回答这个问题。我们表明，LRNN可以看作是对数深度（有界扇入）算术电路，相对于Transformer所允许的对数深度布尔电路，这仅代表轻微深度开销。此外，我们表明非线性RNN可以解决$\mathsf{L}$-完全问题（甚至在多项式精度下解决$\mathsf{P}$-完全问题），揭示了将它们与Transformer一样高效并行化的根本障碍。我们的理论还识别了近期流行LRNN变体之间的细粒度表达差异：置换对角LRNN是$\mathsf{NC}^1$-完全的，而对角加低秩LRNN更具表达性（$\mathsf{PNC}^1$-完全）。我们通过将每种RNN类型与它可以模拟的相应自动机理论模型相关联，提供了进一步见解。总之，我们的结果揭示了非线性RNN与不同LRNN变体之间的基本权衡，为设计在表达性和并行性之间实现最佳平衡的LLM架构提供了基础。

英文摘要

The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.

URL PDF HTML ☆

赞 0 踩 0

2510.16462 2026-06-03 cs.LG stat.ML

Buzz, Choose, Forget: A Meta-Bandit Framework for Bee-Like Decision Making

Buzz, Choose, Forget: 一种类蜂决策的元老虎机框架

Emmanuelle Claeys, Elena Kerjean, Jean-Michel Loubes

发表机构 * University of Toulouse, IRIT（图卢兹大学，IRIT）； University of Toulouse, CBI（图卢兹大学，CBI）； Regalia Team, INRIA University of Toulouse, France（Regalia团队，法国国家信息与自动化研究所图卢兹大学）

AI总结提出基于多臂老虎机的序列模仿学习模型MAYA，通过时间窗口τ模拟蜜蜂有限记忆，在真实、模拟和补充数据集上优于基线模型，并具备可解释性和轨迹推断能力。

2603.03480 2026-06-03 cs.LG stat.ML

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

在线强化学习中延迟观测的极小化最优策略

Harin Lee, Kevin Jamieson

发表机构 * University of California, Berkeley（加州大学伯克利分校）； UC Berkeley（加州大学伯克利分校）

AI总结针对延迟状态观测的强化学习问题，提出结合增广方法和上置信界算法的策略，在表格型MDP上达到极小化最优遗憾界。

Comments ICML camera ready version

详情

AI中文摘要

我们研究具有延迟状态观测的强化学习，其中智能体在随机数量的时间步后观察到当前状态。我们提出了一种结合增广方法和上置信界方法的算法。对于表格型马尔可夫决策过程（MDP），我们推导出遗憾界为$\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$，其中$S$和$A$是状态和动作空间的基数，$H$是时间跨度，$K$是回合数，$D_{\max}$是最大延迟长度。我们还提供了匹配的下界（对数因子除外），表明我们的方法是最优的。我们的分析框架将这个问题表述为一类更广泛的MDP的特例，其中它们的转移动态分解为已知部分和未知但结构化的部分。我们为这个抽象设定建立了通用结果，这可能具有独立的研究价值。

英文摘要

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

URL PDF HTML ☆

赞 0 踩 0

2602.00423 2026-06-03 cs.LG

scBatchProx: Federated-Inspired Refinement for Stable Cell-Type Discriminability under Heterogeneous Batch Compositions

scBatchProx：异质性批次组成下稳定细胞类型可区分性的联邦启发式精炼

Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

发表机构 * National Institute of Health (NIH)（国家卫生研究院）

AI总结提出scBatchProx，一种轻量级后处理方法，通过联邦学习启发的优化和保守正则化，稳定单细胞潜在嵌入，提升异质批次下的细胞类型分类性能。

详情

AI中文摘要

单细胞整合工作流通常构建低维细胞嵌入，然后使用后处理方法减少批次效应。当细胞类型组成在不同批次间变化，某些群体在特定批次中代表性不足或缺失时，这种精炼过程可能变得不稳定。在动态单细胞数据系统中，新获取的批次可能改变技术条件和细胞类型组成，问题变得更加严重。这种不稳定性会降低下游细胞类型分类性能，并削弱在失衡扰动下的稳定性。我们引入scBatchProx，一种轻量级后处理方法，用于在这些异质和不断变化的环境中稳定单细胞潜在嵌入。scBatchProx直接操作预计算嵌入，并将每个批次或研究视为联邦启发优化过程中的客户端。批次条件FiLM适配器学习局部潜在更新，而近端和身份保持正则化使这些更新保持保守。在多批次和跨研究单细胞数据集上的实验表明，scBatchProx在不同上游嵌入上改善了下游细胞类型分类。在受控失衡扰动中，当选定群体从一个批次中降采样或移除时，scBatchProx维持更稳定的受影响细胞类型F1分数。在累积重训练和持续整合设置中，随着新数据集随时间到达，scBatchProx保持有效。这些结果共同表明，保守的联邦启发式精炼有助于在批次组成随数据集和时间变化时维持稳定的单细胞嵌入。

英文摘要

Single-cell integration workflows often construct low-dimensional cell embeddings and then refine them with post-hoc methods to reduce batch effects. This refinement process can become unstable when cell-type compositions vary across batches, with some populations underrepresented or absent in particular batches. The problem becomes more consequential in dynamic single-cell data systems, where newly acquired batches can change both technical conditions and cell-type composition. Such instability can reduce downstream cell-type classification performance and weaken stability under imbalance perturbations. We introduce scBatchProx, a lightweight post-hoc refinement method for stabilizing single-cell latent embeddings in these heterogeneous and evolving settings. scBatchProx operates directly on precomputed embeddings and treats each batch or study as a client in a federated-inspired optimization procedure. A batch-conditioned FiLM adapter learns local latent updates, while proximal and identity-preserving regularization keep these updates conservative. Experiments on multi-batch and cross-study single-cell datasets show that scBatchProx improves downstream cell-type classification across different upstream embeddings. In controlled imbalance perturbations, scBatchProx maintains more stable affected-cell-type F1 when selected populations are downsampled or ablated from one batch. In cumulative retraining and continual integration settings, scBatchProx remains effective as new datasets arrive over time. Together, these results suggest that conservative, federated-inspired refinement can help maintain stable single-cell embeddings as batch compositions change across datasets and over time.

URL PDF HTML ☆

赞 0 踩 0

2512.03005 2026-06-03 cs.AI

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

从审核到调解：LLMs能否充当在线论战中的调解员？

Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manuel Sandoval, Deborah Hall, Yasin Silva, Huan Liu

发表机构 * Arizona State University（亚利桑那州立大学）； Loyola University Chicago（芝加哥洛约拉大学）

AI总结本研究探索大型语言模型（LLMs）能否超越内容审核，作为调解员通过判断对话公平性和情感动态并生成共情缓和信息来化解在线冲突，实验表明API模型在推理和干预一致性上优于开源模型。

Comments Accepted by PAKDD 2026 special session on Data Science: Foundations and Applications

详情

AI中文摘要

大型语言模型（LLMs）的快速发展为人工智能向善应用开辟了新可能性。随着LLMs越来越多地介入在线交流，它们培养共情和建设性对话的潜力成为负责任AI研究的重要前沿。本研究探索LLMs是否不仅能作为检测有害内容的审核员，还能作为能够理解和缓和在线冲突的调解员。我们的框架将调解分解为两个子任务：判断，即LLM评估对话的公平性和情感动态；引导，即生成共情的、缓和性的信息以引导参与者走向解决。为评估调解质量，我们构建了一个大型基于Reddit的数据集，并提出了一个结合基于原则的评分、用户模拟和人工比较的多阶段评估流程。实验表明，API模型在调解时的推理和干预一致性方面优于开源模型。我们的发现突显了当前LLMs作为新兴在线社会调解代理的潜力和局限性。

英文摘要

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

URL PDF HTML ☆

赞 0 踩 0

2505.23725 2026-06-03 cs.LG

MuLoCo: Muon is a practical inner optimizer for DiLoCo

MuLoCo: Muon 是 DiLoCo 的实用内部优化器

Benjamin Thérien, Xiaolong Huang, Aaron Defazio, Irina Rish, Eugene Belilovsky

发表机构 * FAIR at Meta（Meta 的 FAIR 部门）； Mila ； Université de Montréal（蒙特利尔大学）； Concordia University（康科迪亚大学）

AI总结本文提出 MuLoCo，将 Muon 作为 DiLoCo 的内部优化器，通过产生方向更准确的伪梯度，在多个工作节点下提升大语言模型训练性能，并兼容量化、流式处理和长同步间隔。

详情

AI中文摘要

DiLoCo 是一个强大的大语言模型（LLM）训练框架，能够在网络约束下实现更大的最优批大小和更高的加速器利用率。然而，研究表明 DiLoCo 的性能会随着工作节点数（K）的增加而下降（Charles 等人，2025）。在这项工作中，我们认为 DiLoCo 行为中一个相关但常被忽视的因素是内部优化器的选择，它塑造了外部优化器使用的伪梯度。鉴于最近 Muon 相对于 AdamW 在数据并行（DP）训练中的成功，我们研究了 Muon 的归一化优化器步骤如何影响伪梯度的质量。我们发现，相对于 AdamW，随着工作节点数（K）的增加，Muon 产生方向更正确的伪梯度。在我们预训练语言模型的实验中，我们对 150M、416M、914M、1.76B 和 3.1B 模型的 DiLoCo、MuLoCo、AdamW DP 和 Muon DP 进行了广泛的超参数调优。在所有规模上一致地发现，当 K≥1 时，MuLoCo（Muon 内部优化器 DiLoCo）在绝对性能上优于 DiLoCo，并且当 K>2 时，相对于它们各自的数据并行基线，MuLoCo 优于 DiLoCo，同时兼容量化、流式处理和长同步间隔。当 K=1 时，我们发现 MuLoCo 甚至可以优于数据并行黄金标准，同时具有更大的临界批大小。最后，我们将最优超参数外推到 15B 规模，并使用 K=1 和 K=16 个工作节点训练每个方法（共六种）的模型。我们发现，在此规模下，K=16 的 MuLoCo 几乎匹配单工作节点性能，而 K=1 的 MuLoCo 在使用更大的 16M token 批大小时匹配最佳基线性能。

英文摘要

DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers ($K$) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with $K\geq1$ workers, MuLoCo (Muon inner optimizer DiLoCo) achieves superior performance to DiLoCo in absolute terms and for $K>2$ it outperforms DiLoCo relative to their data parallel baselines, while being compatible with quantization, streaming, and long synchronization intervals. At $K=1$, we find that MuLoCo can even outperform the data-parallel gold standard while having larger critical batch sizes. Finally, we extrapolate optimal hyperparameters to 15B scale and train a model with each method (six in total) using $K=1$ and $K=16$ workers. We find that $K=16$ MuLoCo nearly matches single-worker performance at this scale, while MuLoCo $K=1$ matches the best performing baseline while using a much larger $16$M token batch size.

URL PDF HTML ☆

赞 0 踩 0

2602.20217 2026-06-03 cs.LG cs.AI

KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem

KnapSpec: 通过自适应层选择作为背包问题的自推测解码

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

发表机构 * KAIST（韩国科学技术院）

AI总结提出KnapSpec，一种无需训练的框架，将草稿模型选择重新表述为背包问题，通过解耦注意力与MLP层并建模其硬件特定延迟，使用并行动态规划算法自适应确定最优草稿配置，实现令牌吞吐量最大化。

Comments Accepted to ICML 2026

详情

AI中文摘要

自推测解码（SSD）通过跳过层来创建高效的草稿模型，从而加速LLM推理，但现有方法通常依赖静态启发式，忽略了长上下文场景中注意力的动态计算开销。我们提出KnapSpec，一种无需训练的框架，将草稿模型选择重新表述为背包问题，以最大化每时间令牌吞吐量。通过解耦注意力与MLP层，并将其硬件特定延迟建模为上下文长度的函数，KnapSpec通过并行动态规划算法自适应地即时识别最优草稿配置。此外，我们提供了首个严格的理论分析，建立了隐藏状态之间的余弦相似度作为令牌接受率的数学上合理的代理。这一基础使得我们的方法在导航现实世界硬件的动态瓶颈时，能够保持高草稿保真度。我们在Qwen3和Llama3上的实验表明，KnapSpec始终优于最先进的SSD基线，在各种基准测试中实现了高达1.47倍的墙钟加速。我们的即插即用方法确保了长序列的高效推理，无需额外训练或损害目标模型的输出分布。

英文摘要

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art SSD baselines, achieving up to 1.47x wall-clock speedup across various benchmarks. Our plug-and-play approach ensures high-speed inference for long sequences without requiring additional training or compromising the target model's output distribution.

URL PDF HTML ☆

赞 0 踩 0

2602.16666 2026-06-03 cs.AI cs.CY cs.LG

Towards a Science of AI Agent Reliability

迈向AI代理可靠性的科学

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出十二个具体指标，从一致性、鲁棒性、可预测性和安全性四个维度分解AI代理的可靠性，并通过实验揭示能力提升仅带来可靠性小幅改进。

Comments Accepted at ICML 2026. Interactive dashboard available at: https://hal.cs.princeton.edu/reliability

详情

AI中文摘要

AI代理越来越多地被部署来执行重要任务。虽然标准基准测试上的准确率分数不断提高表明进展迅速，但许多代理在实践中仍然持续失败。这种差异凸显了当前评估的一个根本局限性：将代理行为压缩为单一成功指标会掩盖关键的操作缺陷。值得注意的是，它忽略了代理是否在不同运行中表现一致、能否承受扰动、是否可预测地失败，或者错误严重性是否有界。基于安全关键工程，我们通过提出十二个具体指标来提供全面的性能概况，这些指标将代理可靠性分解为四个关键维度：一致性、鲁棒性、可预测性和安全性。在两个互补基准测试上评估15个模型，我们发现最近的能力提升仅带来了可靠性的小幅改进。通过暴露这些持续的局限性，我们的指标补充了传统评估，同时提供了推理代理如何表现、退化和失败的工具。

英文摘要

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 15 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

URL PDF HTML ☆

赞 0 踩 0

2602.18084 2026-06-03 cs.LG

Balancing Symmetry and Efficiency in Graph Flow Matching

平衡图流匹配中的对称性与效率

Benjamin Honoré, Alba Carballo-Castro, Yiming Qin, Pascal Frossard

发表机构 * LTS4, EPFL, Lausanne, Switzerland（LTS4，瑞士洛桑联邦理工学院，拉夫斯堡）

AI总结通过可控对称调制方案，研究图生成模型中严格等变性带来的计算成本与收敛速度之间的权衡，发现适当调节对称性可在加速收敛的同时避免过拟合。

Comments 15 pages, 11 figures

详情

AI中文摘要

等变性是图生成模型的核心，因为它确保模型尊重图的置换对称性。然而，严格的等变性由于增加了架构约束而提高了计算成本，并且由于模型必须在大量可能的节点置换空间上保持一致而可能减慢收敛速度。我们研究了图生成模型中的这种权衡。具体来说，我们从等变离散流匹配模型出发，在训练过程中通过基于正弦位置编码和节点置换的可控对称调制方案来放松其等变性。实验首先表明，对称性破缺可以通过提供更简单的学习信号来加速早期训练，但代价是鼓励捷径解决方案，可能导致过拟合，即模型重复生成训练集的重复图。相反，适当调节对称性信号可以延迟过拟合，同时加速收敛，使模型在基线训练周期的19%内达到更强的性能。

英文摘要

Equivariance is central to graph generative models, as it ensures the model respects the permutation symmetry of graphs. However, strict equivariance can increase computational cost due to added architectural constraints, and can slow down convergence because the model must be consistent across a large space of possible node permutations. We study this trade-off for graph generative models. Specifically, we start from an equivariant discrete flow-matching model, and relax its equivariance during training via a controllable symmetry modulation scheme based on sinusoidal positional encodings and node permutations. Experiments first show that symmetry-breaking can accelerate early training by providing an easier learning signal, but at the expense of encouraging shortcut solutions that can cause overfitting, where the model repeatedly generates graphs that are duplicates of the training set. On the contrary, properly modulating the symmetry signal can delay overfitting while accelerating convergence, allowing the model to reach stronger performance with $19\%$ of the baseline training epochs.

URL PDF HTML ☆

赞 0 踩 0

2502.08834 2026-06-03 cs.LG cs.AI stat.ML

Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers

Rex: 一族可逆指数（随机）龙格-库塔求解器

Zander W. Blasingame, Chen Liu

发表机构 * University of Washington（华盛顿大学）

AI总结提出Rex求解器族，通过Lawson方法将显式（随机）龙格-库塔格式转化为代数可逆形式，用于扩散ODE和SDE，实现近机器精度重建并提升流模型和扩散模型的性能。

Comments Accepted as an Oral presentation at ICML 2026

详情

AI中文摘要

基于神经微分方程的深度生成模型已成为许多生成任务的最先进方法。这些模型依赖于从先验分布积分到数据分布的ODE/SDE求解器；在许多应用中，逆方向积分也非常可取。然而，标准求解器会累积离散误差，阻碍精确反演，这种不准确性在精度关键的应用中是不可接受的。现有的反演方法稳定性差、收敛阶低，且严格限于ODE设置。在这项工作中，我们提出Rex，一族可逆指数（随机）龙格-库塔求解器，通过应用Lawson方法将任何显式（随机）龙格-库塔格式转化为扩散ODE和SDE的代数可逆格式。除了严格的理论分析——建立任意阶收敛性和非零线性稳定区域——我们通过实验证明Rex实现了近机器精度的重建，并改进了基于流模型的玻尔兹曼采样以及基于扩散模型的图像生成和编辑。

英文摘要

Deep generative models based on neural differential equations have become state-of-the-art for many generation tasks. These models rely on ODE/SDE solvers that integrate from a prior distribution to the data distribution; in many applications it is also highly desirable to integrate in the inverse direction. Standard solvers, however, accumulate discretization errors that prohibit exact inversion, an inaccuracy that is unacceptable in precision-critical applications. Existing inversion methods suffer from poor stability and low order of convergence, and are strictly limited to the ODE setting. In this work, we propose Rex, a family of reversible exponential (stochastic) Runge-Kutta solvers obtained by applying Lawson methods to convert any explicit (stochastic) Runge-Kutta scheme into an algebraically reversible one for both diffusion ODEs and SDEs. Beyond a rigorous theoretical analysis -- establishing arbitrary-order convergence and a non-zero region of linear stability -- we empirically demonstrate that Rex achieves near-machine-precision reconstruction and improves Boltzmann sampling with flow models as well as image generation and editing with diffusion models.

URL PDF HTML ☆

赞 0 踩 0

2602.17149 2026-06-03 cs.LG cs.AI

TimeOmni-VL: Unified Models for Time Series Understanding and Generation

TimeOmni-VL：统一时间序列理解与生成的模型

Tong Guan, Sheng Pan, Johan Barthelemy, Zhao Li, Yujun Cai, Cesare Alippi, Ming Jin, Shirui Pan

发表机构 * Tsinghua University（清华大学）

AI总结提出TimeOmni-VL框架，通过保真双向映射和理解引导生成，首次统一时间序列的理解与生成任务。

Comments Accepted by the Forty-third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

近期的时间序列建模在数值生成与语义理解之间存在明显鸿沟，研究表明生成模型往往依赖浅层模式匹配，而理解导向的模型难以输出高保真数值。尽管统一多模态模型（UMMs）已在视觉领域弥合这一差距，但其在时间序列上的潜力尚未被发掘。我们提出TimeOmni-VL，这是首个以视觉为中心的统一时间序列理解与生成框架，通过两项关键创新实现：（1）时间序列与图像之间的保真双向映射（Bi-TSI），改进了时间序列到图像（TS2I）和图像到时间序列（I2TS）的转换，确保近乎无损的变换。（2）理解引导生成。我们引入TSUMM-Suite，这是一个新颖的数据集，包含六个基于时间序列分析的理解任务，并耦合两个生成任务。通过校准的思维链，TimeOmni-VL首次利用时间序列理解作为高保真生成的显式控制信号。实验证实，这种统一方法显著提升了语义理解和数值精度，为多模态时间序列建模开辟了新前沿。

英文摘要

Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that generation models often rely on superficial pattern matching, while understanding-oriented models struggle with high-fidelity numerical output. Although unified multimodal models (UMMs) have bridged this gap in vision, their potential for time series remains untapped. We propose TimeOmni-VL, the first vision-centric framework that unifies time series understanding and generation through two key innovations: (1) Fidelity-preserving bidirectional mapping between time series and images (Bi-TSI), which advances Time Series-to-Image (TS2I) and Image-to-Time Series (I2TS) conversions to ensure near-lossless transformations. (2) Understanding-guided generation. We introduce TSUMM-Suite, a novel dataset consisting of six understanding tasks rooted in time series analytics and coupled with two generation tasks. With a calibrated Chain-of-Thought, TimeOmni-VL is the first to leverage time series understanding as an explicit control signal for high-fidelity generation. Experiments confirm that this unified approach significantly improves semantic understanding and numerical precision, establishing a new frontier for multimodal time series modeling.

URL PDF HTML ☆

赞 0 踩 0