arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2605.14374 2026-05-15 cs.LG cs.AI math.OC

Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

Young-Chae Hong, Yangho Chen

发表机构 * Amazon(亚马逊)

AI总结 本文提出了一种基于混合整数规划的符号规则分类模型——最优模式检测树(OPDT),用于在二分类任务中发现数据中的单一最优模式。为融入先验知识和合规要求,作者进一步引入了分支结构约束(BSC)框架,使决策者能够将领域知识直接嵌入模型。该方法通过优化覆盖范围并最小化误分类的假阳性率,能够在合理时间内于中等规模数据集上发现具有最优性保证的隐藏模式。

Comments Published in Transactions on Machine Learning Research (TMLR). 26 pages, 4 figures. OpenReview URL: https://openreview.net/forum?id=RJ6eMDcDCv

Journal ref Transactions on Machine Learning Research (2026)

详情
英文摘要

Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

2605.14368 2026-05-15 cs.CL cs.AI

Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

Injin Kong, Hyoungjoon Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Biosystems & Biomaterials Science and Engineering, Seoul National University(首尔国立大学生物系统与生物材料科学与工程系)

AI总结 本文研究了如何在预训练语言模型中有效引入扩散模型,提出了一种基于几何引导的扩散-变压器混合模型DiHAL。该方法通过几何特征评估各层的适合性,选择合适的隐藏状态接口,并用扩散桥替换下层变压器结构,保留上层结构和语言模型头部。实验表明,基于几何评分的隐藏状态恢复方法在保持相同训练预算的情况下,优于传统的连续扩散方法,展示了在语言模型中进行扩散替换的可行性。

详情
英文摘要

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

2605.14366 2026-05-15 cs.CL cs.LG

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang

发表机构 * Minzu University of China(中国民族大学) Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学) University of Macau(澳门大学) Peking University(北京大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Hainan International College, Minzu University of China(中国民族大学海南国际学院)

AI总结 该研究探讨了在低资源语言扩展中,如何避免因微调大语言模型而导致的“对齐税”问题。作者提出了一种基于语义奖励的强化学习方法,通过组相对策略优化(GRPO)在嵌入层进行语义对齐,而非传统的似然最大化,从而在保持模型通用能力的同时提升低资源语言的表现。实验表明,该方法在藏汉机器翻译和藏语新闻生成任务中有效缓解了对齐税,生成质量更高且更具可迁移性。

Comments ACL 2026 Findings

详情
英文摘要

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

2605.14365 2026-05-15 cs.LG cs.AI

LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning

Changryeol Choi, Hyewon Park, Yujin Kwon, Gowun Jeong

发表机构 * CJ Logistics(CJ物流)

AI总结 在表格深度学习中,主流方法的性能趋于接近,难以形成明显优劣之分。为此,本文提出 LoMETab,一种基于秩-$r$ 的隐式集成模型,通过引入可调节的秩和初始化尺度,增强模型的多样性与表达能力。实验表明,LoMETab 能有效提升模型间的预测差异性,并在分类和回归任务中展现出良好的控制能力与性能表现。

详情
英文摘要

Recent tabular learning benchmarks increasingly show a tight performance cluster rather than a clear hierarchy among leading methods, spanning gradient boosted decision trees, attention-based architectures, and implicit ensembles such as TabM. As benchmark gains plateau, a complementary goal is to understand and control the mechanisms that make simple neural tabular models competitive. We propose LoMETab, a rank-$r$ generalization of multiplicative implicit ensembles. LoMETab lifts the rank-1 BatchEnsemble/TabM modulation to a rank-$r$ identity-residual Hadamard family by parameterizing each member weight as $W_k = W \odot (1 + A_kB_k^\top)$, where $W$ is shared and $(A_k, B_k)$ are member-specific low-rank factors. This exposes two practical diversity-control axes: the adapter rank $r$ and the initialization scale $σ_{\mathrm{init}}$, and we prove that for $r \ge 2$ this generalization strictly enlarges BatchEnsemble's hypothesis class. Empirically, we show that this added capacity manifests as measurable predictive diversity after training: on representative classification datasets, LoMETab sustains higher pairwise KL than an additive low-rank ablation, and $(r, σ_{\mathrm{init}})$ provides broad control over pairwise KL, varying by up to several orders of magnitude across configurations. The induced diversity is reflected in task-appropriate output-level measures: argmax disagreement for classification and ambiguity for regression, indicating that the control extends beyond pairwise KL to decision- and output-level member variation. Finally, experiments sweeping over adapter rank $r$ and initialization scale $σ_{\mathrm{init}}$ reveal that predictive performance is dataset-dependent over the $(r, σ_{\mathrm{init}})$ grid, supporting LoMETab as a controllable family of implicit ensembles rather than a fixed rank-1 construction.

2605.14359 2026-05-15 cs.LG cs.AI

RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

Zhengjia Zhong, Shuyan Ke, Zaizhou Lin, Jiaqi Song, Hongyi Lan, Hui Li

发表机构 * Key Laboratory of Multimedia Trusted Perception(多媒体可信感知关键实验室) Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, China(高效计算,中华人民共和国教育部,厦门大学,厦门,中国)

AI总结 该论文提出了一种名为RQ-MoE的残差量化框架,通过结合专家混合模型与双流量化机制,实现了针对输入数据动态调整的高效向量压缩。该方法解决了现有动态量化方法在解码过程中存在的瓶颈问题,支持并行解码并提升了表达能力。实验表明,RQ-MoE在重建与检索任务中达到了当前最优或接近最优的性能,同时解码速度比以往方法快6到14倍。

Comments To appear at ICML 2026

详情
英文摘要

Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.

2605.14358 2026-05-15 cs.AI cs.LG

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

Sanjoy Chowdhury, Dinesh Manocha

发表机构 * University of Maryland, College Park, USA(马里兰大学学院公园分校)

AI总结 该研究探讨了语言模型在生成长链推理过程时,其中有多少步骤对于最终预测是必要的。通过定义“最小核心”——即能保持最终答案或预测分布的最小步骤子集,并引入压缩比、冗余度、步骤必要性等指标,研究发现推理轨迹普遍存在冗余,平均有46%的步骤可以移除而不影响答案,且必要性高度集中于少数几步。研究还表明,最小核心能更清晰地揭示推理的几何结构,并在不同模型间具有较好的迁移能力,为理解语言模型推理的本质提供了新视角。

详情
英文摘要

Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.

2605.14352 2026-05-15 cs.CL

Ideology Prediction of German Political Texts

Sinclair Schneider, Florian Steuber, Joao A. G. Schneider, Gabi Dreo Rodosek

发表机构 * Bundestag(议会) Wahl-O-Mat(选举工具) German Media Datasets(德国媒体数据集)

AI总结 本文研究如何利用基于Transformer的模型对德语政治文本进行意识形态预测,将文本的政治立场映射到从-1到1的连续光谱上。研究构建了四个不同来源的语料库,包括德国联邦议院的会议记录、在线决策工具Wahl-O-Mat、33家不同政治倾向的报纸以及议员的推文,并通过对比多个预训练模型,发现DeBERTa-large和Gemma2-2B在不同数据集上表现出色。研究结果表明,模型结构和领域特定数据的可用性对政治偏见估计具有重要影响。

Comments This paper has been accepted for the upcoming 20th International AAAI Conference on Web and Social Media (ICWSM 2026)

详情
英文摘要

Elections represent a crucial milestone in a nation's ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.

2605.14350 2026-05-15 cs.LG

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

Nicholas E. Corrado, Wenyuan Huang, Josiah P. Hanna

发表机构 * Computer Sciences Department(计算机科学系) University of Wisconsin – Madison(威斯康星大学麦迪逊分校)

AI总结 多任务强化学习旨在训练一个智能体同时高效优化多个任务的性能,但传统方法在联合优化所有任务时常导致学习不平衡,即对简单任务学习迅速而对困难任务进展缓慢。本文提出了一种新的自适应任务采样方法DRATS,通过动态优先采样最难完成的任务,以解决数据分配不均的问题。该方法将多任务学习建模为一个可行性问题,并通过最小化最差任务回报差距的最小最大目标进行优化,在多个基准测试中表现出更高的数据效率和最差任务性能。

详情
英文摘要

Multi-task reinforcement learning (MTRL) aims to train a single agent to efficiently optimize performance across multiple tasks simultaneously. However, jointly optimizing all tasks often yields imbalanced learning: agents quickly solve easy tasks but learn slowly on harder ones. While prior work primarily attributes this imbalance to conflicting task gradients and proposes gradient manipulation or specialized architectures to address it, we instead focus on a distinct and under-explored challenge: imbalanced data allocation. Standard MTRL allocates an equal number of environment interactions to each task, which over-allocates data to easy tasks that require relatively few interactions to solve and under-allocates data to hard tasks that require substantially more experience to solve. To address this challenge, we introduce Distributionally Robust Adaptive Task Sampling (DRATS), an algorithm that adaptively prioritizes sampling tasks furthest from being solved. We derive DRATS by formalizing MTRL as a feasibility problem from which we derive a minimax objective for minimizing the worst-case return gap, the difference between a desired target return and the agent's return on a task. In benchmarks like MetaWorld-MT10 and MT50, DRATS improves data efficiency and increases worst-task performance compared to existing task sampling algorithms.

2605.14346 2026-05-15 cs.CV

Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

Yuanhang Yao, Ping Qian, Zhu Liu, Long Ma, Weimin Wang

发表机构 * School of Software Technology, Dalian University of Technology(大连理工大学软件学院)

AI总结 本文研究了如何在点监督下稳定红外小目标检测任务,针对轻量级检测器语义信息不足导致的伪标签噪声和训练不稳定问题,提出了一种基于分层视觉基础模型(VFM)的知识蒸馏框架。该方法通过双层优化过程,结合语义条件仿射调制(SCAM)和动态协作学习策略,有效提升了检测精度和训练稳定性。实验表明,该方法在多种红外小目标检测模型上均取得了显著改进。

详情
英文摘要

Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.

2605.14343 2026-05-15 cs.LG math.ST stat.ML stat.TH

Nearest-Neighbor Radii under Dependent Sampling

Yuanyuan Gao, Yilong Hou, Zhexiao Lin

发表机构 * Department of Statistics, University of California, Berkeley, CA 94720, USA(加州大学伯克利分校统计系) Department of Biostatistics, University of California, Berkeley, CA 94720, USA(加州大学伯克利分校生物统计学系)

AI总结 本文研究了在依赖采样条件下最近邻方法的邻域半径性质,突破了传统独立采样假设。通过分析强混合依赖观测,论文建立了多项式混合条件下的几乎处处收敛结果,并在几何混合条件下给出了精确的非渐近矩界,这些界依赖于局部内在维度而非环境维度,从而适用于高维流形数据。实验验证了理论结果,表明即使在依赖采样下,最近邻几何结构仍具有信息性。

Comments 33 pages

详情
英文摘要

Nearest-neighbor methods are fundamental to classical and modern machine learning, yet their geometric properties are typically analyzed under independent sampling. In this paper, we study the nearest-neighbor radii under dependent sampling. We consider strong mixing dependent observations and ask whether dependence changes the scale of nearest-neighbor neighborhoods. We establish distribution-free almost sure convergence under polynomial mixing and sharp non-asymptotic moment bounds under geometric mixing. The moment bounds depend on the local intrinsic dimension rather than the ambient dimension, making the results applicable to high-dimensional data concentrated near lower-dimensional manifolds. Synthetic experiments and real-world time-series benchmarks support the theory, showing that nearest-neighbor geometry remains informative under dependence sampling.

2605.14341 2026-05-15 cs.CV

AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

Zuopeng Zhao, Ying Liu, Xiaoyu Li, Su Luo, Lu Li, Wenwen Liu

发表机构 * School of Computer Science and Technology / School of Artificial Intelligence, China University of Mining and Technology(计算机科学与技术学院/人工智能学院,中国矿业大学) Mine Digitization Engineering Research Center of the Ministry of Education(教育部矿山数字化工程研究中心) Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing(江苏省智能感知与应急物联网地下空间工业技术工程中心)

AI总结 本文提出了一种名为 AnyBand-Diff 的统一遥感图像生成与波段修复框架,旨在解决现有扩散模型在生成遥感图像时忽略物理规律导致的光谱失真和辐射不一致问题。该方法引入了基于光谱先验的扩散模型架构,结合双随机掩码策略和物理引导采样机制,能够从任意波段子集恢复完整的光谱信息,并保证生成图像的辐射一致性。实验表明,AnyBand-Diff 在生成可靠遥感图像和实现高精度光谱重建方面表现出色,为物理感知的生成模型在地球观测领域的应用提供了新思路。

详情
英文摘要

Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

2605.14340 2026-05-15 cs.SD

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara

发表机构 * Kyoto University, Japan(京都大学,日本) LY Corporation, Japan(LY公司,日本)

AI总结 基于大语言模型(LLM)的自动语音识别系统通过连接音频编码器和LLM取得了良好性能,但在面对新领域时,由于缺乏配对的语音和文本数据,其适应能力受到限制。本文提出一种新的框架,通过显式建模语音与文本的对齐关系,生成更具表现力的伪音频提示,从而有效弥合模态间的差距,提升目标领域的适应效果。实验表明,该方法在整体错误率和未登录词覆盖率方面均优于现有纯文本适应方法。

Comments Submitted to Interspeech 2026

详情
英文摘要

LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.

2605.14337 2026-05-15 cs.CV

IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

Yifan Chen, Fei Yin, Chunle Guo, Chongyi Li, Yujiu Yang

发表机构 * Tsinghua Univerisity(清华大学) NanKai University(南开大学)

AI总结 在夜间复杂场景中,由于光照不足和多种退化因素共存,图像恢复面临较大挑战。本文提出一种基于光照引导的扩散模型(IG-Diff),通过引入光照引导模块,有效提升了低光环境下多退化因素共存场景的图像恢复效果。同时,作者构建了包含多种退化因素的复杂夜间场景数据集,为相关研究提供了重要资源。

Comments Accepted by CGI-2025

详情
英文摘要

In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

2605.14333 2026-05-15 cs.CV

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Yang Yue, Fangyun Wei, Tianyu He, Jinjing Zhao, Zanlin Ni, Zeyu Liu, Jiayi Guo, Lei Shi, Yue Dong, Li Chen, Ji Li, Gao Huang, Dong Chen

发表机构 * Tsinghua University(清华大学) Microsoft Research(微软研究院)

AI总结 本文研究了在基于离散分词的自回归图像生成中如何提升文本和人脸的生成质量。作者指出,传统分词器因过度下采样和量化导致细粒度结构丢失,难以保留可读的文本和清晰的人脸特征。为此,他们提出了InsightTok,通过引入局部、内容感知的感知损失,有效提升了文本和人脸的保真度,并在不牺牲整体重建质量的前提下显著优于现有分词器。该方法在自回归图像生成模型InsightAR中表现出色,生成的图像具有更清晰的文本和更真实的人脸细节。

Comments Code and checkpoints are available at https://github.com/LeapLabTHU/InsightTok

详情
英文摘要

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.

2605.14327 2026-05-15 cs.LG cs.AI

AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction

Yerin Park, Sangseon Lee

发表机构 * Department of Artificial Intelligence, Inha University(人工智能系,Inha大学)

AI总结 药物-药物相互作用(DDI)预测在计算生物医学中具有重要意义,但如何对训练过程中未见的药物进行准确预测仍是一个关键挑战。本文提出了一种与模型无关的多模态集成模块AIM-DDI,它将结构、化学和语义等异构药物信息映射到共享的潜在空间中,并通过统一的融合模块建模模态间依赖关系,从而实现跨不同DDI预测架构的通用集成。实验表明,AIM-DDI在多种DDI模型和DrugBank数据集上均能有效提升预测性能,尤其在两个药物均未在训练中出现的最困难场景下表现突出。

详情
英文摘要

Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.

2605.14326 2026-05-15 cs.CV

D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

Zuopeng Zhao, Ying Liu, Kanyaphakphachsorn Pharksuwan, Su Luo, Xiaoyu Li, Maocai Ning

发表机构 * China University of Mining and Technology(中国矿业大学)

AI总结 本文提出了一种名为D2-CDIG的可控扩散遥感图像生成框架,旨在解决现有方法在复杂地形和大气条件下生成图像准确性与自然度不足的问题。该方法通过融合数字高程模型(DEM)和云雾信息作为双重先验知识,实现了对地表特征和大气现象的精确控制,并引入了可调节的云雾滑块以灵活控制云层厚度和分布。实验表明,D2-CDIG在图像质量、细节丰富度和真实感方面相比传统方法有显著提升,为遥感大模型训练和下游任务提供了高质量的数据支持。

详情
英文摘要

Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.

2605.14323 2026-05-15 cs.LG cs.AI cs.CL

Dynamic Latent Routing

Fangyuan Yu, Xin Su, Amir Abdullah

发表机构 * Thoughtworks AI Labs (TAILS)(Thoughtworks AI实验室(TAILS))

AI总结 本文研究了在时间变化奖励函数的马尔可夫决策过程(MDP)中,子策略的时间拼接问题。作者提出了通用迪杰斯特拉搜索(GDS),并证明通过时间组合中间最优子策略可以恢复全局最优目标达成策略。基于GDS的“搜索、选择、更新”原则,作者进一步提出了动态潜在路由(DLR)方法,该方法在单次训练阶段联合学习离散潜在编码、路由策略和模型参数。实验表明,在低数据微调场景下,DLR在多个数据集和模型上表现优异,优于传统的监督微调方法。

详情
英文摘要

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

2605.14318 2026-05-15 cs.AI cs.LG

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems

Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello

发表机构 * INAF, Osservatorio Astrofisico di Catania(意大利国家天文研究所,卡塔尼亚天文台)

AI总结 本文研究了复杂系统中可解释的预测性维护问题,针对监测变量异构性和冗余性导致的故障信息模糊和模型可解释性下降的问题,提出了一种语义特征分割框架。该方法将监测特征空间分解为保留主要预测信息的规范分量和包含结构边缘信号的残差分量,并基于领域知识定义功能分组以反映系统运行机制。实验表明,规范分量在预测风险和结构稳定性方面均优于残差分量和传统方法,实现了预测性能与语义可解释性的兼顾。

Comments 18 pages, 7 figures. Under review at Neural Computing and Applications. Keywords: semantic segmentation, change point detection, fault anticipation

详情
英文摘要

Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.

2605.14317 2026-05-15 cs.LG physics.ao-ph

Guided Diffusion Sampling for Precipitation Forecast Interventions

Ayumu Ueyama, Kazuhiko Kawamoto, Hiroshi Kera

发表机构 * Chiba University(千叶大学) National Institute of Informatics(国家信息研究所)

AI总结 本文研究如何通过数据驱动的天气预报模型实现对极端降水的干预,以减少其带来的负面影响。作者提出了一种基于梯度引导的扩散采样方法,在扩散天气预报模型中引导采样轨迹,从而在保持大气状态分布一致性的同时实现降水减少。该方法从垂直结构、潜空间轨迹偏差和跨模型可迁移性三个角度评估干预的物理合理性,实验表明其在减少极端降水方面优于对抗性扰动方法。

Comments 12+7 pages, 7+2 figures

详情
英文摘要

Extreme precipitation causes severe societal and economic damage, and weather control has long been discussed as a potential mitigation strategy. However, to the best of our knowledge, perturbation-based interventions for weather control using data-driven weather forecasting models have not yet been explored. While adversarial attacks also generate perturbations that alter forecasts, they aim to exploit model artifacts and do not account for physical plausibility. In this paper, we propose a gradient-based guidance framework for precipitation-reduction interventions through diffusion sampling in diffusion-based weather forecasting models. Instead of directly perturbing atmospheric states, our method steers the diffusion sampling trajectory, enabling precipitation reduction while maintaining consistency with the atmospheric distribution. To assess physical plausibility, we evaluate from three perspectives: (i) vertical and variable-wise perturbation profiles, (ii) latent-space trajectory deviation, and (iii) cross-model transferability. Experiments on extreme precipitation events from WeatherBench2 demonstrate that our method achieves effective precipitation reduction while yielding more physically plausible interventions than adversarial perturbations.

2605.14315 2026-05-15 cs.CV

TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

David Huang, Guile Wu, Chengjie Huang, Bingbing Liu, Dongfeng Bai

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) University of Toronto(多伦多大学) Foundation Model Department, Huawei(华为基础模型部门)

AI总结 本文提出了一种名为 TurboVGGT 的新型方法,用于实现快速的多视角三维重建。该方法采用自适应交替注意力机制的视觉几何变换器,在保证重建质量的同时显著提升了计算效率。通过自适应稀疏全局注意力和帧内注意力的结合,TurboVGGT 能够有效捕捉跨帧的全局关系和单帧内的局部细节,实验表明其在多个三维重建基准上表现优异,兼具速度与精度。

Comments Technical Report

详情
英文摘要

Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: https://turbovggt.github.io/.

2605.14310 2026-05-15 cs.CV

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 在流式视频理解中,如何高效压缩视觉-语言模型的键值缓存以支持长期推理是一个重要问题。本文将KV缓存压缩视为一个核心集选择问题,提出了一种基于几何覆盖和多样性优化的方法,通过联合优化键和值空间的表示,同时保留检索结构和输出相关信息。该方法引入正交性驱动的多样性准则,提升缓存子集的多样性,实验表明在多个开源模型和视频基准上优于传统启发式压缩方法。

详情
英文摘要

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

2605.14305 2026-05-15 cs.CL

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

Xun Fang, Yunchen Li, Hang Yuan, Zhou Yu

发表机构 * East China Normal University(华东师范大学) Beijing Zhongguancun Academy(北京中关村学院)

AI总结 本文提出了一种无因子化误差的离散扩散语言模型(FeF-DLLM),旨在解决传统方法中因独立预测清洁令牌而导致的因子化误差问题。该方法通过精确的前缀条件因子化替代独立预测,更有效地保留令牌间的依赖关系,并结合推测解码技术,在保持并行预测能力的同时提升推理速度。实验表明,该方法在多个基准数据集上平均提升了5.04个百分点的准确性,同时实现了3.86倍的加速。

详情
英文摘要

Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.

2605.14304 2026-05-15 cs.LG cs.AI

Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry

Zuyuan Zhang, Carlee Joe-Wong, Tian Lan

发表机构 * The George Washington University(乔治·华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 该研究提出了一种名为矩阵空间强化学习(MSRL)的新方法,旨在通过复用已有轨迹片段中的局部转移几何结构,提升强化学习中的组合泛化能力。MSRL 使用正定矩阵描述符来捕捉轨迹片段的一阶和二阶统计特性,从而在抽象的矩阵空间中实现代数组合与知识迁移。实验表明,该方法在有限预算下取得了优于现有方法的性能,展示了其在跨任务学习中的有效性。

详情
英文摘要

Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).

2605.14301 2026-05-15 cs.LG stat.ML

Language-Induced Priors for Domain Adaptation

Qiyuan Chen, Jiayu Zhou, Raed Al Kontar

发表机构 * University of Michigan(密歇根大学)

AI总结 在领域适应中,当目标域数据稀缺时,传统统计方法难以区分相关与不相关的源域,导致负迁移。本文提出利用目标域的专家文本描述,构建语言诱导先验(LIP),将其与期望最大化算法结合,以识别相关源域。该方法兼容多种参数模型,能够在目标信号弱时引导源域选择,并随着数据积累逐步优化,理论分析表明其在正确先验下具有接近理想冷启动性能,并保持渐近一致性。实验验证了该框架在估计、预测和决策任务中的有效性。

详情
英文摘要

Domain adaptation faces a fundamental paradox in the cold-start regime. When target data is scarce, statistical methods fail to distinguish relevant source domains from irrelevant ones, which often leads to negative transfer. In this paper, we address this challenge by leveraging expert textual descriptions of the target domain, a resource that is often available but overlooked. We propose a probabilistic framework that translates these semantic descriptions into a choice model, namely a Language-Induced Prior (LIP), that learns the preferences from a pretrained Large Language Model (LLM). The LIP is then integrated into an Expectation-Maximization algorithm to identify source relevance. Methodologically, this framework is compatible with any parametric model where a likelihood is available. It allows the LIP to guide the selection of sources when target signals are weak, while gradually refining these choices as samples accumulate. Theoretically, we prove that the estimator roughly matches an oracle cold-start MSE under a correct prior, while remaining asymptotically consistent regardless of the quality of the LIP. Empirically, we validated the framework on a descriptive (Gaussian estimation), a predictive (C-MAPSS dataset), and a prescriptive task (MuJoCo hopper).

2605.14297 2026-05-15 cs.LG cs.AI math.OC stat.ML

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

Matias Alvo, Daniel Russo, Yash Kanoria

发表机构 * Graduate School of Business Columbia University(哥伦比亚大学商学院)

AI总结 本文研究了在混合离散-连续动作空间中的强化学习问题,这类问题常见于机器人控制和优化领域。为了解决传统策略梯度方法在高维空间中梯度质量差的问题,作者提出了混合策略优化(HPO)方法,通过结合路径梯度和得分函数梯度,实现无偏混合梯度估计,从而有效应对离散动作和非光滑动态带来的挑战。实验表明,HPO在库存控制和切换线性二次调节器等任务中显著优于PPO算法,且在连续动作维度增加时优势更加明显。

详情
英文摘要

We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.

2605.14294 2026-05-15 cs.AI cs.LG

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

Hengjie Liu, Zhenya Zhang, Jianjun Zhao

发表机构 * Kyushu University(九州大学) National Institute of Informatics(国家信息研究所)

AI总结 随着Transformer模型在安全关键领域的广泛应用,其形式化验证变得尤为重要。与传统神经网络相比,Transformer的推理过程涉及复杂的计算,如自注意力层中的点积操作,使得验证极具挑战性。本文提出了一种基于ReLU催化的抽象细化方法,通过精确表示点积的非线性边界,结合凸松弛技术,提升了验证精度,并在两种经典验证方法的基础上扩展出适用于Transformer的高效且精确的验证框架,实验表明该方法在保持较高效率的同时显著提升了验证精度。

Comments 32 pages, 6 figures, the full version of the paper accepted by CAV 2026

详情
英文摘要

Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.

2605.14280 2026-05-15 cs.LG stat.ML

TILT: Target-induced loss tilting under covariate shift

Kakei Yamamoto, Martin J. Wainwright

发表机构 * Lab for Information and Decision Systems(信息与决策系统实验室) Statistics and Data Science Center(统计与数据科学中心) EECS, Massachusetts Institute of Technology(麻省理工学院电子工程与计算机科学系) Mathematics and EECS, Massachusetts Institute of Technology(数学与电子工程与计算机科学系, 麻省理工学院)

AI总结 本文提出了一种名为TILT的无监督域适应方法,用于处理协变量偏移问题。该方法通过引入一个新颖的目标函数,将源域预测器分解为两个部分,并在有标签的源域数据上拟合这两个部分,同时在无标签的目标域数据上对辅助部分施加惩罚,最终得到的主预测器用于目标域预测。理论分析表明,该方法在总体层面能够隐式地诱导相对重要性加权,并且具有良好的稳定性与泛化能力。实验结果表明,TILT在多个任务中优于仅使用源域训练、精确重要性加权以及相对密度比等基线方法。

Comments 32 pages, 17 figures. Submitted to NeurIPS 2026

详情
英文摘要

We introduce and analyze Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. It is based on a novel objective function that decomposes the source predictor as $f+b$, fits $f+b$ on labeled source data while simultaneously penalizing the auxiliary component $b$ on unlabeled target inputs. The resulting fit $f$ is deployed as the final target predictor. At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand $b^*_f$ that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports). We prove a general finite-sample oracle inequality on the excess risk, and use it to give an end-to-end guarantee for training with sparse ReLU networks. Experiments on controlled regression problems and shifted CIFAR-100 distillation show that TILT improves target-domain performance over source-only training, exact importance weighting, and relative density-ratio baselines, with a stable dependence on the regularization parameter.

2605.14278 2026-05-15 cs.CV

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Ruicheng Zhang, Kaixi Cong, Jun Zhou, Zhizhou Zhong, Zunnan Xu, Shuiyang Mao, Wei Liu, Xiu Li

发表机构 * Tsinghua University(清华大学) HKUST(香港科技大学) Video Rebirth Project(视频重生项目)

AI总结 本文提出了一种名为KVPO的ODE原生在线组相对策略优化框架,用于通过键值语义探索对流式自回归视频生成器进行对齐。该方法通过将多样性探索的来源从随机噪声转移到历史键值缓存,构建语义多样且保持数据流形的生成分支,从而提升长期一致性。同时,KVPO引入基于轨迹速度能量的替代策略,实现了与ODE原生形式完全一致的奖励加权对比目标,在多个实验设置中显著提升了视频的视觉质量、运动质量和文本-视频对齐效果。

详情
英文摘要

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

2605.14277 2026-05-15 cs.AI cs.GT

Parallelizing Counterfactual Regret Minimization

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.(CMU战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文研究了如何将反事实遗憾最小化(CFR)算法并行化,以加速求解大规模不完美信息博弈。作者将CFR重新表述为一系列线性代数操作,从而能够利用现有的并行计算技术提升其效率。该方法适用于多种CFR变体,如CFR+、折扣CFR和预测型CFR。实验表明,基于GPU的实现比CPU上的现有实现快达四千倍。

Comments This paper contains and extends ideas that were originally in arxiv:2408.14778

详情
英文摘要

Parallelization has played an instrumental role in the field of artificial intelligence (AI), drastically reducing the time taken to train and evaluate large AI models. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential. In this paper, we parallelize the family of counterfactual regret minimization (CFR) algorithms, which were central to important breakthroughs for solving large imperfect-information games. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state-of-the-art, such as CFR+, discounted CFR, and predictive variants of CFR. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel's CFR implementations on a CPU.

2605.14274 2026-05-15 cs.CV

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

Zhenyang Ni, Yijiang Li, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen, Zhenfei Yin, Minshuo Chen, Philip Torr, Zhaoran Wang, Qi Zhu

发表机构 * Northwestern University(西北大学) University of California, San Diego(加州大学圣地亚哥分校) University of Oxford(牛津大学)

AI总结 该论文提出了一种名为CreFlow的在线强化学习框架,用于改进稀疏奖励下的具身视频生成模型。研究针对现有视频强化学习奖励机制无法准确反映任务逻辑的问题,引入了基于组合逻辑约束的奖励模型,将任务要求转化为线性时序逻辑约束,从而提供更准确的奖励信号和局部错误信息。CreFlow通过两个关键设计——信用感知的NFT损失和校正重流损失,有效提升了高维视频生成的训练效率与稳定性,实验表明其在双臂操作任务中的执行成功率提升了23.8个百分点。

详情
英文摘要

Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.