arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2602.18435 2026-05-15 cs.LG

CAKE: Confidence in Assignments via K-partition Ensembles

Aggelos Semoglou, John Pavlopoulos

发表机构 * Department of Informatics, Athens University of Economics and Business(经济与商业大学信息学院) Archimedes Research Unit, Athena Research Center(阿提卡研究中心阿基米德研究单位)

AI总结 本文提出了一种名为CAKE的方法,用于评估聚类结果中每个数据点的分配置信度。该方法通过结合聚类集成中的分配稳定性与局部几何一致性,生成一个0到1之间的可解释置信度评分。实验表明,CAKE能够有效识别聚类中的模糊点和稳定核心点,为后续聚类任务中的样本选择与优先级排序提供有力支持。

Comments 37 pages, including appendix

Journal ref Machine Learning with Applications, Volume 24, 2026, Article 100915

详情
英文摘要

Clustering is widely used for unsupervised structure discovery, yet it offers limited insight into how reliable each individual assignment is. Diagnostics, such as convergence behavior or objective values, may reflect global quality, but they do not indicate whether particular instances are assigned confidently, especially for initialization-sensitive algorithms like k-means. This assignment-level instability can undermine both accuracy and robustness. Ensemble approaches improve global consistency by aggregating multiple runs, but they typically lack tools for quantifying pointwise confidence in a way that combines cross-run agreement with geometric support from the learned cluster structure. This work introduces CAKE (Confidence in Assignments via K-partition Ensembles), a framework that evaluates each point using two complementary statistics computed over a clustering ensemble: assignment stability and consistency of local geometric fit. These are combined into a single, interpretable score in [0,1]. The theoretical analysis shows that CAKE remains effective under noise and separates stable from unstable points. Experiments on synthetic and real-world datasets indicate that CAKE effectively highlights ambiguous points and stable core members, providing a confidence ranking over instances that can be used for selection or prioritization in downstream clustering workflows.

2602.17949 2026-05-15 cs.CL cs.AI

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake, Jamie Novak, Mathew Miller, Sze-yuan Ooi, Blanca Gallego

发表机构 * Centre for Big Data Research in Health, University of New South Wales(健康大数据研究中心,新南威尔士大学) Eastern Heart Clinic, Prince of Wales Hospital(东部心脏诊所,王室医院) NSW Ambulance Aeromedical Operations, Bankstown Helicopter Base(新南威尔士州急救航空医疗运作,班克stown直升机基地) Department of Anaesthesia, Saint George Hospital(麻醉科,圣乔治医院) Department of Cardiology, Prince of Wales Hospital(心内科,王室医院) School of Clinical Medicine, University of New South Wales(临床医学学院,新南威尔士大学)

AI总结 本文提出CUICurate,一个基于图检索增强生成(GraphRAG)的框架,用于自动化构建临床概念集,以支持自然语言处理应用。该方法利用UMLS知识图谱进行语义检索,结合大语言模型对候选概念进行过滤和分类,实现了比手动构建更全面、更一致的临床概念集。实验表明,CUICurate在多个异构临床概念任务中表现出色,生成的集合不仅规模更大,且具有较高的召回率和稳定性,为临床NLP和表型分析提供了高效、可扩展的解决方案。

Comments 6 figures, 4 tables

详情
英文摘要

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated concept sets and gold-standard concept sets. Results CUICurate produced substantially larger and more complete concept sets than the manual benchmarks. A single retrieval configuration across concepts achieved high recall of definitive concepts with manageable candidate sets. GPT-5 outperformed manual curation for all concepts and retained at least 95% of definitive gold-standard CUIs, while Qwen3-32B achieved comparable but slightly lower performance. Many missed concepts were not observed in 10,000 MIMIC-III notes. CUICurate infrastructure and end-to-end processing was inexpensive and stable across runs. Conclusions CUICurate offers a scalable, reproducible and cost-efficient approach for generating clinician-reviewable UMLS concept sets tailored to clinical natural language processing and phenotyping applications.

2602.15019 2026-05-15 cs.AI cs.IR

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Vlad Vinogradov, Alisa Vinogradova, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev

发表机构 * Bioptic

AI总结 本文研究了在生物医药投资、业务发展和竞争情报中,如何高效发现非美国来源的潜在药物资产。针对当前AI系统在多语言、异构信息源中召回率低、易产生幻觉的问题,作者提出了一种基于树结构的自学习Bioptic Agent,并构建了一个涵盖多语言、多代理的基准测试平台。实验表明,该方法在资产发现任务中显著优于多个主流大模型,验证了其在完整性和准确性上的优势。

详情
英文摘要

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals, and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

2602.14068 2026-05-15 cs.CV

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi, Lei Zhang

发表机构 * The Hong Kong Polytechnic University, Hong Kong(香港理工大学) OPPO Research Institute, ShenZhen, China(OPPO研究院,深圳,中国)

AI总结 CoCoEdit 是一种基于区域正则化强化学习的内容一致图像编辑框架,旨在解决现有模型在编辑目标区域时容易导致非目标区域发生不期望变化的问题。该方法通过引入像素级相似性奖励和区域正则化机制,有效提升了编辑质量与内容一致性。实验表明,CoCoEdit 在多个基准测试中取得了与先进模型相当的编辑效果,并在内容一致性方面表现出显著优势。

Comments Accepted by ICML 2026

详情
英文摘要

Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.

2602.11871 2026-05-15 cs.CL cs.LG

DMAP: A Distribution Map for Text

Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wresilo, Yoann L. Launay, David Sutton, Stuart Burrell

发表机构 * University of Manchester, UK(曼彻斯特大学,英国) Featurespace, a Visa Solution(Visa解决方案的Featurespace) Risk and Security AI Lab, Visa Inc., UK(Visa公司的风险与安全AI实验室,英国) University of Cambridge, UK(剑桥大学,英国)

AI总结 本文提出了一种名为DMAP的方法,通过语言模型将文本映射到单位区间内的样本集合,从而联合编码词序和概率信息,为文本分析提供了数学基础。该方法能够高效、模型无关地分析文本,并在生成参数验证、机器生成文本检测和模型指纹分析等三个案例中展现出广泛的应用价值。DMAP在普通硬件上即可高效计算,具有通用性强、适用范围广的特点,为基于大语言模型的文本分析研究提供了新的基础。

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.

2602.10346 2026-05-15 cs.CL cs.LG

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi, Navid Rezazadeh, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Megagon Labs(Megagon实验室) University of California, Irvine(加州大学伊文斯顿分校)

AI总结 大型语言模型在开放生成任务中需在多样性与逻辑一致性之间取得平衡。本文提出一种基于几何感知的截断方法Top-W,通过引入Wasserstein距离并结合概率质量与熵的权衡,使截断后的分布更贴近原始分布,同时提升生成质量。实验表明,Top-W在多个基准测试中显著优于现有方法,不仅提高了准确性,还增强了生成内容的创造性。

Comments 20 pages, 3 figures, 8 tables, ICML 2026

详情
英文摘要

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

2602.09969 2026-05-15 cs.LG econ.EM stat.ML

Causal Multi-Task Demand Learning

Varun Gupta, Vijay Kamble

发表机构 * Dept. of Operations and Information Systems University of Utah(运营与信息系统系犹他大学) Dept. of Information and Decision Sciences University of Illinois Chicago(信息与决策科学系伊利诺伊大学芝加哥分校)

AI总结 本文研究了一个由零售定价驱动的多任务需求学习问题,旨在估计不同决策场景下的异质性线性价格响应函数。由于每个场景的协变量丰富但价格变化有限,作者提出了一种新的元学习框架,通过利用跨任务信息进行迁移学习,解决因内生性导致的估计偏差问题。该方法在每个任务中假设存在至少两个局部外生的价格点,从而在保证因果识别的前提下提升需求参数估计的准确性,并在真实和合成数据上验证了其有效性。

详情
英文摘要

We study a canonical multi-task demand-learning problem motivated by retail pricing, where a firm seeks to estimate heterogeneous linear price-response functions across multiple decision contexts. Each context is described by rich covariates but exhibits limited price variation, motivating transfer learning across tasks. A central challenge in leveraging cross-task transfer is endogeneity: prices may be arbitrarily correlated with unobserved task-level demand determinants across tasks. We propose a new meta-learning framework that identifies the conditional mean of task-specific causal demand parameters given a subset of task-specific observables despite such confounding, assuming that each task contains at least two distinct locally exogenous price points. This subset is carefully designed to include all of the prices to address cross-task confounding, while masking two demand outcomes that provide randomized supervision to address identifiability issues arising from the inclusion of all prices. We show that this information design is maximally uniformly valid, in that any refinement of the conditioning set that reveals withheld-outcome information is not guaranteed to identify the conditional mean causal target. We validate our method on real and synthetic data, demonstrating improved recovery of demand responses relative to standard transfer-learning baselines.

2602.08874 2026-05-15 cs.CL cs.CR

Do Reasoning LLMs Refuse What They Infer in Long Contexts?

Yu Fu, Haz Sameen Shahgir, Huanli Gong, Zhipeng Wei, N. Benjamin Erichson, Yue Dong

发表机构 * International Computer Science Institute(国际计算机科学研究所) College of Engineering, University of California, Berkeley(加州大学伯克利分校工程学院) Berkeley Lab(伯克利实验室)

AI总结 本文研究了长上下文大语言模型在面对隐含有害意图时的安全性问题。作者提出了一种新的威胁模型——组合推理攻击,通过将有害请求拆分为语义不完整的片段并嵌入长上下文中,使模型在推理过程中需要组合这些片段才能显式推断出有害目标。实验表明,当前前沿模型在直接识别有害请求时拒绝率较高,但在需要组合推理的情况下拒绝率显著下降,揭示了模型在长上下文中存在明显的安全漏洞。

Comments 33 pages, 6 figures

详情
英文摘要

Long-context LLMs can infer objectives that are not stated explicitly. This capability is useful for reasoning over documents, code, retrieved evidence, and tool traces, but it also creates a safety risk: harmful intent can be distributed across a context and become visible only after the model composes the relevant pieces. Existing safety evaluations mostly test explicit harmful requests, and therefore miss this failure mode. We introduce compositional reasoning attacks, a long-context threat model in which harmful requests are decomposed into semantically incomplete fragments and embedded in long contexts. The final query is neutral; the harmful objective emerges only if the model retrieves the fragments, composes them, and infers the implied goal. We instantiate this setting using AdvBench requests, varying the required reasoning from Direct Retrieval to Single-hop Aggregation, Chain Reasoning, and Multi-hop Deductive Reasoning, and evaluate 15 frontier LLMs on contexts up to 64k tokens. Models usually refuse harmful requests when they are directly retrievable. However, refusal rates drop sharply when the same objectives must be reconstructed compositionally, often with larger failures in longer contexts. Benign reconstruction and fragment-position analyses indicate that these failures are not mainly retrieval errors: models often infer the harmful objective and then comply. Increasing inference-time reasoning improves refusal but remains incomplete and costly. Our results reveal a long-context safety gap: current models are better at refusing harmful requests they see than harmful objectives they infer.

2602.07441 2026-05-15 cs.LG cs.AI

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

发表机构 * School of Automation, Central South University(中南大学自动化学院) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了离线强化学习中行为克隆(BC)正则化策略的局限性,指出当数据集动作次优时,盲目模仿会限制策略的性能提升。为此,作者提出了一种名为近端动作替换(PAR)的方法,通过用更优的动作替换数据集中的次优动作,结合值函数的局部上升方向和不确定性约束,提升训练稳定性。实验表明,PAR能有效提升多种BC正则化方法的性能,并在结合基础TD3+BC时达到先进水平。

详情
英文摘要

Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.

2602.07045 2026-05-15 cs.CV cs.AI

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 为了推动多模态大语言模型在遥感领域的应用,研究者提出了首个专注于复杂遥感推理的视觉语言推理基准VLRS-Bench。该基准围绕认知、决策和预测三个核心维度构建,包含2000对问答对,涵盖14项任务和最多八个时间阶段,旨在评估模型在遥感场景下的复杂推理能力。通过融合遥感领域先验知识和专家经验,VLRS-Bench有效提升了任务的地理空间真实性和推理难度,揭示了当前先进模型在该领域的显著瓶颈,为未来研究提供了重要参考。

详情
英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community. The project repository is available at https://github.com/MiliLab/VLRS-Bench.

2602.05285 2026-05-15 cs.LG

Robust Inference-Time Steering of Protein Diffusion Models via Embedding Optimization

Minhuan Li, Jiequn Han, Pilar Cossio, Luhuan Wu

发表机构 * Flatiron Institute(Flatiron研究所)

AI总结 本文研究了如何在蛋白质结构生成中,通过优化嵌入空间来实现对扩散模型的鲁棒引导。作者提出了一种名为EmbedOpt的方法,在推理阶段通过直接优化模型的条件嵌入,使结构先验与实验约束对齐,从而避免传统后验采样方法中可能出现的不稳定问题。实验表明,EmbedOpt在稀疏距离约束和冷冻电镜图拟合任务中表现优异,且对超参数具有较高的鲁棒性。

详情
英文摘要

A core challenge in structural biophysics is generating biomolecular conformations that are both physically plausible and consistent with experimental measurements. While sequence-to-structure diffusion models provide powerful priors, posterior sampling methods steer generation by perturbing atomic coordinates with gradients from experimental likelihoods. However, when the target lies in a low-density region of the prior, these methods require aggressive upweighting of the likelihood that can destabilize sampling and be sensitive to hyperparameters. We propose EmbedOpt, an inference-time steering framework that introduces an orthogonal optimization axis: rather than performing posterior sampling under a fixed prior, EmbedOpt directly optimizes the prior by updating the model's conditional embedding. This embedding space encodes rich coevolutionary signals, so optimizing it shifts the structural prior to align with experimental constraints. Empirically, EmbedOpt matches coordinate-based posterior sampling baselines on sparse distance constraints and outperforms them on cryo-electron microscopy map fitting, including real, noisy experimental ones. Furthermore, EmbedOpt's smooth optimization behavior yields robustness to hyperparameters spanning two orders of magnitude and enables comparable performance with fewer diffusion steps. Code is available at https://github.com/rs-station/embedopt.

2602.04657 2026-05-15 cs.CV

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao

发表机构 * School of Cyberspace Security, Northwestern Polytechnical University(网络安全学院,西北工业大学) School of Computer Science, Northwestern Polytechnical University(计算机学院,西北工业大学) Intellifusion(智融科技)

AI总结 TRIO 是一种通过推理目标指导实现视觉-语言模型高效推理的视觉标记压缩方法。该方法从推理目标出发,将视觉标记压缩转化为保持输出结果不变性的过程,并通过设计的局部代理损失生成标记级梯度显著性,指导标记重排序与选择。TRIO 免于训练,兼容 FlashAttention,适用于实际部署,可在保留 97.2% 原始性能的同时显著提升推理速度与降低计算开销。

详情
英文摘要

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.

2602.04473 2026-05-15 cs.CV

CC-Pan: Channel-wise Compression based Diffusion for Efficient Pan-Sharpening

Junjie Li, Congyang Ou, Haokui Zhang, Guoting Wei, Shengqin Jiang, Ying Li

发表机构 * School of Cyberspace Security, Northwestern Polytechnical University(网络安全学院,西北工业大学) Nanjing University of Science and Technology(南京理工大学) School of Computer Science, Nanjing University of Information Science and Technology(计算机科学学院,南京信息工程大学)

AI总结 本文提出了一种基于通道压缩的扩散模型CC-Pan,用于高效实现多光谱与全色图像的融合(Pan-Sharpening)。该方法通过训练一个通道独立的变分自编码器,将高分辨率多光谱图像编码为紧凑的潜在表示,从而支持不同传感器的多光谱图像并加速推理过程。同时,通过设计的单向和双向交互控制结构引入光谱物理特性及全色图像,结合轻量化的跨带注意力模块,显著提升了融合精度和光谱一致性。实验表明,CC-Pan在多个数据集上优于现有扩散模型,并实现了2-3倍的加速效果,具有良好的跨传感器泛化能力。

详情
英文摘要

Recently, diffusion models have brought novel insights to pan-sharpening and notably boosted fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) sensors, suffering from high inference latency and sensor-specific limitations. In this paper, we present CC-Pan, a cross-sensor latent diffusion framework for efficient pan-sharpening. Specifically, CC-Pan trains a band-wise single-channel variational autoencoder (VAE) to encode high-resolution multispectral (HRMS) images into compact latent representations, naturally supporting MS images with varying band counts across different sensors and establishing a basis for inference acceleration. Spectral physical properties, along with PAN and MS images, are then injected into the diffusion backbone through carefully designed unidirectional and bidirectional interactive control structures, achieving high-precision spatial--spectral fusion in the latent diffusion process. Furthermore, a lightweight region-based cross-band attention (RCBA) module is incorporated at the central layer of the diffusion model, reinforcing inter-band spectral connections to boost spectral consistency and further elevate fusion precision. Extensive experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that CC-Pan outperforms state-of-the-art diffusion-based methods across all three benchmarks, attains a $2$--$3\times$ inference speedup, and exhibits robust cross-sensor generalization capability on the held-out WorldView-2 sensor without any sensor-specific retraining.

2602.04265 2026-05-15 cs.LG cs.AI

Boosting LLM Reasoning via Human-Inspired Reward Shaping

Wenze Lin, Zhen Yang, Xitai Jiang, Xiaoteng Ma, Gao Huang

发表机构 * Tsinghua University(清华大学) Southern University of Science and Technology(南方科技大学) Mind Lab

AI总结 该研究针对大语言模型(LLM)推理能力提升的问题,提出了一种受人类学习行为启发的动态奖励框架T2T。该方法通过区分问题掌握程度,分别采用“厚化”和“薄化”两个阶段的奖励机制:在错误尝试时鼓励广泛探索,在正确解答后则通过长度惩罚促进推理凝练。实验表明,T2T在多个数学基准测试中显著优于现有方法,有效提升了模型的推理性能。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, existing reward formulations typically treat exploration and consolidation as a monolithic process, resulting in entangled stage-wise learning dynamics. This contradicts the natural learning behavior of human learners. In human learning, individuals adopt distinct behavioral patterns toward mastered versus unfamiliar problems. When confronting unmastered challenges, humans prioritize broad exploration to seek viable solutions. By contrast, for well-mastered problems, they focus instead on reasoning condensation and knowledge abstraction to distill concise underlying principles. Motivated by this gap, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across 5 mainstream LLMs demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.

2602.03814 2026-05-15 cs.AI cs.LG

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Xi Wang, Anushri Suresh, Alvin Zhang, Rishi More, William Jurayj, Benjamin Van Durme, Mehrdad Farajtabar, Daniel Khashabi, Eric Nalisnick

发表机构 * Johns Hopkins University, Baltimore, Maryland, USA(约翰霍普金斯大学,巴尔的摩,马里兰州,美国) Apple, USA(苹果公司,美国)

AI总结 本文研究了如何在计算资源有限的情况下,通过控制推理过程中的风险来提升大语言模型的推理效率。作者提出了一种名为“共形思考”的风险控制框架,通过设定上界和下界阈值,分别在模型自信时停止推理(可能产生错误输出)和提前终止无法解决的实例(可能过早停止),从而在保证风险可控的前提下最小化计算开销。实验表明,该方法在多种推理任务和模型中均能有效提升计算效率,同时满足用户设定的风险目标。

Comments ICMl 2026

详情
英文摘要

Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning -- spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

2602.03417 2026-05-15 cs.CL

FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

Yingli Shen, Wen Lai, Jie Zhou, Xueren Zhang, Yudong Wang, Kangyang Luo, Shuo Wang, Ge Gao, Alexander Fraser, Maosong Sun

发表机构 * Tsinghua University(清华大学) Technical University of Munich(慕尼黑技术大学) ModelBest Inc.(ModelBest公司) Minzu University of China(民族大学)

AI总结 本文提出FactNet,一个包含10亿规模的多语言事实知识图谱,旨在解决大语言模型在非英语语言中生成内容时缺乏可检索证据支持的问题。FactNet将17亿个Wikidata断言与来自316个母语维基百科的30.1亿个证据指针相结合,通过确定性构建流程确保每个证据单元均可追溯至原始来源。此外,研究还构建了FactNet-Bench评估套件,用于知识图谱补全、问答和事实核查任务,并验证了FactNet在跨语言知识迁移中的有效性。

详情
英文摘要

Large language models hallucinate factual claims and struggle to ground their outputs in retrievable evidence, particularly in non-English languages. Existing resources impose a trade-off: structured knowledge bases lack textual grounding, whereas grounded datasets remain small and monolingual. We introduce FactNet, a billion-scale open resource that couples 1.7B Wikidata assertions with 3.01B evidence pointers drawn from 316 native Wikipedia editions. FactNet employs a deterministic construction pipeline, ensuring that every evidence unit is traceable to its source with byte-level precision. We further establish FactNet-Bench, an evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking, equipped with systematic leakage controls. Experiments demonstrate that FactNet-Bench differentiates among structural, text-aware, and LLM-integrated methods, and that cross-lingual structure enables knowledge transfer across language tiers.

2602.01664 2026-05-15 cs.AI cs.LG

FlowSteer: Towards Agents Designing Agentic Workflows via Reinforced Progressive Canvas Editing

Mingda Zhang, Wenjin Liu, Tiesunlong Shen, Qika Lin, Rui Mao, Erik Cambria, Xiaoying Tang, Haoran Luo

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学)

AI总结 FlowSteer 是一种新型智能体设计代理工作流的范式,旨在解决当前工作流构建中依赖人工、缺乏全局反馈和无法在线修复错误等问题。该方法引入了可执行的流程画布环境,通过强化学习逐步进行原子编辑,实现工作流的端到端自动设计。实验表明,FlowSteer 在多个数据集上显著优于现有方法,且支持多种操作符库和大语言模型后端,具有良好的通用性和扩展性。

Comments 51 pages, 6 figures, 5 tables. Project page: http://flowsteer.org/

详情
英文摘要

In recent years, agentic workflows have been widely applied to solve complex human tasks. However, existing workflow construction still faces key challenges, including human-dependent workflow construction, the lack of graph-level execution feedback, and the inability to repair errors in-loop during long-horizon construction. To address these challenges, we propose FlowSteer, a new paradigm of Agent Designing Agentic Workflows - a single agent itself end-to-end designs the workflow that a downstream executor runs. To support this paradigm, we introduce the Workflow Canvas, a novel executable graph-state environment that returns syntax-checked execution feedback for every atomic edit. Built on the canvas, we further propose Reinforced Progressive Canvas Editing, in which a lightweight policy agent issues one atomic edit per turn conditioned on real canvas feedback, and is trained end-to-end via reinforcement learning. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks. Our code is available at https://anonymous.4open.science/r/FlowSteer-9B2E.

2602.01359 2026-05-15 cs.LG cs.AI

PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection

Jinju Park, Seokho Kang

发表机构 * Department of Industrial Engineering, Sungkyunkwan University(成均馆大学工业工程系)

AI总结 尽管近期时间序列异常检测研究越来越多地采用如Transformer和基础模型等大型神经网络架构,但这些方法计算成本高、内存消耗大,难以应用于实时和资源受限的场景,且在严格评估下性能提升不明显。本文提出了一种基于块的表示学习方法PaAno,该方法通过从时间序列中提取短时域块,并使用1D卷积神经网络将其嵌入为向量表示,结合三元组损失和预训练任务损失进行训练,以捕捉块中的有用时间模式。在推理阶段,通过比较正常块与当前块的嵌入向量计算异常分数,实验表明PaAno在TSB-AD基准测试中表现优异,显著优于包括大型架构在内的现有方法。

Comments Accepted by the 14th International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.

2602.00992 2026-05-15 cs.RO

Geometry-Aware Sampling-Based Motion Planning on Riemannian Manifolds

Phone Thiha Kyaw, Jonathan Kelly

发表机构 * Institute for Aerospace Studies, University of Toronto(航空航天研究 institute,多伦多大学)

AI总结 本文研究了在黎曼流形上进行几何感知的采样式运动规划问题,旨在在考虑配置空间非欧几里得几何结构的情况下,规划出避障且路径长度最短的运动轨迹。作者提出了一种直接在黎曼流形上运行的采样式规划框架,引入了一种计算高效的黎曼测地距离近似方法,并设计了基于黎曼自然梯度的局部规划器。实验表明,该方法在多种机器人系统中均能生成比传统欧几里得方法和经典数值解法更优的轨迹。

Comments Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR), Oulu, Finland, Jun 15-17, 2026

详情
英文摘要

In many robot motion planning problems, task objectives and physical constraints induce non-Euclidean geometry on the configuration space, yet many planners operate using Euclidean distances that ignore this structure. We address the problem of planning collision-free motions that minimize length under configuration-dependent Riemannian metrics, corresponding to geodesics on the configuration manifold. Conventional numerical methods for computing such paths do not scale well to high-dimensional systems, while sampling-based planners trade scalability for geometric fidelity. To bridge this gap, we propose a sampling-based motion planning framework that operates directly on Riemannian manifolds. We introduce a computationally efficient midpoint-based approximation of the Riemannian geodesic distance and prove that it matches the true Riemannian distance with third-order accuracy. Building on this approximation, we design a local planner that traces the manifold using first-order retractions guided by Riemannian natural gradients. Experiments on a two-link planar arm and a 7-DoF Franka manipulator under a kinetic-energy metric, as well as on rigid-body planning in $\mathrm{SE}(2)$ with non-holonomic motion constraints, demonstrate that our approach consistently produces lower-cost trajectories than Euclidean-based planners and classical numerical geodesic-solver baselines.

2602.00807 2026-05-15 cs.CV cs.RO

Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds

Xianzhe Fan, Shengliang Deng, Xiaoyang Wu, Yuxiang Lu, Zhuoling Li, Mi Yan, Yujia Zhang, Zhizheng Zhang, He Wang, Hengshuang Zhao

发表机构 * School of Computing and Data Science, The University of Hong Kong, Hong Kong SAR, China(计算与数据科学学院,香港大学,香港特别行政区,中国) School of Computing(计算学院) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 现有视觉-语言-动作(VLA)模型通常以二维图像作为视觉输入,这限制了它们在复杂场景中的空间理解能力。为提升VLA模型的性能,本文提出Any3D-VLA,通过引入多样化的点云数据增强三维感知能力,并在训练过程中融合仿真、传感器和模型估计的点云,学习跨域通用的三维表示。实验表明,该方法有效提升了模型性能并缓解了领域差异问题。

Comments ICML 2026

详情
英文摘要

Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.

2602.00520 2026-05-15 cs.LG

NEST: Nested Event Stream Transformer for Sequences of Multisets

Minghui Sun, Haoyu Gong, Xingyu You, Jillian Hurst, Benjamin Goldstein, Matthew Engelhard

发表机构 * Department of Biostatistics & Bioinformatics, Duke University(生物统计学与生物信息学系,杜克大学) Department of Biomedical Engineering, Duke University(生物医学工程系,杜克大学) Department of Pediatrics, Duke University(儿科学系,杜克大学)

AI总结 事件流数据通常具有层次结构,表现为多个事件共现的多重集合序列。现有基础模型大多将其扁平化处理,导致计算效率低且集合级表示质量不高。本文提出嵌套事件流变换器(NEST),保留原始层次结构,引入掩码集合建模(MSM)方法,有效提升预训练效率和下游任务性能。

Comments 10-page main text

详情
英文摘要

Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.

2601.23072 2026-05-15 cs.LG

SplineFlow: Flow Matching for Dynamical Systems with B-Spline Interpolants

Santanu Subhash Rathod, Pietro Liò, Xiao Zhang

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全研究中心) Department of Computer Science and Technology(计算机科学与技术系) University of Cambridge(剑桥大学)

AI总结 本文提出了一种名为SplineFlow的流匹配算法,用于更准确地建模动态系统中的状态演化过程。该方法采用B样条插值来构建条件路径,克服了传统线性插值在处理高阶动态和不规则采样数据时的不足,从而在保证多边际约束的前提下实现更稳定、更平滑的动力学建模。实验表明,SplineFlow在多种确定性和随机动态系统以及细胞轨迹推断任务中均优于现有方法。

Comments 36 pages, 35 tables, 22 figures

详情
英文摘要

Flow matching is a scalable generative framework for characterizing continuous normalizing flows with wide-range applications. However, current state-of-the-art methods are not well-suited for modeling dynamical systems, as they construct conditional paths using linear interpolants that may not capture the underlying state evolution, especially when learning higher-order dynamics from irregular sampled observations. Constructing unified paths that satisfy multi-marginal constraints across observations is challenging, since naïve higher-order polynomials tend to be unstable and oscillatory. We introduce SplineFlow, a theoretically grounded flow matching algorithm that jointly models conditional paths across observations via B-spline interpolation. Specifically, SplineFlow exploits the smoothness and stability of B-spline bases to learn the complex underlying dynamics in a structured manner while ensuring the multi-marginal requirements are met. Comprehensive experiments across various deterministic and stochastic dynamical systems of varying complexity, as well as on cellular trajectory inference tasks, demonstrate the strong improvement of SplineFlow over existing baselines. Our code is available at: https://github.com/santanurathod/SplineFlow.

2601.21656 2026-05-15 cs.LG

TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering

Tianqi Zhao, Guanyang Wang, Yan Shuo Tan, Qiong Zhang

发表机构 * Renmin University of China(中国人民大学) Rutgers University(罗格斯大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为TabClustPFN的新型网络,用于解决表格数据聚类这一基础而具有挑战性的问题。该方法基于先验适配网络(PFN),通过在合成数据上进行预训练,实现了对未知数据集的一次性聚类,无需重新训练或调整超参数。TabClustPFN能够处理异构的数值和类别特征,并适应多种聚类结构,实验表明其在合成数据和真实数据集上均优于传统及深度聚类方法,具有良好的鲁棒性和实用性。

详情
英文摘要

Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data-generating mechanisms, and the absence of transferable inductive biases across datasets. Prior-fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation-invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset-specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings. Code is available at https://github.com/Tianqi-Zhao/TabClustPFN.

2601.21349 2026-05-15 cs.LG cs.AI

L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

发表机构 * Hokkaido University(北海道大学)

AI总结 本文提出了一种名为L2R的统一路由框架,用于改进混合专家(MoE)模型中的路由机制。L2R通过在共享的低秩潜在路由空间中进行专家分配,并引入饱和内积评分(SIPS)来显式控制路由函数的Lipschitz行为,从而提升路由几何的平滑性和稳定性。此外,L2R还采用参数高效的多锚点路由机制以增强专家的表达能力。实验表明,L2R在语言和视觉任务中均能有效提升路由性能和模型整体表现。

详情
英文摘要

Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on an OLMoE-based language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing geometry, expert discrimination, and overall model performance. Code will be released.

2601.21174 2026-05-15 cs.LG

Breaking the Reasoning Horizon in Entity Alignment Foundation Models

Yuanning Cui, Zequn Sun, Wei Hu, Kexuan Xin, Zhangjie Fu

发表机构 * Nanjing University of Information Science and Technology(南京信息工程大学) State Key Laboratory for Novel Software Technology, Nanjing University(南京大学软件新技术国家重点实验室) National Institute of Healthcare Data Science, Nanjing University(南京大学健康数据科学国家研究院) University of Queensland(昆士兰大学) Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology(南京信息工程大学数字取证工程研究中心)

AI总结 实体对齐是知识图谱融合的关键任务,但现有模型在面对未见过的知识图谱时缺乏迁移能力。本文提出了一种基于并行编码策略的实体对齐基础模型,通过利用种子对齐对作为局部锚点,引导信息流并同时初始化两个并行编码流,有效缩短了推理路径,提升了对稀疏异构结构的适应能力。此外,模型引入了合并关系图和可学习交互模块,以建模全局依赖并实现精准匹配,实验表明该方法在未见过的知识图谱上具有良好的泛化性能。

详情
英文摘要

Entity alignment (EA) is critical for knowledge graph (KG) fusion. Existing EA models lack transferability and are incapable of aligning unseen KGs without retraining. While using graph foundation models (GFMs) offer a solution, we find that directly adapting GFMs to EA remains largely ineffective. This stems from a critical "reasoning horizon gap": unlike link prediction in GFMs, EA necessitates capturing long-range dependencies across sparse and heterogeneous KG structuresTo address this challenge, we propose a EA foundation model driven by a parallel encoding strategy. We utilize seed EA pairs as local anchors to guide the information flow, initializing and encoding two parallel streams simultaneously. This facilitates anchor-conditioned message passing and significantly shortens the inference trajectory by leveraging local structural proximity instead of global search. Additionally, we incorporate a merged relation graph to model global dependencies and a learnable interaction module for precise matching. Extensive experiments verify the effectiveness of our framework, highlighting its strong generalizability to unseen KGs.

2601.21151 2026-05-15 cs.LG physics.ao-ph

Learning to Advect: A Neural Semi-Lagrangian Architecture for Weather Forecasting

Carlos A. Pereira, Stéphane Gaudreault, Valentin Dallerit, Christopher Subich, Shoyon Panday, Siqi Wei, Sasa Zhang, Siddharth Rout, Eldad Haber, Raymond J. Spiteri, David Millard, Emilia Diaconescu

发表机构 * Recherche en prévision numérique atmosphérique, Environnement et Changement climatique Canada(环境与气候变化加拿大大气数值预报研究) Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia(不列颠哥伦比亚大学地球、海洋和大气科学系) Department of Computer Science, University of Saskatchewan(萨斯喀彻温大学计算机科学系) Department of Mechanical Engineering, Rochester Institute of Technology(罗切斯特理工学院机械工程系)

AI总结 该研究提出了一种名为PARADIS的物理启发式天气预测模型,旨在解决传统机器学习方法在刻画大气输送等物理过程时的效率与准确性问题。其核心方法是将天气动力学分解为输送、扩散和反应三个模块,并通过神经半拉格朗日算子实现基于轨迹的全球输送过程建模,从而在保持物理结构的同时提升预测性能。实验表明,PARADIS在ERA5基准测试中表现出良好的确定性预测能力,尤其在短期预报和中长期预报的谱保真度方面具有显著优势。

详情
英文摘要

Recent machine-learning approaches to weather forecasting often employ a monolithic architecture in which distinct physical mechanisms-advection (long-range transport), diffusion-like mixing, thermodynamic processes, and forcing-are represented implicitly within a single large network. This is particularly problematic for advection, where long-range transport typically requires expensive global interaction mechanisms or deep stacks of local convolutional layers. To mitigate this, we present PARADIS, a physics-inspired global weather prediction model that enforces inductive biases on network behavior through a functional decomposition into advection, diffusion, and reaction blocks acting on latent variables. We implement advection through a Neural Semi-Lagrangian operator that performs trajectory-based transport via differentiable interpolation on the sphere, enabling end-to-end learning of both the latent modes to be transported and their characteristic trajectories. Diffusion-like processes are modeled by depthwise-separable spatial mixing, whereas local source terms and vertical interactions are handled via pointwise channel interactions, yielding a physically structured operator decomposition. Evaluated on ERA5 benchmarks, PARADIS achieves competitive deterministic forecast skill, with particularly strong short-lead performance, while preserving substantially better spectral fidelity and forecast activity during medium-range rollouts.

2601.19924 2026-05-15 cs.CL cs.AI cs.LG

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge

发表机构 * Shanghai University of Finance and Economics(上海财经大学) Booth School of Business, University of Chicago(芝加哥大学商学院) Antai School of Economics and Management, Shanghai Jiao Tong University(上海交通大学安泰经济管理学院)

AI总结 本文研究了大语言模型(LLMs)在优化建模领域的性能和可扩展性,提出了一种名为OPT-ENGINE的可扩展基准框架,用于系统评估从线性规划到混合整数规划等经典运筹学问题的自动建模与求解能力。通过该框架,研究发现基于纯文本推理的方法在任务复杂度增加时存在鲁棒性不足的问题,而结合外部计算工具虽能提升局部计算能力,却难以满足全局优化约束。研究进一步指出,当前最先进的求解器集成推理方法在自动构建约束条件方面仍面临主要瓶颈,为下一代优化建模大语言模型的发展提供了明确方向。

Journal ref Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
英文摘要

We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).

2601.15620 2026-05-15 cs.LG

Closing the Gap on the Sample Complexity of 1-Identification

Zitian Li, Wang Chi Cheung

发表机构 * Department of Industrial Systems Engineering & Management(工业系统工程与管理系)

AI总结 本文研究了多臂老虎机中的1-识别问题,即判断是否存在某个臂的平均奖励超过给定阈值 $μ_0$,否则输出“None”。作者提出了一个新的优化框架,推导出在至少存在一个合格臂的情况下,最小样本复杂度的下界,并设计了一种新算法,其上界与下界在多项式对数因子内一致,从而填补了该问题在样本复杂度分析上的空白。

详情
英文摘要

The 1-identification problem is a fundamental pure-exploration problem in multi-armed bandits. An agent aims to determine whether there exists an arm whose mean reward exceeds a known threshold $μ_0$, or to output \textsf{None} otherwise. The agent must guarantee correctness with probability at least $1-δ$, while minimizing the expected number of arm pulls $\mathbb{E}[τ]$. We study the 1-identification problem and make two main contributions. First, for instances with at least one qualified arm, we derive a new lower bound on $\mathbb{E}[τ]$ via a novel optimization formulation. Second, we propose a new algorithm and establish upper bounds that match the lower bounds up to polynomial logarithmic factors uniformly over all instances. Our result complements the analysis of $\mathbb{E}τ$ when there are multiple qualified arms, which is an open problem in the literature.

2601.03969 2026-05-15 cs.AI cs.CL

Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models

Wei Wu, Liyi Chen, Congxi Xiao, Tianfu Wang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

发表机构 * University of Science and Technology of China(中国科学技术大学) Xiaohongshu Inc.(小红书公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文研究了大语言模型在训练过程中因强化学习奖励机制导致的“长度偏移”现象,即模型在简单问题上生成冗余推理内容的问题。为此,作者提出了一种动态异常截断(DOT)方法,在训练时选择性地抑制冗余输出,同时保留对复杂问题的长推理能力。结合辅助KL正则化和预测性动态采样,该方法有效提升了模型的推理效率与性能,实验表明其在多个任务上显著优于现有方法。

Comments Accepted by ACL2026

详情
英文摘要

Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.

2601.03630 2026-05-15 cs.CL

Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase

发表机构 * Institute of Science Tokyo(东京科学研究院) Harbin Institute of Technology(哈尔滨工业大学) The University of Osaka(大阪大学)

AI总结 本文首次系统比较了大型推理模型(LRMs)与非推理大语言模型(LLMs)在判断任务中的表现,发现LRMs在判断准确性、指令遵循能力以及对对抗攻击的鲁棒性方面均优于非推理模型,但同时也存在较强的评估偏差。为此,作者提出了一种轻量级的评估策略PlanJudge,通过引导模型在判断前生成明确的评估计划,有效缓解了偏差问题,同时保持了整体判断准确性。

Comments Accepted by ACL 2026 Workshop EvalEval

详情
英文摘要

This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy.