arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.08048 2026-06-02 cs.CV

Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

家庭事务：空间掩码与频率掩码在连续测试时自适应中的系统研究

Chandler Timm C. Doloriel, Yunbei Zhang, Yeonguk Yu, Taki Hasan Rafi, Muhammad salman siddiqui, Tor Kristian Stevik, Fadi Al Machot, Kristian Hovde Liland, Habib Ullah

发表机构 * Faculty of Science and Technology (REALTEK), Norwegian University of Life Sciences (NMBU)（科学与技术学院（REALTEK）、挪威生命科学大学）； Tulane University（路易斯安那州立大学）； Gwangju Institute of Science and Technology（全州科学技术学院）； Hanyang University（翰阳大学）

AI总结通过控制变量实验，系统研究了空间掩码与频率掩码在连续测试时自适应中的效果，发现空间掩码在补丁标记化架构上积累稳定表示，而频率掩码导致灾难性崩溃，且最优掩码家族取决于架构-任务对齐。

Comments Accepted to TMLR 2026; code at https://github.com/chandlerbing65nm/m2a.git

详情

AI中文摘要

最近的连续测试时自适应（CTTA）方法采用掩码图像建模来稳定分布偏移下的学习，但每种方法都将其掩码家族F视为固定的设计选择，并仅沿着选择策略S进行创新，从而使得家族轴未被充分探索。我们提出了一项系统的实证研究，隔离了这一轴。通过使用一个受控的CTTA实例——Mask to Adapt (M2A)——它固定S为随机和标准损失，我们仅改变F，跨越空间（补丁、像素）和频率（全频带、低频带、高频带）家族，同时保持其他所有组件相同。该研究的贡献在于为我们评估的CTTA设置提取了设计指导：（1）掩码家族决定了自适应是累积有用的结构还是累积错误——在补丁标记化架构上，空间掩码在长流中积累稳定的表示，而频率掩码则灾难性地崩溃。我们通过结构保持解释来表征这种不稳定性，其中空间相干性维持了避免与腐败的频谱特征最终重叠所需的宽谱冗余；（2）最优家族取决于架构-任务对齐——在CNN上，其重叠的感受野稀释了补丁遮挡，家族差距消失，而在具有全局线索和大容量ViT的细粒度任务上，频率掩码变得有竞争力。在混杂的系统级比较中——其中基线在损失和辅助组件上也不同——M2A的随机选择与启发式策略表现相当，尽管我们将这一观察视为提示性背景，而非对S相对重要性的受控量化。

英文摘要

Recent continual test-time adaptation (CTTA) methods adopt masked image modeling to stabilize learning under distribution shift, yet each treats its masking family F as a fixed design choice and innovates exclusively along the selection strategy S, leaving the family axis underexplored. We present a systematic empirical study that isolates this axis. Using a controlled CTTA instantiation -- Mask to Adapt (M2A) -- that fixes S = random and standard losses, we vary only F across spatial (patch, pixel) and frequency (all-band, low-band, high-band) families while keeping every other component identical. The study's contributions are the design guidance it extracts for the CTTA settings we evaluated: (1) the masking family determines whether adaptation compounds useful structure or compounds errors -- on patch-tokenized architectures, spatial masking accumulates stable representations over long streams while frequency masking collapses catastrophically. We characterize this instability through a structural-preservation account, where spatial coherence maintains the broad-spectrum redundancy needed to avoid terminally overlapping with a corruption's spectral signature; (2) the optimal family depends on architecture-task alignment -- on CNNs, whose overlapping receptive fields dilute patch occlusion, the family gap vanishes, whereas on fine-grained tasks with global cues and large-capacity ViTs, frequency masking becomes competitive. In confounded system-level comparisons -- where baselines also differ in losses and auxiliary components -- M2A's random selection performs comparably to heuristic strategies, though we treat this observation as suggestive context rather than a controlled quantification of S's relative importance.

URL PDF HTML ☆

赞 0 踩 0

2511.00206 2026-06-02 cs.AI cs.CL

Addressing Longstanding Challenges in Cognitive Science with Language Models

用语言模型应对认知科学中长期存在的挑战

Dirk U. Wulff, Rui Mata

发表机构 * Center for Adaptive Rationality, Max Planck Institute for Human Development（适应性理性中心，马克斯·普朗克人类发展研究所）； Faculty of Psychology, University of Basel（心理学系，巴塞尔大学）

AI总结本文探讨如何利用语言模型应对认知科学中研究整合、形式化、概念清晰度等长期挑战，并指出其风险与机遇。

2509.25837 2026-06-02 cs.LG cs.AI

Distillation of Large Language Models via Concrete Score Matching

通过具体分数匹配进行大型语言模型的蒸馏

Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, Il-Chul Moon

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出具体分数蒸馏（CSD）目标，通过离散分数匹配克服softmax平滑和logit平移不变性限制，实现学生与教师模型间所有词汇对相对logit差异的灵活加权，在GPT-2、OpenLLaMA和GEMMA上优于现有蒸馏方法。

Comments ICLR 2026

详情

AI中文摘要

大型语言模型（LLMs）性能卓越但部署成本高昂，促使知识蒸馏（KD）用于高效推理。现有的KD目标通常通过softmax匹配学生和教师概率，这会模糊有价值的logit信息。虽然直接logit蒸馏（DLD）缓解了softmax平滑问题，但它未能考虑logit平移不变性，从而限制了解空间。我们提出具体分数蒸馏（CSD），一种离散分数匹配目标，克服了softmax引起的平滑和对最优解集的限制。我们解决了自回归LLMs中离散分数匹配的训练不稳定和二次复杂度问题，得到的CSD目标以灵活权重对齐学生和教师之间所有词汇对的相对logit差异。我们在框架内提供了模式寻求和模式覆盖实例，并在GPT-2-1.5B、OpenLLaMA-7B和GEMMA-7B-IT上评估了CSD在任务无关的指令遵循和任务特定蒸馏中的表现。实验表明，CSD持续超越最近的KD目标，实现了良好的保真度-多样性权衡，并与on-policy技术结合时产生互补增益，展示了其在LLM蒸馏中的可扩展性和有效性。代码：https://github.com/aailab-kaist/CSD。

英文摘要

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation. Code: https://github.com/aailab-kaist/CSD.

URL PDF HTML ☆

赞 0 踩 0

2603.00133 2026-06-02 cs.CV cs.AI

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

你不需要所有注意力：文本到图像扩散模型中的外科记忆缓解

Kairan Zhao, Eleni Triantafillou, Peter Triantafillou

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出GUARD框架，通过吸引-排斥动力学调整去噪过程，结合交叉注意力衰减机制，在不损害图像质量的前提下有效缓解文本到图像扩散模型中的记忆问题。

Comments Accepted at ICML 2026

详情

AI中文摘要

生成模型已被证明会“记忆”某些训练数据，导致生成逐字或近乎逐字的图像，这可能引发隐私问题或版权侵权。我们引入了使用吸引-排斥动力学的引导（GUARD），一种用于文本到图像扩散模型中记忆缓解的新框架。GUARD调整图像去噪过程，引导生成远离原始训练图像，朝向与训练数据不同但仍与提示对齐的图像，防止复制训练数据，同时不损害图像生成质量。我们提出了该框架的一个具体实例，其中我们引导的正向目标由一种新的（交叉）注意力衰减方法给出，该方法基于（i）一种新颖的统计机制，自动识别需要衰减交叉注意力的提示位置，以及（ii）在这些每个提示的位置衰减交叉注意力。由此产生的GUARD提供了一种外科手术式的、动态的、每个提示的推理时方法，我们发现，在两种架构以及逐字和模板记忆方面，它是最稳健的方法，始终产生最先进的记忆缓解结果，同时在图像质量方面也优于或产生可比的结果。

英文摘要

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

URL PDF HTML ☆

赞 0 踩 0

2603.00021 2026-06-02 cs.CL

From Global to Local: Learning Context-Aware Graph Representations for Document Classification and Summarization

从全局到局部：学习上下文感知的图表示用于文档分类与摘要

Ruangrin Ldallitsakool, Margarita Bugueño, Gerard de Melo

发表机构 * University of Potsdam（波恩大学）； Hasso Plattner Institute (HPI)（哈索罗滕堡研究所）

AI总结本文提出一种数据驱动的图构建方法，利用动态滑动窗口注意力模块捕获句子间局部与中程语义依赖，结合图注意力网络在文档分类中取得竞争性结果并降低计算成本，同时探索了该方法在抽取式摘要中的应用潜力与局限。

详情

AI中文摘要

最近的NLP系统通常将文档表示为线性标记序列。尽管这捕获了顺序关系，但可能阻碍长程依赖和全局文档结构的建模，尤其是对于长文本。本文提出一种数据驱动的方法来自动构建基于图的文档表示。基于Bugueño和de Melo（2025）的近期工作，我们利用动态滑动窗口注意力模块有效捕获句子之间的局部和中程语义依赖，以及文档内部的结构关系。在我们学习到的图上训练的图注意力网络（GAT）在文档分类上取得了有竞争力的结果，同时比先前方法需要更少的计算资源。我们进一步对所提出的图构建方法在抽取式文档摘要上进行了探索性评估，突出了其潜力和当前局限性。该项目的实现可在GitHub上找到。

英文摘要

Recent NLP systems commonly represent documents as linear token sequences. Although this captures sequential order, it can hinder modeling long-range dependencies and global document structure, especially for long texts. This paper proposes a data-driven method to automatically construct graph-based document representations. Building upon the recent work of Bugueño and de Melo (2025), we leverage the dynamic sliding-window attention module to effectively capture local and mid-range semantic dependencies between sentences, as well as structural relations within documents. Graph Attention Networks (GATs) trained on our learned graphs achieve competitive results on document classification while requiring lower computational resources than previous approaches. We further present an exploratory evaluation of the proposed graph construction method for extractive document summarization, highlighting both its potential and current limitations. The implementation of this project can be found on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2602.24201 2026-06-02 cs.LG

Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics

基于流的难解分布密度比估计及其在基因组学中的应用

Egor Antipov, Alessandro Palma, Lorenzo Consoli, Stephan Günnemann, Andrea Dittadi, Fabian J. Theis

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Cambridge（剑桥大学）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）

AI总结提出利用条件感知流匹配推导单一动力学公式，沿生成轨迹追踪密度比，以高效估计难解分布间的密度比，并在单细胞基因组学数据分析中展示竞争力。

详情

AI中文摘要

估计成对难解数据分布之间的密度比是概率建模中的一个核心问题，它能够在不同条件下对不同数据生成过程中的样本似然进行原则性比较。虽然诸如归一化流之类的精确似然模型为密度比估计提供了一种有前景的方法，但朴素评估计算成本高且容易产生离散化误差，因为需要独立模拟每个分布的似然。在这项工作中，我们利用条件感知流匹配推导出一个单一的动力学公式，用于沿生成轨迹追踪密度比。我们在封闭形式比估计的模拟基准上展示了竞争性能，并表明我们的方法支持单细胞基因组学数据分析中的多种任务，其中基于似然的跨实验条件细胞状态比较能够实现治疗效果估计和批次校正评估。

英文摘要

Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive evaluations are computationally expensive and prone to discretization errors because they require simulating each distribution's likelihood independently. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.

URL PDF HTML ☆

赞 0 踩 0

2602.23881 2026-06-02 cs.LG cs.CL

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

LK损失：用于推测解码的直接接受率优化

Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev

发表机构 * arXiv

AI总结针对推测解码中标准KL散度训练不能最大化接受率的问题，提出LK损失直接优化接受率，实验表明在多种架构和模型上一致提升接受指标。

Comments ICML 2026

详情

AI中文摘要

推测解码通过使用轻量级草稿模型提出候选令牌，然后由目标模型并行验证，从而加速自回归大型语言模型（LLM）推理。加速效果显著取决于接受率，然而标准训练将Kullback-Leibler（KL）散度作为代理目标进行最小化。虽然KL散度和接受率共享相同的全局最优解，但小型草稿模型由于容量有限，通常收敛到次优解，此时最小化KL并不能保证最大化接受率。为解决此问题，我们提出LK损失，这是一种直接针对接受率的特殊训练目标。在四种草稿架构和六个目标模型（参数范围从8B到685B）上的全面实验表明，与基于KL的标准训练相比，所有配置下的接受指标均有一致提升。我们在通用、编码和数学领域评估了我们的方法，并报告平均接受长度提升高达8-10%。LK损失易于实现，不引入计算开销，可直接集成到任何现有的推测器训练框架中，使其成为现有草稿训练目标的有力替代方案。

英文摘要

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we propose LK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

URL PDF HTML ☆

赞 0 踩 0

2602.23204 2026-06-02 cs.CV cs.RO

Motion-aware Event Suppression for Event Cameras

面向事件相机的运动感知事件抑制

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland（苏黎世大学机器人与感知组，瑞士）

AI总结提出首个运动感知事件抑制框架，通过联合分割当前事件流中的独立运动物体并预测其未来运动，实现动态事件的预期抑制，在EVIMO基准上分割精度提升67%，推理速度提高53%。

Comments Robotics: Science and Systems (RSS) 2026

2602.17588 2026-06-02 cs.CL cs.HC

Modeling Distinct Human Interaction in Web Agents

建模网络代理中不同的人类交互

Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Duke University（杜克大学）

AI总结本文通过收集400条真实用户网络导航轨迹，识别四种人机交互模式，并训练语言模型预测用户干预时机，将干预预测准确率提升61.4%-63.4%，用户评价的代理有用性提高36.8%。

Comments Preprint

详情

AI中文摘要

尽管自主网络代理取得了快速进展，但在任务执行过程中，人类参与对于塑造偏好和纠正代理行为仍然至关重要。然而，当前的代理系统缺乏对何时以及为何人类干预的原则性理解，常常在关键决策点自主行动或请求不必要的确认。在这项工作中，我们引入了建模人类干预以支持协作式网络任务执行的任务。我们收集了CowCorpus，这是一个包含400条真实用户网络导航轨迹的数据集，其中包含超过4,200个交错的人类和代理动作。我们识别了四种不同的用户与代理交互模式——放手监督、亲力监督、协作任务解决和完全用户接管。利用这些见解，我们训练语言模型（LMs）根据用户的交互风格预测用户何时可能干预，与基础LMs相比，干预预测准确率提高了61.4%-63.4%。最后，我们将这些干预感知模型部署到实时网络导航代理中，并在用户研究中评估它们，发现用户评定的代理有用性增加了36.8%。总之，我们的结果表明，结构化的人类干预建模可以产生更具适应性和协作性的代理。

英文摘要

Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 36.8% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.

URL PDF HTML ☆

赞 0 踩 0

2602.23197 2026-06-02 cs.CL cs.LG stat.ML

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

微调不忘上下文学习：线性注意力模型的理论分析

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

发表机构 * KAIST（韩国科学技术院）

AI总结本文通过线性注意力模型理论分析，揭示了微调目标如何修改注意力参数并导致少样本性能下降的条件，提出仅更新值矩阵可保持上下文学习能力。

详情

Journal ref: International Conference on Machine Learning (ICML) 2026

AI中文摘要

基于Transformer的大型语言模型展现出上下文学习能力，能够通过少量示例提示适应下游任务。实践中，这类模型常被微调以提升下游任务的零样本性能，使其无需示例即可解决问题，从而降低推理成本。然而，微调可能削弱上下文学习能力，限制微调模型在未见任务上的表现。利用线性注意力模型，我们提供了理论分析，刻画了微调目标如何修改注意力参数，并识别了导致少样本性能下降的条件。我们表明，微调所有注意力参数会损害上下文学习，而仅更新值矩阵可在保持上下文学习的同时提升零样本性能。我们进一步证明，引入辅助的少样本损失主要增强目标任务的上下文学习，但以牺牲微调未见任务上的上下文学习能力为代价。我们提供了来自合成和真实数据集的实验证据，与理论定性预测一致。

英文摘要

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We provide empirical evidence from synthetic and real-world datasets consistent with the qualitative predictions of our theory.

URL PDF HTML ☆

赞 0 踩 0

2602.16953 2026-06-02 cs.AI cs.LG

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov：面向高覆盖率测试生成的执行感知智能体学习

Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LLM4Cov离线智能体学习框架，通过执行验证数据策展、策略感知数据合成和最差状态优先采样，在硬件验证中实现高覆盖率测试生成，4B参数模型在CVDP-ECov上达到69.2%通过率和90.4%平均覆盖率。

Comments ICML'26 Camera Ready version

详情

AI中文摘要

执行感知的LLM智能体为从工具反馈中学习提供了一种有前景的范式，但这种反馈可能昂贵且获取缓慢，使得在线强化学习（RL）在某些场景下不太实用。高覆盖率硬件验证由于依赖工业模拟器和不可微的执行信号，体现了这一挑战。我们提出LLM4Cov，一种离线智能体学习框架，将验证建模为由确定性评估器指导的单步状态转移。基于这一公式，我们引入了执行验证的数据策展、策略感知的智能体数据合成以及最差状态优先采样，以在执行约束下实现可扩展学习。我们进一步通过修订的评估协议，从现有验证套件中整理了一个符合现实的基准。使用所提出的流程，一个紧凑的4B参数模型在智能体评估下实现了69.2%的通过率和90.4%的平均覆盖率（CVDP-ECov），比其教师模型分别高出5.3%和10.5%，展现出与规模大一个数量级的模型相竞争的性能。

英文摘要

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtain, making online reinforcement learning (RL) less practical in certain scenarios. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as single-step state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% pass rate and 90.4% average coverage in CVDP-ECov under agentic evaluation, outperforming its teacher by 5.3% and 10.5%, demonstrating competitive performance against models an order of magnitude larger.

URL PDF HTML ☆

赞 0 踩 0

2601.17074 2026-06-02 cs.LG cs.AI

Physics-Encoded Inverse Modeling for Arctic Snow Depth Prediction

物理编码的北极雪深预测逆建模

Akila Sampath, Vandana Janeja, Jianwu Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出物理编码逆建模框架PhysE-Inv，结合LSTM序列学习与对比学习正则化，在稀疏观测下实现雪深估计，均方误差平均降低24.7%。

详情

AI中文摘要

在有限且稀疏观测下准确估计时变逆问题仍然是科学领域的基本挑战。例如，雪深估计需要推断控制海冰物理的隐藏参数，这可以通过物理信息编码来实现。为了解决这一挑战，我们引入了物理编码逆建模（PhysE-Inv），这是一个新颖的框架，将深度序列学习与物理信息推理相结合，用于解决真实世界稀疏观测环境下的逆问题。PhysE-Inv集成了LSTM编码器-解码器以捕获时间依赖性，并结合对比学习正则化来强制实现噪声不变的潜在表示。该框架学习潜在参数，这些参数与观测输入相结合，在融入物理信息指导的同时重建雪深。PhysE-Inv在所有评估基线上持续表现优异，在所有基线模型上实现了平均MSE降低24.7%，在参数估计设置下比最强基线提高了17.3%。总体而言，我们的工作为数据稀缺领域展示了一种可泛化的逆建模范式，其中物理信息指导可以融入稀疏观测中。

英文摘要

Accurate estimation in time-varying inverse problems under limited and sparse observations remains a fundamental challenge across scientific domains. For example, snow depth estimation requires inferring hidden parameters governing sea ice physics, which can be incorporated through physics-informed encoding. To address this challenge, we introduce Physics-Encoded Inversion (PhysE-Inv), a novel framework that combines deep sequential learning with physics-informed inference for solving inverse problems under real-world sparse observational settings. PhysE-Inv integrates an LSTM encoder-decoder to capture temporal dependencies, together with contrastive learning regularization that enforces noise-invariant latent representations. The framework learns latent parameters that, when combined with observational inputs, reconstruct snow depth while incorporating physics-informed guidance. PhysE-Inv consistently outperforms all evaluated baselines, achieving an average MSE reduction of 24.7\% across all baseline models and a 17.3\% improvement over the strongest baseline under parameter estimation settings. Overall, our work demonstrates a generalizable inverse modeling paradigm for data-scarce domains where physics-informed guidance can be incorporated into sparse observations.

URL PDF HTML ☆

赞 0 踩 0

2602.20807 2026-06-02 cs.CV cs.RO

RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

RU4D-SLAM：面向4D场景重建的高斯溅射SLAM不确定性重加权

Yangfan Zhao, Hanwei Zhang, Ke Huang, Qiufeng Wang, Zhenzhou Shao, Dengyu Wu

发表机构 * Capital Normal University（首都师范大学）； Saarland University（萨尔兰大学）； Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； King’s College London（伦敦国王学院）

AI总结提出RU4D-SLAM框架，通过引入时间因子、不确定性感知和语义引导重加权机制，解决动态环境中3D高斯溅射SLAM的跟踪与4D场景重建问题。

详情

AI中文摘要

将3D高斯溅射与同时定位与地图构建（SLAM）相结合的方法因其能够在运动过程中实现连续3D环境重建而受到广泛关注。然而，现有方法在动态环境中表现不佳，尤其是移动物体使3D重建复杂化，进而阻碍了可靠的跟踪。4D重建的出现，特别是4D高斯溅射，为解决这些挑战提供了有前景的方向，但其在4D感知SLAM中的潜力尚未得到充分探索。沿着这一方向，我们提出了一种鲁棒且高效的框架，即面向4D场景重建的高斯溅射SLAM不确定性重加权（RU4D-SLAM），该框架将时间因子引入空间3D表示，同时结合了场景变化的不确定性感知、模糊图像合成和动态场景重建。我们通过集成运动模糊渲染增强了动态场景表示，并通过扩展原本为静态场景设计的逐像素不确定性建模来处理模糊图像，从而改进了不确定性感知跟踪。此外，我们提出了一种用于动态场景中逐像素不确定性估计的语义引导重加权机制，并引入可学习的不透明度权重以支持自适应4D映射。在标准基准上的大量实验表明，我们的方法在轨迹精度和4D场景重建方面显著优于最先进的方法，尤其是在存在移动物体和低质量输入的动态环境中。代码地址：https://ru4d-slam.github.io

英文摘要

Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: https://ru4d-slam.github.io

URL PDF HTML ☆

赞 0 踩 0

2505.18877 2026-06-02 cs.LG

RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models

RefLoRA：重构低秩适配以实现大型模型的高效微调

Yilang Zhang, Bingcong Li, Georgios B. Giannakis

发表机构 * Department of ECE University of Minnesota（电子工程系明尼苏达大学）； Department of CS ETH Zürich（计算机科学系苏黎世联邦理工学院）

AI总结针对LoRA因非唯一低秩分解导致权重更新不一致和性能下降的问题，提出RefLoRA方法，通过每步优化最小化损失上界的低秩分解，促进更平坦的损失景观和稳定收敛，在自然语言理解和常识推理任务上优于现有LoRA变体且计算开销可忽略。

Comments Accepted as a conference paper at NeurIPS 2025

详情

AI中文摘要

低秩适配（LoRA）通过更新预训练权重矩阵的低维子空间，降低了大型模型微调的计算和内存开销。尽管高效，LoRA由于其非唯一的低秩分解导致权重更新不一致和不平衡，表现出次优收敛和明显的性能下降。为了克服这些限制，本文确定了每步最小化损失上界的最优低秩分解。由此产生的重构低秩适配（RefLoRA）方法促进了更平坦的损失景观，以及一致和平衡的权重更新，从而加速了稳定收敛。大量实验在自然语言理解和常识推理任务上评估了RefLoRA，使用了流行的LLaMA-7B、LLaMA2-7B和LLaMA3-8B等大型语言模型。数值测试证实，RefLoRA收敛更快，优于各种基准，并且与最先进的LoRA变体相比，计算开销可忽略不计。

英文摘要

Low-Rank Adaptation (LoRA) lowers the computational and memory overhead of fine-tuning large models by updating a low-dimensional subspace of the pre-trained weight matrix. Albeit efficient, LoRA exhibits suboptimal convergence and noticeable performance degradation, due to inconsistent and imbalanced weight updates induced by its nonunique low-rank factorizations. To overcome these limitations, this article identifies the optimal low-rank factorization per step that minimizes an upper bound on the loss. The resultant refactored low-rank adaptation (RefLoRA) method promotes a flatter loss landscape, along with consistent and balanced weight updates, thus speeding up stable convergence. Extensive experiments evaluate RefLoRA on natural language understanding, and commonsense reasoning tasks with popular large language models including DeBERTaV3, LLaMA-7B, LLaMA2-7B and LLaMA3-8B. The numerical tests corroborate that RefLoRA converges faster, outperforms various benchmarks, and enjoys negligible computational overhead compared to state-of-the-art LoRA variants.

URL PDF HTML ☆

赞 0 踩 0

2602.20019 2026-06-02 cs.LG cs.AI

Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision

有限监督下动态图的可判别且可泛化的异常检测器学习

Yuxing Tian, Yiyan Qi, Fengran Mo, Weixu Zhang, Jian Guo, Jian-Yun Nie

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对动态图异常检测中标注异常稀缺的问题，提出一个结合残差表示编码、限制损失和双边界优化的模型无关框架，从正常/未标注数据中学习可判别边界，同时利用有限标注异常并保持对未见异常的泛化能力。

Comments Accepted by ICML2026

详情

AI中文摘要

动态图异常检测对许多现实应用至关重要，但由于标注异常的稀缺性，仍然具有挑战性。现有方法要么是无监督的，要么是半监督的：无监督方法避免了标注异常的需求，但往往产生模糊的边界，而半监督方法可能过拟合于有限的标注异常，并对未见异常泛化能力差。为了解决这一差距，我们考虑一个很大程度上未被探索的问题：从正常/未标注数据中学习可判别边界，同时利用有限的标注异常（当可用时），而不牺牲对未见异常的泛化能力。在本文中，我们提出了一个有效、可泛化且模型无关的框架，包含三个主要组件：（i）残差表示编码，捕捉当前交互与其历史上下文之间的偏差，提供与异常相关的信号；（ii）限制损失，将正常表示约束在两个共心超球面之间的区间内，确保尺度一致的同时保持异常的可分离性；（iii）双边界优化策略，利用归一化流建模的对数似然分布，学习一个可判别且鲁棒的边界。大量实验证明了我们的框架在不同评估设置下的优越性。

英文摘要

Dynamic graph anomaly detection is critical for many real-world applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize poorly to unseen anomalies. To address this gap, we consider a largely underexplored problem: learning a discriminative boundary from normal/unlabeled data, while leveraging limited labeled anomalies \textbf{when available} without sacrificing generalization to unseen anomalies. In this paper, we propose an effective, generalizable, and model-agnostic framework with three main components: (i) residual representation encoding that capture deviations between current interactions and their historical context, providing anomaly-relevant signals; (ii) a restriction loss that constrain the normal representations within an interval bounded by two co-centered hyperspheres, ensuring consistent scales while keeping anomalies separable; (iii) a bi-boundary optimization strategy that learns a discriminative and robust boundary using the log-likelihood distribution modeled by a normalizing flow. Extensive experiments demonstrate the superiority of our framework across diverse evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2602.19857 2026-06-02 cs.CV

Contrastive meta-domain adaptation for robust skin lesion classification across clinical and acquisition conditions

对比元域适应用于跨临床和采集条件的鲁棒皮肤病变分类

Rodrigo Mota, Kelvin Cunha, Emanoel dos Santos, Fábio Papais, Francisco Filho, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

发表机构 * University of São Paulo（圣保罗大学）

AI总结提出基于视觉元域概念的适应策略，通过将大规模皮肤镜数据集的视觉表示迁移到临床图像域，提高皮肤病变分类的泛化鲁棒性。

Comments 4 pages, 5 figures, 1 table, Published in: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

2602.19848 2026-06-02 cs.CV

DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

DerMAE: 通过条件潜在扩散和MAE蒸馏改进皮肤病变分类

Francisco Filho, Kelvin Cunha, Fábio Papais, Emanoel dos Santos, Rodrigo Mota, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

发表机构 * Universidade Federal do Pernambuco（佛罗里达州帕尔马大学）

AI总结针对皮肤病变分类中恶性样本不足导致的类别不平衡问题，提出使用类别条件扩散模型生成合成图像，结合自监督MAE预训练学习鲁棒特征，并通过知识蒸馏将大模型知识迁移至轻量级ViT学生模型，在提升分类性能的同时实现高效设备端推理。

Comments 4 pages, 2 figures, 1 table, Published in: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

2602.19789 2026-06-02 cs.LG cs.CY

Position: Stop Preaching and Start Practising Data Frugality for Responsible Development of AI

立场：停止说教，开始实践数据节俭以负责任地发展人工智能

Sophia N. Wilson, Andrew Millard, Guðrún Fjóla Guðmundsdóttir, Raghavendra Selvan, Sebastian Mair

发表机构 * arXiv.org ； GitHub

AI总结本文主张机器学习社区应从说教转向实践数据节俭，通过子集选择方法在保持精度的同时大幅降低训练能耗和碳排放，以负责任地发展AI。

Comments ICML 2026

详情

AI中文摘要

这篇立场论文认为，机器学习社区必须从说教转向实践数据节俭，以实现负责任的人工智能发展。长期以来，进步一直与越来越大的数据集等同，这推动了显著的进步，但现在却带来了越来越小的性能提升，同时伴随着能源使用和碳排放的增加。尽管对数据节俭方法的认识有所提高，但其采用仍停留在口头上，数据规模扩展仍然主导着开发实践。我们认为，必须弥合说教与实践之间的差距，因为持续的数据规模扩展会带来巨大且未被充分核算的环境影响。为了支撑我们的立场，我们提供了与ImageNet-1K下游使用相关的能源使用和碳排放的指示性估计。然后，我们提供了实证证据，表明数据节俭既实用又有益，证明了子集选择方法可以在几乎不损失精度的情况下大幅减少训练能耗，同时减轻数据集偏差。最后，我们概述了将数据节俭从口头说教转变为具体实践的可操作建议，以负责任地发展人工智能。

英文摘要

This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For too long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that subset selection methods can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preaching to concrete practice for responsible development of AI.

URL PDF HTML ☆

赞 0 踩 0

2602.19126 2026-06-02 cs.LG math.PR math.ST stat.TH

Robust Predictive Uncertainty and Double Descent in Contaminated Bayesian Random Features

污染贝叶斯随机特征中的鲁棒预测不确定性与双重下降

Michele Caprio, Katerina Papagiannouli, Siu Lun Chau, Sayan Mukherjee

发表机构 * The University of Manchester, UK（英国曼彻斯特大学）； University of Pisa, Italy（意大利比萨大学）； Nanyang Technological University, Singapore（新加坡南洋理工大学）； Max Planck Institute for Mathematics in the Sciences, Germany（德国马克斯·普朗克数学研究所）

AI总结提出一种鲁棒贝叶斯随机特征回归方法，通过Huber污染集处理先验和似然误设，推导出后验预测密度的上下界，并引入不精确最高密度区域进行鲁棒不确定性量化，证明预测不确定性保持计算可行性并继承经典双重下降相位结构。

详情

AI中文摘要

我们提出了一种随机特征（RF）回归的鲁棒贝叶斯公式，通过Huber风格的污染集明确考虑先验和似然的误设。从岭正则化RF训练与高斯先验和似然的贝叶斯推断之间的经典等价性出发，我们分别用ε-和η-污染信度集替换单一先验和似然，并使用悲观广义贝叶斯更新进行推断。我们推导出所得后验预测密度的下界和上界的显式且可处理的界限。这些界限表明，当污染适中时，先验和似然模糊性有效地直接污染后验预测分布，产生围绕经典高斯预测的不确定性包络。我们引入了一个不精确最高密度区域（IHDR）用于鲁棒预测不确定性量化，并证明它可以通过调整的高斯可信区间进行有效近似。我们进一步获得了预测方差界限（在温和截断近似下得到上界），并证明它们保留了RF模型已知的领先阶比例增长渐近性。这些结果共同建立了贝叶斯随机特征的鲁棒性理论：预测不确定性保持计算可行性，继承经典的双重下降相位结构，并在有界先验和似然误设下通过显式最坏情况保证得到改进。

英文摘要

We propose a robust Bayesian formulation of random feature (RF) regression that accounts explicitly for prior and likelihood misspecification via Huber-style contamination sets. Starting from the classical equivalence between ridge-regularized RF training and Bayesian inference with Gaussian priors and likelihoods, we replace the single prior and likelihood with $ε$- and $η$-contaminated credal sets, respectively, and perform inference using pessimistic generalized Bayesian updating. We derive explicit and tractable bounds for the resulting lower and upper posterior predictive densities. These bounds show that, when contamination is moderate, prior and likelihood ambiguity effectively acts as a direct contamination of the posterior predictive distribution, yielding uncertainty envelopes around the classical Gaussian predictive. We introduce an Imprecise Highest Density Region (IHDR) for robust predictive uncertainty quantification and show that it admits an efficient approximation via an adjusted Gaussian credible interval. We further obtain predictive variance bounds (under a mild truncation approximation for the upper bound) and prove that they preserve the leading-order proportional-growth asymptotics known for RF models. Together, these results establish a robustness theory for Bayesian random features: predictive uncertainty remains computationally tractable, inherits the classical double-descent phase structure, and is improved by explicit worst-case guarantees under bounded prior and likelihood misspecification.

URL PDF HTML ☆

赞 0 踩 0

2602.19066 2026-06-02 cs.LG cs.AI

IDLM: Inverse-distilled Diffusion Language Models

IDLM：逆蒸馏扩散语言模型

David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin

发表机构 * arXiv.org ； GitHub

AI总结针对扩散语言模型推理慢的问题，提出逆蒸馏方法（IDLM），通过理论保证唯一解和梯度稳定松弛，实现4倍至64倍推理加速并保持生成质量。

Comments ICML 2026. We provide the code at: https://david-cripto.github.io/idlm-project-page

详情

AI中文摘要

扩散语言模型（DLM）最近在文本生成中取得了强劲成果。然而，其多步采样导致推理缓慢，限制了实际应用。为解决此问题，我们将逆蒸馏（一种最初为加速连续扩散模型而开发的技术）扩展到离散设置。然而，这种扩展引入了理论和实践上的挑战。从理论角度看，逆蒸馏目标缺乏唯一性保证，可能导致次优解。从实践角度看，离散空间中的反向传播非平凡且常不稳定。为克服这些挑战，我们首先提供理论结果，证明我们的逆形式具有唯一解，从而确保有效优化。然后，我们引入梯度稳定松弛以支持有效训练。最终，在多个DLM上的实验表明，我们的方法——逆蒸馏扩散语言模型（IDLM）——将推理步骤减少了4倍至64倍，同时保持了教师模型的生成质量。我们在项目页面上提供代码、模型检查点和视频教程：https://david-cripto.github.io/idlm-project-page。

英文摘要

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's generation quality. We provide the code, model checkpoints, and video tutorials on the project page: https://david-cripto.github.io/idlm-project-page

URL PDF HTML ☆

赞 0 踩 0

2602.16902 2026-06-02 cs.AI cs.LG

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

LLM-WikiRace 基准测试：大语言模型在真实知识图谱上的规划能力有多强？

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

发表机构 * University of Oxford, UK（牛津大学，英国）； University College London (Centre for AI), UK（伦敦大学学院（人工智能中心），英国）； University of Basel, Switzerland（巴塞尔大学，瑞士）

AI总结提出 LLM-Wikirace 基准，通过维基百科超链接导航任务评估大语言模型的规划、推理与世界知识，发现模型在简单任务上超人类，但困难任务成功率仅 23%，且规划与长程推理是主要瓶颈。

详情

AI中文摘要

我们引入了 LLM-Wikirace，一个用于评估大语言模型（LLM）规划、推理和世界知识的基准。在 LLM-Wikirace 中，模型必须逐步高效地导航维基百科超链接，从给定源页面到达目标页面，这需要前瞻性规划和推理概念如何在现实世界中连接的能力。我们评估了广泛的开源和闭源模型，包括 Gemini-3、GPT-5 和 Claude Opus 4.5，它们在任务的简单级别上取得了最强结果，并展现了超人类性能。尽管如此，在困难难度下性能急剧下降：表现最好的模型 Gemini-3 仅在 23% 的困难游戏中成功，凸显了前沿模型面临的重大挑战。我们的分析表明，世界知识是成功的必要因素，但仅在一定程度内；超过这个阈值，规划和长程推理能力成为主导因素。轨迹级分析进一步揭示，即使是最强的模型在失败后也难以重新规划，经常陷入循环而非恢复。LLM-Wikirace 是一个简单的基准，揭示了当前推理系统的明显局限性，提供了一个开放的竞技场，其中具备规划能力的 LLM 仍有待证明。我们的代码和排行榜可在 https://llmwikirace.github.io 获取。

英文摘要

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

URL PDF HTML ☆

赞 0 踩 0

2602.13430 2026-06-02 cs.CV

Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

处理胸部X光分类中的监督稀缺性：长尾与零样本学习

Ha-Hieu Pham, Hai-Dang Nguyen, Thanh-Huy Nguyen, Min Xu, Ulas Bagci, Trung-Nghia Le, Huy-Hieu Pham

发表机构 * University of Technology, Vietnam（越南技术大学）； National University of Singapore（新加坡国立大学）； University of California, San Diego（加州大学圣地亚哥分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结针对胸部X光分类中极端长尾多标签分布和罕见/未见发现缺失标注的问题，提出不平衡感知多标签学习（任务1）和无需监督标签的零样本预测方法（任务2），在CXR-LT 2026挑战赛中取得领先性能。

详情

DOI: 10.1109/ISBI61048.2026.11515586
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

AI中文摘要

临床实践中的胸部X光（CXR）分类常受限于不完美的监督，这源于（i）极端长尾多标签疾病分布和（ii）罕见或先前未见发现的缺失标注。CXR-LT 2026挑战赛在基于PadChest的基准上解决这些问题，其标签空间包含36个类别，分为30个训练集内分布类别和6个用于零样本评估的集外分布（OOD）类别。我们提出了针对不同监督机制的任务特定解决方案。对于任务1（长尾多标签分类），我们采用不平衡感知的多标签学习策略，以提高尾类别的识别能力，同时保持对常见发现的稳定性能。对于任务2（零样本OOD识别），我们提出了一种预测方法，在训练期间不使用任何来自OOD类别的监督标签或示例的情况下，为未见疾病类别生成分数。通过宏平均平均精度（mAP）评估，我们的方法在两个任务上均取得了强劲性能，在开发阶段的公开排行榜上排名第一。代码和预训练模型可在https://github.com/hieuphamha19/CXR_LT获取。

英文摘要

Chest X-Ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.

URL PDF HTML ☆

赞 0 踩 0

2512.09730 2026-06-02 cs.CL cs.LG

Interpreto: An Explainability Library for Transformers

Interpreto：一个用于Transformer的可解释性库

Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich, Charlotte Claye, François Hoofd, Raphael Bernas, Nicholas Asher, Céline Hudelot, Fanny Jourdan

发表机构 * IRT Saint Exupéry Toulouse（伊尔杜夫圣埃克苏佩里图卢斯）； IRIT Toulouse（图卢兹IRIT）； Khoury College of Computer Sciences（科赫里计算机科学学院）； Ampere（阿姆佩尔）； MICS, CentraleSupélec（MICS，中央超导学院）； Scienta Lab（科学实验室）； Thales Avionics（泰勒斯航空电子）； ANITI

AI总结 Interpreto是一个开源Python库，通过归因方法和基于概念的解释，为HuggingFace语言模型（从早期BERT变体到LLM）提供统一的解释工作流，其端到端基于概念的流水线是主要创新。

Comments Accepted to ACL 2026 System Demonstration. Equal contribution: Poché and Jourdan

2602.07883 2026-06-02 cs.AI

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

ToolSelf: 通过工具驱动的涌现适应统一任务执行与自我重构

Jingqi Zhou, Sheng Wang, Dezhao Deng, Junwen Lu, Junwei Su, Qintong Li, Jiahui Gao, Hao Wu, Jiyue Jiang, Lingpeng Kong, Dunhong Jin, Chuan Wu

发表机构 * The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出ToolSelf框架，将配置更新抽象为标准化工具接口，统一任务执行与自我重构，并采用配置感知两阶段训练（CAT）实现涌现适应性，在多种基准测试中平均超越静态配置基线28.8分。

详情

AI中文摘要

基于LLM的智能体系统在复杂长时任务中表现出色，但仍受限于执行前固定的静态配置。这种刚性导致领域特定性能与跨任务泛化之间的权衡：强先验和紧凑工具空间有助于专业化但削弱迁移，而任务无关的工作流和广泛动作空间扩展覆盖但稀释指导。现有的执行前优化、规划者-工作者编排和配置修补未能解决这一矛盾，因为它们将适应与执行解耦，导致信息丢失、优化碎片化和信用分配模糊。我们提出ToolSelf，一种工具驱动的运行时自我重构范式，将配置更新抽象为标准化工具接口，并在一个策略的动作空间内统一执行和适应。执行代理可以根据任务进度和反馈动态更新子目标、策略、工具箱、上下文和上下文管理模式。我们进一步引入配置感知两阶段训练（CAT），结合拒绝采样微调和轨迹级KTO强化学习来内化自我重构。在多种基准测试中，零样本ToolSelf与任务专用代理相媲美；经过CAT训练后，ToolSelf平均比静态配置基线高出28.8分，为消除手动注入指导的涌现适应性开辟了道路。

英文摘要

LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Such rigidity forces a trade-off between domain-specific performance and cross-task generalization: strong priors and compact tool spaces aid specialization but weaken transfer, while task-agnostic workflows and broad action spaces expand coverage but dilute guidance. Existing pre-execution optimization, planner-worker orchestration, and configuration patching fall short of resolving this tension, as they decouple adaptation from execution, causing information loss, fragmented optimization, and ambiguous credit assignment. We propose ToolSelf, a tool-driven runtime self-reconfiguration paradigm that abstracts configuration updates as a standardized tool interface and unifies execution and adaptation within one policy's action space. The execution agent can dynamically update sub-goals, strategies, toolboxes, context, and context-management modes based on task progress and feedback. We further introduce Configuration-Aware Two-stage Training (CAT), which combines rejection sampling fine-tuning with trajectory-level KTO reinforcement learning to internalize self-reconfiguration. Across diverse benchmarks, zero-shot ToolSelf rivals task-specialized agents; after CAT training, ToolSelf gains 28.8 points over the static-configuration baseline on average, illuminating a path toward emergent adaptivity that obviates manually injected guidance.

URL PDF HTML ☆

赞 0 踩 0

2602.18195 2026-06-02 cs.LG cs.AI

LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification

LERD: 用于神经退行性疾病分类的潜在事件-关系动力学

Yicheng Feng, Hairong Chen, Ziyu Jia, Samir Bhatt, Hengguan Huang

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Washington（华盛顿大学）； University of California, San Diego（加州大学圣地亚哥分校）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出LERD，一种端到端贝叶斯潜在事件-关系动力系统，直接从多通道脑电图推断潜在神经事件及其关系结构，无需事件或交互标注，在阿尔茨海默病分类中优于基线方法并提供生理对齐的动力学摘要。

详情

AI中文摘要

阿尔茨海默病（AD）会改变大脑电生理学并破坏多通道脑电图动力学，使得准确且临床有用的基于脑电图的诊断对于筛查和疾病监测越来越重要。然而，许多现有方法依赖黑盒分类器，并未明确建模其决策背后的潜在事件时序和跨通道协调。为解决这些局限，我们提出LERD，一种端到端贝叶斯潜在事件-关系动力系统，无需事件或交互标注，直接从多通道脑电图推断潜在神经事件及其关系结构。LERD结合连续时间事件推断模块与随机事件生成过程以捕获灵活的时间模式，同时融入电生理学启发的动力学先验以原则性方式指导学习。我们进一步提供理论分析，得到基于初值问题的可处理KL正则化项以及推断关系动力学的稳定性保证。在合成基准和两个真实世界AD脑电图队列上的大量实验表明，LERD一致优于强基线，并生成与生理对齐的速率、时序和图摘要，有助于刻画组级动力学差异。

英文摘要

Alzheimer's disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-based diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the latent event timing and cross-channel coordination behind their decisions. To address these limitations, we propose LERD, an end-to-end Bayesian latent event--relational dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable IVP-based KL regularizer and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned rate, timing, and graph summaries that help characterize group-level dynamical differences.

URL PDF HTML ☆

赞 0 踩 0

2602.18008 2026-06-02 cs.LG cs.AI cs.CL

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

LLM 是否准备好进行神经集成机制建模？一个基准测试与智能体框架

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结本文提出神经集成机制建模（NIMM）基准测试，评估大语言模型在三个科学领域构建神经集成机制模型的能力，并设计树引导的智能体框架 NIMMGen，通过分支级搜索和原子模型细化显著提升搜索稳定性和解质量。

Comments 25 pages, 8 figures

详情

AI中文摘要

大语言模型（LLM）在从数据构建机制模型方面显示出潜力。然而，现有评估主要关注简化设置，未能捕捉真实世界科学建模的复杂性。在实践中，此类建模通常涉及神经集成公式，其中机制模型组件和神经网络组件共同构建，导致搜索空间显著复杂化。受此差距驱动，我们引入了神经集成机制建模（NIMM）基准测试，该基准测试评估 LLM 生成的神经集成机制模型在三个科学领域上的表现。在 NIMM 上的实验表明，现有基于 LLM 的方法难以有效探索这一复杂空间，导致搜索稳定性和解质量有限。为应对这一挑战，我们提出了 NIMMGen，一种树引导的智能体框架，通过分支级搜索实现多样化探索，并通过原子模型细化改进解。大量实验表明，NIMMGen 在 NIMM 上达到了最先进的性能，显著提升了搜索稳定性和解质量。

英文摘要

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus on simplified settings and fail to capture the complexity of real-world scientific modeling. In practice, such modeling often involves neural-integrated formulations, where a mechanistic model component and a neural network component are jointly constructed, leading to a significantly more complex search space. Motivated by this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) benchmark, which evaluates LLM-generated neural-integrated mechanistic models across three scientific domains. Experiments on NIMM reveal that existing LLM-based approaches struggle to effectively explore this complex space, resulting in limited search stability and solution quality. To address this challenge, we propose NIMMGen, a tree-guided agentic framework that enables diversified exploration via branch-level search and improves solutions through atomic model refinement. Extensive experiments demonstrate that NIMMGen achieves state-of-the-art performance on NIMM, significantly improving search stability and solution quality.

URL PDF HTML ☆

赞 0 踩 0

2602.17737 2026-06-02 cs.RO cs.LG cs.MA

NestRL: A Nested Training Regime for Mutual Adaptation in Human-AI Teaming

NestRL: 一种用于人机团队中相互适应的嵌套训练机制

Upasana Biswas, Durgesh Kalwar, Subbarao Kambhampati, Sarath Sreedharan

发表机构 * School of Computing and AI, Arizona State University（计算与人工智能学院，亚利桑那州立大学）； Department of Computer Science, Colorado State University（计算机科学系，科罗拉多州立大学）

AI总结针对人机团队中相互适应的挑战，提出嵌套训练机制NestRL，通过分层训练代理对抗自适应对手，避免产生不透明的协调策略，在Overcooked领域实现更高的任务性能和适应性。

详情

AI中文摘要

相互适应是人机团队中的一个核心挑战，因为人类会自然地根据AI代理的行为调整自己的策略。现有方法试图通过多样化训练伙伴来近似人类行为；然而，这些伙伴通常是静态的，无法捕捉人类队友的适应性。当代理在标准多智能体设置中联合训练时，它们常常收敛到不透明的协调策略，这些策略仅适用于共同训练的伙伴，导致泛化能力差。为了建模自适应的人类行为，我们将人机团队问题形式化为交互式部分可观测马尔可夫决策过程（I-POMDP）。我们提出NestRL，一种嵌套训练机制，通过在每个层级上训练代理对抗来自下一层级的自适应代理，来学习有限层级I-POMDP的解。这使代理暴露于自适应行为，同时防止出现不透明的协调策略。我们提供了理论分析，表明NestRL代理避免了收敛到特定伙伴的策略，并在Overcooked领域通过与最先进的基线进行实证验证。NestRL在与未见过的自适应代理和真实人类队友合作时均实现了更高的任务性能，同时在交互过程中表现出显著更强的适应性。

英文摘要

Mutual adaptation is a central challenge in human-AI teaming, as humans naturally adjust their strategies in response to an AI agent's behavior. Existing approaches attempt to approximate human behavior by diversifying training partners; however, these partners are typically static and fail to capture the adaptive nature of human teammates. When agents are trained jointly in standard multi-agent settings, they often converge to opaque coordination strategies that work only with their co-trained partners, leading to poor generalization. To model adaptive human behavior, we formulate human-AI teaming as an Interactive Partially Observable Markov Decision Process (I-POMDP). We propose NestRL, a nested training regime that learns the solution to a finite-level I-POMDP by training agents at each level against adaptive agents from the level below. This exposes agents to adaptive behavior while preventing emergence of opaque coordination strategies. We provide theoretical analysis showing that NestRL agents avoid convergence to partner-specific strategies, and validate this empirically in the Overcooked domain against state-of-the-art baselines. NestRL achieves higher task performance with both unseen adaptive agents and real human teammates, while exhibiting significantly greater adaptability over the course of interaction.

URL PDF HTML ☆

赞 0 踩 0

2602.17706 2026-06-02 cs.LG

Parallel Complex Diffusion for Scalable Time Series Generation

并行复数扩散用于可扩展时间序列生成

Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Zhiqiang Ge, Qingsong Wen, Yong Liu

发表机构 * Institute of Cyber-Systems and Control（网络系统与控制研究所）； Zhejiang University（浙江大学）； Griffith University（格里菲斯大学）； School of Mathematics（数学学院）； Southeast University（东南大学）； Squirrel Ai Learning

AI总结提出PaCoDi（并行复数扩散）框架，通过离散傅里叶变换将时间序列分解到谱域，利用并行实值估计器替代复数估计器，解决时间序列生成中的纠缠问题，理论证明谱高斯噪声的正交性，并引入平均场理论近似处理边缘耦合，在无条件和条件生成任务中优于5个基线。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26). Extended Version with Full Proofs

详情

DOI: 10.1145/3770855.3817791

AI中文摘要

扩散模型通过去噪间接学习数据分布，使得生成建模的难度与数据的依赖结构密切相关。对于时间序列，强时间依赖性迫使噪声/分数估计器恢复高度纠缠的跨时间关系，导致纠缠问题。我们通过改变扩散空间的拓扑结构来减轻这一负担：离散傅里叶变换（DFT）将时间依赖分解为谱模式，对角化二阶依赖结构，使数据流形与各向同性高斯噪声和均匀扩散动力学更好地对齐。然而，现有的频率感知扩散方法主要使用DFT设计时间DDPM/SDE框架下的估计器模块，而频率原生扩散路径面临复数动力学带来的数学障碍。我们提出PaCoDi（并行复数扩散），一种频率原生扩散框架，在谱域构建扩散路径，同时用实部和虚部的并行实值估计器替代复数估计器。理论上，我们证明了谱高斯噪声的统计正交性，建立了正交前向转移和条件反向分解，并通过谱维纳过程将离散PaCoDi扩展到连续时间谱SDE。我们进一步引入带有交互校正分支的平均场理论近似来处理边缘耦合，并利用厄米对称性减少50%的注意力FLOPs而无信息损失。在无条件和条件时间序列生成上的大量实验表明，在5个基准测试中，生成质量和计算效率分别优于5个最先进基线。代码可在https://github.com/RongyaoCai/PaCoDi获取。

英文摘要

Diffusion models learn data distributions indirectly through denoising, making the difficulty of generative modeling closely tied to the dependency structure of data. For time series, strong temporal dependence forces the noise / score estimator to recover highly entangled cross-time relationships, leading to the curse of entanglement. We mitigate this burden by changing the topology of the diffusion space: the Discrete Fourier Transform (DFT) decomposes temporal dependencies into spectral modes, diagonalizing second-order dependency structure and better aligning the data manifold with isotropic Gaussian noise and homogeneous diffusion dynamics. However, existing frequency-aware diffusion methods mainly use the DFT to design estimator blocks under temporal DDPM/SDE frameworks, while frequency-native diffusion paths face a mathematical barrier from complex-valued dynamics. We propose PaCoDi (Parallel Complex Diffusion), a frequency-native diffusion framework that constructs the diffusion path in the spectral domain while replacing the complex-valued estimator with parallel real-valued estimators for real and imaginary components. Theoretically, we prove the statistical orthogonality of spectral Gaussian noise, establish quadrature forward transitions and conditional reverse factorization, and extend discrete PaCoDi to continuous-time spectral SDEs through a Spectral Wiener Process. We further introduce a Mean Field Theory approximation with an Interactive Correction Branch to handle marginal coupling, and exploit Hermitian symmetry to reduce 50% attention FLOPs without information loss. Extensive experiments on unconditional and conditional time series generation demonstrate superior generative quality and computational efficiency against 5 SOTA baselines in 5 benchmarks, respectively. Code is available at https://github.com/RongyaoCai/PaCoDi.

URL PDF HTML ☆

赞 0 踩 0

2602.16763 2026-06-02 cs.AI

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

当AI基准测试达到平台期：基准饱和的系统性研究

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo, Eliya Habba, Usman Gohar, Siddhesh Pawar, Robert Scholz, Arjun Subramonian, Jingwei Ni, Mykel Kochenderfer, Sanmi Koyejo, Mrinmaya Sachan, Stella Biderman, Zeerak Talat, Avijit Ghosh, Irene Solaiman

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Michigan（密歇根大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本研究定义并分析了60个语言模型基准的饱和现象，发现近半数基准出现饱和，且专家策划而非公开测试数据影响抗饱和能力，为延长基准寿命提供了设计建议。

Comments Accepted at ICML 2026

2602.16745 2026-06-02 cs.LG cs.AI

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

PETS：一种面向高效测试时自一致性的最优轨迹分配原则性框架

Zhangyi Liu, Huaizhi Qu, Xiaowei Yin, He Sun, Yanjun Han, Tianlong Chen, Zhun Deng

发表机构 * Stanford University（斯坦福大学）； UNC at Chapel Hill（Chapel Hill 大学）； Yale University（耶鲁大学）； New York University（纽约大学）

AI总结提出PETS框架，通过将轨迹分配建模为优化问题并引入自一致性率度量，在离线（连接众包理论）和在线流式场景下实现样本高效的测试时自一致性，显著降低采样预算。

详情

AI中文摘要

测试时扩展可以通过聚合随机推理轨迹来提高模型性能。然而，在有限预算下实现样本高效的测试时自一致性仍然是一个开放的挑战。我们引入了PETS（原则性且高效的测试时自一致性），它通过一个优化框架启动了对轨迹分配的原则性研究。我们方法的核心是自一致性率，这是一个新定义的度量，即与无限预算多数投票的一致性。这一公式使样本高效的测试时分配在理论上具有坚实基础，并适合严格分析。我们研究了离线和在线两种设置。在离线模式下，所有问题事先已知，我们将轨迹分配与众包（一个经典且成熟的研究领域）联系起来，将推理轨迹建模为工人。这种视角使我们能够利用丰富的现有理论，获得理论保证和一种高效的基于多数投票的分配算法。在在线流式模式下，问题顺序到达且必须实时做出分配，我们提出了一种受离线框架启发的新方法。我们的方法根据问题难度调整预算，同时保持强大的理论保证和计算效率。实验表明，PETS始终优于均匀分配。在GPQA上，PETS在两种设置下均实现了完美的自一致性，同时相对于均匀分配将采样预算减少了高达75%（离线）和55%（在线）。代码可在https://github.com/ZDCSlab/PETS获取。

英文摘要

Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.

URL PDF HTML ☆

赞 0 踩 0