arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1755
2604.21407 2026-06-08 cs.LG stat.CO stat.ML 版本更新

Even More Guarantees for Variational Inference in the Presence of Symmetries

变分推断在对称性存在下的更多保证

Lena Zellinger, Antonio Vergari

AI总结 本文扩展了变分推断在目标对称性下的鲁棒性理论,证明了使用前向KL散度和α-散度时,即使模型误设也能精确恢复目标均值和相关矩阵,并放宽了对数凹假设,适用于多模态分布。

详情
AI中文摘要

当通过变分推断(VI)近似一个难以处理的密度时,变分族通常被选为一个简单的参数族,很可能不包含目标。这引发了一个问题:在模型误设的情况下,我们能在什么条件下恢复目标的特征?在这项工作中,我们在两个重要方面扩展了先前关于位置-尺度族在目标对称性下鲁棒VI的理论结果:(1)我们通过提供使用前向Kullback-Leibler散度和α-散度时精确恢复目标均值和相关矩阵的充分条件,将它们开放给更广泛的散度。(2)通过这样做,我们发现可以放弃先前工作中做出的对数凹目标的限制性假设,从而允许我们为更广泛的目标(包括多模态目标)提供保证。在我们的实验中,我们展示了我们的保证如何作为选择变分族和α值的指南,并通过一组多样化的例子说明了在缺乏我们的充分条件时优化如何以及为何会失败。

英文摘要

When approximating an intractable density via variational inference (VI) the variational family is typically chosen as a simple parametric family that very likely does not contain the target. This raises the question: Under which conditions can we recover characteristics of the target despite misspecification? In this work, we extend previous theoretical results on robust VI with location-scale families under target symmetries in two substantial ways: (1) We open them up to a wider range of divergences by providing sufficient conditions for exact recovery of the target mean and correlation matrix when using the forward Kullback-Leibler divergence and $α$-divergences. (2) By doing so, we find that we can drop the restrictive assumption of a log-concave target made in previous work, allowing us to give guarantees for a wider range of targets, including multi-modal ones. In our experiments, we show how our guarantees can serve as guidelines for the choice of the variational family and $α$-value and we illustrate on a diverse set of examples how and why optimization can fail in the absence of our sufficient conditions.

2604.20123 2026-06-08 cs.CV 版本更新

Topology-Aware Skeleton Detection via Lighthouse-Guided Structured Inference

拓扑感知的骨架检测:基于灯塔引导的结构化推理

Daoyong Fu, Xiang Zhang, Zhaohuan Zhan, Fan Yang, Ke Yang

AI总结 提出Lighthouse-Skel方法,通过双分支协作检测骨架置信场和结构锚点,并利用灯塔引导策略重连不连续骨架,提升骨架连续性和结构完整性。

详情
Comments
This submission is withdrawn by the authors because we identified substantive issues in the current version that may affect the reliability and interpretation of the results. We are conducting a thorough revision and validation before making the work publicly available again
AI中文摘要

在自然图像中,物体骨架用于表示几何形状。然而,姿态或运动的轻微变化可能导致骨架结构的显著变化,增加骨架检测的难度,并常常导致不连续的骨架。现有方法主要关注点级骨架点检测,忽视了结构连续性在恢复完整骨架中的重要性。为解决此问题,我们提出Lighthouse-Skel,一种通过灯塔引导的结构化推理实现拓扑感知的骨架检测方法。具体来说,我们引入了一个双分支协作检测框架,联合学习骨架置信场和结构锚点(包括端点和连接点)。点分支学习的空间分布引导网络关注拓扑脆弱区域,从而提高骨架检测的准确性。基于学习的骨架置信场,我们进一步提出灯塔引导的拓扑补全策略,该策略将检测到的连接点和断点作为灯塔,沿低成本路径重连不连续的骨架段,从而改善骨架连续性和结构完整性。在四个公开数据集上的实验结果表明,所提方法在实现竞争性检测精度的同时,显著提升了骨架的连通性和结构完整性。

英文摘要

In natural images, object skeletons are used to represent geometric shapes. However, even slight variations in pose or movement can cause noticeable changes in skeleton structure, increasing the difficulty of detecting the skeleton and often resulting in discontinuous skeletons. Existing methods primarily focus on point-level skeleton point detection and overlook the importance of structural continuity in recovering complete skeletons. To address this issue, we propose Lighthouse-Skel, a topology-aware skeleton detection method via lighthouse-guided structured inference. Specifically, we introduce a dual-branch collaborative detection framework that jointly learns skeleton confidence field and structural anchors, including endpoints and junction points. The spatial distributions learned by the point branch guide the network to focus on topologically vulnerable regions, which improves the accuracy of skeleton detection. Based on the learned skeleton confidence field, we further propose a lighthouse-guided topology completion strategy, which uses detected junction points and breakpoints as lighthouses to reconnect discontinuous skeleton segments along low-cost paths, thereby improving skeleton continuity and structural integrity. Experimental results on four public datasets demonstrate that the proposed method achieves competitive detection accuracy while substantially improving skeleton connectivity and structural integrity.

2601.04791 2026-06-08 cs.CV cs.LG 版本更新

Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

用于稳定潜在扩散逆问题求解器的测量一致朗之万校正器

Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh

AI总结 针对潜在扩散模型逆问题求解器的不稳定性,提出测量一致朗之万校正器(MCLC),通过测量一致的朗之万更新缩小求解器与稳定反向扩散之间的差距,实现稳定可靠的潜在空间求解。

详情
Comments
ICML 2026
AI中文摘要

尽管潜在扩散模型(LDM)已成为逆问题的强大先验,但现有的基于LDM的求解器经常遭受不稳定性。在这项工作中,我们首先将不稳定性识别为求解器动力学与扩散模型学习的稳定反向扩散动力学之间的差异,并表明减少这种差距可以稳定求解器。基于此,我们引入了\textit{测量一致朗之万校正器(MCLC)},这是一个理论上有依据的即插即用稳定模块,通过测量一致的朗之万更新来修复基于LDM的逆问题求解器。与先前依赖线性流形假设(通常在潜在空间中不成立)的方法相比,MCLC提供了一种原则性的稳定机制,从而在潜在空间中实现更稳定和可靠的行为。

英文摘要

While latent diffusion models (LDMs) have emerged as powerful priors for inverse problems, existing LDM-based solvers frequently suffer from instability. In this work, we first identify the instability as a discrepancy between the solver dynamics and stable reverse diffusion dynamics learned by the diffusion model, and show that reducing this gap stabilizes the solver. Building on this, we introduce \textit{Measurement-Consistent Langevin Corrector (MCLC)}, a theoretically grounded plug-and-play stabilization module that remedies the LDM-based inverse problem solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often fail to hold in latent space, MCLC provides a principled stabilization mechanism, leading to more stable and reliable behavior in latent space.

2507.06419 2026-06-08 cs.CL 版本更新

Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling

教会奖励模型自我修正:奖励引导的对抗性失败发现以实现鲁棒奖励建模

Pankayaraj Pathmanathan, Furong Huang

AI总结 提出REFORM框架,通过奖励引导的受控解码自动发现奖励模型失败模式,并利用生成的对抗样本自我改进,提升鲁棒性而不牺牲奖励质量。

详情
Journal ref
ACL 2026 Main Conference [Oral]
AI中文摘要

奖励建模(RM)通过捕捉人类偏好来对齐大型语言模型(LLM),越来越多地用于模型微调、响应过滤和排序等任务。然而,由于人类偏好的固有复杂性和可用数据集的有限覆盖,奖励模型在分布偏移或对抗性扰动下经常失败。现有的识别此类失败模式的方法通常依赖于关于偏好分布或失败属性的先验知识,限制了它们在现实场景中的实用性,因为此类信息不可用。在这项工作中,我们提出了一种可处理的、与偏好分布无关的方法,通过奖励引导的受控解码来发现奖励模型的失败模式。在此基础上,我们引入了REFORM,一个自我改进的奖励建模框架,通过使用奖励模型本身来指导生成错误评分的响应,从而增强鲁棒性。这些对抗性示例随后用于扩充训练数据并修补奖励模型的失调行为。我们在两个广泛使用的偏好数据集Anthropic Helpful Harmless (HH)和PKU Beavertails上评估了REFORM,并证明它在不牺牲奖励质量的情况下显著提高了鲁棒性。值得注意的是,REFORM在直接评估和下游策略训练中均保持了性能,并通过去除虚假相关性进一步提高了对齐质量。

英文摘要

Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

2602.09580 2026-06-08 cs.RO cs.LG 版本更新

SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

SERNF: 通过动作块评论家和归一化流实现样本高效的真实世界灵巧策略微调

Chenyu Yang, Denis Tarasov, Davide Liconti, Romain Guntz, Hehui Zheng, Robert K. Katzschmann

AI总结 提出SERNF框架,结合归一化流策略和动作块评论家,实现真实世界灵巧操作策略的样本高效微调,解决多模态动作分布和信用分配问题。

详情
Comments
https://srl-ethz.github.io/SERNF/
AI中文摘要

由于有限的真实世界交互预算和高度多模态的动作分布,真实世界中灵巧操作策略的微调仍然具有挑战性。基于扩散的策略虽然表达能力强,但在微调过程中不允许进行保守的基于似然的更新,因为动作概率难以处理。相比之下,传统的高斯策略在多模态下会崩溃,特别是当动作以块形式执行时,而标准的逐步骤评论家无法与块执行对齐,导致信用分配不佳。我们提出了SERFN,一个具有归一化流(NF)的样本高效离策略微调框架,以应对这些挑战。归一化流策略为多模态动作块提供精确的似然,通过似然正则化实现保守、稳定的策略更新,从而提高样本效率。动作块评论家评估整个动作序列,使价值估计与策略的时间结构对齐,并改善长时域信用分配。据我们所知,这是首次在真实机器人硬件上展示基于似然的多模态生成策略与块级价值学习相结合。我们在真实世界的两个具有挑战性的灵巧操作任务上评估了SERFN:从盒子中取出剪刀并剪断胶带,以及手掌朝下抓握时进行手中立方体旋转——两者都需要在长时域内进行精确、灵巧的控制。在这些任务上,SERFN实现了稳定、样本高效的适应,而标准方法则难以应对。

英文摘要

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SERFN, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SERFN on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SERFN achieves stable, sample-efficient adaptation where standard methods struggle.

2603.19146 2026-06-08 cs.AI cs.LG 版本更新

D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

D5P4:用于并行离散扩散解码中多样性的分区行列式点过程

Jonathan Lys, Vincent Gripon, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene, Bastien Pasdeloup

AI总结 提出D5P4波束解码方法,利用分区行列式点过程在离散扩散模型中选择中间序列,平衡质量与多样性,无需外部验证器。

详情
AI中文摘要

离散扩散模型是自回归方法在文本生成中的有前途的替代方案,但其解码方法仍研究不足。标准的自回归搜索过程(如波束搜索)不直接适用于迭代去噪,其中假设是完整的中间序列而非从左到右的前缀。此外,现有的扩散解码过程对保留假设的多样性和覆盖范围的控制有限。在这项工作中,我们引入了D5P4,一种针对离散扩散模型定制的波束式解码方法,它将中间波束选择视为分区行列式点过程下的MAP推理。这产生了一个模型内部的批次目标,无需外部验证器即可平衡质量和多样性。在开放域生成、问答和数学推理上的实验表明,D5P4提高了多样性和pass@$k$覆盖率,同时匹配或超越了基线质量和保真度。

英文摘要

Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard autoregressive search procedures, such as beam search, do not directly apply to iterative denoising, where hypotheses are complete intermediate sequences rather than left-to-right prefixes. Furthermore, existing diffusion decoding procedures only provide limited control over the diversity and coverage of retained hypotheses. In this work, we introduce D5P4, a beam-style decoding method tailored to discrete diffusion models, which casts intermediate beam selection as MAP inference under a partitioned Determinantal Point Process. This yields a model-internal batch objective that balances quality and diversity without external verifiers. Experiments on open-ended generation, question answering, and mathematical reasoning show that D5P4 improves diversity and pass@$k$ coverage while matching or surpassing baseline quality and fidelity

2603.09403 2026-06-08 cs.CL 版本更新

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

LLM作为元评判者:用于NLP评估指标验证的合成数据

Lukáš Eigler, Jindřich Libovický, David Hurych

AI总结 提出LLM作为元评判者框架,通过控制语义退化生成合成数据替代人工判断,验证NLG评估指标,在多语言问答中元相关性超过0.9。

详情
Comments
16 pages, 1 figure, 14 tables
AI中文摘要

验证NLG的评估指标通常依赖于昂贵且耗时的人工标注,而这些标注主要仅存在于英语数据集。我们提出LLM作为元评判者,这是一个可扩展的框架,利用LLM通过控制真实数据的语义退化生成合成评估数据集,取代人工判断。我们使用 extit{元相关性}来验证我们的方法,衡量从合成数据得出的指标排名与标准人工基准之间的对齐程度。在机器翻译、问答和摘要上的实验表明,合成验证可作为人工判断的可靠代理,在多语言问答中实现超过0.9的元相关性,并证明在人工判断不可用或过于昂贵的情况下是一种可行的替代方案。我们的代码和数据将在论文被接受后公开。

英文摘要

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose LLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using meta-correlation, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data are publicly available at https://github.com/eiglerl/meta-judge.

2602.08857 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

通过将Transformer反编译为RASP发现可解释算法

Xinting Huang, Aleksandra Bakalova, Satwik Bhattamishra, William Merrill, Michael Hahn

AI总结 提出一种将训练好的Transformer忠实重参数化为RASP程序,并通过因果干预发现小型充分子程序的方法,实验表明长度泛化的Transformer内部实现了简单可解释的RASP程序。

详情
Comments
104 pages, 92 figures. Accepted for publication at ICML 2026
AI中文摘要

近期研究表明,Transformer的计算可以在RASP编程语言家族中模拟。这些发现增进了对Transformer表达能力和泛化能力的理解。特别是,Transformer被建议在具有简单RASP程序的问题上精确实现长度泛化。然而,训练模型是否实际实现了简单的可解释程序仍是一个开放问题。在本文中,我们提出了一种从训练好的Transformer中提取此类程序的通用方法。其思想是将Transformer忠实地重参数化为RASP程序,然后应用因果干预来发现一个小的充分子程序。在算法和形式语言任务上训练的小型Transformer实验中,我们表明我们的方法通常能从长度泛化的Transformer中恢复简单且可解释的RASP程序。我们的结果提供了迄今为止最直接的证据,证明Transformer内部实现了简单的RASP程序。

英文摘要

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

2602.03160 2026-06-08 cs.AI cs.CL 版本更新

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

VALUEFLOW:迈向大语言模型中多元化和可引导的基于价值的对齐

Woojin Kim, Sieun Hyeon, Jusang Oh, Jaeyoung Do

AI总结 提出VALUEFLOW框架,通过分层价值嵌入、强度标注数据库和锚定评估器,实现大语言模型在价值强度上的可控对齐,解决现有方法在提取、评估和引导方面的不足。

详情
Comments
Accepted in ICML 2026 (Oral). Code available at https://github.com/AIDASLab/VALUEFLOW
AI中文摘要

将大语言模型(LLMs)与人类价值的多元光谱对齐仍然是一个核心挑战:基于偏好的方法通常无法捕捉更深层次的动机原则。基于价值的方法提供了更原则性的路径,但仍存在三个差距:提取常常忽略层次结构,评估检测存在但未校准强度,并且LLMs在受控强度下的可引导性仍未得到充分理解。为解决这些限制,我们引入了VALUEFLOW,这是第一个统一框架,涵盖提取、评估和引导,并具有校准的强度控制。该框架整合了三个组件:(i) HIVES,一个层次化价值嵌入空间,捕捉理论和跨理论的价值结构;(ii) 价值强度数据库(VIDB),一个大规模资源,包含基于排序聚合得出的强度估计的价值标注文本;(iii) 一个基于锚点的评估器,通过将模型输出与VIDB面板进行排序,产生一致的强度分数。使用VALUEFLOW,我们在十个模型和四个价值理论上进行了全面的大规模研究,识别了可引导性的不对称性和多价值控制的组合规律。本文建立了一个可扩展的基础设施,用于评估和控制价值强度,推进了LLMs的多元化对齐。

英文摘要

Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

2601.23207 2026-06-08 cs.LG cs.AI 版本更新

Learning to Execute Graph Algorithms Exactly with Graph Neural Networks

学习用图神经网络精确执行图算法

Muhammad Fetrat Qharabagh, Artur Back de Luca, George Giapitzakis, Kimon Fountoulakis

AI总结 证明在有限度和有限精度约束下,图神经网络能通过训练多层感知机集成学习局部指令,从而在推理时无误差执行完整图算法,并展示了在分布式计算LOCAL模型及多种经典算法上的可学习性。

详情
AI中文摘要

理解图神经网络能学习什么,特别是它们学习执行算法的能力,仍然是一个核心的理论挑战。在这项工作中,我们证明了在有限度和有限精度约束下图算法的精确可学习性结果。我们的方法遵循两步过程。首先,我们训练一个多层感知机(MLP)集成来执行单个节点的局部指令。其次,在推理过程中,我们使用训练好的MLP集成作为图神经网络(GNN)中的更新函数。利用神经正切核(NTK)理论,我们表明局部指令可以从一个小训练集中学习,从而使得完整的图算法在推理过程中能够以高概率无误差地执行。为了说明我们设置的学习能力,我们为分布式计算的LOCAL模型建立了一个严格的可学习性结果。我们进一步展示了广泛研究的算法(如消息洪泛、广度优先搜索、深度优先搜索和贝尔曼-福特算法)的积极可学习性结果。

英文摘要

Understanding what graph neural networks can learn, especially their ability to learn to execute algorithms, remains a central theoretical challenge. In this work, we prove exact learnability results for graph algorithms under bounded-degree and finite-precision constraints. Our approach follows a two-step process. First, we train an ensemble of multi-layer perceptrons (MLPs) to execute the local instructions of a single node. Second, during inference, we use the trained MLP ensemble as the update function within a graph neural network (GNN). Leveraging Neural Tangent Kernel (NTK) theory, we show that local instructions can be learned from a small training set, enabling the complete graph algorithm to be executed during inference without error and with high probability. To illustrate the learning power of our setting, we establish a rigorous learnability result for the LOCAL model of distributed computation. We further demonstrate positive learnability results for widely studied algorithms such as message flooding, breadth-first and depth-first search, and Bellman-Ford.

2601.10896 2026-06-08 cs.CL 版本更新

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer: 检测和缓解LLM对话性遵从的框架

Parisa Rabbani, Priyam Sahoo, Ruben Mathew, Aishee Mondal, Harshita Ketharaman, Nimet Beyza Bozdag, Dilek Hakkani-Tür

AI总结 提出DialDefer框架,通过对话性遵从分数检测和缓解LLM在对话评估中因提问框架导致的判断偏移,发现框架效应显著但准确率稳定,且模型对人类与AI的不同归因产生最大偏移。

详情
Comments
10 pages main content, 7 figures, 35 pages total with appendix
AI中文摘要

LLM越来越多地被用作第三方评判者,但它们在评估对话中的说话者时的可靠性仍知之甚少。我们证明,LLM对相同主张的判断因框架而异:相同内容在作为陈述验证(“这个陈述正确吗?”)与归因于说话者(“这个说话者正确吗?”)时得到不同裁决。我们称此为对话性遵从,并引入DialDefer,一个用于检测和缓解这些框架诱导的判断偏移的框架。我们的对话性遵从分数(DDS)捕捉了聚合准确性所掩盖的方向性偏移。在十个领域、3000多个实例和五个模型上,对话框架诱导了大幅偏移(模型间平均|DDS|=15.9个百分点,p<0.0001),而准确性保持稳定(<2个百分点),在自然Reddit对话中效应放大2-5倍。这种效应是领域依赖的:单个模型可以在研究生级别的科学上转向不同意(怀疑),在社会判断上转向同意(遵从)。消融实验揭示,人类与LLM的归因导致最大偏移(17.7个百分点的摆动),表明模型认为与人类的分歧比与AI的分歧代价更高。缓解尝试可以减少遵从,但过度校正为怀疑,揭示了超出准确性优化的校准问题。

英文摘要

LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content receives different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across ten domains, 3k+ instances, and five models, conversational framing induces large shifts (mean|DDS|=15.9 percentage points (pp) across models, p < .0001) while accuracy remains stable (<2 pp), with effects amplifying 2--5x on naturalistic Reddit conversations. This effect is domain-dependent: a single model can shift toward disagreement (skepticism) on graduate-level science and toward agreement (deference) on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7 pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts can reduce deference but over-correct into skepticism, revealing a calibration problem beyond accuracy optimization.

2510.26714 2026-06-08 cs.LG cs.AI 版本更新

On the importance of multiple training seeds for evaluating machine unlearning

关于多个训练种子在评估机器遗忘中的重要性

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

AI总结 本文指出评估机器遗忘算法时仅使用单个训练种子可能导致结果不具代表性,并通过图像分类、联邦学习排序和大语言模型实验验证了问题普遍性,最后给出选择训练和遗忘种子数量的指导。

详情
Comments
mini paper, 5 figures
AI中文摘要

机器遗忘旨在从训练好的模型中移除某些数据点的影响,而无需昂贵的重新训练。大多数实用的遗忘算法只是近似,其性能只能通过经验评估。常见做法是从同一个训练好的模型(即仅使用单个训练种子)开始,多次独立运行遗忘算法(即使用多个遗忘种子)。在图像分类实验中,这种做法可能给出不具代表性的结果,因为遗忘性能可能对训练种子的选择敏感。这对于确定性遗忘方法尤其相关,这些方法从同一个训练好的模型开始时总是产生相同的结果。在联邦学习排序和大语言模型上的进一步实验证实,这个问题不仅限于图像分类。我们还解释了为什么增加遗忘种子的数量通常无法弥补多个训练种子的缺失。最后,我们给出了如何选择训练和遗忘种子数量的指导。

英文摘要

Machine unlearning aims to remove the influence of certain data points from a trained model without costly retraining. Most practical unlearning algorithms are only approximate and their performance can only be assessed empirically. Common practice is to run unlearning algorithms multiple times independently (i.e., using multiple unlearning seeds) starting from the same trained model (i.e., using only a single training seed ). In image-classification experiments, this practice can give non-representative results as unlearning performance can be sensitive to the choice of training seed. This is particularly relevant for deterministic unlearning methods which always produce the same result when started from the same trained model. Further experiments on federated learning-to-rank, and large language models confirm that this issue extends beyond image classification. We also explain why increasing the number of unlearning seeds cannot generally compensate for the lack of multiple training seeds. Finally, we give guidance on how to select the number of training and unlearning seeds.

2505.21423 2026-06-08 cs.LG stat.ML 版本更新

Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization

稳定性边缘的冲突偏差:范数与锐度正则化

Maria Matveev, Vit Fojtik, Hung-Hsu Chou, Gitta Kutyniok, Johannes Maly

AI总结 本文研究过参数化网络中梯度下降的隐式正则化,证明学习率在低范数与低锐度之间插值,且单一偏差不足以解释泛化,需考虑动态权衡。

详情
Comments
Accepted at ICML 2026
AI中文摘要

过参数化网络显著的泛化性能通常归因于隐式偏差,例如小学习率下的范数最小化和稳定性边缘(Edge-of-Stability)状态下的低锐度。在这项工作中,我们认为全面理解梯度下降的泛化性能需要分析这些不同形式的隐式正则化之间的相互作用。我们通过实验证明,学习率在训练模型的低参数范数和低锐度之间插值。此外,我们证明对于在简单回归任务上训练的对角线性网络,单独的隐式偏差都不能最小化泛化误差。这些发现表明,仅关注单一隐式偏差不足以解释良好的泛化,并促使我们采用更广阔的隐式正则化视角,捕捉由不可忽略的学习率引起的范数与锐度之间的动态权衡。

英文摘要

The remarkable generalization properties of overparameterized networks are often attributed to implicit biases, such as norm minimization at small learning rates and low sharpness in the Edge-of-Stability regime. In this work, we argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization. We empirically demonstrate that the learning rate interpolates between low parameter norm and low sharpness of the trained model. We furthermore prove that neither implicit bias alone minimizes the generalization error for diagonal linear networks trained on a simple regression task. These findings demonstrate that focusing on a single implicit bias is insufficient to explain good generalization, and they motivate a broader view of implicit regularization that captures the dynamic trade-off between norm and sharpness induced by non-negligible learning rates.

2512.01362 2026-06-08 cs.LG 版本更新

Directed evolution algorithm drives neural prediction

定向进化算法驱动神经预测

Yanlin Wang, Nancy M Young, Patrick C M Wong

AI总结 提出定向进化模型(DEM),模拟生物定向进化试错过程,结合回放缓冲和连续反向传播,在跨域神经预测中提升泛化能力并解决标签稀缺问题。

详情
Comments
43 pages, 5 figures
AI中文摘要

神经预测为预测神经认知功能和障碍的个体差异以及为个性化干预提供预后指标提供了一种有前景的方法。然而,由于领域偏移和标签稀缺的限制,将神经预测模型转化为医学人工智能应用具有挑战性。在此,我们提出定向进化模型(DEM),一种新颖的计算模型,模拟生物定向进化的试错过程,以逼近预测建模任务的最优解。我们证明了定向进化算法是一种有效的不确定性探索策略,能够增强强化学习中的泛化能力。此外,通过将回放缓冲和连续反向传播方法整合到DEM中,我们提供了在连续学习环境中实现利用与探索之间更好权衡的证据。我们在四个不同数据集上进行了实验,这些数据集涉及接受人工耳蜗植入的儿童,其口语发展结果在个体儿童水平上差异很大。术前神经MRI数据已被证明可以准确预测这些儿童术后结果,但在数据集之间不适用。我们的结果表明,DEM能够有效提高跨域植入前神经预测的性能,同时解决目标域中标签稀缺的挑战。

英文摘要

Neural prediction offers a promising approach to forecasting the individual variability of neurocognitive functions and disorders and providing prognostic indicators for personalized invention. However, it is challenging to translate neural predictive models into medical artificial intelligent applications due to the limitations of domain shift and label scarcity. Here, we propose the directed evolution model (DEM), a novel computational model that mimics the trial-and-error processes of biological directed evolution to approximate optimal solutions for predictive modeling tasks. We demonstrated that the directed evolution algorithm is an effective strategy for uncertainty exploration, enhancing generalization in reinforcement learning. Furthermore, by incorporating replay buffer and continual backpropagate methods into DEM, we provide evidence of achieving better trade-off between exploitation and exploration in continuous learning settings. We conducted experiments on four different datasets for children with cochlear implants whose spoken language developmental outcomes vary considerably on the individual-child level. Preoperative neural MRI data has shown to accurately predict the post-operative outcome of these children within but not across datasets. Our results show that DEM can efficiently improve the performance of cross-domain pre-implantation neural predictions while addressing the challenge of label scarcity in target domain.

2511.19359 2026-06-08 cs.LG 版本更新

Enhancing Conformal Prediction via Class Similarity

通过类别相似性增强保形预测

Ariel Fargion, Lahav Dabah, Tom Tirer

AI总结 提出利用类别相似性改进保形预测的方法,通过惩罚组外错误或利用嵌入信息,减少预测集大小并提升语义一致性。

详情
Comments
ICML 2026 (camera-ready). Code is available at: https://github.com/ariel361/CP_via_CS
AI中文摘要

保形预测(CP)已成为高风险分类应用中一个强大的统计框架。CP 不是预测单个类别,而是生成一个预测集,保证以预先指定的概率包含真实标签。不同 CP 方法的性能通常通过其平均预测集大小来评估。在类别可以划分为语义组(例如需要类似治疗的疾病)的设置中,用户可以从不仅平均较小而且包含少量语义不同组的预测集中受益。本文首先解决这个问题,并最终提供一种广泛适用的工具,用于在任何数据集上提升任何 CP 方法。首先,给定一个类别划分,我们建议在 CP 评分函数中增加一个惩罚项,用于惩罚包含组外错误的预测。我们从理论上分析了这一策略,并证明了其在组相关指标上的优势。令人惊讶的是,我们从数学上表明,对于常见的类别划分,它还可以减少任何 CP 评分函数的平均集大小。我们的分析揭示了这种改进背后的类别相似性因素,并激发了一种变体,该变体可以通过利用模型的嵌入进一步减少预测集大小,而无需任何人工语义划分。最后,我们提出了一项广泛的实证研究,涵盖了著名的 CP 方法、多个模型和几个数据集,表明我们基于类别相似性的方法一致地增强了 CP 方法。

英文摘要

Conformal Prediction (CP) has emerged as a powerful statistical framework for high-stakes classification applications. Instead of predicting a single class, CP generates a prediction set, guaranteed to include the true label with a pre-specified probability. The performance of different CP methods is typically assessed by their average prediction set size. In setups where the classes can be partitioned into semantic groups, e.g., diseases that require similar treatment, users can benefit from prediction sets that are not only small on average, but also contain a small number of semantically different groups. This paper begins by addressing this problem and ultimately offers a widely applicable tool for boosting any CP method on any dataset. First, given a class partition, we propose augmenting the CP score function with a term that penalizes predictions with out-of-group errors. We theoretically analyze this strategy and prove its advantages for group-related metrics. Surprisingly, we show mathematically that, for common class partitions, it can also reduce the average set size of any CP score function. Our analysis reveals the class-similarity factors behind this improvement and motivates a variant that can further reduce prediction set size by leveraging the model's embeddings, without requiring any human semantic partition. Finally, we present an extensive empirical study, encompassing prominent CP methods, multiple models, and several datasets, which demonstrates that our class-similarity-based approach consistently enhances CP methods.

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

AI总结 针对机器人通过门缝观察时场景结构缺失的问题,提出MatterDoor方法,利用预训练生成模型(VLM引导外推、单目深度估计、语义分割)采样隐藏房间的语义3D点云先验,在Matterport3D基准上验证了零样本空间语义先验的有效性。

详情
Comments
Under Review
AI中文摘要

自主机器人通常只能通过门缝部分观察房间,墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询,估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询,我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor,一个源自Matterport3D的门遮挡室内场景基准,并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明,无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

2510.07315 2026-06-08 cs.CL cs.AI cs.LG cs.SE 版本更新

SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF: 使代码评估与人类偏好对齐

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

AI总结 提出SWE-IF基准,通过可验证指令分类法VeriCode评估代码指令遵循能力,发现指令遵循是区分LLM代码质量的关键,与功能正确性结合更能匹配人类偏好。

详情
Comments
ICML 2026
AI中文摘要

大型语言模型(LLM)推动了vibe coding,用户通过自然语言交互利用LLM生成并迭代优化代码,直到通过其vibe检查。Vibe检查反映了人类偏好,超越了功能性:解决方案应感觉正确、阅读清晰、保留意图并保持正确。然而,当前的代码评估仍局限于pass@k,仅捕获功能正确性,忽略了用户常规应用的非功能性指令。在本文中,我们假设指令遵循是vibe检查中除功能正确性之外缺失的部分。为了用量化信号衡量模型的代码指令遵循能力,我们提出了VeriCode,一个包含30条可验证代码指令及其确定性验证器的分类法。我们使用该分类法增强现有评估套件,得到SWE-IF,一个评估指令遵循和功能正确性的测试平台。评估31个LLM,我们发现即使最强的模型也难以遵守多条指令,并表现出功能回归。最重要的是,功能正确性和指令遵循的复合得分与人类偏好相关性最强,其中指令遵循成为LLM之间的主要区分因素。我们的代码、数据和分类法可在https://github.com/maszhongming/SWE-IF获取。

英文摘要

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

2509.25522 2026-06-08 cs.AI 版本更新

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

从模型扩展视角理解基于语义ID的生成式推荐

Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju

AI总结 揭示基于语义ID的生成式推荐在模型扩展时存在性能瓶颈,发现直接使用大语言模型作为推荐器具有更好的扩展性,性能提升可达20%。

详情
Comments
Accepted by KDD 2026
AI中文摘要

近期生成模型的进展催生了一种有前景的推荐系统范式,称为生成式推荐(GR),它试图统一丰富的物品语义和协同过滤信号。一种流行的现代方法是使用语义ID(SIDs),即从模态编码器(如大型语言或视觉模型)的嵌入中量化得到的离散编码,在自回归用户交互序列建模设置中表示物品(以下简称基于SID的GR)。虽然其他领域的生成模型展现出完善的缩放定律,我们的工作揭示了基于SID的GR在模型扩展时存在显著瓶颈。特别是,随着我们扩大每个组件(模态编码器、量化分词器和推荐系统本身),基于SID的GR的性能迅速饱和。在这项工作中,我们确定SID编码物品语义信息的有限能力是根本瓶颈之一。基于这一观察,作为获得具有更好缩放行为的GR模型的初步努力,我们重新审视了另一种直接使用大型语言模型(LLMs)作为推荐器的GR范式(以下简称LLM-as-RS)。我们的实验表明,LLM-as-RS范式具有优越的模型缩放属性,并通过缩放实现了比基于SID的GR最佳可达性能高达20%的提升。我们还挑战了普遍认为LLMs难以捕捉协同过滤信息的观点,表明它们建模用户-物品交互的能力随着LLMs的扩展而提升。我们对基于SID的GR和LLMs在44M到14B参数模型规模上的分析强调了基于SID的GR的内在缩放限制,并将LLM-as-RS定位为通往GR基础模型的有希望路径。

英文摘要

Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

2508.02039 2026-06-08 cs.LG stat.ML 版本更新

Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning

多源无数据监督迁移学习的模型回收框架

Sijia Wang, Ricardo Henao

AI总结 提出模型回收框架,在无源数据情况下,通过识别相关源模型子集实现白盒和黑盒设置下的参数高效迁移学习,支持多源无数据监督迁移学习。

详情
AI中文摘要

对数据隐私的日益关注以及与检索源数据进行模型训练相关的其他困难,催生了无源迁移学习的需求,在这种学习中,只能访问预训练模型,而不能访问原始源域的数据。这种设置带来了许多挑战,因为许多现有的迁移学习方法通常依赖于对源数据的访问,这限制了它们直接应用于源数据不可用的场景。此外,实际问题使其更加困难,例如在没有源数据信息的情况下有效选择迁移模型,以及在没有完全访问源模型的情况下进行迁移。受此启发,我们提出了一个模型回收框架,用于参数高效的模型训练,该框架在白盒和黑盒设置中识别要重用的相关源模型的子集。因此,我们的框架使模型即服务(MaaS)提供商能够构建高效预训练模型的库,从而为多源无数据监督迁移学习创造了机会。

英文摘要

Increasing concerns for data privacy and other difficulties associated with retrieving source data for model training have created the need for source-free transfer learning, in which one only has access to pre-trained models instead of data from the original source domains. This setting introduces many challenges, as many existing transfer learning methods typically rely on access to source data, which limits their direct applicability to scenarios where source data is unavailable. Further, practical concerns make it more difficult, for instance efficiently selecting models for transfer without information on source data, and transferring without full access to the source models. So motivated, we propose a model recycling framework for parameter-efficient training of models that identifies subsets of related source models to reuse in both white-box and black-box settings. Consequently, our framework makes it possible for Model as a Service (MaaS) providers to build libraries of efficient pre-trained models, thus creating an opportunity for multi-source data-free supervised transfer learning.

2505.12239 2026-06-08 cs.LG cs.AI cs.CR 版本更新

Towards Efficient and Exact Forgetting Services in Pre-Trained-Model-based Continual Learning

面向基于预训练模型的持续学习中的高效且精确的遗忘服务

Yajiang Huang, Jianheng Tang, Kejia Fan, Huiping Zhuang, Anfeng Liu, Tian Wang, Yunhuai Liu, Mianxiong Dong, Houbing Herbert Song

AI总结 针对持续学习中顺序遗忘请求的挑战,提出基于解析方法的持续遗忘(ACU),通过最小二乘递归推导闭式解,实现高效精确的遗忘,保护历史数据隐私。

详情
AI中文摘要

在持续学习(CL)中,使用预训练模型(PTM)作为特征提取器已成为一种流行做法。结合解析分类器,基于PTM的方法在CL中实现了最先进的性能,追求非遗忘目标。同时,在大多数服务构建范式(例如移动群智感知(MCS))中,主动遗忘在CL阶段获得的特定知识也至关重要,其中移动边缘节点不断收集传感数据,不仅需要非遗忘适应,还需要特定知识遗忘以保护隐私。因此,当遗忘请求在CL中顺序出现时,产生了一个独特的问题,称为持续遗忘(CU)。然而,现有的遗忘方法专注于单次联合遗忘,在应用于CU时显得非常不足,包括(1)违反CL中的历史数据隐私,以及(2)容易被对抗性频繁请求淹没或降级。为了应对CU的挑战,我们提出了一种无梯度方法,称为解析持续遗忘(ACU),用于在基于PTM的CL中实现高效且精确的遗忘,同时保护历史数据隐私。针对每个遗忘请求,我们的ACU通过最小二乘法以可解释的方式递归推导解析(即闭式)解。通过精心设计,我们的ACU兼容样本级和类别级遗忘请求。理论和实验评估验证了我们的ACU在遗忘有效性、模型保真度和系统效率方面的优越性。

英文摘要

In Continual Learning (CL), using a Pre-Trained Model (PTM) as the feature extractor has become a popular practice. Accompanied by analytic classifiers, the PTM-based methods have achieved state-of-the-art performance in CL, in pursuit of the non-forgetting goal. Meanwhile, actively forgetting specific knowledge acquired during the CL phase is also essential in most service construction paradigms, for example, Mobile Crowd Sensing (MCS), where mobile edge nodes continuously collect sensory data and demand not only non-forgetting adaptation but also specific knowledge forgetting for privacy preservation. Thus, a unique problem, called Continual Unlearning (CU), arises when the forgetting requests show sequentially in CL. However, existing unlearning methods focus on single-shot joint forgetting and prove highly inadequate when applied to CU, including (1) violating the historical data privacy in CL and (2) vulnerably being overwhelmed or degraded with adversarially frequent requests. To handle the challenges of CU, we propose a gradient-free approach, called Analytic Continual Unlearning (ACU), for efficient and exact forgetting with historical data privacy preservation in PTM-based CL. In response to each unlearning request, our ACU recursively derives the analytical (i.e., closed-form) solutions via least squares in an interpretable manner. By meticulous design, our ACU is compatible with both sample-level and class-level unlearning requests. The theoretical and experimental evaluations validate our ACU's superiority in unlearning effectiveness, model fidelity, and system efficiency.

2606.05759 2026-06-08 cs.CV

Physics-Guided Deep Unfolding for Blind Cross-Sensor Spectral Super-Resolution via Learning the Spectral Transformation Function

物理引导的深度展开网络用于盲跨传感器光谱超分辨率:通过学习光谱变换函数

Zhaolin Li, Jinsong Chen, Shanxin Guo, Tuo Zhang, Xinglong Zhang, Pan Chen

AI总结 提出一种物理引导的深度展开网络PGU-Net,通过交替优化联合估计高光谱图像和可学习的光谱变换函数,解决盲跨传感器光谱超分辨率问题。

详情
AI中文摘要

高光谱成像为定量遥感提供丰富的光谱信息,然而高光谱传感器成本高昂,因此在许多无人机部署中不可用。光谱超分辨率旨在从多光谱图像重建高光谱图像。大多数现有的SSR方法假设固定且已知的光谱响应函数,因此仅限于单传感器设置。在实际的跨传感器场景中,从HSI到MSI的光谱退化是未知的,并且随传感器特性和场景内容变化,这使得HSI重建病态。本文提出一种物理引导的深度展开网络,称为PGU-Net,通过联合估计HSI和可学习的光谱变换函数来解决盲跨传感器SSR。PGU-Net将交替优化过程展开为端到端可训练的多阶段架构,每个阶段依次更新HSI和STF。两个模块结合了可学习的近端网络和可微的闭式求解器,在保持强表示能力的同时实现物理可解释性。在具有多个SRF的基准数据集(CAVE和NTIRE 2022)上的实验表明,STF(退化算子)的准确恢复以及相对于最先进SSR方法的重建性能提升。此外,在真实无人机跨传感器数据集(Headwall Nano HSI和DJI P4多光谱MSI)上的评估验证了PGU-Net在真正盲条件下的有效性和鲁棒性,并表明估计的STF可能表现出与土地覆盖相关的差异。

英文摘要

Hyperspectral imaging provides rich spectral information for quantitative remote sensing, yet hyperspectral sensors remain costly and thus unavailable in many UAV deployments. Spectral super-resolution (SSR) seeks to reconstruct hyperspectral images (HSIs) from multispectral images (MSIs). Most existing SSR methods assume a fixed and known spectral response function (SRF) and are therefore limited to single-sensor settings. In practical cross-sensor scenarios, the spectral degradation from HSI to MSI is unknown and varies with sensor characteristics and scene content, which renders HSI reconstruction ill-posed. This paper proposes a physics-guided deep unfolding network, termed PGU-Net, to address blind cross-sensor SSR by jointly estimating the HSI and a learnable spectral transformation function (STF). PGU-Net unrolls an alternating optimization procedure into an end-to-end trainable architecture with stages, where each stage sequentially updates the HSI and the STF. Both modules combine learnable proximal networks with differentiable closed-form solvers, enabling physical interpretability while retaining strong representation capacity. Experiments on benchmark datasets (CAVE and NTIRE 2022) with multiple SRFs demonstrate accurate recovery of the STF (degradation operator) and improved reconstruction performance over state-of-the-art SSR methods. Furthermore, evaluations on a real UAV cross-sensor dataset (Headwall Nano HSI and DJI P4 Multispectral MSI) verify the effectiveness and robustness of PGU-Net under truly blind conditions, and suggest that the estimated STF may exhibit land-cover-related differences.

2510.17568 2026-06-08 cs.CV

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

PAGE-4D: 通过解耦姿态与几何估计实现VGGT-4D感知

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang

AI总结 提出PAGE-4D,扩展VGGT到动态场景,通过动态感知聚合器解耦静态与动态信息,同时提升相机姿态估计、深度预测和点云重建性能。

详情
Comments
ICLR 2026, VGGT-4D, Dynamic VGGT
AI中文摘要

最近的3D前馈模型,如视觉几何基础变换器(VGGT),在推断静态场景的3D属性方面表现出强大的能力。然而,由于这些模型通常在静态数据集上训练,它们在涉及复杂动态元素的现实场景中(例如移动的人或可变形物体如雨伞)往往表现不佳。为了解决这一限制,我们引入了PAGE-4D,一种将VGGT扩展到动态场景的前馈模型,能够实现相机姿态估计、深度预测和点云重建——全部无需后处理。多任务4D重建的一个核心挑战是任务之间的固有冲突:准确的相机姿态估计需要抑制动态区域,而几何重建则需要对其进行建模。为了解决这一矛盾,我们提出了一种动态感知聚合器,通过预测动态感知掩码来解耦静态和动态信息——抑制姿态估计的运动线索,同时放大几何重建的运动线索。大量实验表明,PAGE-4D在动态场景中始终优于原始VGGT,在相机姿态估计、单目和视频深度估计以及密集点图重建方面取得了更优的结果。必要的代码和额外演示可在链接:https://page4d.github.io/ 获取,包括训练和推理掩码变体以及仅训练掩码变体(=推理时的VGGT架构)。关键词:VGGT-4D,4D感知,动态场景重建。

英文摘要

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/, including both the training-and-inference masking variant and the training-only masking variant (= VGGT architecture at inference). Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

2506.14634 2026-06-08 cs.CL cs.AI cs.CY

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

不是什么别的吗?利用大语言模型对德国开放式调查回答进行编码:调查动机

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

AI总结 本文探讨了使用大语言模型对开放式调查回答进行编码的有效性,通过德国调查参与原因的数据,比较了不同LLM和提示方法的性能,发现仅微调的LLM能获得满意预测效果,且分类性能差异影响类别分布。

详情
Journal ref
Survey Research Methods (2025)
Comments
to appear in Survey Research Methods
AI中文摘要

近年来,大语言模型(LLM)的发展和广泛可及性引发了关于其在调查研究中应用的讨论,包括对开放式调查回答的分类。由于其语言能力,LLM可能成为耗时的手动编码和监督学习模型预训练的高效替代方案。由于现有研究大多集中在英语回答的非复杂主题或单一LLM上,尚不清楚其发现是否具有普遍性以及这些分类的质量如何与传统方法相比。本研究探讨了不同LLM在其他情境下对开放式调查回答进行编码的程度,以德国调查参与原因的数据为例。我们比较了几种最先进的LLM和提示方法,并通过人类专家编码评估LLM的性能。总体而言,LLM之间的性能差异很大,只有微调的LLM才能达到满意的预测性能。提示方法之间的性能差异取决于所用的LLM。最后,LLM在不同调查参与原因类别上的不均等分类性能导致了不同的类别分布,当不使用微调时。我们讨论了这些发现的含义,不仅对开放式回答编码的方法学研究,还对其实质分析,以及处理或实质性分析此类数据的实践者。最后,我们强调了研究人员在选择LLM时代开放式回答分类自动化方法时需要考虑的许多权衡。通过这样做,我们的研究为关于LLM在调查研究中高效、准确和可靠应用条件的日益增长的研究做出了贡献。

英文摘要

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

2606.07381 2026-06-08 eess.IV cs.AI cs.CV 新提交

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

合成病灶MR图像在低数据场景下自动局灶性皮质发育不良检测中的影响

Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, Simon K. Warfield

AI总结 本研究通过条件生成网络合成FCD病灶MRI数据,评估其真实性及对自动检测的影响,发现合成数据可减少约20%标注需求,但真实数据仍更有效。

详情
AI中文摘要

背景与目的:自动检测局灶性皮质发育不良(FCD)需要大量体素级病灶勾画的MRI数据,这些数据难以获取。本研究旨在生成呈现FCD的合成MRI数据,评估其真实性,并评估其对自动FCD检测的影响,特别是在减少手动标注需求方面。方法:回顾性研究了来自多个(3个)中心的131例FCD患者和90例健康对照的T1加权(T1w)和T2加权液体衰减反转恢复(FLAIR)MRI扫描。通过将生成网络以二元FCD掩膜为条件生成合成MRI。两位神经放射科医生从14张真实和14张合成扫描的随机集合中识别真实图像。训练了三个nnU-Net模型用于检测FCD,分别使用:(i)仅真实数据(35例FCD/35例对照),(ii)真实数据(35例FCD/35例对照)加合成增强,以及(iii)扩展的真实数据(70例FCD/70例对照)。结果:专家区分真实与合成图像的能力有限,T1w分类准确率为60%,FLAIR为70%(评分者间一致性kappa=0.86)。用合成数据增强自动FCD检测使灵敏度提高8.14%(p=0.12),并改善了模型在真实病灶部位的置信度(0.83±0.11至0.89±0.12;p=0.02)。扩展真实数据模型进一步将灵敏度提高至73.8%(p<0.001),置信度提高至0.90±0.14(p=0.01)。结论:条件生成网络可以生成逼真的合成FCD-MRI,在保持同等灵敏度的情况下减少约20%的标注数据需求。当可用时,等量的真实数据仍比合成增强更有效。

英文摘要

Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p < 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

2606.06983 2026-06-08 eess.IV cs.AI cs.CV 新提交

DaX: Learning General Pathology Representations Across Scales

DaX: 跨尺度的通用病理学表示学习

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao, Minfeng Xu

AI总结 提出病理视觉基础模型DaX,通过改进DINOv3自监督学习,结合连续放大训练、跨尺度组织视图等设计,在44个公开数据集的161项临床任务上取得最佳平均性能。

详情
AI中文摘要

计算病理学需要能够跨不同临床终点迁移且对放大倍数、染色、扫描仪类型、切片制备和输入分辨率变化保持鲁棒的视觉表示。我们提出DaX,一个病理视觉基础模型,它将DINOv3风格的自监督学习适应到全切片组织病理学。DaX从自然图像DINOv3权重初始化,并融合了连续放大训练、跨尺度组织视图、方向无关和采集鲁棒的数据增强、多输入尺寸训练以及Gram锚定的密集一致性。这些设计旨在连接局部细胞形态与全局组织结构,同时稳定跨输入尺度的密集token级表示。我们进一步构建了一个WSI级基准,包含来自44个公共数据集的161项临床有意义任务,涵盖28,182名患者和34,394张切片,跨越四个临床领域和九个任务类别。所有模型在固定的患者级交叉验证协议下进行评估,并采用折叠级统计排名,从而实现可重复的比较,对分割依赖的变异性不敏感。在该基准上,DaX在任务中取得了最高的平均性能,并持续获得强大的任务级排名分数,其增益涵盖诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后。这些结果支持DaX作为计算病理学的可迁移视觉编码器,并为未来的病理基础模型提供了标准化的评估框架。项目页面:此https URL。

英文摘要

Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.

2606.07016 2026-06-08 stat.AP cs.CV 新提交

An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

信号交叉口弱势道路使用者安全的集成路边感知与通信框架

Parvez Anowar

AI总结 提出集成多模态感知、边缘计算、V2X/P2X通信和自适应信号控制的框架,基于公开数据集R-LiViT分析53,319个标注,发现VRU占49%、昼夜密度差异大、近距离事件变化10倍、83%行人边界框小,支持多模态感知和自适应部署。

详情
Comments
17 pages, 5 figures, 2 tables. Preprint
AI中文摘要

弱势道路使用者(VRU)约占全球城市交通死亡人数的一半,而交叉口集中了不成比例的伤亡。最近关于VRU保护的感知技术综述列举了数十种单传感器和双传感器部署,但所调查的系统均未将多模态感知与边缘侧近碰撞分析以及双向车联万物(V2X)和行人联万物(P2X)消息传递集成在单个交叉口机柜中。本文提出一个信号交叉口VRU保护的综合框架,在感知层结合LiDAR、雷达、RGB相机和热成像相机,在计算层进行基于边缘的预测和替代安全分析,在通信层进行V2X和P2X消息传递,在驱动层进行自适应信号控制。该框架基于使用R-LiViT(首个公开的路边LiDAR-视觉-热成像数据集)的实证案例研究,该数据集提供了200个多模态序列和2,400个标注的RGB-T帧,来自三个德国交叉口。对53,319个检测标注的分析显示,VRU约占所有道路使用者观测的49%;从白天到夜晚,行人密度下降38%,车辆下降45%,而夜间分布显示更高的近距离比例;在三个交叉口的八个独特位置,每帧近距离事件计数变化约10倍;83%的行人边界框在图像空间中较小,表明VRU通常远离任何单个传感器。这些发现支持多模态感知、边缘侧分析和自适应上下文感知部署,而非统一的单传感器解决方案。

英文摘要

Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.

2606.06537 2026-06-08 q-bio.QM cs.CV eess.IV 新提交

DSU-Net: An Attention-Enhanced Dense Skip U-Net for Breast Lesion Segmentation in Mammographic Images

DSU-Net:用于乳腺X线图像中乳腺病变分割的注意力增强密集跳跃U-Net

Reza Bozorgpour, Mohammadreza Soltany Sadrabadi

AI总结 提出DSU-Net,通过密集跳跃连接和注意力机制改进特征传播与边界描绘,在CBIS-DDSM数据集上实现高精度乳腺病变分割。

详情
AI中文摘要

乳腺癌仍然是全球女性癌症相关死亡的主要原因之一,因此早期检测对于有效治疗至关重要。乳腺X线摄影是主要的筛查方式;然而,可疑病变的准确勾画仍然具有挑战性,且存在观察者间差异。自动分割方法可以通过提供一致且高效的病变定位来辅助放射科医生。本研究提出了DSU-Net,一种用于乳腺X线图像中自动乳腺病变分割的注意力增强密集跳跃U-Net架构。该框架集成了密集跳跃连接和注意力机制,以改进特征传播、保留空间信息并增强病变边界描绘。实验使用了乳腺摄影筛查数字数据库的精选乳腺成像子集(CBIS-DDSM)。为了解决严重的前景-背景不平衡问题,训练中采用了结合Dice损失、焦点损失和二元交叉熵损失的复合损失函数。所提模型在验证数据集上实现了0.9421的Dice相似系数、0.8905的交并比、0.9711的准确率和0.9878的AUC-ROC。定性评估显示了对不同大小和形态病变的准确勾画,而定量结果证实了病变与背景区域之间的稳健区分。这些发现表明,DSU-Net在乳腺X线图像中提供了准确可靠的乳腺病变分割,并突出了注意力引导深度学习在计算机辅助乳腺癌筛查和诊断中的潜力。

英文摘要

Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, making early detection essential for effective treatment. Mammography is the primary screening modality; however, accurate delineation of suspicious lesions remains challenging and subject to inter-observer variability. Automated segmentation methods can assist radiologists by providing consistent and efficient lesion localization. This study presents DSU-Net, an attention-enhanced Dense Skip U-Net architecture for automated breast lesion segmentation in mammographic images. The proposed framework integrates dense skip connections and attention mechanisms to improve feature propagation, preserve spatial information, and enhance lesion boundary delineation. Experiments were conducted using the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). To address severe foreground-background imbalance, a composite loss function combining Dice loss, focal loss, and binary cross-entropy loss was employed during training. The proposed model achieved a Dice Similarity Coefficient of 0.9421, an Intersection over Union of 0.8905, an accuracy of 0.9711, and an AUC-ROC of 0.9878 on the validation dataset. Qualitative evaluation demonstrated accurate delineation of lesions with varying sizes and morphologies, while quantitative results confirmed robust discrimination between lesion and background regions. These findings demonstrate that DSU-Net provides accurate and reliable breast lesion segmentation in mammographic images and highlights the potential of attention-guided deep learning for computer-aided breast cancer screening and diagnosis.

2606.07412 2026-06-08 cs.SE cs.AI 新提交

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

Socratic-SWE:通过轨迹衍生智能体技能实现自我进化的编码智能体

Chuan Xiao, Zhengbo Jiao, Shaobo Wang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang, Lin Qu

AI总结 提出Socratic-SWE闭环自进化框架,通过将智能体历史求解轨迹蒸馏为结构化技能,生成针对性修复任务,实现编码智能体的持续自我改进。

详情
Comments
21 pages, 5 figures. Under review
AI中文摘要

基于LLM的软件工程智能体已成为现实世界语言模型能力的核心测试平台,但其训练仍受限于高质量SWE任务的可用性。现有的合成数据方法通常通过固定突变或漏洞注入程序创建任务,导致生成的任务分布很大程度上独立于智能体自身的弱点和训练进度。我们提出Socratic-SWE,一个闭环自进化框架,将智能体的历史求解轨迹重新用作训练信号的来源。Socratic-SWE不仅将轨迹视为奖励计算的证据,还将其蒸馏为结构化的智能体技能,总结重复出现的失败模式和有效的修复模式。这些技能随后指导在真实仓库中生成针对性的修复任务。候选任务通过基于执行的验证进行检查,并使用求解器梯度对齐奖励进行评分,从而保留的任务既可验证又有助于改进求解器。更新后的求解器产生新的轨迹,使任务课程能够在连续轮次中自适应。在SWE-bench Verified、SWE-bench Lite、SWE-bench Pro和Terminal-Bench 2.0上,Socratic-SWE在相同计算预算下持续优于自我进化的基线,经过三次迭代后在SWE-bench Verified上达到50.40%。这些结果表明,求解轨迹可以作为自我进化SWE智能体的可扩展基础。

英文摘要

LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.

2606.07245 2026-06-08 cs.CY cs.AI 新提交

AI Sovereignty: A Qualitative Model of Strategic Competition as AI Becomes an Instrument of National Power

AI主权:当AI成为国家力量工具时的战略竞争定性模型

Timothy Clancy, Asmeret Naugle

AI总结 提出AI主权定义及首个包含微观、中观、宏观因素的定性模型,分析国家间AI驱动的战略竞争动态,识别关键杠杆点(如加速器、电力、数据等)及其在直接动能行动和间接非动能行动中的应用。

详情
Comments
Main article: 19 pages, 10 figures. Supplementary: 19 pages, 7 figures, 7 tables. To be presented at the 2026 International System Dynamics Conference (ISDC), July 20-24, TU Delft, Delft, Netherlands
AI中文摘要

AI主权是一个国家独立控制其人工智能(AI)技术的程度。对日益复杂的前沿AI模型的竞争具有越来越重要的战略意义,各国正在考虑AI如何改善其经济状况、竞争优势和整体国家实力。然而,AI主权的成本巨大,我们缺乏定义和概念模型来应对不断演变的AI主权动态。我们通过提出与AI主权相关的定义,以及一个首次包含微观、中观和宏观因素的定性模型来填补这一空白。基于模型的定性预测突出了竞争动态和AI驱动国家实力的潜在演变。该模型识别了各国可用于增强自身增长或削弱对手的关键杠杆点,包括考虑加速器、电力、水、数据集和熟练劳动力。这些杠杆点可以通过直接动能行动(如伊朗用无人机瞄准数据中心)和间接非动能效应(包括网络、太空、信息、经济胁迫和外交)在战略和操作层面激活。如果我们的假设和假说成立,这种战略竞争可能将定义21世纪各国如何改善其经济状况、竞争优势和整体国家实力。

英文摘要

AI sovereignty is the extent to which a nation independently controls its artificial intelligence (AI) technologies. The race toward ever-more-sophisticated frontier AI models is of increasing strategic importance, with nations considering how AI might improve their economic situations, competitive advantage, and overall national power. However, the costs of AI sovereignty are enormous, and we lack definitions and conceptual models to navigate evolving AI sovereignty dynamics. We address this gap with definitions relevant to AI sovereignty, along with a first-of-its-kind qualitative model that incorporates micro, meso, and macro contributors. Model-based qualitative forecasts highlight competitive dynamics and evolving potential for AI-driven national power. The model identifies key leverage points that nations can use to enhance their own growth or degrade an adversary's, including consideration of accelerators, electricity, water, data sets and skilled workforce. These leverage points can be activated at strategic and operational levels through both direct kinetic actions, such as Iran's targeting of data centers with drones, and indirect non-kinetic effects including cyber, space, information, economic coercion and diplomacy. If our assumptions and hypotheses are valid, this strategic competition may come to define how nations improve their economic situations, competitive advantage, and overall national power in the 21st Century.

2606.07205 2026-06-08 cs.DS cs.LG 新提交

Towards Tight Bounds for Streaming Attention

流式注意力机制的紧界

Justin Y. Chen, Ying Feng, Piotr Indyk, Michael Kapralov, Ekaterina Kochetkova, Boris Prokhorov

AI总结 本文通过核密度估计的三种方法(差异理论、多项式方法和空间划分)的紧密结合,几乎确定了流式注意力近似问题的空间复杂度紧界,并引入带大量辅助信息的INDEX问题新下界技术。

详情
AI中文摘要

注意力机制是现代Transformer架构的基石。然而,其表达能力以二次运行时和线性空间使用为代价。特别是,经典Transformer架构显式存储所有先前看到的输入元素(token)以生成下一个。在有限空间中实现Transformer的问题,称为KV缓存压缩,在过去几年中引起了广泛关注,推动了强大启发式算法的发展。Haris等人(COLT'25)和Kochetkova等人(NeurIPS'25)的最新工作将KV缓存压缩形式化为流式注意力近似问题,并提供了基于差异理论的上界和信息论下界。然而,这些论文在上界和下界之间留下了显著差距。例如,他们算法的空间使用随精度参数增加,但下界并未增强。在这项工作中,我们重新审视流式注意力近似问题,并给出了其空间复杂度的几乎紧界。在算法方面,我们通过核密度估计的三种不同方法(基于差异的coreset构造(如Charikar-Kapralov-Waingarten'24)、多项式方法(如Greengard-Rokhlin'87、Alman-Song'23)和空间划分(如Andoni-Laarhoven-Razenshteyn-Waingarten'17、Charikar-Kapralov-Nouri-Siminelakis'20))之间令人惊讶的紧密相互作用实现了这一结果。在下界方面,我们的主要技术贡献是一种使用大量辅助信息的INDEX问题的新技术,我们希望这将在其他高维几何估计问题中证明有用。

英文摘要

The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (tokens) in order to generate the next one. The problem of implementing a transformer in limited space, known as KV cache compression, has received much interest over the past few years, spurring the development of powerful heuristics. Recent works of Haris et al, COLT'25 and Kochetkova et al, NeurIPS'25, formalized KV cache compression as the streaming attention approximation problem, providing both upper bounds (based on discrepancy theory) and information theoretic lower bounds. However, those papers left open a significant gap between the upper and lower bounds. For example, the space usage of their algorithms increases with the precision parameter, but the lower bound does not get stronger. In this work, we revisit the streaming attention approximation problem and provide nearly tight bounds on its space complexity. On the algorithmic side, we achieve the result through a surprisingly tight interplay between three distinct methods for kernel density estimation: discrepancy-based coreset constructions (e.g., Charikar-Kapralov-Waingarten'24), the polynomial method (e.g., Greengard-Rokhlin'87, Alman-Song'23), and space partitioning (e.g., Andoni-Laarhoven-Razenshteyn-Waingarten'17, Charikar-Kapralov-Nouri-Siminelakis'20). On the lower bound side, our main technical contribution is a new technique for using the INDEX problem with a large amount of side information that we hope will prove useful in other high dimensional geometric estimation problems.