arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3962
2605.25312 2026-06-09 cs.CL 版本更新

P1SCO: Social Dimensions from a Perspectivist Lens

P1SCO:从视角主义视角看社会维度

Amanda Cercas Curry, Gianmarco de Francisci Morales, Luca Maria Aiello

发表机构 * Independent Researcher(独立研究者) CENTAI, Turin(CENTAI,都灵) IT University of Copenhagen(哥本哈根技术大学)

AI总结 本文提出P1SCO数据集,从三个平台收集社交媒体评论并按十个社会维度标注,以捕捉社会互动和感知的多样性,支持细粒度分析及跨平台、个体差异研究。

详情
AI中文摘要

我们介绍了P1SCO,一个从三个不同平台收集的社交媒体评论数据集,根据十个社会维度进行标注,以捕捉社会互动和感知的多样性。该数据集经过仔细分解,允许在单个评论、标注者和平台层面进行分析。除了社会维度标签外,我们还包含了丰富的标注者元数据,包括人口统计信息、大五人格特征和政治倾向。这种评论级标注和标注者级特征的组合,能够对社会感知如何因平台、个体差异和人口因素而变化进行细致分析。通过保留标注者视角的多样性,我们的数据集支持标注者间和标注者内部一致性研究、人格和政治倾向对社会解读的影响,以及社会话语的跨平台动态分析。

英文摘要

We introduce P1SCO, a dataset of social media comments collected from three distinct platforms, annotated according to ten social dimensions to capture the diversity of social interactions and perceptions. The dataset is carefully disaggregated to allow analysis at the level of individual comments, annotators, and platforms. In addition to the social dimension labels, we include rich metadata on the annotators, including demographics, Big Five personality profiles, and political affiliation. This combination of comment-level annotations and annotator-level features enables nuanced analyses of how social perception varies across platforms, individual differences, and demographic factors. By preserving the diversity of annotator perspectives, our dataset supports studies of inter- and intra-annotator agreement, the influence of personality and political orientation on social interpretation, and the cross-platform dynamics of social discourse.

2605.24942 2026-06-09 cs.LG cs.AI 版本更新

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

黎曼流形操控:用于无标签操控的几何感知生成自编码器

Narmeen Oozeer, Shivam Raval, Philip Quirke, Manikandan Ravikiran, Jeff Phillips, Shriyash Upadhyay, Amirali Abdullah

发表机构 * Martian Harvard University(哈佛大学) Thoughtworks University of Utah(犹他大学)

AI总结 提出将语言模型操控重新定义为激活空间上的黎曼测地线计算,通过基于输出空间Hellinger距离学习的编码器实现无标签、无拓扑先验的流形操控。

详情
AI中文摘要

语言模型的操控——干预其内部激活以改变下游行为——最近已从线性插值扩展到非线性方法,如角度操控和核化操控,这些方法定义了干预变换,而无需在激活空间中的路径上学习显式几何。新引入的几何感知流形方法确实学习了这样的几何,但需要带标签的类中心以及预设的循环或顺序结构。这些假设限制了流形操控的应用范围,因为现有构造需要带标签的中心和兼容的边界条件。我们将流形操控更广泛地重新定义为激活空间上的黎曼测地线计算,将线性操控和带标签样条操控恢复为特定度量选择下的测地线。该框架内一个有原则的度量是输出空间Hellinger距离拉回到激活空间;我们通过一个在小型概念-令牌模式上基于输出距离训练的学习编码器来近似该度量——无需每个提示的标签、无需拓扑先验、也无需每个任务的曲线拟合。实验上,该方法在标准四任务语言模型算术基准的所有任务中可靠地将模型驱动到目标类别,同时在较小输出空间上遵循比基线更行为自然的轨迹。因此,我们为流形操控提供了一个统一的黎曼框架,以及一个基于模式监督、无标签的实例化,该实例化无需带标签的中心或预设边界条件即可运行。

英文摘要

Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids together with prescribed cyclic or sequential structure. These assumptions restrict where manifold steering can be applied, since existing constructions require labelled centroids and compatible boundary conditions. We recast manifold steering more broadly as \textbf{Riemannian geodesic computation} on activation space, recovering linear and labelled-spline steering as geodesics under particular choices of metric. A principled metric within this framework is the output-space Hellinger distance pulled back to activations; we approximate this with a learned encoder trained on output distances over a small concept-token schema - no per-prompt labels, no topology prior, and no per-task curve fitting. Empirically, the method reliably drives the model onto the target class across all tasks in a standard four-task language-model arithmetic benchmark, while following more behaviourally natural trajectories than baselines on smaller output spaces. We thereby provide a unified Riemannian framework for manifold steering together with a schema-supervised, label-free instantiation that operates without labelled centroids or prescribed boundary conditions.

2605.24892 2026-06-09 cs.CV 版本更新

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

X-Foresight:一种通过预测世界建模的联合视觉-动作因果预测网络

Baolu Li, Jingyu Qian, Rui Guo, Yilun Chen, Hanpeng Liu, Yuan Lin, Junhong Zhou, Ruixin Liu, Willow Yang, Yutong Zheng, Zhenli Zhang, Sean Li, Chaoda Zheng, Boyang Wang, Tenglong, Gu, Zhuangzhuang Ding, Pengkun Zheng, Yu Zhang, Xianming Liu

发表机构 * PWM Team(PWM团队) XPeng Inc.(XPeng公司)

AI总结 提出X-Foresight,一种将预测世界模型直接集成到VLA架构中的方法,通过长程分块自回归策略和课程学习,联合学习世界建模与实时动作控制,以解决视频预测中的低熵冗余和长程因果建模难题。

详情
AI中文摘要

物理世界知识主要存在于视频中。赋予视觉-语言-动作(VLA)模型此类知识对于安全且可泛化的规划至关重要。预测世界建模通过从过去观测预测未来视频,使VLA能够内化物理动态和长程因果关系。然而,朴素的下一帧预测面临两个挑战:1)与语义上不同的文本标记不同,视频标记是低熵且冗余的,导致预测退化为琐碎的外推;2)世界建模存在时间困境:密集预测捕捉瞬时动态,但无法高效建模长程因果。为有效学习世界知识,我们引入X-Foresight,一种直接集成到VLA架构中的预测世界模型,以联合学习世界建模和实时动作控制。其核心是一种长程分块自回归策略,该策略解决了上述两个挑战:通过预测语义上遥远的块而非相邻帧,它避免了琐碎的外推,同时保留密集的块内帧用于瞬时动态和稀疏的块间过渡用于长程因果。课程学习计划逐步扩展预测范围并稳定长程训练。为有效捕捉长程因果,我们提出时间重要性采样,将监督集中于由自我运动和行为信号识别的安全关键块。我们进一步将逼真合成委托给基于扩散的多视图渲染器,以改善逼真外观。大量实验表明,X-Foresight在规划性能上显著优于VLA基线,同时保持强大的生成保真度,为世界知识驱动的自主系统建立了稳健的范式。

英文摘要

Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.

2605.24890 2026-06-09 cs.CV 版本更新

QuoVLA: Quotient Space for Vision-Language-Action Models

QuoVLA:视觉-语言-动作模型的商空间

Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han

发表机构 * Department of Automation(自动化系)

AI总结 针对VLA模型预训练VLM潜在表示动作信息不足的观点,提出商空间框架QuoVLA,通过量化模块和双分支设计压缩潜在表示为动作充分表示,在多个基准上提升泛化性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过将视觉观察和语言指令映射到连续动作来适配预训练的视觉-语言模型(VLM)以进行机器人控制。现有方法通常采取动作不足的观点,假设预训练的VLM潜在表示要么缺乏直接可用的动作信息,要么应该屏蔽动作学习信号。与这一观点相反,我们的 extit{VLA商理论}表明,预训练的VLM潜在表示并非动作不足而是动作充分的:它们已经包含控制所需的信息,但由于区分了诱导相同最优动作行为的提示级变体而仍然过度完备。为了将这一理论付诸实践,我们提出了QuoVLA,一个用于VLA的商空间框架,将预训练的VLM潜在表示压缩为动作充分的表示。具体来说,QuoVLA通过一个量化模块和一个具有相对时间复杂度正则化的双分支设计实例化这一原则,在去除提示级冗余的同时保留动作相关信息。跨多个基准的大量实验表明,QuoVLA实现了强大的性能,在视觉、语言和环境分布偏移下的泛化方面尤其显著提升。我们的代码将公开提供。

英文摘要

Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.

2603.04862 2026-06-09 cs.SD 版本更新

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

先聚焦后聆听:探索用于噪声鲁棒的大规模音频语言模型的即插即用音频增强器

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 提出即插即用的音频增强器FTL,通过分离语音与非语音并利用模态路由器预测目标模态,生成任务自适应增强信号,无需微调即可提升LALMs在噪声环境下的性能。

详情
Comments
Accepted by ICML 2026 Workshop (Machine Learning for Audio)
AI中文摘要

大规模音频语言模型(LALMs)是一类用于音频理解的基础模型。现有的LALMs在现实世界的噪声声学条件下,当语音和非语音声音干扰时,性能往往会显著下降。虽然噪声感知微调可以提高鲁棒性,但它需要特定任务的噪声数据和昂贵的重新训练,限制了可扩展性。为了解决这个问题,我们提出了先聚焦后聆听(FTL),一种即插即用的音频增强器,可提高LALMs的噪声鲁棒性。具体来说,FTL首先将输入波形分离为语音和非语音,并应用模态路由器根据用户指令预测目标音频模态(例如,语音)。最后,一个模态感知融合模块生成任务自适应的增强信号,以改善下游感知和推理。跨多个LALMs和任务的实验表明,FTL在不同噪声水平下都能提升性能,而无需对LALMs进行微调。

英文摘要

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.

2605.23595 2026-06-09 cs.LG cs.AI cs.CV cs.ET cs.PF 版本更新

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

基于元学习的成本效益模型评估

Trinh Pham, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

发表机构 * Griffith University(格里菲斯大学) Edith Cowan University(埃迪斯科文大学) The University of Queensland(昆士兰大学)

AI总结 提出MetaEvaluator,一种基于元学习的模型无关框架,通过参考模型池实现无标签数据上的快速、准确且成本效益高的新模型评估。

详情
Comments
Accepted by KDD 2026
AI中文摘要

机器学习的快速发展产生了不断扩展的模型生态系统,使得在未见过的未标记数据上验证新发布模型的可靠性变得越来越具有挑战性。传统的评估流程依赖于昂贵的标注、重复的微调或无法跨模型家族迁移的狭窄假设。我们提出了MetaEvaluator,一个成本效益高、模型无关的框架,用于快速、无标签地评估跨不同架构和模态的未见模型。MetaEvaluator利用参考模型池上的元学习来获得可迁移的初始化,从而能够准确评估新模型,同时将成本分摊到整个池中,并消除了每个模型重新训练的需要。据我们所知,这是第一个能够在完全未标记数据集上评估新模型的模型无关框架。大量实验表明,与传统方法相比,MetaEvaluator以显著降低的成本产生稳定且准确的性能估计,使得在未标记数据上对新出现的模型进行可扩展的基准测试变得实用。

英文摘要

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2605.23247 2026-06-09 cs.LG 版本更新

Accelerating Divisible Load Processing Through Machine Learning: A Practical Framework for Large-Scale Workloads

通过机器学习加速可分负载处理:大规模工作负载的实用框架

Bharadwaj Veeravalli

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(电子与计算机工程系,新加坡国立大学)

AI总结 提出首个机器学习框架,使用前馈神经网络预测单级树网络架构中的最优处理时间,实现97-99%准确率和1-5%平均绝对百分比误差,推理时间小于1毫秒,相比传统方法加速10-100倍。

详情
AI中文摘要

本文介绍了首个用于可分负载理论(DLT)范式下单级树网络(SLTN)架构中预测最优处理时间的机器学习框架。使用具有16个工程特征的前馈神经网络(FNN),我们在100,000个合成生成的配置上训练模型,无需显式推导DLT方程即可预测最优处理时间。模型达到97-99%的准确率(R平方因子),平均绝对百分比误差为1-5%,表明神经网络能够有效学习复杂的负载分布关系。特征重要性分析显示,模型隐式捕捉了DLT的数学结构,包括负载守恒和同时完成约束。推理时间低于1毫秒,该方法相比传统DLT计算提供10-100倍的加速,适用于实时调度、设计空间探索和云资源分配。该方法在多样化的系统配置(n=3到20,负载大小=1到100 GB)中泛化良好,精度一致,尽管在非常大或高度异构的系统中性能略有下降。本工作证明了使用机器学习加速分布式计算优化同时保持接近最优精度的可行性。

英文摘要

In this paper, we introduce the first machine learning framework for predicting optimal processing times in Single-Level Tree Network (SLTN) architectures for the Divisible Load Theory (DLT) paradigm. Using a feedforward neural network(FNN) with 16 engineered features, we train a model on 100,000 synthetically generated configurations to predict optimal processing times without explicit formulation of DLT equations. The model achieves 97-99% accuracy (R-square factor) with mean absolute percentage error of 1-5%, demonstrating that neural networks can effectively learn complex load distribution relationships. Feature importance analysis reveals that the model implicitly captures DLT mathematical structure, including load conservation and simultaneous finishing constraints. With inference times under 1 millisecond, the approach serves as a viable option over traditional DLT computation, enabling applications in real-time scheduling, design space exploration, and cloud resource allocation. The method generalizes well across diverse system configurations (n=3 to 20, load size =1 to 100 GB) with consistent accuracy, though performance degrades slightly for very large or highly heterogeneous systems. This work demonstrates the feasibility of using machine learning to accelerate distributed computing optimization while maintaining near-optimal accuracy.

2605.22863 2026-06-09 cs.LG 版本更新

Latent Cache Flow: Model-to-Model Communication Without Text

潜在缓存流:无需文本的模型间通信

Maximillian Rossi, Prajwal Raghunath, Eugene Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出潜在缓存流(LCF)方法,通过联合翻译和压缩键值缓存实现高效模型间通信,在上下文不同场景下比基于文本的通信准确率提高23%、速度提升8.5倍。

详情
Comments
6 pages, 5 figures
AI中文摘要

当今的LLM智能体通过文本进行通信,由于需要自回归解码共享模型的状态并在接收模型处编码,这会导致显著的延迟和信息损失。最近的工作如Cache-to-Cache(C2C;Fu等人,2026)试图通过学习适配器来交换KV缓存,该适配器将共享者的KV矩阵转换为接收者模型。然而,这些适配器体积庞大且训练成本高,并且逐词翻译,要求目标上下文完全相同。这对于LLM具有不同上下文的智能体通信来说是不合适的。我们引入了潜在缓存流(LCF)。为了解决效率问题,我们观察到键和值可以联合翻译和压缩,将适配器大小减少到C2C的约4%。为了解决上下文不同的问题,我们设计了适配器来传输目标模型所没有的新信息的摘要。我们的初步实验表明,在共享上下文设置中,一个13 MB的LCF适配器可以比956 MB的C2C适配器更准确;对于不同上下文,LCF比基于文本的通信准确率提高23%,速度提升8.5倍。

英文摘要

LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a pruned 13 MB LCF adapter can be more accurate than C2C at 956 MB in shared-context settings; for different contexts, LCF improves F1 by 7.5% and Exact Match by 23% while 8.5 times faster than text-based communication.

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) ByteDance Inc.(字节跳动公司)

AI总结 针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题,提出技能检索增强(SRA)范式,通过动态检索外部技能库提升智能体性能,并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)演变为能够自主解决问题的智能体,它们越来越依赖外部的、可复用的技能来处理超出其原生参数能力的任务。在现有的智能体系统中,整合技能的主要策略是在上下文窗口内显式枚举可用技能。然而,这种策略无法扩展:随着技能库的扩大,上下文预算迅速消耗,智能体在识别正确技能方面的准确性显著下降。为此,本文提出了技能检索增强(SRA),一种新的范式,其中智能体按需从大型外部技能库中动态检索、整合和应用相关技能。为了使该问题可衡量,我们构建了一个大规模技能库,并引入了SRA-Bench,这是首个对完整SRA流程进行分解评估的基准,涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5,400个能力密集型测试实例和636个手动构建的金标准技能,这些技能与网络收集的干扰技能混合,形成了一个包含26,262个技能的大规模语料库。大量实验表明,基于检索的技能增强可以显著提高智能体性能,验证了该范式的潜力。同时,我们揭示了技能整合中的一个基本差距:当前的LLM智能体倾向于以相似的速率加载技能,无论是否检索到金标准技能,或者任务是否实际需要外部能力。这表明技能增强的瓶颈不仅在于检索,还在于基础模型判断何时加载何种技能以及何时真正需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题,并为未来智能体系统中能力的可扩展增强奠定了基础。

英文摘要

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

2605.22763 2026-06-09 cs.AI 版本更新

Advancing Mathematics Research with AI-Driven Formal Proof Search

用AI驱动的形式证明搜索推进数学研究

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely Bérczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Miklós Z. Horváth, Andrew Ferraiuolo, Henryk Michalewski, Edward Lockhart, Codrut Grosu, Thomas Hubert, Matej Balog, Pushmeet Kohli, Swarat Chaudhuri

发表机构 * Google DeepMind(谷歌DeepMind) Aarhus University(奥胡斯大学)

AI总结 本文研究了如何利用大型语言模型生成形式证明,以解决开放性数学问题,并展示了AI辅助形式证明搜索在数学研究中的应用和贡献。

详情
AI中文摘要

大型语言模型(LLMs)在数学推理方面日益表现出色,但其不可靠性限制了其在数学研究中的实用性。一种缓解方法是使用LLMs生成Lean等语言中的形式证明。我们首次对这种方法解决开放性问题的能力进行了大规模评估。我们的最强大代理在每个问题的成本仅为几百美元的情况下,自主解决了353个开放性埃德勒问题中的9个,并证明了492个OEIS猜想中的44个,同时正被应用于组合学、优化、图论、代数几何和量子光学研究。一个基本代理交替使用基于LLM的生成和基于Lean的验证,复制了埃德勒的成功,但在最困难的问题上成本更高。这些发现展示了AI辅助形式证明搜索的威力,并揭示了使这种技术可行的代理设计。

英文摘要

Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

2605.22664 2026-06-09 cs.AI 版本更新

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

WorkstreamBench: 评估LLM代理在金融领域的端到端电子表格任务

Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, Hongseok Namkoong

发表机构 * Decision, Risk, and Operations Division, Columbia Business School(哥伦比亚商学院决策、风险与运营部门) ESB Business School, Reutlingen University(图宾根大学ESB商学院)

AI总结 本文提出WorkstreamBench,用于评估LLM代理在金融领域复杂端到端电子表格任务中的能力,重点在于财务建模和情景分析等关键流程,通过三个维度(准确性、公式、格式)的细粒度标准来衡量解决方案质量。

详情
AI中文摘要

LLM代理越来越多地被期望执行端到端工作流,从高层次用户指令生成完整的成果。为了满足企业需求,前沿AI实验室已开发出能够从头构建整个电子表格的代理。这在金融领域尤为重要,因为核心工作流如财务建模、预测和情景分析通常通过电子表格完成。然而,现有电子表格基准测试并未衡量这种高级能力,而是专注于问答或单个公式编辑。为填补这一空白,我们提供了首个评估代理在端到端电子表格任务上的评估,重点是经济关键的金融工作流,如建模和情景分析。由于其中的交付成果通常由多个利益相关者审查和修订,判断其质量必然涉及诸如可读性或修改便捷性等高级标准。为了反映解决方案质量的多维性质,我们开发了一个包含三个维度(准确性、公式、格式)的评估分类学,每个维度包含细粒度的标准,以反映专业标准。Claude家族在基准测试中领先,产生最专业的输出,在我们的定性审查中,但即使最强的代理也经常无法达到专业金融标准,并且在难度超过几个链式计算后显著下降。这表明当前的代理尚无法可靠地生成专业质量的电子表格,以满足现实工作流程所需的复杂性。

英文摘要

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

2605.11314 2026-06-09 cs.CV cs.AI 版本更新

Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中基于3D无标记运动学的罗达和格雷厄姆步态分类量化

Lauhitya Reddy, Seth Donahue, Jeremy Bauer, Susan Sienko, Anita Bagley, Joseph Krzak, Maura Eveld, Karen Kruger, Ross Chafetz, Vedant Kulkarni, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Shriners Children’s(夏皮罗儿童医院) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的沃克·H·库勒生物医学工程系)

AI总结 本文提出了一种基于单视角视频的无标记步态分析方法,用于量化罗达和格雷厄姆步态分类中的膝踝z分数,从而在资源有限的临床环境中实现可扩展的客观步态评估。

详情
Comments
29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format
AI中文摘要

脑瘫(CP)是一种运动神经障碍,是儿童中最常见的终身身体残疾原因。大约75%的脑瘫儿童能够行走,准确的步态评估对于保持行走功能至关重要,这种功能在四分之一到一半的脑瘫成人中在中年时会恶化。罗达和格雷厄姆分类系统利用来自3D仪器化步态分析(3D-IGA)的踝关节和膝关节z分数来量化矢状面步态偏差,但3D-IGA成本高且仅限于专业中心,而观察性评估仅显示中等的评分者间一致性。我们开发了一种无标记步态分析流程,可以直接从单视角临床步态视频中量化罗达和格雷厄姆膝踝z分数。在1,058个双侧肢体样本(来自152名儿童的529次试验,其中88名男性,63名女性,年龄12.1±4.0岁,60种不同的主要诊断,脑瘫最为常见,n=54)中,矢状面模型在膝关节z分数上达到R²=0.80±0.02和CCC=0.89±0.02,踝关节z分数上达到R²=0.57±0.02和CCC=0.72±0.02,与3D-IGA相比。二元筛查用于过量膝关节屈曲的AUROC=0.88,正确识别了83%的受影响儿童,应用罗达和格雷厄姆规则得到7类准确率为43±1%,宏AUROC=0.78±0.01,踝关节预测误差仍然是主要瓶颈。除了横断面筛查外,连续z分数支持跨访问的纵向轨迹跟踪,为监测疾病进展和治疗反应提供定量基础,这在观察性量表中是无法实现的。这些结果证明了基于视频的z分数估计、过量屈曲筛查和纵向轨迹跟踪在资源有限的临床环境中实现可扩展、客观步态评估的可行性。

英文摘要

Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

2605.22079 2026-06-09 cs.CL 版本更新

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ishigaki-IDS-Bench: 一个用于从BIM信息需求生成信息交付规范的基准

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.(ONESTRUCTION公司) AWS GenAI Innovation Center(AWS生成式人工智能创新中心)

AI总结 本文提出Ishigaki-IDS-Bench基准,用于评估大型语言模型生成符合行业标准的XML信息交付规范(IDS)的能力,通过166个由BIM/IDS专家编写和验证的示例,结合内容一致性评估和结构审核,展示了当前LLM在生成满足IDS标准和IFC词汇约束的XML方面的局限性。

详情
Comments
7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face
AI中文摘要

大型语言模型(LLMs)被广泛用于生成结构化输出,如JSON、SQL和代码,但公共资源仍然有限,无法有效评估必须同时满足行业标准XML和领域词汇约束的生成能力。本文提出了Ishigaki-IDS-Bench,一个用于评估从BIM信息需求生成信息交付规范(IDS)XML能力的基准。该基准包含166个由BIM/IDS专家编写和验证的示例,这些示例是通过将83个实际场景扩展为日语和英语后生成的,对应黄金IDS文件以及输入格式、语言、轮次设置、IFC版本和建筑领域等元数据。其评估结合了基于IDSAuditTool的可操作性、结构和内容审核,以及与黄金IDS文件的内容一致性评估。在零样本评估中,10个LLM中表现最好的模型在内容一致性上达到65.6%的宏F1分数,但只有27.7%的输出通过内容审核。这些结果表明,当前LLM能够表达部分信息需求作为IDS,但仍难以稳定生成满足IDS标准和IFC词汇约束的XML。Ishigaki-IDS-Bench支持比较评估、失败分析以及开发符合领域标准的受限结构生成方法。我们已将评估脚本和基准数据以CC BY 4.0许可发布在GitHub和Hugging Face上。

英文摘要

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

2605.21854 2026-06-09 cs.CV cs.AI 版本更新

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

CrossVLA: 跨范式后训练和推理优化用于视觉-语言-动作模型

Zhi Liu

发表机构 * Tianjin University(天津大学)

AI总结 本文研究了视觉-语言-动作(VLA)模型的跨范式后训练方法,提出了CrossVLA框架,通过改进的连续动作流匹配估计器、对比LoRA和DoRA参数高效层的性能,并揭示了推理过程中去噪循环对延迟的影响,最终实现了在LIBERO数据集上的显著提升。

详情
Comments
Workshop draft, 14 pages, 4 figures. Code, ckpts, data: https://github.com/lz-googlefycy/vla-lab
AI中文摘要

视觉-语言-动作(VLA)模型迅速收敛到一小套架构模式:离散令牌自回归(例如OpenVLA)和连续动作流匹配(例如pi-0.5)。然而,通过直接偏好优化(DPO)进行偏好对齐——语言模型中事实上的后训练步骤——几乎仅在自回归VLA上被研究。我们提出了CrossVLA,对跨范式VLA后训练进行实证研究。三大贡献:(i)一个替代流匹配对数概率估计器,使DPO可以在不进行概率流ODE积分的情况下在连续动作后端上运行;(ii)对LoRA和DoRA作为VLA DPO的参数高效层进行直接比较,发现DoRA在LIBERO 4套件上比OpenVLA SFT平均提升10.4个百分点(600次试验,3种子)——每套件+20.0对象,+11.0长周期,+8.0目标,+2.7空间——在对象上无种子方差(38/50在每个种子上);(iii)推理时间解剖显示去噪循环主导了78.6%的sample_actions延迟,而类似于VLA-Cache的前缀K/V缓存达到了21%的加速上限——无论是块级还是令牌级缓存策略在我们的基准中都会使成功率降至0-80%。我们进一步在6000个LIBERO帧上预训练了一个多视角+时间投影头,实现了99.5%的k-NN召回率@1(36倍于随机),可用作下游初始化。所有代码、检查点、训练日志和复现脚本均在https://github.com/lz-googlefycy/vla-lab上公开。

英文摘要

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz-googlefycy/vla-lab.

2605.21028 2026-06-09 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink:动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Lab. of Computer Network and Information Integration, Southeast University(东南大学计算机网络与信息集成重点实验室) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Institute of Automation, CAS(中国科学院自动化研究所)

AI总结 本文提出 DySink,一种基于检索的框架,通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks,以提高自回归长视频生成的动态性和时间质量。

详情
AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率,通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而,这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧,而丢弃可能更相关的中间历史。结果,保留的长程上下文可能变得不适应,并偏向过时的线索;在严重情况下,RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃,其中内容会回归到 sink 帧。我们提出 DySink,一种基于检索的框架,维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合,后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明,DySink 在动态度方面一致优于强基线,同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

2604.24199 2026-06-09 cs.SD cs.AI eess.AS eess.SP 版本更新

Speech Enhancement Based on Drifting Models

基于漂移模型的语音增强

Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

发表机构 * Victoria University of Wellington(维多利亚大学) Lincoln University(林肯大学) GN Advanced Science(GN先进科学)

AI总结 本文提出了一种基于漂移模型的语音增强框架DriftSE,通过将去噪问题建模为平衡问题,实现单步推理,从而在无需配对数据的情况下实现高质量语音增强。

详情
Comments
6 pages, 2 figures
AI中文摘要

我们提出了一种基于漂移模型的语音增强(DriftSE),一种新颖的生成框架,将去噪建模为一个平衡问题。与依赖迭代采样的方法不同,DriftSE通过演化映射函数的推动分布来实现单步推理,直接匹配干净语音分布。这种演化由漂移场驱动,这是一种学习到的修正向量,引导样本向干净分布的高密度区域发展,这自然促进了在未配对数据上的训练,通过匹配分布而非配对样本。我们从两种形式研究了该框架:从噪声观测到直接映射,以及从高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准测试中,DriftSE在单步中实现了高保真度的增强,优于多步扩散基线,并建立了语音增强的新范式。

英文摘要

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

协同优化中的因果卸载:在对抗性贡献下的精确和近似影响反转

Ali Mahdavi, Azadeh Zamanifar, Amirfarhad Farhadi, Omid Kashefi

发表机构 * Department of Computer Engineering, SRC, Islamic Azad University Tehran, Iran(伊朗伊斯兰Azad大学塔希尔分校计算机工程系) School of Computer Engineering, Iran University of Science and Technology Tehran, Iran(伊朗科学技术大学塔希尔分校计算机工程系) Meta CA, USA(美国Meta公司)

AI总结 本文提出HF-KCU方法,通过共轭梯度迭代在Krylov子空间中近似影响函数,从而在协同优化中实现数据删除,减少计算复杂度并提高隐私保护效果。

详情
AI中文摘要

联邦学习系统必须支持数据删除请求以符合隐私法规,但每次删除后重新训练是计算上不可行的。我们提出了HF-KCU方法,通过在Krylov子空间中进行共轭梯度迭代近似影响函数,将复杂度从O(d^3)降低到O(kd),其中k<<d。因果加权机制确保只有持有删除数据的客户端接收参数更新,防止对未受影响的客户端造成虚假变化。我们的方法设计用于处理有界对抗性扰动的Hessian和梯度,提供在现实威胁模型下的优雅退化。我们在卷积(ResNet-18,SimpleCNN)和Transformer(ViT-Lite)架构上CIFAR-10、MNIST和Fashion-MNIST数据集上验证了HF-KCU。在CIFAR-10的Dirichlet(alpha=0.5)划分下,HF-KCU在重新训练的基础上实现了47.75倍的速度提升,同时保持测试准确率在0.60%以内(71.16 vs 71.76%)。对遗忘集的成员推断攻击的成功率达到了0.499,与重新训练模型匹配,证实了有效的隐私恢复。我们提供了收敛保证,显示Krylov近似误差随着O((k^{1/2}-1)/(k^{1/2}+1))递减,其中k是Hessian条件数。因果加权机制确保了手术更新,只有持有删除数据的客户端被修改,保护了未受影响参与者的模型质量,并避免了异步联邦设置中梯度方法的不稳定性。该设计提供了可解释性,因为每个更新都可以直接追溯到删除数据的影响。该方法的效率和精度使其适用于生产联邦系统,其中删除请求异步到达且计算预算受限。

英文摘要

Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client's contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where k<<d.A causal weighting mechanism ensures that only clients holding the deleted data receive parameter updates, preventing spurious changes to unaffected clients. Our method is designed to handle bounded adversarial perturbations to the Hessian and gradient, providing graceful degradation under realistic threat models. We validate HF-KCU across convolutional (ResNet-18, SimpleCNN) and transformer (ViT-Lite) architectures on CIFAR-10, MNIST, and Fashion-MNIST. On CIFAR-10 under Dirichlet (alpha=0.5) partitioning, HF-KCU achieves 47.75 times speedup over retraining while maintaining test accuracy within 0.60% of the rational baseline(71.16 vs 71.76 %). Membership inference attacks on the forget set yield success rates of 0.499 matching the retrained model and confirming effective privacy restoration. We provide convergence guarantees showing that the Krylov approximation error decreases as O((k ^1/2-1)/(k^1/2+1)) where k is the Hessian condition number. The causal weighting mechanism ensures surgical updates, where only clients holding deleted data are modified, preserving model quality for unaffected participants and avoiding the instability of gradient-based approaches in asynchronous federated settings. This design provides interpretability as each update is directly traceable to the influence of the deleted data. The method's efficiency and precision make it suitable for production federated systems where deletion requests arrive asynchronously and computational budgets are constrained.

2605.19674 2026-06-09 cs.AI 版本更新

Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

超越理性错觉:行为现实的战略分类

Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng, Yikai Chen, Haoxuan Li, Yang Shi, Jinxuan Yang, Zhouchen Lin, Yuanlong Chen, Yuanxing Zhang, Shaowu Yang, Wenjing Yang, Haotian Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于前景理论的行为现实战略分类框架,以应对现实中受心理偏差影响的决策者策略性操纵问题。

详情
Comments
Accepted by ICML2026
AI中文摘要

战略分类(SC)研究了决策模型与策略性操纵特征以获得有利结果的代理之间的相互作用。现有SC框架通常依赖于理想化的假设,即代理是严格理性的。然而,行为经济学和心理学的证据一致表明,现实世界中的决策往往受到认知偏差的影响,偏离纯粹理性。为了正式化这一限制,我们识别并定义了一个新的问题设置,称为行为现实的战略分类问题,其中代理的策略性操纵由于心理偏差而偏离完全理性。受识别限制的启发,我们提出了前景引导的战略框架(Pro-SF)来解决这个问题,这是一个基于前景理论的原理框架,用于建模和学习在行为现实的战略响应下。具体来说,为了捕捉行为现实的战略操纵,我们的框架通过引入三种受前景理论启发的关键机制,重新表述了代理与决策者之间的Stackelberg式互动,包括收益与成本之间的不对称性、不同的主观参照点以及非理性的概率扭曲。在合成和现实世界数据集上的实验表明,Pro-SF是一种行为导向的战略分类方法,连接了机器学习和行为经济学,为现实世界中的更可靠部署提供了桥梁。

英文摘要

Strategic classification(SC) studies the interaction between decision models and agents who strategically manipulate their features for favorable outcomes. Existing SC frameworks typically rely on the idealized assumption that agents are strictly rational. However, evidence from behavioral economics and psychology consistently shows that real-world decision-making is often shaped by cognitive biases, deviating from pure rationality. To formalize this limitation, we identify and define a new problem setting, termed the behaviorally realistic strategic classification problem, where agents' strategic manipulations deviate from full rationality due to psychological biases. Motivated by the identified limitation, we propose the Prospect-Guided Strategic Framework (Pro-SF) to address the problem, a principled framework grounded in prospect theory to model and learn under behaviorally realistic strategic responses. Specifically, to capture behaviorally realistic strategic manipulations, our framework reformulates the Stackelberg-style interaction between agents and the decision-maker by incorporating three key mechanisms inspired by prospect theory, including the asymmetry between benefits and costs, different subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets establish Pro-SF as a behaviorally grounded approach to strategic classification, bridging machine learning and behavioral economics for more reliable deployment in the real world.

2605.19662 2026-06-09 cs.AI 版本更新

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

当表格基础模型遇见策略性表格数据:一种先验对齐方法

Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng, Yikai Chen, Haoxuan Li, Jinxuan Yang, Kun Kuang, Yuanlong Chen, Mingyang Geng, Wanrong Huang, Shixuan Liu, Shaowu Yang, Wenjing Yang, Zhouchen Lin, Haotian Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 本文研究了表格基础模型在策略性表格数据上的泛化能力,提出了一种策略感知的先验对齐框架SPN,以提高模型在策略性环境中的鲁棒性和预测性能。

详情
Comments
Accepted by ICML2026
AI中文摘要

基于预训练先验数据拟合网络(PFNs)的表格基础模型在多样化的表格任务上表现出强大的泛化能力,但通常设计用于非策略性设置,其中数据分布与部署分类器无关。然而,在许多现实世界决策场景中,个体可能在部署后有意识地修改特征以获得有利结果,导致部署后分布偏移。本文研究了PFN风格的表格基础模型是否能泛化到此类策略性表格数据。我们证明,策略性操纵导致了预训练期间学习的非策略性先验与操纵后的策略性先验之间的不匹配,从而产生系统性的预测偏差。为了解决这个问题,我们提出了策略性先验数据拟合网络(SPN),一种推理时策略感知的框架,能够在不重新训练的情况下将表格基础模型适应到策略性环境。SPN构建策略性上下文示例以近似操纵后的输入,并将PFN预测与诱导的策略性分布对齐。在现实世界和合成表格数据集上的实验表明,与表格基础模型和经典表格方法相比,SPN在策略性操纵下始终提高了鲁棒性和预测性能。

英文摘要

Tabular foundation models based on pretrained prior-data fitted networks~(PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for \emph{non-strategic} settings where data distributions are independent of deployed classifiers. In many real-world decision scenarios, however, individuals may strategically modify their features after deployment to obtain favorable outcomes, inducing a post-deployment distribution shift. This paper studies whether PFN-style tabular foundation models can generalize to such \emph{strategic} tabular data. We show that strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias. To address this issue, we propose \textbf{Strategic Prior-data Fitted Network}~\textit{(SPN)}, an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining. SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution. Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

2605.19266 2026-06-09 cs.CL cs.AI 版本更新

FormalASR: End-to-End Spoken Chinese to Formal Text

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结 本文提出FormalASR,一种端到端的中文语音到正式文本转换模型,通过构建大规模的语音到正式文本数据集,并使用Qwen3-ASR进行微调,实现了比原声基线减少37.4%的CER,同时提升了ROUGE-L和BERTScore指标,提供了一个轻量级的设备端解决方案。

详情
AI中文摘要

自动语音识别(ASR)系统通常优化于逐字转录,这保留了不连贯、填充词和非正式口语结构,这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑,但这种设计增加了延迟和内存成本,并且难以在设备上部署。我们提出了FormalASR,两个紧凑的端到端模型(0.6B和1.7B),可直接将中文语音转录为正式书面文本。为了实现这一目标,我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集,通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模(0.6B和1.7B)的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明,FormalASR在比原声基线减少37.4%的CER的同时,也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM,提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于分步置信度归因(SCA)的方法,用于诊断黑盒大语言模型在多步推理中的失败,通过信息瓶颈原理对生成的推理轨迹进行置信度评估,并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能,但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号,但现有方法受限于最终答案或需要内部模型访问。在本文中,我们引入了分步置信度归因(SCA),一种适用于封闭源LLM的框架,该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理:与正确解决方案中的一致结构对齐的步骤获得高置信度,而偏差则被标记为可能错误。我们提出了两种互补的方法:(1)NIBS,一种非参数化的IB方法,用于测量一致性而无需图结构,以及(2)GIBS,一种基于图的IB模型,通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明,SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外,使用步骤级置信度指导自我修正,比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

2605.18643 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI Kuaishou Technology(快手科技) Shanghai AI Lab(上海人工智能实验室) TsinghuaC3I/ZEDA(清华大学C3I/ZEDA)

AI总结 本文提出ZEDA框架,通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型,显著减少专家FLOPs并提升推理速度。

详情
AI中文摘要

混合专家(MoE)通过稀疏专家激活高效地扩展语言模型,其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应,使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本,通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应(ZEDA),一种低成本框架,将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换,ZEDA在每个MoE层中注入无参数的零输出专家,并通过两阶段自蒸馏适应增强模型,利用原始MoE作为冻结的教师,并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试(涵盖数学、代码和指令跟随)中,ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上,ZEDA比最强的动态MoE基线分别高出6.1和4.0个点,并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

2605.06317 2026-06-09 cs.CV cs.AI 版本更新

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

NavOne: 一种基于顶部向下地图的视觉语言导航的一步全局规划

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

发表机构 * South China University of Technology(南方科技大学)

AI总结 本文提出了一种基于顶部向下地图的视觉语言导航方法,通过引入NavOne框架,实现多模态地图的单步全局路径规划,显著提升了导航效率和性能。

详情
Comments
10 pages, 7 figures
AI中文摘要

现有的视觉语言导航(VLN)方法通常采用以自身为中心的逐步导航范式,这导致误差累积并限制了效率。尽管最近的方法试图利用预建的环境地图,但它们通常依赖于逐步更新记忆图或评分离散路径提案,这限制了连续的空间推理并创建了离散瓶颈。我们提出了顶部向下VLN(TD-VLN),将导航重新表述为在预建的顶部向下地图上的一步全局路径规划问题,支持我们新构建的R2R-TopDown数据集。为了解决这个问题,我们引入了NavOne,一个统一的框架,它在单次端到端前向传递中直接预测多模态地图上的密集路径概率。NavOne具有顶部向下地图融合器,用于联合多模态地图表示,并扩展了空间感知的深度混合。在R2R-TopDown上的广泛实验表明,NavOne在基于地图的VLN方法中实现了最先进的性能,其规划阶段的速度提升比现有基于地图的基线方法快8倍,比以自身为中心的方法快80倍,从而实现了高效全局导航。

英文摘要

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

2605.17609 2026-06-09 cs.LG 版本更新

Adaptive Generate-Rank-Verify: Inference-Time Search with Costly Verification

自适应生成-排序-验证:具有高成本验证的推理时间搜索

Shaddin Dughmi, Mahdi Haghifam, Yusuf Hakan Kalayci

发表机构 * University of Southern California(南加州大学) Northwestern University(西北大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Simons Institute for the Theory of Computing(Simons计算理论研究所) Data Science Institute at the University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文提出了一种自适应生成-排序-验证方法,通过在未知分布下自适应地生成和验证候选答案,以在保证成本的前提下找到正例,同时通过理论分析和实验验证了该方法在数学推理和编程竞赛中的有效性。

详情
Comments
33 Pages, 6 Figures, 4 Tables. Changes compared to V1: updated the related work section
AI中文摘要

许多推理时间语言模型管道结合了低成本奖励信号和高成本验证器,例如数学推理中的精确答案检查或代码生成中的隐藏测试执行。我们通过学习理论的视角将这一设置形式化为生成性主动搜索:一个成本敏感的首次正例搜索问题,在其中策略会自适应地从未知分布中采样候选者,观察低成本评分,并支付验证器标签的费用,直到找到正例。对于固定的提示,生成器和奖励模型诱导出两个未知对象:奖励分数上的分布和条件于评分的成功函数。当这些量已知时,我们使用动态规划方法来表征分布感知的最优策略。在现实和实用的设置中,当评分分布和成功函数都未知时,我们提出ADAP算法,一种分层自适应的生成-排序-验证算法,逐步增加采样的响应数量和顶部验证的数量。在单调性假设下,即更高的奖励分数不太可能通过验证,我们证明ADAP在期望成本上接近分布感知的最优。我们通过基于中心星数的学习理论下界补充这一结果,表明对评分-标签关系的结构假设是必要的。在数学推理和竞争编程上的实验验证了在固定非自适应策略和难度自适应基线上的预测优势。

英文摘要

Many inference-time language-model pipelines combine a cheap reward signal with an expensive verifier, such as exact answer checking in mathematical reasoning or hidden-test execution in code generation. We formalize this setting using a learning-theoretic lens as generative active search: a cost-sensitive first-positive search problem in which a policy adaptively samples candidates from an unknown distribution, observes cheap scores, and pays for verifier labels until it finds a positive example. For a fixed prompt, the generator and reward model induce two unknown objects: a distribution over reward scores and a score-conditioned success function. When these quantities are known, we characterize the distribution-aware optimal policy using a dynamic programming approach. In the realistic and practical setting where both the score distribution and success function are unknown, we propose ADAP, a shellwise adaptive generate-rank-verify algorithm that progressively increases the number of sampled responses and top-ranked verifications. Under the monotonicity assumption that higher reward scores are no less likely to pass verification, we show that ADAP achieves expected cost within a constant factor of the distribution-aware optimum. We complement this result with learning-theoretic lower bounds, based on a centered star number, showing that structural assumptions on the score--label relationship are necessary. Experiments on mathematical reasoning and competitive programming validate the predicted advantage over both fixed non-adaptive policies and difficulty-adaptive baselines.

2605.02439 2026-06-09 cs.CV cs.LG 版本更新

Anomaly-Preference Image Generation

异常偏好图像生成

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * arXiv.org

AI总结 本文提出了一种新的异常生成方法,通过隐式偏好对齐机制和时间感知能力分配模块,提升生成图像的真实性和多样性,实验表明其在真实性和多样性上均优于现有方法。

详情
Comments
Accepted by ICML 2026
AI中文摘要

从有限数据中合成逼真且多样的异常样本对于鲁棒模型泛化至关重要。然而,现有方法难以平衡保真度和多样性,通常受分布不匹配和过拟合的阻碍。为缓解这一问题,我们引入了异常偏好优化,一种将异常生成重新表述为偏好学习问题的新范式。我们的方法核心是隐式偏好对齐机制,利用真实异常作为正例参考,直接从去噪轨迹偏差中推导优化信号,而无需昂贵的人工标注。此外,我们提出了一个时间感知能力分配模块,动态地沿扩散时间线分配模型能力,在高噪声阶段优先考虑结构多样性,在低噪声阶段增强细粒度保真度。在推理过程中,分层采样策略调节保真度与对齐的权衡,实现对生成过程的精确控制。大量实验表明,该方法显著优于现有基线,实现了真实性和多样性方面的最先进性能。

英文摘要

Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.

2605.17301 2026-06-09 cs.CL cs.AI 版本更新

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG: 检测和解决检索增强生成中的知识冲突

Chenyu Wang, Yueyuan Li, Yingmin Liu, Yang Shu

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出ConflictRAG框架,通过两阶段冲突检测模块、熵-TOPSIS框架和冲突感知RAG评分,有效检测和解决检索增强生成中的知识冲突,实验表明其在冲突检测F1和正确性方面优于现有方法。

详情
Comments
6 pages, 6 figures, submitted to IEEE SMC 2026
AI中文摘要

检索增强生成(RAG)系统隐式假设检索文档之间相互一致——这一假设在实践中经常失效。我们提出了ConflictRAG,一种具有冲突意识的RAG框架,能够在生成答案之前检测、分类和解决知识冲突。该框架引入了三个贡献:(1)一个两阶段冲突检测模块,结合轻量级嵌入基于MLP分类器和选择性LLM细化,使API成本降低62%,同时保持90.8%的检测准确率;(2)一个熵-TOPSIS框架用于数据驱动的来源可信度评估,比手动启发式方法提高7.1%的选取准确率;(3)一个冲突感知RAG评分(CARS)用于诊断冲突处理能力。在三个基准测试中对六个基线的实验表明,冲突检测F1达到88.7%,并且在最强的冲突感知基线中,正确性提高了5.3-6.1%。该流程能够有效跨基础LLM转移。

英文摘要

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

2605.17289 2026-06-09 cs.LG cs.AI 版本更新

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

LEAP:可学习的端到端无结构剪枝大型语言模型

Mohammad Mozaffari, Younes Hourri, Mohammad Rastegari, Mahyar Najibi

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出LEAP,一种可学习的端到端无结构剪枝方法,通过伯努利-戈姆贝茨松弛替代传统参数化,提高了无结构剪枝的端到端准确率,实验表明在多个LLM家族上平均提升了零样本准确率。

详情
Comments
Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM)
AI中文摘要

无结构稀疏性现在通过最近的GPU内核和数据流硬件原生加速,瓶颈从推理执行转移到了剪枝算法。最先进的无结构LLM剪枝方法是基于最优大脑外科手术原理的分层代理,牺牲了端到端准确性,尤其是在高稀疏度下。端到端替代方案如MaskLLM和PATCH表明可学习掩码可以缩小这一差距,但它们的类别-模式参数化随有效掩码数量按行数增长,并不适用于无结构设置。我们引入LEAP,用每权重伯努利-戈姆贝茨松弛替代这种不可行参数化,使端到端无结构掩码学习变得可行。在五个从0.5B到8B参数的LLM家族上,在50%和60%稀疏度下,LEAP在六个任务的零样本准确率上平均比ADMM提升+2.59点,ADMM是我们在扫掠中的最佳分层基线。

英文摘要

Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.

2605.16928 2026-06-09 cs.CL cs.AI 版本更新

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

全注意力再临:在数百次训练步骤内将全注意力转化为稀疏

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出RTPurbo方法,通过利用模型内在稀疏性,在少量训练步骤内实现高效的稀疏注意力,从而在保持接近无损精度的同时,显著提升推理效率。

详情
Comments
20 pages, 9 figures
AI中文摘要

大型语言模型的长上下文推理受到全注意力二次成本的限制。现有的高效替代方法通常依赖于原生稀疏训练或启发式令牌驱逐,导致效率、训练成本和准确性之间存在不理想的权衡。在本文中,我们证明全注意力LLM本质上已经是稀疏的,并且可以通过最小的适应转化为高度稀疏的模型。我们的方法基于三个观察:(1) 只有少量的注意力头真正需要完整的长上下文处理;(2) 长距离检索主要由低维子空间支配,允许相关令牌通过16维索引器高效检索;(3) 有用的令牌预算强烈依赖于查询,使得动态top-p选择比固定top-k稀疏化更合适。基于这些见解,我们提出了RTPurbo,该方法仅保留检索头的完整KV缓存,并引入轻量级令牌索引器进行稀疏注意力。通过利用模型的内在稀疏性,RTPurbo仅在数百次训练步骤内即可实现稀疏化。在长上下文基准和推理任务上的实验表明,RTPurbo在保持接近无损精度的同时,实现了显著的效率提升,包括在100万上下文下的预填充速度提升高达9.36倍,以及解码速度提升约2.01倍。这些结果表明,可以通过标准的全注意力训练获得强大的稀疏推理,而无需昂贵的原生稀疏预训练。

英文摘要

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

2605.16823 2026-06-09 cs.LG 版本更新

VQ-Atom: Semantic Discretization of Local Atomic Environments for Molecular Representation Learning

原子作为语言:VQ-Atom:用于分子表示学习的语义离散化

Takayuki Kimura

发表机构 * Atoms as Language, LLC(Atoms as Language公司)

AI总结 本文提出VQ-Atom,一种用于分子表示学习的语义离散化框架,通过将连续的原子级图表示转换为对应局部化学环境的离散标记,从而提升分子表示的学习效果。

详情
AI中文摘要

分子表示学习已成为AI驱动药物发现中的核心方法,但现有分子分词如SMILES仍主要是语法性的,无法自然对齐具有化学意义的子结构。在本文中,我们介绍了VQ-Atom,一种语义离散化框架,将连续的原子级图表示转换为对应局部化学环境的离散标记。利用图神经网络嵌入和向量量化,原子被分配到代表化学有意义的原子上下文的代码本条目中。这些离散标记定义了一种适合基于Transformer的预训练的分子语言。我们评估了VQ-Atom在蛋白质-配体相互作用预测中的表现,采用蛋白质冷分割设置且不依赖3D结构信息。实验结果表明,与传统分词方法相比,VQ-Atom在预测性能上始终有所提升,表明语义基础的离散化可以显著增强分子表示学习。我们的发现表明,分词设计本身在使化学领域有效语言建模中起着关键作用。

英文摘要

Large language models succeed by combining large-scale pretraining with meaningful discrete tokens. In molecular machine learning, SMILES is widely used as a token representation, but it is primarily a linearization format for molecular graphs rather than a semantic decomposition of chemistry. We propose VQ-Atom, a semantic tokenization framework that assigns discrete atom-level tokens based on local chemical environments via vector quantization. Unlike SMILES tokens, VQ-Atom tokens encode graph-local chemical context and are aligned with molecular structure. On protein-cold drug--target interaction prediction using the KIBA dataset, VQ-Atom substantially improves global ranking performance, achieving AUROC of 0.79 while substantially outperforming both SMILES-based and continuous molecular representations under an identical downstream architecture. Furthermore, VQ-Atom enables approximately 3 times faster downstream training than continuous atom-level representations by replacing per-atom continuous features with reusable discrete tokens. These results suggest that molecular tokenization is not merely a preprocessing step, but a central design choice. In particular, well-structured tokens can encode substantial chemical semantics, reducing the burden on downstream learning. VQ-Atom can be interpreted as defining a molecular language, where tokens correspond to chemically meaningful atomic environments, suggesting that token design may constitute an additional axis of machine learning research alongside architecture, objectives, and optimization.

2605.16309 2026-06-09 cs.AI cs.LG cs.MA 版本更新

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL:通过受控符号补丁学习适应大语言模型代理

Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University at Buffalo(布法罗大学) University of Colorado Boulder(科罗拉多大学博尔德分校) University of Colorado Colorado Springs(科罗拉多大学科罗拉多州立分校)

AI总结 ANNEAL通过受控符号补丁学习适应大语言模型代理,解决重复故障问题,其核心机制FDKA能定位责任操作符并生成类型补丁,实现持久结构修复,优于现有方法。

详情
Comments
Code Implementation: https://github.com/sbhakim/anneal-agents
AI中文摘要

基于大语言模型的代理可以恢复个体执行错误,但在底层过程知识未修复时,同一故障会反复失败。现有自我进化方法通过更新提示、记忆或模型权重来解决这一差距,但未直接修复编码任务执行的符号结构,且缺乏安全部署所需的治理保证。我们引入ANNEAL,一种神经符号代理,将重复失败转化为受控符号编辑过程知识图谱,而无需修改基础模型权重。其核心机制,故障驱动知识获取(FDKA),定位责任操作符,通过约束LLM生成合成类型补丁,并通过多维评分、符号护栏和金丝雀测试验证提案,再提交。每条接受的编辑都携带完整溯源和确定性回滚能力。在四个领域和27个多种子运行中,ANNEAL是唯一在测试重复故障设置中将失败率降至0%的评估系统。消融实验表明,移除FDKA会消除所有结构修复并使成功率下降最高26.7个百分点。这些结果表明,受控符号修复为持续故障消除提供了与权重级和提示级适应互补的范式。

英文摘要

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.