arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Pruning and Distilling Mixture-of-Experts into Dense Language Models

将混合专家模型剪枝和蒸馏为密集语言模型

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

发表机构 * KRAFTON KAIST(韩国科学技术院)

AI总结 提出首个将混合专家(MoE)模型转换为标准密集架构的系统框架,通过专家评分、选择、分组、拼接和知识蒸馏,在参数匹配条件下比密集到密集剪枝平均下游准确率提升6.3个百分点,训练速度提升1.6倍。

详情
AI中文摘要

混合专家(MoE)现在是前沿语言模型的主导架构,但它需要将所有专家参数加载到内存中,因此在内存受限的部署中不太受欢迎。现有的压缩方法减少了专家数量,但输出仍然是具有相同基本限制的MoE模型。我们提出了第一个将训练好的MoE转换为标准全密集架构的系统框架:专家被评分、选择和分组,然后拼接成密集的前馈网络(FFN),并通过MoE教师的知识蒸馏进行精炼。我们在Qwen3-30B-A3B上评估了7种评分方法、5种分组方法和2种幅度缩放方法,涵盖了多种选定的专家数量,共产生350种配置。我们发现评分方法的选择影响最大,我们提出的新颖的多样性感知评分在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于先前的方法。在参数匹配的受控比较下,经过约4B token的蒸馏,MoE到密集的转换在平均下游准确率上比密集到密集的剪枝高出6.3个百分点,训练壁钟速度提升1.6倍。

英文摘要

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

2605.27786 2026-06-09 cs.LG cs.AI 版本更新

Locality-Aware Redundancy Pruning for LLM Depth Compression

面向LLM深度压缩的局部感知冗余剪枝

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo, Minkyu Kim, Sunwoo Lee

发表机构 * University of Southern California(美国南加州大学) Neural Superintelligence Lab, MODULABS(MODULABS神经超级智能实验室) Seoul National University(首尔国立大学) Inha University(釜山大学)

AI总结 提出LoRP,一种基于表示局部性的无训练单次深度剪枝框架,通过引入表示局部性分数(RLS)来识别和剪除冗余层,在多种LLM上提升了困惑度和下游任务准确率。

详情
AI中文摘要

大型语言模型在跨网络深度上已知存在表示冗余,这使得深度剪枝成为提高推理效率的有效方法。现有的单次剪枝方法依赖于局部层重要性或跨架构的固定冗余假设。我们提出了局部感知冗余剪枝(LoRP),一种由表示局部性引导的无训练单次深度剪枝框架。我们表明,层间冗余可以是局部化的或全局分布的,具体取决于LLM架构。为了表征这一现象,我们引入了表示局部性分数(RLS),该分数源自全局层间隐藏状态相似性。使用小的校准集,LoRP计算成对层相似性,按表示相似性对层进行聚类,并根据残差簇内冗余分配剪枝。跨多种LLM家族的实验表明,在困惑度和下游任务准确性上均有提升。

英文摘要

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy. Official github repository: https://github.com/daniel-eai/LoRP-Locality-Aware-Redundancy-Pruning/

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师:以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Southern California(南加州大学) Independent Researcher(独立研究者) National University of Singapore(新加坡国立大学) Microsoft(微软) Google(谷歌) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Northwestern University(西北大学) Allen Institute for AI (AI2)(人工智能研究院(AI2))

AI总结 提出以学生为中心的答案采样(SCAS)框架,通过估计学生中心的学习成本选择教师生成的答案,从而提升学生模型性能。

详情
AI中文摘要

LLM训练越来越依赖教师生成的监督,包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据,隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败:即使多个教师对同一问题提供正确答案,最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题,我们提出以学生为中心的答案采样(SCAS),该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发,我们推导出该成本的高效前向代理,并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明,SCAS持续提升学生性能,表明有效的蒸馏应优先考虑与当前学生匹配的监督,而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

2605.26078 2026-06-09 cs.LG 版本更新

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Wasserstein策略梯度在熵正则化强化学习中的全局收敛性

Zhaoyu Zhu, Rui Gao, Shuang Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文通过利用熵正则化强化学习的Bellman结构,证明了Wasserstein策略梯度(WPG)方法的全局收敛性,并建立了分布Polyak-Łojasiewicz条件。

详情
AI中文摘要

Wasserstein策略梯度(WPG)是一种利用动作分布的最优传输几何的强化学习(RL)策略优化方法。对于熵正则化RL目标,WPG通过将每个状态条件策略沿软Q函数的动作梯度以及Langevin型扩散进行传输来演化。尽管它在连续控制问题中具有吸引力,但其全局收敛性质仍不清楚。标准的Langevin分析并不直接适用,因为RL目标通过Bellman递归而非静态凸泛函依赖于策略,且Langevin漂移由软Q函数决定,其正则性必须在策略迭代过程中加以控制。在本文中,我们通过利用熵正则化RL的Bellman结构,发展了WPG的全局收敛理论。我们表明,通常由凸性扮演的角色可以被基于Bellman的论证所取代:软Bellman残差相对于Gibbs策略具有状态级KL表示;Bellman压缩将此残差与全局最优性差距联系起来;而Bellman预解恒等式将价值改进与相对Fisher信息联系起来。结合演化Gibbs族的均匀对数Sobolev不等式(LSI),这些要素产生了分布Polyak-Łojasiewicz条件。我们进一步建立了控制离散化误差所需的正则性和一致界,从而获得直到离散化偏差的几何收缩。概念上,我们的分析表明,尽管熵正则化RL在通常的平坦意义上不是凸的,但Bellman递归诱导了一种有利的Polyak-Łojasiewicz型(PL)几何,支持WPG的全局收敛。

英文摘要

Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.

2605.26452 2026-06-09 cs.RO cs.LG cs.SY eess.SY 版本更新

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

鲁棒Koopman控制屏障滤波器用于安全演员-评论家强化学习

Dhruv S. Kushwaha, Zoleikha A. Biron

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出鲁棒Koopman-CBF SAC框架,通过数据驱动学习Koopman预测器、构建提升空间中的仿射CBF约束并利用二次规划安全层实施,同时通过投影残差裕度处理近似误差,实现零约束违反或减少违规。

Comments 17 pages, 7 figures

详情
AI中文摘要

机器人系统的安全强化学习需要策略在训练和部署期间满足状态和输入约束的同时提高任务性能。控制屏障函数通过最小侵入性安全滤波器提供强制执行前向不变性的原则性机制,但其在无模型强化学习中的应用受限于对精确动力学和手工设计屏障证书的需求。我们提出鲁棒Koopman-CBF SAC,一种安全滤波的演员-评论家框架,从数据中学习有限维Koopman预测器,在提升空间中构建仿射CBF约束,并通过二次规划安全层强制执行。为考虑有限维Koopman近似误差,使用从留出轨迹数据估计的投影残差裕度收紧CBF条件。评论家在执行的安操作上训练,而演员则被正则化向Koopman-CBF可行集,减少训练中对滤波器的依赖。在安全控制基准测试中,该方法在CartPole稳定和跟踪上实现零约束违反,同时匹配或超过无约束SAC的回报。在高维Safety Gymnasium运动任务中,该方法在某些设置下减少了违规,但也暴露了一阶速度屏障和线性EDMD模型的重要局限性,推动了高阶和多步Koopman-CBF扩展。这些结果表明,鲁棒Koopman-CBF滤波器是无模型强化学习和可证明安全之间的有前途桥梁,同时阐明了此类滤波器保持有效的结构条件。所有代码可在\href{https://github.com/DhruvKushwaha/Koopman-CBF-Soft-Actor-Critic}{Github仓库}获取。

英文摘要

Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.

2605.26108 2026-06-09 cs.CV 版本更新

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

通过奖励倾斜分布匹配增强少步生成器

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

发表机构 * Tencent Hunyuan(腾讯文英) Hong Kong University of Science and Technology(香港科技大学) Westlake University(西湖大学)

AI总结 提出奖励倾斜分布匹配蒸馏(RTDMD)两阶段框架,结合分布匹配蒸馏与奖励引导强化学习,在仅4步推理下实现文本到图像生成的最新性能。

Comments Code and models are available at https://github.com/Harahan/RTDMD

详情
AI中文摘要

近期少步扩散蒸馏的进展实现了高效图像生成,但将这些模型与人类偏好对齐仍具挑战。我们提出奖励倾斜分布匹配蒸馏(RTDMD),一个两阶段框架,将分布匹配蒸馏与奖励引导的强化学习统一用于少步流生成器。我们证明,最小化到奖励倾斜教师分布的KL散度自然分解为分布匹配项和奖励最大化项。在第一阶段,我们引入环境一致分布匹配蒸馏(AC-DMD),它执行子区间分布匹配,并用一致性正则化增强假分数目标,帮助假分数模型在有限更新下跟踪变化的生成器分布。在第二阶段,我们联合优化两项:对于奖励最大化项,我们推导出一个混合策略梯度,将GRPO风格的估计器用于随机中间过渡,与通过确定性最后步骤的直接奖励反向传播相结合,并进一步引入步骤子集GRPO(SubGRPO)以降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD在偏好、美学和组合指标上仅用4步推理就建立了新的最先进结果,超越了先前的少步文本到图像生成方法。代码和模型见https://github.com/Harahan/RTDMD。

英文摘要

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

2605.30226 2026-06-09 cs.RO cs.AI 版本更新

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

BORA: 弥合离线强化学习与在线残差适应以实现真实世界灵巧VLA模型

Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) CASIA(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室) USTC(中国科学技术大学)

AI总结 提出BORA框架,通过离线构建动作条件价值引导的评论家,并结合在线冻结VLA基础、引入人类在环的分块残差适应机制,解决灵巧操作中高维探索导致的时间不一致、样本低效和硬件风险问题,在五个真实灵巧任务上平均成功率提升33%。

Comments 24 pages,11 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为将视觉-语言理解融入真实世界机器人操作的一种有前景的范式。然而,由于高维手部控制和复合执行误差,灵巧操作对VLA策略仍然具有挑战性,这使得真实世界的强化学习后训练对于弥合视觉基础动作生成与物理可靠灵巧执行之间的差距至关重要。然而,高维灵巧探索常常引发真实世界中的时间不一致性、样本低效和硬件风险。为应对这些挑战,我们提出BORA,一种为真实世界灵巧VLA模型设计的离线到在线强化学习后训练框架。在离线阶段,BORA构建一个以VLM的认知令牌和动作块作为输入的评论家。这种设计实现了动作条件价值引导,使评论家能够评估超越视觉上下文的灵巧手部运动。在随后的在线阶段,BORA冻结VLA基础,并引入一种轻量级、人类在环(HiL)的分块残差适应机制,以减轻真实世界执行误差并进一步在真实物理环境中纠正离线学习到的意图。通过继承离线评论家并采用干预驱动奖励,BORA有效纠正执行差异并适应真实世界物理变化,同时将预训练策略作为稳定先验。在五个复杂真实世界灵巧任务上的广泛评估表明,BORA显著优于纯模仿学习和传统解耦强化学习基线,在标准设置下平均成功率绝对提升33%,在未见物体泛化中提升高达43%。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.

2605.30184 2026-06-09 cs.LG physics.ao-ph 版本更新

Can AI Weather Models Predict Beyond Two Weeks? A Quantitative Benchmark and Analysis of Long Rollouts

AI天气模型能否预测两周以上?长期推演的定量基准与分析

Fanny Lehmann, Firat Ozdemir, Yun Cheng, Torsten Hoefler, Sebastian Schemm, Benedikt Soja, Siddhartha Mishra

发表机构 * ETH AI Center(ETH人工智能中心) ETH Zurich(苏黎世联邦理工学院) Swiss Data Science Center(瑞士数据科学中心) Scalable Parallel Computing Lab(可扩展并行计算实验室) Dep. of Applied Mathematics and Theoretical Physics(应用数学与理论物理系) University of Cambridge(剑桥大学) Institute of Geodesy and Photogrammetry(大地测量与摄影测量研究所) Seminar for Applied Mathematics(应用数学研讨会)

AI总结 通过九种AI天气模型的一年推演,将长期不稳定性分类为爆发、漂移和季节性丧失三种模式,并发现稳定性取决于对小时空尺度的处理。

详情
AI中文摘要

虽然AI天气模型在短期到中期预报(最多15天)中表现出色,但在更长时间推演时经常出现定义不清的“不稳定性”。本文通过九种最先进的AI天气模型的一年推演,将这些失败形式化为三种不同的模式:爆发、漂移和季节性丧失。我们的分析表明,稳定性取决于对小时空尺度的处理:不稳定的模型放大高频能量,而稳定的模型在输入中添加噪声时起到去噪作用。我们的发现远未将这些模型简化为随机鹦鹉,而是强调稳定模型根据初始状态生成独特的天气轨迹。我们通过对架构设计选择的消融研究验证了我们的发现,这些研究使用了最先进的Vision Transformer(ViT)AI天气模型架构。

英文摘要

While AI weather models excel at short-to-medium range forecasts (up to 15 days), they frequently suffer from ill-defined "instabilities" when rolled out over longer horizons. This work addresses the lack of a formal taxonomy by categorizing these failures into three distinct regimes: blow-up, drift, and loss of seasonality, through year-long rollouts of nine state-of-the-art AI weather models. Our analysis reveals that stability hinges on the treatment of small spatio-temporal scales: unstable models amplify high-frequency energy, while stable models act as denoisers when noise is added to their inputs. Far from reducing these models to mere stochastic parrots, our findings highlight that stable models generate unique weather trajectories, conditioned on the initial state. We verify our findings through ablation studies on architectural design choices, conducted using state-of-the-art Vision Transformer (ViT) AI weather model architectures.

2605.29920 2026-06-09 cs.LG 版本更新

Midpoint Generative Models

中点生成模型

Daniil Shlenskii, Nikita Gushchin, Lev Novitskiy, Dmitry V. Dylov, Alexander Korotin

发表机构 * AXXX, Russia(俄罗斯AXXX) Applied AI Institute, Russia(俄罗斯应用人工智能研究所) Kandinsky Lab, Russia(俄罗斯康德斯基实验室)

AI总结 提出中点生成模型(MGM),利用流匹配的对称性定义中点散度,并通过变分目标训练单步生成模型,在性能上与现有方法竞争。

详情
AI中文摘要

我们引入了中点生成模型(MGM),这是一个用于训练单步生成模型的原则性框架。MGM基于线性插值流匹配的一个简单对称性:当两个端点分布重合时,相应的漂移场在中点时间$t=1/2$处消失。我们证明该场的范数定义了分布之间的有效差异,称为中点散度。我们通过引入随机翻转插值将该散度扩展到中点之外,并通过用对称随机插值替代确定性线性流匹配插值进一步推广,得到广义中点散度。最后,我们推导了广义散度的变分形式,从而得到一个可处理的目标用于训练单步生成器。由此产生的MGM算法为生成建模提供了一种有效且理论上有依据的方法,在单步生成建模方法中取得了有竞争力的性能。

英文摘要

We introduce Midpoint Generative Models (MGM), a principled framework for training one-step generative models. MGM is based on a simple symmetry of Flow Matching with linear interpolation: when the two endpoint distributions coincide, the corresponding drift field vanishes at the midpoint time, $t=1/2$. We show that the norm of this field defines a valid discrepancy between distributions, which we call the Midpoint Divergence. We extend this discrepancy beyond the midpoint by introducing randomly flipped interpolations and further generalize it by replacing deterministic linear Flow Matching interpolations with symmetric stochastic interpolants, yielding a generalized Midpoint Divergence. Finally, we derive a variational formulation of our generalized divergence, yielding a tractable objective for training a one-step generator. The resulting MGM algorithm offers an effective and theoretically grounded approach to generative modeling, achieving competitive performance against existing one-step generative modeling methods.

2605.29823 2026-06-09 cs.AI 版本更新

Quantifying and Optimizing Simplicity via Polynomial Representations

通过多项式表示量化和优化简单性

Tianren Zhang, Xiangxin Li, Minghao Xiao, Guanyu Chen, Feng Chen

发表机构 * [cs.AI](计算机科学与人工智能)

AI总结 提出多项式表示作为分布感知的低维神经函数代理,通过正交多项式基近似网络预测行为,以有效度作为简单性度量,并导出可微正则化器以提升泛化。

Comments ICML 2026

详情
AI中文摘要

深度网络通常表现出对“简单”解的偏好,这种简单性偏差被广泛认为在泛化中起关键作用。然而,一种广泛适用、定量的简单性度量仍然难以捉摸。我们引入多项式表示作为分布感知的、低维神经函数代理:我们使用正交多项式基沿数据依赖的插值路径近似网络的预测行为,从而得到紧凑的函数表示。我们表明,该表示的有效度可作为实用的简单性度量,能够预测跨任务和架构的泛化,并且持续优于现有的泛化代理(如锐度)。最后,多项式表示自然产生可微的简单性正则化器,在图像和文本分类、微调对比视觉语言模型以及强化学习中持续改善泛化。

英文摘要

Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introduce polynomial representations as a distribution-aware, low-dimensional surrogate for neural functions: we approximate a network's predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that the effective degree of this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield a differentiable simplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision-language models, and reinforcement learning.

2504.19399 2026-06-09 cs.RO 版本更新

Follow Everything: A Leader-Following and Obstacle Avoidance Framework with Goal-Aware Adaptation

跟随一切:具有目标感知适应的领导者跟随与避障框架

Qianyi Zhang, Shijian Ma, Boyi Liu, Jianhao Jiao, Dimitrios Kanoulas

发表机构 * Institute of Robotics and Automatic Information System, Nankai University, China(南开大学机器人与自动化信息系统研究所) Centre for Data Science, University of Macau, China(澳门大学数据科学中心) Electrical and Computer Engineering Department, Hong Kong University of Science and Technology, China(香港科学与技术大学电子与计算机工程系) Department of Computer Science, University College London, UK(伦敦大学学院计算机科学系) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学航空与航空工程系)

AI总结 提出统一框架,用分割模型替代检测模型以跟随任意形态领导者,并设计目标感知适应机制和基于图的规划器,实现领导者暂时离开视野时的鲁棒跟随与避障。

详情
AI中文摘要

鲁棒且灵活的领导者跟随是机器人融入人类社会的一项关键能力。现有方法难以泛化到任意形态的领导者,并且在领导者暂时离开机器人视野时常常失败,本文引入了一个统一框架来应对这两个挑战。首先,用分割模型替代传统检测模型,使领导者可以是任何物体。为了增强识别鲁棒性,实现了一个距离帧缓冲区,在多个距离存储领导者嵌入,以考虑领导者跟随任务的独特特征。其次,设计了一种目标感知适应机制,根据领导者的可见性和运动来控制机器人规划状态,并辅以基于图的规划器,为每个状态生成候选轨迹,确保高效跟随和避障。在室内外环境中,使用腿式机器人跟随者与各种领导者(人、地面机器人、无人机、腿式机器人、停止标志)进行的仿真和真实世界实验显示,在跟随成功率、减少视觉丢失时长、降低碰撞率和减小领导者-跟随者距离方面取得了竞争性改进。

英文摘要

Robust and flexible leader-following is a critical capability for robots to integrate into human society. While existing methods struggle to generalize to leaders of arbitrary form and often fail when the leader temporarily leaves the robot's field of view, this work introduces a unified framework addressing both challenges. First, traditional detection models are replaced with a segmentation model, allowing the leader to be anything. To enhance recognition robustness, a distance frame buffer is implemented that stores leader embeddings at multiple distances, accounting for the unique characteristics of leader-following tasks. Second, a goal-aware adaptation mechanism is designed to govern robot planning states based on the leader's visibility and motion, complemented by a graph-based planner that generates candidate trajectories for each state, ensuring efficient following with obstacle avoidance. Simulations and real-world experiments with a legged robot follower and various leaders (human, ground robot, UAV, legged robot, stop sign) in both indoor and outdoor environments show competitive improvements in follow success rate, reduced visual loss duration, lower collision rate, and decreased leader-follower distance.

2605.29475 2026-06-09 cs.CL cs.AI cs.CE cs.HC 版本更新

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot:一个基于网络的交互式助手,用于统一探索性和细粒度科学假设发现

Hongran An, Zonglin Yang

发表机构 * Central Conservatory of Music(中央音乐学院) Nanyang Technological University(南洋理工大学)

AI总结 提出MOOSE-Copilot,通过形式化的人机交互协议,将发散性探索和收敛性细化统一,利用蓝图、路由和反馈三种信号引导生成,显著优于纯自主基线。

Comments Accepted to ACL 2026 (System Demonstrations)

详情
AI中文摘要

大型语言模型(LLMs)在科学假设发现中展现出显著潜力。然而,现有方法存在两个关键限制:它们将发散性探索构思和收敛性细粒度细化视为孤立任务,并且自主运行,几乎没有人类指导。我们提出了MOOSE-Copilot,这是第一个通过形式化的人机交互(HAII)协议弥合这一抽象差距的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程:初始蓝图、阶段间路由和再生反馈。定量评估表明,注入这些结构化专家信号显著优于纯自主基线,并在神谕指导下建立了性能上限。此外,为了普及这一范式,我们开发了一个直观的基于网络界面,具有交互式树状可视化。这明确消除了复杂命令行代理工具的陡峭学习曲线,使跨学科研究人员能够直接利用、视觉编排并加速端到端的科学突破。

英文摘要

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback-no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.

2605.28912 2026-06-09 cs.LG cs.CR 版本更新

Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems

基于环空间感知的电力系统自编码器盲假数据注入攻击检测

Xin Li, Chenhan Xiao, Jonathan Cohen, Aviad Elyashar, Yang Weng, Rami Puzis

发表机构 * Faculty of Computer and Information Science, Ben-Gurion-University, Be’er Sheva, Israel(计算机与信息科学学院,本·古里安大学,贝尔谢巴,以色列)

AI总结 针对自编码器利用测量流形零空间生成的盲假数据注入攻击,提出基于拓扑环空间检测器,利用最小环基实现最优泛化误差,有效检测数据驱动攻击。

Comments 13 pages, 11 figures

详情
AI中文摘要

人工智能驱动的数据中心和大型储能系统的快速增长,使得电力系统运行越来越依赖实时测量数据和自动决策。然而,许多现有的检测方法依赖于对测量值的统计或数据驱动分析,当攻击者利用相同的数据结构构造隐蔽扰动时,这些方法可能会失效。为说明这一局限性,我们展示了一种盲假数据注入攻击(FDIA),其中自编码器学习测量流形并生成与雅可比零空间对齐的扰动,从而使得攻击能够逃避基于残差的坏数据检测器和时间序列异常检测器。为了缓解利用零空间的数据驱动FDIA,我们提出了一种拓扑感知的环空间检测器(CSD),该检测器利用网络的环空间施加结构约束,以增强零空间估计。此外,我们证明,通过使用最小环基(MCB),所提出的CSD实现了攻击检测的最优泛化误差。通过利用拓扑导出的环约束而不是仅仅依赖于数值零空间估计,所提出的方法不需要精确的线路参数,并改善了正常测量与受攻击测量之间的分离。在IEEE 14、30、57和118节点系统上的仿真结果表明,该方法在实际测量噪声下有效检测数据驱动FDIA。

英文摘要

The rapid growth of AI-driven data centers and large-scale energy storage systems is increasing the reliance of power system operation on real-time measurement data and automated decision-making. However, many existing detection methods rely on statistical or data-driven analysis of measurements and can fail when attackers exploit the same data structure to craft stealthy perturbations. To illustrate this limitation, we demonstrate a blind False Data Injection Attack (FDIA) in which an Autoencoder learns the measurement manifold and generates perturbations aligned with the Jacobian null space, thereby allowing the attack to evade both residual-based baddata detectors and time-series anomaly detectors. To mitigate data-driven FDIAs which exploit the null space, we propose a topology-informed Cycle-Space Detector (CSD) that leverages the Cycle-Space of the network to impose structural constraints that enhance null space estimation. In addition, we prove that by using the Minimum Cycle Basis (MCB), the proposed CSD achieves the optimal generalization error for attack detection. By exploiting topology-derived cycle constraints rather than relying solely on numerical null space estimation, the proposed method does not require precise line parameters and improves the separation between normal and attacked measurements. Simulation results on IEEE 14-, 30-, 57-, and 118-bus systems demonstrate that the proposed method effectively detects data-driven FDIAs under realistic measurement noise.

2605.28831 2026-06-09 cs.CL cs.AI 版本更新

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3Mem:用于长时域交互式问答的结构化时空场景-事件记忆

Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) The University of Sydney(悉尼大学) Beihang University(北航)

AI总结 提出S3MEM框架,通过结构化场景-事件记忆和锚点敏感检索,在长时域交互式问答中实现比通用记忆接口更优的准确率-效率平衡。

详情
AI中文摘要

长时域交互代理通常积累大量轨迹历史,但仍无法可靠地回答关于早期事件的问题。我们认为主要瓶颈不仅是上下文长度,而是长期记忆的轨迹到答案接口。当历史以纯文本块存储并使用标准检索增强生成(RAG)查询时,系统通常检索到局部相关但链不完整的证据,特别是对于空间、时间、重复事件和多跳状态问题。我们提出S3MEM,一种用于长时域交互式问答(QA)的结构化场景-事件情节记忆框架。S3MEM将轨迹写入结构化记忆单元,通过锚点敏感检索检索证据,并为答案时间推理提供紧凑的令牌预算感知证据接口。从这个意义上说,S3MEM是一种结构化证据利用工具,将代理轨迹转换为查询对齐的支持。我们在两个内部标题环境(Crafter、Jericho)和两个外部环境(SciWorld、ALFWorld)上评估S3MEM。在共享的冻结答案时间协议下,S3MEM在所有四个环境中一致优于Vanilla RAG,在Crafter、Jericho和ALFWorld上超过Graph-NoReader,在SciWorld上与之匹配,同时使用的证据令牌显著减少。三个改编的近期基线——A-MEM启发、MemoryOS改编和LightMem改编——在多个设置中优于Vanilla RAG,但没有一个达到S3MEM的整体准确率-效率前沿。总体而言,证据支持一个有限的结论:在当前冻结的答案时间协议下,结构化写入和锚点敏感证据路由为长时域交互式QA提供了比通用记忆接口更强的准确率-效率前沿。

英文摘要

Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches \(0.48\) F1 and \(0.40\) BLEU with (1{,}073) evidence tokens per question, about \(15.8\times\) fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only \(189\)tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.

2605.19276 2026-06-09 cs.CL cs.LG 版本更新

OpenCompass: A Universal Evaluation Platform for Large Language Models

OpenCompass:大型语言模型的通用评估平台

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Zhiwei Fei, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Zhuozhi Xiong, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo

发表机构 * OpenCompass Team(OpenCompass团队) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出OpenCompass,一个模块化、高兼容性、灵活且高并发的通用LLM评估平台,支持多种任务场景和主流基准数据集。

详情
AI中文摘要

近年来,人工智能领域经历了从特定任务的小规模模型到通用大型语言模型(LLM)的范式转变。随着LLM的快速迭代,对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。目前,基于静态基准数据集的主流评估方法面临任务类型多样性、评估标准不一致以及数据处理流程碎片化等挑战,难以高效进行跨领域和大规模模型评估。为解决上述问题,本文提出并开源了OpenCompass,一个一站式、可扩展且支持高并发的通用LLM评估平台。该平台遵循模块化和组件解耦的设计理念,具有三大核心优势:高兼容性、灵活性和高并发性。OpenCompass的核心架构包括五个关键组件:配置系统、任务划分模块、执行与调度模块、任务执行单元和结果可视化模块。其工作流程提供基于规则、LLM作为评判者和级联评估器,以适应不同任务场景的需求。平台支持知识、推理、计算、科学、语言、代码等多个领域的基准数据集,为学术界和工业界提供统一高效的LLM评估工具,有助于准确识别LLM的优缺点并进行后续优化。

英文摘要

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

2605.25985 2026-06-09 cs.AI 版本更新

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

面向多自由变量复杂逻辑查询的神经可扩展符号搜索框架

Weizhi Fei, Hang Yin, Zihao Wang, Shukai Zhao, Wei Zhang, Yangqiu Song

发表机构 * Department of Mathematical Sciences, Tsinghua University(清华大学数学科学系) Squarepoint Capital(Squarepoint资本) Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程系) Department of Computer Sciences, University of Rochester(罗切斯特大学计算机科学系)

AI总结 针对知识图谱上多自由变量复杂查询的联合排序难题,提出神经可扩展符号搜索(NS3)框架,通过预算约束和超节点合并近似联合排序,显著提升性能。

Comments 10 pages, 5 figures

详情
AI中文摘要

复杂查询回答(CQA)是在不完整知识图谱(KG)上进行知识表示和推理的基本任务。回答带有$k$个自由变量的存在性一阶查询(即$ ext{EFO}_k$查询)是一个关键但具有挑战性的问题,因为它需要对$\mathcal{E}^k$中的答案元组进行排序,其中$\mathcal{E}$表示KG的实体集。随着$k$的增长,这很快变得难以处理。因此,现有基准和方法依赖于单个变量的边际排序;然而,边际排序是元组真实联合排序的较差代理。基于$ ext{EFO}_1$查询的神经符号搜索,我们提出了神经可扩展符号搜索(NS3),这是一个预算框架,无需枚举$\mathcal{E}^k$即可近似联合排序。NS3 (i) 回答边际化子查询以获得必要的候选集,(ii) 将多个自由变量合并为超节点,其域由动态预算$B$修剪和控制,以及(iii) 逐步将$ ext{EFO}_k$查询简化为在预算缩减域上的$ ext{EFO}_{k-1}$查询。在三个标准KG数据集上,NS3在保持强边际准确性的同时,显著提高了联合排序性能。我们进一步发布了一个联合排序基准,将现有的$ ext{EFO}_1$数据集扩展到$k=3$,从而能够系统评估多变量查询。我们的代码提供在https://github.com/HKUST-KnowComp/NS3_KDD2026。

英文摘要

Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for $\text{EFO}_1$ queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating $\mathcal{E}^k$. NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget $B$, and (iii) progressively reduces an $\text{EFO}_k$ query to an $\text{EFO}_{k-1}$ query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing $\text{EFO}_1$ datasets to $k=3$, enabling systematic evaluation of multi-variable queries. Our code is provided in https://github.com/HKUST-KnowComp/NS3_KDD2026.

2605.25312 2026-06-09 cs.CL 版本更新

P1SCO: Social Dimensions from a Perspectivist Lens

P1SCO:从视角主义视角看社会维度

Amanda Cercas Curry, Gianmarco de Francisci Morales, Luca Maria Aiello

发表机构 * Independent Researcher(独立研究者) CENTAI, Turin(CENTAI,都灵) IT University of Copenhagen(哥本哈根技术大学)

AI总结 本文提出P1SCO数据集,从三个平台收集社交媒体评论并按十个社会维度标注,以捕捉社会互动和感知的多样性,支持细粒度分析及跨平台、个体差异研究。

详情
AI中文摘要

我们介绍了P1SCO,一个从三个不同平台收集的社交媒体评论数据集,根据十个社会维度进行标注,以捕捉社会互动和感知的多样性。该数据集经过仔细分解,允许在单个评论、标注者和平台层面进行分析。除了社会维度标签外,我们还包含了丰富的标注者元数据,包括人口统计信息、大五人格特征和政治倾向。这种评论级标注和标注者级特征的组合,能够对社会感知如何因平台、个体差异和人口因素而变化进行细致分析。通过保留标注者视角的多样性,我们的数据集支持标注者间和标注者内部一致性研究、人格和政治倾向对社会解读的影响,以及社会话语的跨平台动态分析。

英文摘要

We introduce P1SCO, a dataset of social media comments collected from three distinct platforms, annotated according to ten social dimensions to capture the diversity of social interactions and perceptions. The dataset is carefully disaggregated to allow analysis at the level of individual comments, annotators, and platforms. In addition to the social dimension labels, we include rich metadata on the annotators, including demographics, Big Five personality profiles, and political affiliation. This combination of comment-level annotations and annotator-level features enables nuanced analyses of how social perception varies across platforms, individual differences, and demographic factors. By preserving the diversity of annotator perspectives, our dataset supports studies of inter- and intra-annotator agreement, the influence of personality and political orientation on social interpretation, and the cross-platform dynamics of social discourse.

2605.24942 2026-06-09 cs.LG cs.AI 版本更新

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

黎曼流形操控:用于无标签操控的几何感知生成自编码器

Narmeen Oozeer, Shivam Raval, Philip Quirke, Manikandan Ravikiran, Jeff Phillips, Shriyash Upadhyay, Amirali Abdullah

发表机构 * Martian Harvard University(哈佛大学) Thoughtworks University of Utah(犹他大学)

AI总结 提出将语言模型操控重新定义为激活空间上的黎曼测地线计算,通过基于输出空间Hellinger距离学习的编码器实现无标签、无拓扑先验的流形操控。

详情
AI中文摘要

语言模型的操控——干预其内部激活以改变下游行为——最近已从线性插值扩展到非线性方法,如角度操控和核化操控,这些方法定义了干预变换,而无需在激活空间中的路径上学习显式几何。新引入的几何感知流形方法确实学习了这样的几何,但需要带标签的类中心以及预设的循环或顺序结构。这些假设限制了流形操控的应用范围,因为现有构造需要带标签的中心和兼容的边界条件。我们将流形操控更广泛地重新定义为激活空间上的黎曼测地线计算,将线性操控和带标签样条操控恢复为特定度量选择下的测地线。该框架内一个有原则的度量是输出空间Hellinger距离拉回到激活空间;我们通过一个在小型概念-令牌模式上基于输出距离训练的学习编码器来近似该度量——无需每个提示的标签、无需拓扑先验、也无需每个任务的曲线拟合。实验上,该方法在标准四任务语言模型算术基准的所有任务中可靠地将模型驱动到目标类别,同时在较小输出空间上遵循比基线更行为自然的轨迹。因此,我们为流形操控提供了一个统一的黎曼框架,以及一个基于模式监督、无标签的实例化,该实例化无需带标签的中心或预设边界条件即可运行。

英文摘要

Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids together with prescribed cyclic or sequential structure. These assumptions restrict where manifold steering can be applied, since existing constructions require labelled centroids and compatible boundary conditions. We recast manifold steering more broadly as \textbf{Riemannian geodesic computation} on activation space, recovering linear and labelled-spline steering as geodesics under particular choices of metric. A principled metric within this framework is the output-space Hellinger distance pulled back to activations; we approximate this with a learned encoder trained on output distances over a small concept-token schema - no per-prompt labels, no topology prior, and no per-task curve fitting. Empirically, the method reliably drives the model onto the target class across all tasks in a standard four-task language-model arithmetic benchmark, while following more behaviourally natural trajectories than baselines on smaller output spaces. We thereby provide a unified Riemannian framework for manifold steering together with a schema-supervised, label-free instantiation that operates without labelled centroids or prescribed boundary conditions.

2605.24892 2026-06-09 cs.CV 版本更新

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

X-Foresight:一种通过预测世界建模的联合视觉-动作因果预测网络

Baolu Li, Jingyu Qian, Rui Guo, Yilun Chen, Hanpeng Liu, Yuan Lin, Junhong Zhou, Ruixin Liu, Willow Yang, Yutong Zheng, Zhenli Zhang, Sean Li, Chaoda Zheng, Boyang Wang, Tenglong, Gu, Zhuangzhuang Ding, Pengkun Zheng, Yu Zhang, Xianming Liu

发表机构 * PWM Team(PWM团队) XPeng Inc.(XPeng公司)

AI总结 提出X-Foresight,一种将预测世界模型直接集成到VLA架构中的方法,通过长程分块自回归策略和课程学习,联合学习世界建模与实时动作控制,以解决视频预测中的低熵冗余和长程因果建模难题。

详情
AI中文摘要

物理世界知识主要存在于视频中。赋予视觉-语言-动作(VLA)模型此类知识对于安全且可泛化的规划至关重要。预测世界建模通过从过去观测预测未来视频,使VLA能够内化物理动态和长程因果关系。然而,朴素的下一帧预测面临两个挑战:1)与语义上不同的文本标记不同,视频标记是低熵且冗余的,导致预测退化为琐碎的外推;2)世界建模存在时间困境:密集预测捕捉瞬时动态,但无法高效建模长程因果。为有效学习世界知识,我们引入X-Foresight,一种直接集成到VLA架构中的预测世界模型,以联合学习世界建模和实时动作控制。其核心是一种长程分块自回归策略,该策略解决了上述两个挑战:通过预测语义上遥远的块而非相邻帧,它避免了琐碎的外推,同时保留密集的块内帧用于瞬时动态和稀疏的块间过渡用于长程因果。课程学习计划逐步扩展预测范围并稳定长程训练。为有效捕捉长程因果,我们提出时间重要性采样,将监督集中于由自我运动和行为信号识别的安全关键块。我们进一步将逼真合成委托给基于扩散的多视图渲染器,以改善逼真外观。大量实验表明,X-Foresight在规划性能上显著优于VLA基线,同时保持强大的生成保真度,为世界知识驱动的自主系统建立了稳健的范式。

英文摘要

Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.

2605.24890 2026-06-09 cs.CV 版本更新

QuoVLA: Quotient Space for Vision-Language-Action Models

QuoVLA:视觉-语言-动作模型的商空间

Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han

发表机构 * Department of Automation(自动化系)

AI总结 针对VLA模型预训练VLM潜在表示动作信息不足的观点,提出商空间框架QuoVLA,通过量化模块和双分支设计压缩潜在表示为动作充分表示,在多个基准上提升泛化性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过将视觉观察和语言指令映射到连续动作来适配预训练的视觉-语言模型(VLM)以进行机器人控制。现有方法通常采取动作不足的观点,假设预训练的VLM潜在表示要么缺乏直接可用的动作信息,要么应该屏蔽动作学习信号。与这一观点相反,我们的 extit{VLA商理论}表明,预训练的VLM潜在表示并非动作不足而是动作充分的:它们已经包含控制所需的信息,但由于区分了诱导相同最优动作行为的提示级变体而仍然过度完备。为了将这一理论付诸实践,我们提出了QuoVLA,一个用于VLA的商空间框架,将预训练的VLM潜在表示压缩为动作充分的表示。具体来说,QuoVLA通过一个量化模块和一个具有相对时间复杂度正则化的双分支设计实例化这一原则,在去除提示级冗余的同时保留动作相关信息。跨多个基准的大量实验表明,QuoVLA实现了强大的性能,在视觉、语言和环境分布偏移下的泛化方面尤其显著提升。我们的代码将公开提供。

英文摘要

Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.

2603.04862 2026-06-09 cs.SD 版本更新

Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

先聚焦后聆听:探索用于噪声鲁棒的大规模音频语言模型的即插即用音频增强器

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 提出即插即用的音频增强器FTL,通过分离语音与非语音并利用模态路由器预测目标模态,生成任务自适应增强信号,无需微调即可提升LALMs在噪声环境下的性能。

Comments Accepted by ICML 2026 Workshop (Machine Learning for Audio)

详情
AI中文摘要

大规模音频语言模型(LALMs)是一类用于音频理解的基础模型。现有的LALMs在现实世界的噪声声学条件下,当语音和非语音声音干扰时,性能往往会显著下降。虽然噪声感知微调可以提高鲁棒性,但它需要特定任务的噪声数据和昂贵的重新训练,限制了可扩展性。为了解决这个问题,我们提出了先聚焦后聆听(FTL),一种即插即用的音频增强器,可提高LALMs的噪声鲁棒性。具体来说,FTL首先将输入波形分离为语音和非语音,并应用模态路由器根据用户指令预测目标音频模态(例如,语音)。最后,一个模态感知融合模块生成任务自适应的增强信号,以改善下游感知和推理。跨多个LALMs和任务的实验表明,FTL在不同噪声水平下都能提升性能,而无需对LALMs进行微调。

英文摘要

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.

2605.23595 2026-06-09 cs.LG cs.AI cs.CV cs.ET cs.PF 版本更新

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

基于元学习的成本效益模型评估

Trinh Pham, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

发表机构 * Griffith University(格里菲斯大学) Edith Cowan University(埃迪斯科文大学) The University of Queensland(昆士兰大学)

AI总结 提出MetaEvaluator,一种基于元学习的模型无关框架,通过参考模型池实现无标签数据上的快速、准确且成本效益高的新模型评估。

Comments Accepted by KDD 2026

详情
AI中文摘要

机器学习的快速发展产生了不断扩展的模型生态系统,使得在未见过的未标记数据上验证新发布模型的可靠性变得越来越具有挑战性。传统的评估流程依赖于昂贵的标注、重复的微调或无法跨模型家族迁移的狭窄假设。我们提出了MetaEvaluator,一个成本效益高、模型无关的框架,用于快速、无标签地评估跨不同架构和模态的未见模型。MetaEvaluator利用参考模型池上的元学习来获得可迁移的初始化,从而能够准确评估新模型,同时将成本分摊到整个池中,并消除了每个模型重新训练的需要。据我们所知,这是第一个能够在完全未标记数据集上评估新模型的模型无关框架。大量实验表明,与传统方法相比,MetaEvaluator以显著降低的成本产生稳定且准确的性能估计,使得在未标记数据上对新出现的模型进行可扩展的基准测试变得实用。

英文摘要

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2605.23247 2026-06-09 cs.LG 版本更新

Accelerating Divisible Load Processing Through Machine Learning: A Practical Framework for Large-Scale Workloads

通过机器学习加速可分负载处理:大规模工作负载的实用框架

Bharadwaj Veeravalli

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(电子与计算机工程系,新加坡国立大学)

AI总结 提出首个机器学习框架,使用前馈神经网络预测单级树网络架构中的最优处理时间,实现97-99%准确率和1-5%平均绝对百分比误差,推理时间小于1毫秒,相比传统方法加速10-100倍。

详情
AI中文摘要

本文介绍了首个用于可分负载理论(DLT)范式下单级树网络(SLTN)架构中预测最优处理时间的机器学习框架。使用具有16个工程特征的前馈神经网络(FNN),我们在100,000个合成生成的配置上训练模型,无需显式推导DLT方程即可预测最优处理时间。模型达到97-99%的准确率(R平方因子),平均绝对百分比误差为1-5%,表明神经网络能够有效学习复杂的负载分布关系。特征重要性分析显示,模型隐式捕捉了DLT的数学结构,包括负载守恒和同时完成约束。推理时间低于1毫秒,该方法相比传统DLT计算提供10-100倍的加速,适用于实时调度、设计空间探索和云资源分配。该方法在多样化的系统配置(n=3到20,负载大小=1到100 GB)中泛化良好,精度一致,尽管在非常大或高度异构的系统中性能略有下降。本工作证明了使用机器学习加速分布式计算优化同时保持接近最优精度的可行性。

英文摘要

In this paper, we introduce the first machine learning framework for predicting optimal processing times in Single-Level Tree Network (SLTN) architectures for the Divisible Load Theory (DLT) paradigm. Using a feedforward neural network(FNN) with 16 engineered features, we train a model on 100,000 synthetically generated configurations to predict optimal processing times without explicit formulation of DLT equations. The model achieves 97-99% accuracy (R-square factor) with mean absolute percentage error of 1-5%, demonstrating that neural networks can effectively learn complex load distribution relationships. Feature importance analysis reveals that the model implicitly captures DLT mathematical structure, including load conservation and simultaneous finishing constraints. With inference times under 1 millisecond, the approach serves as a viable option over traditional DLT computation, enabling applications in real-time scheduling, design space exploration, and cloud resource allocation. The method generalizes well across diverse system configurations (n=3 to 20, load size =1 to 100 GB) with consistent accuracy, though performance degrades slightly for very large or highly heterogeneous systems. This work demonstrates the feasibility of using machine learning to accelerate distributed computing optimization while maintaining near-optimal accuracy.

2605.22863 2026-06-09 cs.LG 版本更新

Latent Cache Flow: Model-to-Model Communication Without Text

潜在缓存流:无需文本的模型间通信

Maximillian Rossi, Prajwal Raghunath, Eugene Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出潜在缓存流(LCF)方法,通过联合翻译和压缩键值缓存实现高效模型间通信,在上下文不同场景下比基于文本的通信准确率提高23%、速度提升8.5倍。

Comments 6 pages, 5 figures

详情
AI中文摘要

当今的LLM智能体通过文本进行通信,由于需要自回归解码共享模型的状态并在接收模型处编码,这会导致显著的延迟和信息损失。最近的工作如Cache-to-Cache(C2C;Fu等人,2026)试图通过学习适配器来交换KV缓存,该适配器将共享者的KV矩阵转换为接收者模型。然而,这些适配器体积庞大且训练成本高,并且逐词翻译,要求目标上下文完全相同。这对于LLM具有不同上下文的智能体通信来说是不合适的。我们引入了潜在缓存流(LCF)。为了解决效率问题,我们观察到键和值可以联合翻译和压缩,将适配器大小减少到C2C的约4%。为了解决上下文不同的问题,我们设计了适配器来传输目标模型所没有的新信息的摘要。我们的初步实验表明,在共享上下文设置中,一个13 MB的LCF适配器可以比956 MB的C2C适配器更准确;对于不同上下文,LCF比基于文本的通信准确率提高23%,速度提升8.5倍。

英文摘要

LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a pruned 13 MB LCF adapter can be more accurate than C2C at 956 MB in shared-context settings; for different contexts, LCF improves F1 by 7.5% and Exact Match by 23% while 8.5 times faster than text-based communication.

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) ByteDance Inc.(字节跳动公司)

AI总结 针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题,提出技能检索增强(SRA)范式,通过动态检索外部技能库提升智能体性能,并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)演变为能够自主解决问题的智能体,它们越来越依赖外部的、可复用的技能来处理超出其原生参数能力的任务。在现有的智能体系统中,整合技能的主要策略是在上下文窗口内显式枚举可用技能。然而,这种策略无法扩展:随着技能库的扩大,上下文预算迅速消耗,智能体在识别正确技能方面的准确性显著下降。为此,本文提出了技能检索增强(SRA),一种新的范式,其中智能体按需从大型外部技能库中动态检索、整合和应用相关技能。为了使该问题可衡量,我们构建了一个大规模技能库,并引入了SRA-Bench,这是首个对完整SRA流程进行分解评估的基准,涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5,400个能力密集型测试实例和636个手动构建的金标准技能,这些技能与网络收集的干扰技能混合,形成了一个包含26,262个技能的大规模语料库。大量实验表明,基于检索的技能增强可以显著提高智能体性能,验证了该范式的潜力。同时,我们揭示了技能整合中的一个基本差距:当前的LLM智能体倾向于以相似的速率加载技能,无论是否检索到金标准技能,或者任务是否实际需要外部能力。这表明技能增强的瓶颈不仅在于检索,还在于基础模型判断何时加载何种技能以及何时真正需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题,并为未来智能体系统中能力的可扩展增强奠定了基础。

英文摘要

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

2605.22763 2026-06-09 cs.AI 版本更新

Advancing Mathematics Research with AI-Driven Formal Proof Search

用AI驱动的形式证明搜索推进数学研究

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely Bérczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Miklós Z. Horváth, Andrew Ferraiuolo, Henryk Michalewski, Edward Lockhart, Codrut Grosu, Thomas Hubert, Matej Balog, Pushmeet Kohli, Swarat Chaudhuri

发表机构 * Google DeepMind(谷歌DeepMind) Aarhus University(奥胡斯大学)

AI总结 本文研究了如何利用大型语言模型生成形式证明,以解决开放性数学问题,并展示了AI辅助形式证明搜索在数学研究中的应用和贡献。

详情
AI中文摘要

大型语言模型(LLMs)在数学推理方面日益表现出色,但其不可靠性限制了其在数学研究中的实用性。一种缓解方法是使用LLMs生成Lean等语言中的形式证明。我们首次对这种方法解决开放性问题的能力进行了大规模评估。我们的最强大代理在每个问题的成本仅为几百美元的情况下,自主解决了353个开放性埃德勒问题中的9个,并证明了492个OEIS猜想中的44个,同时正被应用于组合学、优化、图论、代数几何和量子光学研究。一个基本代理交替使用基于LLM的生成和基于Lean的验证,复制了埃德勒的成功,但在最困难的问题上成本更高。这些发现展示了AI辅助形式证明搜索的威力,并揭示了使这种技术可行的代理设计。

英文摘要

Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

2605.11314 2026-06-09 cs.CV cs.AI 版本更新

Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中基于3D无标记运动学的罗达和格雷厄姆步态分类量化

Lauhitya Reddy, Seth Donahue, Jeremy Bauer, Susan Sienko, Anita Bagley, Joseph Krzak, Maura Eveld, Karen Kruger, Ross Chafetz, Vedant Kulkarni, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Shriners Children’s(夏皮罗儿童医院) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的沃克·H·库勒生物医学工程系)

AI总结 本文提出了一种基于单视角视频的无标记步态分析方法,用于量化罗达和格雷厄姆步态分类中的膝踝z分数,从而在资源有限的临床环境中实现可扩展的客观步态评估。

Comments 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format

详情
AI中文摘要

脑瘫(CP)是一种运动神经障碍,是儿童中最常见的终身身体残疾原因。大约75%的脑瘫儿童能够行走,准确的步态评估对于保持行走功能至关重要,这种功能在四分之一到一半的脑瘫成人中在中年时会恶化。罗达和格雷厄姆分类系统利用来自3D仪器化步态分析(3D-IGA)的踝关节和膝关节z分数来量化矢状面步态偏差,但3D-IGA成本高且仅限于专业中心,而观察性评估仅显示中等的评分者间一致性。我们开发了一种无标记步态分析流程,可以直接从单视角临床步态视频中量化罗达和格雷厄姆膝踝z分数。在1,058个双侧肢体样本(来自152名儿童的529次试验,其中88名男性,63名女性,年龄12.1±4.0岁,60种不同的主要诊断,脑瘫最为常见,n=54)中,矢状面模型在膝关节z分数上达到R²=0.80±0.02和CCC=0.89±0.02,踝关节z分数上达到R²=0.57±0.02和CCC=0.72±0.02,与3D-IGA相比。二元筛查用于过量膝关节屈曲的AUROC=0.88,正确识别了83%的受影响儿童,应用罗达和格雷厄姆规则得到7类准确率为43±1%,宏AUROC=0.78±0.01,踝关节预测误差仍然是主要瓶颈。除了横断面筛查外,连续z分数支持跨访问的纵向轨迹跟踪,为监测疾病进展和治疗反应提供定量基础,这在观察性量表中是无法实现的。这些结果证明了基于视频的z分数估计、过量屈曲筛查和纵向轨迹跟踪在资源有限的临床环境中实现可扩展、客观步态评估的可行性。

英文摘要

Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

2605.22079 2026-06-09 cs.CL 版本更新

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ishigaki-IDS-Bench: 一个用于从BIM信息需求生成信息交付规范的基准

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.(ONESTRUCTION公司) AWS GenAI Innovation Center(AWS生成式人工智能创新中心)

AI总结 本文提出Ishigaki-IDS-Bench基准,用于评估大型语言模型生成符合行业标准的XML信息交付规范(IDS)的能力,通过166个由BIM/IDS专家编写和验证的示例,结合内容一致性评估和结构审核,展示了当前LLM在生成满足IDS标准和IFC词汇约束的XML方面的局限性。

Comments 7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face

详情
AI中文摘要

大型语言模型(LLMs)被广泛用于生成结构化输出,如JSON、SQL和代码,但公共资源仍然有限,无法有效评估必须同时满足行业标准XML和领域词汇约束的生成能力。本文提出了Ishigaki-IDS-Bench,一个用于评估从BIM信息需求生成信息交付规范(IDS)XML能力的基准。该基准包含166个由BIM/IDS专家编写和验证的示例,这些示例是通过将83个实际场景扩展为日语和英语后生成的,对应黄金IDS文件以及输入格式、语言、轮次设置、IFC版本和建筑领域等元数据。其评估结合了基于IDSAuditTool的可操作性、结构和内容审核,以及与黄金IDS文件的内容一致性评估。在零样本评估中,10个LLM中表现最好的模型在内容一致性上达到65.6%的宏F1分数,但只有27.7%的输出通过内容审核。这些结果表明,当前LLM能够表达部分信息需求作为IDS,但仍难以稳定生成满足IDS标准和IFC词汇约束的XML。Ishigaki-IDS-Bench支持比较评估、失败分析以及开发符合领域标准的受限结构生成方法。我们已将评估脚本和基准数据以CC BY 4.0许可发布在GitHub和Hugging Face上。

英文摘要

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

2605.21854 2026-06-09 cs.CV cs.AI 版本更新

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

CrossVLA: 跨范式后训练和推理优化用于视觉-语言-动作模型

Zhi Liu

发表机构 * Tianjin University(天津大学)

AI总结 本文研究了视觉-语言-动作(VLA)模型的跨范式后训练方法,提出了CrossVLA框架,通过改进的连续动作流匹配估计器、对比LoRA和DoRA参数高效层的性能,并揭示了推理过程中去噪循环对延迟的影响,最终实现了在LIBERO数据集上的显著提升。

Comments Workshop draft, 14 pages, 4 figures. Code, ckpts, data: https://github.com/lz-googlefycy/vla-lab

详情
AI中文摘要

视觉-语言-动作(VLA)模型迅速收敛到一小套架构模式:离散令牌自回归(例如OpenVLA)和连续动作流匹配(例如pi-0.5)。然而,通过直接偏好优化(DPO)进行偏好对齐——语言模型中事实上的后训练步骤——几乎仅在自回归VLA上被研究。我们提出了CrossVLA,对跨范式VLA后训练进行实证研究。三大贡献:(i)一个替代流匹配对数概率估计器,使DPO可以在不进行概率流ODE积分的情况下在连续动作后端上运行;(ii)对LoRA和DoRA作为VLA DPO的参数高效层进行直接比较,发现DoRA在LIBERO 4套件上比OpenVLA SFT平均提升10.4个百分点(600次试验,3种子)——每套件+20.0对象,+11.0长周期,+8.0目标,+2.7空间——在对象上无种子方差(38/50在每个种子上);(iii)推理时间解剖显示去噪循环主导了78.6%的sample_actions延迟,而类似于VLA-Cache的前缀K/V缓存达到了21%的加速上限——无论是块级还是令牌级缓存策略在我们的基准中都会使成功率降至0-80%。我们进一步在6000个LIBERO帧上预训练了一个多视角+时间投影头,实现了99.5%的k-NN召回率@1(36倍于随机),可用作下游初始化。所有代码、检查点、训练日志和复现脚本均在https://github.com/lz-googlefycy/vla-lab上公开。

英文摘要

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz-googlefycy/vla-lab.

2604.24199 2026-06-09 cs.SD cs.AI eess.AS eess.SP 版本更新

Speech Enhancement Based on Drifting Models

基于漂移模型的语音增强

Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

发表机构 * Victoria University of Wellington(维多利亚大学) Lincoln University(林肯大学) GN Advanced Science(GN先进科学)

AI总结 本文提出了一种基于漂移模型的语音增强框架DriftSE,通过将去噪问题建模为平衡问题,实现单步推理,从而在无需配对数据的情况下实现高质量语音增强。

Comments 6 pages, 2 figures

详情
AI中文摘要

我们提出了一种基于漂移模型的语音增强(DriftSE),一种新颖的生成框架,将去噪建模为一个平衡问题。与依赖迭代采样的方法不同,DriftSE通过演化映射函数的推动分布来实现单步推理,直接匹配干净语音分布。这种演化由漂移场驱动,这是一种学习到的修正向量,引导样本向干净分布的高密度区域发展,这自然促进了在未配对数据上的训练,通过匹配分布而非配对样本。我们从两种形式研究了该框架:从噪声观测到直接映射,以及从高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准测试中,DriftSE在单步中实现了高保真度的增强,优于多步扩散基线,并建立了语音增强的新范式。

英文摘要

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.