arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别1778
2606.07515 2026-06-08 cs.CL cs.AI cs.HC math.PR 新提交

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠?

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze

AI总结 通过离散概率问题基准测试,发现 LLM 在标准问题上准确率 0.96,但在反直觉问题上仅 0.59,且存在 token 偏差和误导提示的脆弱性。

详情
AI中文摘要

我们通过离散概率问题的受控基准研究,调查了大语言模型的概率推理能力。我们构建了两个数据集,分别是一组标准习题和一组反直觉习题,旨在触发启发式推理,并评估了 8 个最先进的模型,每个模型分别在有无思维链提示的情况下进行测试。模型在标准问题上的平均准确率为 0.96,但在反直觉问题上仅为 0.59。我们进一步提供了 token 偏差的经验证据:当规范表述被伪装变体替换时,性能下降超过 20%。在提示中嵌入误导性建议会使性能降低高达 34%,且没有模型被证明免疫。综合来看,报告的结果表明,尽管当前 LLM 在高级数学问题上取得了成功,但它们尚未成为真正的概率推理者。

英文摘要

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

2606.07514 2026-06-08 cs.CV 新提交

UniSHARP: Universal Sharp Monocular View Synthesis

UniSHARP: 通用锐利单目视图合成

Meixi Song, Dizhe Zhang, Hao Ren, Ruiyang Zhang, Bo Du, Ming-Hsuan Yang, Lu Qi

发表机构 * Insta360 Research Sun Yat-sen University Beihang University Wuhan University University of California, Merced

AI总结 提出UniSHARP,通过统一全景隐空间和射线基高斯表示,将SHARP扩展到任意相机系统(包括鱼眼、全景),在特征与高斯空间隐式对齐,在构建的多视角基准上大幅超越现有方法。

详情
Comments
Project page: https://insta360-research-team.github.io/Unisharp-website/
AI中文摘要

在这项工作中,我们专注于扩展SHARP(一种流行的逼真视图合成方法),以实现跨连续相机系统(从传统透视相机到广角、鱼眼和全景设置)的通用单目渲染。为了克服SHARP的针孔特定假设,我们的关键思想是将各种图像对齐到统一的全景隐空间中。因此,我们提出了UniSHARP,它在特征空间和高斯空间中执行隐式对齐。具体来说,高斯基元沿射线和径向距离排列在基于射线的通用表示中,而从UniK3D启发的编码器中提取的2D语义和3D空间特征被联合解码以生成完整的高斯云。为了全面评估我们的方法,我们构建了一个覆盖各种场景下多种成像系统的基准。该基准进一步按视场角(FoV)分层,以实现对通用单目渲染任务的细粒度评估。在提出的基准上进行的大量实验证明了UniSHARP的有效性,其性能大幅优于替代方法。项目页面可在此处找到:this https URL

英文摘要

In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal monocular rendering across a continuum of camera systems, from conventional perspective cameras to wide-field-of-view, fisheye and omnidirectional panoramic settings. To overcome the pinhole-specific assumptions of SHARP, our key idea is to align various images in a unified omnidirectional latent space. Thus, we propose UniSHARP, which performs implicit alignment in both feature and Gaussian spaces. Specifically, Gaussian primitives are arranged along rays and radial distances in a ray-based universal representation, while 2D semantic and 3D spatial features extracted from UniK3D-inspired encoders are jointly decoded to generate the complete Gaussian cloud. To comprehensively evaluate our method, we construct a benchmark covering diverse imaging systems across various scenes. The benchmark is further stratified by field of view (FoV) to enable fine-grained assessment of the universal monocular rendering task. Extensive experiments on the proposed benchmark demonstrate the effectiveness of UniSHARP, outperforming alternative methods by a large margin. The project page can be found at: https://insta360-research-team.github.io/Unisharp-website/

2606.07513 2026-06-08 cs.CL 新提交

Agentopia: Long-Term Life Simulation and Learning in Agent Societies

Agentopia: 智能体社会中的长期生活模拟与学习

Xintao Wang, Sirui Zheng, Hongqiu Wu, Weiyuan Li, Jen-tse Huang, Minghao Zhu, Can Zu, Qi Deng, Jiawei Wang, Qianyu He, Heng Wang, Xiaojian Wu, Yunzhe Tao

发表机构 * Fudan University Johns Hopkins University University of Science and Technology of China

AI总结 提出Agentopia框架,模拟100个智能体在10年内的社会生活,通过生命奖励训练LLM,提升其社交智能,并在角色扮演基准上取得15.6%的提升。

详情
Comments
79 pages, 19 figures
AI中文摘要

人类从社会生活中学习。用LLM驱动的智能体模拟这一过程代表了一个有前景的研究方向,引发了一个自然的问题:LLM能否从这种模拟的社会经验中学习,以更好地理解和复制人类行为。然而,先前的智能体社会模拟通常以天为单位运行,限制了社会互动的深度和长期成长。在本文中,我们研究智能体社会中的长期生活模拟和LLM学习,有两个目标:(1) 研究从终身模拟中涌现的社会行为,(2) 通过多年的模拟社会经验,发展LLM的拟人化能力,特别是社会生活中的智能。具体来说,我们提出了Agentopia,一个用于多智能体社会中长期生活模拟的综合框架,其中100个智能体在10年的模拟时间内自主追求个人成长、发展社会关系并满足其需求和目标。我们定义了生命奖励来反映人类福祉,并利用该奖励通过拒绝采样训练LLM。大量实验表明,智能体表现出丰富的涌现社会行为。此外,生命奖励训练有效增强了底层LLM,从而在模拟中改善了智能体的福祉,并泛化到下游角色扮演基准,提升了15.6%。

英文摘要

Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long-term growth. In this paper, we study long-term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life-long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long-term life simulation in multi-agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well-being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well-being in simulation, and generalizes to downstream role-playing benchmarks with +15.6% improvement.

2606.07508 2026-06-08 cs.CV 新提交

Streaming Video Generation with Streaming Force Control

流式视频生成与流力控制

Hanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv, Shenlong Wang, Huaizu Jiang

发表机构 * Northeastern University Impossible Research University of California, Berkeley University of Illinois Urbana-Champaign

AI总结 提出StreamForce框架,通过统一力表示和蒸馏流程实现因果、统一的流式视频生成,支持局部和全局时变力控制,在单GPU上达到16.6 FPS,力遵循和运动真实性达最优。

详情
AI中文摘要

我们提出StreamForce,一个流式视频生成框架,通过连续的力输入实现基于物理的控制。与先前的视频模型不同,这些模型为不同的力类型训练单独的模型,假设固定的力,或依赖非因果处理,StreamForce是一个因果且统一的模型,能够即时且连贯地响应局部和全局的时变力。为此,我们设计了一个统一的力表示作为控制信号,并开发了一个用于力可控视频生成的蒸馏流程。我们的模型结合了自回归效率和力响应性,维持了稳定的光度学和动态真实性。StreamForce在单个GPU上以高达16.6 FPS的速度运行,在力遵循和运动真实性方面均达到了最先进的性能。项目网站:此https URL

英文摘要

We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, or rely on non-causal processing, StreamForce is a causal and unified model that responds instantly and coherently to both local and global, time-varying forces. To achieve this, we design a unified force representation as a control signal and develop a distillation pipeline for force-controllable video generation. Our model combines autoregressive efficiency with force responsiveness, sustaining stable photometric and dynamic realism. StreamForce runs at up to 16.6 FPS on a single GPU, achieving state-of-the-art performance in both force adherence and motion realism. Project website: https://neu-vi.github.io/StreamForce/

2606.07506 2026-06-08 cs.RO 新提交

Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation

基于可负担性的四足机器人层级强化学习操控

Tuba Girgin, Jose Castelblanco, Gabriel Rodriguez, Emre Girgin, Cagri Kilic

发表机构 * Embry-Riddle Aeronautical University Carnegie Mellon University

AI总结 提出三级层级强化学习框架,利用姿态和交互点可负担性引导导航与操控策略,在仿真和真实环境中实现自主物体操控。

详情
Comments
This paper is submitted to Wiley Journal of Field Robotics
AI中文摘要

四足机器人的物体操控能力是一个开放的研究挑战。虽然先前的研究侧重于低级策略学习,任务执行仍依赖于专家设计的高级轨迹。自主选择目标物体上的可负担交互点和可负担机器人基座姿态消除了对预设计轨迹的需求。本研究提出了一个三级层级强化学习(RL)框架,利用姿态可负担性来引导导航策略,而导航策略驱动运动策略。此外,操控策略由交互点可负担性引导,实现四足机器人的物体中心姿态对齐和有效的末端执行器操控规划。我们在IsaacSim生态系统中训练所提出的框架,并在仿真和真实环境中进行评估。我们在仿真中研究了姿态可负担性在多个场景下的有效性,同时在真实环境中验证了各种物体交互任务,形成了物体交互数据集。结果表明,所提出的框架能够基于可负担性自主识别候选姿态,并在无需人类引导的情况下成功执行真实世界中的物体操控任务。

英文摘要

The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low-level policy learning, task execution still relies on expert-designed high-level trajectories. Autonomous selection of both an affordable interaction point on the target object and an affordable robot base pose removes the need for pre-designed trajectories. This study proposes a three-level hierarchical reinforcement learning (RL) framework that utilizes pose affordances to guide the navigation policy, while the navigation policy drives the locomotion policy. In addition, the pedipulation policy is guided by interaction-point affordances, enabling object-centric pose alignment of the quadruped robot and effective end-effector manipulation planning. We train the proposed framework in the IsaacSim ecosystem and evaluate it in both simulation and real-world settings. We investigate the effectiveness of pose affordance across multiple scenarios in simulation while various object interaction tasks are validated on real-world setting forming an object-interaction dataset. The results show that the proposed framework can autonomously identify candidate poses based on their affordance and successfully execute object manipulation tasks in the real world without human guidance.

2606.07503 2026-06-08 cs.CV 新提交

Differences in Detection: Explainability Where it Matters

检测中的差异:可解释性在关键之处

Johannes Theodoridis, Johannes Maucher, Andreas Schilling

发表机构 * University of Tübingen Institute for Applied AI Hochschule der Medien Stuttgart

AI总结 提出DnD方法,通过匹配算法直接比较两个目标检测模型,揭示个体与共享错误,并引导可解释性方法聚焦于度量相关示例。

详情
Comments
Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2026 - How Do Vision Models Work? (HOW)
AI中文摘要

我们提出了检测中的差异(DnD),一种直观的比较两个目标检测模型的方法。基于相同的匹配算法,它补充了平均精度($mAP$)和TIDE误差分析的标准指标,能够直接比较两个模型。更具体地说,我们计算两个模型都识别的真实标签的交集,然后是相应的差集以及两个模型都遗漏的真实标签的补集。与独立的汇总统计比较相比,这种比较更直接、更直观。它揭示了个体和共享的错误,当与错误类型结合时尤其有趣。在这种情况下,检测误差的差异可以自然地通过标准混淆矩阵进行分析。虽然本身有价值,但我们认为DnD的最佳应用之一是引导可解释性方法(如ODAM)关注基于结构化子集的度量相关示例。我们方法的代码可在此处获取:this https URL

英文摘要

We propose Differences in Detection (DnD), an intuitive method to compare two object detection models. Based on the same matching algorithm, it complements the standard metrics of mean Average Precision ($mAP$) and TIDE error analysis with the ability to compare two models directly. More specifically, we calculate the intersection of ground truth labels that are recognized by both models, followed by the corresponding difference sets and the complement set of ground truth labels that are missed by both models. The resulting comparison is more direct and intuitive than a comparison of independent summary statistics. It reveals individual and shared mistakes and becomes particularly interesting when combined with error types. In this case, the differences in detection errors can be analyzed naturally in a standard confusion matrix. While valuable in itself, we believe that one of the best applications of DnD is to guide explainability methods such as ODAM towards metric-relevant examples, grounded in structured subsets. The code for our method is available here: https://github.com/JohannesTheo/differences-in-detection

2606.07502 2026-06-08 cs.CL cs.IR 新提交

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

你的解嵌入矩阵实际上是文本嵌入的特征透镜

Songhao Wu, Zhongxin Chen, Yuxuan Liu, Heng Cui, Cong Li, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China Lenovo Group Limited Wuhan University

AI总结 发现LLM文本嵌入与高频词对齐导致语义捕获不足,提出EmbedFilter通过过滤解嵌入矩阵中的高频子空间来增强表示,并实现降维加速检索。

详情
Comments
preprint
AI中文摘要

大型语言模型在广泛的下游任务中展现出令人印象深刻的零样本能力。然而,它们难以作为现成的嵌入模型,导致在大量文本嵌入基准测试中表现欠佳。在本文中,我们确定了这种缺陷的一个潜在原因。我们的动机源于一个意外的观察:当文本嵌入投影到词汇空间时,它们倾向于与频繁但信息量少的词对齐。我们认为,这种对高频词的过度表达抑制了模型捕获细微语义的能力。为了解决这个问题,我们引入了EmbedFilter,一种简单的线性变换,旨在直接精炼从LLM中导出的文本嵌入。具体来说,我们发现LLM内部的解嵌入矩阵编码了一个潜在空间,该空间正在主动将这些高频词写入嵌入空间。通过过滤掉这个子空间,EmbedFilter抑制了高频词的影响,从而增强了语义表示。作为一个引人注目的副产品,这实现了固有的降维,降低了索引存储并加速了检索,同时完全保留了精炼后的嵌入质量。我们在多个LLM骨干上的实验表明,配备EmbedFilter的LLM即使在嵌入维度显著降低的情况下也能实现优越的零样本下游性能。我们希望我们的发现能提供对基于LLM的表示机制的更深入见解,并激发更多有原则的设计来改进文本嵌入训练。我们的代码可在此https URL获取。

英文摘要

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

2606.07500 2026-06-08 cs.LG cs.AI 新提交

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

稀疏子空间到专家共享的任务无关持续学习

Fatema Siddika, Md Anwar Hossen, Tanwi Mallick, Ali Jannesari

发表机构 * Iowa State University Argonne National Laboratory

AI总结 提出SETA框架,通过将参数分解为任务特定专家和共享专家的稀疏子空间,结合自适应弹性锚定和路由感知正则化,解决LLM持续学习中的塑性-稳定性困境,在多个基准上优于现有方法。

详情
Comments
19 pages. arXiv admin note: text overlap with arXiv:2601.17616
AI中文摘要

大型语言模型(LLM)中的持续学习受到塑性-稳定性困境的阻碍,获取新能力往往导致先前知识的灾难性遗忘。现有方法通常统一对待参数,未能区分特定任务知识和共享能力。我们提出了用于任务无关持续学习的稀疏专家混合(SETA)框架,该框架通过将参数自适应稀疏子空间分解为任务特定专家模块来解决塑性-稳定性冲突。与标准更新(其中任务竞争相同参数)不同,SETA将知识分离为独特专家(旨在隔离任务特定模式)和共享专家(负责捕获共同特征)。这种结构通过自适应弹性锚定和路由感知正则化来维护,该正则化在权重和路由级别共同保护共享知识,并使统一的门控网络能够在推理过程中自动检索正确的专家组合。在多种领域特定基准上的大量实验表明,相对于最先进的持续学习基线,SETA实现了具有竞争力或更优的整体性能,特别是在LLaMA-2 7B和Qwen3-4B上,早期任务知识的保持和反向迁移能力尤为突出。

英文摘要

Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the plasticity-stability conflict through adaptive sparse subspace decomposition into task-specific expert modules. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through adaptive elastic anchoring and a routing-aware regularization that jointly protect shared knowledge at both the weight and routing levels and enable a unified gating network to automatically retrieve the correct expert combination during inference. Extensive experiments across diverse domain-specific benchmarks demonstrate that SETA achieves competitive or superior overall performance relative to state-of-the-art continual learning baselines, with particularly strong retention of early-task knowledge and improved backward transfer on LLaMA-2 7B and Qwen3-4B.

2606.07498 2026-06-08 cs.CV 新提交

Implicit Data Synthesis for Contrastive Unsupervised Data Augmentation

隐式数据合成用于对比无监督数据增强

Patrick Kage, Trevor Hedges, N. Siddharth, Pavlos Andreadis

发表机构 * School of Informatics, The University of Edinburgh Massachusetts Institute of Technology Lincoln Laboratory

AI总结 针对科学观测数据难以标注的问题,提出通过扰动网络权重而非数据生成对比样本,在雷达流星观测上使用SimCLR管道验证性能提升。

详情
Comments
11 pages, 3 figures, 2 tables
AI中文摘要

科学观测产生大量未标记数据,手工标记费力,因此无监督学习技术对于处理数据集很有价值。在这些方法中,对比学习提供了一种从无标注数据集中提取结构表示的便捷机制。对于自然图像,通用方法是使用多种数据空间增强方法来生成合成样本;然而,对于科学观测,数据空间扰动可能从根本上改变底层数据。我们提出的方法是通过扰动网络权重而非底层数据来生成对比样本,从而更紧密地保留数据结构。我们使用基于SimCLR的管道在雷达流星观测上演示了该技术,并展示了在匹配协议下的性能提升。

英文摘要

Scientific observations generate large quantities of unlabeled data which is laborious to hand-label, making unsupervised learning techniques valuable for processing datasets. Among these approaches, contrastive learning provides a convenient mechanism for extracting structural representations from unannotated datasets. For natural imagery, the general approach is to use a variety of data-space augmentation methods in order to generate synthetic samples; however, for scientific observations data-space perturbations can fundamentally alter the underlying data. Our proposed method is to generate contrastive samples by perturbing the network weights rather than the underlying data, thus more closely preserving the structure of the data. We demonstrate this technique using a SimCLR-based pipeline applied over radar observations of meteors, and show performance gains under matched protocols.

2606.07496 2026-06-08 cs.LG math.OC 新提交

Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

加速去中心化随机梯度下降用于强凸优化

Ming Sun, Kun Yuan

发表机构 * Peking University

AI总结 提出MG-ADSGD算法,结合Nesterov型原始-对偶外推与多轮快速八卦平均,通过耦合八卦深度与小批量大小,同时实现加速收敛和通信高效,达到最优通信复杂度。

详情
AI中文摘要

去中心化随机优化是网络大规模学习的基本范式,其中智能体仅与邻居通信,无需中央协调器。对于强凸问题,通信效率主要由条件数 \(\kappa=L/\mu\) 和网络谱间隙 \(1-\beta\) 决定。尽管确定性去中心化方法可以同时实现加速的 \(\sqrt{\kappa}\) 和 \(1/\sqrt{1-\beta}\) 依赖,但现有随机方法未能同时获得这两种改进。本文提出 \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD),一种结合Nesterov型原始-对偶外推与多轮快速八卦平均的去中心化随机算法。关键思想是将八卦深度与小批量大小耦合,使得额外的通信轮次同时改善共识精度并减少梯度方差。我们证明MG-ADSGD达到通信复杂度 \[ \widetilde{\mathcal O}\!\left( \frac{\sigma^2}{\mu n\epsilon}\log\frac{1}{\epsilon} + \sqrt{\frac{\kappa}{1-\beta}}\log\frac{1}{\epsilon} \right), \] 其中 \(\epsilon\) 表示目标精度,\(n\) 是节点数,\(\sigma^2\) 是梯度方差。据我们所知,该界提供了去中心化随机强凸优化目前最佳的通信复杂度,仅含与 \(\epsilon\) 无关的对数因子。

英文摘要

Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the condition number \(κ=L/μ\) and the network spectral gap \(1-β\). Although deterministic decentralized methods can simultaneously achieve accelerated \(\sqrtκ\) and \(1/\sqrt{1-β}\) dependences, no existing stochastic method attains both improvements at once. In this paper, we propose \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD), a decentralized stochastic algorithm that combines Nesterov-type primal--dual extrapolation with multi-round fast gossip averaging. The key idea is to couple the gossip depth with the mini-batch size so that additional communication rounds simultaneously improve consensus accuracy and reduce gradient variance. We show that MG-ADSGD achieves the communication complexity \[ \widetilde{\mathcal O}\!\left( \frac{σ^2}{μnε}\log\frac{1}ε + \sqrt{\fracκ{1-β}}\log\frac{1}ε \right), \] where \(ε\) denotes the target accuracy, \(n\) is the number of nodes, and \(σ^2\) is the gradient variance. To the best of our knowledge, this bound yields the best currently available communication complexity for decentralized stochastic strongly convex optimization, up to logarithmic factors that are independent of $ε$.

2606.07495 2026-06-08 cs.LG 新提交

Second-Order Path Kernel Interpolation Formulas in Machine Learning

机器学习中的二阶路径核插值公式

Jin Guo, Roy Y. He, Jean-Michel Morel

发表机构 * City University of Hong Kong

AI总结 本文提出神经网络的二阶路径核插值公式,引入曲率加权项和随机梯度下降的噪声耦合项,并扩展到带动量的情况,完善了路径核对预测的解释。

详情
AI中文摘要

理解训练数据如何塑造神经网络预测是现代学习理论的核心问题。2020年,Pedro Domingos提出了一个适用于通过确定性梯度下降学习的每个模型的插值公式。它将模型的预测表示为沿优化路径的积分,该积分依赖于一个数据相关的核,该核对齐模型在测试数据和训练数据上的梯度。这种一阶特征对于基于批处理的随机优化训练的模型仍然有效。在本文中,我们发展了这些插值公式的二阶形式。我们表明,主要的路径核插值由一个曲率加权的插值项补充。对于随机梯度下降,出现了一个额外的采样诱导分量,将预测的曲率与小批量梯度噪声的协方差耦合起来。我们还将表示扩展到带动量的随机梯度下降,其中插值结构得以保留,但权重由记忆相关因子修改。此外,我们建立了终端预测的浓度估计,确定了围绕期望二阶表示的波动尺度。这些结果共同完善了神经网络预测的路径核解释。

英文摘要

Understanding how training data shape neural network predictions is a central problem in modern learning theory. In 2020, Pedro Domingos proposed an interpolation formula valid for every model learned by deterministic gradient descent. It expresses the model's prediction as an integral, along the optimization path, of a data-dependent kernel that aligns the model's gradients at the test and training data. Such a first-order characterization remains valid for models trained with batch-based stochastic optimization. In this paper, we develop second-order forms of these interpolation formulas. We show that the leading path-kernel interpolation is supplemented by a curvature-weighted interpolation term. For stochastic gradient descent, an additional sampling-induced component appears, coupling the curvature of the prediction with the covariance of mini-batch gradient noise. We also extend the representation to stochastic gradient descent with momentum, where the interpolation structure is preserved but with the weights modified by a memory-related factor. Moreover, we establish a concentration estimate for the terminal prediction, identifying the fluctuation scale around the expected second-order representation. Together, these results provide a refinement of the path-kernel interpretation of neural network prediction.

2606.07494 2026-06-08 cs.SD eess.AS 新提交

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

缓解深度伪造语音中的代理到真实域差距

Xuanjun Chen, Yun-Shing Wu, Wei-Chung Lu, Claire Lin, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University Graduate Institute of Networking and Multimedia, National Taiwan University Department of Information Management, National Taiwan University NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)

AI总结 提出域偏移特征增强(DSFA)方法,通过将确定性特征统计转换为随机分布来缩小代理数据与真实世界之间的域差距,在CoSG ExtEval数据集上达到最先进性能。

详情
Comments
Work in progress
AI中文摘要

最近的基于神经音频编解码器的语音生成(CodecFake)产生了高度逼真的音频,对现有的深度伪造反制模型构成了挑战。虽然使用编解码器重合成语音(CoRS)作为代理数据可以提高性能,但它通常泛化能力有限。我们提出了域偏移特征增强(DSFA),通过在微调期间将确定性特征统计转换为随机分布来模拟“真实世界”的变化。为了评估泛化能力,我们进一步引入了基于编解码器的语音生成扩展评估(CoSG ExtEval)数据集,这是CoSG Eval(来自CodecFake+)数据集的更具挑战性的扩展,包含40个未见过的生成模型和长音频。实验结果表明,将后训练的SSL骨干与DSFA相结合有效地缩小了代理到真实世界的域差距。该方法在CoSG Eval和CoSG ExtEval中针对各种CodecFake攻击均达到了最先进的性能。

英文摘要

Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

2606.07489 2026-06-08 cs.AI econ.GN q-fin.EC 新提交

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

AI代理如何重塑知识工作:自主性、效率与范围

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

发表机构 * Harvard Business School Perplexity AI

AI总结 基于Perplexity产品数据,研究发现AI代理通过端到端任务执行,将自主工作时间从33秒提升至26分钟,完成时间缩短87%,成本降低94%,并扩展了工作范围与认知层次。

详情
AI中文摘要

前沿AI系统正从对话式助手转向端到端执行任务的自主代理,弥合智能与实用性之间的差距。利用Perplexity的Search和Computer产品的生产数据,我们通过研究AI代理如何加速和重塑知识工作来考察这一转变。三个关键实证发现出现。首先,使用具有几乎相同初始查询对的会话作为同一底层任务的自然实验,Computer每个用户会话执行26分钟的自主工作,而Search为33秒。Computer自动化了Search用户可能手动编排和实现的任务分解与执行。因此,Computer将后续查询分布转向更高层次的工作,如验证和扩展。自主性也提高了执行质量,Computer上每次查询的不满意率比Search低55%。其次,由于其自主性优势,Computer在匹配任务上将完成时间从269分钟减少到36分钟,与仅配备Search的人类相比,估计时间和成本分别降低87%和94%。第三,Computer改变了用户尝试的工作范围:Computer查询更常跨越职业边界,需要更高层次的认知,利用更广泛的专业知识,采取将相互依赖的子任务捆绑到单个查询中的复合任务形式,并解锁了同一用户在Search使用中基本不存在的工作活动。综合来看,证据表明AI代理加速工作流程、提高输出质量、降低成本,并扩展自动化工作的广度和深度。

英文摘要

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

2606.07488 2026-06-08 cs.LG 新提交

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

CoMetaPNS:心脏电生理模拟的持续元学习个性化神经代理

Ryan Missel, Xiajun Jiang, Linwei Wang

发表机构 * Golisano College of Computing and Information Sciences, Rochester Institute of Technology Department of Computer Science, Rowan University The University of Utah

AI总结 提出持续元学习框架CoMetaPNS,通过贝叶斯高斯混合模型记忆缓冲区分辨数据来源,实现个性化神经代理的持续学习,避免灾难性遗忘,在心脏模拟预测中优于基线。

详情
AI中文摘要

个性化虚拟心脏模拟面临模型个性化和计算成本的挑战。虽然神经代理提供了最先进的解决方案,但它们通常只解决高效个性化或训练可泛化模型中的一个方面。最近的工作通过使用有限的主题特定上下文数据,通过小样本生成建模与集合条件代理和元学习摊销推理,重新定义了学习个性化代理的过程。然而,这些方法假设一个静态且多样化的训练分布,并具有已知的任务标识符。当新数据可用时,它们需要与所有先前数据一起进行昂贵的重新训练,以避免灾难性遗忘——即模型在训练新任务时忘记旧任务的现象。这在临床环境中是一个主要限制,因为未标记的数据通常顺序到达,而完全重新训练是不可行的。本文提出了一种新的持续元学习框架,以实现个性化的神经代理,该代理不仅能够持续整合信息,还能识别传入数据是否来自已知或未知的动态源。通过利用基于记忆缓冲区的持续贝叶斯高斯混合模型,我们的框架可以推断数据随时间变化的标识符和关系——这是有效元学习所必需的。在合成心脏数据上的实验结果表明,与现有基线相比,我们的方法在模拟预测、计算可扩展性和对灾难性遗忘的鲁棒性方面表现更优。

英文摘要

Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training generalizable models. Recent work reframes this by learning the process of personalizing a surrogate using limited subject-specific context data, through few-shot generative modeling with set-conditioned surrogates and meta-learned amortized inference. These methods, however, assume a static and diverse training distribution with known task identifiers. When new data becomes available, they require costly retraining with all prior data to avoid catastrophic forgetting - a phenomena where the model forgets earlier tasks when trained on new ones. This is a major limitation in clinical settings where often unlabeled data arrives sequentially and full retraining is infeasible. This paper presents a new continual meta-learning framework to achieve personalized neural surrogates able to not only continually integrate information but also identify whether incoming data stems from a known or unknown dynamics source. By leveraging a continual Bayesian Gaussian Mixture Model over a memory buffer, our framework can infer the identifiers and relationships of data over time - required for effective meta-learning. Empirical results on synthetic cardiac data demonstrate superior simulation forecasting, computational scalability, and resilience to catastrophic forgetting compared to existing baselines.

2606.07483 2026-06-08 cs.LG stat.ML 新提交

Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach

从级联数据中恢复网络:一种基于去偏雅可比矩阵的机器学习方法

Lei Huang

发表机构 * MIT Sloan School of Management

AI总结 提出CascadeNet框架,通过去偏雅可比矩阵估计一步转移函数,无需指定扩散模型即可恢复隐藏影响网络,在模拟和COVID-19传播数据中优于现有方法。

详情
AI中文摘要

许多重要结果以动态级联的形式展开,包括产品采用、疾病传播、金融困境和信息扩散。一个核心挑战是恢复这些级联背后的隐藏影响网络。现有方法通常假设特定的扩散模型,当该假设错误时,其性能会大幅下降。我们提出了CascadeNet,一种基于雅可比矩阵的机器学习框架,用于网络恢复,无需指定扩散机制。关键思想是,潜在的影响结构可以通过一步转移函数的雅可比矩阵来刻画。CascadeNet首先构建转移函数的灵活估计量,然后通过Riesz表示应用Neyman正交去偏,使得去偏后的雅可比矩阵是$\sqrt{n}$一致且渐近正态的,从而能够对网络结构进行正式推断。我们在模拟实验和真实世界实证应用中验证了CascadeNet。在模拟中,数据生成过程已知,CascadeNet在九种常见数据生成过程中实现了最高的网络恢复准确率。在西班牙52个省份的COVID-19传播实证应用中,CascadeNet恢复的传播网络与真实的省际移动网络显著相关,而基线方法恢复的网络与真实情况无显著一致性。

英文摘要

Many important outcomes unfold as dynamic cascades, including product adoption, disease spread, financial distress, and information diffusion. A central challenge is to recover the hidden influence network behind these cascades. Existing methods typically assume a specific diffusion model, and their performance degrades substantially when that assumption is misspecified. We propose CascadeNet, a Jacobian-based machine learning framework for network recovery that does not require specifying a diffusion mechanism. The key idea is that the underlying influence structure can be characterized by the Jacobian of the one-step transition function. CascadeNet first constructs a flexible estimator of the transition function, and further applies Neyman-orthogonal debiasing via the Riesz representer, so that the debiased Jacobian is $\sqrt{n}$-consistent and asymptotically normal, enabling formal inference on the network structure. We validate CascadeNet in both a simulation exercise and a real-world empirical application. In simulations, where the data-generating process is known, CascadeNet achieves the highest network recovery accuracy across nine common data-generating processes. In an empirical application to COVID-19 transmission across Spain's 52 provinces, CascadeNet recovers transmission networks that are significantly correlated with the true inter-province mobility network, whereas networks recovered by baseline methods show no significant alignment with the ground truth.

2606.07481 2026-06-08 cs.LG 新提交

Drifting Models for Surrogate Flow Modeling

用于代理流建模的漂移模型

Chris R. Jung, Markus Dörr, Natalie Jüngling, Jennifer Niessner, Adam T. Müller, Nicolaj C. Stache

发表机构 * Center for Machine Learning (ZML) Institute for Flow in Additively Manufactured Porous Structures (ISAPS) Heilbronn University of Applied Sciences

AI总结 提出条件漂移框架,在VAE潜空间中进行漂移并利用标签感知掩码对齐边界条件,实现高质量单步生成,速度比迭代扩散快两个数量级。

详情
Comments
Accepted to the 2nd International Symposium AI and Fluid Mechanics 2026
AI中文摘要

虽然计算流体动力学(CFD)可以为优化室内环境提供高保真流场,但其计算成本限制了快速探索。为了解决这个问题,生成式代理比确定性网络提供了更好的分布建模,但迭代采样速度慢。为了实现高质量的单步生成,我们将新颖的生成式漂移框架应用于流体力学。我们引入了一个条件架构,该架构在学习的VAE潜空间中进行漂移,并使用标签感知掩码将生成的样本与其边界条件对齐。我们的标签条件模型在精度和流一致性上匹配迭代扩散,同时运行速度快两个数量级。此外,我们提出了一种空间条件变体,为泛化到未见几何体开辟了有希望的路径。最终,条件漂移作为基于扩散方法的高效替代方案,为推理速度至关重要的实时CFD代理提供了可能。

英文摘要

While Computational Fluid Dynamics (CFD) provides high-fidelity flow fields for optimizing indoor environments, its computational cost limits rapid exploration. To solve this problem generative surrogates offer better distribution modeling than deterministic networks, but iterative sampling is slow. To enable high-quality, single-pass generation, we adapt the novel generative drifting framework to fluid mechanics. We introduce a conditional architecture that performs drifting in a learned VAE latent space and uses label-aware masking to align generated samples with their boundary conditions. Our label-conditioned model matches iterative diffusion in accuracy and flow consistency while running two orders of magnitude faster. Additionally, we propose a spatial-conditioning variant that establishes a promising path towards generalization to unseen geometries. Ultimately, conditional drifting serves as a highly efficient alternative to diffusion based approaches, unlocking real-time CFD surrogates where inference speed is critical.

2606.07479 2026-06-08 cs.CL cs.AI 新提交

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

基于监督与基于演示的上下文学习在多词表达分类中的比较

Sercan Karakaş, Yusuf Şimşek

发表机构 * University of Chicago Fırat University

AI总结 研究土耳其语多词表达分类,对比监督基线(BERTurk)与指令微调LLM在零样本、单样本和少样本提示下的表现,发现提示敏感性和演示偏差影响显著。

详情
Comments
Accepted to ACL SRW 2026
AI中文摘要

土耳其语习语性轻动词结构(LVC)对多词表达处理具有挑战性,因为它们通常与完全字面义的动词-宾语组合共享相同表面形式,同时作为一个部分习语性谓词发挥作用。我们将土耳其语LVC检测定义为二元分类任务(字面义 vs. 习语义),并在手动创建的受控集(N=147)上评估,该集合包含匹配的负例:域外随机句子和域内字面义控制(NLVC),以及LVC正例。我们比较了监督土耳其语编码器基线(带有分类头的BERTurk)与来自不同家族的三个指令微调LLM,在零样本、单样本和少样本提示下的表现,并分析演示如何改变错误分布。在零样本情况下,LLM在负例上表现良好,但LVC召回率非常低。单样本提示显著提高了LVC检测,但可能引发强烈的、模型特定的偏差,导致模型过度预测或欠预测LVC。更丰富的少样本提示改善了校准,并为GPT-OSS-20B和Qwen 2.5-14B带来了稳健的整体性能。总体而言,结果突显了土耳其语元语言分类中的显著提示敏感性:监督基线仍然具有竞争力,而提示LLM在精心构建的演示下可以在LVC上匹配或超越它。

英文摘要

Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.

2606.07475 2026-06-08 cs.LG cs.AI 新提交

Graph Neural Network leveraging Higher-order Class Label Connectivity for Heterophilous Graphs

利用高阶类标签连通性的图神经网络用于异配图

Takuto Takahashi, Itsuki Nakayama, Takahiro Mitani, Ryosuke Kikuchi, Yuya Sasaki, Makoto Onizuka

发表机构 * The University of Osaka

AI总结 针对异配图中节点分类性能受限问题,提出标签上下文分类器(LCC),通过四种游走生成标签上下文嵌入捕获高阶类标签连通性,并可与任意GNN自适应集成,实验表明优于现有方法。

详情
AI中文摘要

图神经网络(GNN)中的节点分类已广泛应用于图分析的各个领域。在同配图中,具有相同类标签的节点倾向于连接,GNN能实现高精度节点分类。然而,在异配图中,不同类标签的节点更可能连接,其性能仍然有限。特别是,当前基于图卷积网络的GNN无法捕获高阶类标签连通性,而这在真实世界的异配图中经常出现。为了解决这个问题,我们提出了一种新颖的分类器——标签上下文分类器(LCC),旨在捕获有向图中的高阶类标签连通性。LCC通过利用四种不同类型的游走生成的标签上下文嵌入来估计目标节点的类标签。此外,我们的方法允许通过自适应学习LCC和任意GNN的重要性来集成它们。实验结果表明,与LCC集成的GNN优于最先进的方法,并且标签上下文嵌入提高了异配有向图中的节点分类性能。

英文摘要

Node classification in graph neural networks (GNNs) has been widely applied in various fields of graph analysis. GNNs achieve high-accuracy node classification in homophilous graphs, where nodes with the same class label tend to be connected. However, their performance remains limited in heterophilous graphs, where nodes with different class labels are more likely to be connected. In particular, current GNNs derived from graph convolutional networks cannot capture higher-order class label connectivity, which is frequently observed in real-world heterophilous graphs. To address this issue, we propose a novel classifier, Label Context Classifier (LCC), designed to capture higher-order class label connectivity in directed graphs. LCC estimates the class label of a target node by leveraging label context embeddings that are generated through four distinct types of walks. In addition, our approach allows the integration of LCC and any GNN by adaptively learning their importance. Experimental results demonstrate that GNNs integrated with LCC outperform SOTA methods and the label context embeddings improve the node classification performance in heterophilous directed graphs.

2606.07474 2026-06-08 cs.LG 新提交

Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation

无监督持续聚类:通过前向-后向知识蒸馏

Mohammadreza Sadeghi, Sareh Soleimani, Zihan Wang, Narges Armanfard

发表机构 * Department of Electrical and Computer Engineering, McGill University Mila – Quebec AI Institute

AI总结 提出无监督持续聚类(UCC)问题,并设计前向-后向知识蒸馏方法(FBCC),通过持续教师网络和轻量任务学生网络的双阶段蒸馏,在不存储旧数据的情况下保留聚类结构,显著减少灾难性遗忘。

详情
Comments
Accepted at ECML PKDD 2026 (Research Track). arXiv admin note: substantial text overlap with arXiv:2405.19234
AI中文摘要

无监督持续学习(UCL)旨在使神经网络能够在没有标签或无法访问过去数据的情况下学习连续任务。该设置中的一个主要挑战是灾难性遗忘,即模型在学习新任务时会忘记先前学过的任务。由于缺乏指导学习和记忆保留的标签,这一挑战在UCL中被放大。现有的缓解策略,如知识蒸馏和重放缓冲区,常常引发内存和隐私问题。此外,当前的UCL方法大多忽略了聚类特定的目标。为了填补这一空白,我们引入了无监督持续聚类(UCC),并提出了用于持续聚类的前向-后向知识蒸馏(FBCC)。FBCC采用一个带有聚类投影仪的持续教师网络和轻量级任务特定学生网络。通过双阶段的前向-后向蒸馏过程,教师在学习新聚类的同时保留先前发现的聚类结构,而无需存储过去的数据。FBCC代表了UCC的开创性方法,展示了在连续任务中改进的聚类性能。在四个基准数据集上的实验表明,FBCC在聚类准确性上始终优于现有的持续学习基线,同时显著减少了灾难性遗忘。

英文摘要

Unsupervised Continual Learning (UCL) aims to enable neural networks to learn sequential tasks without labels or access to past data. A major challenge in this setting is Catastrophic Forgetting, where models forget previously learned tasks upon learning new ones. This challenge is amplified in UCL due to the absence of labels to guide learning and memory retention. Existing mitigation strategies, such as knowledge distillation and replay buffers, often raise memory and privacy concerns. Moreover, current UCL methods largely overlook clustering-specific objectives. To fill this gap, we introduce Unsupervised Continual Clustering (UCC) and propose Forward-Backward Knowledge Distillation for Continual Clustering (FBCC). FBCC employs a continual teacher network with a clustering projector and lightweight task-specific students. Through a dual-phase forward-backward distillation process, the teacher learns new clusters while preserving previously discovered cluster structure without storing past data. FBCC represents a pioneering approach to UCC, demonstrating improved clustering performance across sequential tasks. Experiments on four benchmark datasets demonstrate that FBCC consistently outperforms existing continual learning baselines in clustering accuracy while significantly reducing catastrophic forgetting.

2606.07473 2026-06-08 cs.SD cs.AI 新提交

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper 幻觉检测与缓解:基于隐藏表示引导和稀疏自编码器

Georgii Aparin, Vadim Popov, Tasnima Sadekova, Assel Yermekova

发表机构 * AI Foundation and Algorithm Lab National University of Science and Technology MISIS National Research University Higher School of Economics

AI总结 通过分析Whisper内部表示,提出基于稀疏自编码器的引导策略,将非语音测试集上的幻觉率从72.63%降至14.11%(small模型),接近微调方法性能。

详情
AI中文摘要

Whisper是一种广泛采用的ASR模型,已知存在幻觉问题——即对非语音音频生成与输入完全无关的连贯转录。我们研究了是否可以通过Whisper的内部表示来检测和缓解幻觉。我们提取音频编码器激活,并评估两种表示空间:原始Whisper激活和稀疏自编码器(SAE)潜在变量。我们表明,两个空间都编码了线性可分的幻觉相关信息,判别能力集中在稀疏特征子集中,并向更深编码器层增强。我们提出了两种引导策略:激活空间引导和SAE潜在空间引导。基于SAE的引导将完整非语音测试集上的幻觉率从72.63%降至14.11%(Whisper small),从86.88%降至27.33%(Whisper large-v3),同时在语音数据上WER退化很小,接近基于微调方法的性能。

英文摘要

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

2606.07464 2026-06-08 cs.RO cs.AI cs.CV 新提交

Planning-aligned Token Compression for Long-Context Autonomous Driving

面向长上下文自动驾驶的规划对齐令牌压缩

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone

发表机构 * NVIDIA Research School of Computing and Data Science, The University of Hong Kong

AI总结 提出COMPACT-VA框架,基于条件VQ-VAE将长上下文压缩为有界表示,通过规划对齐实现决策关键信息保留,在动态场景中成功率提升超6%,速度提升3.3倍。

详情
Comments
9 pages
AI中文摘要

整体视觉-动作模型代表了自动驾驶中的一种新兴范式。然而,这种架构在编码用于复杂交互的扩展时间上下文时,会产生迅速超过实时计算预算的令牌序列。虽然线性变换器和外部记忆等方法试图使上下文轻量化,但令牌压缩与架构最为兼容,因为它不需要修改主干网络。然而,现有的压缩采用基于规则的启发式方法(如时间衰减),与规划解耦,存在丢失决策关键信息的风险。我们提出COMPACT-VA,一种基于条件VQ-VAE的规划对齐工作记忆框架,将扩展上下文压缩为有界表示。压缩条件同时基于历史轨迹和学习的规划意图,其中后验编码器在训练期间从未来轨迹中提炼规划意图,而先验编码器学习从压缩观测中预测它。压缩记忆与预测的潜在变量拼接,输入策略进行端到端优化,从而在保留决策关键信息的情况下进行规划。我们在历史上下文对行为正确性(如停车、让行或前行)最关键的高信号动态场景中进行评估,并相应地设计了行为指标。在可比的令牌预算下,我们在成功率上实现了超过6%的提升(68.3%),且各项指标一致提升。消融实验验证了规划对齐耦合的有效性。闭环评估证实,与未压缩处理相比,COMPACT-VA在保持一般驾驶性能的同时实现了3.3倍的速度提升和2.7倍的内存减少。

英文摘要

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

2606.07462 2026-06-08 cs.AI 新提交

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

像真正的研究者一样行动:一套评估前沿LLM和研究生命周期中智能体框架的基准测试套件

Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song, Lingyu Zhang, Lanxuan Xue, Luodi Chen, Zepeng Xin, Kaiyu Li, Xiangyong Cao

发表机构 * Xi’an Jiaotong University Xidian University

AI总结 提出AARR基准系列,通过AARRI-Bench评估智能体在细粒度研究场景中模拟人类研究者的专业性、全面性和细微推理能力,发现最佳配置成功率仅68.3%。

详情
AI中文摘要

随着基础模型的进步和智能体框架日益复杂,智能体在复杂、长期编码任务甚至自主实验执行中展现出卓越能力。尽管它们从研究助手演变为自主研究智能体,但这些系统在领域敏感性、研究伦理和细微科学判断方面仍存在显著局限。因此,前沿智能体仍无法完全取代人类研究者。为弥合这一差距,我们构思了AARR(像真正的研究者一样行动)基准系列。与主要评估宏观执行能力的现有基准不同,AARR关注智能体能否在细粒度研究场景中模拟人类研究者的专业性、全面性和细微推理。在这项工作中,我们提出了AARRI-Bench(像真正的研究实习生一样行动),这是该系列的第一个基准。我们在前沿模型和智能体系统上进行了大量实验,揭示即使是最佳配置(Mini-SWE-Agent与Claude Opus 4.7)也仅达到68.3%的成功率,经常忽略真实人类研究者显而易见的微妙但关键细节。我们的结果表明,开发类似研究者的AI需要进一步探索研究行为,而不仅仅是复杂的框架。我们的数据发布在此https URL。

英文摘要

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.

2606.07457 2026-06-08 cs.LG eess.SP stat.ML 新提交

Time series Foundation Models based on Physics-Informed Synthetic Histories for Cold-Start Photovoltaic Forecasting

基于物理信息合成历史的时间序列基础模型用于冷启动光伏预测

Lorenzo Longarini, Alessandro Rongoni, Simone Silenzi, Emanuele Frontoni, Riccardo Rosati

发表机构 * European Commission

AI总结 针对光伏电站冷启动预测问题,提出利用物理信息合成历史数据,结合时间序列基础模型进行零样本预测,在440个站点上实现1.7-2倍性能提升。

详情
Comments
To be published in the 2nd ICML Workshop on Foundation Models for Structured Data
AI中文摘要

在并网调试时,光伏运营商必须在目标站点观测数据可用之前预测发电量,这限制了标准监督预测器的直接使用。针对这种冷启动场景,我们提出了一种零样本流程,通过电站元数据和气象协变量生成合成发电历史,使时间序列基础模型(TSFMs)能够通过推理时条件化进行预测。我们在严格的冷启动基线、真实反馈和自预测反馈策略下,将五种TSFM与经典基线进行了基准测试。评估涵盖了四个数据集中$440$个光伏站点以及多种气候区域。协变量感知的基础模型比基线性能提升约$1.7-2$倍:TabPFN-TS在真实反馈下实现了最低误差(MAE $0.514$, RMSE $0.721$ $kWh$ ${kWp}^{-1}$ ${d}^{-1}$),而Chronos-2在自预测反馈下最为鲁棒。性能对合成历史来源基本不敏感,表明准确性更多取决于合理的时序上下文可用性,而非特定生成器。

英文摘要

At commissioning time, Photovoltaic (PV) operators must forecast production before target-site observations are available, limiting the direct use of standard supervised forecasters. This cold-start setting is addressed with a zero-shot pipeline that generates a synthetic production history from plant metadata and meteorological covariates, enabling time-series foundation models (TSFMs) to forecast through inference-time conditioning. Five TSFMs are benchmarked against classical baselines under strict Cold-Start Baseline, Real Feedback, and Self-Forecast Feedback strategies. The evaluation spans $440$ PV sites across four datasets and diverse climate regimes. Covariate-aware foundation models outperform baselines by approximately $1.7-2\times$: TabPFN-TS achieves the lowest error under Real Feedback (MAE $0.514$, RMSE $0.721$ $kWh$ ${kWp}^{-1}$ ${d}^{-1}$), while Chronos-2 is most robust under Self-Forecast Feedback. Performance is largely insensitive to the synthetic-history source, indicating that accuracy is driven more by the availability of plausible temporal context than by the specific generator.

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany Department of Language Science and Technology, Saarland University, Saarbrücken, Germany

AI总结 提出TEVI框架,利用稀疏自编码器解耦图像嵌入,并通过文本条件掩码模块选择性重构嵌入,以改善CLIP等视觉-语言模型的图像-文本对齐,在多个检索基准上取得提升。

详情
Comments
20 pages, 13 figures, 14 tables
AI中文摘要

视觉-语言模型(如CLIP)由于共享图像-文本嵌入空间,对多种任务非常有用。尽管如此,图像和文本嵌入往往对齐不佳,影响下游性能。最近的研究表明,这可以归因于信息不平衡:图像包含的信息比其标题描述的更多。在这项工作中,我们提出了TEVI,一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说,我们使用稀疏自编码器来解耦图像嵌入,并训练一个掩码模块,根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中,我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型,我们进一步在粗粒度短标题(MS COCO, Flickr)和细粒度长标题(IIW, DOCCI)基准上实现了改进的检索性能,在更丰富的标题上获得更强的增益,并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

2606.07441 2026-06-08 cs.CL 新提交

Sycophantic Praise: Evaluating Excessive Praise in Language Models

谄媚式赞美:评估语言模型中的过度赞美

Daniel Vennemeyer, Phan Anh Duong, Meryl Ye, Ruihong Huang, Tianyu Jiang

发表机构 * University of Cincinnati Carnegie Mellon University Texas A&M University

AI总结 提出参数化框架衡量赞美是否过度,发现谄媚式赞美在社交和解释性领域远多于客观推理领域,且现有方法无法可靠测量。

详情
AI中文摘要

语言模型中的谄媚通常被研究为过度同意或验证,而明确的赞美和奉承相对较少受到关注。我们认为谄媚式赞美是一个独特的对齐问题,无法使用当前方法可靠测量。我们引入了一个参数化框架,衡量赞美相对于贡献质量和预期用户能力是否过度。我们表明,在与人类标注的一致性上,我们的框架显著优于通用LLM评判者,并且谄媚式赞美在社交和解释性领域发生的频率远高于客观推理环境。这些发现共同将赞美校准定位为一个独特的对齐挑战。

英文摘要

Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment problem that cannot be reliably measured using current methods. We introduce a parameterized framework that measures whether praise is excessive relative to contribution quality and expected user ability. We show that our framework substantially outperforms generic LLM judges in agreement with human annotations, and that sycophantic praise occurs far more frequently in social and interpretive domains than in objective reasoning settings. Together, these findings position praise calibration as a distinct alignment challenge.

2606.07437 2026-06-08 cs.RO cs.AI cs.HC cs.SE cs.SY eess.SY 新提交

Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability

重新构想自动驾驶时代的ISO 26262:通过可迁移性和可预测性增强可控性

Chaitanya Shinde, Hadi Hajieghrary, Paul Schmitt, Adam Shoemaker, Bodo Seifert, Steve Kenner

发表机构 * Torc Robotics, Inc. Reynolds & Moore Critical Systems Analysis, LLC

AI总结 针对自动驾驶汽车缺乏人类驾驶员的问题,将ISO 26262中的可控性分解为可迁移性和可预测性两个可审计维度,并给出量化框架,以支持SAE L4/L5系统的功能安全论证。

详情
AI中文摘要

ISO 26262标准通过基于严重性、暴露度和可控性的风险评估来定义道路车辆的功能安全,其基础是人类驾驶车辆范式。在自动驾驶汽车(AV)的背景下,缺乏人类驾驶员需要重新审视这些原则。本文将可控性占位符分解为ISO 26262的两个可审计证据维度,引入了两个可测量的子概念:可迁移性和可预测性。可迁移性扩展了可控性,以捕捉AV系统将控制权移交给专用后备安全机制的能力,而可预测性则捕捉外部主体预测AV行为的难易程度。可预测性基于人机交互启发原则进行形式化定义,并提供了量化它的数学框架。引入了设计能力与可实现能力之间的差距,以区分架构后备声明与场景条件相关的可实现后备能力。所提出的度量与ISO 26262和ISO/PAS 21448(SOTIF)保持一致,使后备和交互声明在ODD切片上可证伪和可追溯。这些维度补充而非替代现有标准,这些增强保留了ISO 26262的结构,同时将其适用性扩展到在SAE L4和L5级别运行的无驾驶员自动化系统。

英文摘要

The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems' ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.

2606.07436 2026-06-08 cs.CV 新提交

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D:面向智能体3D空间推理的场景感知技能进化

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

发表机构 * Zhejiang University University of Technology Sydney OPPO Research Institute

AI总结 提出Skill-3D框架,通过场景记忆和技能库的协同进化,使智能体根据场景自适应选择工具,显著提升3D空间推理中工具使用的正确性和充分性。

详情
AI中文摘要

本文探索智能体3D空间理解,即MLLM智能体通过工具使用进行3D推理。现有方法在3D场景下常误用工具并表现出有偏的工具偏好,使得智能体范式相比非智能体策略仅有边际提升。我们揭示3D空间推理任务在不同场景下具有异质性,而这些智能体对所有场景采用统一的工具使用策略,而非根据具体场景和任务选择工具。为解决此问题,我们提出Skill-3D,一种学习自进化场景感知技能的框架。具体而言,Skill-3D识别任务场景并将智能体的工具使用轨迹记录到场景记忆中,其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能,失败的轨迹作为教训附加到该技能上。在训练过程中,一旦相似场景再次出现,注入相应技能以引导智能体,产生新轨迹,其成功和失败进一步优化技能,形成记忆和技能库共同进化的循环。实验表明,Skill-3D显著提升了3D空间推理中的工具利用率(在VSI-Bench上从39%提升至78%),推动智能体正确且充分地使用工具。例如,在MMSI-Bench上,它将Gemini-3-Flash提升了67%。此外,我们在技能引导的轨迹上进行智能体后训练,使Qwen3-VL-8B在VSI-Bench上提升了43%。

英文摘要

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.

2606.07433 2026-06-08 cs.CV cs.AI cs.MM 新提交

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

发表机构 * Peking University Wuhan University Shanghai Jiao Tong University Nanyang Technological University CASIA University of Tokyo University of Liverpool Zhejiang University National University of Singapore UC Merced

AI总结 提出人类视角下视频理解的三个功能能力(观看、记忆、推理),构建统一框架分析视频MLLM的感知、记忆、推理和预测,并总结挑战、方法、应用及未来方向。

详情
AI中文摘要

视频理解正被多模态大语言模型(MLLMs)迅速变革,研究从短视频片段转向长视频、多模态和知识密集型视频场景。这些场景要求模型在有限计算预算下处理稀疏证据、长程依赖、多模态对齐和可靠推理。本文从人类视角出发,围绕三个功能能力——观看、记忆和推理——组织基于LLM的视频理解。该视角并非将视频任务视为孤立基准,而是提供一个统一结构,用于分析视频MLLM如何获取证据、保持上下文并产生有依据的输出。我们引入一个公式,通过感知表示、记忆状态、推理轨迹和最终预测来表征视频理解系统。基于此公式,我们识别出时空感知、高效长视频处理、记忆建模、流式理解和忠实推理中的挑战。代表性方法按其视频MLLM系统中的角色进行组织:观看涵盖细粒度、全面、音视频和高效感知;记忆包括离线记忆和流式记忆;推理涵盖纯文本推理和视频辅助推理。我们进一步考察了应用领域,如自我中心、体育、教学、医学和叙事视频,并涵盖了跨任务类型、监督格式、模态和能力维度的训练数据集和评估基准。最后,我们概述了可扩展、记忆感知和有依据的视频智能的开放问题和未来方向。相关工作将在https://this https URL持续追踪。

英文摘要

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

2606.07426 2026-06-08 cs.LG 新提交

Discovering Multiscale Deep Formulas in Complex Systems via Neural-Guided Lambda Calculus

通过神经引导的Lambda演算发现复杂系统中的多尺度深层公式

Hanqiao Yu, Shusen Yang, Xuebin Ren, Cong Zhao

发表机构 * Xi’an Jiaotong University

AI总结 提出Deflex方法,结合可分解深度能量模型和Lambda演算符号回归,自动从复杂系统中提取多尺度公式,效率最高提升7倍。

详情
Comments
35 pages, 5 figures; Supplementary Information available as an ancillary file (79 pages)
AI中文摘要

科学中的一个基本问题是以简洁数学公式的形式识别复杂系统的潜在模式。当前基于人工智能的方法在单尺度系统中表现出色,但在识别多尺度复杂系统中的尺度特定公式方面仍有限。我们提出Deflex,一种端到端的人工智能方法,可从复杂系统中自动提取可能具有不同形式的多尺度公式,包括不变量和分布。Deflex由两个子系统组成,分别称为Deflexformer和Deflexpressor。Deflexpressor是一个用于高阶公式的lambda演算符号回归模型。Deflexformer是一个可分解的深度能量模型,用于学习跨尺度的统一表示。Deflexpressor生成合成数据以预训练Deflexformer,然后通过解耦多尺度潜在关系来指导公式发现。在六个具有不同行为的代表性复杂系统中,Deflex实现了比最先进方法高达7倍的效率提升,同时实现了自动多尺度发现。我们的工作可能成为跨学科科学发现的有用工具。

英文摘要

A fundamental problem in science is identifying underlying patterns of complex systems in the form of concise mathematical formulas. Current Artificial Intelligence (AI)-based methods have shown strong performance in single-scale systems, yet remain limited in identifying scale-specific formulas in multiscale complex systems. We present Deflex, an end-to-end AI method to automatically extract multiscale formulas with potentially different forms, including invariants and distributions, from complex systems. Deflex consists of two subsystems named Deflexformer and Deflexpressor. Deflexpressor is a lambda-calculus symbolic regression model for higher-order formulas. Deflexformer is a decomposable deep energy model for learning unified representations across scales. Deflexpressor generates synthetic data to pre-train Deflexformer, which then guides formula discovery by decoupling multiscale latent relationships. Across six representative complex systems with diverse behaviors, Deflex achieves up to 7-fold higher efficiency than the state-of-the-art methods while enabling automated multiscale discovery. Our work could be a useful tool for scientific discovery across disciplines.

2606.07424 2026-06-08 cs.RO 新提交

Rapid co-design of Buoyancy-assisted robots for Challenging Locomotion using Gaussian Evolutionary Specialists

基于高斯进化专家的浮力辅助机器人快速协同设计以应对挑战性运动

Ankit Sinha, Nitish Sontakke, Dennis Hong, Yusuke Tanaka, Sehoon Ha

发表机构 * Georgia Institute of Technology University of California, Los Angeles ETH Zurich

AI总结 提出高斯进化专家(GES)框架,通过解耦设计空间划分与策略学习,在浮力辅助轻量腿单元(BALLU)上实现5-25%性能提升,并缩短37%设计优化时间。

详情
Comments
Submitted to RA-L
AI中文摘要

设计高性能腿式机器人需要联合优化形态和控制。无模型强化学习(RL)为开发鲁棒控制器提供了模型预测控制的替代方案,无需明确指定机器人动力学。因此,我们看到了使用RL训练控制器和评估设计以优化机器人形态。虽然RL在运动方面取得了成功,但由于重复的策略训练,将其用于协同设计内循环成本高昂。基于形态的条件通用策略提供了一种有前景的替代方案,但遭受行为多样性崩溃,收敛到单一策略,在不同设计上表现次优。另一方面,端到端混合专家(MoE)架构因其表示崩溃而失败。我们提出高斯进化专家(GES),一个将设计空间划分与策略学习解耦以显式捕获多样行为的框架。GES将专家策略分配给演化的高斯区域,并通过训练、探测和领土扩展迭代优化它们。生成的专家被集成到设计采样循环中,用直接评估替代昂贵的重新训练。在浮力辅助轻量腿单元(BALLU)上测试时,GES发现的设计比朴素通用策略性能高5-25%。在硬件上,GES优化设计克服了24厘米高的障碍——比基线BALLU设计提升3倍。此外,GES将设计优化时间缩短了37%。

英文摘要

Designing high-performance legged robots requires jointly optimizing morphology and control. Model-free Reinforcement Learning (RL) offers an alternative to model-predictive control for developing robust controllers without explicitly specifying robot dynamics. Thus, we have seen theuse of RL to train controllers and evaluate designs for robot morphology optimization. While RL has shown success inlocomotion, using it in the co-design inner loop is expensive due to repeated policy training. Universal policies conditioned on morphology offer a promising alternative, but suffer from behavioral diversity collapse, converging to a single strategy that performs sub-optimally across designs. On the other hand, end-to-end Mixture-of-Experts (MoE) architectures fail due to a collapse in its representation. We propose Gaussian Evolutionary Specialists (GES), a framework that decouples design-space partitioning from policy learning to capture diverse behaviors explicitly. GES assigns specialist policies to evolving Gaussian regions and iteratively refines them via training, probing, and territory expansion. The resulting specialists are integrated into a design sampling loop, replacing costly re-training with direct evaluation. When tested on the Buoyancy-Assisted Light Legged Unit (BALLU), GES discovers designs with 5 - 25% higher performance than naive universal policies. On hardware, a GES optimized design overcomes a 24 cm tall obstacle - 3x improvement over the baseline BALLU design. Moreover, GES curtails design optimization time by 37%.