arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07514 2026-06-08 cs.CV 新提交

UniSHARP: Universal Sharp Monocular View Synthesis

UniSHARP: 通用锐利单目视图合成

Meixi Song, Dizhe Zhang, Hao Ren, Ruiyang Zhang, Bo Du, Ming-Hsuan Yang, Lu Qi

发表机构 * Insta360 Research ； Sun Yat-sen University ； Beihang University ； Wuhan University ； University of California, Merced

AI总结提出UniSHARP，通过统一全景隐空间和射线基高斯表示，将SHARP扩展到任意相机系统（包括鱼眼、全景），在特征与高斯空间隐式对齐，在构建的多视角基准上大幅超越现有方法。

详情

Comments: Project page: https://insta360-research-team.github.io/Unisharp-website/

AI中文摘要

在这项工作中，我们专注于扩展SHARP（一种流行的逼真视图合成方法），以实现跨连续相机系统（从传统透视相机到广角、鱼眼和全景设置）的通用单目渲染。为了克服SHARP的针孔特定假设，我们的关键思想是将各种图像对齐到统一的全景隐空间中。因此，我们提出了UniSHARP，它在特征空间和高斯空间中执行隐式对齐。具体来说，高斯基元沿射线和径向距离排列在基于射线的通用表示中，而从UniK3D启发的编码器中提取的2D语义和3D空间特征被联合解码以生成完整的高斯云。为了全面评估我们的方法，我们构建了一个覆盖各种场景下多种成像系统的基准。该基准进一步按视场角（FoV）分层，以实现对通用单目渲染任务的细粒度评估。在提出的基准上进行的大量实验证明了UniSHARP的有效性，其性能大幅优于替代方法。项目页面可在此处找到：this https URL

英文摘要

In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal monocular rendering across a continuum of camera systems, from conventional perspective cameras to wide-field-of-view, fisheye and omnidirectional panoramic settings. To overcome the pinhole-specific assumptions of SHARP, our key idea is to align various images in a unified omnidirectional latent space. Thus, we propose UniSHARP, which performs implicit alignment in both feature and Gaussian spaces. Specifically, Gaussian primitives are arranged along rays and radial distances in a ray-based universal representation, while 2D semantic and 3D spatial features extracted from UniK3D-inspired encoders are jointly decoded to generate the complete Gaussian cloud. To comprehensively evaluate our method, we construct a benchmark covering diverse imaging systems across various scenes. The benchmark is further stratified by field of view (FoV) to enable fine-grained assessment of the universal monocular rendering task. Extensive experiments on the proposed benchmark demonstrate the effectiveness of UniSHARP, outperforming alternative methods by a large margin. The project page can be found at: https://insta360-research-team.github.io/Unisharp-website/

URL PDF HTML ☆

赞 0 踩 0

2606.07513 2026-06-08 cs.CL 新提交

Agentopia: Long-Term Life Simulation and Learning in Agent Societies

Agentopia: 智能体社会中的长期生活模拟与学习

Xintao Wang, Sirui Zheng, Hongqiu Wu, Weiyuan Li, Jen-tse Huang, Minghao Zhu, Can Zu, Qi Deng, Jiawei Wang, Qianyu He, Heng Wang, Xiaojian Wu, Yunzhe Tao

发表机构 * Fudan University ； Johns Hopkins University ； University of Science and Technology of China

AI总结提出Agentopia框架，模拟100个智能体在10年内的社会生活，通过生命奖励训练LLM，提升其社交智能，并在角色扮演基准上取得15.6%的提升。

详情

Comments: 79 pages, 19 figures

AI中文摘要

人类从社会生活中学习。用LLM驱动的智能体模拟这一过程代表了一个有前景的研究方向，引发了一个自然的问题：LLM能否从这种模拟的社会经验中学习，以更好地理解和复制人类行为。然而，先前的智能体社会模拟通常以天为单位运行，限制了社会互动的深度和长期成长。在本文中，我们研究智能体社会中的长期生活模拟和LLM学习，有两个目标：(1) 研究从终身模拟中涌现的社会行为，(2) 通过多年的模拟社会经验，发展LLM的拟人化能力，特别是社会生活中的智能。具体来说，我们提出了Agentopia，一个用于多智能体社会中长期生活模拟的综合框架，其中100个智能体在10年的模拟时间内自主追求个人成长、发展社会关系并满足其需求和目标。我们定义了生命奖励来反映人类福祉，并利用该奖励通过拒绝采样训练LLM。大量实验表明，智能体表现出丰富的涌现社会行为。此外，生命奖励训练有效增强了底层LLM，从而在模拟中改善了智能体的福祉，并泛化到下游角色扮演基准，提升了15.6%。

英文摘要

Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long-term growth. In this paper, we study long-term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life-long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long-term life simulation in multi-agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well-being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well-being in simulation, and generalizes to downstream role-playing benchmarks with +15.6% improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.07508 2026-06-08 cs.CV 新提交

Streaming Video Generation with Streaming Force Control

流式视频生成与流力控制

Hanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv, Shenlong Wang, Huaizu Jiang

发表机构 * Northeastern University ； Impossible Research ； University of California, Berkeley ； University of Illinois Urbana-Champaign

AI总结提出StreamForce框架，通过统一力表示和蒸馏流程实现因果、统一的流式视频生成，支持局部和全局时变力控制，在单GPU上达到16.6 FPS，力遵循和运动真实性达最优。

2606.07506 2026-06-08 cs.RO 新提交

Affordance-Based Hierarchical Reinforcement Learning for Quadruped Pedipulation

基于可负担性的四足机器人层级强化学习操控

Tuba Girgin, Jose Castelblanco, Gabriel Rodriguez, Emre Girgin, Cagri Kilic

发表机构 * Embry-Riddle Aeronautical University ； Carnegie Mellon University

AI总结提出三级层级强化学习框架，利用姿态和交互点可负担性引导导航与操控策略，在仿真和真实环境中实现自主物体操控。

详情

Comments: This paper is submitted to Wiley Journal of Field Robotics

AI中文摘要

四足机器人的物体操控能力是一个开放的研究挑战。虽然先前的研究侧重于低级策略学习，任务执行仍依赖于专家设计的高级轨迹。自主选择目标物体上的可负担交互点和可负担机器人基座姿态消除了对预设计轨迹的需求。本研究提出了一个三级层级强化学习（RL）框架，利用姿态可负担性来引导导航策略，而导航策略驱动运动策略。此外，操控策略由交互点可负担性引导，实现四足机器人的物体中心姿态对齐和有效的末端执行器操控规划。我们在IsaacSim生态系统中训练所提出的框架，并在仿真和真实环境中进行评估。我们在仿真中研究了姿态可负担性在多个场景下的有效性，同时在真实环境中验证了各种物体交互任务，形成了物体交互数据集。结果表明，所提出的框架能够基于可负担性自主识别候选姿态，并在无需人类引导的情况下成功执行真实世界中的物体操控任务。

英文摘要

The object manipulation capabilities of quadruped robots is an open research challenge. While previous studies have focused on low-level policy learning, task execution still relies on expert-designed high-level trajectories. Autonomous selection of both an affordable interaction point on the target object and an affordable robot base pose removes the need for pre-designed trajectories. This study proposes a three-level hierarchical reinforcement learning (RL) framework that utilizes pose affordances to guide the navigation policy, while the navigation policy drives the locomotion policy. In addition, the pedipulation policy is guided by interaction-point affordances, enabling object-centric pose alignment of the quadruped robot and effective end-effector manipulation planning. We train the proposed framework in the IsaacSim ecosystem and evaluate it in both simulation and real-world settings. We investigate the effectiveness of pose affordance across multiple scenarios in simulation while various object interaction tasks are validated on real-world setting forming an object-interaction dataset. The results show that the proposed framework can autonomously identify candidate poses based on their affordance and successfully execute object manipulation tasks in the real world without human guidance.

URL PDF HTML ☆

赞 0 踩 0

2606.07503 2026-06-08 cs.CV 新提交

Differences in Detection: Explainability Where it Matters

检测中的差异：可解释性在关键之处

Johannes Theodoridis, Johannes Maucher, Andreas Schilling

发表机构 * University of Tübingen ； Institute for Applied AI ； Hochschule der Medien Stuttgart

AI总结提出DnD方法，通过匹配算法直接比较两个目标检测模型，揭示个体与共享错误，并引导可解释性方法聚焦于度量相关示例。

详情

Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2026 - How Do Vision Models Work? (HOW)

AI中文摘要

我们提出了检测中的差异（DnD），一种直观的比较两个目标检测模型的方法。基于相同的匹配算法，它补充了平均精度（$mAP$）和TIDE误差分析的标准指标，能够直接比较两个模型。更具体地说，我们计算两个模型都识别的真实标签的交集，然后是相应的差集以及两个模型都遗漏的真实标签的补集。与独立的汇总统计比较相比，这种比较更直接、更直观。它揭示了个体和共享的错误，当与错误类型结合时尤其有趣。在这种情况下，检测误差的差异可以自然地通过标准混淆矩阵进行分析。虽然本身有价值，但我们认为DnD的最佳应用之一是引导可解释性方法（如ODAM）关注基于结构化子集的度量相关示例。我们方法的代码可在此处获取：this https URL

英文摘要

We propose Differences in Detection (DnD), an intuitive method to compare two object detection models. Based on the same matching algorithm, it complements the standard metrics of mean Average Precision ($mAP$) and TIDE error analysis with the ability to compare two models directly. More specifically, we calculate the intersection of ground truth labels that are recognized by both models, followed by the corresponding difference sets and the complement set of ground truth labels that are missed by both models. The resulting comparison is more direct and intuitive than a comparison of independent summary statistics. It reveals individual and shared mistakes and becomes particularly interesting when combined with error types. In this case, the differences in detection errors can be analyzed naturally in a standard confusion matrix. While valuable in itself, we believe that one of the best applications of DnD is to guide explainability methods such as ODAM towards metric-relevant examples, grounded in structured subsets. The code for our method is available here: https://github.com/JohannesTheo/differences-in-detection

URL PDF HTML ☆

赞 0 踩 0

2606.07502 2026-06-08 cs.CL cs.IR 新提交

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

你的解嵌入矩阵实际上是文本嵌入的特征透镜

Songhao Wu, Zhongxin Chen, Yuxuan Liu, Heng Cui, Cong Li, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China ； Lenovo Group Limited ； Wuhan University

AI总结发现LLM文本嵌入与高频词对齐导致语义捕获不足，提出EmbedFilter通过过滤解嵌入矩阵中的高频子空间来增强表示，并实现降维加速检索。

详情

Comments: preprint

AI中文摘要

大型语言模型在广泛的下游任务中展现出令人印象深刻的零样本能力。然而，它们难以作为现成的嵌入模型，导致在大量文本嵌入基准测试中表现欠佳。在本文中，我们确定了这种缺陷的一个潜在原因。我们的动机源于一个意外的观察：当文本嵌入投影到词汇空间时，它们倾向于与频繁但信息量少的词对齐。我们认为，这种对高频词的过度表达抑制了模型捕获细微语义的能力。为了解决这个问题，我们引入了EmbedFilter，一种简单的线性变换，旨在直接精炼从LLM中导出的文本嵌入。具体来说，我们发现LLM内部的解嵌入矩阵编码了一个潜在空间，该空间正在主动将这些高频词写入嵌入空间。通过过滤掉这个子空间，EmbedFilter抑制了高频词的影响，从而增强了语义表示。作为一个引人注目的副产品，这实现了固有的降维，降低了索引存储并加速了检索，同时完全保留了精炼后的嵌入质量。我们在多个LLM骨干上的实验表明，配备EmbedFilter的LLM即使在嵌入维度显著降低的情况下也能实现优越的零样本下游性能。我们希望我们的发现能提供对基于LLM的表示机制的更深入见解，并激发更多有原则的设计来改进文本嵌入训练。我们的代码可在此https URL获取。

英文摘要

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

URL PDF HTML ☆

赞 0 踩 0

2606.07500 2026-06-08 cs.LG cs.AI 新提交

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

稀疏子空间到专家共享的任务无关持续学习

Fatema Siddika, Md Anwar Hossen, Tanwi Mallick, Ali Jannesari

发表机构 * Iowa State University ； Argonne National Laboratory

AI总结提出SETA框架，通过将参数分解为任务特定专家和共享专家的稀疏子空间，结合自适应弹性锚定和路由感知正则化，解决LLM持续学习中的塑性-稳定性困境，在多个基准上优于现有方法。

详情

Comments: 19 pages. arXiv admin note: text overlap with arXiv:2601.17616

AI中文摘要

大型语言模型（LLM）中的持续学习受到塑性-稳定性困境的阻碍，获取新能力往往导致先前知识的灾难性遗忘。现有方法通常统一对待参数，未能区分特定任务知识和共享能力。我们提出了用于任务无关持续学习的稀疏专家混合（SETA）框架，该框架通过将参数自适应稀疏子空间分解为任务特定专家模块来解决塑性-稳定性冲突。与标准更新（其中任务竞争相同参数）不同，SETA将知识分离为独特专家（旨在隔离任务特定模式）和共享专家（负责捕获共同特征）。这种结构通过自适应弹性锚定和路由感知正则化来维护，该正则化在权重和路由级别共同保护共享知识，并使统一的门控网络能够在推理过程中自动检索正确的专家组合。在多种领域特定基准上的大量实验表明，相对于最先进的持续学习基线，SETA实现了具有竞争力或更优的整体性能，特别是在LLaMA-2 7B和Qwen3-4B上，早期任务知识的保持和反向迁移能力尤为突出。

英文摘要

Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task Agnostic Continual Learning (SETA), a framework that resolves the plasticity-stability conflict through adaptive sparse subspace decomposition into task-specific expert modules. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through adaptive elastic anchoring and a routing-aware regularization that jointly protect shared knowledge at both the weight and routing levels and enable a unified gating network to automatically retrieve the correct expert combination during inference. Extensive experiments across diverse domain-specific benchmarks demonstrate that SETA achieves competitive or superior overall performance relative to state-of-the-art continual learning baselines, with particularly strong retention of early-task knowledge and improved backward transfer on LLaMA-2 7B and Qwen3-4B.

URL PDF HTML ☆

赞 0 踩 0

2606.07498 2026-06-08 cs.CV 新提交

Implicit Data Synthesis for Contrastive Unsupervised Data Augmentation

隐式数据合成用于对比无监督数据增强

Patrick Kage, Trevor Hedges, N. Siddharth, Pavlos Andreadis

发表机构 * School of Informatics, The University of Edinburgh ； Massachusetts Institute of Technology Lincoln Laboratory

AI总结针对科学观测数据难以标注的问题，提出通过扰动网络权重而非数据生成对比样本，在雷达流星观测上使用SimCLR管道验证性能提升。

2606.07496 2026-06-08 cs.LG math.OC 新提交

Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

加速去中心化随机梯度下降用于强凸优化

Ming Sun, Kun Yuan

发表机构 * Peking University

AI总结提出MG-ADSGD算法，结合Nesterov型原始-对偶外推与多轮快速八卦平均，通过耦合八卦深度与小批量大小，同时实现加速收敛和通信高效，达到最优通信复杂度。

详情

AI中文摘要

去中心化随机优化是网络大规模学习的基本范式，其中智能体仅与邻居通信，无需中央协调器。对于强凸问题，通信效率主要由条件数 $\kappa=L/\mu$ 和网络谱间隙 $1-\beta$ 决定。尽管确定性去中心化方法可以同时实现加速的 $\sqrt{\kappa}$ 和 $1/\sqrt{1-\beta}$ 依赖，但现有随机方法未能同时获得这两种改进。本文提出 \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD)，一种结合Nesterov型原始-对偶外推与多轮快速八卦平均的去中心化随机算法。关键思想是将八卦深度与小批量大小耦合，使得额外的通信轮次同时改善共识精度并减少梯度方差。我们证明MG-ADSGD达到通信复杂度 \[ \widetilde{\mathcal O}\!\left( \frac{\sigma^2}{\mu n\epsilon}\log\frac{1}{\epsilon} + \sqrt{\frac{\kappa}{1-\beta}}\log\frac{1}{\epsilon} \right), \] 其中 $\epsilon$ 表示目标精度，$n$ 是节点数，$\sigma^2$ 是梯度方差。据我们所知，该界提供了去中心化随机强凸优化目前最佳的通信复杂度，仅含与 $\epsilon$ 无关的对数因子。

英文摘要

Decentralized stochastic optimization is a fundamental paradigm for large-scale learning over networks, where agents communicate only with their neighbors and no central coordinator is required. For strongly convex problems, communication efficiency is mainly determined by the condition number $κ=L/μ$ and the network spectral gap $1-β$. Although deterministic decentralized methods can simultaneously achieve accelerated $\sqrtκ$ and $1/\sqrt{1-β}$ dependences, no existing stochastic method attains both improvements at once. In this paper, we propose \emph{Multi-Gossip Accelerated DSGD} (MG-ADSGD), a decentralized stochastic algorithm that combines Nesterov-type primal--dual extrapolation with multi-round fast gossip averaging. The key idea is to couple the gossip depth with the mini-batch size so that additional communication rounds simultaneously improve consensus accuracy and reduce gradient variance. We show that MG-ADSGD achieves the communication complexity \[ \widetilde{\mathcal O}\!\left( \frac{σ^2}{μnε}\log\frac{1}ε + \sqrt{\fracκ{1-β}}\log\frac{1}ε \right), \] where $ε$ denotes the target accuracy, $n$ is the number of nodes, and $σ^2$ is the gradient variance. To the best of our knowledge, this bound yields the best currently available communication complexity for decentralized stochastic strongly convex optimization, up to logarithmic factors that are independent of $ε$.

URL PDF HTML ☆

赞 0 踩 0

2606.07495 2026-06-08 cs.LG 新提交

Second-Order Path Kernel Interpolation Formulas in Machine Learning

机器学习中的二阶路径核插值公式

Jin Guo, Roy Y. He, Jean-Michel Morel

发表机构 * City University of Hong Kong

AI总结本文提出神经网络的二阶路径核插值公式，引入曲率加权项和随机梯度下降的噪声耦合项，并扩展到带动量的情况，完善了路径核对预测的解释。

详情

AI中文摘要

理解训练数据如何塑造神经网络预测是现代学习理论的核心问题。2020年，Pedro Domingos提出了一个适用于通过确定性梯度下降学习的每个模型的插值公式。它将模型的预测表示为沿优化路径的积分，该积分依赖于一个数据相关的核，该核对齐模型在测试数据和训练数据上的梯度。这种一阶特征对于基于批处理的随机优化训练的模型仍然有效。在本文中，我们发展了这些插值公式的二阶形式。我们表明，主要的路径核插值由一个曲率加权的插值项补充。对于随机梯度下降，出现了一个额外的采样诱导分量，将预测的曲率与小批量梯度噪声的协方差耦合起来。我们还将表示扩展到带动量的随机梯度下降，其中插值结构得以保留，但权重由记忆相关因子修改。此外，我们建立了终端预测的浓度估计，确定了围绕期望二阶表示的波动尺度。这些结果共同完善了神经网络预测的路径核解释。

英文摘要

Understanding how training data shape neural network predictions is a central problem in modern learning theory. In 2020, Pedro Domingos proposed an interpolation formula valid for every model learned by deterministic gradient descent. It expresses the model's prediction as an integral, along the optimization path, of a data-dependent kernel that aligns the model's gradients at the test and training data. Such a first-order characterization remains valid for models trained with batch-based stochastic optimization. In this paper, we develop second-order forms of these interpolation formulas. We show that the leading path-kernel interpolation is supplemented by a curvature-weighted interpolation term. For stochastic gradient descent, an additional sampling-induced component appears, coupling the curvature of the prediction with the covariance of mini-batch gradient noise. We also extend the representation to stochastic gradient descent with momentum, where the interpolation structure is preserved but with the weights modified by a memory-related factor. Moreover, we establish a concentration estimate for the terminal prediction, identifying the fluctuation scale around the expected second-order representation. Together, these results provide a refinement of the path-kernel interpretation of neural network prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.07494 2026-06-08 cs.SD eess.AS 新提交

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

缓解深度伪造语音中的代理到真实域差距

Xuanjun Chen, Yun-Shing Wu, Wei-Chung Lu, Claire Lin, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University ； Graduate Institute of Networking and Multimedia, National Taiwan University ； Department of Information Management, National Taiwan University ； NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)

AI总结提出域偏移特征增强（DSFA）方法，通过将确定性特征统计转换为随机分布来缩小代理数据与真实世界之间的域差距，在CoSG ExtEval数据集上达到最先进性能。

详情

Comments: Work in progress

AI中文摘要

最近的基于神经音频编解码器的语音生成（CodecFake）产生了高度逼真的音频，对现有的深度伪造反制模型构成了挑战。虽然使用编解码器重合成语音（CoRS）作为代理数据可以提高性能，但它通常泛化能力有限。我们提出了域偏移特征增强（DSFA），通过在微调期间将确定性特征统计转换为随机分布来模拟“真实世界”的变化。为了评估泛化能力，我们进一步引入了基于编解码器的语音生成扩展评估（CoSG ExtEval）数据集，这是CoSG Eval（来自CodecFake+）数据集的更具挑战性的扩展，包含40个未见过的生成模型和长音频。实验结果表明，将后训练的SSL骨干与DSFA相结合有效地缩小了代理到真实世界的域差距。该方法在CoSG Eval和CoSG ExtEval中针对各种CodecFake攻击均达到了最先进的性能。

英文摘要

Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

URL PDF HTML ☆

赞 0 踩 0

2606.07488 2026-06-08 cs.LG 新提交

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

CoMetaPNS：心脏电生理模拟的持续元学习个性化神经代理

Ryan Missel, Xiajun Jiang, Linwei Wang

发表机构 * Golisano College of Computing and Information Sciences, Rochester Institute of Technology ； Department of Computer Science, Rowan University ； The University of Utah

AI总结提出持续元学习框架CoMetaPNS，通过贝叶斯高斯混合模型记忆缓冲区分辨数据来源，实现个性化神经代理的持续学习，避免灾难性遗忘，在心脏模拟预测中优于基线。

详情

AI中文摘要

个性化虚拟心脏模拟面临模型个性化和计算成本的挑战。虽然神经代理提供了最先进的解决方案，但它们通常只解决高效个性化或训练可泛化模型中的一个方面。最近的工作通过使用有限的主题特定上下文数据，通过小样本生成建模与集合条件代理和元学习摊销推理，重新定义了学习个性化代理的过程。然而，这些方法假设一个静态且多样化的训练分布，并具有已知的任务标识符。当新数据可用时，它们需要与所有先前数据一起进行昂贵的重新训练，以避免灾难性遗忘——即模型在训练新任务时忘记旧任务的现象。这在临床环境中是一个主要限制，因为未标记的数据通常顺序到达，而完全重新训练是不可行的。本文提出了一种新的持续元学习框架，以实现个性化的神经代理，该代理不仅能够持续整合信息，还能识别传入数据是否来自已知或未知的动态源。通过利用基于记忆缓冲区的持续贝叶斯高斯混合模型，我们的框架可以推断数据随时间变化的标识符和关系——这是有效元学习所必需的。在合成心脏数据上的实验结果表明，与现有基线相比，我们的方法在模拟预测、计算可扩展性和对灾难性遗忘的鲁棒性方面表现更优。

英文摘要

Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training generalizable models. Recent work reframes this by learning the process of personalizing a surrogate using limited subject-specific context data, through few-shot generative modeling with set-conditioned surrogates and meta-learned amortized inference. These methods, however, assume a static and diverse training distribution with known task identifiers. When new data becomes available, they require costly retraining with all prior data to avoid catastrophic forgetting - a phenomena where the model forgets earlier tasks when trained on new ones. This is a major limitation in clinical settings where often unlabeled data arrives sequentially and full retraining is infeasible. This paper presents a new continual meta-learning framework to achieve personalized neural surrogates able to not only continually integrate information but also identify whether incoming data stems from a known or unknown dynamics source. By leveraging a continual Bayesian Gaussian Mixture Model over a memory buffer, our framework can infer the identifiers and relationships of data over time - required for effective meta-learning. Empirical results on synthetic cardiac data demonstrate superior simulation forecasting, computational scalability, and resilience to catastrophic forgetting compared to existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.07483 2026-06-08 cs.LG stat.ML 新提交

Network Recovery from Cascade Data: A Debiased Jacobian-Based Machine Learning Approach

从级联数据中恢复网络：一种基于去偏雅可比矩阵的机器学习方法

Lei Huang

发表机构 * MIT Sloan School of Management

AI总结提出CascadeNet框架，通过去偏雅可比矩阵估计一步转移函数，无需指定扩散模型即可恢复隐藏影响网络，在模拟和COVID-19传播数据中优于现有方法。

详情

AI中文摘要

许多重要结果以动态级联的形式展开，包括产品采用、疾病传播、金融困境和信息扩散。一个核心挑战是恢复这些级联背后的隐藏影响网络。现有方法通常假设特定的扩散模型，当该假设错误时，其性能会大幅下降。我们提出了CascadeNet，一种基于雅可比矩阵的机器学习框架，用于网络恢复，无需指定扩散机制。关键思想是，潜在的影响结构可以通过一步转移函数的雅可比矩阵来刻画。CascadeNet首先构建转移函数的灵活估计量，然后通过Riesz表示应用Neyman正交去偏，使得去偏后的雅可比矩阵是$\sqrt{n}$一致且渐近正态的，从而能够对网络结构进行正式推断。我们在模拟实验和真实世界实证应用中验证了CascadeNet。在模拟中，数据生成过程已知，CascadeNet在九种常见数据生成过程中实现了最高的网络恢复准确率。在西班牙52个省份的COVID-19传播实证应用中，CascadeNet恢复的传播网络与真实的省际移动网络显著相关，而基线方法恢复的网络与真实情况无显著一致性。

英文摘要

Many important outcomes unfold as dynamic cascades, including product adoption, disease spread, financial distress, and information diffusion. A central challenge is to recover the hidden influence network behind these cascades. Existing methods typically assume a specific diffusion model, and their performance degrades substantially when that assumption is misspecified. We propose CascadeNet, a Jacobian-based machine learning framework for network recovery that does not require specifying a diffusion mechanism. The key idea is that the underlying influence structure can be characterized by the Jacobian of the one-step transition function. CascadeNet first constructs a flexible estimator of the transition function, and further applies Neyman-orthogonal debiasing via the Riesz representer, so that the debiased Jacobian is $\sqrt{n}$-consistent and asymptotically normal, enabling formal inference on the network structure. We validate CascadeNet in both a simulation exercise and a real-world empirical application. In simulations, where the data-generating process is known, CascadeNet achieves the highest network recovery accuracy across nine common data-generating processes. In an empirical application to COVID-19 transmission across Spain's 52 provinces, CascadeNet recovers transmission networks that are significantly correlated with the true inter-province mobility network, whereas networks recovered by baseline methods show no significant alignment with the ground truth.

URL PDF HTML ☆

赞 0 踩 0

2606.07481 2026-06-08 cs.LG 新提交

Drifting Models for Surrogate Flow Modeling

用于代理流建模的漂移模型

Chris R. Jung, Markus Dörr, Natalie Jüngling, Jennifer Niessner, Adam T. Müller, Nicolaj C. Stache

发表机构 * Center for Machine Learning (ZML) ； Institute for Flow in Additively Manufactured Porous Structures (ISAPS) ； Heilbronn University of Applied Sciences

AI总结提出条件漂移框架，在VAE潜空间中进行漂移并利用标签感知掩码对齐边界条件，实现高质量单步生成，速度比迭代扩散快两个数量级。

详情

Comments: Accepted to the 2nd International Symposium AI and Fluid Mechanics 2026

AI中文摘要

虽然计算流体动力学（CFD）可以为优化室内环境提供高保真流场，但其计算成本限制了快速探索。为了解决这个问题，生成式代理比确定性网络提供了更好的分布建模，但迭代采样速度慢。为了实现高质量的单步生成，我们将新颖的生成式漂移框架应用于流体力学。我们引入了一个条件架构，该架构在学习的VAE潜空间中进行漂移，并使用标签感知掩码将生成的样本与其边界条件对齐。我们的标签条件模型在精度和流一致性上匹配迭代扩散，同时运行速度快两个数量级。此外，我们提出了一种空间条件变体，为泛化到未见几何体开辟了有希望的路径。最终，条件漂移作为基于扩散方法的高效替代方案，为推理速度至关重要的实时CFD代理提供了可能。

英文摘要

While Computational Fluid Dynamics (CFD) provides high-fidelity flow fields for optimizing indoor environments, its computational cost limits rapid exploration. To solve this problem generative surrogates offer better distribution modeling than deterministic networks, but iterative sampling is slow. To enable high-quality, single-pass generation, we adapt the novel generative drifting framework to fluid mechanics. We introduce a conditional architecture that performs drifting in a learned VAE latent space and uses label-aware masking to align generated samples with their boundary conditions. Our label-conditioned model matches iterative diffusion in accuracy and flow consistency while running two orders of magnitude faster. Additionally, we propose a spatial-conditioning variant that establishes a promising path towards generalization to unseen geometries. Ultimately, conditional drifting serves as a highly efficient alternative to diffusion based approaches, unlocking real-time CFD surrogates where inference speed is critical.

URL PDF HTML ☆

赞 0 踩 0

2606.07479 2026-06-08 cs.CL cs.AI 新提交

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

基于监督与基于演示的上下文学习在多词表达分类中的比较

Sercan Karakaş, Yusuf Şimşek

发表机构 * University of Chicago ； Fırat University

AI总结研究土耳其语多词表达分类，对比监督基线（BERTurk）与指令微调LLM在零样本、单样本和少样本提示下的表现，发现提示敏感性和演示偏差影响显著。

详情

Comments: Accepted to ACL SRW 2026

AI中文摘要

土耳其语习语性轻动词结构（LVC）对多词表达处理具有挑战性，因为它们通常与完全字面义的动词-宾语组合共享相同表面形式，同时作为一个部分习语性谓词发挥作用。我们将土耳其语LVC检测定义为二元分类任务（字面义 vs. 习语义），并在手动创建的受控集（N=147）上评估，该集合包含匹配的负例：域外随机句子和域内字面义控制（NLVC），以及LVC正例。我们比较了监督土耳其语编码器基线（带有分类头的BERTurk）与来自不同家族的三个指令微调LLM，在零样本、单样本和少样本提示下的表现，并分析演示如何改变错误分布。在零样本情况下，LLM在负例上表现良好，但LVC召回率非常低。单样本提示显著提高了LVC检测，但可能引发强烈的、模型特定的偏差，导致模型过度预测或欠预测LVC。更丰富的少样本提示改善了校准，并为GPT-OSS-20B和Qwen 2.5-14B带来了稳健的整体性能。总体而言，结果突显了土耳其语元语言分类中的显著提示敏感性：监督基线仍然具有竞争力，而提示LLM在精心构建的演示下可以在LVC上匹配或超越它。

英文摘要

Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2606.07475 2026-06-08 cs.LG cs.AI 新提交

Graph Neural Network leveraging Higher-order Class Label Connectivity for Heterophilous Graphs

利用高阶类标签连通性的图神经网络用于异配图

Takuto Takahashi, Itsuki Nakayama, Takahiro Mitani, Ryosuke Kikuchi, Yuya Sasaki, Makoto Onizuka

发表机构 * The University of Osaka

AI总结针对异配图中节点分类性能受限问题，提出标签上下文分类器（LCC），通过四种游走生成标签上下文嵌入捕获高阶类标签连通性，并可与任意GNN自适应集成，实验表明优于现有方法。

详情

AI中文摘要

图神经网络（GNN）中的节点分类已广泛应用于图分析的各个领域。在同配图中，具有相同类标签的节点倾向于连接，GNN能实现高精度节点分类。然而，在异配图中，不同类标签的节点更可能连接，其性能仍然有限。特别是，当前基于图卷积网络的GNN无法捕获高阶类标签连通性，而这在真实世界的异配图中经常出现。为了解决这个问题，我们提出了一种新颖的分类器——标签上下文分类器（LCC），旨在捕获有向图中的高阶类标签连通性。LCC通过利用四种不同类型的游走生成的标签上下文嵌入来估计目标节点的类标签。此外，我们的方法允许通过自适应学习LCC和任意GNN的重要性来集成它们。实验结果表明，与LCC集成的GNN优于最先进的方法，并且标签上下文嵌入提高了异配有向图中的节点分类性能。

英文摘要

Node classification in graph neural networks (GNNs) has been widely applied in various fields of graph analysis. GNNs achieve high-accuracy node classification in homophilous graphs, where nodes with the same class label tend to be connected. However, their performance remains limited in heterophilous graphs, where nodes with different class labels are more likely to be connected. In particular, current GNNs derived from graph convolutional networks cannot capture higher-order class label connectivity, which is frequently observed in real-world heterophilous graphs. To address this issue, we propose a novel classifier, Label Context Classifier (LCC), designed to capture higher-order class label connectivity in directed graphs. LCC estimates the class label of a target node by leveraging label context embeddings that are generated through four distinct types of walks. In addition, our approach allows the integration of LCC and any GNN by adaptively learning their importance. Experimental results demonstrate that GNNs integrated with LCC outperform SOTA methods and the label context embeddings improve the node classification performance in heterophilous directed graphs.

URL PDF HTML ☆

赞 0 踩 0

2606.07474 2026-06-08 cs.LG 新提交

Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation

无监督持续聚类：通过前向-后向知识蒸馏

Mohammadreza Sadeghi, Sareh Soleimani, Zihan Wang, Narges Armanfard

发表机构 * Department of Electrical and Computer Engineering, McGill University ； Mila – Quebec AI Institute

AI总结提出无监督持续聚类（UCC）问题，并设计前向-后向知识蒸馏方法（FBCC），通过持续教师网络和轻量任务学生网络的双阶段蒸馏，在不存储旧数据的情况下保留聚类结构，显著减少灾难性遗忘。

详情

Comments: Accepted at ECML PKDD 2026 (Research Track). arXiv admin note: substantial text overlap with arXiv:2405.19234

AI中文摘要

无监督持续学习（UCL）旨在使神经网络能够在没有标签或无法访问过去数据的情况下学习连续任务。该设置中的一个主要挑战是灾难性遗忘，即模型在学习新任务时会忘记先前学过的任务。由于缺乏指导学习和记忆保留的标签，这一挑战在UCL中被放大。现有的缓解策略，如知识蒸馏和重放缓冲区，常常引发内存和隐私问题。此外，当前的UCL方法大多忽略了聚类特定的目标。为了填补这一空白，我们引入了无监督持续聚类（UCC），并提出了用于持续聚类的前向-后向知识蒸馏（FBCC）。FBCC采用一个带有聚类投影仪的持续教师网络和轻量级任务特定学生网络。通过双阶段的前向-后向蒸馏过程，教师在学习新聚类的同时保留先前发现的聚类结构，而无需存储过去的数据。FBCC代表了UCC的开创性方法，展示了在连续任务中改进的聚类性能。在四个基准数据集上的实验表明，FBCC在聚类准确性上始终优于现有的持续学习基线，同时显著减少了灾难性遗忘。

英文摘要

Unsupervised Continual Learning (UCL) aims to enable neural networks to learn sequential tasks without labels or access to past data. A major challenge in this setting is Catastrophic Forgetting, where models forget previously learned tasks upon learning new ones. This challenge is amplified in UCL due to the absence of labels to guide learning and memory retention. Existing mitigation strategies, such as knowledge distillation and replay buffers, often raise memory and privacy concerns. Moreover, current UCL methods largely overlook clustering-specific objectives. To fill this gap, we introduce Unsupervised Continual Clustering (UCC) and propose Forward-Backward Knowledge Distillation for Continual Clustering (FBCC). FBCC employs a continual teacher network with a clustering projector and lightweight task-specific students. Through a dual-phase forward-backward distillation process, the teacher learns new clusters while preserving previously discovered cluster structure without storing past data. FBCC represents a pioneering approach to UCC, demonstrating improved clustering performance across sequential tasks. Experiments on four benchmark datasets demonstrate that FBCC consistently outperforms existing continual learning baselines in clustering accuracy while significantly reducing catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2606.07473 2026-06-08 cs.SD cs.AI 新提交

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Whisper 幻觉检测与缓解：基于隐藏表示引导和稀疏自编码器

Georgii Aparin, Vadim Popov, Tasnima Sadekova, Assel Yermekova

发表机构 * AI Foundation and Algorithm Lab ； National University of Science and Technology MISIS ； National Research University Higher School of Economics

AI总结通过分析Whisper内部表示，提出基于稀疏自编码器的引导策略，将非语音测试集上的幻觉率从72.63%降至14.11%（small模型），接近微调方法性能。

详情

AI中文摘要

Whisper是一种广泛采用的ASR模型，已知存在幻觉问题——即对非语音音频生成与输入完全无关的连贯转录。我们研究了是否可以通过Whisper的内部表示来检测和缓解幻觉。我们提取音频编码器激活，并评估两种表示空间：原始Whisper激活和稀疏自编码器（SAE）潜在变量。我们表明，两个空间都编码了线性可分的幻觉相关信息，判别能力集中在稀疏特征子集中，并向更深编码器层增强。我们提出了两种引导策略：激活空间引导和SAE潜在空间引导。基于SAE的引导将完整非语音测试集上的幻觉率从72.63%降至14.11%（Whisper small），从86.88%降至27.33%（Whisper large-v3），同时在语音数据上WER退化很小，接近基于微调方法的性能。

英文摘要

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07464 2026-06-08 cs.RO cs.AI cs.CV 新提交

Planning-aligned Token Compression for Long-Context Autonomous Driving

面向长上下文自动驾驶的规划对齐令牌压缩

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone

发表机构 * NVIDIA Research ； School of Computing and Data Science, The University of Hong Kong

AI总结提出COMPACT-VA框架，基于条件VQ-VAE将长上下文压缩为有界表示，通过规划对齐实现决策关键信息保留，在动态场景中成功率提升超6%，速度提升3.3倍。

详情

Comments: 9 pages

AI中文摘要

整体视觉-动作模型代表了自动驾驶中的一种新兴范式。然而，这种架构在编码用于复杂交互的扩展时间上下文时，会产生迅速超过实时计算预算的令牌序列。虽然线性变换器和外部记忆等方法试图使上下文轻量化，但令牌压缩与架构最为兼容，因为它不需要修改主干网络。然而，现有的压缩采用基于规则的启发式方法（如时间衰减），与规划解耦，存在丢失决策关键信息的风险。我们提出COMPACT-VA，一种基于条件VQ-VAE的规划对齐工作记忆框架，将扩展上下文压缩为有界表示。压缩条件同时基于历史轨迹和学习的规划意图，其中后验编码器在训练期间从未来轨迹中提炼规划意图，而先验编码器学习从压缩观测中预测它。压缩记忆与预测的潜在变量拼接，输入策略进行端到端优化，从而在保留决策关键信息的情况下进行规划。我们在历史上下文对行为正确性（如停车、让行或前行）最关键的高信号动态场景中进行评估，并相应地设计了行为指标。在可比的令牌预算下，我们在成功率上实现了超过6%的提升（68.3%），且各项指标一致提升。消融实验验证了规划对齐耦合的有效性。闭环评估证实，与未压缩处理相比，COMPACT-VA在保持一般驾驶性能的同时实现了3.3倍的速度提升和2.7倍的内存减少。

英文摘要

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

URL PDF HTML ☆

赞 0 踩 0

2606.07462 2026-06-08 cs.AI 新提交

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

像真正的研究者一样行动：一套评估前沿LLM和研究生命周期中智能体框架的基准测试套件

Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song, Lingyu Zhang, Lanxuan Xue, Luodi Chen, Zepeng Xin, Kaiyu Li, Xiangyong Cao

发表机构 * Xi’an Jiaotong University ； Xidian University

AI总结提出AARR基准系列，通过AARRI-Bench评估智能体在细粒度研究场景中模拟人类研究者的专业性、全面性和细微推理能力，发现最佳配置成功率仅68.3%。

详情

AI中文摘要

随着基础模型的进步和智能体框架日益复杂，智能体在复杂、长期编码任务甚至自主实验执行中展现出卓越能力。尽管它们从研究助手演变为自主研究智能体，但这些系统在领域敏感性、研究伦理和细微科学判断方面仍存在显著局限。因此，前沿智能体仍无法完全取代人类研究者。为弥合这一差距，我们构思了AARR（像真正的研究者一样行动）基准系列。与主要评估宏观执行能力的现有基准不同，AARR关注智能体能否在细粒度研究场景中模拟人类研究者的专业性、全面性和细微推理。在这项工作中，我们提出了AARRI-Bench（像真正的研究实习生一样行动），这是该系列的第一个基准。我们在前沿模型和智能体系统上进行了大量实验，揭示即使是最佳配置（Mini-SWE-Agent与Claude Opus 4.7）也仅达到68.3%的成功率，经常忽略真实人类研究者显而易见的微妙但关键细节。我们的结果表明，开发类似研究者的AI需要进一步探索研究行为，而不仅仅是复杂的框架。我们的数据发布在此https URL。

英文摘要

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.

URL PDF HTML ☆

赞 0 踩 0

2606.07457 2026-06-08 cs.LG eess.SP stat.ML 新提交

Time series Foundation Models based on Physics-Informed Synthetic Histories for Cold-Start Photovoltaic Forecasting

基于物理信息合成历史的时间序列基础模型用于冷启动光伏预测

Lorenzo Longarini, Alessandro Rongoni, Simone Silenzi, Emanuele Frontoni, Riccardo Rosati

发表机构 * European Commission

AI总结针对光伏电站冷启动预测问题，提出利用物理信息合成历史数据，结合时间序列基础模型进行零样本预测，在440个站点上实现1.7-2倍性能提升。

详情

Comments: To be published in the 2nd ICML Workshop on Foundation Models for Structured Data

AI中文摘要

在并网调试时，光伏运营商必须在目标站点观测数据可用之前预测发电量，这限制了标准监督预测器的直接使用。针对这种冷启动场景，我们提出了一种零样本流程，通过电站元数据和气象协变量生成合成发电历史，使时间序列基础模型（TSFMs）能够通过推理时条件化进行预测。我们在严格的冷启动基线、真实反馈和自预测反馈策略下，将五种TSFM与经典基线进行了基准测试。评估涵盖了四个数据集中$440$个光伏站点以及多种气候区域。协变量感知的基础模型比基线性能提升约$1.7-2$倍：TabPFN-TS在真实反馈下实现了最低误差（MAE $0.514$, RMSE $0.721$ $kWh$ ${kWp}^{-1}$ ${d}^{-1}$），而Chronos-2在自预测反馈下最为鲁棒。性能对合成历史来源基本不敏感，表明准确性更多取决于合理的时序上下文可用性，而非特定生成器。

英文摘要

At commissioning time, Photovoltaic (PV) operators must forecast production before target-site observations are available, limiting the direct use of standard supervised forecasters. This cold-start setting is addressed with a zero-shot pipeline that generates a synthetic production history from plant metadata and meteorological covariates, enabling time-series foundation models (TSFMs) to forecast through inference-time conditioning. Five TSFMs are benchmarked against classical baselines under strict Cold-Start Baseline, Real Feedback, and Self-Forecast Feedback strategies. The evaluation spans $440$ PV sites across four datasets and diverse climate regimes. Covariate-aware foundation models outperform baselines by approximately $1.7-2\times$: TabPFN-TS achieves the lowest error under Real Feedback (MAE $0.514$, RMSE $0.721$ $kWh$ ${kWp}^{-1}$ ${d}^{-1}$), while Chronos-2 is most robust under Self-Forecast Feedback. Performance is largely insensitive to the synthetic-history source, indicating that accuracy is driven more by the availability of plausible temporal context than by the specific generator.

URL PDF HTML ☆

赞 0 踩 0

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany ； Department of Language Science and Technology, Saarland University, Saarbrücken, Germany

AI总结提出TEVI框架，利用稀疏自编码器解耦图像嵌入，并通过文本条件掩码模块选择性重构嵌入，以改善CLIP等视觉-语言模型的图像-文本对齐，在多个检索基准上取得提升。

详情

Comments: 20 pages, 13 figures, 14 tables

AI中文摘要

视觉-语言模型（如CLIP）由于共享图像-文本嵌入空间，对多种任务非常有用。尽管如此，图像和文本嵌入往往对齐不佳，影响下游性能。最近的研究表明，这可以归因于信息不平衡：图像包含的信息比其标题描述的更多。在这项工作中，我们提出了TEVI，一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说，我们使用稀疏自编码器来解耦图像嵌入，并训练一个掩码模块，根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中，我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型，我们进一步在粗粒度短标题（MS COCO, Flickr）和细粒度长标题（IIW, DOCCI）基准上实现了改进的检索性能，在更丰富的标题上获得更强的增益，并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.07441 2026-06-08 cs.CL 新提交

Sycophantic Praise: Evaluating Excessive Praise in Language Models

谄媚式赞美：评估语言模型中的过度赞美

Daniel Vennemeyer, Phan Anh Duong, Meryl Ye, Ruihong Huang, Tianyu Jiang

发表机构 * University of Cincinnati ； Carnegie Mellon University ； Texas A&M University

AI总结提出参数化框架衡量赞美是否过度，发现谄媚式赞美在社交和解释性领域远多于客观推理领域，且现有方法无法可靠测量。

2606.07437 2026-06-08 cs.RO cs.AI cs.HC cs.SE cs.SY eess.SY 新提交

Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability

重新构想自动驾驶时代的ISO 26262：通过可迁移性和可预测性增强可控性

Chaitanya Shinde, Hadi Hajieghrary, Paul Schmitt, Adam Shoemaker, Bodo Seifert, Steve Kenner

发表机构 * Torc Robotics, Inc. ； Reynolds & Moore ； Critical Systems Analysis, LLC

AI总结针对自动驾驶汽车缺乏人类驾驶员的问题，将ISO 26262中的可控性分解为可迁移性和可预测性两个可审计维度，并给出量化框架，以支持SAE L4/L5系统的功能安全论证。

详情

AI中文摘要

ISO 26262标准通过基于严重性、暴露度和可控性的风险评估来定义道路车辆的功能安全，其基础是人类驾驶车辆范式。在自动驾驶汽车（AV）的背景下，缺乏人类驾驶员需要重新审视这些原则。本文将可控性占位符分解为ISO 26262的两个可审计证据维度，引入了两个可测量的子概念：可迁移性和可预测性。可迁移性扩展了可控性，以捕捉AV系统将控制权移交给专用后备安全机制的能力，而可预测性则捕捉外部主体预测AV行为的难易程度。可预测性基于人机交互启发原则进行形式化定义，并提供了量化它的数学框架。引入了设计能力与可实现能力之间的差距，以区分架构后备声明与场景条件相关的可实现后备能力。所提出的度量与ISO 26262和ISO/PAS 21448（SOTIF）保持一致，使后备和交互声明在ODD切片上可证伪和可追溯。这些维度补充而非替代现有标准，这些增强保留了ISO 26262的结构，同时将其适用性扩展到在SAE L4和L5级别运行的无驾驶员自动化系统。

英文摘要

The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllability, grounded in a human-driven vehicle paradigm. In the context of autonomous vehicles (AVs), the absence of a human driver necessitates revisiting these principles. This paper decomposes the Controllability placeholder into two auditable evidence dimensions of ISO 26262 by introducing two measurable sub-concepts: Transferability and Predictability. Transferability extends Controllability to capture AV systems' ability to hand off control to dedicated fallback safety mechanisms, while Predictability captures how easily external agents can anticipate AV behavior. Predictability is formally defined from human-robot interaction-inspired principles, and a mathematical framework is provided to quantify it. A designed-versus-achievable gap is introduced to distinguish architectural fallback claims from scene-conditioned achievable fallback capability. The proposed metrics align with ISO 26262 and ISO/PAS 21448 (SOTIF), rendering fallback and interaction claims falsifiable and traceable across ODD slices. These dimensions complement rather than replace existing standards, and the enhancements preserve the structure of ISO 26262 while extending its applicability to driverless automated systems operating at SAE Levels 4 and 5.

URL PDF HTML ☆

赞 0 踩 0

2606.07433 2026-06-08 cs.CV cs.AI cs.MM 新提交

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

发表机构 * Peking University ； Wuhan University ； Shanghai Jiao Tong University ； Nanyang Technological University ； CASIA ； University of Tokyo ； University of Liverpool ； Zhejiang University ； National University of Singapore ； UC Merced

AI总结提出人类视角下视频理解的三个功能能力（观看、记忆、推理），构建统一框架分析视频MLLM的感知、记忆、推理和预测，并总结挑战、方法、应用及未来方向。

详情

AI中文摘要

视频理解正被多模态大语言模型（MLLMs）迅速变革，研究从短视频片段转向长视频、多模态和知识密集型视频场景。这些场景要求模型在有限计算预算下处理稀疏证据、长程依赖、多模态对齐和可靠推理。本文从人类视角出发，围绕三个功能能力——观看、记忆和推理——组织基于LLM的视频理解。该视角并非将视频任务视为孤立基准，而是提供一个统一结构，用于分析视频MLLM如何获取证据、保持上下文并产生有依据的输出。我们引入一个公式，通过感知表示、记忆状态、推理轨迹和最终预测来表征视频理解系统。基于此公式，我们识别出时空感知、高效长视频处理、记忆建模、流式理解和忠实推理中的挑战。代表性方法按其视频MLLM系统中的角色进行组织：观看涵盖细粒度、全面、音视频和高效感知；记忆包括离线记忆和流式记忆；推理涵盖纯文本推理和视频辅助推理。我们进一步考察了应用领域，如自我中心、体育、教学、医学和叙事视频，并涵盖了跨任务类型、监督格式、模态和能力维度的训练数据集和评估基准。最后，我们概述了可扩展、记忆感知和有依据的视频智能的开放问题和未来方向。相关工作将在https://this https URL持续追踪。

英文摘要

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07426 2026-06-08 cs.LG 新提交

Discovering Multiscale Deep Formulas in Complex Systems via Neural-Guided Lambda Calculus

通过神经引导的Lambda演算发现复杂系统中的多尺度深层公式

Hanqiao Yu, Shusen Yang, Xuebin Ren, Cong Zhao

发表机构 * Xi’an Jiaotong University

AI总结提出Deflex方法，结合可分解深度能量模型和Lambda演算符号回归，自动从复杂系统中提取多尺度公式，效率最高提升7倍。

详情

Comments: 35 pages, 5 figures; Supplementary Information available as an ancillary file (79 pages)

AI中文摘要

科学中的一个基本问题是以简洁数学公式的形式识别复杂系统的潜在模式。当前基于人工智能的方法在单尺度系统中表现出色，但在识别多尺度复杂系统中的尺度特定公式方面仍有限。我们提出Deflex，一种端到端的人工智能方法，可从复杂系统中自动提取可能具有不同形式的多尺度公式，包括不变量和分布。Deflex由两个子系统组成，分别称为Deflexformer和Deflexpressor。Deflexpressor是一个用于高阶公式的lambda演算符号回归模型。Deflexformer是一个可分解的深度能量模型，用于学习跨尺度的统一表示。Deflexpressor生成合成数据以预训练Deflexformer，然后通过解耦多尺度潜在关系来指导公式发现。在六个具有不同行为的代表性复杂系统中，Deflex实现了比最先进方法高达7倍的效率提升，同时实现了自动多尺度发现。我们的工作可能成为跨学科科学发现的有用工具。

英文摘要

A fundamental problem in science is identifying underlying patterns of complex systems in the form of concise mathematical formulas. Current Artificial Intelligence (AI)-based methods have shown strong performance in single-scale systems, yet remain limited in identifying scale-specific formulas in multiscale complex systems. We present Deflex, an end-to-end AI method to automatically extract multiscale formulas with potentially different forms, including invariants and distributions, from complex systems. Deflex consists of two subsystems named Deflexformer and Deflexpressor. Deflexpressor is a lambda-calculus symbolic regression model for higher-order formulas. Deflexformer is a decomposable deep energy model for learning unified representations across scales. Deflexpressor generates synthetic data to pre-train Deflexformer, which then guides formula discovery by decoupling multiscale latent relationships. Across six representative complex systems with diverse behaviors, Deflex achieves up to 7-fold higher efficiency than the state-of-the-art methods while enabling automated multiscale discovery. Our work could be a useful tool for scientific discovery across disciplines.

URL PDF HTML ☆

赞 0 踩 0

2606.07424 2026-06-08 cs.RO 新提交

Rapid co-design of Buoyancy-assisted robots for Challenging Locomotion using Gaussian Evolutionary Specialists

基于高斯进化专家的浮力辅助机器人快速协同设计以应对挑战性运动

Ankit Sinha, Nitish Sontakke, Dennis Hong, Yusuke Tanaka, Sehoon Ha

发表机构 * Georgia Institute of Technology ； University of California, Los Angeles ； ETH Zurich

AI总结提出高斯进化专家（GES）框架，通过解耦设计空间划分与策略学习，在浮力辅助轻量腿单元（BALLU）上实现5-25%性能提升，并缩短37%设计优化时间。

详情

Comments: Submitted to RA-L

AI中文摘要

设计高性能腿式机器人需要联合优化形态和控制。无模型强化学习（RL）为开发鲁棒控制器提供了模型预测控制的替代方案，无需明确指定机器人动力学。因此，我们看到了使用RL训练控制器和评估设计以优化机器人形态。虽然RL在运动方面取得了成功，但由于重复的策略训练，将其用于协同设计内循环成本高昂。基于形态的条件通用策略提供了一种有前景的替代方案，但遭受行为多样性崩溃，收敛到单一策略，在不同设计上表现次优。另一方面，端到端混合专家（MoE）架构因其表示崩溃而失败。我们提出高斯进化专家（GES），一个将设计空间划分与策略学习解耦以显式捕获多样行为的框架。GES将专家策略分配给演化的高斯区域，并通过训练、探测和领土扩展迭代优化它们。生成的专家被集成到设计采样循环中，用直接评估替代昂贵的重新训练。在浮力辅助轻量腿单元（BALLU）上测试时，GES发现的设计比朴素通用策略性能高5-25%。在硬件上，GES优化设计克服了24厘米高的障碍——比基线BALLU设计提升3倍。此外，GES将设计优化时间缩短了37%。

英文摘要

Designing high-performance legged robots requires jointly optimizing morphology and control. Model-free Reinforcement Learning (RL) offers an alternative to model-predictive control for developing robust controllers without explicitly specifying robot dynamics. Thus, we have seen theuse of RL to train controllers and evaluate designs for robot morphology optimization. While RL has shown success inlocomotion, using it in the co-design inner loop is expensive due to repeated policy training. Universal policies conditioned on morphology offer a promising alternative, but suffer from behavioral diversity collapse, converging to a single strategy that performs sub-optimally across designs. On the other hand, end-to-end Mixture-of-Experts (MoE) architectures fail due to a collapse in its representation. We propose Gaussian Evolutionary Specialists (GES), a framework that decouples design-space partitioning from policy learning to capture diverse behaviors explicitly. GES assigns specialist policies to evolving Gaussian regions and iteratively refines them via training, probing, and territory expansion. The resulting specialists are integrated into a design sampling loop, replacing costly re-training with direct evaluation. When tested on the Buoyancy-Assisted Light Legged Unit (BALLU), GES discovers designs with 5 - 25% higher performance than naive universal policies. On hardware, a GES optimized design overcomes a 24 cm tall obstacle - 3x improvement over the baseline BALLU design. Moreover, GES curtails design optimization time by 37%.

URL PDF HTML ☆

赞 0 踩 0

2606.07416 2026-06-08 cs.LG 新提交

Video-Based Prediction of In-Flight Particle Characteristics in Atmospheric Plasma Spraying

基于视频的大气等离子喷涂中飞行粒子特性预测

Abhijeet Praveen, Sareh Soleimani, Cormac Cureton, Aman Sidhu, Kintak Raymond Yu, Cristian Cojocaru, Narges Armanfard

发表机构 * Department of Electrical and Computer Engineering, McGill University ； Mila – Quebec AI Institute ； National Research Council Canada

AI总结提出利用高速视频观测等离子体羽流，通过TabPFN、CNN等模型预测飞行粒子温度和速度，最高R²达0.90和0.82，实现非侵入式诊断。

详情

Comments: Accepted at ECML PKDD 2026 (Applied Data Science Track)

AI中文摘要

大气等离子喷涂（APS）是一种广泛使用的涂层工艺，其中飞行粒子的温度和速度强烈影响涂层质量。然而，这些粒子特性在操作过程中难以连续监测，这促使了非侵入式数据驱动诊断方法的发展。在这项工作中，我们研究了高速视频观测等离子体羽流在估计APS中飞行粒子特性方面的预测潜力。我们引入了三种不同的视频衍生特征表示，并使用Tabular Prior-Data Fitted Networks（TabPFN）、卷积神经网络（CNN）以及包括随机森林、梯度提升、支持向量回归和XGBoost在内的经典回归基线进行评估。实验采用分组留一交叉验证，对来自63次APS喷涂运行的126个标记的喷涂前后视频记录进行。在工程化特征实验中，TabPFN在温度预测方面表现最一致，使用组合特征表示达到R²=0.86。CNN模型在速度预测方面表现更强，达到R²=0.81。此外，我们评估了使用预训练CNN直接对原始视频帧进行操作的模型，发现预训练CNN加回归头实现了最高性能，温度和速度的R²分别为0.90和0.82。结果表明，视频衍生的羽流信息为APS非侵入式诊断和实时过程监测提供了有前景且可扩展的基础。

英文摘要

Atmospheric plasma spraying (APS) is a widely used coating process in which in-flight particle temperature and velocity strongly influence coating quality. However, these particle characteristics are difficult to monitor continuously during operation, motivating the development of non-invasive data-driven diagnostic methods. In this work, we investigate the predictive potential of high-speed video observations of the plasma plume for estimating in-flight particle characteristics in APS. We introduce three different video-derived feature representations and evaluate them using Tabular Prior-Data Fitted Networks (TabPFN), convolutional neural networks (CNN), and classical regression baselines including Random Forest, Gradient Boosting, Support Vector Regression, and XGBoost. Experiments are conducted using grouped leave-one-out cross-validation on 126 labeled pre- and post-spray video recordings from 63 APS spray runs. Across the engineered feature experiments, TabPFN achieves the most consistent performance for temperature prediction, reaching R2 = 0.86 using the combined feature representation. CNN models particularly perform stronger for velocity prediction, achieving R2 of 0.81. In addition, we evaluate models operating directly on raw video frames using pretrained CNNs and find that the highest performance is achieved by a pretrained CNN with a regression head with R2 of 0.90 and 0.82 for temperature and velocity, respectively. The results demonstrate that video-derived plume information provides a promising and scalable foundation for non-invasive APS diagnostics and real-time process monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.07414 2026-06-08 cs.LG cs.NE 新提交

Sparsely gated tiny linear experts

稀疏门控的微型线性专家

Simon Schug

发表机构 * Princeton University

AI总结提出稀疏门控线性神经元（sgatlin）网络，通过将每个专家缩减为单个神经元并去除非线性，在等计算量下提升语言模型困惑度，同时增强可解释性。

详情

Comments: Code available at https://github.com/smonsays/sparsely-gated-linear

AI中文摘要

稀疏性允许在不按比例增加计算成本的情况下扩展模型参数。虽然混合专家（MoE）模型变得越来越稀疏，但单个专家通常仍然庞大且密集。在这里，我们通过将每个专家缩减为单个神经元并选择许多可用神经元中的极小一部分，进一步增加稀疏性，从而提高计算效率和可解释性。与直觉相反，实现这两者的关键是去除通常应用于专家的非线性，从而得到一个稀疏门控线性神经元（sgatlin）网络。在等计算量比较中，我们发现用sgatlin替换所有Transformer前馈层可以在不同计算预算下改善语言模型的困惑度。同时，由此产生的前馈电路的稀疏性和线性为模型可解释性提供了新的机会。在一个小规模案例研究中，我们证明sgatlin中的前馈电路可以在无需训练额外替代模型的情况下进行解释。我们发现它们形成了语义结构化的聚类，并且在因果上参与了事实回忆。我们的发现为计算高效且可解释的Transformer前馈层指明了一条可能的路径。

英文摘要

Sparsity allows scaling model parameters without proportionally increasing computational cost. While mixture of experts (MoE) models are made increasingly sparse, individual experts typically remain large and dense. Here, we demonstrate that further increasing sparsity by shrinking each expert to consist of a single neuron and selecting a tiny fraction of many available neurons can improve compute efficiency and interpretability. Counterintuitively, the key to achieving both is removing the nonlinearity typically applied to the experts, resulting in a network of sparsely gated linear neurons (sgatlin). In an isoflop comparison, we find that replacing all transformer feedforward layers with sgatlin improves perplexity in language models across different compute budgets. At the same time, the sparsity and linearity of the resulting feedforward circuits present new opportunities for model interpretability. In a small-scale case study, we demonstrate that feedforward circuits in sgatlin can be interpreted without having to train additional replacement models. We find that they form semantically structured clusters and are causally implicated in factual recall. Our findings paint a possible path towards compute-efficient and interpretable transformer feedforward layers.

URL PDF HTML ☆

赞 0 踩 0

2606.07410 2026-06-08 cs.LG cs.AI 新提交

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

人类与DeepSeek-R1大语言模型数学推理的全面解剖

Yuxiang Chen, Jun Wang

发表机构 * UCL Centre for Artificial Intelligence

AI总结通过AIME 2025所有30道题目的10247个推理步骤注释，发现DeepSeek-R1存在拓扑模仿（表面模仿推理而非真正推理），但成功轨迹中分支与回溯的稳定使用以及反射在演绎推理中的有效放置是真正推理的信号。

详情

AI中文摘要

大语言模型中“顿悟时刻”的出现，特别是DeepSeek-R1-0120，引发了这些系统是真正推理还是仅仅模仿推理表象的问题。我们对AIME 2025所有30道题目进行了模型与人类推理的全面实证比较，将10247个推理步骤详尽地注释为五个功能类别：分析、推理、分支、回溯和反思。我们发现了一个明显的结构差异。人类解决方案在分析和演绎之间保持紧凑交替，而DeepSeek-R1频繁重访中间结果，进行浅层且往往不必要的验证，并在局部检查中循环，而没有有意义的逻辑进展。我们将其描述为拓扑模仿：再现推理的表面形式而不发挥其功能作用。尽管如此，我们识别出两个真正推理的信号。首先，成功轨迹表现出分支和回溯的稳定使用，而失败轨迹要么过度使用要么使用不足探索性动作。其次，反思仅在置于演绎推理中时才有效；陷入分析循环的反思专注于局部数值细节而忽略全局逻辑错误。这些发现表明，当前的长链思维模型可能更多地因推理的表象而非真正的演绎进展而获得奖励。我们讨论了改进评估和训练的方向，包括测量跨轨迹稳定性、惩罚“空转”轨迹、鼓励更深层的逻辑纠正，以及将推理时间计算重新分配给演绎和回溯。总体而言，推理质量不仅取决于反思发生的多少，还取决于反思是否一致地出现在适当的逻辑尺度上。

英文摘要

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inference, Branch, Backtrace, and Reflection. We find a clear structural difference. Human solutions maintain a compact alternation between analysis and deduction, whereas DeepSeek-R1 frequently revisits intermediate results, performs shallow and often unnecessary verification, and loops through local checks without meaningful logical progress. We describe this as topological mimicry: reproducing the surface form of reasoning without its functional role. Despite this, we identify two signals of genuine reasoning. First, successful traces exhibit stable use of branching and backtracking, while failed traces either underuse or overuse exploratory actions. Second, reflection is only effective when placed within deductive inference; reflections trapped in analysis loops focus on local numerical details while missing global logical errors. These findings suggest that current long-CoT models may be rewarded more for the appearance of reasoning than for genuine deductive progress. We discuss directions for improving evaluation and training, including measuring cross-trace stability, penalising "spinning-wheel" traces, encouraging deeper logical correction, and reallocating inference-time compute toward deduction and backtracking. Overall, reasoning quality depends not simply on how much reflection occurs, but on whether reflection appears consistently and at the appropriate logical scale.

URL PDF HTML ☆

赞 0 踩 0