arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19921 2026-06-19 cs.AI 新提交

eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization

eCNNTO：一种高度泛化的加速拓扑优化的卷积网络

Shengbiao Lu, Xiaodong Wei

发表机构 * Global college, Shanghai Jiao Tong University（上海交通大学全球学院）

AI总结提出基于元素的卷积神经网络eCNNTO，通过预测近最优密度跳过大量迭代，加速密度拓扑优化，并引入新训练策略提升效率与泛化能力。

详情

AI中文摘要

本工作提出了一种基于元素的卷积神经网络（CNN）来加速基于密度的拓扑优化（TO），称为eCNNTO。TO通常需要大量迭代，其中每次迭代都进行有限元分析，导致效率瓶颈，尤其是在使用密集网格实现高分辨率设计时。为解决这一限制，eCNNTO建立在Kallioras等人（2020）的工作基础上，该工作为每个元素训练了一个深度信念网络（DBN），根据其早期历史预测近最优密度，从而跳过绝大多数迭代并显著加速TO过程。然而，该方法缺乏相邻元素间的空间相关性，可能导致最终结构中存在不连通的特征。所提方法采用带有残差连接的CNN来解决这一问题。在此基础上，引入了一种新的训练策略以进一步提高优化效率，其中训练数据集由最终阶段的密度历史而非早期历史组成。这一变化也有助于减少所需的训练数据量。eCNNTO仅需少量数据集进行训练，却能泛化到边界条件、载荷情况、设计域几何形状、网格分辨率以及非设计域大不相同的各种问题。最后，通过二维和三维的多个示例展示了eCNNTO的泛化能力和效率，分别实现了高达90%和97%的迭代次数减少。

英文摘要

This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.19920 2026-06-19 cs.RO cs.LG cs.MA 新提交

Deep-Unfolded Coordination

深度展开协调

Hunter Kuperman, Minchan Jung, Rahul V. Ghosh, Alex Oshin, Evangelos A. Theodorou

发表机构 * Autonomous Control and Decision Systems Laboratory Georgia Institute of Technology United States（佐治亚理工学院自主控制与决策系统实验室）

AI总结提出Deep Coordinator框架，通过深度展开ADMM-DDP迭代学习动态调整超参数，实现非凸优化器求解时自适应惩罚参数，在车队和四旋翼仿真中速度提升6.18-9.44倍且可扩展至8倍规模。

Comments The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables

详情

AI中文摘要

分布式优化是一种高度可扩展且结构透明的技术，用于解决多机器人问题；然而，这类方法通常需要高度专门化、针对特定问题的超参数调整。在这项工作中，我们提出了Deep Coordinator，一个深度展开框架，学习在求解时根据优化器性能动态调整ADMM-DDP（一种流行的机器人任务分布式求解器）的超参数。我们的架构包括将固定数量的ADMM-DDP迭代展开成一个神经网络，层之间具有可学习的函数，将优化器状态映射到下一个超参数。据我们所知，Deep Coordinator是第一个在求解时调整非凸优化器惩罚参数的深度展开框架；我们展示了主流的监督方法在训练此类模型时可能产生退化解，并提出了一种无监督学习方案。在车队和四旋翼飞行器的仿真中，Deep Coordinator生成的轨迹质量与常规求解器相当，但速度快6.18-9.44倍。此外，当部署到比训练规模大8倍的系统时，Deep Coordinator仍能保持其性能优势。

英文摘要

Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.

URL PDF HTML ☆

赞 0 踩 0

2606.19919 2026-06-19 cs.LG 新提交

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

ADaPT：面向高效大推理模型的令牌级解耦

Tingyun Li, Zishang Jiang, Jinyi Han, Xinyi Wang, Sihang Jiang, Han Xia, Zhaoqian Dai, Shuguang Ma, Fei Yu, Jiaqing Liang, Yanghua Xiao

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（华东师范大学上海智能教育研究院）； College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）； Ant Group（蚂蚁集团）

AI总结提出ADaPT，通过令牌级双过程框架解耦效率与正确性信号，引入模式选择令牌控制快慢推理，实现推理时效率-性能权衡的精确连续控制，在降低推理成本的同时保持强推理能力。

详情

AI中文摘要

大型推理模型依赖长思维链实现强性能，但统一应用此类推理会产生高计算成本。现有面向效率的方法试图缩短或混合推理策略，但往往会降低推理能力。我们将根本原因识别为效率激励与正确性优化之间的序列级耦合，这隐式惩罚了长但正确的推理轨迹。为解决此问题，我们提出自适应双过程思维（ADaPT），一种令牌级双过程框架，在训练期间显式解耦效率和正确性信号。ADaPT引入模式选择令牌来控制快速和慢速推理，将效率相关奖励仅应用于此令牌，以避免惩罚正确的长推理，同时在适当时鼓励效率。此外，ADaPT在推理时实现了对效率-性能权衡的精确连续控制：通过调整模式选择令牌的生成概率，单个训练好的模型可以平滑地沿效率-性能帕累托前沿移动。大量实验表明，ADaPT在多个基准测试中显著降低推理成本，同时保持强推理性能。

英文摘要

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.19918 2026-06-19 cs.ET 新提交

A Novel FeFET Differential Bit-Cell With Hybrid Volatile and Non-Volatile Memory Modes

一种具有混合易失性和非易失性存储模式的新型FeFET差分位单元

Jianze Wang, Wei Zhang, Xuanyao Fong

AI总结提出一种由交叉耦合FeFET和存取晶体管组成的4T差分位单元，通过调整写入条件可在易失/非易失模式间切换，无需显式备份恢复操作，面积小于传统6T SRAM。

详情

AI中文摘要

非易失性SRAM（nvSRAM）设计已被研究以解决基于CMOS的SRAM的高泄漏功耗和新兴非易失性存储器（eNVM）技术的大写入延迟问题。然而，先前将SRAM与eNVM器件结合的nvSRAM设计通常需要备份和恢复（B\\&R）操作，并导致显著的单元面积开销。在此，我们提出一种差分存储位单元，由一对交叉耦合的铁电场效应晶体管（FeFET）和一对存取晶体管组成，形成四晶体管（4T）结构，比传统的6T SRAM和许多先前的nvSRAM设计更小。通过调整写入条件，所提出的位单元可配置为在易失性或非易失性模式下工作。在非易失性模式下，所提出的nvSRAM实现了0.13~$\mu$W的存储功耗和2~ns的存储时间，且无需显式的B\\&R操作。所提出的位单元也可视为交叉耦合增益单元，从而实现进一步的应用。

英文摘要

Non-volatile SRAM (nvSRAM) designs have been investigated to address the high leakage power of CMOS-based SRAM and the large write latency of emerging non-volatile memory (eNVM) technologies. However, prior nvSRAM designs that combine SRAM with eNVM devices typically require backup and restore (B\&R) operations and incur significant cell-area overhead. Here, we propose a differential memory bit-cell consisting of a pair of cross-coupled ferroelectric field-effect transistors (FeFETs) and a pair of access transistors, resulting in a four-transistor (4T) structure, which is smaller than conventional 6T SRAM and many prior nvSRAM designs. The proposed bit-cell can be configured to operate in either volatile or non-volatile mode by adjusting the write conditions. In the non-volatile mode, the proposed nvSRAM achieves a store power of 0.13~$μ$W with a 2~ns store time, and no explicit B\&R operation is required. The proposed bit-cell can also be viewed as a cross-coupled gain cell, enabling further applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19915 2026-06-19 cs.CV 新提交

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能工程学院）

AI总结提出SpatialSV框架，通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示（深度图、相机姿态、点云），实现可解释的3D空间感知内化，无需外部工具，并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

解锁多模态大语言模型（MLLMs）的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验，这会带来显著的推理开销，或依赖潜在特征蒸馏，后者缺乏可解释性和细粒度几何约束。为解决这些问题，我们提出SpatialSV，一个旨在将鲁棒的3D空间感知内化到MLLMs中，同时提供内在可解释性的框架。与被动特征模仿不同，SpatialSV采用任务导向的视觉监督，迫使模型主动将其2D视觉特征提升为显式3D表示，包括深度图、相机姿态和点云。关键的是，这个2D到3D的提升过程为模型的表示提供了一个透明窗口：生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外，该框架在半监督设置中展现出强泛化能力，验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19914 2026-06-19 cs.RO cs.AI 新提交

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances

Co-policy: 响应式人机音乐共创框架

Xuetao Li, Wenke Huang, Mang Ye, Zijian Liu, Jinhua Xie, Jifeng Xuan, Miao Li

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Automation, Wuhan University of Technology（武汉理工大学自动化学院）； School of Geodesy and Geomatics, Wuhan University（武汉大学测绘学院）； School of Robotics, Wuhan University（武汉大学机器人学院）

AI总结提出Co-policy框架，通过语义锚定、约束变分和视觉运动策略实现人机音乐实时共创，在真实钟琴实验中优于扩散策略基线。

详情

AI中文摘要

艺术长期以来一直是人类创造力的关键表达。具身人工智能为生成模型通过物理动作而非无形数字内容参与创造力提供了一条途径。在机器人音乐共创中，将语义音乐理解与实时且可物理执行的表演连接起来具有挑战性。我们提出了Co-policy，一个人机音乐共创框架，它分离了语义意图接地、约束音乐变分和视觉运动执行。为了接地音乐语义，Co-policy使用预推理语义锚点和微调的Qwen-vl规划器（F-Qwen）将语音、实时音乐种子和视觉观察转化为结构化的共创计划。为了支持低延迟执行，Co-policy引入了高斯混合视觉运动策略（GMP），实现为条件混合密度策略，在单次前向传递中将目标音符和视觉上下文映射到多模态机器人动作。与仅复现用户指定音符的机器人回放系统不同，Co-policy在音乐和物理约束下生成互补的音乐响应。真实机器人钟琴实验、消融研究和专家评估显示，与扩散策略和消融基线相比，意图对齐、执行准确性和响应频率均有提升，支持物理接地动作生成作为具身人机共创的关键要求。

英文摘要

Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.

URL PDF HTML ☆

赞 0 踩 0

2606.19913 2026-06-19 cs.AR 新提交

Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture

在CGLA架构上设计并评估节能的Whisper点积内核卸载

Takuto Ando, Yu Eto, Ayumu Takeuchi, Yasuhiko Nakashima

AI总结在CGLA架构IMAX上卸载Whisper点积内核，通过内核映射、本地内存大小调整和突发调度优化，在Whisper tiny上实现比Jetson AGX Orin低2.35倍、比RTX 4090低10.48倍的功耗延迟积（PDP），为低功耗本地语音识别提供可编程架构方案。

Comments This paper is accepted at Concurrency and Computation: Practice and Experience (Wiley)

详情

AI中文摘要

在本文中，我们在IMAX（一种可编程的粗粒度线性阵列（CGLA）架构）上实现并评估了Whisper点积内核卸载。在ARM Cortex-A72上的性能分析显示，点积操作占FP16执行时间的90.6%和Q8_0执行时间的87.1%。为了解决这一内核瓶颈，我们结合了内核映射、本地内存大小调整和突发调度。该实现使用了内联FP16到FP32转换、64位数据路径上的2路SIMD FMA、列式多线程以及混合执行，其中对齐的向量段在IMAX上运行，剩余段在主机CPU上并发执行。我们通过FPGA原型和28nm ASIC投影（840MHz）评估了该设计。对于Whisper tiny，32KB本地内存和突发长度16共同最小化PDP和EDP。在基于TDP的跨平台比较中，投影的IMAX在Whisper tiny Q8_0上的PDP为11.58J，比Jetson AGX Orin（27.16J）低2.35倍，比RTX 4090（121.38J）低10.48倍。相同的设计扩展到Whisper base和Whisper small，但PDP差距缩小，因为32KB本地内存覆盖率从tiny的93.8%下降到base和small的约66.5%。这些结果表明，IMAX是一种在tiny模型范围内实现低PDP本地ASR的可编程架构。

英文摘要

In this paper, we implement and evaluate Whisper dot-product kernel offloading on IMAX, a programmable Coarse-Grained Linear Arrays (CGLAs) architecture. Whisper-tiny.en profiling on an ARM Cortex-A72 shows that dot-product operations account for 90.6% of FP16 execution time and 87.1% of Q8_0 execution time. To address this kernel bottleneck, we combine kernel mapping, local-memory sizing, and burst scheduling. The implementation uses inline FP16-to-FP32 conversion, 2-way SIMD FMA on a 64-bit datapath, column-wise multithreading, and mixed execution in which aligned vector segments run on IMAX and residual segments run concurrently on the host CPU. We evaluate the design with an FPGA prototype and a 28nm ASIC projection at 840MHz. For Whisper-tiny.en, 32KB local memory and burst length 16 jointly minimize PDP and EDP. Under a TDP-based cross-platform comparison, the projected IMAX records a PDP of 11.58J for Whisper-tiny.en Q8_0, 2.35x lower than Jetson AGX Orin (27.16J) and 10.48x lower than RTX 4090 (121.38J). The same design extends to Whisper-base.en and Whisper-small.en, where the PDP gap narrows as 32KB local-memory coverage drops from 93.8% for tiny to about 66.5% for base and small. These results position IMAX as a programmable architecture for lower-PDP local ASR in the tiny-model regime.

URL PDF HTML ☆

赞 0 踩 0

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MATM框架，通过共享存储和检索智能体轨迹，实现异构智能体群体间的知识复用，提升下游任务性能并减少交互步骤。

详情

AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署，激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决，检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成（展示了人类创作工件对单个智能体的价值）扩展到检索智能体生成的工件以支持智能体群体。特别是，智能体轨迹编码了可重用的程序性知识，然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留，迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆（MATM），一个用于群体级存储和检索智能体生成轨迹的框架，其中生产者智能体将轨迹贡献到共享仓库，消费者智能体检索它们以改进任务执行。我们专注于交互环境（ALFWorld和WebArena），其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明，从MATM检索轨迹可提高下游任务性能并减少交互步骤，无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 新提交

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估：基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar（卡塔尔计算研究所，多哈，卡塔尔）

AI总结提出仅使用母语语音资源训练的轻量级发音评估框架，通过离散化语音标记和语言模型计算意外度，结合文本引导对齐特征，在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库，这些语料库收集成本高昂。我们提出一个轻量级框架，仅使用母语语音资源训练，以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时，学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度，其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块，该模块从参考文本预测母语标记序列，并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上，PCC从0.60提升到0.66（带转录引导），接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

URL PDF HTML ☆

赞 0 踩 0

2606.19908 2026-06-19 cs.CV 新提交

Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

用于内窥镜视频的高斯过程先验变分自编码器

Ivan De Boi, Xinxing Shi, Xiaoyu Jiang, Tim J. M. Jaspers, Francisco Caetano, Mauricio A. Alvarez, Fons van der Sommen, Sam Van der Jeught

发表机构 * Department of Electromechanics, InViLab, University of Antwerp（安特卫普大学机电工程系InViLab实验室）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； Department of Electrical Engineering, Eindhoven University of Technology（埃因霍温理工大学电气工程系）

AI总结提出高斯过程先验变分自编码器（GPVAE），通过时间高斯过程先验替代因子化先验，结合两种可扩展GP近似和镜面反射掩码，实现内窥镜视频缺失帧的插值与修复，在C3VDv2数据集上平均降低RMSE 21.9%。

详情

AI中文摘要

内窥镜视频分析对于胃肠道诊断和计算机辅助干预至关重要，但视频序列经常受到镜面反射、运动伪影和缺失帧的退化影响。这些瞬态损坏会分散临床医生的注意力，降低图像可解释性，并干扰下游任务（如3D重建和导航）。因此，有效的修复需要利用时间连续性而非孤立处理帧的方法。我们提出了一种用于内窥镜视频修复的高斯过程先验变分自编码器（GPVAE）框架，该框架用时间高斯过程先验替代标准因子化潜在先验，从而能够以不确定性感知的重建方式插值缺失帧。该框架结合了内窥镜专用编码器（包括卷积EndoVAE骨干网络和来自GastroNet-5M的预训练Vision Transformer编码器）以及两种可扩展GP近似：层次先验近似（HPA）和稀疏精度近似（SPA）。镜面反射通过基于DUCKNet的掩码流水线处理，该流水线从重建目标中排除损坏像素。在C3VDv2结肠镜数据集上，最佳GPVAE变体相对于匹配的VAE基线，图像重建RMSE平均降低21.9%，最高降低26.1%。下游轨迹RMSE在经典视觉里程计和预训练PoseNet上平均降低12.7%，而每epoch训练时间平均增加27.3%。最后，GP后验提供每帧不确定性估计，反映时间支持并为修复帧提供置信度信号。

英文摘要

Endoscopic video analysis is essential for gastrointestinal diagnosis and computer-assisted interventions, but video sequences are routinely degraded by specular reflections, motion artifacts, and missing frames. These transient corruptions can distract clinicians, reduce image interpretability, and disrupt downstream tasks such as 3D reconstruction and navigation. Effective restoration therefore requires methods that exploit temporal continuity rather than treating frames in isolation. We introduce a Gaussian Process Prior Variational Autoencoder (GPVAE) framework for endoscopic video restoration that replaces the standard factorized latent prior with a temporal Gaussian process prior, enabling interpolation of missing frames with uncertainty-aware reconstruction. The framework combines endoscopy-specific encoders, including a convolutional EndoVAE backbone and pretrained Vision Transformer encoders from GastroNet-5M, with two scalable GP approximations: Hierarchical Prior Approximation (HPA) and Sparse Precision Approximation (SPA). Specular reflections are handled using a DUCKNet-based masking pipeline that excludes corrupted pixels from the reconstruction objective. On the C3VDv2 colonoscopy dataset, the best GPVAE variants reduced image reconstruction RMSE by 21.9\% on average, and by up to 26.1\%, relative to matched VAE baselines. Downstream trajectory RMSE was reduced by 12.7\% on average across classical visual odometry and a pretrained PoseNet, at an average increase of 27.3\% in training time per epoch. Finally, the GP posterior provides per-frame uncertainty estimates that reflect temporal support and offer a confidence signal for restored frames.

URL PDF HTML ☆

赞 0 踩 0

2606.19904 2026-06-19 cs.SI 新提交

Toward Temporal Realism in City-Scale Crisis Response Simulation using LLM Agents

面向城市级危机响应模拟中时间真实性的LLM智能体方法

Anping Zhang, Yang Tan, Yuanbo Tang, Huaze Tang, Qiuhua Ye, Marta C. Gonzalez, Yang Li

AI总结针对LLM社会模拟缺乏时间真实性的问题，基于深圳疫情志愿活动数据，提出数据校准的自激与危机激活机制，实现爆发性时间模式，使智能体时间分布接近真实。

Comments 11pages,7 figures

详情

AI中文摘要

人类集体参与在时间上很少是稳定的：它是爆发性的，短时间的密集活动与长时间的安静间隔交替出现。在危机响应和社区动员中，预测人们何时行动与预测他们是否行动同样重要。这类场景越来越多地使用基于LLM的社会模拟器进行建模，然而这些模拟器的验证仅关注每个行动是否合理，而非行动的时间是否与现实一致。它们的时间真实性，即模拟活动再现真实人类系统爆发性、重尾时间分布的程度，因此仍未得到检验。我们利用深圳跨多年、城市规模的线下志愿活动日志（涵盖COVID-19疫情）来考察这一差距。实证上，我们确认爆发性时间在个体和跟踪群体层面普遍存在，且主要是内生性和自激的，并由疫情放大而非日常活动周期产生。一个标准的纯LLM模拟器几乎无法再现这种时间分布：其同步调度缺乏自激通道，因此智能体以近乎规律的时钟行动。基于这些发现，我们构建了一个模拟器，其中数据校准的自激通道和危机时期机制决定每个智能体何时行动，并仅在这些时刻查询LLM，由LLM决定加入哪个任务以及是否承诺。纯LLM基线未产生任何爆发性智能体（中位爆发性$B=-0.14$）；单个数据校准的门控足以将每个智能体的时间分布提升至爆发阈值以上（中位$B\approx0.37$），且不降低LLM的内容决策质量。这些结果表明，基于LLM的危机响应模拟中，时间真实性的最佳实现方式是将智能体何时行动（由显式自激和危机激活机制控制）与做什么（由LLM控制）解耦。

英文摘要

Human collective participation is rarely steady in time: it is bursty, with short episodes of intense activity separated by long quiet intervals. In crisis response and community mobilization, predicting when people act matters as much as predicting whether they act. Such settings are increasingly modeled with LLM-based social simulators, yet these simulators are validated on whether each action is individually plausible, not on whether actions are timed as in reality. Their temporal realism, the degree to which simulated activity reproduces the bursty, heavy-tailed timing of real human systems, thus remains untested. We examine this gap using a multi-year, city-scale log of offline volunteering in Shenzhen that spans the COVID-19 pandemic. Empirically, we establish that bursty timing is common at individual and tracked-group levels, that it is largely endogenous and self-exciting, and that it is amplified by the pandemic rather than produced by daily activity cycles. A standard LLM-only simulator reproduces almost none of this timing: its synchronous schedule has no self-excitation channel, so agents act on a near-regular clock. Guided by these findings, we build a simulator in which a data-calibrated self-excitation channel and a crisis-period regime decide when each agent acts and query the LLM only at those moments, leaving it to decide which task to join and whether to commit. The LLM-only baseline yields no bursty agents (median burstiness $B=-0.14$); a single data-calibrated gate is then sufficient to lift per-agent timing above the burst threshold (median $B\approx0.37$) without degrading LLM content decisions. These results indicate that temporal realism in LLM-based crisis-response simulation is best achieved by decoupling when agents act, governed by an explicit self-excitation and crisis-activation mechanism, from what they do, governed by the LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.19901 2026-06-19 cs.CV 新提交

Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

基于语义调制的线性递归单元用于图像超分辨率

Mingyu Choi, Woo Kyoung Han, Sunghoon Im, Kyong Hwan Jin

发表机构 * Korea University（高丽大学）； DGIST（大邱庆北科学技术院）

AI总结提出一种结合语义调制单元的线性递归网络，通过调制、空间分类和原型增强实现高效图像超分辨率，性能超越现有方法。

Comments Accepted to CVPR 2026 Findings

2606.19899 2026-06-19 cs.CY cs.AI 新提交

Measuring Biological Capabilities and Risks of AI Agents

测量AI代理的生物能力与风险

Patricia Paskov, Jeffrey Lee, Kyle Brady, Alyssa Worland

发表机构 * PATRICIA PASKOV, JEFFREY LEE, KYLE BRADY, ALYSSA WORLAND（PATRICIA PASKOV、JEFFREY LEE、KYLE BRADY、ALYSSA WORLAND）

AI总结针对AI科学家等自主执行多步科学任务的代理系统，本文提出生物代理评估作为解释性工具，并基于实践经验给出定义、设计、运行、评分和记录评估的考量，以帮助决策者谨慎解读结果并指导投资。

详情

AI中文摘要

本文针对一个迅速出现的政策挑战：如何生成和解释关于AI科学家（即能够自主或协作执行多步科学任务的代理AI系统）的生物能力与风险的可信证据。随着这些系统进入真实研究流程，决策者越来越多地面临评估结果，而这些结果的含义取决于通常隐含或记录不足的底层设计选择。我们综合了关于AI驱动的生物风险的现有证据，并引入生物代理评估作为评估这些系统的一种有前景但需要谨慎解释的工具。我们的核心贡献是一套基于实践经验的考量——源自我们自己的评估——展示了围绕定义、设计、运行、评分和记录评估的选择如何实质性地塑造结果对风险意味着什么和不意味着什么。该分析旨在帮助政策制定者以适当的谨慎态度解读生物评估输出；引导公共和私人资助者向AI-生物学评估研究的高杠杆投资；并支持评估新兴AI系统的生物安全从业者。次要受众包括在前沿AI实验室、AI提供商、科学机构和第三方评估组织中设计或进行代理评估的研究人员。

英文摘要

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

URL PDF HTML ☆

赞 0 踩 0

2606.19898 2026-06-19 cs.DB cs.IR 新提交

Query-aware Routing for Filtered Approximate Nearest Neighbors Search

面向过滤近似最近邻搜索的查询感知路由

Qianqian Xiong, Mengxuan Zhang

AI总结提出查询感知路由框架，通过轻量级ML模型预测各候选方法的召回率，结合离线基准表选择最佳召回-QPS权衡，在五个未见数据集上达到SOTA性能。

Comments 12 pages

详情

AI中文摘要

过滤ANN搜索结合向量相似性与属性谓词，是现代向量数据库和检索增强生成中的核心原语。我们在多个数据集上对三种谓词下的所有主要分类过滤ANN方法进行基准测试，发现没有单一方法占主导地位。此外，即使在单个数据集和谓词类型内，查询的最佳方法也可能不同。因此，我们提出了一种查询感知路由框架。轻量级ML模型预测每个候选方法在查询上的召回率，路由器查阅离线基准表（该表将每种方法和参数设置映射到其测量的召回率和QPS），然后选择具有最佳召回-QPS权衡的方法。我们的消融研究将22个候选特征缩减为最小的三个特征集，并采用回归而非分类作为预测目标以提高准确性。我们的模型在六个真实世界数据集上训练，并应用于五个未见过的验证数据集。最终结果表明，与现有的过滤ANN基线相比，我们的路由器在所有五个验证数据集上实现了最先进的召回率和QPS平衡，同时引入了可忽略的延迟开销。

英文摘要

Filtered ANN search, which combines vector similarity with attribute predicates, is a core primitive in modern vector databases and retrieval-augmented generation. We benchmark all major categorical filtered ANN methods across multiple datasets under three predicates and find that no single method dominates. Moreover, even within a single dataset and predicate type, the best method for a query can vary. Therefore, we propose a query-aware routing framework. A lightweight ML model predicts each candidate method's recall on the query, and the router consults an offline benchmark table that maps every method and parameter setting to its measured recall and QPS, then selects the method with the best recall--QPS trade-off. Our ablation study narrows 22 candidate features to a minimal set of three and we adopt regression rather than classification as the prediction target to sharpen accuracy. Our model is trained on six real-world datasets and applied to five unseen validation datasets. The final result shows that our router achieves state-of-the-art recall and QPS balance across all five validation datasets compared to existing filtered ANN baselines, while incurring negligible latency overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.19897 2026-06-19 cs.RO 新提交

One-to-Two Acting: A Novel Framework for Single-arm Agent Action Expansion to Dual Arms

一对二执行：一种面向单臂智能体动作扩展至双臂的新框架

Youbin Yao, Nieqin Cao, Mingyan Li, Yan Ding, Fuqiang Gu, Chao Chen

发表机构 * Chongqing University（重庆大学）； Xi’an Jiaotong-Liverpool University（西交利物浦大学）； Lumos Robotics

AI总结提出ExS2D层次化动作扩展框架，利用单臂监督实现双臂操作，通过时间优先关系提取、子任务引导动作映射和碰撞避免协调规划，在仿真中减少54.4%执行步骤并保持成功率。

Comments 6 pages, 5 figures, 3 tables

详情

AI中文摘要

双臂操作可以通过并行执行提高吞吐量，但收集双臂演示进行训练成本高且困难。我们提出ExS2D，一种层次化动作扩展框架，能够从单臂监督实现双臂操作。ExS2D首先从文本指令生成结构化子任务，同时显式捕获时间优先关系。然后通过观察中的子任务引导动作映射，将每个子任务落地为可执行动作。最后，由多模态大语言模型驱动的协调器执行考虑优先关系的动作分配和同步规划，以选择无碰撞的双臂执行。仿真实验表明，ExS2D在保持与单臂基线相当的成功率的同时，平均执行步骤减少了54.4%。在四个任务上的真实机器人实验进一步证明了ExS2D在少量单臂样本下进行双臂执行的可靠性，且未使用任何双臂演示。

英文摘要

Dual-arm manipulation can improve throughput via parallel execution, but collecting bimanual demonstrations for training is costly and difficult. We present ExS2D, a hierarchical action expansion framework that enables dual-arm manipulation from single-arm supervision. ExS2D first generates structured subtasks from textual instructions while explicitly capturing temporal precedence. It then grounds each subtask into executable actions through subtask-guided action mapping in observation. Finally, precedence-aware action allocation and synchronized planning are performed by a multimodal large language model driven coordinator to select collision-free dual-arm executions. Simulation experiments demonstrate that ExS2D reduces the average execution steps by 54.4% while maintaining a comparable success rate to a single-arm baseline. Real-robot experiments on four tasks further demonstrate the reliability of ExS2D for dual-arm execution under few-shot single-arm samples, while using zero bimanual demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2606.19894 2026-06-19 cs.LG 新提交

Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures

任意低维结构上扩散模型的分数近似

Xinhe Mu, Zaijiu Shang, Zhaoqi Zhou, Chuan Zhou, Qi Meng, Guiying Yan, Zhiming Ma

发表机构 * Shanghai Institute for Mathematics and Interdisciplinary Sciences（上海数学与交叉科学研究院）； Huawei Technologies Co., Ltd.（华为技术有限公司）

AI总结针对任意紧支撑分布，提出一种基于离散混合的分数近似方法，证明ReLU网络复杂度仅随上Minkowski维数d指数增长，打破环境维数诅咒，解释扩散模型在非光滑数据上的有效性。

详情

AI中文摘要

基于分数的扩散模型的显著成功激发了大量建立其理论基础的努力。然而，现有的分数近似复杂度界限严重依赖于限制性假设，如Lipschitz连续密度或光滑流形支撑，而这些假设通常被真实感知数据固有的奇异性、尖锐边界和不连续簇所违反。本文建立了一个通用的分数近似定理，适用于任何支撑在任意上Minkowski维数为$d$的紧集上的分布。通过一种新颖的离散混合公式，我们证明了分数函数可以用ReLU网络近似，其复杂度仅随$d$指数增长，从而打破了环境维数的指数诅咒。结合现有关于精确求解任意紧分布的反向扩散SDE的理论，我们的工作表明扩散模型能够自适应地处理不规则、非光滑的数据结构，解释了它们在真实生成任务中的能力。

英文摘要

The remarkable success of score-based diffusion models has spurred significant efforts to establish their theoretical foundations. However, existing complexity bounds for score approximation rely heavily on restrictive assumptions like Lipschitz continuous densities or smooth manifold supports, which are routinely violated by the singularities, sharp boundaries, and disjoint clusters inherent to real-world perceptual data. This work establishes a universal score approximation theorem that works for any distribution supported on any compact set of upper Minkowski dimension $d$. Using a novel discrete-mixture formulation, we prove that the score function can be approximated with a ReLU network whose complexity grows exponentially only with $d$, thus breaking the exponential curse of ambient dimensionality. Combined with existing theories on accurately solving the backward diffusion SDE for arbitrary compact distributions, our work shows that diffusion models readily adapt to irregular, non-smooth data structures, explaining their competence in real-world generative tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19893 2026-06-19 cs.AI 新提交

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute（江西陶瓷工艺美术职业技术学院数字艺术学院）； Universiti Sains Malaysia（马来西亚理科大学）

AI总结提出MetaResearcher框架，通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构，在对抗环境中扩展深度研究智能体的训练，提升基准性能和认知鲁棒性。

详情

AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力，但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中，我们提出了MetaResearcher，一个新颖的框架，在四个协同维度上扩展深度研究智能体的训练。首先，我们引入了一个演化虚拟世界，将时间动态和对抗性错误信息注入训练环境，迫使智能体发展来源可信度评估和时间冲突解决技能。其次，我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索，推动智能体走向真正的研究行为。第三，我们在GRPO框架内提出了一种自我反思元奖励机制，共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性，直接解决了先前工作中观察到的重复动作循环问题。第四，我们引入了一个异构多智能体群体架构，包括专门的侦察、过滤和合成模型，通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施，MetaResearcher在训练中需要零边际API成本，同时目标是在基准性能（GAIA，Xbench-DS）和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.19891 2026-06-19 cs.LG 新提交

Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses

具有全局有界扰动的凸损失对抗性赌博机优化

Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto

发表机构 * Department of Informatics, Kyushu University（九州大学信息学系）； RIKEN AIP（理化学研究所革新智能综合研究中心）

AI总结研究损失函数可能非凸非光滑的对抗性赌博机优化，提出一种修改的赌博机优化算法，并分析扰动预算对遗憾的影响，将线性损失下的全局预算后行动扰动模型扩展到一般凸且β-光滑损失。

详情

AI中文摘要

我们研究对抗性赌博机优化，其中损失函数可能非凸且非光滑。在每一轮中，学习者选择一个动作并仅观察该动作产生的损失。损失由一个潜在的凸且β-光滑分量和一个对抗性扰动组成，该扰动可能在观察学习者的动作后选择。扰动受全局预算约束，控制其随时间累积的幅度。该框架将全局预算的后行动扰动模型从线性损失扩展到一般凸且β-光滑损失。对于这个更广泛的类别，我们建立了期望遗憾保证，明确刻画了扰动预算的影响。为了建立这些保证，我们修改了一个标准的赌博机优化算法，并开发了一种分析来控制由扰动引起的额外遗憾。在没有扰动的情况下，我们的结果退化为具有β-光滑损失的标准赌博机凸优化设置的遗憾保证。

英文摘要

We study adversarial bandit optimization in which the loss functions may be non-convex and non-smooth. In each round, the learner selects an action and observes only the loss incurred at that action. The loss consists of an underlying convex and $β$-smooth component and an adversarial perturbation that may be chosen after observing the learner's action. The perturbations are subject to a global budget controlling their cumulative magnitude over time. This framework extends the globally budgeted, post-action perturbation model from underlying linear losses to general convex and $β$-smooth losses. For this broader class, we establish expected regret guarantees that explicitly characterize the effect of the perturbation budget. To establish these guarantees, we modify a standard bandit optimization algorithm and develop an analysis that controls the additional regret caused by the perturbations. In the absence of perturbations, our results reduce to regret guarantees for the standard bandit convex optimization setting with $β$-smooth losses.

URL PDF HTML ☆

赞 0 踩 0

2606.19890 2026-06-19 cs.CY 新提交

Open Weight AI Models Require Proportional Evaluation Approaches

开放权重AI模型需要比例评估方法

Patricia Paskov, Christopher Rodriguez, Sunishchal Dev, Stephen Casper

AI总结本文针对开放权重AI模型（OWMs）的独特风险因素，提出四种比例评估方法（PE1-PE4），并系统审查2025年至2026年4月发布的37个OWM系列，发现仅一个满足所有评估要求。

详情

AI中文摘要

开放权重AI模型（OWMs），即公开发布权重的模型，正在快速分发，并接近领先的封闭权重AI模型（CWMs）的性能水平。虽然OWMs带来了巨大的科学和经济利益，但它们的发布引入了独特的风险因素，而现有的评估实践（主要针对CWM部署设计）未能考虑这些因素。在本文中，我们认为这些风险因素需要不同的比例评估（PE）方法：在没有系统级保障的情况下进行评估（PE1），评估对消除模型级保障的修改的鲁棒性（PE2），测试选择性能力增强（PE3），以及代理最坏情况下的滥用（PE4）。我们系统审查了2025年至2026年4月期间发布的OWMs的当前评估实践，发现所审查的37个模型系列中只有一个满足PE1-4，大多数不满足任何一项。本文面向参与AI评估的政策制定者、资助者和研究人员。随着OWMs能力日益增强，其评估值得开发者、资助者和治理机构密切关注。

英文摘要

Open-weight AI models (OWMs), or models released with publicly-available weights, are distributing rapidly and approaching the performance levels of leading closed-weight AI models (CWMs). While OWMs offer substantial scientific and economic benefits, their release introduces distinct risk factors for which existing evaluation practices, largely designed for CWM deployment, fail to account. In this paper, we argue that these risk factors demand distinct proportional evaluation (PE) approaches: evaluating without system-level safeguards (PE1), assessing robustness to modifications that undo model-level safeguards (PE2), testing selective capability amplification (PE3), and proxying worst-case misuse (PE4). We systematically review current evaluation practices of OWMs released in 2025 through April 2026, finding that only one of the 37 families of models reviewed fulfills PE1-4 and most do not fulfill any. This paper targets policymakers, funders, and researchers involved in AI evaluation. As OWMs grow increasingly capable, their evaluation warrants close attention from developers, funders, and governance bodies alike.

URL PDF HTML ☆

赞 0 踩 0

2606.19889 2026-06-19 cs.CV 新提交

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SurgVista：具有合理器械-组织动力学的长程手术世界建模

Wentao Pan, Wuyang Li, Shengyuan Liu, Xinyu Liu, Hengyu Liu, Yixuan Yuan

发表机构 * The Chinese University of Hong Kong（香港中文大学）； EPFL（瑞士联邦理工学院洛桑）； Imperial College London（伦敦帝国学院）

AI总结提出SurgVista手术世界模型，通过变形一致性正则化和漂移适应训练，解决空间交互不连贯和时间保真度崩溃问题，在长程预测中显著优于现有方法。

详情

AI中文摘要

将机器人策略学习扩展到自主手术面临挑战，因为专家演示成本高昂且体内探索存在重大安全风险。手术世界模型通过从初始观测生成逼真的、动作条件下的未来帧来解决这一问题，但现有方法存在两种持续失效模式：空间交互不连贯，即可见器械接触未能引起空间一致的组织变形；以及时间保真度崩溃，即预测误差在自回归展开中累积并逐渐破坏视觉质量。我们提出SurgVista，一种通过两种训练策略缓解这两种失效的手术世界模型。变形一致性正则化从训练视频中提取场景点轨迹，并通过潜在对比学习强制跨帧一致性，增强物理一致的器械-组织动力学。漂移适应训练通过用在线预测残差和根据长程漂移统计校准的光度增强扰动条件帧，减轻长程漂移，在扩展展开中维持视觉保真度。为了进行严格评估，我们进一步引入SurgWorld-Bench，包含多样化的手术类型、长程展开以及用于器械运动精度和组织响应保真度的解耦指标。大量实验表明，SurgVista在视觉质量、时间一致性和交互保真度方面持续优于最先进方法，且随着预测视界增长优势扩大。

英文摘要

Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

URL PDF HTML ☆

赞 0 踩 0

2606.19888 2026-06-19 cs.LG cs.AI 新提交

SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models

SL-S4Wave：基于结构化状态空间模型的生理波形自监督学习

Feng Wu, Harsh Deep, Eric Lehman, Sanyam Kapoor, Guoshuai Zhao, Rahul Krishnan, Gari Clifford, Li-wei H Lehman

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； OpenEvidence, USA（OpenEvidence（美国））； New York University（纽约大学）； Xi’an Jiaotong University（西安交通大学）； University of Toronto（多伦多大学）； Emory University（埃默里大学）

AI总结提出SL-S4Wave框架，结合对比学习与基于结构化状态空间模型的编码器，通过多尺度子核全局卷积捕获多通道生理波形的局部和长程依赖，在心律失常检测等任务中优于现有方法。

详情

AI中文摘要

由于高采样率、多通道信号复杂性、固有噪声和有限的标记数据，对长序列医学时间序列数据（如心电图）进行建模面临重大挑战。尽管最近基于各种编码器架构（如卷积神经网络）的自监督学习方法被提出用于从未标记数据中学习表示，但它们往往在捕获长程依赖和噪声不变特征方面存在不足。结构化状态空间模型擅长长序列建模，但现有的S4架构无法捕获多通道生理波形的独特特征。在这项工作中，我们提出了SL-S4Wave，一个自监督学习框架，它将对比学习与基于结构化状态空间模型的定制编码器相结合。该编码器利用多尺度子核实现多层全局卷积，从而能够在嘈杂的高分辨率多通道波形中捕获细粒度局部模式和长程时间依赖。在真实世界数据集上的大量实验表明，SL-S4Wave（1）在具有挑战性的心律失常检测任务中持续优于最先进的监督和自监督基线，（2）使用显著更少的标记示例实现高性能，展示了强大的标签效率，（3）在长波形片段上保持稳健性能，突出了其对大多数现有方法无法有效建模的长序列中复杂时间动态的建模能力，以及（4）有效迁移到未见的心律失常类型，强调了其强大的跨域泛化能力。我们还在多个EEG任务上评估了SL-S4Wave，在强基线上取得了优越性能，证明了我们的方法在心脏波形之外的泛化能力。

英文摘要

Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent self-supervised learning (SSL) methods, based on various encoder architectures such as convolutional neural networks, have been proposed to learn representations from unlabeled data, they often fall short in capturing long-range dependencies and noise-invariant features. Structured state space models (S4) excel at long-sequence modeling, but existing S4 architectures fail to capture the unique characteristics of multichannel physiological waveforms. In this work, we propose SL-S4Wave, a self-supervised learning framework that combines contrastive learning with a tailored encoder built on structured state space models. The encoder incorporates multi-layer global convolution using multiscale subkernels, enabling the capture of both fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel waveforms. Extensive experiments on real-world datasets demonstrate that SL-S4Wave (1) consistently outperforms state-of-the-art supervised and self-supervised baselines in a challenging arrhythmia detection task, (2) achieves high performance with significantly fewer labeled examples, showcasing strong label efficiency, and (3) maintains robust performance on long waveform segments, highlighting its capacity to model complex temporal dynamics in long sequences that most existing approaches fail to efficiently model, and (4) transfers effectively to unseen arrhythmia types, underscoring its robust cross-domain generalization. We additionally evaluate SL-S4Wave on multiple EEG tasks, achieving superior performance over strong baselines, demonstrating generalizability of our approach beyond cardiac waveforms.

URL PDF HTML ☆

赞 0 踩 0

2606.19887 2026-06-19 cs.CR cs.AI 新提交

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

FFinRED：面向金融大语言模型红队测试的专家引导基准生成与评估框架

Chaeyun Kim, Daeyoung Park, Junghwan Kim, Jinyoung Jeong, Eunji Song, Yongtaek Lim, Minwoo Kim

发表机构 * DATUMO INC.（DATUMO公司）； Korea Advanced Institute of Science and Technology (KAIST)（韩国先进科学研究院）； Financial Security Institute (FSI)（金融安全研究所）

AI总结提出FinRED框架，通过专家引导的两级分类法将全球金融标准映射为威胁，并利用真实金融文档生成上下文丰富的红队行为提示，结合专家验证的评估标准，有效降低关键假阴性。

详情

AI中文摘要

现有的安全基准主要针对通用对抗场景，但忽略了金融领域的特定风险。金融大语言模型面临监管合规违规、欺诈助长和系统性信任侵蚀等问题，需要有针对性的评估。我们引入了FinRED，一个与金融专家共同开发的、用于金融大语言模型安全评估的专家引导红队测试框架。FinRED采用新颖的两级分类法，将全球标准（如FATF和EU DORA）映射到从监管规避到复杂欺诈的威胁，并结合可扩展的流水线，通过专家定义的架构将真实金融文档转换为上下文丰富的红队行为提示（种子）。严格的专家验证确认了种子的合理性和真实性，以实现有意义的LLM安全评估。我们还提供了一个经过专家验证的、金融专用的评估标准，该标准超越了免责声明检查，比静态的一刀切标准更贴近人类专家，并将关键假阴性从28个减少到12个。FinRED与国际采纳的风险管理和信息安全标准（如ISO/IEC 27001）保持一致，已在韩国金融安全研究院（FSI）的监管沙盒中部署，用于真实金融服务中的生成式AI安全评估。为减轻双重用途风险，数据集、生成流水线、提示模板和评估框架对合格研究人员开放，访问地址为：此https URL和此https URL。

英文摘要

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

URL PDF HTML ☆

赞 0 踩 0

2606.19883 2026-06-19 cs.LG stat.ML 新提交

Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

匹配市场遇上累积前景理论：迈向最优和对抗鲁棒学习

Ananya Kunisetty, Avishek Ghosh

发表机构 * Indian Institute of Technology Bombay（印度理工学院孟买分校）

AI总结研究基于累积前景理论（CPT）的竞争性双边匹配市场多智能体多臂赌博机问题，提出最优遗憾界算法并扩展到对抗性市场。

Comments Accepted at ECML-PKDD 2026, Naples, Italy

详情

AI中文摘要

我们研究了一个在竞争性设置下具有双边匹配市场的多智能体多臂赌博机问题，该问题基于以人为中心的决策模型。为了捕捉人类偏好，我们使用累积前景理论（CPT），该理论通过一个（α-Hölder连续）权重函数以非线性方式加权智能体的行动。CPT已被广泛用于行为经济学和风险敏感机器学习中，以模拟人类偏好。我们分析了带有CPT权重扭曲奖励的最先进学习算法，并获得了玩家最优遗憾界为$\mathcal{O}(K\log T \left(\frac{1}{\Delta}\right)^{2/\alpha})$，其中$K$表示臂数，$T$是学习时间，$\Delta$表示（适当定义的）玩家的最小偏好差距。注意到对$\Delta$的依赖是次优的，我们通过明智地选择探索期间的活跃臂集进一步改进了这一遗憾，从而在主导项中消除了对$K$的依赖，并在臂数$K$显著大于玩家数$N$的设置中实现了改进的（最优）遗憾保证。此外，我们考虑了对抗性市场，其中智能体的观测奖励可能被破坏。我们提出并分析了在已知和未知总破坏预算两种设置下，以CPT作为风险敏感度量的鲁棒市场算法，并在两种情况下建立了对数级别的玩家最优遗憾保证。

英文摘要

We study a multi-agent multi-armed bandit problem in the competitive setup with two-sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a ($α$-Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state-of-the-art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of $\mathcal{O}(K\log T \left(\frac{1}Δ\right)^{2/α})$, where $K$ denotes the number of arms, $T$ is the learning horizon, and $Δ$ represents (suitably defined) players' minimum preference gap. Noticing the dependence on $Δ$ to be sub-optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on $K$ in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms $K$ is significantly larger than the number of players $N$. In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player-optimal regret guarantees in both cases.

URL PDF HTML ☆

赞 0 踩 0

2606.19882 2026-06-19 cs.CV cs.LG 新提交

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego（加州大学圣地亚哥分校）

AI总结提出多模态概念瓶颈模型（MM-CBM），利用双概念瓶颈层对齐图像和文本嵌入，实现可解释的零样本分类和图像检索，在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情

AI中文摘要

概念瓶颈模型（CBM）通过将图像提取的特征与自然概念对齐，增强了深度学习网络的可解释性。然而，现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制，其中预期概念之外的预测信号被无意中利用。在本文中，我们提出了多模态概念瓶颈模型（MM-CBM）来解决这些问题，并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层（CBL）将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务，如零样本分类或图像检索。与现有方法相比，MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率，在黑盒性能的约5%以内，同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.19881 2026-06-19 cs.CL 新提交

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

REDACT：一个系统控制的个人信息检测多语言基准

Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj, Vidhan Jhawar, Ranga Prasad Chenna, Bharadwaj Y M G

发表机构 * ServiceNow

AI总结提出REDACT基准，包含13,427条记录、51种实体类型、25种语言，通过强度-2覆盖阵列采样控制9个生成轴，并引入实体级元数据（披露状态、形式、GDPR敏感层级）以支持分层评估，揭示检测器在敏感数据上的架构依赖性失败模式。

Comments 14 pages, 5 figures

详情

AI中文摘要

个人可识别信息（PII）检测的基准基础设施仍然有限：现有语料库涵盖的实体类型少，使用临时生成条件，并且未显示哪些表面条件导致检测器失败。我们提出REDACT，一个系统控制的多语言PII基准，包含13,427条记录、324,078个实体注释、51种实体类型、4,127个表面形式模式以及跨越9种文字的25种语言。一个强度-2覆盖阵列采样器控制九个生成轴：领域、格式、难度、长度、密度、代码切换、语言、邻接和共现。三个实体级元数据字段（披露状态、披露形式和符合GDPR的敏感层级）使得能够进行超越聚合或按类型F1的分层评估。从完整基准中，我们在一个锁定的、按语言分层的1000条记录样本上评估了五个检测器（Presidio、GLiNER、OpenAI隐私过滤器、GPT-4.1和Claude Sonnet 4.6）。聚合F1掩盖了架构依赖的失败结构：基于规则的检测器在最高风险数据上表现不佳，包括高敏感类别（召回率0.07）和非逐字披露形式，而LLM检测器保持更鲁棒，高敏感层级是其最强的敏感切片。一个三模型无参考LLM作为评判者的评估证实，敏感层级分配是任务最困难的轴。我们发布了基准、模式、提示和分层评估工具。

英文摘要

Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

URL PDF HTML ☆

赞 0 踩 0

2606.19874 2026-06-19 cs.RO cs.CV 新提交

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM：结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences（中国科学院合肥物质科学研究院）； University of Science and Technology of China（中国科学技术大学）； Aarhus University（奥胡斯大学）； University of Tokyo（东京大学）； Beijing University of Chemical Technology（北京化工大学）； North China Electric Power University（华北电力大学）

AI总结提出MMD-SLAM，利用亚特兰大世界假设引导多元高斯表示，通过点线融合、主导方向编码和高斯进化策略，提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情

AI中文摘要

3D高斯泼溅（3DGS）显著提升了新视角合成和高保真场景重建，扩展了基于3DGS的视觉同步定位与建图（SLAM）方法的潜力。然而，大多数现有系统未能充分利用底层结构信息，这限制了渲染质量并常常导致地图不一致。为了解决这些限制，我们提出了MMD-SLAM，一个结构增强的视觉SLAM框架，利用亚特兰大世界（AW）假设来引导多元高斯表示以实现逼真的建图。首先，我们引入了一种点线融合策略用于位姿优化，其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次，我们设计了一种具有主导方向的多元高斯表示，显式编码来自AW假设的结构先验。最后，我们提出了一种高斯进化策略，该策略适应场景几何并将结构线索融入全局优化。大量实验表明，这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如，与MonoGS相比，我们的方法在ScanNet上实现了48.56%的ATE RMSE降低，在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

URL PDF HTML ☆

赞 0 踩 0

2606.19869 2026-06-19 cs.DC 新提交

EVM Workloads in the Wild: Evidence for Multi-Dimensional Gas Metering, State Growth, Delayed Execution, and Parallelism

现实中的EVM工作负载：多维Gas计量、状态增长、延迟执行和并行性的证据

Lioba Heimbach, Kushal Babel, Jason Milionis

AI总结通过分析2025年以太坊L1和Base L2的区块追踪，发现资源组合不稳定、状态增长被低估、执行结果对历史状态敏感，为多维Gas计量和状态增长显式定价提供了实证基础。

详情

AI中文摘要

EVM兼容区块链上的Gas计量假设执行条件是稳定的：资源组合足够恒定，可以将执行成本合并为具有固定相对价格的单一标量，并且提交与执行之间的状态漂移不会实质性改变交易结果。我们衡量了这一假设失败的程度。我们呈现了2025年全年以太坊（L1）和Base（L2）上EVM工作负载的追踪级测量研究，每条链每天采样3000个区块。我们将每笔交易分解为操作码级执行Gas、固有Gas、退款和持久状态增量。为测量状态敏感性，我们在旧状态上重新执行2025年9月的交易，并记录Gas使用和存储访问模式的变化。我们发现资源组合远非稳定：在Base上，存储读取和计算分别占执行Gas的29.2%和24.3%，而以太坊将34.9%用于存储写入。以太坊在2025年Gas上限翻倍，使其自身配置转向更重计算、类似Base的模式。Base还表现出更高比例的冷存储读取（49.7%对以太坊的39.6%）。持久状态增长（一种被定价为临时成本的永久成本）在Base上达到456 GB，而在以太坊上为38 GB。执行结果同样不稳定：在Base上，46.0%的交易在附近历史状态间的Gas估算存在差异，而以太坊为13.9%，MEV和DeFi活动的敏感性尤其高。存储访问模式在不同状态间也存在差异，限制了访问列表的有效性并使并行执行复杂化。我们的工作为多维Gas计量和状态增长的显式定价提供了实证基础。研究表明，状态敏感的执行行为使工作负载估算复杂化，直接影响交易可预测性和用户体验。

英文摘要

Gas metering on EVM-compatible blockchains assumes that execution conditions are stable: that the resource mix is constant enough to justify collapsing execution costs into a single scalar with fixed relative prices, and that state drift between submission and execution does not materially alter a transaction's outcome. We measure the extent to which this assumption fails. We present a trace-level measurement study of EVM workloads on Ethereum (L1) and Base (L2) throughout 2025, sampling 3,000 blocks per day per chain. We decompose each transaction into opcode-level execution gas, intrinsic gas, refunds, and persistent state deltas. To measure state sensitivity, we re-execute transactions from September 2025 on older states and record how gas usage and storage access patterns change. We find the resource mix to be far from stable: on Base, storage reads and compute account for 29.2% and 24.3% of execution gas, while Ethereum devotes 34.9% to storage writes. Ethereum's gas limit doubling during 2025 shifted its own profile toward compute-heavier, Base-like patterns. Base also exhibits a higher fraction of cold storage reads (49.7% versus 39.6% on Ethereum). Persistent state growth, a permanent cost priced as a transient one, reaches 456 GB on Base versus 38 GB on Ethereum. Execution outcomes are equally unstable: gas estimates vary across nearby historical states for 46.0% of transactions on Base, compared to 13.9% on Ethereum, with especially high sensitivity for MEV and DeFi activity. Storage access patterns also diverge across states, limiting the effectiveness of access lists and complicating parallel execution. Our work provides an empirical foundation for multi-dimensional gas metering and explicit pricing of state growth. They show that state-sensitive execution behavior complicates workload estimation, directly affecting transaction predictability and user experience.

URL PDF HTML ☆

赞 0 踩 0

2606.19868 2026-06-19 cs.AI 新提交

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

大型语言模型黑盒不确定性估计方法的系统评估

Jiayi Wang, Xu-Yao Zhang

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）

AI总结系统评估了24种黑盒不确定性估计方法在4个模型和4个数据集上的表现，发现无单一方法普遍最优，但基于答案空间推理和比较的方法通常有效，混合方法在多数条件下表现良好。

详情

AI中文摘要

尽管大型语言模型（LLMs）在广泛的任务中展现出强大的能力，但其输出通常仍不可靠，可能包含幻觉，因此不确定性估计（UE）对于构建可信赖的LLMs至关重要。在实践中，许多主流LLMs仅通过受限API访问，此时logits和隐藏状态等内部信号不可用，使得黑盒UE尤为重要。然而，现有关于LLMs黑盒UE的研究在方法论上仍然零散，缺乏统一的实证比较。为填补这一空白，我们系统回顾了黑盒UE方法，并将其分为五类：基于口头化、基于采样、基于解释、多智能体和混合方法。我们进一步构建了统一的评估框架，并在4个模型和4个数据集设置下对24种代表性方法进行了基准测试。结果表明，没有单一方法在所有设置中一致占优。然而，在答案空间中进行推理和比较候选的方法通常有效，而结合多种不确定性信号的混合方法在大多数条件下表现良好。通过发布基准数据和统一评估框架，我们旨在促进可重复比较并支持未来研究，同时我们的实证发现为开发未来LLMs的黑盒UE方法提供了实践指导。

英文摘要

Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black-box UE especially important. However, existing work on black-box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black-box UE methods and organize them into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black-box UE methods for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19867 2026-06-19 cs.CV cs.AI 新提交

PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

PSCT-Net: 通过可微反投影和注意力引导细化实现几何感知的儿科颅骨CT重建

Dong Yeong Kim, Jaewon Choi, Youmin Shin, Jungyu Lee, Myeongseop Kim, Jinwook Choi, Joo Whan Kim, Young-Gon Kim

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University（首尔大学生物工程跨学科项目）； Department of Transdisciplinary Medicine, Seoul National University Hospital（首尔大学医院跨学科医学系）； Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； Department of Medicine, Seoul National University College of Medicine（首尔大学医学院医学系）； Healthcare AI Research Institute, Seoul National University Hospital（首尔大学医院医疗人工智能研究所）

AI总结提出PSCT-Net，利用可微反投影建立空间先验，结合注意力引导投影和双向Mamba模块，从稀疏双平面X射线重建3D CT，缓解深度模糊并改善骨边界。

Comments 11pages, 5 figures

详情

AI中文摘要

计算机断层扫描（CT）对于诊断儿科颅面异常至关重要，但对发育中的解剖结构存在辐射风险。从稀疏双平面X射线重建3D CT提供了一种低剂量替代方案，但问题严重不适定。现有方法采用几何无关的特征提升，将2D特征天真地投影到3D中，缺乏显式空间建模，导致深度模糊和骨边界退化。我们提出PSCT-Net，一种具有可微反投影的几何感知框架。可微反投影建立了空间保真的体积先验，缓解了深度模糊。然后，注意力引导投影（AGP-3D）模块学习2D区域与3D位置之间的非线性体素级对应关系。双向Mamba（BiM-3D）模块以线性复杂度捕获长程体积依赖关系。我们进一步整理了一个私有的机构儿科颅骨CT数据集PedSkull-CT，包含正常和病理病例用于内部评估，弥补了以成人中心和躯干为主的数据集的空白。

英文摘要

Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.19866 2026-06-19 cs.CR 新提交

Low-Cost Multi-Precision Systolic Arrays for Accelerating FHE NTTs on AI ASICs

低成本多精度脉动阵列用于在AI ASIC上加速FHE NTT

George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos

AI总结针对FHE在AI硬件上因精度不匹配导致的性能瓶颈，提出一种最小修改的多精度脉动阵列，在统一数据流下原生执行全精度输出重建，实现1.33倍加速。

详情

AI中文摘要

全同态加密（FHE）确保了强大的数据隐私，但面临难以承受的计算开销。在AI硬件（如张量处理单元TPU）上加速FHE很有前景，但受到精度不匹配的根本限制：TPU针对8位算术优化，而FHE及其关键部分（如数论变换NTT）需要高精度。当前方法通过矩阵分解在低精度矩阵引擎上执行NTT计算来弥合这一差距。然而，重建全精度结果需要移位加累加，这与矩阵乘法的数据流不匹配。这迫使将全精度重建从矩阵引擎卸载到向量处理器，破坏了矩阵乘法数据流，造成显著的性能瓶颈。为解决这一限制，我们提出一种最小修改的多精度脉动阵列，在统一数据流下，与低精度矩阵乘法同步，在阵列内部原生执行全精度输出重建。使用OpenRoad在7nm工艺下综合，我们的设计硬件开销可忽略不计。使用SCALE-Sim的周期精确模拟表明，在128x128矩阵引擎上，对于2^12到2^16的变换大小，在所提出的架构上原生执行NTT可实现至少1.33倍的加速，成功使标准AI硬件支持高精度FHE加速。

英文摘要

Fully Homomorphic Encryption (FHE) ensures robust data privacy but suffers from prohibitive computational overhead. Accelerating FHE on AI hardware like Tensor Processing Units (TPUs) is promising, yet fundamentally limited by a precision mismatch: TPUs are optimized for 8-bit arithmetic, whereas FHE and its critical parts such as the Number Theoretic Transform (NTT), demand high precision. Current approaches bridge this gap using matrix decomposition to execute NTT computations on low-precision matrix engines. However, reconstructing the full-precision results requires shift-and-add accumulation that does not match the dataflow of matrix multiplication. This forces offloading full-precision reconstruction from matrix engines to vector processors that disrupts the matrix multiplication dataflow, creating significant performance bottleneck. To resolve this limitation, we propose a minimally modified multi-precision systolic array that performs full-precision output reconstruction natively within the array in sync with low-precision matrix multiplication under a uniform dataflow. Synthesized at 7nm with OpenRoad, our design incurs negligible hardware overhead. Cycle-accurate simulations using SCALE-Sim demonstrate that natively executing NTTs on the proposed architecture achieves at least 1.33x speedup, for transform sizes 2^12 to 2^16 on 128x128 matrix engines, successfully enabling standard AI hardware to support high-precision FHE acceleration.

URL PDF HTML ☆

赞 0 踩 0