arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03877 2026-06-03 cs.CV

MLP Splatting: Object-Centric Neural Fields

MLP Splatting: 以对象为中心的神经场

Shinjeong Kim, Yuzhou Cheng, Xin Kong, Paul H. J. Kelly, Andrew J. Davison

AI总结 提出MLP-Splatting方法,通过少量紧凑MLP原语实现场景分解和新视角合成,支持对象级编辑且内存和渲染效率优于现有方法。

详情
AI中文摘要

3D表示对于场景渲染、理解和交互至关重要。最近的方法,如3D高斯泼溅和神经辐射场,实现了令人印象深刻的光照真实感新视角合成,但缺乏将场景元素轻松分解为少数原语的能力,需要额外的分割或分组才能进行对象级操作。我们提出了MLP-Splatting,一种通过少量富有表现力的光场原语实现场景分解,同时提供光照真实感新视角合成的方法。MLP-Splatting将每个原语建模为一个独立的紧凑MLP,具有局部空间支持,预测辐射度和不透明度。与低级高斯原语或单个全局辐射场相比,我们的神经原语提供了更大的表达能力,同时保持空间局部性。通过高效的光线-原语交互稀疏体积合成进行渲染。我们的原语仅使用RGB监督进行训练,这产生了代表局部场景区域(通常对应于对象或对象部分)的原语,通过选择少量原语即可实现无需分割掩码的交互式对象级编辑。我们的方法辅以可选的语义特征蒸馏,支持开放词汇场景交互和开放集实例分割。与最先进的方法相比,我们在实验中表明,与语义3DGS方法相比,我们实现了显著更低的内存使用(1/15倍)和更快的渲染(3倍)。项目页面:此https URL

英文摘要

3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis. MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray-primitive interactions. Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage (1/15$\times$) and faster rendering (3$\times$), as we show in our experiments compared to semantic 3DGS methods. Project Page: https://shinjeongkim.com/mlp-splatting

2606.03876 2026-06-03 cs.HC cs.AI cs.MA

From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members

从“是什么”到“怎么样”和“为什么”:与远程家庭成员共享老年人被动追踪数据的LLM生成回顾性摘要

Jiachen Li, Reina Szeyi Chan, Akshat Choube, Xiang Zhi Tan, Elizabeth Mynatt, Varun Mishra

AI总结 本研究利用大型语言模型(LLM)从多模态追踪数据生成回顾性摘要,通过技术探针和访谈重新设计系统,显著提升了远程家庭成员对摘要的满意度、帮助性、信任和接收意愿,并提出了支持其从“是什么”到“怎么样”和“为什么”的认知转变的设计启示。

详情
AI中文摘要

随着现代普适计算技术的日益普及,多模态追踪系统有望为远程家庭成员(RFM)等利益相关者提供及时的意识和 reassurance,这些成员在老年人护理协调中扮演核心角色。然而,将异构数据流整合为高层次、有意义的内容(如回顾性摘要)仍然具有挑战性。虽然近期工作已展示了大型语言模型(LLM)在解释多模态追踪数据方面的潜力,但针对像RFM这样拥有丰富个人知识、强烈情感责任但对老年人日常生活了解有限且照护能力受限的利益相关者生成叙事性描述的研究仍较少。在本工作中,我们探索了如何利用LLM为老年人的RFM从多模态追踪数据生成回顾性摘要。我们利用并定制了现有系统Vital Insight,在不同日期和数据可用性场景下生成初始摘要作为技术探针,并对11名RFM进行访谈以收集反馈。基于这些见解,我们将系统重新设计为一种多层、多智能体、洞察驱动的摘要方法,从客观统计和描述构建到丰富、上下文感知的叙述。随后,我们通过同一11名RFM的调查比较了重新设计的摘要与初始版本,发现满意度、感知帮助性、信任和接收意愿均有显著提升。最后,我们提出了针对RFM及更广泛场景的AI生成摘要的设计启示,强调需要支持RFM的认知转变,从简单地呈现“收集了什么数据”转向解释“我的亲人过得怎么样”和“为什么”。

英文摘要

With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.

2606.03875 2026-06-03 cs.CV

Seg2Track++: Probabilistic Track Validation and Data Association for Multi-Object Tracking and Segmentation

Seg2Track++: 用于多目标跟踪与分割的概率轨迹验证与数据关联

Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes

AI总结 提出Seg2Track++框架,结合SAM2实例分割与概率轨迹验证,实现零样本多目标跟踪与分割,提升身份保持并抑制假阳性传播。

详情
AI中文摘要

自主系统需要鲁棒的多目标跟踪与分割(MOTS)以在动态环境中可靠运行,确保一致的目标身份和精确的掩码级描绘。SAM2等基础模型在分割方面表现出强大的零样本泛化能力,但其直接应用于MOTS受到不可靠的轨迹关联和假阳性传播的限制。本文介绍Seg2Track++,一个将实例分割与SAM2及新颖的轨迹管理模块相结合的框架,以执行具有增强时间一致性的零样本MOTS。轨迹通过掩码质心距离(MCD)和置信度感知成本调制(CCM)进行关联,而概率轨迹验证(PTV)采用伯努利滤波器验证轨迹存在并抑制鬼影轨迹。在KITTI MOTS上的实验结果表明,无需微调即可改善身份保持、减少假阳性传播并实现鲁棒的轨迹管理。

英文摘要

Autonomous systems require robust Multi-Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask-level delineation. Foundation models such as SAM2 have shown strong zero-shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false-positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero-shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence-Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false-positive propagation, and robust track management without fine-tuning.

2606.03874 2026-06-03 cs.CV cs.RO

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

DyaPlex: 用于二元交互的全双工语音-运动模型

Koki Nagano, Hongyu Liu, Seonwook Park, Tianye Li, Amrita Mazumdar, Christian Jacobsen, Shengze Wang, Michael Stengel, Rajarshi Roy, Ka Chun Cheung, Simon See, Shalini De Mello

AI总结 提出DyaPlex,一种流式全双工语音-运动模型,通过双塔Transformer架构和统一二元令牌交织机制,实现同步多模态交互,在单体和二元交互基准上达到最优性能。

详情
Comments
Project page: https://research.nvidia.com/labs/amri/projects/DyaPlex
AI中文摘要

我们提出了DyaPlex,一种用于二元交互的流式全双工语音-运动模型。为了捕捉人类交流的连续性和互惠性,这种全双工能力使智能体能够以流式方式同时感知和生成语音及物理运动。其核心在于,我们的方法利用了基础全双工语音模型的强先验,并集成了新颖的运动通路,从而实现完全同步的多模态交互。具体来说,我们设计了一种双塔Transformer架构,在保持冻结基础语音模型的零样本对话推理能力的同时,构建了深度耦合的流式运动通路。通过引入统一的二元令牌交织机制,并借助时间对齐的语音-运动RoPE引导交叉注意力,我们的模型有效地将自回归运动与丰富的潜在语音特征对齐。在4000小时的Seamless Interaction数据集上训练,我们的模型有效捕捉了跨说话者依赖关系,并在单体和二元人类交互基准上建立了新的最优性能。

英文摘要

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

2606.03871 2026-06-03 cs.CV cs.CL cs.LG

Visual Instruction Tuning Aligns Modalities through Abstraction

视觉指令调优通过抽象对齐模态

Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga

AI总结 通过探针分析和因果干预,发现视觉指令调优将视觉特征直接嵌入LLM的中间语义层,绕过早期单模态处理层,并通过扩展和强化现有抽象阶段对齐视觉与文本表示。

详情
AI中文摘要

视觉指令调优有效地使预训练的大语言模型(LLM)能够同时处理图像信息和文本。然而,视觉特征如何嵌入到LLM骨干网络的逐层抽象层次中仍不清楚。通过一系列不同的视觉-语言架构,我们表明指令调优主要充当桥梁,将视觉特征直接嵌入到LLM的中间语义层,绕过了用于单模态处理的早期层。通过探针分析和因果干预,我们表明这些中间层是视觉-语言处理的语义核心,并在广泛的 multimodal 基准测试中发挥关键作用。此外,通过比较语义等价的视觉和文本表示的几何结构,我们发现微调扩展并强化了现有的抽象阶段,使视觉特征与已有的文本特征对齐。最后,我们通过将微调限制在中间层来确认这种局部对齐的功能作用:该策略在视觉中心基准测试中保持了全微调的性能,同时减少了训练时间。我们的结果表明,多模态集成是一种局部现象,由LLM内部抽象引擎的重新利用驱动。

英文摘要

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

2606.03868 2026-06-03 cs.CV

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

统一视频-动作联合去噪用于灵巧动作与数据生成

Dingrui Wang, YuAn Wang, Jinkun Liu, Yue Zhang, Mattia Piccinini, Yu Sun, Johannes Betz

AI总结 提出Donk模型,通过联合建模交互视频与手部轨迹的分布,实现灵巧手的动作生成与数据增强。

详情
Comments
9 pages, 5 figures
AI中文摘要

最近的世界动作模型通过将广泛的视觉动态先验与可执行的机器人动作对齐来利用视频基础模型。我们从分布的角度重新审视这种对齐。现有的公式通常将对齐的先验缩小为基于观测的未来动作策略分布。相比之下,我们通过在多条件机制下对交互视频和可执行手部轨迹的联合空间进行建模,保持更广泛的分布。我们提出了Donk,一个用于灵巧手的统一视频-动作去噪模型。通过语言、初始图像和初始手部状态,Donk采样未来视频和双手MANO轨迹作为动作策略。在没有图像条件的情况下,相同的去噪架构从文本条件分布中采样配对的视频-动作展开,将对齐的视频先验转化为数据引擎。在动作、视频和仅文本生成评估中,Donk在相同的统一训练方案下提高了灵巧轨迹的准确性,保持了强大的视频保真度,并产生了平滑的文本条件动作展开。

英文摘要

Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

2606.03867 2026-06-03 cs.CL cs.AI

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

一种基于LLM和知识图谱的无训练混合智能体框架用于多文档摘要

Cuong Vuong Tuan, Trang Mai Xuan, Tien-Cuong Nguyen, Vu-Duc Ngo, Thien Van Luong

AI总结 提出一种无需训练、结合大语言模型和知识图谱的混合智能体框架,通过分解摘要任务为专用智能体(抽取、知识感知抽象、迭代精炼)并利用多视角一致性机制,在英文和越南语数据集上取得领先性能。

详情
Comments
Accepted by Neural Computing and Applications
AI中文摘要

多文档摘要(MDS)在从文本数据集合中提取关键信息方面发挥着关键作用。现有方法通常难以捕捉复杂的文档间关系,严重依赖大量标注数据进行监督训练,或在跨领域和跨语言时泛化能力有限。为解决这些限制,我们提出一种无训练的混合智能体框架用于MDS,该框架利用大语言模型(LLM)和知识图谱的互补优势。我们的方法将摘要分解为专门的智能体任务:抽取式选择、知识感知抽象和迭代精炼,每个任务无需特定微调。我们通过由LLM引导的多视角一致性机制统一其输出。在四个英文和越南语数据集上的实验表明,该方法达到了最先进或具有竞争力的性能,验证了我们模块化设计的有效性和适应性。

英文摘要

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

2606.03866 2026-06-03 cs.IR cs.AI cs.CL

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Taiji: 面向工业LLM增强推荐的帕累托最优策略优化与语义ID权衡

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu, Peng Jiang, Kun Gai

AI总结 提出Taiji框架,通过逆向工程推理和开放拒绝采样生成高质量CoT数据,并采用帕累托最优策略优化(POPO)自适应调整跨域奖励权重,实现LLM语义知识与推荐ID特征的帕累托最优权衡,在快手广告平台部署后服务超4亿日活用户。

详情
Comments
8 pages, 2 figures
AI中文摘要

通过大型语言模型(LLM)扩展推荐系统已成为工业界的显著趋势。然而,通过后训练(如SFT和RL)将LLM的语义空间与推荐系统的ID空间对齐仍然具有挑战性。现有的LLM4Rec范式受到两个主要问题的瓶颈:(1)在SFT期间,难以衡量和改进开放域推荐中的思维链(CoT)质量;(2)在RL对齐过程中,忽略了LLM语义奖励与推荐偏好奖励之间的权衡。受这些挑战启发,我们提出了Taiji,一种专为工业推荐系统设计的新型LLM-as-Enhancer框架。为了克服SFT瓶颈,我们利用逆向工程推理和开放拒绝采样生成高质量、领域特定的CoT数据。为了解决RL对齐问题,我们提出了帕累托最优策略优化(POPO),它自适应调整跨域奖励权重。理论上,它在LLM的语义世界知识与代表在线用户偏好的协同ID特征之间实现了最优权衡。大量的离线评估和在线A/B测试验证了Taiji的有效性。自2026年5月在快手广告平台部署以来,Taiji目前每天服务超过4亿用户,产生了显著的商业收入,并展示了其在网络规模环境中的强大可扩展性。

英文摘要

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

2606.03864 2026-06-03 cs.SI cs.CY cs.DL cs.LG physics.soc-ph

Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

基于概念网络动力学的科学突破可解释预测

Thomas Maillart, Thibaut Chataing, Ntorina Antoni, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud

AI总结 提出一种可解释的机器学习方法,通过建模OpenAlex概念网络的演化,预测科学突破的结构前兆(研究概念之间联系的出现和增强),并利用59个特征的两阶段LightGBM模型同时预测概念对的形成和未来权重,在四个技术/生物医学领域取得优于现有方法的ROC-AUC(0.954-0.967)和可解释性。

详情
Comments
18 pages, 10 figures, 4 tables. An earlier version was presented at Global Tech Mining Conference 2026. Code and data: https://github.com/wazaahhh/breakthroughs-forecasting
AI中文摘要

我们介绍了一种可解释的机器学习方法,通过建模OpenAlex概念网络随时间演化的方式,预测科学突破的结构前兆——研究概念之间联系的出现和增强。利用59个语义和拓扑特征,一个两阶段LightGBM模型联合预测概念对的形成及其未来权重,增加了一个回归阶段,将预期强度量化到先前的链接存在预测之上。与现有技术相比,该方法同时提高了准确性和可解释性:在四个技术和生物医学领域的比较验证中,无需重新调整即可在所有时间范围内获得[0.954, 0.967]的ROC-AUC,超过了先前模型约0.90的水平,而每个预测都基于结构化的、可审计的特征,而非不透明的嵌入。分类性能高(AUC约0.95),回归保持稳定(一到五年内RMSLE为0.45至0.6)。特征归因表明,结构因素——特别是Adamic-Adar相似性和基于度的Hadamard度量——持续驱动准确性,表明与突破相关的重组出现在紧密连接的子网络中。两个专家锚定的案例——量子退火和AI赋能的量子架构——显示模型浮现出与专家预期一致的技术融合。然后,我们概述了一个三层决策架构——检测、专家翻译、机构整合——将这些预测转化为基于证据的研究战略和政策,以开放数据和可解释特征为基础。

英文摘要

We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC-AUC in [0.954, 0.967] at all horizons without re-tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors -- particularly Adamic-Adar similarity and degree-based Hadamard measures -- consistently drive accuracy, suggesting that breakthrough-relevant recombinations emerge in tightly connected sub-networks. Two expert-anchored cases, quantum annealing and AI-enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three-layer decision architecture -- detection, expert translation, institutional integration -- that turns these forecasts into evidence-based research strategy and policy, anchored in open data and explainable features.

2606.03858 2026-06-03 cs.AI

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

PyraMathBench: 评估与提升大型语言模型的数学能力

Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He

AI总结 提出PyraMathBench分层基准测试,通过整合数值处理与数学推理评估LLM,并引入SOLVE模块和IRPO优化方法提升数值-数学协同能力。

详情
AI中文摘要

尽管数值推理作为大型语言模型(LLM)在各类应用中数学能力的基石具有关键作用,但很少有基准测试通过整合数值处理与数学推理来评估LLM,这阻碍了数学任务中失败的可解释性。我们引入了PyraMathBench,一个全面的分层基准测试,包含来自7,404道数学文字题的32,505个问题,涵盖4个关键认知方面、14个子类别和2种模态。实验表明,LLM的性能因数值计算不足和对抽象数值问题的处理薄弱而严重受损。为解决这一问题,我们提出了智能优化与学习型多功能模块(SOLVE)和交互式相对策略优化(IRPO),通过高效的工具调用(模糊匹配和低质量调用拒绝)增强LLM的数值-数学协同能力。对比实验显示,Qwen-2.5在SOLVE和IRPO训练下获得了5.0分的提升。

英文摘要

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

2606.03852 2026-06-03 cs.SE cs.AI

FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

FLARE: 面向大语言模型代码精炼的细粒度诊断反馈

Yinsheng Yao, Hongxiang Zhang, Weixi Tong, Tianyi Zhang

AI总结 提出FLARE框架,利用轻量级诊断模型预测行级可疑信号进行缺陷定位和代码精炼,通过Top-K候选搜索提升修复效果。

详情
AI中文摘要

大型语言模型生成的代码常含有错误。现有方法依赖测试失败和自批评等反馈信号来迭代精炼生成的代码,但这些信号要么过于粗粒度,要么过于高层,不足以告知模型何处需要修复。在本工作中,我们提出了Flare,一个迭代框架,配备轻量级诊断模型,用于预测行级可疑信号以进行缺陷定位和代码精炼。鉴于诊断预测固有的不确定性,Flare搜索前K个最可疑区域,并根据执行结果选择最佳候选。在LiveCodeBench和BigCodeBench上使用五个基础LLM的实验表明,即使没有候选搜索(k=1),Flare也以1.72%到7.42%的绝对提升优于最强基线。此外,与无候选搜索相比,搜索10个候选平均提升8.50%。单独评估时,我们的轻量级诊断模型与最近的缺陷定位方法相比取得了最佳性能,表明它能提供可靠的细粒度代码精炼指导。

英文摘要

Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and selects the best candidate according to execution outcomes. Experiments on LiveCodeBench and BigCodeBench with five base LLMs show that, even without candidate search (k=1), Flare outperforms the strongest baseline with an absolute improvement from 1.72% to 7.42%. Furthermore, searching over 10 candidates yields an average improvement of 8.50% compared with no candidate search. When evaluated in isolation, our lightweight diagnostic model achieves the best performance compared with recent fault localization methods, demonstrating that it can provide reliable fine-grained guidance for code refinement.

2606.03851 2026-06-03 cs.LG

Two-Action Apple Tasting with Switching Costs

带有切换成本的双动作苹果品尝问题

Tommaso Cesari, Roberto Colomboni

AI总结 研究对抗性对手下带有切换成本的双动作苹果品尝问题,通过揭示动作和盲动作的权衡,证明了最优遗憾为Θ(√T)。

详情
AI中文摘要

我们研究带有切换成本的双动作苹果品尝问题,对手是 oblivious 的。在等价的归一化形式中,每一轮学习者在揭示动作和盲动作之间选择:揭示动作获得奖励 $0$ 并揭示盲动作的隐藏值 $x_t\in[-1,1]$;盲动作获得奖励 $x_t$ 但不揭示任何信息。每当学习者切换动作时支付一个单位,遗憾相对于事后最佳固定动作来衡量。带有切换成本的通用反馈图算法对该问题给出 $\widetilde O(T^{2/3})$ 的遗憾保证。双动作苹果品尝图是切换成本分类中缺失的 $\Omega(T^{2/3})$ 障碍的自然候选:这样的下界将传递到一大类仍未分类的反馈图。我们证明这个障碍不存在:该问题的 oblivious 极小极大期望遗憾满足 \[ \frac{1}{2\sqrt3}\cdot\sqrt T \le R_T^\star \le 2\sqrt{3}\cdot \sqrt{T}. \]

英文摘要

We study the two-action apple-tasting problem with switching costs against an oblivious adversary. In an equivalent normalized formulation, at each round the learner chooses between a revealing action and a blind action: the revealing action gives reward $0$ and reveals the hidden value $x_t\in[-1,1]$ of the blind action; the blind action gives reward $x_t$ but reveals nothing. The learner pays one unit whenever they switches actions, and regret is measured against the best fixed action in hindsight. General feedback-graph algorithms with switching costs give $\widetilde O(T^{2/3})$ regret guarantees for this problem. The two-action apple-tasting graph was the natural candidate for the missing $Ω(T^{2/3})$ obstruction in the switching-cost classification: such a lower bound would have transferred to a large family of still-unclassified feedback graphs. We prove that this obstruction is not there: the oblivious minimax expected regret for this problem satisfies \[ \frac{1}{2\sqrt3}\cdot\sqrt T \le R_T^\star \le 2\sqrt{3}\cdot \sqrt{T}. \]

2606.03847 2026-06-03 cs.RO

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

去噪提示何时重新规划:基于流的机器人策略的去噪方差自适应分块

Xiangdong Feng, Yuxuan Cheng, Chen Shi, Boyao Han, Yuxuan Yan, Yitong Hong, Zhuotao Tian, Li Jiang

AI总结 针对基于流的机器人策略中固定执行步长的问题,提出DVAC方法,利用去噪过程中干净动作估计的方差自适应决定执行步长,在保持或提升任务成功率的同时降低重新规划频率。

详情
AI中文摘要

动作分块已成为基于流的机器人策略的常见推理策略,通过建模演示中的多步时间依赖关系来改善动作连贯性。然而,执行步长通常仍被设为经验固定值,忽略了可预测的自由空间运动和精度关键交互阶段往往需要不同的重新规划频率。在本文中,我们首先证明基于流的策略的去噪过程包含任务阶段的内在信号:干净动作估计在可预测运动阶段保持稳定,但在接触密集或精度敏感操作附近波动更大。受此观察启发,我们提出DVAC(去噪方差自适应分块),一种测试时方法,自适应地决定从每个预测分块中执行多少动作。DVAC测量最终去噪步骤中干净动作估计的方差,执行稳定的低方差前缀,并在提交高方差未来动作之前重新规划。为了跨任务和 rollout 迁移,DVAC进一步使用局部方差尺度的滚动估计来校准阈值。在LIBERO、RoboTwin、CALVIN和真实世界操作上的实验表明,DVAC在提高任务成功率的同时降低了重新规划频率。使用基于$\pi_{0.5}$的策略,DVAC将LIBERO成功率从94.75%提高到98.00%,重新规划减少43.0%,同时在RoboTwin和CALVIN上也取得了总体收益,并提高了真实世界执行效率。

英文摘要

Action chunking has become a common inference strategy for flow-based robot policies, improving action coherence by modeling multi-step temporal dependencies in demonstrations. However, the execution horizon is still typically set as an empirical fixed value, overlooking that predictable free-space motions and precision-critical interaction phases often require different replanning frequencies. In this work, we first show that the denoising process of flow-based policies contains an intrinsic signal of task phases: clean-action estimates remain stable during predictable motion phases, but fluctuate more strongly around contact-rich or precision-sensitive operations. Motivated by this observation, we propose DVAC (Denoising-Variance Adaptive Chunking), a test-time method that adaptively determines how many actions to execute from each predicted chunk. DVAC measures the variance of clean-action estimates over the final denoising steps, executes the stable low-variance prefix, and replans before high-variance future actions are committed. To transfer across tasks and rollouts, DVAC further calibrates the threshold with a rolling estimate of the local variance scale. Experiments on LIBERO, RoboTwin, CALVIN, and real-world manipulation show that DVAC improves task success while reducing replanning frequency. With a $π_{0.5}$-based policy, DVAC improves LIBERO success from 94.75% to 98.00% and reduces replanning by 43.0%, while also yielding aggregate gains on RoboTwin and CALVIN and improving real-world execution efficiency.

2606.03846 2026-06-03 cs.CL cs.AI cs.LG

Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models

聚类自评估:一种简单而有效的大型语言模型不确定性量化方法

Qi Cao, Takeshi Kojima, Andrew Gambardella, Helinyi Peng, Yutaka Matsuo, Yusuke Iwasawa

AI总结 提出一种基于语义聚类和多项选择概率的简单自评估方法,用于大型语言模型的不确定性量化,在多个模型和数据集上优于基线方法。

详情
Comments
Findings of ACL 2026
AI中文摘要

大型语言模型(LLM)在各种任务中表现出色,但常常生成看似合理实则事实错误的回答。这一问题因缺乏明确的不确定性估计而加剧,使用户难以判断模型输出的可靠性。现有的不确定性量化方法通常依赖间接信号,如生成样本的熵。这些信号难以解释,且未充分利用模型评估自身不确定性的能力。我们提出一种简单而有效的自评估方法用于LLM的不确定性量化。我们的方法将生成样本分组为语义不同的聚类,将其转化为结构化多项选择题的答案选项,并使用LLM分配给每个选项的概率作为置信度估计。在多个模型和数据集上的实验表明,我们的方法始终优于基线方法。值得注意的是,仅需两个额外样本即可达到竞争性能,证明了其有效性和效率。

英文摘要

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

2606.03843 2026-06-03 cs.LG cs.AI

Re-Evaluating Continual Learning with Few-Shot Adaptation

重新评估带少样本适应的持续学习

Amogh Inamdar, Matthew So, Vici Milenia, Richard Zemel

AI总结 本文提出用少样本评估替代零样本评估来更全面衡量持续学习系统的稳定性和可塑性,并通过新指标“每样本可塑性”发现元学习未来任务序列能诱导学习到学习行为。

详情
Comments
21 pages, 16 figures
AI中文摘要

持续学习方法旨在最大化在任务序列上训练的机器学习模型的稳定性和可塑性。稳定性的标准度量(即遗忘)是模型在先前学习任务上的零样本性能,而可塑性则是在最近学习任务上的性能。然而,零样本评估并未完全衡量模型或方法保留已学信息或快速适应新信息的能力,因为它需要在多个任务上完美回忆。在本文中,我们提出少样本评估作为对持续学习系统稳定性和可塑性的更全面评估。我们对持续图像分类的任务序列进行了细粒度评估,发现这一范式为流行持续学习策略的性能提供了新颖的见解。通过使用新指标——每样本可塑性——进行少样本评估,我们展示了通过元学习未来任务的短序列向持续学习方法添加“前瞻性”会在任务序列上诱导学习到学习的行为。

英文摘要

Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method's ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive assessment of the stability and plasticity of a continual learning system. We conduct a fine-grained assessment on task sequences for continual image classification and find that this paradigm produces novel insights into the performance of popular continual learning strategies. Through few-shot evaluation with a novel metric -- per-shot plasticity -- we show that adding `foresight' to continual learning methods via the meta-learning of a short sequence of future tasks induces learning-to-learn behavior over the task sequence.

2606.03841 2026-06-03 cs.AI

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

EvoDS: 具有技能学习和上下文管理的自进化自主数据科学智能体

Zherui Yang, Fan Liu, Yansong Ning, Hao Liu

AI总结 提出EvoDS,通过自主技能获取和自适应上下文压缩策略,结合强化学习训练,使数据科学智能体能够自进化并显著提升多阶段迭代任务的性能。

详情
Comments
Accepted by KDD2026
AI中文摘要

近年来,大语言模型(LLM)智能体的进展为自动化数据科学带来了有希望的突破。然而,现有方法仍然受到静态动作集和缺乏原则性长程上下文管理的根本限制,阻碍了它们在多阶段、迭代数据科学流程中积累跨任务可重用经验并可靠运行的能力。为了解决这些挑战,我们引入了EvoDS,一个自进化的自主数据科学智能体,通过智能体强化学习学会扩展其技能并自适应地管理长期上下文。具体来说,EvoDS引入了两个关键策略:(1)自主技能获取(ASA)机制,使智能体能够合成、验证和重用可执行技能;(2)自适应上下文压缩(ACC)策略,将上下文管理视为一个学习控制问题而非被动截断。这些策略在一个两阶段多智能体训练方案中协调,使EvoDS能够随时间自主改进。理论上,我们证明了EvoDS的分层设计减少了工具选择错误,其优化目标与信息瓶颈原理一致,确保了高效的上下文使用。实验上,EvoDS在四个不同基准测试中平均优于最先进的开源数据科学智能体28.9%,同时消除了超出令牌限制的失败。我们的代码和数据可在该网址获取。

英文摘要

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

2606.03839 2026-06-03 cs.LG

Text-attributed Graph Condensation via Text Selection and Attribute Matching

通过文本选择和属性匹配的文本属性图压缩

Haowei Han, Yuxiang Wang, Guojia Wan, Hao Wang, Shanshan Feng, Hao Huang, Jiawei Jiang, Xiao Yan

AI总结 提出TAGSAM方法,通过子图文本选择和属性相似性匹配压缩文本属性图,在保持训练精度的同时显著降低空间和时间消耗。

详情
AI中文摘要

文本属性图(TAG)是一种重要的图结构数据,其中每个节点都有文本描述。TAG模型通常联合训练图神经网络(GNN)和语言模型,导致高空间和时间消耗,尤其是在大型数据集上。为了缓解这一问题,我们提出了TAGSAM,一种在保持训练精度的同时压缩TAG的压缩方法。TAGSAM有两个关键设计,即子图文本选择和属性相似性匹配,分别压缩TAG的文本描述和图拓扑。对于文本,子图文本选择通过最大化互信息从多个相关文本描述中选择并合并代表性文本块。对于图拓扑,基于匹配训练轨迹(MTT)的流行压缩方法存在高方差,阻碍了精度。我们的属性相似性匹配通过对齐稳定的相似性矩阵来缓解这一问题。我们评估了TAGSAM与六个最先进的基线方法,结果显示其优越性能。在相同压缩大小下,TAGSAM在精度上平均比最佳基线提高4.9%。此外,即使将TAG压缩到仅1%的大小,它仍能保持有竞争力的训练精度。我们的代码可在以下网址获取:this https URL

英文摘要

Text-Attributed Graph (TAG) is an important type of graph structured data, where each node has a text description. TAG models usually train a Graph Neural Network (GNN) and language model jointly, which leads to high space and time consumption, especially on large datasets. To mitigate this, we propose TAGSAM, a condensation method that compresses TAGs while preserving training accuracy. TAGSAM comes with two key designs, i.e., subgraph text Selection and Attribute similarity Matching, which compress the text description and graph topology of TAG, respectively. For the texts, subgraph text selection selects and merges representative text chunks from multiple related text descriptions by maximizing mutual information. For the graph topology, popular condensation methods based on Matching Training Trajectories (MTT) suffer from high variance, which hinders accuracy. Our attribute similarity matching mitigates this issue by aligning stable similarity matrices. We evaluate TAGSAM against six state-of-the-art baselines, where it showcases superior performance. For the same compressed size, TAGSAM improves upon the best-performing baseline by an average of 4.9% in accuracy. Furthermore, it maintains competitive training accuracy even when the TAG is condensed to just 1% size. Our code is available at https://github.com/SundayVHan/TAGSAM

2606.03837 2026-06-03 cs.CV

Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

在低资源视频任务适应中,我们(不)需要时间上下文的哪些部分?

Luc P. J. Sträter, Hazel Doughty

AI总结 本文系统研究了视频理解中模型适应策略的时间上下文分配问题,通过评估不同设置下的参数高效微调和探测方法,揭示了时间上下文在骨干网络、PEFT和探测之间的最优分布。

详情
AI中文摘要

参数高效微调(PEFT)和探测使得仅使用少量可训练参数就能适应基础模型,这对于标注和计算成本高昂的视频理解具有吸引力。然而,视频PEFT主要集中于适应图像预训练模型,而标准PEFT方法也可应用于视频表示。这些设置很少被比较,并且都将时间推理限制在模型的单个组件中,从而留下了时间上下文应如何在骨干网络、PEFT和探测之间分布的问题。在这项工作中,我们提供了视频理解中模型适应策略的系统研究。我们在外观聚焦、运动聚焦和空间密集设置中评估了方法,特别关注数据有限且参数效率最有利的场景。我们的结果为跨设置的PEFT和探测提供了新的见解,并证明了时间上下文分配对于有效视频适应的重要性。

英文摘要

Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

2606.03834 2026-06-03 cs.RO

Let the Dynamics Flow: Stable Flow Matching Dynamical Systems

让动力学流动:稳定的流匹配动力系统

Rodrigo Pérez-Dattari, Francisco Leiva, Andrea Testa, Leonel Rozo, Javier Ruiz del Solar, Noémie Jaquier

AI总结 提出稳定流匹配动力系统(SFMDS)框架,通过流匹配参数化动力系统并施加李雅普诺夫稳定性约束,实现稳定、可扩展、多模态的机器人运动生成。

详情
AI中文摘要

流匹配最近已成为模仿学习的一种强大方法,能够实现可扩展、表达力强且多模态的运动策略。然而,将这些生成模型纳入形式化的稳定性保证(确保机器人行为安全和可泛化的前提)仍然是一个重大挑战。虽然将机器人运动建模为动力系统允许这种基于稳定性的归纳偏置,但现有框架难以捕捉复杂机器人任务中固有的丰富动作分布。本文介绍了稳定流匹配动力系统(SFMDS),这是一个弥合高容量生成模型与形式化李雅普诺夫稳定性保证之间差距的新框架。SFMDS通过流匹配参数化动力系统,同时将模型约束到稳定解族。我们提出了两种变体:基于惩罚项的软约束,以及直接嵌入模型架构的硬结构约束。我们还将两种公式扩展到李群。在基准数据集、仿真和类人机器人上的实验表明,SFMDS在低维和高维状态空间中学习稳定、可扩展和多模态的动力系统,从而实现安全且富有表现力的机器人运动生成。

英文摘要

Flow matching has recently emerged as a powerful approach for imitation learning, enabling scalable, expressive, and multimodal motion policies. However, incorporating formal stability guarantees into these generative models, a prerequisite to ensure safe and generalizable robot behaviors, remains a significant challenge. While modeling robot motions as dynamical systems allows for such stability-based inductive biases, existing frameworks struggle to capture the rich action distributions inherent in complex robotic tasks. This paper introduces Stable Flow Matching Dynamical Systems (SFMDS), a novel framework that bridges the gap between high-capacity generative modeling and formal Lyapunov stability guarantees. SFMDS parametrizes dynamical systems via flow matching while simultaneously constraining the model to a family of stable solutions. We propose two variants: a soft constraint based on a penalty term, and a hard structural constraint embedded directly in the model architecture. We further extend both formulations to Lie groups. Experiments on benchmark datasets, in simulation, and on a humanoid robot show that SFMDS learns stable, scalable, and multimodal dynamical systems in low- and high-dimensional state spaces, enabling safe and expressive robot motion generation.

2606.03831 2026-06-03 cs.LG stat.ML

Online Learning with Gradient-Variation Interval Regret

基于梯度变化的区间遗憾在线学习

Yan-Feng Xie, Shuche Wang, Peng Zhao, Zhi-Hua Zhou

AI总结 本文提出首个基于梯度变化量实现区间遗憾界的在线学习算法,采用两层在线集成结构,自适应多种问题相关量并达到极小化最优率,同时引入Lipschitz和平滑性无关的变体。

详情
AI中文摘要

本文研究使用区间遗憾度量的非平稳在线学习,该度量要求在线算法在每个时间区间内表现良好。我们提出了第一个在线学习算法,其区间遗憾界随梯度变化缩放,梯度变化是衡量在线函数梯度累积变化的基本度量,与多种问题相关量有关,并与随机优化等问题紧密相连。我们的方法采用简单高效的两层在线集成结构,实现了强大的理论保证。具体来说,它享有同时自适应多种问题相关量的遗憾界,同时在最坏情况下保持极小化最优率。此外,认识到超参数调优的挑战,我们引入了一种Lipschitz和平滑性无关的变体,自动适应这些可能未知的常数。这主要得益于一种新颖的Lipschitz自适应元算法,该算法可能具有独立的意义。除了区间遗憾,我们的方法还产生了更广泛的影响:它为区间动态遗憾(一种更强的度量,与任何区间上的变化比较器竞争)提供了通用的界,并首次为随机扩展对抗优化提供了分段刻画。理论发现通过实验得到验证。

英文摘要

This paper investigates non-stationary online learning using the metric of interval regret, which requires an online algorithm to perform well over every time interval. We propose the first online learning algorithm that achieves an interval regret bound scaling with gradient variation, a fundamental measure of the cumulative change in online function gradients, which relates to various problem-dependent quantities and is closely connected to stochastic optimization and other problems. Our method employs a simple and efficient two-layer online ensemble structure that achieves strong theoretical guarantees. Specifically, it enjoys a regret bound that simultaneously adapts to various problem-dependent quantities while also preserving the minimax-optimal rate in the worst case. Moreover, recognizing the challenge of hyperparameter tuning, we introduce a Lipschitz- and smoothness-agnostic variant that automatically adapts to these potentially unknown constants. This is primarily enabled by a novel Lipschitz-adaptive meta algorithm, which may be of independent interest. Beyond interval regret, our method also yields broader implications: it provides versatile bounds for interval dynamic regret, a stronger measure that competes with changing comparators over any interval, and yields the first piecewise characterization for stochastic extended adversarial optimization. Theoretical findings are validated by experiments.

2606.03829 2026-06-03 cs.AI

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

BigFinanceBench: 一个基于工作流的金融研究智能体基准

Alex Wang, Georg Meinhardt, Jacob Katz, Joseph H. Kim, Pratyush K. Chaudhary, Chase Blagden, Eric Xu

AI总结 针对金融研究答案可审计推导过程未被充分评估的问题,提出包含928个专家编写任务、每个任务附带点权评分标准的BigFinanceBench基准,用于评估完整推导过程而非仅最终答案,实验表明最佳系统仅达58.8%评分,存在显著提升空间。

详情
AI中文摘要

金融研究答案只有在其他分析师能够审计其产生过程(包括选择哪个来源、哪个时期和会计定义、做出哪些假设以及如何进行计算)时才具有决策相关性。现有的金融基准主要评估孤立的子技能或最终答案,而忽略了可审计的推导过程本身。我们引入了BigFinanceBench,一个包含928个专家编写的开放式金融研究任务的基准,其中每个任务将一个真实参考答案与一个点权评分标准配对,该评分标准将推导过程分解为可独立检查的步骤。BigFinanceBench基于工作流,因为它评估完整的推导过程而不仅仅是最终输出。在36,241个评分点上,该基准支持部分信用评估和跨分析师工作流的故障定位。评估十个当前前沿和开放权重智能体,我们发现存在显著提升空间:最佳系统仅达到58.8%的评分,最终答案准确性是推导质量的有用但有损的代理指标,并且模型能力在金融工作流中不均匀变化。

英文摘要

Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.

2606.03827 2026-06-03 cs.CV cs.AI

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

AI总结 提出4D F-MeshLDM框架,结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验,实现可控的3D+t心脏网格序列生成,在UK Biobank数据上优于基线方法。

详情
Comments
This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026
AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中,虚拟解剖通常表示为从生成模型采样的3D+t网格。然而,大多数现有网格生成器关注静态解剖,而序列模型往往缺乏显式周期性。为此,我们提出4D F-MeshLDM,一个条件生成框架,包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间,以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量,我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹,可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明,4D F-MeshLDM在解剖保真度上优于最先进的基线,并实现了接近零的周期闭合误差。此外,生成的队列准确保留了临床功能指标,突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

2606.03825 2026-06-03 cs.LG cs.CL

Dynamic Short Convolutions Improve Transformers

动态短卷积改进Transformer

Oliver Sieberling, Bharat Runwal, Rameswar Panda, Yoon Kim

AI总结 本文提出动态短卷积作为新的神经网络原语,通过输入依赖的滤波器增强Transformer,在语言建模中相比标准Transformer和静态短卷积变体持续提升性能,并带来计算优势。

详情
AI中文摘要

Transformer已成为大型语言模型的主导架构,主要得益于注意力、前馈层、残差连接和归一化的可扩展性和灵活性。本文引入动态短卷积作为改进Transformer的额外神经网络原语。与静态短卷积不同,动态卷积使用输入依赖的滤波器,在保持卷积局部性偏差的同时增加表达能力。动机实验表明,在具有挑战性的关联回忆任务中,对键、查询和值表示应用动态短卷积相比静态卷积变体提升了性能。在从150M到2B参数的语言建模实验中,动态卷积持续优于标准Transformer和用静态短卷积增强的Transformer。拟合缩放定律表明,当动态卷积应用于键、查询和值向量时,相对于计算匹配的Transformer具有1.33倍的计算优势,而在每个线性层后添加动态卷积时优势达到1.60倍。动态卷积还在线性RNN(Mamba-2/Gated DeltaNet)和混合专家架构上带来了改进。我们通过自定义Triton内核使这些增益变得实用,实现了高效的训练和可管理的端到端减速。这些结果表明,动态短卷积是一种可扩展、硬件高效且富有表现力的原语,可用于推进基于Transformer的语言模型。

英文摘要

Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations improves performance on challenging associative recall tasks compared with static convolutional variants. Across language-modeling experiments ranging from 150M to 2B parameters, dynamic convolutions consistently outperform standard Transformers and Transformers augmented with static short convolutions. Fitting scaling laws indicates a 1.33$\times$ compute advantage over compute-matched Transformers when dynamic convolutions are applied to the key, query, and value vectors, and a 1.60$\times$ advantage when adding dynamic convolutions after every linear layer. Dynamic convolutions also offer improvements on linear RNNs (Mamba-2/Gated DeltaNet) and mixture-of-experts architectures. We make these gains practical with custom Triton kernels that enable efficient training with a manageable end-to-end slowdown. These results suggest that dynamic short convolutions are a scalable, hardware-efficient, and expressive primitive for advancing Transformer-based language models.

2606.03823 2026-06-03 cs.AI cs.CY cs.NE

Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization

基于遗传优化的稀疏道路观测城市交通仿真校准

Hunter Sawyer, Jesse Roberts, Simon Matei

AI总结 提出一种基于遗传算法的框架,利用稀疏道路观测数据校准城市交通仿真,无需详细就业数据,生成与真实测量高度相关的交通流和就业分布。

详情
AI中文摘要

城市交通仿真是基础设施规划的关键工具,包括电动汽车充电站的布局。然而,许多城市的逼真交通仿真受到两个基本数据限制的阻碍:大多数城市只有一小部分道路段有详细的真实交通测量数据,而建模通勤交通所需的就业分布数据很少以仿真所需的分辨率提供。本文提出一个基于遗传算法的框架,直接解决这两个限制,从稀疏道路观测中校准城市交通仿真,无需详细的就业位置数据。使用北卡罗来纳州格林斯博罗的SUMO交通仿真平台,我们的方法优化了就业分布和门控交通参数,使仿真交通与已知交通流率的一小部分道路样本对齐。我们证明,该方法产生的仿真交通与真实测量高度相关,能泛化到训练中未包含的道路段,并且产生的就业分布与人口普查就业数据在定性上具有良好的一致性,尽管从未直接在该就业数据上训练。这项工作表明,可以从最少的真实观测实现逼真的城市交通仿真,提供一种可扩展且数据轻量的仿真校准方法,降低了在不同城市部署交通模型的障碍。

英文摘要

Urban traffic simulation is a critical tool for infrastructure planning, including the placement of electric vehicle charging stations. However, realistic traffic simulation across many cities is hindered by two fundamental data limitations: detailed real-world traffic measurements are available for only a small fraction of road segments in most cities, and employment distribution data critical for modeling commuter traffic is rarely available at the resolution needed for simulation. This paper presents a genetic algorithm-based framework that directly addresses both limitations, calibrating urban traffic simulations from sparse road observations without requiring detailed job location data. Using the SUMO traffic simulation platform for Greensboro, North Carolina, our approach optimizes job distributions and gate-traffic parameters to align simulated traffic with a small sample of roads with known traffic-flow rates. We demonstrate that this approach produces simulated traffic that correlates well with real-world measurements, generalizes to road segments withheld from training, and produces job distributions that show promising qualitative agreement with census employment data despite never directly training on that employment data. This work demonstrates that realistic urban traffic simulation can be achieved from minimal real-world observations, offering a scalable and data-light approach to simulation calibration that reduces the barrier to deploying traffic models across diverse cities.

2606.03821 2026-06-03 cs.LG

Finding Needles in the Haystack: Transductive Active Labeling in Ecology

在干草堆中寻找针:生态学中的转导式主动标注

Rupa Kurinchi-Vendhan, Sara Beery

AI总结 本文提出转导式主动标注方法,通过发现稀有类样本解决生态数据长尾分布问题,并设计混合停止准则提升稀有类恢复。

详情
AI中文摘要

主动学习现在已成为标注生态数据的标准做法,使生态学家能够快速处理大量野外数据以理解和监测自然环境。当前的做法归纳性地评估主动学习,在保留的测试集上估计预测性能。我们认为这种评估与大多数生态任务不一致,这些任务的目标是尽可能高效地转导式地标注整个数据池。我们证明,忽略人在回路中会低估继续标注的重要性,特别是对于长尾中的类别,这些类别可能具有不成比例的生态重要性(稀有物种、不常见行为等)。我们的分析表明,对于这个长尾,转导式目标将重要性从预测转移到发现:真正的挑战变成了在干草堆中寻找针,即嵌入在潜在几何中丰富类别密集区域内的稀有类别样本,我们通过一种新的采样难度度量来量化这一点。最后,为了将这些见解转化为实际的生态工作流程,我们提出了一种受生态稀疏曲线启发的保守混合停止准则,并表明将预测性能与发现标准相结合可以减少长尾池上的过早停止,当发现(而非分类)是限制因素时,改善稀有类别的恢复。

英文摘要

Active learning is now standard practice in labeling ecological data, enabling ecologists to quickly process large volumes of field data to understand and monitor natural environments. Current practices evaluate active learning inductively, estimating predictive performance on a held-out test set. We argue that this evaluation is misaligned with most ecological tasks, where the goal is to transductively label an entire pool of data as efficiently as possible. We demonstrate that ignoring the human-in-the-loop underestimates the importance of continuing to label, particularly for classes in the long tail which may be of disproportionate ecological importance (rare species, uncommon behaviors, etc.). Our analysis shows that, for this long tail, the transductive objective shifts importance from prediction to discovery: the true challenge becomes finding "needles in the haystack," examples of rare classes that are embedded within dense regions of abundant classes in the latent geometry, which we quantify with a novel metric of sampling difficulty. Finally, to translate these insights to practical ecological workflows, we propose a conservative hybrid stopping criterion inspired by ecological rarefaction curves, and show that combining predictive performance with discovery criteria reduces premature stopping on long-tailed pools, improving rare-class recovery when discovery, not classification, is the limiting factor.

2606.03819 2026-06-03 cs.LG

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

TreeFlash: 用于更快推测解码的并行AR近似

Peer Rheinboldt, Frédéric Berdoz, Roger Wattenhofer

AI总结 提出TreeFlash,通过MLP层近似自回归分布,在保持O(1)解码时间复杂度的同时,提升树形推测解码的块效率和加速比。

详情
AI中文摘要

用于推测解码的一次性块起草器在单次前向传播中生成完整草稿,通过消除顺序令牌生成实现高吞吐量。然而,它们仅基于前缀上下文预测每个草稿令牌,而不依赖于先前生成的令牌。这种非自回归条件导致随着草稿深度增加,起草器的分布偏离验证器的真实自回归分布。在基于树的起草中,这个问题更加严重,因为不同的分支被迫共享后续令牌的相同边际分布。我们提出TreeFlash,通过引入一个以起草器隐藏状态和前一个令牌为条件的MLP层来近似自回归分布,从而解决这一问题。TreeFlash通过采用两阶段近似机制,保留了一次性起草器的O(1)解码时间复杂度。TreeFlash在各种任务和模型上实现了最先进的性能,与边际树起草相比,块效率提高了12%,加速比提高了9%。

英文摘要

One-shot block drafters for speculative decoding generate the full draft in a single forward pass, achieving strong throughput by eliminating sequential token generation. However, they predict each draft token conditioned only on the prefix context, with no dependence on previously drafted tokens. This non-autoregressive conditioning causes the drafter's distribution to diverge from the verifier's true autoregressive distribution as draft depth grows. This problem becomes more severe in tree-based drafting, where distinct branches are forced to share the same marginal distribution for subsequent tokens. We propose TreeFlash, which addresses this by incorporating an MLP layer conditioned on the drafter's hidden state and the previous token to approximate an autoregressive distribution. TreeFlash retains the $\mathcal{O}(1)$ decoding time complexity of one-shot drafters by employing a two-stage approximation mechanism. TreeFlash achieves state-of-the-art performance across a variety of tasks and models, improving over marginal tree drafting by $12\%$ higher block efficiency and $9\%$ higher speedup.

2606.03817 2026-06-03 cs.CL

Rethinking the Idiomaticity Decomposability Hypothesis: Evidence from Distributional Learning

重新思考习语可分解性假说:来自分布学习的证据

Maggie Mi, Golzar Atefi, Atsuki Yamaguchi, Felix Gers, Aline Villavicencio, Nafise Sadat Moosavi

AI总结 本文利用上下文语言模型作为受控分布学习者,研究习语可分解性(构成意义对整体比喻意义的贡献程度)与句法灵活性的关系,发现模型导出的可分解性与人类判断弱相关,与句法灵活性呈微小但一致的负相关,且习语表征的稳定化受惊讶度、可分解性和频率共同影响,其中可分解性具有最强的训练依赖效应。

详情
Comments
ACL 2026 Main - long paper (9 pages + Appendices)
AI中文摘要

习语可以根据其可分解性进行分析,即可分解性是指构成意义对整体比喻意义的贡献程度。可分解性被认为可以预测句法灵活性。基于用法的解释则将习语行为归因于分布经验,如说话者熟悉度和可预测性。我们使用上下文语言模型作为受控分布学习者来检验这些观点。我们提出了一种模型内部的可分解性度量,并将其与人类评分、句法灵活性和可预测性相关联,同时跟踪预训练期间习语的学习过程。模型导出的可分解性与人类判断弱相关,并且与句法灵活性呈微小但一致的负相关。预训练分析表明,模型中习语表征的稳定化不能仅由频率解释。相反,惊讶度、可分解性和频率都有贡献,其中可分解性显示出最强的训练依赖效应。

英文摘要

Idioms can be analysed in terms of their decomposability, the extent to which constituent meanings contribute to the figurative whole. Decomposability is thought to predict syntactic flexibility. Usage-based accounts instead attribute idiom behaviour to distributional experience, such as speaker familiarity and predictability. We examine these views using contextualised language models as controlled distributional learners. We propose a model-internal measure of decomposability and relate it to human ratings, syntactic flexibility, and predictability while tracking idiom learning during pretraining. Model-derived decomposability correlates weakly with human judgments and shows a small but consistent negative relationship with syntactic flexibility. Pretraining analyses show that stabilisation of idiom representations in models is not explained by frequency alone. Instead, surprisal, decomposability, and frequency all contribute, with decomposability showing the strongest training-dependent effect.

2606.03814 2026-06-03 cs.AI

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

利用BART基于评分标准评估CS1 C++编程作业

Kelsey Rainey, Jesse Roberts

AI总结 提出基于评分标准的多任务微调BART模型,用于自动评分C++编程作业,通过联合预测数值分数和等级区间并匹配分布,使评分更接近教师行为。

详情
AI中文摘要

本文研究基于评分标准的多任务微调变换器模型,用于自动评分入门级C++编程作业,旨在产生比通用LLM更能反映教师评分行为的分数预测。使用多学期CS1数据,将学生提交的作业与数值分数、字母等级区间和作业评分标准配对,然后预处理为统一的序列用于变换器输入。采用带有LoRA适应的BART编码器-解码器,联合预测数值分数和等级区间,并增加分布匹配项以对齐预测分数和经验分数分布,这是以往工作中常被忽略的评估维度。实验比较了单任务和多任务训练、硬独热与模糊及基于边界的软标签、有评分标准与无评分标准条件,并增加了T5和成对预训练变体。结果表明,具有基于边界的软标签和评分标准上下文的多任务BART在平均绝对误差和分数分布对齐方面优于单任务、硬标签或仅代码基线。完全微调的T5进一步提高了分布保真度,而成对预训练以牺牲少数类敏感性为代价减少了数值误差。总体而言,研究结果表明,校准感知、评分标准引导的训练比优化准确性的替代方案能产生更像教师的评分行为。

英文摘要

This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.

2606.03812 2026-06-03 cs.AI

Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

通过智能体对话危害识别分析增强操作安全性

Sanjay Das, Ran Elgedawy, Ethan Seefried, Ryan Burchfield, Tirthankar Ghosal

AI总结 提出HAZDIAL框架,利用结构化多智能体多轮对话(对抗性辩论与建设性讨论)改进基于NLP的危害识别质量,并通过算法优化智能体交互,实验证明优于单次基线方法。

详情
AI中文摘要

在工业过程控制、自主系统和安全关键系统等高风险领域,操作安全性要求可靠的危害识别。虽然大型语言模型在自动化安全分析任务中显示出潜力,但单次、整体推理是脆弱的:它缺乏安全工程师迭代应用的自校正、深思熟虑和上下文细化。在本文中,我们介绍了HAZDIAL,一个研究结构化智能体对话(多智能体、多轮交互)是否比单次基线提高基于NLP的危害识别质量的框架。我们系统地比较了两种对话模式:对抗性辩论和建设性讨论,并提出了基于算法的智能体交互优化。我们使用标准分类指标(准确率、精确率、召回率、F1)和新颖的对话指标,针对策划的金标准数据集评估所有配置。这项工作推进了对话系统、多智能体推理和AI安全的交叉领域,为对话驱动的危害分析提供了经验证据。

英文摘要

Operational safety in high-stakes domains such as industrial process control, autonomous, and safety-critical systems, demand reliable hazard identification. While large language models (LLMs) have shown promise in automating safety analysis tasks, single-turn, monolithic inference is brittle: it lacks the self-correction, deliberation, and contextual refinement that safety engineers apply iteratively. In this paper, we introduce HAZDIAL, a framework that investigates whether structured agentic dialogue-multi-agent, multi-turn interactions improves the quality of NLP- based hazard identification over single-pass baselines. We systematically compare two dialogue modalities: adversarial debate and constructive discussion, and propose an algorithm-based agentic interaction optimization. We evaluate all configurations against a curated golden dataset using standard classification metrics (accuracy, precision, recall, F1) and novel dialogue metrics. This work advances the intersection of dialogue systems, multi-agent reasoning, and AI safety, providing an empirical evidence for dialogue-driven hazard analysis.

2606.03811 2026-06-03 cs.CR cs.AI cs.LG

AI Agents Enable Adaptive Computer Worms

AI代理实现自适应计算机蠕虫

Jonas Guan, Tom Blanchard, Hanna Foerster, Hengrui Jia, Gabriel Huang, Nicolas Papernot

AI总结 本研究展示了AI代理能够生成针对每个目标的定制攻击策略,利用被感染机器上的大语言模型自我维持并传播,形成自持的AI驱动网络威胁。

详情
AI中文摘要

计算机蠕虫是一种通过在网络中从一台机器复制到另一台机器来传播的恶意软件。传统蠕虫(如WannaCry)利用预定的漏洞,修补这些漏洞即可阻止其传播。本文表明,人工智能(AI)代理实现了一种根本性的新威胁:一种能够针对每个遭遇的目标生成定制攻击策略的蠕虫。该蠕虫寄生性地利用被感染的机器运行开放权重的大语言模型(LLM)以维持其推理能力,或扩展其攻击范围。在部署于Linux、Windows和物联网(IoT)设备的机器网络上,该蠕虫通过利用常见的现实企业网络漏洞进行传播。由于蠕虫由窃取的计算资源驱动,攻击者每次新感染所需的边际成本为零。这在攻击者和防御者之间造成了不稳定的经济不对称。此外,由于蠕虫不需要商业AI平台,集中式安全控制(如服务拒绝或速率限制)在结构上无关紧要。我们的结果表明,自持的AI驱动网络威胁不再是理论上的。我们必须为自主的生成式对手做好准备:这些恶意软件系统无需人类操作员即可传播,其定义不是固定的利用代码,而是实时推理目标、适应观察并合成攻击逻辑的能力。

英文摘要

A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, exploited predetermined vulnerabilities, and their spread can be halted by patching those vulnerabilities. Here we show that artificial intelligence (AI) agents enable a fundamentally new threat: a worm that generates tailored attack strategies to each target it encounters. The worm parasitically uses compromised machines to run open-weight large language models (LLMs) to sustain its reasoning, or extend its reach for further attacks. Deployed on a network of machines spanning Linux, Windows, and IoT (Internet of Things) devices, the worm propagated by exploiting common, real-world corporate network vulnerabilities. Since the worm is powered by stolen compute, the attacker's marginal cost per new infection is zero. This creates a destabilizing economic asymmetry between attackers and defenders. Moreover, because the worm requires no commercial AI platform, centralized safety controls, such as service refusals or rate limiting, are structurally irrelevant. Our results demonstrate that self-sustaining AI-driven cyber-threats are no longer theoretical. We must prepare for autonomous generative adversaries: malware systems that propagate without human operators and are defined not by fixed exploit code, but by the capacity to reason about targets, adapt to observations, and synthesize attack logic in real time.