arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23819 2026-05-25 cs.CV cs.AI

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

不过于生成，也不过于判别：人类对齐的甜蜜点

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, Victor Boutin

发表机构 * ANITI ； Brown University（布朗大学）； CNRS（国家科学研究中心）

AI总结本文探讨了计算视觉中一个核心问题：人类视觉表征是由判别式学习还是生成式学习更好地解释。研究通过联合能量模型（JEMs）在固定架构下连续插值判别与生成训练目标，分离学习目标的影响，并在六个涵盖感知相似性、光泽感知、人类响应不确定性等的人类对齐基准上进行评估。结果表明，人类对齐在生成与判别目标的中间点达到最优，而非极端端点，表明人类视觉对齐源于生成与判别目标的平衡，而非单一目标的选择。

详情

AI中文摘要

计算视觉中的一个核心问题是，人类视觉表征是否更好地由判别学习或生成学习解释。然而，现有的比较常常混淆学习目标与架构、规模及训练数据，使得目标本身是否驱动对齐的问题悬而未决。我们使用联合能量模型（JEM）来解决这一混淆问题，该模型在固定架构内连续插值判别与生成训练。通过改变单个混合系数，我们隔离了学习目标的影响，并在六个涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突和诊断性特征归因的人类对齐基准上评估了所得模型。在这多样化的测试套件中，人类对齐在生成-判别连续体的中间点始终达到最大，而非任一端点。混合JEM结合了判别学习诱导的类别结构与生成学习诱导的对输入结构的敏感性，在视觉的多个层次上产生了更类人的行为。这些结果表明，生成-判别二分法不是理解人类对齐视觉的正确轴：对齐并非来自选择其中一个目标，而是来自平衡两者。

英文摘要

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

URL PDF HTML ☆

赞 0 踩 0

2605.23797 2026-05-25 cs.LG cs.CV

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models

去偏负挖掘提升基于预训练视觉语言模型的分布外检测

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

发表机构 * University of Technology Sydney（悉尼科技大学）

AI总结本文研究了如何利用预训练的视觉-语言模型（VLM）进行分布外（OOD）检测，旨在识别来自未知类别的输入。现有方法主要依赖启发式规则从未标注的语料中挖掘负样本，但存在严重的负样本偏差问题。为此，作者提出了一种去偏负样本挖掘方法，通过间接估计负样本分布来纠正偏差，并将其转化为基于标注数据和未标注语料的蒙特卡洛采样过程。实验表明，该方法在多种OOD检测任务中取得了新的最先进性能。

Comments KDD 2026

详情

AI中文摘要

旨在识别来自未知类别的意外输入，分布外（OOD）检测已成为增强机器学习模型可靠性的关键方法。本文聚焦于基于预训练视觉语言模型（VLM）的事后OOD检测这一新兴范式，其中一种流行的流程是通过检查输入与ID标签和负标签（即语义上不同于ID标签的标签）之间的亲和度来检测OOD输入。由于目标OOD标签不可用，现有工作主要依赖启发式规则从未标注的语料数据中挖掘负标签。尽管取得了经验上的成功，我们认为基于VLM的OOD检测能力尚未被完全释放，因为文献中臭名昭著的假阴性问题远未解决。基于这一动机，我们感兴趣于解决为OOD评分挖掘真实负标签的挑战。为此，我们开发了一个理论框架，通过间接近似负标签的分布来校正负标签的采样偏差。令人惊讶的是，我们表明去偏负挖掘可以自然地转化为基于ID标签和未标注语料数据的蒙特卡洛采样。大量实验经验性地证明，我们的方法在各种OOD检测设置中建立了新的最先进水平。代码公开于\href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{此处}。

英文摘要

Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{\textcolor{red}{here}}.

URL PDF HTML ☆

赞 0 踩 0

2605.23790 2026-05-25 cs.CV

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

探索基于事件的显著性预测：一种基于Transformer的模型

Romaric Mazna, Jean Martinet, Sai Deepesh Pokala

发表机构 * i3S/CNRS, Université Côte d’Azur（i3S/CNRS，法国国家科学研究中心，埃克塞特大学）

AI总结本文研究了基于事件相机数据的显著性预测问题，提出了一个基于Transformer的模型SEST，用于从事件数据中预测显著性区域。为克服事件数据缺乏大规模标注数据集和强基线模型的难题，作者引入了事件原生的预训练策略和合成监督，并构建了两个新的基准数据集。实验表明，SEST在事件显著性预测任务中优于现有方法，并在真实事件数据上展示了良好的迁移能力，是首次将深度学习应用于事件显著性预测的研究。

详情

AI中文摘要

显著性预测在RGB图像和视频中作为人类视觉注意的计算模型已被广泛研究。相比之下，尽管事件相机具有生物启发性和良好的传感特性，但从事件数据预测显著性仍基本未被探索。两个障碍阻碍了这一方向：缺乏大规模事件显著性数据集，以及缺乏强基线。在本文中，我们介绍了SEST（Swin事件显著性Transformer），一种基于Transformer的事件数据显著性预测模型，通过事件原生预训练和合成监督弥补数据稀缺障碍。SEST利用自监督预训练的事件Swin Transformer骨干结合轻量CNN解码器生成动态显著性图。为解决标注事件显著性数据稀缺的问题，我们引入了两个新的基准数据集N-DHF1K和N-UCF Sports，这些数据集从大规模RGB显著性基准生成。实验结果表明，SEST明显优于现有事件显著性方法，并缩小了与最先进RGB模型的性能差距。在真实事件相机数据集上的零样本评估进一步证明，我们在合成数据上训练的模型在真实事件流上仍具有可迁移性。据我们所知，这项工作是首次将深度学习应用于基于事件的显著性预测，开辟了事件视觉与神经形态视觉注意交叉领域的新研究方向。

英文摘要

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

URL PDF HTML ☆

赞 0 踩 0

2605.23780 2026-05-25 cs.AI

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

超越二元编辑：基于对抗子空间对齐的鲁棒多模态知识编辑

Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

发表机构 * Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Jiangxi Normal University（江西师范大学）

AI总结本文研究了多模态大语言模型中鲁棒的内在知识编辑问题，旨在在不损害原有能力的前提下高效更新知识。针对现有方法在语义等价的视觉和语言变体间传播编辑效果有限的问题，作者提出了对抗子空间对齐方法（ASAM），通过引入潜在对抗鲁棒化（LAR）和秩约束子空间学习（RCSL）技术，增强模型在高维多模态空间中的泛化能力和编辑鲁棒性。实验表明，该方法在知识编辑任务中表现出优越的性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）需要高效的机制来更新知识，同时不降低现有能力。虽然内在多模态知识编辑实现了强可靠性和局部性，但它通常表现出有限的泛化性，无法在语义等价的视觉和语言变体之间传播编辑。这个问题源于在高维多模态空间中缺乏显式的语义监督、僵化的编辑范围以及对单个样本的有偏锚定。我们通过显式地针对泛化性来解决鲁棒的内在多模态知识编辑。我们通过知识单元（将语义等价的多模态输入分组）形式化鲁棒性，并将泛化性定义为每个单元内一致的预测。为了暴露脆弱的语义区域，我们引入了潜在对抗鲁棒化（LAR），它在联合潜在空间中生成对抗但语义连贯的变体。我们进一步提出了秩约束子空间学习（RCSL），通过基于奇异值的目标在编辑层强制对抗表示的低秩对齐。大量实验证明了ASAM的有效性。

英文摘要

Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieves strong reliability and locality, it often exhibits limited generality, failing to propagate edits across semantically equivalent visual and linguistic variations. This issue arises from the lack of explicit semantic supervision, rigid editing scopes, and biased anchoring to individual samples in high-dimensional multimodal spaces. We address robust intrinsic multimodal knowledge editing by explicitly targeting generalization. We formalize robustness through knowledge units that group semantically equivalent multimodal inputs and define generality as consistent predictions within each unit. To expose fragile semantic regions, we introduce Latent Adversarial Robustification (LAR), which generates adversarial yet semantically coherent variants in the joint latent space. We further propose Rank-Constrained Subspace Learning (RCSL), enforcing low-rank alignment of adversarial representations at the edit layer via a singular value-based objective. Extensive analysis demonstrates the effectiveness of ASAM empirically.

URL PDF HTML ☆

赞 0 踩 0

2605.23777 2026-05-25 cs.CV

Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

机器学习应用于祖母绿宝石分级：框架提案与公开数据集创建

FB Pena, D Crabi, Sandro C Izidoro, Érick O Rodrigues, G Bernardes

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnológica Federal do Paraná (UTFPR), Pato Branco, State of Parana, Brazil（学术信息系（DAINF），联邦技术大学Parana分校（UTFPR），Pato Branco，巴西巴拉那州）

AI总结本文提出了一种基于机器学习的祖母绿宝石分级框架，并创建了一个公开数据集。该框架从图像采集到最终分类实现了整个分级过程的自动化，避免了人工分级的主观性。研究首次将机器学习与图像处理技术结合应用于祖母绿分级，取得了98%的分类准确率，并发布了包含192张祖母绿图像及其预处理特征的数据集。

Journal ref Pattern Analysis and Applications 2022

详情

DOI: 10.1007/s10044-021-01041-4

AI中文摘要

目前，宝石分级是由宝石学家执行的手工过程。一种流行的方法使用参考石，由专家目视检查，决定哪一颗参考石与待检石最相似。该过程非常主观，不同专家可能做出不同的分级选择。本文提出了一个完整的框架，涵盖图像采集直至最终宝石分类。该提案能够自动化整个过程，除了将宝石放入创建的图像采集腔室之外。它摒弃了专家做出的主观决策。这是首个将机器学习方法与图像处理技术相结合用于祖母绿分级的工作。所提出的框架实现了98%的准确率（正确分类的宝石），优于深度学习方法。此外，我们还创建并发布了所使用的数据集，包含192张祖母绿宝石图像及其提取和预处理后的特征。

英文摘要

The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

URL PDF HTML ☆

赞 0 踩 0

2605.23775 2026-05-25 cs.CV

A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques

一种基于cGANs和图像处理技术的木材计数新方法

João VC Mazzochin, Giovani Bernardes Vitor, Gustavo Tiecker, Elioenai MF Diniz, Gilson A Oliveira, Marcelo Trentin, Érick O Rodrigues

发表机构 * Graduate Program of Production and Systems Engineering, Universidade Tecnol6gica Federal do Paraná (UTFPR)（生产与系统工程研究生项目，联邦技术大学帕托布拉诺分校（UTFPR））； Institute of Technological Sciences, Universidade Federal de Itajubá (UNIFEI)（技术科学研究所，联邦大学伊塔比拉分校（UNIFEI））； Business School, Universidade Federal do Paraná (UFPR)（商业学院，联邦帕拉分校（UFPR））； Graduate Program of Electrical and Computer Engineering, Universidade Tecnológica Federal do Paraná (UTEPR)（电气与计算机工程研究生项目，技术联邦大学帕托布拉诺分校（UTEPR））

AI总结本文提出了一种基于条件生成对抗网络（cGANs）和图像处理技术的新型木材原木计数方法，旨在解决精确计数中的挑战。该方法结合图像处理技术处理噪声和交叉重叠问题，并利用连通组件算法实现高效计数。研究还公开了一个包含466张图像、约13,048根桉树原木的数据库，实验表明该方法在像素级和原木级准确率上分别达到96.4%和92.3%，具有较高的实用价值和实时处理能力，适用于林业管理、资源优化等实际场景。

Journal ref Forests 2025

详情

DOI: 10.3390/f16020237

AI中文摘要

本研究解决了精确木材计数的挑战，所提出方法论的应用可涵盖从材料管理、监控和安全科学到木材交通监测、木材体积估计等自动化方法。我们引入了一种利用条件生成对抗网络（cGANs）进行桉木图像分割的方法，结合专门的图像处理技术处理噪声和交叉，并采用连通分量算法进行高效计数。为支持本研究，我们创建并公开了一个包含466张图像、约13,048根桉木的全面数据库，用于训练和验证。我们的方法表现出稳健性能，平均像素精度达到96.4%，原木计数精度达到92.3%，其他指标如F1分数在0.879至0.933之间，IoU值在0.784至0.875之间，进一步验证了其有效性。该实现效率高，在NVIDIA T4 GPU上每张图像平均处理时间为0.713秒，适合实时应用。该方法对运营林业具有重要实际意义，能够实现更准确的库存管理，减少人工计数的错误，并优化资源配置。此外，模型的分割能力为桉木堆体积估计等高级应用奠定了基础，有助于对林业运营进行更全面和精细的分析。该方法在处理复杂场景（包括交叉原木和变化的环境条件）方面的成功，使其成为相关工业领域实际应用的有价值工具。

英文摘要

This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average Accuracy_pixel of 96.4% and Accuracy_logs of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713s per image on an NVIDIA T4 GPU, making it suitable for realtime applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology's success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors.

URL PDF HTML ☆

赞 0 踩 0

2605.23772 2026-05-25 cs.AI cs.LO cs.PL cs.SE

Agentic Proving for Program Verification

程序验证的智能体证明

Alessandro Sosso, Akhil Arora, Bas Spitters

发表机构 * Department of Computer Science（计算机科学系）

AI总结该研究评估了基于代理的定理证明系统在程序验证任务中的能力，通过在CLEVER基准上测试Claude Code的表现，发现其在生成规范、验证实现以及端到端程序生成与验证方面均取得了较高的成功率。研究还指出当前程序验证基准与现代代理证明系统的能力之间存在差距，并强调需要更严格、更具鲁棒性的评估方法，特别是替代基于同构评分的规范评估方式。研究结果表明，结合编译器的紧密循环代理范式是当前程序验证最有效的方法之一。

详情

AI中文摘要

智能体系统最近已成为形式数学中自动定理证明的最先进方法。为了评估这些能力在程序验证中的延伸程度，我们在CLEVER（一个用于可验证代码生成的Lean 4基准）上，在智能体证明框架中评估了Claude Code。我们的结果显示，Claude为98.8%的问题生成了可论证的有效规范（其中81.3%也被CLEVER基于同构的评分在基准的正确部分接受），针对正确的地面真实规范验证了87.5%问题的实现，并在具有自洽前提的条目上，端到端程序生成和验证管道的成功率达到98.1%。在所有阶段，Claude进一步对其自身尝试提供了高质量的反馈（经人工审查确认），识别了失败的根本原因和数据集中残留的错误。这些发现突显了现有程序验证基准的难度与当代智能体证明器能力之间日益增长的不匹配，并指出了对更严格、更具错误鲁棒性的评估方法的需求，特别是对生成规范基于同构的评分的替代方案。更广泛地说，我们的结果提供了经验证据，表明紧密的编译器在环智能体范式目前是基础程序验证最有效的方法。

英文摘要

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

URL PDF HTML ☆

赞 0 踩 0

2605.23771 2026-05-25 cs.CV cs.AI cs.MA

PhotoFlow: Agentic 3D Virtual Photography Missions

PhotoFlow: 智能体式3D虚拟摄影任务

Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Northeastern University（东北大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Cornell University（康奈尔大学）； Shanghai AI Laboratory（上海人工智能实验室）； Sichuan University（四川大学）

AI总结 PhotoFlow 是一种用于虚拟摄影的智能代理系统，能够在没有预设相机参数或参考图像的情况下，根据语言指令在3D场景中生成符合语义意图的高质量照片。该系统由三个模块组成：Director 生成多样化的相机候选方案，Reviewer 进行视觉评估与参数筛选，Reflector 则通过失败经验优化搜索策略。研究还提出了 VPhotoBench 基准，包含多个 Blender 场景和语言条件摄影任务，实验表明 PhotoFlow 在多轮渲染预算下表现出色，是首个在任意 Blender 场景中实现语言条件虚拟摄影的可执行代理系统。

详情

AI中文摘要

虚拟摄影要求智能体进入一个预制的3D场景，没有预设的相机姿态或参考图像，从场景信息和语言意图中推断合适的镜头，选择可执行的相机参数，并渲染最终照片。视觉-语言模型的最新进展使这种空间智能体越来越可行，但该任务强调两种难以同时评估的能力：复杂的3D空间理解和抽象审美判断。我们引入了PhotoFlow，一个导演-评审-反思智能体，用于闭环相机搜索。导演构建软摄影蓝图并提议多样化的候选相机；评审结合规则检查、视觉批评和成对优胜者选择；反思将失败转化为区域记忆、死区抑制和高探索重定位。我们还引入了VPhotoBench，一个包含47个开源许可的Blender场景和141个语言条件摄影任务的基准，涵盖主体放置、关系构图和氛围/风格。在保留实验中，PhotoFlow在六轮渲染预算下，在一次性预测、单链反思、锚点库选择和随机搜索中取得了最强的外部质量-对齐复合指标和成功率。据我们所知，这是第一项将任意Blender场景中的语言条件虚拟摄影作为可执行智能体任务的工作，我们的结果表明，以LLM为中心的空间智能体已经可以在旨在挑战3D推理和审美选择的设置中产生强大的照片。

英文摘要

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

URL PDF HTML ☆

赞 0 踩 0

2605.23762 2026-05-25 cs.RO

Direct Dynamic Retargeting for Humanoid Imitation Learning from Videos

面向人形机器人视频模仿学习的直接动态重定向

Constant Roux, Ludovic De Matteïs, Armand Jordana, Valentin Guillet, Nicolas Mansard, Olivier Stasse, Philippe Souères

发表机构 * LAAS-CNRS, Université de Toulouse, CNRS（法国图卢兹大学LAAS-CNRS中心，法国国家科学研究中心）

AI总结本文研究了如何从单目视频中学习人类形体的模仿技能，并将其应用于人形机器人。为了解决人类运动与人形机器人之间形态差异带来的挑战，作者提出了直接动态重定向（DDR）方法，通过任务空间建模和基于采样的模型预测控制求解器，直接生成符合物理规律的高质量轨迹，避免了传统方法中的几何偏差。实验表明，DDR在轨迹跟踪精度和强化学习训练效率方面均优于现有方法。

详情

AI中文摘要

从单目视频演示中进行模仿学习为向人形机器人教授复杂技能提供了一种可扩展的方法。然而，将人体运动转化为类人运动需要克服显著的形态不匹配。标准方法依赖于几何重定向或间接动态重定向流程。我们发现这些中间运动学投影引入了几何偏差，限制了搜索空间并产生了次优的动态行为。在本文中，我们提出了直接动态重定向（DDR），一种新颖的单阶段框架，可直接从专家视频生成高保真、动态可行的轨迹。通过将问题在任务空间中建模，并在物理模拟器中利用基于采样的模型预测控制求解器，DDR 在缓解输入漂移的同时原生优化复杂的接触序列。我们的实验表明，绕过几何偏差使 DDR 在演示跟踪精度上优于最先进的基线方法。此外，我们证实，向强化学习智能体提供此类物理可行的参考可加速训练收敛，并增强敏捷和平衡行为的最终执行。源代码将公开发布。

英文摘要

Imitation Learning from monocular video demonstrations provides a scalable approach for teaching complex skills to humanoid robots. However, translating human motion to humanoids requires overcoming significant morphological mismatches. Standard approaches rely on Geometric Retargeting or Indirect Dynamic Retargeting pipelines. We identify that these intermediate kinematic projections introduce a geometric bias, restricting the search space and yielding suboptimal dynamic behaviors. In this paper, we propose Direct Dynamic Retargeting (DDR), a novel single-stage framework that generates high-fidelity, dynamically feasible trajectories directly from expert videos. By formulating the problem in the task space and leveraging a sampling-based Model Predictive Control solver within a physics simulator, DDR natively optimizes over complex contact sequences while mitigating input drift. Our experiments demonstrate that bypassing the geometric bias allows DDR to outperform state-of-the-art baselines in demonstration tracking accuracy. Furthermore, we establish that providing such physically viable references to RL agents accelerates training convergence and enhances the final execution of agile and balancing behaviors. Source code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.23754 2026-05-25 cs.LG

LLM-driven design of physics-constrained constitutive models: two agents are better than one

LLM驱动的物理约束本构模型设计：两个智能体胜过一个

Marius Tacke, Matthias Busch, Kian Abdolazizi, Jonas Eichinger, Kevin Linka, Roland Aydin, Christian Cyron

发表机构 * Helmholtz-Zentrum Hereon（海德堡中心）； Hamburg University of Technology（汉堡技术大学）； RWTH Aachen University（亚琛工业大学）； Saarland University（萨尔兰州大学）； German Center for Artificial Intelligence（德国人工智能中心）

AI总结本文提出了一种基于大语言模型（LLM）的多智能体方法，用于生成符合物理规律的本构模型。该方法引入了两个智能体：Creator 负责根据数据生成模型，Inspector 负责检查模型是否满足九项物理约束，若不满足则返回修改。实验表明，该方法显著提高了生成模型的物理正确性，同时保持了高精度和良好的泛化能力，为自动化、物理感知的模型发现提供了可信的解决方案。

详情

AI中文摘要

传统上，开发描述材料在载荷下变形方式的本构模型需要连续介质力学、机器学习和科学编程方面多年的专业知识。最近，大型语言模型（LLM）已被证明可以通过按需生成本构模型来降低这一门槛，但现有的单智能体流程缺乏系统性的检查，以确保生成的模型尊重基本物理定律。为弥补这一差距，我们引入了首个多智能体LLM驱动的本构模型生成方法：一个Creator智能体根据数据提出定制模型，而一个Inspector智能体对每个提案进行严格审计，检查其是否满足九个物理约束，并在检测到违规时返回修改。我们使用本构人工神经网络（CANN）演示了这一概念，并在脑组织、实验橡胶和合成橡胶上使用两种不同的LLM骨干（Claude Opus 4.7和Kimi K2.5）进行基准测试。添加Inspector后，对于Opus，导出模型中真正满足所有物理约束的比例从91%提高到完美的100%；对于Kimi，从37%提高到56%，同时保持了接近基线的准确性和对未见加载路径的显著泛化能力。综合来看，生成的模型在物理上有效、高度准确，并能可靠地外推到训练数据之外——这些特性使其可以直接在实践中使用。因此，将生成与检查分离，使LLM驱动的本构建模成为一个真正可信的过程。该范式故意与技术无关，并随着LLM能力的进步自动扩展，为自动化、物理感知的模型发现开辟了一条有前景的道路。

英文摘要

Developing constitutive models that capture how materials deform under load traditionally requires years of specialized expertise in continuum mechanics, machine learning, and scientific programming. Large language models (LLMs) have recently been shown to lower this barrier by generating constitutive models on demand, but existing single-agent pipelines lack systematic checks that the resulting models respect fundamental physical laws. To close this gap, we introduce the first multi-agent LLM-driven approach for constitutive model generation: a Creator agent proposes a model tailored to the data, while an Inspector agent critically audits each proposal against nine physical constraints and returns it for refinement whenever a violation is detected. We demonstrate this concept with constitutive artificial neural networks (CANNs) and benchmark it on brain tissue, experimental rubber, and synthetic rubber, using two different LLM backbones (Claude Opus 4.7 and Kimi K2.5). Adding the Inspector raises the share of exported models that truly satisfy all physical constraints from 91% to a perfect 100% for Opus and from 37% to 56% for Kimi, while preserving near-baseline accuracy and remarkable generalization to unseen loading paths. In combination, the generated models are physically valid, highly accurate, and extrapolate reliably beyond the training data - properties that together make them directly usable in practice. Separating generation from inspection thus turns LLM-driven constitutive modeling into a genuinely trustworthy process. The paradigm is deliberately technique-agnostic and scales automatically with advances in LLM capability, opening a promising path toward automated, physics-aware model discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.23753 2026-05-25 cs.LG

SeedER: Seed-and-Expand Retrieval from Knowledge Graphs

SeedER: 基于种子扩展的知识图谱检索

Hamed Shirzad, Frederik Wenkel, Dominique Beaini, Danica J. Sutherland, Emmanuel Noutahi

发表机构 * Valence Labs, Montréal, QC, Canada（Valence实验室，加拿大魁北克省蒙特利尔）； University of British Columbia, Department of Computer Science, Vancouver, BC, Canada（不列颠哥伦比亚大学计算机科学系，加拿大不列颠哥伦比亚省温哥华）

AI总结 SeedER 是一种用于知识图谱的检索框架，旨在解决其不规则结构带来的检索挑战。该方法通过先利用轻量级的密集嵌入和实体检索确定核心节点，再通过强化学习训练的图感知策略进行选择性扩展，从而高效发现与查询相关的节点。实验表明，SeedER 在保持较低扩展成本的同时，显著提升了检索效果，尤其在处理多跳组合查询时表现出优越的性能。

详情

AI中文摘要

知识图谱（KGs）为关系知识提供了丰富的表示，但其不规则结构使得检索具有挑战性：自我图扩展迅速增长，而密集嵌入方法难以处理多跳组合查询。现有的基于智能体的图探索方法虽然表达能力强，但通常对于大规模检索来说过于昂贵。我们引入了SeedER（种子扩展检索），这是一个通过迭代、低成本扩展显式利用KG结构的检索框架。SeedER首先使用轻量级密集和基于实体的检索播种一个紧凑的核心节点集，然后通过使用强化学习训练的图感知策略选择性地扩展该集合。这种设计将全局推理分解为可重用的局部决策，从而能够在严格控制扩展成本的同时高效发现与查询相关的节点。我们展示了密集检索在组合图查询上的理论局限性，并从组合泛化和图约束子模优化的角度确立了SeedER的优势。实验上，SeedER在紧凑候选集上显著提高了召回率，超过了强大的密集和图增强基线，使其成为知识密集型推理系统中有效的第一阶段检索器。

英文摘要

Knowledge graphs (KGs) offer a rich representation for relational knowledge, but their irregular structure makes retrieval challenging: ego-graph expansion grows rapidly, and dense embedding methods struggle with multi-hop compositional queries. Existing agent-based graph exploration approaches, while expressive, are often too expensive for large-scale retrieval. We introduce SeedER (Seed-and-Expand Retrieval), a retrieval framework that explicitly leverages KG structure through iterative, low-cost expansion. SeedER first seeds a compact set of core nodes using lightweight dense and entity-based retrieval, then selectively expands this set via a learned graph-aware policy trained with reinforcement learning. This design decomposes global reasoning into reusable local decisions, enabling efficient discovery of query-relevant nodes while tightly controlling expansion cost. We show theoretical limitations of dense retrieval on compositional graph queries, and establish advantages of SeedER from both compositional generalization and graph-constrained submodular optimization perspectives. Empirically, SeedER substantially improves recall with compact candidate sets over strong dense and graph-augmented baselines, making it an effective first-stage retriever for knowledge-intensive reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23751 2026-05-25 cs.LG

Approaching I/O-optimality for Approximate Attention

逼近近似注意力的I/O最优性

Pál András Papp, Aleksandros Sobczyk, Anastasios Zouzias

发表机构 * Computing Systems Lab（计算系统实验室）； Huawei Technologies（华为技术）

AI总结本文研究了大语言模型中注意力机制的I/O复杂度问题，旨在以最少的快慢内存数据传输次数计算注意力矩阵。作者提出了一种基于近似注意力框架的I/O高效算法，使得在大多数参数设置下，I/O代价仅近似线性依赖于序列长度$n$，显著优于现有方法的二次复杂度。同时，作者还给出了不同参数范围下的I/O下界，证明所提方法接近I/O最优。

详情

AI中文摘要

我们重新审视了大语言模型中注意力的I/O复杂度。给定查询-键-值矩阵 $Q,K,V\in\mathbb{R}^{n\times d}$，以及一个快速内存大小为 $M$ 的机器，目标是计算“注意力矩阵” $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$，同时最小化快速和慢速内存之间的数据传输次数。文献中的现有方法，尤其是FlashAttention及其变体，其I/O开销与 $n$ 呈二次关系，而一个平凡的下界仅需要 $\Omega(nd)$ 次I/O来读取输入和写入输出。在这项工作中，我们提出了一种计算注意力的技术，在大多数参数范围内，其I/O开销几乎与 $n$ 呈线性关系。这是通过开发受Alman和Song最近提出的近似注意力框架启发的I/O高效算法实现的。我们还证明了每个参数范围内的相应下界，以表明我们的算法确实接近I/O最优。

英文摘要

We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $Ω(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

URL PDF HTML ☆

赞 0 踩 0

2605.23747 2026-05-25 cs.CV

Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

复兴密集材质分割：稳定的视觉Transformer与泛化悖论

Allan Kazakov, Duygu Cakir, Hilal Kurt İrfanoğlu, Yavuz İrfanoğlu

发表机构 * Bahcesehir University, Istanbul, Turkey（巴塞希尔大学，伊斯坦布尔，土耳其）； Poder Bilişim Teknolojileri Sanayi ve Ticaret A.Ş., Istanbul, Turkey（Poder信息科技工业和贸易股份有限公司，伊斯坦布尔，土耳其）； Galatasaray University, Istanbul, Turkey（加拉塔萨雷大学，伊斯坦布尔，土耳其）

AI总结本文旨在复兴苹果密集材料分割（Apple-DMS）基准，解决当前材料分割任务中因几何偏倚模型主导而导致的性能停滞问题。研究提出了一种稳定训练方法，包括高保真逻辑投影、查询熵正则化和物理兼容的数据增强策略，显著提升了基于Vision Transformer的分割模型性能。同时，作者揭示了“泛化悖论”——虽然数据重划分可提升指标，却会降低模型在真实场景中的泛化能力，强调了使用原始数据划分对推动物理感知人工智能研究的重要性。

详情

AI中文摘要

材质分割，即对物理表面属性进行像素级分类，仍然是计算机视觉中的一个挑战性问题，需要区别于以物体为中心解析的物理化学理解。尽管引入了严格的Apple密集材质分割（DMS）数据集，该基准测试仍遭受衰退和停滞，日益被偏向几何的基础模型所掩盖。在本文中，我们复兴Apple-DMS基准测试，建立现代视觉Transformer基线。我们对SegFormer和Mask2Former架构进行了详尽评估，揭示标准训练范式由于高方差梯度而在无定形纹理场上失败。为解决此问题，我们引入了一种稳定的训练方案，包括高保真logit投影、查询熵正则化以及领域特定、符合物理的增强流程。我们优化的SegFormer-B5在原始数据集划分上达到了0.4572 mIoU的新最先进水平（SOTA），显著超越了先前的卷积基线。此外，我们识别出一个关键的“泛化悖论”：虽然将数据集重新划分为数据丰富的80/10/10划分将指标提升至0.5276 mIoU，但专家定性分析表明这导致了分布同质化，严重降低了真实世界、分布外性能。通过发布我们恢复的数据集索引和稳健的训练框架，我们证明材质感知远未解决，并敦促社区利用严格的原始划分推动物理基础人工智能的真正进展。

英文摘要

Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical "Generalization Paradox": while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.23744 2026-05-25 cs.LG

Contrast to Detect: Dynamic Graph Contrastive Regularization for Unsupervised Anomaly Detection in Multivariate Time Series

对比检测：面向无监督多变量时间序列异常检测的动态图对比正则化

Yunhua Pei, Zixing Song, Jin Zheng, John Cartlidge

发表机构 * School of Computer Science, University of Bristol（布里斯托大学计算机科学学院）； School of Engineering Mathematics, University of Bristol（布里斯托大学工程数学学院）

AI总结该研究针对多变量时间序列中的无监督异常检测问题，提出了一种名为ContrastAD的框架，用于应对动态变量依赖关系和频谱噪声带来的挑战。该方法通过动态图对比学习，将结构演变作为学习信号，而非抑制其变化，并引入多视角嵌入和频率感知注意力机制以提升鲁棒性。实验表明，ContrastAD在多个真实数据集上取得了优越的异常检测性能，尤其在F1指标上表现突出。

Comments 12 pages, 5 figures. Preprint. Code and demo data available online

详情

AI中文摘要

多变量时间序列（MTS）中的异常检测受到动态变量间依赖关系和频谱噪声下特征纠缠的阻碍，在实践中，由于缺乏异常标签而进一步复杂化。现有的基于重构的检测器倾向于像正常模式一样忠实地恢复异常，而流行的图对比方法强制视图间不变性，从而假设一个平稳的关系结构，这一假设在真实系统的结构漂移下被打破。我们提出ContrastAD，一个无监督框架，将结构演化本身转变为学习信号而非抑制它。一个多视角编码器从时间、属性和结构视角编码输入。一个频率感知注意力混合器在注意力之前执行频谱top-K过滤，防止噪声泄漏到查询-键相似度中。核心组件，一个动态图对比学习器，从批次级DTW距离构建基于幂律的稀疏图快照，并将最发散的对与稳定锚点进行对比，在不施加刚性不变性的情况下正则化潜在空间。在五个真实世界基准上，ContrastAD在所有五个数据集上获得最高平均F1，并在三个数据集上获得最高AUC（SWaT 93.60，SMD 98.66，PSM 97.79），在SWaT和PSM上相对于最强基线具有统计显著的F1和AUC差距。在MSL和SMAP上，其AUC落后领先者不到0.7个百分点，同时F1仍领先。消融和敏感性研究进一步证实，对比目标作为软正则化器效果最佳，支持我们的主张：在非平稳动态下严格不变性是次优的。

英文摘要

Anomaly detection in multivariate time series (MTS) is hindered by dynamic inter-variable dependencies and feature entanglement under spectral noise, and in practice, is further complicated by the absence of anomaly labels. Existing reconstruction-based detectors tend to recover anomalies as faithfully as normal patterns, while prevailing graph contrastive methods enforce invariance across views and thus assume a stationary relational structure, an assumption that breaks under structural drift in real systems. We propose ContrastAD, an unsupervised framework that turns structural evolution itself into a learning signal rather than suppressing it. A Multi-Perspective Embedder encodes inputs from temporal, attribute, and structural perspectives. A Frequency-Aware Attention Mixer then performs spectral top-K filtering before attention, preventing noise from leaking into query-key similarities. The core component, a Dynamic Graph Contrastive Learner, builds power-law-inspired sparse graph snapshots from batch-level DTW distances and contrasts the most divergent pair against a stable anchor, regularizing the latent space without imposing rigid invariance. Across five real-world benchmarks, ContrastAD attains the highest mean F1 on all five datasets and the highest AUC on three (SWaT 93.60, SMD 98.66, PSM 97.79), with statistically significant F1 and AUC margins over the strongest baseline on SWaT and PSM. On MSL and SMAP, it trails the AUC leader by under 0.7 points while still leading on F1. Ablation and sensitivity studies further confirm that the contrastive objective works best as a soft regularizer, supporting our claim that strict invariance is suboptimal under non-stationary dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.23726 2026-05-25 cs.LG cs.DS stat.ML

Optimal Dimension-Free Sampling for Regularized Classification

正则化分类的最优无维度采样

Meysam Alishahi, Alexander Munteanu, Simon Omlor, Jeff M. Phillips

发表机构 * University of Utah, USA（美国犹他大学）； TU Dortmund, Germany（德国图鲁姆大学）

AI总结本文研究了在正则化分类问题中实现$(1\pm\varepsilon)$相对误差的最优无维度采样方法，适用于一大类满足Lipschitz条件的分类损失函数，如逻辑回归、铰链损失和ReLU损失等。作者给出了不同正则化项下的采样复杂度上界和下界，证明了基于$\|\cdot\|_2/k$和$\|\cdot\|_1/k$正则化的采样复杂度分别为$k^2/\varepsilon^2$和$k/\varepsilon^2$，并分析了$\|\cdot\|_2^2/k$正则化下采样复杂度对函数导数性质的依赖。相比现有基于敏感度的立方复杂度方法，本文通过统一采样和更精细的高阶矩分析，实现了更优的采样效率。

详情

AI中文摘要

我们证明了对于一大类Lipschitz连续分类损失函数，在各种正则化项下，达到$(1\pm\varepsilon)$相对误差的最优采样界。这包括重要的函数如logistic和sigmoid损失、hinge损失和ReLU损失，作为突出和流行的代表性例子。特别地，我们证明了对于$\|\cdot\|_2/k$正则化的$k^2/\varepsilon^2$上下界，以及对于$\|\cdot\|_1/k$正则化的$k/\varepsilon^2$上下界。对于$\|\cdot\|_2^2/k$正则化，采样复杂度主要取决于有界导数性质：如果$|g'(x)|\leq g(x)$，且$g(0)>0$，且$g$是单调或凸的，则采样复杂度是$k$的线性；否则一般界为$k^2/\varepsilon^2$。然而，如果$g(0)=0$，我们的结果表明不可能得到无维度界，甚至次线性界也被排除。所有上界都有匹配的下界（至多相差多对数项）。此外，我们的工作在概念上和算法上依赖于简单的均匀或（平方）范数采样，从而改进了最近(Alishahi and Phillips, ICML'24)的立方$k^3/\varepsilon^2$敏感度采样界。这是通过涉及更高矩界和经验过程分析的精细论证来实现的，以避免在事实上的标准VC维和敏感度框架中出现的过度计数。

英文摘要

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.

URL PDF HTML ☆

赞 0 踩 0

2605.23723 2026-05-25 cs.AI

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

MemAudit：通过因果归因和结构异常检测对中毒代理记忆进行事后审计

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； Qiyuan Tech（齐元科技）； Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机科学学院多媒体信息处理实验室）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结随着大型语言模型代理越来越多地依赖持久内存来存储历史交互并提升任务执行能力，内存机制也带来了潜在的安全隐患：攻击者可通过正常交互向内存中注入恶意记录，从而影响代理的行为。为此，本文提出 MemAudit，一种用于事后审计内存增强型大语言模型代理的因果记忆审计框架。该方法结合因果影响评分与结构异常检测，有效识别出对有害输出有贡献的恶意记忆记录，并在多种攻击场景下显著降低了攻击成功率。

详情

AI中文摘要

大型语言模型代理越来越依赖持久记忆来存储过去的交互、检索相关演示并改进长期任务执行。然而，这种记忆机制也造成了一个实际的安全漏洞：对抗性用户可能通过普通交互将恶意记录注入代理的记忆中，这些记录随后可能被检索以引导代理的推理和行动。现有的防御主要关注在线干预，如提示过滤或输出阻止，但它们没有解决事后问题，即在观察到有害行为后，哪些存储的记忆应负责。我们提出了 extbf{MemAudit}，一个用于记忆增强型LLM代理的事后因果记忆审计框架。该框架结合了两个互补信号：（1）反事实记忆影响评分，衡量每个记忆对有害输出的因果贡献；（2）记忆一致性图，识别更广泛记忆存储中的结构异常记忆。我们针对MINJA（一种仅查询的记忆注入攻击，其中恶意记录通过正常代理交互生成和存储，而非直接修改记忆库）评估了MemAudit。在QA和推理代理两种设置中，MemAudit在现实的事后审计场景下显著降低了攻击成功率。结果显示，QA攻击成功率从$70\%$降至$0\%$，而RAP攻击成功率从$83.3\%$降至$0\%$。

Mohammad Tabish, Stefan Klus

发表机构 * Maxwell Institute for Mathematical Sciences, University of Edinburgh and Heriot–Watt University（爱丁堡大学麦克斯韦数学科学研究所和赫里奥特-瓦特大学）； School of Mathematical & Computer Sciences, Heriot–Watt University（赫里奥特-瓦特大学数学与计算机科学学院）

AI总结本文提出了一种用于复杂动力系统传递算子近似的随机神经网络架构RaNNDy，其隐藏层的权重和偏置随机初始化并固定，仅训练输出层，从而降低了训练成本并提供了闭式解。然而，该方法依赖于初始选择的激活函数来确定基函数，为此，本文提出了一种优化激活函数的算法，在保持网络参数固定的情况下提升基函数的适应性，并通过多个基准问题验证了方法的有效性。

详情

AI中文摘要

RaNNDy是一种随机神经网络架构，用于数据驱动地逼近与复杂动力系统相关的传递算子。网络隐藏层的权重和偏置随机初始化并保持固定，仅训练输出层。与完全优化的神经网络相比，这具有几个优点，特别是输出层的闭式解和显著降低的训练成本。尽管有这些优点，RaNNDy局限于参数化算子逼近所需基函数的权重和偏置的初始选择。由于基函数由激活函数决定，为隐藏层选择合适的激活函数至关重要。在这项工作中，我们提出了一种算法，该算法优化激活函数本身，同时保持随机神经网络中的权重和偏置固定，从而提供更合适的字典。我们通过各种基准问题（包括随机微分方程和图上的随机游走）说明了该方法的有效性。

英文摘要

RaNNDy is a randomized neural network architecture for the data-driven approximation of transfer operators associated with complex dynamical systems. The weights and biases of the hidden layers of the network are randomly initialized and kept fixed, only the output layer is trained. This has several advantages over fully optimized neural networks, notably a closed-form solution for the output layer and significantly lower training costs. Despite these advantages, RaNNDy is restricted to the initial selection of weights and biases that parametrize the basis functions required for the operator approximation. Since the basis functions are determined by the activation function, choosing an appropriate activation function for the hidden layers is crucial. In this work, we propose an algorithm that optimizes the activation function itself, while keeping the weights and biases in the randomized neural network fixed, providing a more suitable dictionary. We illustrate the efficacy of the approach using various benchmark problems, including stochastic differential equations and random walks on graphons.

URL PDF HTML ☆

赞 0 踩 0

2605.22738 2026-05-25 cs.LG cs.AI stat.ML

Proxy-Based Approximation of Shapley and Banzhaf Interactions

基于代理的Shapley和Banzhaf交互近似

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter, Eyke Hüllermeier, Maximilian Muschalik, Fabian Fumagalli

发表机构 * LMU Munich（慕尼黑大学）； MCML ； DFKI（德意志联邦防务研究院）； Centre for Credible AI, Warsaw University of Technology（华沙技术大学可信AI中心）； University of Warsaw（华沙大学）； Claremont McKenna College（克莱尔蒙特麦肯纳学院）； Bielefeld University（比勒菲尔德大学）

AI总结本文研究了如何高效准确地估计Shapley和Banzhaf交互值，以解释机器学习模型中特征之间的复杂相互作用。为此，作者提出了ProxySHAP方法，结合树模型代理的高效采样与残差校正策略，实现了在保证精度的同时提升计算效率。理论分析表明，ProxySHAP能够在多项式时间内计算树集成模型的精确交互指数，并有效控制偏差与方差。实验表明，ProxySHAP在多个基准测试中表现优异，尤其在大规模高维数据上显著优于现有方法。

详情

AI中文摘要

Shapley和Banzhaf交互捕捉了现代机器学习应用中固有的复杂动态。然而，当前对这些高阶交互的估计器在速度和准确性之间进行权衡。为了克服这一限制，我们引入了ProxySHAP。ProxySHAP将基于树的代理模型的高样本效率与通过残差校正实现一致性的原则路径相结合。在理论层面，我们推导了干预TreeSHAP的多项式时间推广，以计算树集成的精确交互指数，成功避免了先前方法中的指数树深度依赖。此外，我们正式分析了残差调整策略，刻画了最大样本重用（MSR）在特定条件下校正代理偏差而不使其方差随交互规模指数增长的条件。广泛的基准测试表明，ProxySHAP在近似质量上树立了新的最先进标准，包括在具有数千个特征的大规模应用中。通过在小预算和大预算场景下均实现最低误差，ProxySHAP显著优于先前最佳估计器ProxySPEX和KernelSHAP-IQ，同时在可解释性下游任务上也提供了卓越性能。

英文摘要

Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22672 2026-05-25 cs.AI

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

能力是负担吗？更强大的语言模型在关键时刻做出更差的预测

Nick Merrill, Jaeho Lee, Ezra Karger

发表机构 * Forecasting Research Institute（预测研究 institute）

AI总结本文研究了在具有超线性增长和制度变化尾风险的时间序列预测任务中，能力更强的语言模型反而会产生更差的分布预测这一逆向缩放现象。通过在模拟和真实数据集上的实验，发现更强大的模型倾向于高估上尾风险，而下尾预测相对稳定。研究还表明，模型规模和后训练均对这一现象有影响，并建议在评估语言模型预测能力时应结合连续的准确性指标，而不仅仅依赖于单一阈值的二元指标。

详情

AI中文摘要

我们记录了LLM在预测问题上的逆缩放现象，这些问题的底层时间序列表现出超线性增长和制度转换的尾部风险，这种结构在金融和流行病学中很常见。在这些任务上，更强大的模型会产生更差的分位数预测。该模式出现在我们发布的、无污染的模拟世界基准ForecastBench-Sim（FBSim）上，在预测具有匹配线性控制的合成SIR流行病时，并在COVID-19、麻疹、住房市场和恶性通货膨胀的真实世界数据集中得到复现。每个分位数的分解表明，失败集中在尾部上端，更强大的模型将其向上移动以跟踪激进的增长外推，而下尾部保持不变。Llama-3.1的族内研究表明，模型规模和后训练都独立地促成了这种效应。领域知识并不能可靠地挽救校准。这种逆缩放并不出现在LLM预测基准中常见的单阈值指标上，在相同的输出上，能力-准确性关系的符号发生了反转。在常规截止点上的单阈值评分忽略了尾部上端的成本；包含尾部的评分在相同的输出上反转了能力-准确性关系的符号。我们建议LLM预测评估使用连续（且无界）的准确性度量以及有界的二元阈值度量。

英文摘要

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.22643 2026-05-25 cs.CL

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

温水煮青蛙：面向智能体安全的多轮基准测试

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi

发表机构 * Icaro Foundation（Icaro基金会）； Sapienza University of Rome（罗马萨皮恩扎大学）； Sant’Anna School of Advanced Studies（圣安娜高级研究学校）； Tongji University School of Law（同济大学法学院）； AIQI Consortium（AIQI联盟）； Università Cattolica del Sacro Cuore（圣心大学）； Piccadilly Labs（皮卡迪利实验室）； VU Amsterdam（阿姆斯特丹伏伊大学）； Independent（独立）

AI总结本文提出“Boiling the Frog”（慢慢煮青蛙）多轮基准测试，用于评估作为智能体部署的语言模型在企业及办公场景中的安全性，尤其关注其在逐步攻击下的表现。该基准通过模拟工作环境中的多轮交互，测试模型在面对逐步引入的风险请求时是否会引发安全问题。研究发现，不同模型在该基准上的攻击成功率差异显著，部分模型的攻击成功率高达92.9%，突显了当前AI系统在安全防护方面仍面临严峻挑战。

详情

AI中文摘要

背景。传统的语言模型安全基准评估生成的文本：模型是否输出有毒语言、再现偏见或遵循有害指令。当模型作为智能体部署时，安全相关的对象从系统所说的内容转移到它在环境中执行的操作，仅评估模型在提示下的响应已不足以应对人工智能带来的安全挑战。近期发展出现了评估大型语言模型作为智能体的基准测试。我们对此研究方向做出贡献。方法。我们引入“温水煮青蛙”基准测试，评估部署在企业及办公环境中使用工具的AI模型是否易受渐进式攻击。每个场景以良性工作区编辑开始，随后引入一个包含风险的请求。基准测试聚焦于有状态的多轮评估：链暴露持久工作区，将风险负载置于轮次序列中的受控位置，并评估最终工件状态是否变得不安全。场景基于“温水煮青蛙”风险、AI法案附件I和附件III高风险情境以及欧盟AI法案通用人工智能行为准则（GPAI）构建的三级操作风险分类进行组织。结果。在九个模型的面板中，总体严格攻击成功率（ASR）为44.4%。模型级ASR范围从Claude Haiku 4.5的20.5%到Gemini 3.1 Flash Lite的92.9%，Seed 2.0 Lite也超过80%。对于行为准则失控场景，平均链类别级ASR达到93.3%。

英文摘要

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.21874 2026-05-25 cs.SD

Real-time, EDM-inspired sonification of the activity of a supercomputer

实时、受EDM启发的超级计算机活动声音化

Marco Alunno, Paolo Bientinesi

发表机构 * High-Performance Computing Center North（高性能计算中心北）； Umeå University（乌梅大学）

AI总结本文研究了如何将超计算机实时运行数据通过声音形式进行信息丰富的声学化呈现。研究提出了一种基于电子舞曲（EDM）风格的声学化方法，以持续、清晰且吸引人的方式反映系统各节点的活动状态。该方法强调实时监控而非调试，生成无限延续且风格统一的音乐，将数据声学化与长期监听需求相结合，具有独特创新性。

Comments 7 pages, 2 figures, accepted conference paper

详情

AI中文摘要

本文描述的项目探索了对超级计算机实时接收的数据进行信息性声音化。这些数据捕获了计算机所有节点当前的活动，因此其声音化作为一种持续监控节点行为以及整个系统行为的形式。由于这种监控理论上永无止境，因此产生的声音化必须在音乐上能够通过声音传达信息，同时保持长时间的可理解性和吸引力。我们没有将预定义的音乐风格强加于数据，而是试图找到一种数据本身能够合理支持的音乐风格。从一小部分候选中，我们选择了EDM，因为它是一类流派，其结构和时间特征与连续的数据驱动过程和长期聆听非常契合。通过这种基于风格的方法，本研究建立在计算机数据声音化的悠久传统之上，同时独特地结合了很少同时处理的三个要素：以监控（而非调试）为主要目标、实时（而非事后）数据解释，以及生成几乎无限且风格连贯（而非不协调）的音乐。

英文摘要

The project described in this paper explores the informative sonification of data received in real time from a supercomputer. These data capture the current activities in all the nodes of the computer, therefore, their sonification functions as a form of continuous monitoring of the nodes' behavior and, by extension, of the system as a whole. Because such monitoring is theoretically unending, the resulting sonification must be musically capable of conveying information through sound in a way that remains both intelligible and engaging over long durations. Rather than imposing a predefined musical style onto the data, we sought to identify one which the data themselves could plausibly support. From a small set of candidates, we selected EDM because it is a family of genres whose structural and temporal characteristics align well with continuous, data-driven processes and long-term listening. Through this style-based approach, this research builds on the long tradition of computer data sonification while uniquely combining three elements rarely addressed together: monitoring (rather than debugging) as the primary goal, real-time (rather than post-mortem) data interpretation, and generation of virtually infinite and stylistically coherent (rather than incongruous) music.

URL PDF HTML ☆

赞 0 踩 0