arXivDaily arXiv每日学术速递 周一至周五更新
重置
CS计算机1059
2606.12354 2026-06-11 cs.CR 新提交

ECYSAP EYE: From Cyber Situational Awareness to Mission-Centric Decision Support for Enhanced Cyberspace Operations

ECYSAP EYE:从网络态势感知到以任务为中心的决策支持,增强网络空间行动

Pantaleone Nespoli, Daniel Díaz-López, Sergio Lopez Bernal, Francisco Oliva Bermejo, Pedro González Megías, Jorge Maestre Vidal, Víctor Sobrino García, Gregorio Martínez Pérez

AI总结 提出ECYSAP EYE系统之系统架构,通过七类任务相关制品(如RCyP、CySRs等)实现从感知到决策再到执行的过渡,支持增量部署与验证,提升网络空间任务规划与执行中的态势感知与决策支持能力。

详情
Comments
4 pages, 1 figure, 1 table, paper in proceedings of the XI National Cybersecurity Research Conference (JNIC) in Barcelona, Spain, May, 2026
AI中文摘要

运营组织越来越需要超越孤立技术警报的网络态势感知(CySA)能力,提供可嵌入异构工具链和网络安全或网络防御流程的任务相关制品。ECYSAP EYE通过一种面向采用的系统之系统(SoS)架构满足这一需求,该架构围绕七组以任务为中心的制品:识别网络空间图(RCyP)、网络态势报告(CySRs)、假设分析报告(WIAR)、选项建议(OPRE)、操作员仪表板/人机界面(DSH)、行动执行(AE)和事后报告(AAR)。ECYSAP EYE架构构建了从感知(全频谱RCyP视图)到面向决策的推理(WIAR/CySRs/OPRE),再到操作执行和学习(DSH/AE/AAR)的过渡,具有支持增量部署和验证的明确集成面。本文从技术转移角度介绍这一创新项目,总结了更新后的架构、七组制品的功能角色,以及在任务规划与执行背景下网络态势对决策过程的预期影响。

英文摘要

Operational organizations increasingly require Cyber Situational Awareness (CySA) capabilities that go beyond isolated technical alerts, providing mission-relevant artefacts that can be embedded into heterogeneous toolchains and cyber security or cyber defense processes. ECYSAP EYE addresses this need through an adoption-oriented System-of-Systems (SoS) architecture centered on seven groups of mission-focused artefacts: the Recognized Cyberspace Picture (RCyP), Cyber Situational Reports (CySRs), the What-If Analysis Report (WIAR), Option Recommendations (OPRE), an operator Dashboard/HMI (DSH), Action Enforcement (AE), and After-Action Reports (AAR). The ECYSAP EYE architecture structures the transition from perception (full-spectrum RCyP views), to decision-oriented reasoning (WIAR/CySRs/OPRE), and to operational execution and learning (DSH/AE/AAR), with explicit integration surfaces that support incremental deployment and validation. This paper presents this innovative project from a technology transfer perspective, summarizing the updated architecture, the functional role of seven groups of artefacts, and the expected impact of cyber situations on the decision-making process in the context of a mission planning and execution.

2606.12352 2026-06-11 cs.RO cs.AI 新提交

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

CHORUS: 基于单一VLA策略的去中心化多体协作

Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn, Jeannette Bohg

发表机构 * Stanford University(斯坦福大学)

AI总结 提出CHORUS框架,利用预训练视觉-语言-动作模型的视觉运动先验,实现无需推理时通信的去中心化多机器人协作,在真实实验中显著优于基线。

详情
Comments
Project Website: this https URL
AI中文摘要

多机器人协作使机器人能够高效完成从通过门搬运沙发到建筑工地组装结构等各种任务。然而,在移动多机器人环境中实现这种协调仍然具有挑战性:基于团队联合观测的集中式方法随团队规模扩展性差,而为每个机器人训练一个策略的去中心化方法通常需要显式对齐程序或推理时信息共享来克服部分可观测性。我们的关键见解是,预训练的视觉-语言-动作(VLA)模型的视觉运动先验应能够仅从每个机器人的局部观测实现反应式去中心化协作,无需这些推理时假设。我们提出CHORUS,一个适配单一VLA骨干以控制多样化多机器人团队的框架。推理时,每个机器人运行CHORUS的独立副本,仅基于其自身观测和机器人标识提示。在包括移动卷尺测量、图书馆书籍交接和洗衣篮抬举的真实实验中,CHORUS相比去中心化从头训练模型提升64个百分点,对队友行为的反应性提升40个百分点,并优于集中式基线。这些结果表明,共享VLA骨干能够实现去中心化多机器人协作,无需每个机器人的独立策略或推理时机器人间通信。

英文摘要

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

2606.12350 2026-06-11 cs.AI 新提交

Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Nonslop: 人机协作写作中的游戏化实验

Maria Edwards, Julian Togelius

AI总结 通过游戏化写作实验,研究用户在AI建议下何时保持创意自主性,揭示效率与真实性之间的张力。

详情
Comments
Accepted at the 2026 IEEE Conference on Games (CoG 2026); to be published in the conference proceedings. Camera-ready version
AI中文摘要

大型语言模型(LLM)的快速普及引发了关于人类创造力和个体表达在AI辅助创作时代的关键问题。人类何时采纳AI建议?这对个体声音有何影响?本研究通过一项游戏化写作练习来探讨这些问题,74名参与者(214份回复)在写作时,AI生成的单词建议可供使用。该游戏模拟了一个反乌托邦的未来,其中AI试图从残存的人类个性中学习,并抑制类似AI的写作。通过这种方式,它试图创造能够揭示真实用户偏好而非默认行为(例如接受现成的AI生成建议)的条件。请注意,这是对“有帮助的助手”设计模式的刻意反转;系统明确禁止你接受AI建议。我们分析了不同任务类型、用户行为和回复特征下的用户行为模式,以理解创造性任务中人机交互的影响因素。研究重点关注用户何时选择保持创意自主性,而非违反游戏规则接受AI帮助。此外,还探讨了这些选择如何与回复模式、任务特征和用户行为相关联。这种游戏化方法既为研究真实的人机交互提供了一个框架,也为理解AI增强创造力中效率与真实性之间的张力提供了一个发人深省的视角。

英文摘要

The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an era of AI-assisted creation. When do humans adopt AI suggestions, and what are the implications for individual voice? This study examines these questions through a gamified writing exercise where 74 participants (214 responses) replied to prompts while AI-generated word suggestions were available as they wrote. The game simulates a dystopian future in which an AI is attempting to learn from what remains of human individuality, and disincentivizes AI-like writing. In doing so, it attempts to create conditions that reveal authentic user preferences rather than default behaviors, such as accepting a readily available AI-generated suggestion. Note that this is a deliberate inversion of the "helpful assistant" design pattern; the system is explicitly forbidding you from accepting AI suggestions. We analyze user behavior patterns across different task types, user behaviors, and response characteristics to understand the factors influencing human-AI interaction in creative tasks. The study focuses on when users choose to maintain creative autonomy versus violating the rules of the game and accepting AI assistance. It also explores how these choices relate to response patterns, task characteristics, and user behavior. This gamified approach offers both a framework for studying authentic human-AI interaction and a provocative lens for understanding the tension between efficiency and authenticity in AI-augmented creativity.

2606.12349 2026-06-11 cs.RO eess.SY 新提交

Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles

面向无人水面艇操纵性评估的海洋机器人Unity仿真器中可追溯虚拟海试

Paria Rezayan

发表机构 * School of Engineering and Built Environment, Sheffield Hallam University(谢菲尔德哈勒姆大学工程与建筑环境学院)

AI总结 针对USV水动力导数辨识数据获取难的问题,在MARUS仿真器中建立标准化虚拟海试框架,通过TC/ZZ机动自动化执行、数据采集与后处理管道,生成符合IMO/ITTC指标的可重复数据集,案例验证了框架的有效性。

详情
AI中文摘要

精确识别水动力导数对于无人水面艇(USV)的控制与导航至关重要,但物理海试的高保真操纵数据受成本和安全性限制。回转试验(TC)和Z形试验(ZZ)仍是IMO和ITTC评估程序的基础。本文扩展了海洋机器人Unity仿真器(MARUS),引入标准化虚拟海试框架,用于TC/ZZ机动的自动化执行和数据生成,包括可追溯的命令-执行日志记录、面向系统辨识(SI)的数据调理以及自动提取符合IMO/ITTC的操纵性指标。一个关键贡献是专用的TC/ZZ数据采集和后处理管道,提高了基于仿真的机动的可重复性和可审计性,同时生成适用于水动力导数辨识和数字孪生工作流的SI就绪数据集。另一个特点是差动推力转向的显式命令-执行分离,其中输入记录为有序的等效舵命令,而实际执行则记录为基于施加推力的执行级代理。案例研究结果表明了可重复且合规的机动行为。对于TC试验,左舷和右舷之间的归一化进距差异约为3.9%,战术直径差异约为4.6%至4.7%。对于ZZ试验,±10度和±20度机动下的第一和第二超越角超调量均保持在1度以下,满足IMO标准,而峰值偏航速率约为4.1至5.8度/秒。总体而言,该框架提供了一种可重复且可审计的虚拟海试工作流,用于生成符合IMO/ITTC的数据集,并支持系统辨识、水动力导数估计和数字孪生校准。

英文摘要

Accurate identification of hydrodynamic derivatives is essential for control and navigation of Unmanned Surface Vehicles (USVs), but high-fidelity manoeuvring data from physical sea trials are constrained by cost and safety. Turning Circle (TC) and Zig-Zag (ZZ) trials remain fundamental to IMO and ITTC assessment procedures. This paper extends the Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres, with traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. Another feature is explicit command-execution separation for differential-thrust steering, where inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case-study results demonstrate repeatable and compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9 percent between port and starboard sides, while the tactical diameter differs by approximately 4.6 to 4.7 percent. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/- 10 degree and +/- 20 degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 deg/s. Overall, the framework provides a repeatable and auditable virtual sea-trial workflow for generating IMO/ITTC-aligned datasets and supporting system identification, hydrodynamic-derivative estimation, and digital-twin calibration.

2606.12347 2026-06-11 cs.CE physics.geo-ph 新提交

Local Stress Redistribution Controls Interactions between Hydraulic Fractures and Pre-existing Fractures

局部应力重分布控制水力裂缝与预先存在裂缝之间的相互作用

S. Shandilaya, M. Alaleeli, S.H. Kim, M. Mobasher, S. Roshankhah

AI总结 通过实验和模拟,研究了天然裂缝诱导的应力重分布如何控制水力裂缝的轨迹,揭示了剪切变形对裂缝吸引或排斥的作用机制。

详情
Comments
24 pages, 12 figures. Submitted to the International Journal of Rock Mechanics and Mining Sciences
AI中文摘要

水力裂缝在天然裂缝性地层中的传播受到预先存在的天然裂缝附近局部应力状态的强烈影响。天然裂缝诱导的剪切变形和应力重分布在控制水力裂缝轨迹中的作用仍不明确。本研究通过耦合实验室实验和孔隙弹性扩展有限元模拟,在平面应变条件下对完整和预裂PMMA试样进行了研究,探讨了天然裂缝诱导的应力重分布如何控制水力裂缝与天然裂缝的相互作用。数字图像相关提供了机械加载和水力压裂过程中位移和应变演化的全场测量。在固定底座、侧向约束和垂直压缩边界条件下,倾斜的天然裂缝诱导不对称的应力重分布和剪切变形,在流体注入前产生不同的局部应力状态。结果表明,水力裂缝轨迹由天然裂缝相对于远场最大主应力方向产生的剪应力和剪应变分量的符号和空间分布控制。促进天然裂缝附近压应力发展的剪切变形导致水力裂缝偏转远离,而降低天然裂缝有效法向应力的剪切变形则促进裂缝吸引和连接。预裂试样中水力裂缝曲率的相应数值再现需要混合模式(I-II型)断裂能释放准则,而完整试样则纯I型扩展。总体而言,研究结果揭示了由于天然裂缝的存在,局部应力状态演化导致从拉伸张开到剪切辅助混合模式传播的转变,为地下刺激和储存应用中预测和控制裂缝轨迹提供了机理基础。

英文摘要

Hydraulic fracture (HF) propagation in naturally fractured formations is strongly influenced by local stress states near pre-existing natural fractures (NFs). The role of NF-induced shear deformation and stress redistribution in controlling HF trajectories remains poorly characterized. This study investigates how NF-induced stress redistribution governs HF-NF interactions through coupled laboratory experiments and poroelastic extended finite element simulations on intact and pre-fractured PMMA specimens under plane-strain conditions. Digital image correlation provides full-field measurements of displacement and strain evolution during mechanical loading and hydraulic fracturing. Under fixed-base, lateral confinement, and vertical compression boundary conditions, inclined NFs induce asymmetric stress redistribution and shear deformation, generating distinct local stress states prior to fluid injection. The results demonstrate that the HF trajectory is governed by the sign and spatial distribution of shear stress and shear strain components generated by NF orientation relative to the far-field maximum principal stress. Shear deformation that promotes compressive stress development adjacent to the NF causes the HF to deflect away, whereas shear deformation that reduces the effective normal stress along the NF promotes fracture attraction and linkage. Corresponding numerical reproduction of HF curvature in pre-fractured specimens requires mixed-mode (Mode I-II) fracture energy release criteria, while the intact specimen propagates in pure Mode I. Overall, the findings reveal a transition from tensile opening to shear-assisted mixed-mode propagation as local stress states evolve due to the presence of NFs, providing a mechanistic basis for predicting and controlling fracture trajectories in subsurface stimulation and storage applications.

2606.12346 2026-06-11 cs.CV cs.AI cs.LG 新提交

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Atlas H&E-TME:基于AI的可扩展组织分析,达到专家病理学家级别的准确性

Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

发表机构 * Aignostics, Germany(Aignostics,德国) Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院病理学研究所) Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院柏林健康研究所) Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, MA, US(哈佛医学院麻省总医院病理学系) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US(梅奥诊所检验医学与病理学系) Machine Learning Group, Technische Universität Berlin, Germany(柏林工业大学机器学习组) BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany(柏林学习与数据基础研究所) Department of Artificial Intelligence, Korea University, Republic of Korea(高丽大学人工智能系) Max-Planck Institute for Informatics, Germany(马克斯·普朗克信息学研究所) German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany(德国癌症研究中心及德国癌症联盟柏林和慕尼黑合作站点) Institute of Pathology, Ludwig-Maximilians-Universität München, Germany(慕尼黑大学病理学研究所) Bavarian Cancer Research Center (BZKF), Germany(巴伐利亚癌症研究中心)

AI总结 提出Atlas H&E-TME系统,利用病理基础模型预测组织质量、区域和细胞类型,通过IHC共识验证和20万+注释基准,在多种癌症中达到或超越病理学家水平。

详情
AI中文摘要

苏木精和伊红(H&E)染色是组织病理学的基石,然而对H&E全切片图像(WSI)进行可扩展的定量分析仍然是计算病理学中的核心挑战。我们提出了Atlas H&E-TME,这是一个基于Atlas病理基础模型家族的AI系统,可预测多种癌症类型的组织质量、组织区域和细胞类型标签,在细胞级分辨率下每张切片产生超过4,500个定量读数。验证此类系统的关键挑战在于克服H&E-only金标准固有的形态模糊性,以及依赖免疫组织化学(IHC)等模态的更可靠参考的可扩展性有限。我们通过一个双重验证框架解决了这一问题,该框架将生物学深度的基础与技术及形态学的广度相结合。在深度方面,我们提出了一种IHC引导的多病理学家共识协议,该协议显著提高了相较于传统H&E-only注释的评分者间一致性。这产生了一个分子学基础的参考,我们据此比较Atlas H&E-TME和仅使用H&E的病理学家。在广度方面,我们在超过20万个高置信度H&E-only病理学家注释上对Atlas H&E-TME进行了基准测试,这些注释涵盖1,500多个病例,跨越八种癌症类型及其最常见的转移部位,亚型覆盖每种癌症类型>90%的临床病例,来自25个以上来源和8种以上扫描仪型号。与IHC引导的共识相比,Atlas H&E-TME达到或超过了病理学家仅使用H&E的性能,并在这一广泛的形态学和技术范围内一致且稳健地泛化。通过这种方式,Atlas H&E-TME将H&E切片——病理学中最普遍的数据——转化为一个可扩展的、定量的肿瘤及其微环境窗口,为转化和临床研究中下一代基于组织的生物标志物奠定了基础。

英文摘要

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

2606.12344 2026-06-11 cs.LG cs.CL 新提交

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Claw-SWE-Bench:评估OpenClaw风格代理框架在编码任务上的基准

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

发表机构 * TokenRhythm Technologies(TokenRhythm 技术公司) Infinigence AI Peking University(北京大学) City University of Hong Kong(香港城市大学) SEE Fund(SEE 基金) Shanghai Jiaotong University(上海交通大学) Beijing Jiaotong University(北京交通大学) Tsinghua University(清华大学)

AI总结 提出Claw-SWE-Bench基准,通过适配器协议统一评估异构代理框架,发现适配器设计对编码性能至关重要,且模型和框架选择显著影响通过率与成本。

详情
AI中文摘要

通用代理(如OpenClaw)越来越多地被用作自主工具使用者,但其编码能力难以在SWE-bench下衡量:通用代理本身不满足评分所需的干净Docker工作区、补丁和预测合约。我们引入了Claw-SWE-Bench,一个多语言SWE-bench风格的基准和适配器协议,使异构代理框架(即claws)在公平设置下具有可比性,包括固定提示、运行时预算、工作区合约、补丁提取过程和评估器。完整基准包含8种语言、43个仓库的350个GitHub问题解决实例,这些实例来自SWE-bench-Multilingual和SWE-bench-Verified-Mini,经过未来提交清理。我们还发布了Claw-SWE-Bench Lite用于更快验证,这是一个通过成本感知、排名感知程序从17个校准列中选出的80个实例子集。在完整基准上,使用最小直接差异适配器的OpenClaw仅获得19.1%的Pass@1,而完整适配器在相同GLM 5.1骨干下达到73.4%,表明适配器设计对于使OpenClaw风格的框架有效执行编码任务至关重要。在OpenClaw × 9模型扫描和5框架 × 2模型扫描中,模型选择使Pass@1变化29.4个百分点,固定模型下框架选择变化27.4个百分点;精度相似的系统在总API成本上可能差异很大。因此,Claw-SWE-Bench将框架和成本核算视为SWE风格编码代理评估的第一类轴,提供了完整基准和低成本参考集,用于可重复比较。数据可在https://this URL 和 https://this URL 获取。

英文摘要

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at this https URL and this https URL.

2606.12343 2026-06-11 cs.DC 新提交

Fair Comparison of Scheduling Algorithms on Heterogeneous Edge Clusters: A Continuous Adaptive Benchmark

异构边缘集群上调度算法的公平比较:一种连续自适应基准测试

Zihang Wang, Boris Sedlak, Juan Luis Herrera, Schahram Dustdar

AI总结 提出一种开源基准平台,用于公平比较异构边缘集群上的连续多模式调度算法,通过统一接口、闭环工作负载驱动器和双指标SLO评分,揭示控制器排名强烈依赖配置,且原始SLO与稳态SLO分离可暴露切换成本。

详情
AI中文摘要

现代人工智能工作负载部署在边缘-云连续体的异构层级上,必须满足关于延迟、吞吐量和输出质量的多维服务等级目标(SLO)。对于每个传入任务,调度器选择目标节点和处理模式(例如,完整或降低推理精度)。我们将这类问题称为连续多模式调度(CMMS)。公平比较CMMS算法很困难,因为先前的研究通常在自己的栈中、在单一工作负载下评估每个控制器,且不报告每次决策的开销。为弥补这些差距,我们提出一个开源基准平台,具有(i)统一控制器接口,(ii)覆盖多种工作负载模式的闭环工作负载驱动器,以及(iii)双指标SLO评分,分别报告原始SLO(整体合规性)和稳态SLO(稳定运行期间的合规性)。通过运行六个控制器跨越五个集群配置和两种负载状态(424个回合),我们发现控制器排名强烈依赖于配置:在轻负载下获胜的深度强化学习控制器,在负载增加时输给基于规则的启发式算法近29个百分点,且每次决策的操作开销约为500倍。我们进一步表明,将原始SLO与稳态SLO分离可以暴露切换成本,而单一聚合分数会混淆这些成本。

英文摘要

Modern Artificial Intelligence (AI) workloads deployed across the heterogeneous tiers of an edge--cloud continuum must satisfy multi-dimensional Service Level Objectives (SLOs) over latency, throughput, and output quality. For each incoming task, the scheduler picks both a target node and a processing mode (e.g., full or reduced inference precision). We call this class of problems \emph{Continuous Multi-Mode Scheduling} (CMMS). Comparing CMMS algorithms fairly is difficult because prior studies typically evaluate each controller in its own stack, under a single workload, and without reporting per-decision overhead. To close these gaps, we present an open source benchmark platform that features (i) a unified controller interface, (ii) a closed-loop workload driver covering multiple workload patterns, and (iii) dual-metric SLO scoring that reports raw SLO (overall compliance) and steady-state SLO (compliance during stable operation) separately. Running six controllers across five cluster configurations and two load regimes (424 episodes), we find that controller rankings are strongly configuration-dependent: a deep reinforcement-learning winner under light workloads loses to a rule-based heuristic by nearly 29 percentage points once load intensifies, at roughly 500$\times$ the per-decision operational overhead. We further show that separating raw from steady-state SLOs exposes switching costs that a single aggregate score would otherwise conflate.

2606.12342 2026-06-11 cs.CL cs.AI cs.ET cs.LG 新提交

ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM: 通过跨词汇表logit混合实现推理时对齐迁移

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs

AI总结 针对领域微调降低大模型安全性的问题,提出无需训练的ALIGNBEAM方法,通过逐token翻译锚模型logit并选择最安全候选,实现跨词汇表的安全对齐迁移,保持任务准确性和推理开销。

详情
AI中文摘要

领域微调会降低大型语言模型的安全性:微调后的专家模型容易顺从以领域语言表述的有害提示。现有的推理时防御方法通过混合来自安全锚模型的logit,但要求两个模型共享词汇表,这使得它们无法用于安全性退化最严重的跨族专家模型。我们提出ALIGNBEAM,一种无需训练的方法,通过在每个解码步骤逐token将锚模型logit翻译为目标模型的词汇表来解除这一限制;然后一个小型LLM法官从K个候选续写中选择最安全的。无需改变权重,并且可以在部署时调整安全-效用权衡而无需重新训练。在跨词汇表和同词汇表评估对中,ALIGNBEAM显著提高了对抗性基准上的拒绝率,同时将任务准确性和推理开销保持在实用范围内。结果表明,安全对齐可以在推理时在不同模型族之间迁移,而无需修改任一模型的权重。

英文摘要

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

2606.12341 2026-06-11 cs.CR 新提交

OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

OCELOT:面向隐私保护LLM代理的推理泄露预算

Jin Xie, Songze Li

AI总结 提出OCELOT运行时中介,通过后验风险控制和证人验证降级机制,在代理轨迹中预算对手信念更新,平衡任务效用与隐私泄露。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地代表用户行事——读取个人文件、调用工具、与外部服务交易——每一步都可能跨信任边界泄露个人身份信息(PII)。这里的隐私不是单个输出的属性,而是整个轨迹的属性,三个特性使其难以处理:泄露是累积的,因为单独无害的发布在诚实但好奇或共谋的接收者处累积成关于受保护秘密的推断;双向的,因为恶意观察可以注入指令,将代理自身的推理模型转而针对用户;任务相关的,因为同一字段对某个接收者是必要的,对另一个却是多余的。每次发布的上下文完整性过滤器、信息流控制和后验泄露监视器各自解决了部分问题,但没有一个能在运行时控制基于推断的累积泄露。我们将代理隐私重新定义为后验风险控制,并提出了OCELOT,一种运行时中介,它预算对手关于秘密的信念在轨迹中可能改进的程度,而不是过滤输出。其机制“证人验证降级”将判断与信任分离:一个不受信任的、本地微调的防御模型检查每个候选发布并发出结构化证据——标记原子和提议的降级操作符——然后由确定性验证器审计,为所选变体收取认证的最小熵成本,并在防篡改账本上记录接收者信任加权预算,授权最小披露的有用发布。在多样化的代理基准测试和近期防御中,OCELOT在更高任务效用下实现了显著更低的泄露,抵抗自适应注入、越狱、累积推断和接收者共谋,且仅增加适度开销。

英文摘要

Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output but of an entire trajectory, and three properties make it hard: leakage is cumulative, as individually innocuous releases accumulate across honest-but-curious or colluding sinks into inferences about a protected secret; bidirectional, as a malicious observation can inject instructions that turn the agent's own reasoning model against the user; and task-dependent, as the same field is necessary for one recipient yet gratuitous for another. Per-release contextual-integrity filters, information-flow controls, and posterior-leakage monitors each address part of this but none controls cumulative, inference-based leakage at runtime. We recast agent privacy as \emph{posterior-risk control} and present OCELOT, a runtime mediator that budgets how much an adversary's belief about a secret may improve across a trajectory, rather than filtering outputs. Its mechanism, \emph{Witness-Verified Declassification}, separates judgment from trust: an untrusted, locally fine-tuned defender model inspects each candidate release and emits structured evidence -- labeled atoms and proposed declassification operators -- which a deterministic verifier audits, charging a certified min-entropy cost for the chosen variant and authorizing the least-disclosing useful release under a sink-trust-weighted budget recorded on a tamper-evident ledger. Across diverse agent benchmarks and recent defenses, OCELOT attains significantly lower leakage at higher task utility, resists adaptive injection, jailbreak, cumulative inference, and sink collusion, and adds only modest overhead.

2606.12340 2026-06-11 cs.CV 新提交

Echoes of the Prior: A Computational Phenomenology of Forgetting

先验的回响:遗忘的计算现象学

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * Eberhard Karl University of Tübingen(蒂宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 通过在前馈3D重建模型中诱导突触衰减,可视化遗忘的主观现象学,将神经网络作为认知代理探索神经形态美学。

详情
AI中文摘要

记忆不仅仅是数据的存储;它是现实的支架。当生物记忆消退时,世界并不会简单地变黑;它会退化为无法辨认的混乱。《先验的回响》是一个互动装置,试图可视化这种遗忘的主观现象学。通过在前馈3D重建模型中诱导受控的突触衰减,我们为大脑预测先验的侵蚀创造了一个艺术类比。我们将神经网络定位为一种工程工具之外的认知代理——一个硅基大脑,其结构退化唤起了失去对世界掌控的迷失、诗意和恐怖体验。最终,我们提供这个框架作为催化剂,邀请更广泛的社区探索神经形态美学在可视化智能脆弱性方面的未开发潜力。互动演示见此网址。

英文摘要

Memory is not merely the storage of data; it is the scaffolding of reality. When biological memory fades, the world does not simply turn black; it regresses into an unrecognizable chaos. Echoes of the Prior is an interactive installation that attempts to visualize this subjective phenomenology of forgetting. By inducing controlled synaptic decay within a Feed-Forward 3D Reconstruction model, we create an artistic analogy for the erosion of the brain's predictive priors. We position the Neural Network not as a tool for engineering, but as a cognitive proxy - a silicon brain whose structural degeneration evokes the disorienting, poetic, and terrifying experience of losing one's grip on the world. Ultimately, we offer this framework as a catalyst, inviting the wider community to explore the uncharted potential of neuromorphic aesthetics in visualizing the fragility of intelligence. Interactive demo see this https URL.

2606.12339 2026-06-11 cs.SD cs.RO 新提交

Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments

Fast-SDE:混响环境中高效单麦克风声源距离估计

Jiang Wang, Runwu Shi, Yaozhong Kang, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Institute of Science Tokyo(东京科学大学)

AI总结 提出Fast-SDE,一种基于子带处理的轻量级单麦克风框架,用于在资源受限的机器人平台上实现高效声源距离估计。

详情
Comments
To appear in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
AI中文摘要

声源距离估计(SDE)是人机交互中的关键能力。不适当的交互距离不仅会降低语音采集和理解的可靠性,还会损害交互的自然性和舒适性。现有大多数SDE方法依赖麦克风阵列,然而多麦克风系统通常需要精心的硬件同步、几何校准以及额外的空间和计算资源,这限制了其在尺寸受限和计算能力有限的嵌入式平台上的适用性。为了解决这些问题,我们提出了Fast-SDE,一种轻量级的单麦克风SDE框架,适用于计算资源有限且尺寸严格受限的机器人平台。具体来说,Fast-SDE采用基于子带的骨干网络,将频率轴分解为多个子带,而不是使用宽的全频带骨干处理整个频谱。然后,一个共享的子带编码器将每个子带映射为紧凑的潜在表示,并学习声学结构与时频模式之间的关系。最后,一个轻量级的回归头将融合后的子带表示转换为估计的距离。大量的仿真和真实世界实验证明了所提方法的优点。为了惠及更广泛的研究社区,我们在以下网址开源了代码:this https URL。

英文摘要

Sound source distance estimation (SDE) is a critical capability in human-robot interaction. An inappropriate interaction distance not only reduces the reliability of speech acquisition and understanding, but also compromises the naturalness and comfort of the interaction. Most existing SDE methods rely on microphone arrays, however, multi-microphone systems typically require careful hardware synchronization, geometric calibration, and additional space and computational resources, which limits applicability to size-constrained and computability-limited embodied platforms. To alleviate these issues, we propose Fast-SDE, a lightweight single-microphone SDE framework that is suited for deployment on robot platforms with limited computational resources and strict size constraints. Specifically, Fast-SDE employs a subband-based backbone that decomposes the frequency axis into multiple subbands, rather than processing the entire spectrum with a wide full-band backbone. A shared subband encoder then maps each subband to a compact latent representation and learns the relationship between acoustic structure and time-frequency patterns. Finally, a lightweight regression head converts the fused subband representations into the estimated distance. Extensive simulation and real-world experiments demonstrate the merits of the proposed method. To benefit the broader research community, we have open-sourced our code at this https URL.

2606.12337 2026-06-11 math.NA cs.LG 新提交

Adjoint Method versus Physics-Informed Neural Networks in PDE-Constrained Inverse Problems

伴随方法与物理信息神经网络在PDE约束逆问题中的比较

Zhen Zhang, Alessandro Alla, George Em Karniadakis

AI总结 针对PDE约束逆问题,公平比较伴随优化与PINN,发现未知参数表示决定方法选择:网格场适合伴随,神经表示适合PINN;PINN在时间依赖问题中成本更低,且可预热启动伴随。

详情
Comments
35 pages, 10 figures
AI中文摘要

由偏微分方程(PDE)控制的逆问题是计算力学的核心,通常通过伴随优化求解,而物理信息神经网络(PINN)已成为一种灵活的替代方案。由于这两种方法通常在不同公式、参数化、优化器和正则化选择下进行比较,因此它们的相对性能难以评估。我们针对PDE约束逆问题,对伴随优化和PINN进行了公平比较。从共同的抽象公式出发,我们在相同的域、控制方程、观测模型和正则化项上实例化两种方法,并在适用情况下匹配优化器、未知参数化和算术精度。基准测试包括非定常Burgers方程、噪声达西渗透率反演、三维Allen-Cahn反应识别和非定常Navier-Stokes粘度识别。结果表明,未知参数的表示在很大程度上决定了首选方法:基于网格的场有利于离散伴随,而神经表示是PINN的原生方法,适用于封闭和本构建模。对于时间依赖问题,伴随反演可能因轨迹存储和微分而成本高昂,而PINN以较低成本提供令人满意的重建。然后,PINN预热启动的伴随策略以大幅降低的成本恢复伴随级别的精度。

英文摘要

Inverse problems governed by partial differential equations (PDEs) are central to computational mechanics and are commonly solved by adjoint-based optimization, while physics-informed neural networks (PINNs) have emerged as a flexible alternative. Their relative performance remains difficult to assess because the two approaches are often compared under different formulations, parameterizations, optimizers, and regularization choices. We present a fair comparison of adjoint optimization and PINNs for PDE-constrained inverse problems. From a common abstract formulation, we instantiate both methods on identical domains, governing equations, observation models, and regularization terms, while matching the optimizer, unknown parameterization, and arithmetic precision wherever applicable. The benchmarks include unsteady Burgers, noisy Darcy permeability inversion, three-dimensional Allen--Cahn reaction identification, and unsteady Navier--Stokes viscosity identification. The results show that the representation of the unknown largely determines the preferred method: grid-based fields favor the discrete adjoint, whereas neural representations are native to PINNs and relevant for closure and constitutive modeling. For time-dependent problems, adjoint inversion can be dominated by trajectory storage and differentiation, while PINNs provide satisfactory reconstructions at lower cost. A PINN-warm-started adjoint strategy then recovers adjoint-level accuracy at substantially reduced cost.

2606.12334 2026-06-11 cs.LG cs.RO 新提交

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

傅里叶特征让智能体通过模仿学习学习高精度策略

Balázs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) FZI Research Center for Information Technology(FZI信息技术研究中心)

AI总结 提出在点云编码器中使用傅里叶特征映射,解决神经网络低频偏好导致的高精度操作问题,在多个基准和真实机器人上显著提升性能。

详情
Comments
Published as a conference paper at ICML 2026
AI中文摘要

高精度机器人操作需要细粒度的空间推理,由于深度模糊和透视尺度问题,仅使用RGB的策略通常难以实现。直接利用3D信息(如基于点云的策略)比纯图像策略提供了更强的几何先验,但其性能仍然高度依赖于任务。我们假设这种差异可能是由于神经网络倾向于学习低频函数的频谱偏差,这尤其影响以缓慢变化的笛卡尔特征为条件的架构。因此,我们提出将点云从笛卡尔空间映射到高维傅里叶空间,有效地使点云编码器能够直接访问高频特征。我们通过实验验证了傅里叶特征在RoboCasa和ManiSkill3基准测试中的具有挑战性的操作任务以及真实机器人设置上的效果。尽管简单,我们发现傅里叶特征在不同的编码器架构和基准测试中提供了显著的好处,并且对超参数具有鲁棒性。我们的结果表明,傅里叶特征让策略比笛卡尔特征更有效地利用几何细节,显示了其作为基于点云的模仿学习的通用工具的潜力。我们在项目页面上提供源代码和视频:https://this https URL

英文摘要

High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: this https URL

2606.12332 2026-06-11 cs.CL cs.LG 新提交

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

通过信息增益衡量多轮对话中的语义进展

Paul He, Shiva Kasiviswanathan, Dominik Janzing

发表机构 * NTU Singapore(新加坡南洋理工大学) Amazon(亚马逊) Amazon Research, Tübingen, Germany(亚马逊研究院(德国图宾根))

AI总结 提出基于信息论的信息增益指标,通过高斯嵌入近似量化多轮对话中问题相关的语义进展,无需LLM推理,在多个基准上取得与人类判断一致的结果。

详情
Comments
Preprint. 26 pages
AI中文摘要

评估多轮对话具有挑战性,因为质量体现在多轮之间而非单个回复。我们关注信息寻求对话的一个关键维度:语义进展,定义为对话过程中新、与问题相关且非冗余信息的累积。我们将语义进展形式化为基于问题的不确定性减少,并引入一个在嵌入空间中近似它的信息论指标。我们的主要估计器使用具有闭式更新的易处理高斯公式,而互补的最大熵论证表明,当仅保留二阶嵌入信息时,对数行列式结构更广泛地出现。该公式产生了理想的理论性质,包括单调性、跨轮次总信息增益的可加分解以及冗余证据的递减回报。与LLM作为评判者的方法不同,我们的指标在评估时不需要自回归推理,并且对于固定的嵌入模型完全可复现。在MT-Bench、Chatbot Arena和UltraFeedback上的实验表明,尽管仅针对语义进展,所提出的指标与人类判断的一致性具有竞争力,在MT-Bench和UltraFeedback上相比几个基于LLM的评判者具有更好的对齐。值得注意的是,该方法在仅CPU执行下使用轻量级嵌入模型仍然有效,表明语义进展可以在不依赖大模型能力的情况下被捕获。

英文摘要

Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.

2606.12329 2026-06-11 cs.AI 新提交

PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

PROJECTMEM:面向AI编码代理的本地优先、事件溯源记忆与判断层

Ripon Chandra Malo, Tong Qiu

发表机构 * University of Utah(犹他大学)

AI总结 提出PROJECTMEM,一种本地优先、事件溯源的记忆与判断层,通过记录事件日志并生成紧凑摘要,帮助AI编码代理避免重复错误,实现记忆即治理。

详情
Comments
12 pages, 5 figures, 1 table. Code: this https URL
AI中文摘要

AI编码助手现在支持越来越多的软件工作,从快速脚本到生产应用。然而,这些代理在很大程度上仍然是无状态的:每个新会话都会重新读取项目文件,重新推导之前的决策,并且——最昂贵的是——可能会重复已经失败的调试尝试。重建这种上下文每个会话估计消耗5,000-20,000个令牌;瓶颈通常不是模型能力,而是缺失的项目记忆。我们提出了projectmem,一个面向AI编码代理的开源、本地优先的记忆与判断层。projectmem将开发记录为一个仅追加的纯文本事件日志,包含类型化事件——问题、尝试、修复、决策和笔记——并通过模型上下文协议(MCP)将该日志确定性地投影为紧凑的、AI可读的摘要。除了存储,projectmem还添加了一个确定性的前置动作门,在代理重复之前失败的修复或编辑已知脆弱文件之前警告它。我们将其定义为记忆即治理:记忆不仅回答代理,还作用于其下一个动作。该系统完全离线运行,无遥测;其不可变日志也作为可重现、可审计的AI辅助开发的溯源轨迹。projectmem作为一个三依赖的Python包发布(14个MCP工具,19个CLI命令,37个自动化测试),并通过一个为期两个月的自我研究进行评估,涉及10个项目,包含207个记录事件。源代码:此 https URL。

英文摘要

AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, local-first memory and judgment layer for AI coding agents. projectmem records development as an append-only, plain-text event log of typed events - issues, attempts, fixes, decisions, and notes - and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, projectmem adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. We frame this as Memory-as-Governance: memory that does not merely answer the agent but acts on its next action. The system runs fully offline with no telemetry; its immutable log also serves as a provenance trail for reproducible, auditable AI-assisted development. projectmem ships as a three-dependency Python package (14 MCP tools, 19 CLI commands, 37 automated tests) and is evaluated through a two-month self-study across 10 projects comprising 207 logged events. Source code: this https URL.

2606.12320 2026-06-11 cs.AI cs.CC cs.CR cs.SE 新提交

A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

生产AI代理运行时治理的五平面参考架构

Krti Tallam

发表机构 * Kamiwaza

AI总结 针对生产AI代理打破传统数据边界治理假设的问题,提出由推理平面和四个执行平面组成的五平面参考架构,通过可组合原语实现运行时治理,阻断七种威胁并验证四个正确性不变式。

详情
Comments
65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work
AI中文摘要

企业安全旨在治理数据边界:受保护表面是静态和传输中的数据,控制措施——访问控制、数据丢失防护、边界检查——治理该边界的穿越。生产AI代理瓦解了这一假设。代理代表企业读取上下文、调用工具、调用连接器并修改记录系统,因此风险转移到工作流内部,进入一系列单独允许但可能转变未经授权业务流程的动作序列。现有策略引擎无法扩展到这种机制:它们根据原子主体评估请求时决策,而代理系统需要对复合主体进行状态化评估,这些主体的权限通过委托链衰减。我们提出了一种用于生产代理运行时治理的参考架构,由四个可组合原语构建:五平面分解(一个裁决意图的推理平面,以及四个执行平面——网络、身份、端点、数据——实现决策)、任意停止中介、具有能力衰减的复合主体,以及作为结构化证据基础的审计。我们定义了六种中断原语的分类,这些原语泛化了允许和拒绝,陈述并论证了四个正确性不变式,并展示了在五个具体工作流中阻断七种生产代理威胁。策略引擎核心的参考实现提供了测量证据:衰减正确性和证据可重构性在每次试验中成立,裁决运行在个位数微秒内,审计基础的防篡改行为完全符合设计。我们明确范围:该架构治理委托行为,而非模型行为,针对实时代理基准的全系统评估是下一步工作。

英文摘要

Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise's behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains. We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes -- network, identity, endpoint, data -- that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate's tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step.

2606.12319 2026-06-11 cs.CV 新提交

Anatomically Conditioned Recurrent Refinement for Topology-Aware Circle of Willis Segmentation

解剖条件循环细化用于拓扑感知的Willis环分割

Juraj Perić, Marija Habijan, Dario Mužević, Irena Galić, Danilo Babin, Aleksandra Pižurica

发表机构 * Faculty of Electrical Engineering, Computer Science and Information Technology, Osijek, Croatia(奥西耶克大学电气工程、计算机科学与信息技术学院) Clinical Medical Center Osijek, Osijek, Croatia(奥西耶克临床医学中心) Ghent University, Dept. of Telecommunications and Information Processing, imec-TELIN-IPI, Ghent, Belgium(根特大学电信与信息处理系,imec-TELIN-IPI) Ghent University, Dept. of Telecommunications and Information Processing, TELIN-GAIM, Ghent, Belgium(根特大学电信与信息处理系,TELIN-GAIM)

AI总结 提出AC2RUNet,通过静态和动态双流架构结合课程学习,在TopCoW数据集上显著降低Hausdorff距离和Betti数误差,改善拓扑连通性。

详情
Comments
9 pages, 4 figures, 1 table. Accepted at EUSIPCO 2026
AI中文摘要

由于复杂的拓扑结构和易碎细小的血管结构,从磁共振血管造影(MRA)中分割Willis环(CoW)具有挑战性。标准卷积神经网络(CNN)通常无法捕捉这些拓扑约束,导致“血管断裂”伪影。为了解决这个问题,我们提出了解剖条件循环细化U-Net(AC2RUNet)。我们的架构将分割解耦为两个流:提取不变解剖特征的静态流和随时间迭代细化拓扑错误的轻量级动态流。我们进一步引入了一种动态课程学习策略,从高召回率的几何监督过渡到拓扑感知约束。在TopCoW数据集上验证,AC2RUNet显著降低了Hausdorff距离(4.72 mm vs 9.17 mm)和Betti数误差(0.19 vs 0.40),在保持相当体积Dice的同时改善了nnU-Net基线的拓扑连通性。

英文摘要

Segmenting the Circle of Willis (CoW) from Magnetic Resonance Angiography (MRA) is challenging due to complex topology and thin vascular structures that are prone to fragmentation. Standard Convolutional Neural Networks (CNNs) often fail to capture these topological constraints, resulting in "broken vessel" artifacts. To address this, we propose the Anatomically Conditioned Recurrent Refinement U-Net (AC2RUNet). Our architecture decouples segmentation into two streams: a Static Stream that extracts invariant anatomical features and a lightweight Dynamic Stream that iteratively refines topological errors over time. We further introduce a dynamic curriculum learning strategy that transitions from high-recall geometric supervision to topology-aware constraints. Validated on the TopCoW dataset, AC2RUNet substantially reduces Hausdorff Distance (4.72 mm vs 9.17 mm) and Betti number errors (0.19 vs 0.40), improving topological connectivity over the nnU-Net baseline while maintaining comparable volumetric Dice.

2606.12318 2026-06-11 cs.LG cs.AI 新提交

Harness In-Context Operator Learning with Chain of Operators

利用算子链实现上下文算子学习

Minghui Yang, Ling Guo, Liu Yang

发表机构 * Department of Mathematics, Shanghai Normal University(上海师范大学数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出Chain of Operators (CHOP)框架,通过构造显式初等变换与冻结ICON的算子链,无需微调即可提升上下文算子网络在分布外算子任务上的泛化能力,在标量守恒律和平均场控制问题中降低推理误差。

详情
AI中文摘要

神经算子近似函数空间之间的映射,但通常对其他算子泛化能力差,需要微调或重新训练。上下文算子网络(ICON)通过向模型提供数值上下文来解决此问题,使模型从提示中学习特定算子并适应不同算子而无需微调。然而,ICON在分布外(OOD)算子任务上仍可能泛化失败。受大型语言模型(LLM)的提示工程成功启发,我们引入了算子链(CHOP),一种在不更新参数的情况下将冻结的ICON应用于OOD算子任务的框架。具体来说,CHOP构建了一个由显式初等变换和冻结ICON组成的算子链。在标量守恒律和平均场控制问题上的实验表明,与直接ICON评估相比,CHOP降低了相对推理误差,同时链中的每个算子保持可解释且具有封闭形式。在一个PDE族上构建的链进一步泛化到另一个不同的族,表明跨提示系统存在共享机制。

英文摘要

Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

2606.12316 2026-06-11 cs.CV 新提交

Slots, Transitions, Loops: Learning Composable World Models for ARC

槽、转换、循环:学习可组合的ARC世界模型

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出Loop-OWM架构,通过颜色原型槽、演示条件任务摘要和循环转换模型,学习ARC任务中的视觉符号规则,在ARC-1和ARC-2上超越基线。

详情
AI中文摘要

ARC测试上下文中的规则归纳:给定少量输入-输出演示,模型必须推断隐藏规则并将其应用于新查询。虽然许多方法通过语言、代码或符号程序表达ARC规则,但ARC本身是视觉符号的:规则表现为对象、颜色、形状和空间关系上的网格转换。我们引入Loop-OWM,一种以对象为中心的世界建模架构,将规则学习为结构化状态上的可组合转换。它结合了颜色原型槽、演示条件任务摘要,以及具有密集传播和槽条件校正的循环转换模型。在ARC-1和ARC-2上,Loop-OWM以相当或更少的参数优于非循环和循环基线。这些结果表明,ARC规则不仅可以作为语言描述或搜索程序学习,还可以作为视觉符号世界状态上的转换学习。

英文摘要

ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.

2606.12306 2026-06-11 cs.RO 新提交

UGV-Conditioned Multi-UAV Informative Planning on a Shared Exposure Belief

基于共享暴露信念的UGV条件多无人机信息规划

Lars Oerlemans, Moji Shi, Marija Popovic

AI总结 提出一种协调无人机编队降低地面车辆在未知威胁区导航风险的方法,通过共享暴露信念引导感知并减少冗余覆盖,仿真显示累积暴露降低38%,冗余覆盖从38.8%降至3.7%。

详情
Comments
8 pages, 6 figures
AI中文摘要

在大型、威胁增强的环境中进行安全地面导航需要空中支持,以主动降低地面车辆沿路线面临的风险。现有的空中侦察系统专注于测绘或覆盖环境,但不将感知引导到对地面车辆安全最相关的区域。在本文中,我们解决了协调一组无人机(UAV)以提高无人地面车辆(UGV)在未知威胁区导航安全性的问题。我们方法的一个关键方面是共享暴露信念,该信念根据空中观测在线更新,并由无人机团队和地面车辆共同使用。这使我们能够将空中感知引导到路线相关区域,同时允许UGV围绕新发现的威胁重新规划。我们通过空间区域分配协调无人机团队以避免冗余感知。仿真实验表明,与不考虑危险等级的系统相比,我们的方法将UGV累积暴露降低了38%,并在我们的多无人机协调方案下将冗余空中覆盖从38.8%降至3.7%。

英文摘要

Safe ground navigation in large, threat-augmented environments requires aerial support that actively reduces the risks that a ground vehicle faces along its route. Existing aerial reconnaissance systems focus on mapping or covering the environment, but do not direct sensing toward regions that are most relevant for ground vehicle safety. In this paper, we address the problem of coordinating a team of unmanned aerial vehicles (UAVs) to improve the safety of an unmanned ground vehicle (UGV) navigating through unknown threat zones. A key aspect of our approach is a shared exposure belief that is updated online from aerial observations and used jointly by the UAV team and the ground vehicle. This enables us to direct aerial sensing towards route-relevant regions while allowing the UGV to replan around newly revealed threats. We coordinate the UAV team through spatial region assignment to avoid redundant sensing. Simulation experiments show that our approach reduces cumulative UGV exposure by 38% compared to a system that does not account for hazard levels, and reduces redundant aerial coverage from 38.8% to 3.7% under our multi-UAV coordination scheme.

2606.12303 2026-06-11 cs.CV 新提交

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

从二维网格到一维标记:重塑多模态图像融合的共享表示

Yuchen Xian, Yunqiu Xu, Yang He, Yi Yang

AI总结 提出基于冻结预训练图像标记器的紧凑一维标记接口,通过选择性标记编辑(STE)稀疏更新关键标记,在保持融合骨干网络不变的同时引导全局外观一致性,实现全局连贯与局部保真的最佳平衡。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

多模态图像融合旨在将来自不同模态的互补信息整合到融合图像中,该图像在保持全局一致外观的同时保留丰富的局部细节。现有方法在二维特征网格上构建共享表示,这些表示擅长建模局部结构,但对图像级全局外观因素的利用有限。为平衡这些目标,我们引入了一种基于冻结预训练图像标记器的紧凑一维标记接口,用于建模非局部外观/基因素。我们的设计不是将标记器用作重建骨干,而是将一维标记空间用作全局载体,同时保留用于局部结构恢复的二维空间路径。具体来说,我们引入了选择性标记编辑(STE),它稀疏地更新/替换一小部分关键标记,提供了一种轻量级机制来引导全局外观一致性,同时保持融合骨干网络不变并避免额外损失。在四个常用基准上的实验表明,我们的方法实现了最佳整体性能,在全局连贯性和局部保真度方面均具有一致的多指标改进。项目页面:此 https URL

英文摘要

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: this https URL

2606.12301 2026-06-11 quant-ph cs.IT 新提交

An iterative Ising decoder for quantum error correction codes

一种用于量子纠错码的迭代Ising解码器

Yuanqi Liu, Weilei Zeng, Peixiang Li, Yantong Liu, Guangyao Huang, Yingwen Liu, Dongyang Wang, Junjie Wu, Lingling Lao

AI总结 提出迭代低阶解码(ILOD)算法,通过交替求解X和Z子哈密顿量并利用贝叶斯先验近似交叉关联,将相互作用项的最大体数减半,加速求解器并降低自旋开销,在容错阈值和收敛性上接近或优于联合公式。

详情
Comments
12 pages, 8 figures, comments are welcome
AI中文摘要

Ising框架将量子纠错中的解码问题映射为经典哈密顿量的基态优化,其中$X$-$Z$误差关联作为交叉项出现。在现象学退极化噪声下,精确的联合公式对环面码包含高达8体相互作用,对$6.6.6$色码包含10体相互作用。这些高阶项会降低求解器收敛性,增加运行时间,并在嵌入到原生2体Ising硬件时提高辅助自旋开销。在这项工作中,我们提出了迭代低阶解码(ILOD)算法,它在$X$型和$Z$型子哈密顿量之间交替,通过贝叶斯先验近似交叉型关联,该先验利用另一种类型的推断误差配置重新加权每种类型的耦合。这将哈密顿量中相互作用项的最大体数减半,加速了求解器,在更大码距下恢复收敛性,并将2体嵌入的总自旋数减少了2.5倍。对于环面码,ILOD达到4.73%的阈值,而联合公式为4.83%,经验运行时间比按$(0.81)^d$缩放。对于$6.6.6$色码,在小码距下它们的阈值在统计不确定性内一致,并且ILOD在更大码距下保持收敛,而联合公式尽管有更大的退火预算却无法收敛。

英文摘要

The Ising framework maps the decoding problem in quantum error correction onto ground-state optimization of a classical Hamiltonian, in which $X$-$Z$ error correlations enter as cross terms. Under phenomenological depolarizing noise, the exact joint formulation contains up to 8-body interactions for the toric code and 10-body for the $6.6.6$ color code. These high-order terms degrade solver convergence, inflate runtime, and raise the auxiliary spin overhead when embedding into native 2-body Ising hardware. In this work, we propose the iterative low-order decoding (ILOD) algorithm, which alternates between $X$- and $Z$-type sub-Hamiltonians, approximating cross-type correlations through Bayesian priors that reweight each type's couplings using the other type's inferred error configuration. This halves the maximum body count of interaction terms in the Hamiltonian, accelerating the solver, restoring convergence at larger code distances, and reducing the total spin count for 2-body embedding by a factor of $2.5$. For the toric code, ILOD attains a threshold of $4.73%$ versus $4.83%$ for the joint formulation, with the empirical runtime ratio scaling as $(0.81)^d$. For the $6.6.6$ color code, their thresholds agree within statistical uncertainty for small code distances, and ILOD remains convergent for larger distances where the joint formulation fails to converge despite a larger annealing budget.

2606.12300 2026-06-11 cs.CV cs.AI 新提交

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

自然语言在小时级视频中的时间定位是一个搜索问题:基准与经验分解

Sukmin Seo, Geewook Kim

发表机构 * NAVER Cloud AI KAIST AI(韩国科学技术院人工智能系)

AI总结 针对小时级视频的自然语言时间定位,提出搜索是主要瓶颈而非识别,发布首个开放小时级定位基准ExtremeWhenBench,并通过检索-定位混合方法显著提升性能。

详情
Comments
10 pages, 6 figures, Code and benchmark: this https URL
AI中文摘要

时间定位——根据自然语言查询返回视频中的区间$[t_s, t_e]$——是长视频的语言接口,但此前仅在短视频上研究;小时级自然语言定位的动态仍未充分探索。我们认为,在小时级尺度上,限制因素是搜索而非识别:视频-LLM的瓶颈不在于定位附近的事件,而在于根据自然语言查询搜索长视频的相关区域。为验证这一点,我们发布了ExtremeWhenBench,首个开放的小时级定位基准(194个视频上的2273个查询,平均时长75.7分钟,最长9小时),具有开放式查询分布。所有开放视频-LLM均表现不佳,而帧级检索基线优于它们;失败分类将85%的失败归因于搜索;检索-定位混合方法比单一视频-LLM提升了6.7倍——类似于开放域QA中的检索-读取模式。

英文摘要

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

2606.12299 2026-06-11 cs.RO cs.LG 新提交

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

学习对你的VLA说什么:基本无害的视觉语言动作模型引导

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出一个框架,通过交互式搜索语言序列改进闭环VLA任务性能,并学习一个改进头预测何时语言引导能提升性能,同时通过共形化防止有害干预。

详情
Comments
22 pages, 14 tables, 14 figures
AI中文摘要

视觉-语言-动作(VLA)模型为机器人控制提供了自然语言接口,但从语言到行为的映射通常脆弱且不直观:语义相似的指令可能引发截然不同的行为,而某些能力可能无法仅通过提示激发。因此,人类指令和零样本语言模型都可能无法可靠地引导VLA成功执行任务。在这项工作中,我们提出了一个框架,该框架交互式地搜索改进闭环VLA任务性能的语言序列,将这些序列提炼为测试时语言反馈策略(LFP),并学习一个改进头来预测何时语言引导会提升性能。我们对这个改进头进行共形化,以防止在分布外场景中LFP相对于原始指令降低任务性能的有害引导干预。关键的是,我们的方法适用于任意冻结的预训练VLA,既不需要访问原始训练分布,也不需要微调底层模型。在已知环境中,我们的共形化LFP在仿真中使基础VLA性能提升24.7%,在硬件中提升65.0%。在视觉和语义扰动下,我们的共形化LFP具有强大的无害性保证,并产生开环提示无法观察到的恢复行为。

英文摘要

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

2606.12295 2026-06-11 cs.CV cs.CL cs.IR 新提交

Findings of the MAGMaR 2026 Shared Task

MAGMaR 2026 共享任务结果

Alexander Martin, Dengjia Zhang, Joel Brogan, Francis Ferraro, Jeremy Gwinnup, Reno Kriz, Teng Long, Kenton Murray, Andrew Yates, Xiang Xiang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) OpenAI University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) Air Force Research Laboratory(空军研究实验室) Human Language Technology Center of Excellence, Johns Hopkins University(约翰霍普金斯大学人类语言技术卓越中心) University of Amsterdam(阿姆斯特丹大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文介绍MAGMaR 2026共享任务的结果,包括视频检索和基于检索视频的生成任务,所有提交系统均超越去年基线。

详情
Comments
Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: this https URL
AI中文摘要

本概述论文介绍了第二届多模态检索增强生成(MAGMaR)研讨会的共享任务结果。在该共享任务中,参与者提交的系统专注于(i)视频检索或(ii)基于检索到的视频进行文章的接地生成。团队可以提交到任一任务。对于检索任务,我们有2个参与团队提交了总共17个系统——所有这些系统都击败了基于去年共享任务获胜者得出的基线。在生成方面,我们有4个团队提交了16个系统。所有团队至少有一个生成的报告被人类标注者评为最佳。

英文摘要

This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.

2606.12294 2026-06-11 cs.CV eess.IV 新提交

Bridging the Modality Gap in Forensic Image Retrieval

弥合法医图像检索中的模态差距

Ricardo González-Gazapo, Annette Morales-González, Yoanna Martínez-Díaz, Heydi Méndez-Vázquez, Milton García-Borroto

发表机构 * Advanced Technologies Application Center (CENATAV)(先进技术应用中心(CENATAV)) Centro de Sistemas Complejos, Facultad de Física, Universidad de La Habana(哈瓦那大学物理学院复杂系统中心)

AI总结 提出统一检索框架,利用多模态大语言模型生成文本描述并结合视觉与文本特征融合,提升纹身、人脸素描等法医任务的检索精度与鲁棒性。

详情
Comments
23 pages, 5 figures, paper submitted to Elsevier journal
AI中文摘要

自动图像检索在现代法医分析中扮演着越来越关键的角色,支持依赖于视觉证据高效比较的调查工作流程。虽然先前的工作主要集中在开发和优化多模态检索系统,但很少关注评估这些技术在多样化真实场景中的法医适用性。在本研究中,我们提出了一个统一的检索框架,适用于四个关键的法医任务:(1)给定纹身查询图像的纹身图像检索;(2)由人类专家文本描述引导的纹身检索,模拟目击者口头描述纹身的常见情况;(3)从手绘草图中检索纹身;(4)从法医面部素描中检索人脸。我们的系统利用多模态大语言模型(MLLM)自动为所有查询和图库图像生成结构化文本描述,然后使用句子变换器嵌入进行基于文本的比较。我们使用仅视觉嵌入、仅文本嵌入以及一种多模态融合策略来评估检索性能,该策略结合了来自与每个任务相关的最先进视觉特征提取器的文本和图像相似性分数。模态融合一致地提高了检索精度和鲁棒性,特别是在视觉信息有限或嘈杂的场景中(例如,素描、部分纹身或零碎的目击者陈述)。这项工作突显了统一多模态检索流程的法医价值,并展示了现代MLLM如何能够操作化传统上依赖人工专家分析的具有挑战性的法医任务。我们的结果将多模态检索定位为支持涉及纹身、面部合成和目击者描述的调查工作流程的有前途工具。

英文摘要

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

2606.12291 2026-06-11 cs.CL 新提交

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford(牛津大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Waterloo(滑铁卢大学)

AI总结 本研究提出MedMisBench基准,通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性,发现模型准确率从71.1%降至38.0%,权威性虚假信息攻击成功率达69.5%。

详情
AI中文摘要

大型语言模型(LLMs)现在在医疗执照考试中达到专家级分数,这鼓励了高分数意味着安全医疗判断的假设,而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的:当误导性上下文被注入到LLMs最初正确回答的问题中时,它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性,并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对,涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中,平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%,攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造:权威框架的虚假信息达到69.5%的攻击成功率,例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点:现有基准衡量模型知道什么,但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

2606.12290 2026-06-11 cs.CR 新提交

Selection Integrity for LLM Graph Memory: An Accumulability Criterion for Information-Flow-Blind Retrieval

LLM图记忆的选择完整性:面向信息流盲检索的可累积性准则

Zeming Fei, Hongming Fei, Xiaoyang Wang, Yang yang, Prosanta Gope, Biplab Sikdar, Ying Zhang

AI总结 针对图记忆检索中信息流控制盲区,提出可累积性准则,证明无源结构写入可导致不可逆转账被误导,并通过重分配性而非依赖性预测漏洞,提出认证子图重计算防御。

详情
AI中文摘要

智能体记忆正在转向图结构,目前为其构建的溯源防御都检查一件事:智能体检索到的记录的来源。我们证明,这类防御在构造上是盲的。长期图记忆在可写图结构上运行全局选择步骤,因此不可信主体写入的结构会改变哪些认证事实被选中,而引用的证据保持完全认证;忠实的信息流控制(IFC)检查读者所用内容的来源(全部已认证),在文档问答基板和真实多会话智能体记忆上,做出与无防御完全相同的字节级决策。在最严重的实例中,无源结构写入在499个实时操作中静默地误导28次不可逆账本转移:忠实IFC允许每一次,而\authselect\\阻止每一次。然后我们精确刻画哪些记忆暴露:当选择器的结构项可以重新分配top-$k$成员中$\Omega(1)$份额越过所选事实的边界时,该通道被允许。个性化PageRank可以,因为无源写入重新路由了守恒的随机游走质量;内容固定的重排序器不能,而Graphiti的节点距离(比PageRank更依赖结构)保持免疫。可重分配性(而非依赖性)是预测指标。我们证明了一般情况下的免疫情况,以及在验证的瓶颈条件下的开放情况。关闭该通道迫使任何溯源防御在认证子图上重新计算选择,这正是\authselect\\所做的,零超额阻塞和2-3%延迟。

英文摘要

Agent memory is moving to graphs, and the provenance defenses now being built for it all check one thing: the provenance of the records an agent retrieves. We show that this entire class of defense is blind by construction. A long-term graph memory runs a global selection step over writable graph structure, so structure that an untrusted principal writes changes \emph{which} authenticated facts are selected while the cited evidence stays fully authenticated; faithful information-flow control (IFC), checking the provenance of what the reader uses (all of it authenticated), makes the byte-identical decision to no defense at all, across document-QA substrates and real multi-session agent memory. In the most consequential instance, a no-source structural write silently misdirects $28$ irreversible ledger transfers over $499$ live actions: faithful IFC permits every one, and \authselect\ prevents every one. We then characterize exactly which memories are exposed: a selector admits the channel when its structural term can reallocate an $\Omega(1)$ share of top-$k$ membership past a selected fact's margin. Personalized PageRank can, since a sourceless write reroutes conserved random-walk mass; a content-fixed reranker cannot, and Graphiti's node-distance, which leans on structure \emph{more} than PageRank does, stays immune. Reallocatability, not reliance, is the predictor. We prove the immune case in general and the open case under a chokepoint condition we verify. Closing the channel forces any provenance defense to recompute selection on the authenticated subgraph, which is what \authselect\ does, at zero over-block and $2$--$3\%$ latency.

2606.12289 2026-06-11 cs.LG cs.AI cs.NE 新提交

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

标准可解释模型:一种基于拉格朗日力学的可解释机器学习通用理论,用于演绎设计可解释方法

Pietro Barbiero, Giovanni De Felice, Mateo Espinosa Zarlenga, Francesco Giannini, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra, Ruggero Noris

AI总结 提出标准可解释模型(SIM),基于拉格朗日力学从前提演绎出可解释性对称性和约束,通过最小化拉格朗日函数得到最优可解释模型,解决现有方法局限性并指导新方法设计。

详情
AI中文摘要

随着人工智能模型复杂性的增加,可解释性已成为理解、调试和控制其计算不可或缺的工具。然而,可解释性缺乏通用理论来演绎设计可解释方法。理论与方法之间的这种差距导致了文献的碎片化和不一致的评估协议。为填补这一空白,我们引入了标准可解释模型(SIM),这是一种基于拉格朗日力学的通用理论,能够演绎设计可解释方法。具体而言,SIM 在一组前提中总结了目标用户的可解释性含义。从这些前提出发,SIM 系统地推导出可解释性对称性和相应的约束,这些约束塑造了拉格朗日函数的景观,其最小值对应于最优可解释模型。为了达到最小值,可以更新不透明模型的参数值使其更可解释,或者将约束编译成可解释架构。我们通过实验表明,SIM 能够识别并解决现有方法(包括传统、基于概念和机制可解释性)的局限性,突出未充分探索的研究方向,并指导核心编程接口的设计。除了作为一种研究方法,SIM 的演绎性质为可解释性课程提供了教学基础,并可能改变科学界对这一长期碎片化学科的看法。

英文摘要

As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community's perspective of a discipline that has long been fragmented.