arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2183
2606.12332 2026-06-11 cs.CL cs.LG 新提交

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

通过信息增益衡量多轮对话中的语义进展

Paul He, Shiva Kasiviswanathan, Dominik Janzing

发表机构 * NTU Singapore(新加坡南洋理工大学) Amazon(亚马逊) Amazon Research, Tübingen, Germany(亚马逊研究院(德国图宾根))

AI总结 提出基于信息论的信息增益指标,通过高斯嵌入近似量化多轮对话中问题相关的语义进展,无需LLM推理,在多个基准上取得与人类判断一致的结果。

详情
Comments
Preprint. 26 pages
AI中文摘要

评估多轮对话具有挑战性,因为质量体现在多轮之间而非单个回复。我们关注信息寻求对话的一个关键维度:语义进展,定义为对话过程中新、与问题相关且非冗余信息的累积。我们将语义进展形式化为基于问题的不确定性减少,并引入一个在嵌入空间中近似它的信息论指标。我们的主要估计器使用具有闭式更新的易处理高斯公式,而互补的最大熵论证表明,当仅保留二阶嵌入信息时,对数行列式结构更广泛地出现。该公式产生了理想的理论性质,包括单调性、跨轮次总信息增益的可加分解以及冗余证据的递减回报。与LLM作为评判者的方法不同,我们的指标在评估时不需要自回归推理,并且对于固定的嵌入模型完全可复现。在MT-Bench、Chatbot Arena和UltraFeedback上的实验表明,尽管仅针对语义进展,所提出的指标与人类判断的一致性具有竞争力,在MT-Bench和UltraFeedback上相比几个基于LLM的评判者具有更好的对齐。值得注意的是,该方法在仅CPU执行下使用轻量级嵌入模型仍然有效,表明语义进展可以在不依赖大模型能力的情况下被捕获。

英文摘要

Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.

2606.12329 2026-06-11 cs.AI 新提交

PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

PROJECTMEM:面向AI编码代理的本地优先、事件溯源记忆与判断层

Ripon Chandra Malo, Tong Qiu

发表机构 * University of Utah(犹他大学)

AI总结 提出PROJECTMEM,一种本地优先、事件溯源的记忆与判断层,通过记录事件日志并生成紧凑摘要,帮助AI编码代理避免重复错误,实现记忆即治理。

详情
Comments
12 pages, 5 figures, 1 table. Code: this https URL
AI中文摘要

AI编码助手现在支持越来越多的软件工作,从快速脚本到生产应用。然而,这些代理在很大程度上仍然是无状态的:每个新会话都会重新读取项目文件,重新推导之前的决策,并且——最昂贵的是——可能会重复已经失败的调试尝试。重建这种上下文每个会话估计消耗5,000-20,000个令牌;瓶颈通常不是模型能力,而是缺失的项目记忆。我们提出了projectmem,一个面向AI编码代理的开源、本地优先的记忆与判断层。projectmem将开发记录为一个仅追加的纯文本事件日志,包含类型化事件——问题、尝试、修复、决策和笔记——并通过模型上下文协议(MCP)将该日志确定性地投影为紧凑的、AI可读的摘要。除了存储,projectmem还添加了一个确定性的前置动作门,在代理重复之前失败的修复或编辑已知脆弱文件之前警告它。我们将其定义为记忆即治理:记忆不仅回答代理,还作用于其下一个动作。该系统完全离线运行,无遥测;其不可变日志也作为可重现、可审计的AI辅助开发的溯源轨迹。projectmem作为一个三依赖的Python包发布(14个MCP工具,19个CLI命令,37个自动化测试),并通过一个为期两个月的自我研究进行评估,涉及10个项目,包含207个记录事件。源代码:此 https URL。

英文摘要

AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, local-first memory and judgment layer for AI coding agents. projectmem records development as an append-only, plain-text event log of typed events - issues, attempts, fixes, decisions, and notes - and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, projectmem adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. We frame this as Memory-as-Governance: memory that does not merely answer the agent but acts on its next action. The system runs fully offline with no telemetry; its immutable log also serves as a provenance trail for reproducible, auditable AI-assisted development. projectmem ships as a three-dependency Python package (14 MCP tools, 19 CLI commands, 37 automated tests) and is evaluated through a two-month self-study across 10 projects comprising 207 logged events. Source code: this https URL.

2606.12320 2026-06-11 cs.AI cs.CC cs.CR cs.SE 新提交

A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

生产AI代理运行时治理的五平面参考架构

Krti Tallam

发表机构 * Kamiwaza

AI总结 针对生产AI代理打破传统数据边界治理假设的问题,提出由推理平面和四个执行平面组成的五平面参考架构,通过可组合原语实现运行时治理,阻断七种威胁并验证四个正确性不变式。

详情
Comments
65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work
AI中文摘要

企业安全旨在治理数据边界:受保护表面是静态和传输中的数据,控制措施——访问控制、数据丢失防护、边界检查——治理该边界的穿越。生产AI代理瓦解了这一假设。代理代表企业读取上下文、调用工具、调用连接器并修改记录系统,因此风险转移到工作流内部,进入一系列单独允许但可能转变未经授权业务流程的动作序列。现有策略引擎无法扩展到这种机制:它们根据原子主体评估请求时决策,而代理系统需要对复合主体进行状态化评估,这些主体的权限通过委托链衰减。我们提出了一种用于生产代理运行时治理的参考架构,由四个可组合原语构建:五平面分解(一个裁决意图的推理平面,以及四个执行平面——网络、身份、端点、数据——实现决策)、任意停止中介、具有能力衰减的复合主体,以及作为结构化证据基础的审计。我们定义了六种中断原语的分类,这些原语泛化了允许和拒绝,陈述并论证了四个正确性不变式,并展示了在五个具体工作流中阻断七种生产代理威胁。策略引擎核心的参考实现提供了测量证据:衰减正确性和证据可重构性在每次试验中成立,裁决运行在个位数微秒内,审计基础的防篡改行为完全符合设计。我们明确范围:该架构治理委托行为,而非模型行为,针对实时代理基准的全系统评估是下一步工作。

英文摘要

Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise's behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains. We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes -- network, identity, endpoint, data -- that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate's tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step.

2606.12319 2026-06-11 cs.CV 新提交

Anatomically Conditioned Recurrent Refinement for Topology-Aware Circle of Willis Segmentation

解剖条件循环细化用于拓扑感知的Willis环分割

Juraj Perić, Marija Habijan, Dario Mužević, Irena Galić, Danilo Babin, Aleksandra Pižurica

发表机构 * Faculty of Electrical Engineering, Computer Science and Information Technology, Osijek, Croatia(奥西耶克大学电气工程、计算机科学与信息技术学院) Clinical Medical Center Osijek, Osijek, Croatia(奥西耶克临床医学中心) Ghent University, Dept. of Telecommunications and Information Processing, imec-TELIN-IPI, Ghent, Belgium(根特大学电信与信息处理系,imec-TELIN-IPI) Ghent University, Dept. of Telecommunications and Information Processing, TELIN-GAIM, Ghent, Belgium(根特大学电信与信息处理系,TELIN-GAIM)

AI总结 提出AC2RUNet,通过静态和动态双流架构结合课程学习,在TopCoW数据集上显著降低Hausdorff距离和Betti数误差,改善拓扑连通性。

详情
Comments
9 pages, 4 figures, 1 table. Accepted at EUSIPCO 2026
AI中文摘要

由于复杂的拓扑结构和易碎细小的血管结构,从磁共振血管造影(MRA)中分割Willis环(CoW)具有挑战性。标准卷积神经网络(CNN)通常无法捕捉这些拓扑约束,导致“血管断裂”伪影。为了解决这个问题,我们提出了解剖条件循环细化U-Net(AC2RUNet)。我们的架构将分割解耦为两个流:提取不变解剖特征的静态流和随时间迭代细化拓扑错误的轻量级动态流。我们进一步引入了一种动态课程学习策略,从高召回率的几何监督过渡到拓扑感知约束。在TopCoW数据集上验证,AC2RUNet显著降低了Hausdorff距离(4.72 mm vs 9.17 mm)和Betti数误差(0.19 vs 0.40),在保持相当体积Dice的同时改善了nnU-Net基线的拓扑连通性。

英文摘要

Segmenting the Circle of Willis (CoW) from Magnetic Resonance Angiography (MRA) is challenging due to complex topology and thin vascular structures that are prone to fragmentation. Standard Convolutional Neural Networks (CNNs) often fail to capture these topological constraints, resulting in "broken vessel" artifacts. To address this, we propose the Anatomically Conditioned Recurrent Refinement U-Net (AC2RUNet). Our architecture decouples segmentation into two streams: a Static Stream that extracts invariant anatomical features and a lightweight Dynamic Stream that iteratively refines topological errors over time. We further introduce a dynamic curriculum learning strategy that transitions from high-recall geometric supervision to topology-aware constraints. Validated on the TopCoW dataset, AC2RUNet substantially reduces Hausdorff Distance (4.72 mm vs 9.17 mm) and Betti number errors (0.19 vs 0.40), improving topological connectivity over the nnU-Net baseline while maintaining comparable volumetric Dice.

2606.12318 2026-06-11 cs.LG cs.AI 新提交

Harness In-Context Operator Learning with Chain of Operators

利用算子链实现上下文算子学习

Minghui Yang, Ling Guo, Liu Yang

发表机构 * Department of Mathematics, Shanghai Normal University(上海师范大学数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出Chain of Operators (CHOP)框架,通过构造显式初等变换与冻结ICON的算子链,无需微调即可提升上下文算子网络在分布外算子任务上的泛化能力,在标量守恒律和平均场控制问题中降低推理误差。

详情
AI中文摘要

神经算子近似函数空间之间的映射,但通常对其他算子泛化能力差,需要微调或重新训练。上下文算子网络(ICON)通过向模型提供数值上下文来解决此问题,使模型从提示中学习特定算子并适应不同算子而无需微调。然而,ICON在分布外(OOD)算子任务上仍可能泛化失败。受大型语言模型(LLM)的提示工程成功启发,我们引入了算子链(CHOP),一种在不更新参数的情况下将冻结的ICON应用于OOD算子任务的框架。具体来说,CHOP构建了一个由显式初等变换和冻结ICON组成的算子链。在标量守恒律和平均场控制问题上的实验表明,与直接ICON评估相比,CHOP降低了相对推理误差,同时链中的每个算子保持可解释且具有封闭形式。在一个PDE族上构建的链进一步泛化到另一个不同的族,表明跨提示系统存在共享机制。

英文摘要

Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

2606.12316 2026-06-11 cs.CV 新提交

Slots, Transitions, Loops: Learning Composable World Models for ARC

槽、转换、循环:学习可组合的ARC世界模型

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出Loop-OWM架构,通过颜色原型槽、演示条件任务摘要和循环转换模型,学习ARC任务中的视觉符号规则,在ARC-1和ARC-2上超越基线。

详情
AI中文摘要

ARC测试上下文中的规则归纳:给定少量输入-输出演示,模型必须推断隐藏规则并将其应用于新查询。虽然许多方法通过语言、代码或符号程序表达ARC规则,但ARC本身是视觉符号的:规则表现为对象、颜色、形状和空间关系上的网格转换。我们引入Loop-OWM,一种以对象为中心的世界建模架构,将规则学习为结构化状态上的可组合转换。它结合了颜色原型槽、演示条件任务摘要,以及具有密集传播和槽条件校正的循环转换模型。在ARC-1和ARC-2上,Loop-OWM以相当或更少的参数优于非循环和循环基线。这些结果表明,ARC规则不仅可以作为语言描述或搜索程序学习,还可以作为视觉符号世界状态上的转换学习。

英文摘要

ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.

2606.12300 2026-06-11 cs.CV cs.AI 新提交

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

自然语言在小时级视频中的时间定位是一个搜索问题:基准与经验分解

Sukmin Seo, Geewook Kim

发表机构 * NAVER Cloud AI KAIST AI(韩国科学技术院人工智能系)

AI总结 针对小时级视频的自然语言时间定位,提出搜索是主要瓶颈而非识别,发布首个开放小时级定位基准ExtremeWhenBench,并通过检索-定位混合方法显著提升性能。

详情
Comments
10 pages, 6 figures, Code and benchmark: this https URL
AI中文摘要

时间定位——根据自然语言查询返回视频中的区间$[t_s, t_e]$——是长视频的语言接口,但此前仅在短视频上研究;小时级自然语言定位的动态仍未充分探索。我们认为,在小时级尺度上,限制因素是搜索而非识别:视频-LLM的瓶颈不在于定位附近的事件,而在于根据自然语言查询搜索长视频的相关区域。为验证这一点,我们发布了ExtremeWhenBench,首个开放的小时级定位基准(194个视频上的2273个查询,平均时长75.7分钟,最长9小时),具有开放式查询分布。所有开放视频-LLM均表现不佳,而帧级检索基线优于它们;失败分类将85%的失败归因于搜索;检索-定位混合方法比单一视频-LLM提升了6.7倍——类似于开放域QA中的检索-读取模式。

英文摘要

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

2606.12299 2026-06-11 cs.RO cs.LG 新提交

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

学习对你的VLA说什么:基本无害的视觉语言动作模型引导

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出一个框架,通过交互式搜索语言序列改进闭环VLA任务性能,并学习一个改进头预测何时语言引导能提升性能,同时通过共形化防止有害干预。

详情
Comments
22 pages, 14 tables, 14 figures
AI中文摘要

视觉-语言-动作(VLA)模型为机器人控制提供了自然语言接口,但从语言到行为的映射通常脆弱且不直观:语义相似的指令可能引发截然不同的行为,而某些能力可能无法仅通过提示激发。因此,人类指令和零样本语言模型都可能无法可靠地引导VLA成功执行任务。在这项工作中,我们提出了一个框架,该框架交互式地搜索改进闭环VLA任务性能的语言序列,将这些序列提炼为测试时语言反馈策略(LFP),并学习一个改进头来预测何时语言引导会提升性能。我们对这个改进头进行共形化,以防止在分布外场景中LFP相对于原始指令降低任务性能的有害引导干预。关键的是,我们的方法适用于任意冻结的预训练VLA,既不需要访问原始训练分布,也不需要微调底层模型。在已知环境中,我们的共形化LFP在仿真中使基础VLA性能提升24.7%,在硬件中提升65.0%。在视觉和语义扰动下,我们的共形化LFP具有强大的无害性保证,并产生开环提示无法观察到的恢复行为。

英文摘要

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

2606.12295 2026-06-11 cs.CV cs.CL cs.IR 新提交

Findings of the MAGMaR 2026 Shared Task

MAGMaR 2026 共享任务结果

Alexander Martin, Dengjia Zhang, Joel Brogan, Francis Ferraro, Jeremy Gwinnup, Reno Kriz, Teng Long, Kenton Murray, Andrew Yates, Xiang Xiang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) OpenAI University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) Air Force Research Laboratory(空军研究实验室) Human Language Technology Center of Excellence, Johns Hopkins University(约翰霍普金斯大学人类语言技术卓越中心) University of Amsterdam(阿姆斯特丹大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文介绍MAGMaR 2026共享任务的结果,包括视频检索和基于检索视频的生成任务,所有提交系统均超越去年基线。

详情
Comments
Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: this https URL
AI中文摘要

本概述论文介绍了第二届多模态检索增强生成(MAGMaR)研讨会的共享任务结果。在该共享任务中,参与者提交的系统专注于(i)视频检索或(ii)基于检索到的视频进行文章的接地生成。团队可以提交到任一任务。对于检索任务,我们有2个参与团队提交了总共17个系统——所有这些系统都击败了基于去年共享任务获胜者得出的基线。在生成方面,我们有4个团队提交了16个系统。所有团队至少有一个生成的报告被人类标注者评为最佳。

英文摘要

This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.

2606.12294 2026-06-11 cs.CV eess.IV 新提交

Bridging the Modality Gap in Forensic Image Retrieval

弥合法医图像检索中的模态差距

Ricardo González-Gazapo, Annette Morales-González, Yoanna Martínez-Díaz, Heydi Méndez-Vázquez, Milton García-Borroto

发表机构 * Advanced Technologies Application Center (CENATAV)(先进技术应用中心(CENATAV)) Centro de Sistemas Complejos, Facultad de Física, Universidad de La Habana(哈瓦那大学物理学院复杂系统中心)

AI总结 提出统一检索框架,利用多模态大语言模型生成文本描述并结合视觉与文本特征融合,提升纹身、人脸素描等法医任务的检索精度与鲁棒性。

详情
Comments
23 pages, 5 figures, paper submitted to Elsevier journal
AI中文摘要

自动图像检索在现代法医分析中扮演着越来越关键的角色,支持依赖于视觉证据高效比较的调查工作流程。虽然先前的工作主要集中在开发和优化多模态检索系统,但很少关注评估这些技术在多样化真实场景中的法医适用性。在本研究中,我们提出了一个统一的检索框架,适用于四个关键的法医任务:(1)给定纹身查询图像的纹身图像检索;(2)由人类专家文本描述引导的纹身检索,模拟目击者口头描述纹身的常见情况;(3)从手绘草图中检索纹身;(4)从法医面部素描中检索人脸。我们的系统利用多模态大语言模型(MLLM)自动为所有查询和图库图像生成结构化文本描述,然后使用句子变换器嵌入进行基于文本的比较。我们使用仅视觉嵌入、仅文本嵌入以及一种多模态融合策略来评估检索性能,该策略结合了来自与每个任务相关的最先进视觉特征提取器的文本和图像相似性分数。模态融合一致地提高了检索精度和鲁棒性,特别是在视觉信息有限或嘈杂的场景中(例如,素描、部分纹身或零碎的目击者陈述)。这项工作突显了统一多模态检索流程的法医价值,并展示了现代MLLM如何能够操作化传统上依赖人工专家分析的具有挑战性的法医任务。我们的结果将多模态检索定位为支持涉及纹身、面部合成和目击者描述的调查工作流程的有前途工具。

英文摘要

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

2606.12291 2026-06-11 cs.CL 新提交

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

测量大语言模型在误导性医疗上下文下的认知韧性

Hongjian Zhou, Xinyu Zou, Jinge Wu, Sean Wu, Junchi Yu, Bradley Max Segal, Tobias Erich Niebuhr, Sara Amro, Michael Petrus, Sheikh Momin, Alexandra M. Cardoso Pinto, Rachel Niesen, Laura Sophie Wegner, Dhruv Darji, Jung Moses Koo, Joshua Fieggen, Kapil Narain, Mingde Zeng, Lei Clifton, Linda Shapiro, Fenglin Liu, David A. Clifton

发表机构 * University of Oxford(牛津大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Waterloo(滑铁卢大学)

AI总结 本研究提出MedMisBench基准,通过注入误导性上下文测试大语言模型在医疗场景中的认知韧性,发现模型准确率从71.1%降至38.0%,权威性虚假信息攻击成功率达69.5%。

详情
AI中文摘要

大型语言模型(LLMs)现在在医疗执照考试中达到专家级分数,这鼓励了高分数意味着安全医疗判断的假设,而患者越来越多地使用它们获取健康建议。我们证明这一假设是脆弱的:当误导性上下文被注入到LLMs最初正确回答的问题中时,它们会放弃正确答案。我们将这种在对抗性上下文中保持正确判断的能力称为认知韧性,并引入MedMisBench来测量它。MedMisBench包含10,932个医疗问题项目和48,889个误导性上下文-选项对,涵盖医疗推理、代理能力和患者旅程评估。在11个模型配置中,平均准确率从原始问题的71.1%下降到聚焦误导性上下文下的38.0%,攻击成功率为51.5%。最具破坏性的注入是正式的、规则式的捏造:权威框架的虚假信息达到69.5%的攻击成功率,例外投毒声明达到64.1%。来自7个国家的14名临床专家小组在38.2%的审查案例中识别出严重的潜在危害。MedMisBench暴露了LLM在医疗环境评估中的结构性盲点:现有基准衡量模型知道什么,但不衡量它们在误导性上下文下是否保持正确的医疗判断。

英文摘要

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

2606.12286 2026-06-11 cs.CV 新提交

CellNet -- Localizing Cells using Sparse and Noisy Point Annotations

CellNet -- 利用稀疏和噪声点标注定位细胞

Benjamin Eckhardt, Dmytro Fishman, Stuart Fawke, Andrew Curtis, Bo Fussing, Constantin Pape

发表机构 * University of Göttingen(哥廷根大学) Wellcome Sanger Institute(威康桑格研究所) University of Tartu(塔尔图大学)

AI总结 提出基于回归的深度学习算法CellNet,利用稀疏点标注在相位对比显微镜图像中检测和计数细胞,减少标注负担,在低数据场景下优于零样本方法。

详情
Comments
Conference poster at Biology at Scale: From Variants to Cellular Programs and Functions
AI中文摘要

计数活细胞是许多生物学研究工作流程中的重要步骤。我们在Wellcome Sanger研究所的合作者通过大规模饱和基因组编辑筛选研究人类重要基因,这需要反复多次计数细胞。基于计算机视觉的自动化对于高通量和资源效率至关重要。在这项工作中,我们开发了一种基于回归的深度学习计算机视觉算法,用于检测和计数相位对比显微镜图像中的细胞。为了减少标注工作量(这在实际中常成为瓶颈),我们专注于仅使用稀疏点标注来计数细胞,这种标注方式快速且易于获取。通过与最先进的零样本方法比较,我们表明基于回归的计数在低数据场景下是一种有前景的替代方案。通过开发自动计数显微镜图像中活细胞的方法,我们为人类基因组的重要研究做出了贡献。代码可在以下网址获取:https://this https URL。

英文摘要

Counting living cells is an important step in many biological research workflows. Our collaborators at the Wellcome Sanger Institute study vital genes in humans via large scale saturation genome editing screening, which requires repeatedly counting cells a great number of times. Computer Vision based automation is crucial for high throughput and resource efficiency. In this work, we develop a regression-based deep learning computer vision algorithm to detect and count cells in phase-contrast microscopy images. To reduce annotation effort, which in practice often becomes a bottleneck, we focus on counting cells only using sparse point annotations, which are fast and easy to acquire. By comparison to state-of-the-art 0-shot methods, we show that regression-based counting is a promising alternative in low data regimes. Through developing methods to automatically count living cells in microscopy images, we contribute to valuable research on the human genome. The code is available at this https URL.

2606.12280 2026-06-11 cs.LG 新提交

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

在8位权重和激活下保持FP8质量上限:Ideogram 4.0面向消费级GPU的INT8与GGUF后训练量化

Deep Gandhi, Ali Asaria, Tony Salomone

发表机构 * Transformer Lab

详情
AI中文摘要

后训练量化使得大型文本到图像扩散变换器能够在消费级GPU上运行,然而硬件特定的权衡很少被直接测量。我们对Ideogram 4.0——一个9.3B流匹配扩散变换器(DiT),以两个独立权重副本的形式部署,用于无分类器引导,并由Qwen3-VL-8B编码器调节——针对缺乏FP8张量核心的Ampere RTX 3090 GPU进行量化。我们的INT8 W8A8方案(逐通道权重、逐token动态激活、SmoothQuant以及对少量高脆弱性层的混合精度保护)保持了FP8的质量上限:在200提示基准上,INT8与FP8的配对同种子bootstrap置信区间在Pick和CLIP指标上均包含零,而INT8相比NF4提升了+1.9 CLIP(95%置信区间[+1.21,+2.64],排除零)。据我们所知,针对此类模型进行的逐类别OCR分析首次确认了文本可读性得以保留,而消融实验将前馈网络下投影的保护隔离为关键质量杠杆。我们的GGUF Q4_K量化在相同磁盘大小下优于NF4,并在质量-内存前沿上成为帕累托最优解,配对置信区间排除零(Q8_0质量中性)。最后,我们描述了8位量化在哪些方面有帮助以及哪些方面没有:INT8的权重与FP8的占用空间相当而非缩小,因此在Ampere上实现速度提升需要融合INT8内核。

英文摘要

Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion transformer (DiT), shipped as two separate-weight copies of a single-stream 34-layer backbone for classifier-free guidance and conditioned by a Qwen3-VL-8B encoder - for Ampere RTX 3090 GPUs, which lack FP8 tensor cores. Our INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by $+1.9$ CLIP (95% CI $[+1.21,+2.64]$, excluding zero). A per-category OCR analysis, to our knowledge unreported for this model class, confirms text legibility is preserved, and an ablation isolates protection of the FFN down-projections as the dominant quality lever. Our GGUF Q4_K quantization beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier, with paired confidence intervals excluding zero (Q8_0 is quality neutral). Finally, we characterize where 8-bit quantization helps and where it does not: INT8's weights match FP8's footprint rather than shrink it, so a speed gain on Ampere awaits a fused INT8 kernel.

2606.12278 2026-06-11 cs.CV cs.LG 新提交

Finding Sparse Subnetworks in One Training Cycle via Progressive Magnitude-Based Pruning

通过渐进式幅度剪枝在一个训练周期内找到稀疏子网络

Romana Qureshi, Hafida Benhidour, Said Kerrache, Nahlah Aljeraisy

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) University of Jeddah(吉达大学) King Fahd University of Petroleum and Minerals(法赫德国王石油矿产大学) King Saud University(沙特国王大学)

AI总结 提出渐进式幅度剪枝方法,在单训练周期内线性增加稀疏度,基于权重幅度更新掩码,在CIFAR-10和MNIST上优于LTH、SNIP和GraSP等基线。

详情
AI中文摘要

神经网络剪枝通过移除不太重要的参数来减小模型大小,同时旨在保持预测性能。尽管彩票假说(LTH)表明,当从合适的初始化训练时,稀疏子网络可以匹配密集网络,但其迭代剪枝过程需要多个完整的训练周期。本工作评估了渐进式幅度剪枝作为一种单周期替代方案。该方法在训练期间使用线性调度逐渐增加稀疏度,并基于活跃权重幅度更新剪枝掩码。我们在CIFAR-10和MNIST上,针对ResNet、VGG风格和LeNet架构进行了系统实验,将所提方法与代表性的迭代和基于初始化的剪枝基线(包括LTH、SNIP和GraSP)进行比较。在CIFAR-10上,该方法在ResNet-18上以72.9%稀疏度达到95.12%的准确率,而LTH报告为90.5%。在极端稀疏度下,它在VGG类架构上以97%稀疏度达到93.13%的准确率,而SNIP约为92.0%;在VGG-19上以97.97%稀疏度达到93.44%的准确率,而GraSP在98%稀疏度下为92.19%。在ResNet-18上的稀疏度-准确率分析进一步表明,在70-85%稀疏度范围内,准确率保持在密集基线的0.1个百分点以内。这些结果表明,在所评估的设置下,渐进式幅度剪枝为神经网络稀疏化提供了一种有效的单周期方法。

英文摘要

Neural network pruning reduces model size by removing less important parameters while aiming to preserve predictive performance. Although the Lottery Ticket Hypothesis (LTH) shows that sparse subnetworks can match dense networks when trained from suitable initializations, its iterative pruning procedure requires multiple complete training cycles. This work evaluates progressive magnitude-based pruning as a single-cycle alternative. The method gradually increases sparsity during training using a linear schedule and updates pruning masks based on active weight magnitudes. We conduct systematic experiments on CIFAR-10 and MNIST across ResNet, VGG-style, and LeNet architectures, comparing the proposed method with representative iterative and initialization-based pruning baselines, including LTH, SNIP, and GraSP. On CIFAR-10, the method achieves 95.12\% accuracy on ResNet-18 at 72.9\% sparsity, compared with 90.5\% reported for LTH. At extreme sparsity, it achieves 93.13\% accuracy on a VGG-like architecture at 97\% sparsity, compared with approximately 92.0\% for SNIP, and 93.44\% accuracy on VGG-19 at 97.97\% sparsity, compared with 92.19\% for GraSP at 98\% sparsity. A sparsity-accuracy analysis on ResNet-18 further shows that accuracy remains within 0.1 percentage points of the dense baseline across 70--85\% sparsity. These results indicate that progressive magnitude-based pruning provides an effective single-cycle approach for neural network sparsification under the evaluated settings.

2606.12277 2026-06-11 cs.LG 新提交

Finding Multiple Interpretations in Datasets

在数据集中寻找多种解释

Matthew Chak, Paul Anderson

发表机构 * Department of Computer Science, California Polytechnic State University(加州州立理工大学计算机科学系)

AI总结 提出一种方法,在保持性能的同时,找到具有不同上下文感知特征但性能相似的模型集,以提取对潜在现象的洞察。

详情
AI中文摘要

在本文中,我们提出了一种方法,用于寻找在损失/准确率测量方面表现相似但具有高度不同上下文感知特征的模型集。通过在METABRIC数据集上的实验,我们表明所提出的方法找到了多个模型,这些模型的基因表达与对照组方法找到的模型高度不同,且没有性能损失。我们认为,只要目标是分析模型的任何全局特征以提取对正在研究的潜在现象的洞察,所提出的方法就很重要。

英文摘要

In this paper, we propose an approach to finding sets of similar-performing models (in terms of loss/accuracy measurements) with highly different context-aware characteristics. Through experiments on the METABRIC dataset, we show that the proposed method finds multiple models with highly different gene expressions than those found by the control methodology without performance penalties. We argue that the proposed methodology is important whenever one aims to analyze any global characteristic of a model to extract insight into the underlying phenomenon being studied.

2606.12273 2026-06-11 cs.CL 新提交

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

超越完全随机掩码:扩散语言模型的注意力引导去噪与优化

Jia Deng, Junyi Li, Wayne Xin Zhao, Jinpeng Wang, Hongyu Lu, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Meituan(美团) WeChat, Tencent(腾讯微信) Beijing Key Laboratory of Research on Large Models and Intelligent Governance(大型模型与智能治理北京市重点实验室)

AI总结 提出AGDO框架,利用注意力结构指导去噪顺序并强化关键令牌,在数学和编码基准上提升扩散语言模型的推理性能。

详情
Comments
13 pages. Accepted to ACL 2026 Main Conference
AI中文摘要

扩散大语言模型(dLLMs)通过并行解码提供了自回归模型的高效替代方案,然而现有的后训练方法大多依赖随机掩码策略,忽略了内在的令牌依赖关系。在这项工作中,我们对dLLMs中的注意力进行了实证分析,表明对未掩码上下文关注更强的令牌表现出更高的生成稳定性,并在推理中发挥关键作用。受这些发现启发,我们提出了AGDO,一种注意力引导的去噪与优化框架,将训练和优化与注意力导出的依赖关系对齐。AGDO基于注意力结构确定去噪顺序,并在监督微调和强化学习过程中强调注意力关键令牌。在数学和编码基准上的实验表明,AGDO持续提升推理性能,优于dLLMs的最先进后训练方法。

英文摘要

Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

2606.12268 2026-06-11 cs.AI 新提交

The Impossibility of Eliciting Latent Knowledge

引出潜在知识的不可能性

Korbinian Friedl, Francis Rhys Ward, Paul Yushin Rapoport, Tom Everitt, Jonathan Richens

发表机构 * The London School of Economics and Political Science(伦敦政治经济学院) Independent(独立机构)

AI总结 本文利用因果影响图形式化定义引出潜在知识问题,证明不存在仅依赖行为反馈的训练策略能确保智能体诚实报告其信念。

详情
Comments
24 pages, 3 figures. Includes proofs in appendix
AI中文摘要

高级AI系统对其环境拥有广泛的知识;事实上,它们的知识可能(远远)超过其开发者或用户。因此,AI系统的一个理想属性是诚实——即它准确报告其对世界的信念。设计一个诚实的AI系统可能很困难,特别是当我们想询问关于环境中潜在变量的问题时——这些变量对与之交互的人类是隐藏的。这就引出了引出潜在知识(ELK)问题:训练AI智能体诚实报告其信念的问题。在本文中,我们使用因果影响图(CID)使ELK在形式上精确化。CID可用于描述智能体的训练环境与其主观世界表征之间的关系。我们使用CID来形式化可观测变量和潜在变量之间的区别,明确指定智能体诚实的确切含义,并正式定义目标泛化错误。我们证明,在某些情况下,开发者可以通过在训练期间提供正确的反馈来激励智能体诚实回答问题。然而,智能体泛化的一种自然但不理想的方式是提供人类会评估为真实的答案,而不是诚实的答案。我们证明了一个不可能性定理:不存在仅依赖于智能体行为且能确保产生诚实智能体的基于反馈的训练策略,即使在训练期间反馈是完美的。

英文摘要

Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.

2606.12263 2026-06-11 cs.CV 新提交

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

VOID: 击败潜在扩散模型中的未授权模仿

Chunlin Qiu, Ang Li, Tianxiao Huang, Ruilin Gan, Yunjie Ge, Shenyi Zhang, Huayi Duan, Lingchen Zhao, Chao Shen, Qian Wang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院) School of Computer Science, Wuhan University(武汉大学计算机学院) Institute for Math&AI, Wuhan University(武汉大学数学与人工智能研究所) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) School of Cyber Science and Engineering, Xi’an Jiaotong University(西安交通大学网络空间安全学院)

AI总结 针对潜在扩散模型被用于未授权模仿的问题,提出VOID防御框架,通过操纵模型内在随机性,放大潜在编码误差并抵消目标引导信号,实现语义破坏,阻止未授权模仿,同时将扰动限制在人眼不可感知区域。

详情
Comments
To appear in the 35th USENIX Security Symposium (USENIX Security 2026)
AI中文摘要

虽然潜在扩散模型(LDM)彻底改变了视觉合成,但它们越来越多地被用于对个人的未授权模仿。现有防御通过注入欺骗性扰动,将生成图像引导至无关目标。然而,这种方法基于一个无根据的假设:微小的扰动能在LDM的整个生成过程中保持其欺骗效果。实际上,模型固有的恢复机制会移除这些扰动,导致个体身份在生成的图像中重新出现。我们提出VOID,一种通过操纵LDM内在随机性克服这一难题的防御框架。VOID以两种新颖方式扰动扩散管道:1)放大潜在编码误差以破坏图像的语义结构,以及2)抵消目标引导信号以抑制模型的恢复能力。这导致语义破坏,阻止任何未授权模仿。值得注意的是,安全增益不以视觉效用为代价,因为VOID同时设法将扰动限制在受保护图像的人眼不可感知区域。我们在5个数据集上对10种模仿攻击的24种最先进防御进行了全面评估,证明了VOID前所未有的保护能力:它将平均Frechet Inception Distance(FID)从113提高到365,比迄今为止最强的防御提升了223%。

英文摘要

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

2606.12258 2026-06-11 cs.CV 新提交

Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning

连接昼夜:基于协同提示与原型学习的无监督跨域重识别

Jiyang Xu, Rui Liu, Hang Dai

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出无监督昼夜重识别框架,结合提示学习和原型表示学习,通过两阶段训练实现无标注跨域身份关联,性能媲美全监督方法。

详情
AI中文摘要

跨域昼夜重识别(ReID)面临昼夜场景间显著视觉外观差异的根本挑战。现有的全监督方法严重依赖劳动密集型标注,成本高昂且跨域泛化能力有限。本文研究无监督昼夜重识别,提出一种新颖框架,协同结合提示学习和基于原型的表示学习,无需人工标注即可关联跨域身份。我们的方法采用渐进式两阶段训练策略。第一阶段,利用视觉语言模型以无标注方式生成实例特定的文本提示。我们采用实例级对齐机制,将视觉特征和文本提示嵌入统一语义空间,通过实例感知的动态偏差适应将未标注的昼夜图像与可学习提示对齐。第二阶段,构建域特定原型记忆库,并引入两个互补模块:i) 域内身份关联模块,增强每个域内的特征判别性;ii) 跨域原型匹配模块,可靠识别正负原型对,从而建立昼夜间的鲁棒身份对应关系。在公开基准上的大量实验验证了方法的有效性。在无监督设置下,我们的框架取得了与最先进全监督方法相当的Rank-1准确率。

英文摘要

Cross-domain day-night re-identification (ReID) is fundamentally challenged by the substantial visual appearance discrepancies between daytime and nighttime scenes. Existing fully supervised methods rely heavily on labor-intensive annotations, which are costly and exhibit limited generalization across domains. In this work, we investigate unsupervised day-night ReID and propose a novel framework that synergistically combines prompt learning and prototype-based representation learning to associate identities across domains without requiring manual labels. Our approach follows a progressive two-stage training strategy. In the first stage, we exploit the vision-language model to generate instance-specific textual prompts in an annotation-free manner. We employ an instance-level alignment mechanism to embed visual features and textual prompts into a unified semantic space, aligning unlabeled day/night images with learnable prompts via instance-aware dynamic-bias adaptation. In the second stage, we construct domain-specific prototype memory banks and introduce two complementary modules: i) an intra-domain identity association module to enhance feature discriminability within each domain, and ii) a cross-domain prototype matching module to reliably identify positive and negative prototype pairs, thereby establishing robust identity correspondences across day and night. Extensive experiments on public benchmarks validate the effectiveness of our method. Under the unsupervised setting, our framework attains Rank-1 accuracy comparable to state-of-the-art fully supervised methods.

2606.12252 2026-06-11 cs.LG cs.AI 新提交

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

使用可解释性作为训练时可靠性信号实现高效心电图分类

Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford(牛津大学工程科学系生物医学工程研究所) School of Computer Science, University of Nottingham Ningbo China(宁波诺丁汉大学计算机科学学院)

AI总结 提出ERTS方法,利用训练中的解释质量(Grad-CAM注意力图)区分信息性和不可靠不确定性,过滤低聚焦样本,在三个ECG数据集上提升macro-F1并降低训练成本。

详情
AI中文摘要

训练用于临床时间序列分析的深度神经网络计算需求高,但许多医疗环境缺乏重复模型开发和部署所需的资源。这一挑战在心电图分类中尤为明显,大数据集和长训练计划使效率变得重要。渐进式数据丢弃通过从梯度更新中排除已学习的样本来降低训练成本,但它依赖模型置信度,可能保留因噪声或歧义而难以处理而非有用信号的样本。在这项工作中,我们引入了ERTS,一种基于可解释性的可靠性训练信号,用于高效心电图分类。ERTS在训练期间利用解释质量来区分信息性和不可靠的不确定性。基于渐进式数据选择,我们计算候选样本的Grad-CAM注意力图,并推导出一个聚焦分数,衡量模型预测是否得到连贯且局部化模式的支持。低聚焦样本被过滤掉,而具有有意义注意力的样本优先进行梯度更新。我们在三个ECG数据集和多个骨干架构上评估ERTS,显示macro-F1的一致提升以及有效训练成本的降低。这些结果表明,解释质量可以作为改善临床时间序列学习中效率和可靠性的实用信号。代码将发布。

英文摘要

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

2606.12251 2026-06-11 cs.LG cs.AI cs.CR 新提交

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

强化学习破坏基于梯度的对抗优化

Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

发表机构 * COSIC, KU Leuven(鲁汶大学COSIC) Imec Brubotics, VUB(布鲁塞尔自由大学Brubotics) DistriNet, KU Leuven(鲁汶大学DistriNet)

AI总结 研究通过强化学习训练图像分类器以破坏攻击者使用的梯度结构,发现RL作为隐式正则化器产生不稳定梯度方向和较小梯度幅度,使基于梯度的攻击失效,并与对抗训练结合实现双重防御。

详情
AI中文摘要

基于梯度的对抗攻击仍然是对深度神经网络(DNN)的主要威胁,因为它们利用梯度信息高效优化对抗扰动。为了解决这个问题,我们研究了强化学习(RL)训练是否可以通过使用策略梯度目标和epsilon-贪婪探索来训练图像分类器,从而破坏攻击者使用的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100上使用多种架构进行系统实验,我们发现RL训练的分类器显著破坏了基于梯度的对抗优化。为了解释这一点,我们使用损失景观可视化、静态和动态梯度指标以及预测熵进行了全面的机制分析。我们的分析揭示,RL充当隐式正则化器,产生具有高度不稳定梯度方向和较小梯度幅度的模型。这种组合使得每个PGD步骤在方向上不可靠且幅度有限,导致基于梯度的攻击在实际迭代预算内失败。我们进一步表明,将RL与对抗训练(RL-adv)结合提供了在两个互补层面运作的双层防御:RL退化攻击者可用的梯度信息(梯度级防御),而对抗训练强化决策边界(边界级防御)。RL-adv在所有评估的主要攻击类型(包括基于梯度的PGD、AutoAttack、基于迁移和基于查询的攻击)中实现了最高的鲁棒性,显著优于SL-adv。这些发现将RL诱导的梯度破坏识别为一种互补的鲁棒性机制,并激励未来研究结合SL效率与RL梯度正则化特性的混合SL-RL训练调度。

英文摘要

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

2606.12250 2026-06-11 cs.CL 新提交

Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

重新评估高性能大语言模型在波兰医学考试中的表现:真实能力还是偏差驱动?

Antoni Lasik, Jakub Pokrywka, Łukasz Grzybowski, Jeremi Ignacy Kaczmarek, Gabriela Korzańska, Janusz Świeczkowski-Feiz, Oskar Pastuszek, Paulina Hoffman, Jakub Tomasz Dąbrowski, Wojciech Kusa

发表机构 * NASK National Research Institute(NASK国家研究所) Adam Mickiewicz University(亚当·密茨凯维奇大学) ARAAI Poland(ARAAI波兰) Poznań University of Medical Sciences(波兹南医科大学) Centre of Postgraduate Medical Education, Poland(波兰研究生医学教育中心) T. Marciniak Lower Silesian Specialist Hospital(T. 马尔奇尼亚克下西里西亚专科医院) Medical University of Warsaw(华沙医科大学)

AI总结 通过引入扩展和更具挑战性的波兰医学考试基准,减少MCQA伪影,发现标准MCQA分数高估了LLM的真实临床能力,最佳模型在更难的设置下分数下降28.4和31个百分点。

详情
Comments
26 pages total with references and appendix, preprint
AI中文摘要

医学领域的大语言模型(LLM)主要通过多项选择题问答(MCQA)进行评估,但由于猜测策略和答案偏差,这种方法可能高估真实的临床能力。为解决这些局限性,我们引入了一个基于波兰医学考试的扩展且更具挑战性的基准,增加了超过15,000道题目、两个新领域和四项结构修改,以减少MCQA特定伪影并更好地测试推理能力。我们评估了21个LLM,结果表明评估设计对结果影响很大。在我们的更难设置下,最佳模型(Qwen3.5-122B)在英语和波兰语考试中分别下降了28.4和31个百分点。尽管数据污染证据不足,但标准MCQA分数并不能可靠地反映真实的医学能力。为促进进一步研究,我们公开了该基准。

英文摘要

Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

2606.12248 2026-06-11 cs.CV 新提交

Damage-TriageFormer: A Foundation-Model Framework for Typology-Based Building Damage Assessment from Mono-Temporal Imagery

Damage-TriageFormer:基于类型学的单时相影像建筑损伤评估的基础模型框架

Yiming Xiao, Yu-Hsuan Ho, Sanjay Thasma, Junwei Ma, Ali Mostafavi

发表机构 * Texas A&M University(德克萨斯A&M大学) Resilitix Intelligence LLC Institute for a Disaster Resilient Texas(德克萨斯灾害韧性研究所)

AI总结 提出Damage-TriageFormer,一种基于单张灾后影像的建筑损伤类型学评估模型,通过扩展DINOv3 ViT-L骨干网络和两阶段门控损伤头,在三个灾害数据集上实现了宏观F1约0.62,无需灾前影像即可支持应急响应。

详情
AI中文摘要

决策相关的建筑损伤评估对于灾后资源优先分配和恢复至关重要,但大多数自动化方法要么将损伤扁平化为单一严重程度等级(无损伤、轻微、严重、摧毁),要么需要成对的灾前和灾后影像,而这对于突发灾害通常不可用。本文提出了Damage-TriageFormer,一种基于单张灾后影像、足迹条件化的模型,它生成损伤类型学而非严重程度等级。我们的贡献包括:(1)DamageTriage-Bench,一个基于NOAA应急响应影像(涵盖2018年迈克尔飓风、2024年海伦飓风和2025年洛杉矶野火复合灾害)构建的新基准,包含五个类型学类别,区分屋顶损伤和结构损伤,并在每个类别内区分部分和全部范围;(2)Damage-TriageFormer,它扩展了DINOv3 ViT-L骨干网络,结合简单特征金字塔进行更高分辨率的实例池化、两阶段门控损伤头以及辅助严重程度回归目标。我们的模型在验证集上达到宏观F1为0.624,在保留的分层测试集上为0.619,在运营分类最需要的地方表现最强,无损伤建筑和完全结构倒塌的每类F1分别为0.91和0.84。尽管罕见的完全屋顶损伤类别由于样本有限和固有的模糊标签边界仍然困难,但我们的结果表明,单张灾后影像可以支持可操作的建筑损伤分类,无需灾前参考即可实现有针对性的应急响应和资源分配。

英文摘要

Decision-relevant building damage assessment is critical for prioritizing resources and recovery after a disaster, yet most automated methods either flatten damage into a single severity scale (no damage, minor, major, destroyed) or require paired pre- and post-event imagery that is often unavailable for emerging hazards. This paper presents Damage-TriageFormer, a single-image, post-event, footprint-conditioned model that produces a damage typology rather than a severity scale. We contribute: (1) DamageTriage-Bench, a new benchmark built from NOAA Emergency Response Imagery across Hurricane Michael (2018), Hurricane Helene (2024), and the 2025 Los Angeles wildfire complex, with five typology classes that distinguish roof damage from structural damage and, within each, partial from total extent; and (2) Damage-TriageFormer, which extends a DINOv3 ViT-L backbone with a Simple Feature Pyramid for higher-resolution instance pooling, a two-stage gated damage head, and an auxiliary severity-regression objective. Our model achieves macro F1 of 0.624 on validation and 0.619 on a held-out stratified test set, performing strongest where operational triage needs it most, with per-class F1 of 0.91 and 0.84 on undamaged buildings and total structural collapse, respectively. While the rare Total Roof Damage class remains difficult due to its limited examples and an inherently ambiguous label boundary, our results show that single-image post-event imagery can support actionable building damage typing, enabling targeted emergency response and resource allocation without a pre-event reference.

2606.12240 2026-06-11 cs.LG cs.AI 新提交

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

多速率专家混合模型加速液态神经网络训练

Shilong Zong, Almuatazbellah Boker, Hoda Eldardiry

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 提出多速率专家混合框架,结合液态神经网络的多尺度动态与注意力机制,提升多变量时间序列建模的准确性和效率。

详情
AI中文摘要

多变量时间序列数据通常表现出复杂的时间依赖、不规则采样和跨多个时间尺度的异质动态,使得精确序列建模特别具有挑战性。传统的循环神经网络(RNN),如长短期记忆网络(LSTM),在离散时间下运行,可能难以有效捕捉连续和不规则的时间行为。液态神经网络(LNN)通过连续时间动态解决了其中一些限制,但标准LNN架构通常依赖单一动力系统,限制了其建模异质时间模式的能力。为了解决这些挑战,我们提出了一个基于液态神经网络的多速率专家混合(MR-MoE)框架。在所提出的架构中,多个基于LNN的专家以不同的时间尺度运行,使模型能够明确分离快速变化的动态和缓慢演变的时间趋势。门控网络进一步实现了基于输入条件的自适应专家专业化。此外,我们结合了特征级和时间注意力机制,以提高鲁棒性、可解释性和长程依赖建模能力。特征级注意力抑制噪声或无关变量,而时间注意力则选择性地关注信息丰富的历史状态。我们在一个复杂的多变量时间序列预测任务上评估了所提出的框架,并与强基线模型(包括LSTM、单体LNN和标准MoE模型)进行了比较。实验结果表明,所提出的MR-MoE框架在保持良好计算效率的同时,持续实现了改进的AUROC和AUPRC性能。这些结果突显了结合连续时间动态、多尺度专家分解和自适应注意力机制对时间序列建模的有效性。

英文摘要

Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

2606.12236 2026-06-11 cs.RO cs.CV 新提交

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 提出DrivingAgent框架,通过自动化模块开发(设计阶段)和强化学习训练的轻量级LLM实时调度(调度阶段),解决自动驾驶系统集成新模型和满足实时约束的挑战,在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情
AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而,这一趋势带来了两个关键挑战:(i)设计和集成新模型的手动且劳动密集型过程,以及(ii)缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型(LLM)的智能体为自动化提供了有前景的途径,但现有框架并不适合自动驾驶。具体来说,它们未能区分系统设计和实时调度的根本不同需求,将模块视为不透明的黑盒,并且并非为持续运行而设计。为了解决这些局限性,我们提出了DrivingAgent,这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段,DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段,它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块,并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明,DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

2606.12234 2026-06-11 cs.CL 新提交

On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

论LLM条件控制中的效果-流畅性权衡:一项系统性研究

Iuri Macocco, Pau Rodríguez, Arno Blaas, Luca Zappella, Marco Baroni, Xavier Suau

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Apple(苹果公司) ICREA(加泰罗尼亚研究与高级研究所)

AI总结 系统研究LLM条件控制方法在注入和移除目标概念时的效果与流畅性权衡,发现高效引导方法常以牺牲流畅性为代价,且激活引导方法在指令调优模型上效果较差。

详情
Comments
8 pages, 2 figure
AI中文摘要

控制大型语言模型(LLM)的输出是其可靠部署的核心挑战,然而对所涉及权衡的清晰理解仍然难以捉摸。当前的条件控制方法通常在评估时狭隘地关注其注入或移除目标概念的有效性,而忽略了生成质量。我们系统性地研究了注入和移除场景中的一系列条件控制方法。我们发现,高效的引导方法通常以流畅性的大幅损失为代价来实现条件控制。此外,我们识别出一个关键但先前被忽视的与训练范式的交互:激活引导方法在指令调优模型上的效果远不如在基础模型上。另一方面,简单的提示和全面的监督微调是概念注入的可行选择,但在概念移除方面效果不佳。最后,廉价计算的文本指标与昂贵的LLM作为评判者的评分高度相关,并为条件控制方法的行为提供了见解。

英文摘要

Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.

2606.12232 2026-06-11 cs.LG 新提交

Re-evaluating Confidence Remasking in Masked Diffusion Language Models

重新评估掩蔽扩散语言模型中的置信度重新掩蔽

Stipe Frkovic, Metod Jazbec, Dan Zhang, Christian A. Naesseth, Ilija Bogunovic, Eric Nalisnick

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam(阿姆斯特丹大学UvA-Bosch Delta实验室) Bosch Center for AI(博世人工智能中心) University of Basel(巴塞尔大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文重新评估了掩蔽扩散语言模型中一种无需训练的后验置信度重新掩蔽方法WINO,发现在标准解码设置下其收益甚微,且会加剧多样性坍塌问题。

详情
AI中文摘要

掩蔽扩散语言模型(dLLMs)最近已成为自回归语言模型的有竞争力的替代方案,其通过并行令牌生成实现更快的推理。然而,掩蔽公式的一个显著限制是,一旦令牌被解除掩蔽,就无法再修改,这使得dLLMs容易受到早期采样错误的影响。为了解决这个问题,越来越多的研究试图扩展掩蔽dLLMs,使其具有自我纠正(重新掩蔽)能力。其中一类有吸引力的方法以无需训练、事后方式基于令牌置信度实现,早期报告的结果令人鼓舞。在这项工作中,我们重新审视了代表性事后重新掩蔽方法WINO [Hong et al., 2026]的实证评估,发现在标准解码设置(较短的块长度)下,它相比于仅基于置信度的解除掩蔽 [Wu et al., 2025] 几乎没有带来好处。将评估扩展到非贪婪解码,我们发现虽然基于置信度的重新掩蔽可以在一定程度上减轻由增加随机性引入的错误,但它也加剧了先前报道的基于置信度的解除掩蔽导致的多样性坍塌。总体而言,我们的结果表明,事后基于置信度的重新掩蔽的好处高度依赖于设置,这凸显了需要更全面的评估框架。

英文摘要

Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One appealing subset of these methods does so in a training-free, post-hoc manner based on token confidences, with encouraging early reported results. In this work, we revisit the empirical evaluation of a representative post-hoc remasking method, WINO [Hong et al., 2026], and find that under standard decoding settings (shorter block lengths) it brings little-to-no benefit over confidence-based unmasking alone [Wu et al., 2025]. Extending the evaluation to non-greedy decoding, we find that while confidence-based remasking can mitigate errors introduced by increased stochasticity to some extent, it also exacerbates the diversity collapse previously reported for confidence-based unmasking. Overall, our results show that the benefits of post-hoc confidence-based remasking are highly setting-dependent, underscoring the need for a more comprehensive evaluation framework.

2606.12226 2026-06-11 cs.CV eess.IV 新提交

An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography

一种电势增强的基准数据集,用于电容层析成像的物理引导图像重建

Xinqi Zhang, Qiming Ma, Lihui Peng

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对电容层析成像(ECT)数据驱动方法忽略电势场的问题,提出一个包含电势图的基准数据集,通过COMSOL-MATLAB管道生成20,000个样本,并验证其提升建模精度和鲁棒性。

详情
AI中文摘要

虽然深度学习显著推进了电容层析成像(ECT)的图像重建,但大多数数据驱动方法直接映射电容和介电常数分布,将传感器视为黑箱。这忽略了电势场——控制非线性和病态“软场”效应的基本物理联系。为解决此问题,我们提出一个电势增强的ECT基准数据集,旨在将ECT背后的潜在物理显式集成到学习过程中。通过COMSOL-MATLAB管道为八电极传感器生成示例,数据集包含20,000个随机样本,涵盖四种典型流型。关键的是,除了传统的电容向量和以图像形式描绘的介电常数分布外,每个样本还保留了八个激励方向的全场电势图。除了数据发布,我们还提供了ECT正问题和逆问题的说明性评估协议。通过在分布内(IID)和分布外(OOD)场景下的全面测试,我们系统地展示了包含电势图如何增强建模精度和鲁棒性。从根本上说,潜在场信息的显式包含显著降低了将物理定律集成到ECT建模中的障碍,从而为未来ECT图像重建的物理引导机器学习建立了标准化基础。

英文摘要

While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field'' effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.

2606.12218 2026-06-11 cs.CV cs.AI 新提交

Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

为食物-水关系调整Prithvi-EO用于休耕地检测:地理空间基础模型的ViT-Adapter颈部与参数高效骨干微调

Sk Muhammad Asif, Orhun Aydin

发表机构 * Earth, Atmospheric and Geospatial Science, Saint Louis University(圣路易斯大学地球、大气与地理空间科学系)

AI总结 针对休耕地检测中多尺度特征需求与基础模型单尺度ViT骨干不匹配的问题,提出结合LoRA和混合PEFT的两种参数高效微调方案与三种颈部设计,其中Lite ViT-Adapter配合单阶段检测头在mAP@50上达到0.9479,优于无适配器方法25.70%。

详情
Comments
10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026
AI中文摘要

理解休耕地的空间分布对于优化食物-水关系至关重要,因为休耕在作物轮作和水资源保护中发挥着作用。休耕是美国农业部作物数据层中的一个低精度类别。地理空间基础模型Prithvi-EO在计算机视觉任务中展现出强大的迁移能力。然而,其视觉Transformer骨干在单一空间尺度上生成特征,不适合目标检测头所需的多尺度特征。现有方法通过缩放单步长令牌来合成多尺度金字塔,牺牲了空间异质性,而全骨干微调对于地理空间基础模型来说计算成本过高。我们评估了一个结合两种参数高效微调方案的休耕地检测流程:低秩适应和混合PEFT,以及三种颈部设计:伪多尺度、Lite ViT-Adapter和Full ViT-Adapter。我们最佳配置,即带有单阶段检测头的Lite ViT-Adapter,在Diou损失下实现了0.9479的mAP@50,表明中心感知定位对于不规则休耕地检测的有效性。在LoRA下,ViT-Adapter释放的单阶段检测比无适配器的基于锚点的方法提高了6.42%,而最佳配置比基线无适配器的基于锚点的方法提高了25.70%。这些结果表明,轻量级空间先验融合和选择性骨干解冻使Prithvi-EO能够更有效地捕捉局部休耕模式,优于依赖重塑单步长ViT令牌的方法。

英文摘要

Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.

2606.12217 2026-06-11 cs.CV cs.AI cs.RO 新提交

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

使远见可操作:在世界动作模型中重新利用表示对齐

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

发表机构 * The University of Hong Kong(香港大学) XPENG Robotics(小鹏机器人)

AI总结 针对世界动作模型中视觉预测与动作提取不匹配的问题,提出AGRA方法,通过对齐视频扩散特征与语义表示,提升动作解码器对任务相关区域的关注,从而改善操作任务的性能与泛化能力。

详情
AI中文摘要

世界动作模型(WAM)通过使用视频生成模型在生成控制动作之前建模未来场景演变,为机器人操作提供了一条有前景的途径。然而,我们的实证观察揭示了一个现象:生成合理的视觉未来并不总能保证提取出准确的动作。为了诊断这一失败,我们进行了动作头注意力分析和因果干预。我们发现动作解码器未能聚焦于任务相关的交互区域,并且对任务无关区域的扰动保持敏感。这揭示了一种表示不匹配:为视觉重建优化的隐藏状态并未以适用于低级动作控制的形式组织。在本文中,我们提出了AGRA,一种动作接地表示对齐目标,通过将中间视频扩散特征与来自基础视觉编码器的空间连贯语义表示对齐,来正则化世界-动作接口。我们在真实世界的操作任务上评估了AGRA。实验表明,AGRA使世界模型表示更加动作接地:通过将动作解码器聚焦于正确的交互区域,它提高了物体定位精度和功能理解,并使策略对任务无关区域的扰动更加鲁棒。因此,AGRA在分布内性能和分布外泛化方面均持续优于基线世界动作模型。

英文摘要

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.