arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02068 2026-06-02 cs.CV cs.AI

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

基于可微多平面图像的快速轻量级新视角合成

Kaidi Zhang, Guanxu Zhu

发表机构 * Universiti Malaya（马来大学）； Wuhan University（武汉大学）

AI总结针对现有方法在速度、模型大小和稀疏视角下的不足，提出基于可微多平面图像（MPI）的快速轻量级新视角合成方法，利用点图进行几何初始化并引入一步扩散处理空洞和伪影。

详情

AI中文摘要

近年来，新视角合成取得了显著进展，主流方法如神经辐射场（NeRF）和3D高斯泼溅（3DGS）产生了令人印象深刻的结果。然而，这些方法往往难以平衡渲染速度和模型大小，且其基于优化的训练可能非常耗时。此外，它们通常依赖于密集观测，在稀疏视角条件下往往无法产生令人满意的结果。尽管前馈重建显著减少了3DGS的优化时间，但其像素对齐公式从单张图像生成数百万个高斯，严重限制了其在移动设备上的实际部署。为了解决这些限制，我们重新审视了多平面图像（MPI）表示，该表示使用一组紧凑的平面层来表示场景，以实现高效的新视角合成。利用视觉基础模型的最新进展，我们使用预测的点图进行可靠的几何初始化，然后进行可微优化。为了解决稀疏初始化MPI中的空洞和伪影问题，我们引入了一步扩散，该扩散既参与MPI的可微优化，也参与渲染结果的后处理。与代表性的基于GS的方法相比，我们的方法速度快30.7%，模型大小仅为其14.8%，同时在前景场景中实现了具有竞争力的合成质量。

英文摘要

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.02061 2026-06-02 cs.LG

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

消除原型：原型SAE的稳定性是初始化和度量设计的人为产物

Michał Brzozowski, Neo Christopher Chung

发表机构 * Samsung AI Center（三星人工智能中心）； University of Warsaw（华沙大学）

AI总结本文通过实验证明，原型稀疏自编码器声称的稳定性源于多轮训练中相同的初始化设置，而非原型约束本身，并强调稳定性与稳定化的区别对可解释性研究至关重要。

详情

AI中文摘要

使用稀疏自编码器（SAE）的字典学习从神经网络激活中产生过完备基，这些基通常是可解释的，并减少了多义性。然而，不同随机种子的SAE特征差异很大——这个问题被称为不稳定性。原型SAE（Fel等人，2025）被提出作为一种通用的字典学习干预，用于更可靠的概念提取，并报告在训练结束时字典更稳定。我们证明原型SAE声称的稳定性是在多次运行中设置相同初始化的结果。通过我们的分析，我们试图澄清机械可解释性中可能模糊使用的两个不同概念：稳定性是两个独立训练模型之间的一致性，而稳定化是独立初始化的运行向共同解收敛。这种区分对于自然语言处理（NLP）的机械可解释性至关重要，其中特征稳定性越来越多地被用作SAE特征是可重用分析单元的证据。原型SAE的实验共享一个确定性的k-means解码器初始化，在训练开始前将运行间字典距离设为零。当移除这种初始化时，原型约束在我们的设置中没有提供稳定化优势。我们进一步发现了一个依赖于预处理的余弦几何问题，使端点稳定性指标的解释复杂化。总的来说，我们的研究支持在更大的字典学习传统中研究SAE的价值，同时表明稳定性声明需要轨迹诊断和初始化消融。

英文摘要

Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.

URL PDF HTML ☆

赞 0 踩 0

2606.02058 2026-06-02 cs.CV cs.RO

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

TIDES：基于可变形重建的时间导数事件模拟

Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield

发表机构 * University of Surrey（萨里大学）

AI总结提出TIDES，一种基于动态高斯泼溅的连续时间事件模拟器，通过显式3D场景表示推导逐像素强度动态，实现精确的阈值交叉预测，并利用遮挡引导自适应时间步长，达到最先进的事件流保真度。

详情

AI中文摘要

事件相机响应环境外观变化而发出异步事件。真实世界事件数据集的稀缺使得模拟至关重要。然而，大多数模拟器从帧序列推断事件时间戳，迫使许多阈值交叉共享一小组离散时间；我们将这种失效模式称为时间戳批处理，它在快速运动和遮挡下会恶化。我们提出TIDES，一种基于动态高斯泼溅的连续时间事件模拟器。由于TIDES在具有学习几何和运动的显式3D场景表示上运行，它可以直接从场景推导每像素强度动态，而不是通过渲染帧的差分。这使得能够精确预测阈值交叉，包括每个渲染步骤的多次交叉，而无需时间上采样或帧插值。相同的3D场景模型揭示了物体之间部分遮挡的位置；TIDES利用这一点来指导自适应时间步长，仅将计算集中在遮挡动力学使简单亮度变化模型不可靠的区域。最后，我们使用瓦片级仲裁器对有限传感器带宽进行建模，其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在配对的RGB-事件基准测试中，TIDES达到了最先进的事件流保真度。我们还表明，TIDES模拟的事件比竞争对手更有效地转移到真实下游任务。

英文摘要

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

URL PDF HTML ☆

赞 0 踩 0

2606.02054 2026-06-02 cs.AI

eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

eMoT: 通过符号锚定和记忆腐蚀演化的思维记忆

Xiang Li, Jiwei Wei, Ke Liu, Yitong Qin, Jinyu Guo, Malu Zhang, Peng Wang, Yang Yang

发表机构 * Center for Future Media, University of Electronic Science and Technology of China（未来媒体中心，电子科技大学）

AI总结提出eMoT框架，通过记忆腐蚀、符号锚定和一致性精炼三个模块，将推理轨迹视为动态演化记忆，以稳定多步推理并提升准确率与一致性。

详情

AI中文摘要

尽管大型语言模型（LLMs）在多步推理任务上取得了令人印象深刻的性能，但其可靠性仍然受到关键限制的阻碍，例如不受约束的幻觉和较差的数值计算。从根本上说，这些问题源于标准模型将推理视为一次性的瞬态生成过程，而不是保留并改进成功的程序逻辑。为了解决这些挑战，我们提出了eMoT（演化的思维记忆），这是一个统一框架，通过将推理轨迹视为动态演化的记忆而非静态模板来稳定多步推理。该框架主要由三个相互连接的模块组成：（i）记忆腐蚀机制，强化高效用推理结构，同时逐渐衰减较少使用的结构；（ii）符号锚定引擎，利用Python进行确定性计算，类似于人类使用计算器；（iii）一致性驱动的精炼过程，将神经推理与符号结果对齐，减少逻辑差异的累积。在多个推理基准上，eMoT相比标准的思维链和结构化推理基线提高了准确率和解决方案一致性。在传统任务Game of 24上，eMoT达到了100%的准确率，比基线高出17.6%。在数学任务GSM8K、ASDiv、SVAMP和MGSM上的评估进一步显示了在多步数学推理中的持续改进。在我们的评估中，尽管使用了轻量级骨干模型且基线能力受限，我们仍取得了优越的性能。与依赖大规模模型的替代方法相比，我们的结果表明性能提升根本上是由eMoT框架的推理控制驱动的，而非单纯的模型规模。

英文摘要

While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered by critical limitations such as unconstrained hallucinations and poor numerical computation. Fundamentally, these issues arise because standard models treat reasoning as a transient, one-off generation process rather than retaining and refining successful procedural logic. To address these challenges, we propose eMoT (evolving Memory-of-Thought), a unified framework that stabilizes multi-step reasoning by treating reasoning trajectories as dynamic, evolving memories rather than static templates. The framework primarily consists of three interconnected modules: (i) a memory corrosion mechanism that reinforces high-utility reasoning structures while gradually decaying less frequent ones; (ii) a symbolic anchoring engine that utilizes Python for deterministic computation, much like a human uses a calculator; and (iii) a consistency-driven refinement process that aligns neural inference with symbolic outcomes, reducing the accumulation of logical discrepancies. Across multiple reasoning benchmarks, eMoT improves accuracy and solution consistency over standard Chain-of-Thought and structured reasoning baselines.On the traditional task Game of 24, eMoT achieves 100% accuracy, surpassing the baseline by up to 17.6%. Evaluations on mathematical task GSM8K, ASDiv, SVAMP, and MGSM further show consistent gains in multi-step mathematical reasoning. In our evaluation, we achieve superior performance despite utilizing a lightweight backbone model with constrained baseline capabilities. Compared to alternative methods that rely on massively scaled models, our results demonstrate that the performance gains are fundamentally driven by the eMoT framework's reasoning control rather than sheer model size.

URL PDF HTML ☆

赞 0 踩 0

2606.02049 2026-06-02 cs.AI

Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings

面向建筑最优能量管理的可解释数据驱动深度强化学习方法

Hallah Shahid Butt, Qiong Huang, Gökhan Demirel, Kevin Förderer, Erfan Tajalli-Ardekani, Simnon Waczowicz, Luigi Spatafora, Veit Hagenmeyer, Benjamin Schäfer

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出可解释深度强化学习框架，结合真实数据训练策略内与离策略算法，通过事后解释技术揭示电池管理决策过程，实现降本与透明化。

详情

AI中文摘要

可再生能源在电力系统中的日益普及，特别是在配备光伏板和储能系统的建筑中，引入了能源系统的显著复杂性。波动的发电量、变化的电价以及增加的实体（如光伏系统和热泵）增加了复杂性，使系统更难运行。这导致了对额外控制和优化路径的需求，包括基于数据的控制，如强化学习。虽然深度强化学习已成为在动态且日益复杂的环境中优化建筑运营的有前景的解决方案，但其黑箱特性阻碍了用户信任和实际应用。本文提出了一种应用于住宅建筑能量管理的可解释深度强化学习框架。我们在合成数据以及来自KIT Living Lab Energy Campus的真实数据上展示了其使用。我们在扩展的状态空间上训练并比较了策略内和离策略的DRL智能体，该状态空间包含实时测量（需求、光伏发电、电池功率、荷电状态）、外部信号（动态电价、本地天气数据）、日历和假日指标以及需求和价格预测。我们的实验结果表明，策略内算法，特别是优势演员-评论家和近端策略优化，在累积奖励和策略稳定性方面优于离策略方法。为了解释这些模型，我们采用事后解释技术来阐述学到的控制策略。我们的发现表明，XRL框架不仅通过最优电池管理降低了电力成本，还提供了对智能体决策过程的透明、可操作的见解。

英文摘要

The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional control and optimization routes including data-based controls, such as reinforcement learning. While deep reinforcement learning (DRL) has emerged as a promising solution to optimize building operations in dynamic and ever more complex environments, its black-box nature impedes user trust and practical adoption. This paper presents a framework for explainable deep reinforcement learning (XRL) applied to energy management in residential buildings. We demonstrate its usage on both synthetic data but also on real-world data from the Living Lab Energy Campus (LLEC) at KIT. We train and compare both on-policy and off-policy DRL agents on an expanded state space that incorporates real-time measurements (demand, PV generation, battery power, state of charge), external signals (dynamic electricity price, local weather data), calendrical and holiday indicators, and forecasts for demand and price. Our experimental results indicate that on-policy algorithms, particularly Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), outperform off-policy methods in terms of cumulative rewards and policy stability. To explain these models, we employ post-hoc interpretation techniques to elaborate the learned control policies. Our findings demonstrate that the XRL framework not only reduces electricity costs through optimal battery management, but also provides transparent, actionable insights into the agent's decision-making process.

URL PDF HTML ☆

赞 0 踩 0

2606.02048 2026-06-02 cs.AI cs.CV physics.bio-ph

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

动态酪蛋白凝胶化显微图像拓扑纹理分析及其与流变学性质的关系

Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen, Jon Sporring

发表机构 * Department of Computer Science, University of Copenhagen, Denmark（哥本哈根大学计算机科学系）； Department of Green Technology, University of Southern Denmark, Denmark（南丹麦大学绿色技术系）； Department of Food Science, University of Copenhagen, Denmark（哥本哈根大学食品科学系）

AI总结提出结合拓扑数据分析、差分盒计数、多重分形分割和局部二值模式的工具箱，分析STED显微图像中酪蛋白凝胶化的拓扑与纹理特征，揭示与流变学性质相关的微观结构转变。

详情

AI中文摘要

我们提出了一种新颖的计算工具箱，集成了拓扑数据分析（TDA）、差分盒计数（DBC）、多重分形分割（MFP）和局部二值模式（LBP），应用于由葡萄糖酸-δ-内酯（GDL）在30°C和40°C以及两种GDL浓度（1.8%和3.5% w/v）下诱导的酪蛋白酸钠凝胶化的时间序列超分辨率STED显微图像。TDA通过最大Betti-1曲线追踪拓扑环，即反映蛋白质网络互连性的封闭环状结构，揭示了分散聚集体的滞后阶段、与网络渗透和流变学观察到的溶胶-凝胶转变相一致的急剧衰减，以及对应于网络重排的凝胶后增加。这些拓扑转变通过DBC和MFP得到证实，因为这些方法能够解析结构复杂性和空间异质性的变化。该工具箱在实验应用前在模拟分形图像上进行了验证。总之，这些描述符对体相流变学作为平均体相力学响应捕获的细微微观结构转变具有敏感性。这种集成方法为表征食品和材料科学中具有演化微观结构动力学的复杂微观结构提供了稳健的定量工具。代码可在https://github.com/Zahratabatabaei/Delifood_CV_paper.git获取。

英文摘要

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at https://github.com/Zahratabatabaei/Delifood_CV_paper.git

URL PDF HTML ☆

赞 0 踩 0

2606.02042 2026-06-02 cs.CV

Normality-Preserving Continual Industrial Anomaly Detection via Orthogonal LoRA Banks

通过正交LoRA库保持正态性的持续工业异常检测

Weibai Fang, Haijun Che, Feiyang Ren, Qiancheng Lao

发表机构 * Yisu University（Yorkshire University）

AI总结提出基于历史冻结正交LoRA库和分层新颖性自适应库增长模块的框架，解决扩散模型在持续工业异常检测中的历史正态先验漂移和灾难性遗忘问题。

Comments 33 pages,6 figures,Submitted to Advanced Engineering Informatics

详情

AI中文摘要

基于扩散模型的持续工业异常检测面临历史正态先验漂移和灾难性遗忘问题。现有的持续扩散方法通过回放或约束优化保留先前知识，但缺乏在顺序适应过程中隔离和保护类别特定正态先验的显式机制。尽管低秩适应提供了模块化残差更新，但标准LoRA既未冻结历史正态子空间，也未阻止新适配器干扰先前适配器。为解决此问题，我们提出基于两个模块的正态保持持续异常检测框架：历史冻结正交LoRA库（HF-OLB）和分层新颖性自适应库增长模块（HNABG）。HF-OLB冻结预训练的U-Net主干和已学习的LoRA库，并将新任务特定的正态残差约束到历史LoRA子空间的正交补空间中。HNABG进一步分配层依赖的残差容量，并仅在残差正态新颖性超过现有库的表达容量时扩展库。在MVTec和VisA上的大量实验证明了所提方法的有效性。在具有挑战性的VisA 2x6设置下，我们的方法实现了83.6/91.8的图像和像素级A-AUROC，以及3.8/3.9的FM，将像素级A-AUROC提升了3.2个百分点，同时将像素级FM降低了1.3。这些结果表明，我们的方法在长时间跨度的持续类别序列中有效保留了历史正态先验。

英文摘要

Continual industrial anomaly detection with diffusion models suffers from historical normality prior drift and catastrophic forgetting. Existing continual diffusion methods preserve previous knowledge through replay or constrained optimization, but they lack an explicit mechanism for isolating and protecting category-specific normality priors during sequential adaptation. Although low-rank adaptation provides modular residual updates, standard LoRA neither freezes historical normality subspaces nor prevents new adapters from interfering with previous ones. To address this issue, we propose a normality-preserving continual anomaly detection framework based on two modules: History Frozen Orthogonal LoRA Bank (HF-OLB) and Hierarchical Novelty Adaptive Bank Growth module (HNABG). HF-OLB freezes both the pre-trained U-Net backbone and the learned LoRA banks, and constrains new task-specific normality residuals to the orthogonal complement of historical LoRA subspaces. HNABG further allocates layer-dependent residual capacity and expands the bank only when the residual normality novelty exceeds the expressive capacity of existing banks. Extensive experiments on MVTec and VisA demonstrate the effectiveness of the proposed method. On the challenging VisA 2x6 setting, our method achieves 83.6/91.8 image and pixel level A-AUROC with 3.8/3.9 FM, improving pixel level A-AUROC over the state of the art by 3.2 points while reducing pixel level FM by 1.3. These results show that our method effectively preserves historical normality priors in long horizon continual category sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.02041 2026-06-02 cs.CL

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

SentGuard：面向大型语言模型的句子级流式护栏

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University（复旦大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出SentGuard，一种与生成并行运行的句子级流式护栏，通过轻量级等待缓冲区将流式令牌分组为句子块并仅释放已验证块，以在低延迟下实现高精度不安全内容检测。

Comments 16 pages, 5 figures, submitted to ARR

详情

AI中文摘要

大型语言模型越来越多地实时流式输出长篇幅、推理密集的响应，这使得何时进行审核与是否进行审核同样关键。现有的护栏分为两种不理想的极端：响应级方法延迟干预直到完整输出生成，而令牌级方法基于不完整的语义进行操作，往往产生不稳定的决策和过多的护栏调用。为应对这一挑战，我们提出SentGuard，一种与生成并行运行的句子级流式护栏。一个轻量级等待缓冲区将流式令牌分组为句子块，并仅向用户释放已验证的块，引入一个小偏移量，使得SentGuard能够在目标LLM解码后续内容时评估当前前缀。为支持这一点，我们构建了StreamSafe基准，包含8个危害类别的结构化逐句标注，捕捉推理和响应段中安全风险的演变。我们进一步使用从粗到细的目标训练SentGuard，以在不安全意图在句子边界出现时立即检测。在5个安全基准上的实验表明，SentGuard优于现有基线，在两个句子内检测到90.5%的不安全案例，同时保持7.41%的低流式误报率。

英文摘要

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

URL PDF HTML ☆

赞 0 踩 0

2606.02035 2026-06-02 cs.AI cs.LG

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

RL-ACRGNet：基于强化学习的胸部放射学报告生成网络

Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya

发表机构 * Human-AI Interaction (HAIx) Lab, Indian Institute of Technology Gandhinagar（人类-人工智能交互实验室，印度理工学院冈丁加尔）； Department of Computer Science and Engineering, Madhav Institute of Technology and Science Deemed University (MITS-DU)（计算机科学与工程系，马达夫技术与科学 deemed 大学（MITS-DU））； Multimedia and Information Security Research Group, Department of Computer Science and Engineering, ABV-Indian Institute of Information Technology and Management（多媒体与信息安全研究组，计算机科学与工程系，ABV-印度信息科技与管理学院）

AI总结提出RL-ACRGNet，一种结合预训练DenseNet编码器与多级LSTM解码器的离策略强化学习框架，通过度量奖励机制优化视觉语义嵌入，在IU-Xray和MIMIC-CXR数据集上超越基线，生成高质量临床报告。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

医学影像解读是现代临床诊断的基石，然而手动生成放射学报告既耗时又容易出现解读不一致。在医学AI领域，通过深度学习自动化这些描述有望简化临床工作流程并标准化诊断输出。然而，由于在捕获细粒度视觉特征和确保临床连贯性方面的局限性，准确的疾病检测和精确的报告生成仍然是重大挑战。为了解决这些问题，我们提出了RL-ACRGNet，一种改进的编码器-解码器模型，它将预训练的DenseNet编码器与多级LSTM解码器集成在离策略强化学习框架中。通过使用双网络方法，基于度量奖励机制细化视觉语义嵌入，我们证明RL-ACRGNet在IU-Xray数据集上持续优于最先进的基线，在BLEU-4（0.47%）、METEOR（0.17%）和ROUGE-L（0.518）上取得了定量改进。此外，在大规模MIMIC-CXR数据集上的综合评估证实了该模型的稳健泛化能力及其生成高质量、临床相关报告的能力。

英文摘要

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports

URL PDF HTML ☆

赞 0 踩 0

2606.02027 2026-06-02 cs.RO cs.LG cs.MA

World-Task Factorization for Robot Learning

世界-任务分解用于机器人学习

Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok

发表机构 * Department of Computer Science and Technology, University of Cambridge, United Kingdom（计算机科学与技术系，剑桥大学，英国）； Robotics and Biology Laboratory, Technische Universität Berlin（机器人与生物学实验室，柏林技术大学）； Science of Intelligence (SCIoI), Cluster of Excellence, Berlin, Germany（智能科学（SCIoI），卓越中心，柏林，德国）； Robotics Institute Germany（德国机器人研究所）

AI总结提出将策略分解为世界因子和任务因子，通过可微图模型AICON与紧凑学习策略结合，实现零样本泛化到新配置并迁移到真实硬件。

详情

AI中文摘要

机器人学习必须产生能够泛化到新的约束、队友和环境组合的策略。为此，我们必须对策略进行结构性分解，这种选择决定了哪些部分泛化、哪些需要重新训练、哪些保持纠缠。现有方法涵盖从期望结构从数据扩展中涌现，到通过层次结构、技能库或学习专门化手工设计。在本文中，我们研究我们认为机器人学中最基本的分解：将世界与任务分离。我们研究了这种分解有原则的条件。世界因子是具身系统和环境的属性；它们独立于意图存在。任务因子由任务在世界所允许的事物上的逻辑定义。我们通过贝叶斯模型证据形式化这种不对称性：它与数据生成过程一致，通过分析世界模型保持高似然，并减少奥卡姆剃刀对任务参数的惩罚。我们通过将AICON（一个可微分的递归估计器和互连图，具有组合性，无需任务特定数据即可运行，并将成本梯度传播到执行器）与一个紧凑的学习策略配对来实例化这种分解，该策略调节梯度路径。梯度作为两个因子之间的接口：它们通过图携带世界结构，通过成本携带任务结构，从而在保持结构泛化的同时实现低维学习。我们在三个问题上测试了世界/任务分解，这些问题包含异构机器人、环境、任务逻辑和感觉运动模态。我们的框架在所有设置中优于端到端基线和分析启发式方法，零样本泛化到分布外配置，并无需重新训练即可迁移到真实硬件。

英文摘要

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

URL PDF HTML ☆

赞 0 踩 0

2606.02022 2026-06-02 cs.CV cs.AI cs.LG

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配：多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow（莫斯科Tevian）； Lomonosov Moscow State University（莫斯科国立罗蒙诺索夫大学）

AI总结本文揭示了多视角目标关联中常用的排名度量（如AP、FPR-95）与分配目标之间的根本性不匹配，并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情

AI中文摘要

多视角目标关联是一个重要的计算机视觉问题，是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题，但最近的工作严重依赖成对排名度量（如AP和FPR-95）进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上，我们表明即使分配已经正确，AP和FPR-95也可能不完美，而基于Sinkhorn的归一化可以使它们完美。相反，最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试，在实践中验证了这种不匹配。我们表明，仅优化几个后处理参数就能显著提升AP和FPR-95，而分配级别的度量（如ACC和IPAA）却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

URL PDF HTML ☆

赞 0 踩 0

2606.02021 2026-06-02 cs.CV

PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation

PerBite: 一种用于咬合感知食物体积估计的精选诊断工作流

Ahmad AlMughrabi, Farid Al-Areqi, David Fernández Gómez, Umair Haroon, Marc Bolaños, Ricardo Marques, Petia Radeva

发表机构 * University of Barcelona（巴塞罗那大学）； LogMeal ； Universitat Pompeu Fabra（庞培法布拉大学）

AI总结提出PerBite工作流，通过分割、三维重建、尺度校准和网格后处理等步骤，从餐前餐后状态估计食物体积，在MetaFood挑战中排名第一。

详情

AI中文摘要

一个视觉上合理的食物网格能否被信任来估计消耗食物的体积？\method 使用来自MetaFood CVPR 2026连续三维重建与进食挑战的选定配对餐前和餐后状态来研究这个问题。提交的工作流遵循一个精选的重建协议：SAM~3分割食物和盘子区域；Hunyuan3D/SAM~3D生成无量纲食物网格；盘子直径提供度量尺度；在Blender中移除盘子几何形状；剩余的网格进行孔洞填充、水密化并积分以估计体积。MoGe-2仅作为辅助线索用于初始菜肴直径估计，当直接盘子测量不确定时；它不是报告挑战结果的主要尺度来源。\method 排名第一，在34个网格上使用刚性ICP（无尺度校正）的平均Chamfer距离为8.31。在17个餐前餐后对上，它实现了33.87%的状态级体积MAPE和零单调性违规，而消耗体积MAPE为53.74%。结果表明，表面重建、度量尺度、受控网格清理、水密体积积分和物理消耗一致性应分别评估以用于饮食评估。源代码和评估脚本将在\href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}提供。

英文摘要

Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87\% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74\%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}.

URL PDF HTML ☆

赞 0 踩 0

2606.02020 2026-06-02 cs.CL cs.LG

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

揭示思维链推理的熵动力学

Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过熵动力学揭示思维链推理的两阶段结构（不确定性区域和置信区域），并提出基于CUSUM变化点检测的无训练框架实现早期退出和测试时缩放，以提升推理效率与可靠性。

Comments 21 pages, 10 figures, accepted in ICML2026

详情

AI中文摘要

本文研究了思维链（CoT）的熵动力学，揭示了一致的两阶段结构：一个探索性的不确定性区域，然后急剧过渡到收敛的置信区域。我们证明置信区域具有两个关键性质：1）高可靠性——置信区域中的答案变得高度准确和稳定，以及2）高冗余性——模型在达到正确答案后生成长时间的不必要token。这些性质解锁了更高效和可靠的推理策略：1）早期退出利用可靠性和冗余性，在收益递减时安全终止计算，以及2）测试时缩放使用置信区域信号优先考虑收敛轨迹。为了实施这些见解，我们将置信区域检测建模为序列变化点检测问题，首次将经典变化点方法应用于监控CoT推理。使用累积和（CUSUM）算法（一种统计最优的变化点检测器），我们开发了一个无训练框架用于实时推理控制。实验表明，我们的方法为早期退出建立了优越的帕累托前沿。CUSUM在减少11.1% token的情况下达到63.06%的准确率，在准确率上分别超过DEER和Dynasor 3.28%和4.36%。对于测试时缩放，CUSUM加权投票始终优于自一致性。

英文摘要

This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.02016 2026-06-02 cs.LG

Evaluating Real-World Generalizability of Algorithm Selection Models

评估算法选择模型的现实世界泛化能力

Gjorgjina Cenikj, Jakub Kudela, Eva Tuba, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute（计算机系统部门，约泽夫·斯蒂芬研究所）； Brno University of Technology（布拉格技术大学）

AI总结通过跨基准测试系统评估算法选择模型在合成与现实优化问题上的泛化能力，分析其迁移性能并指出在特定领域应用中的挑战。

Comments 10 pages, 12 figures

详情

DOI: 10.1145/3795101.3805348

AI中文摘要

算法选择（AS）旨在通过利用可测量的问题特征和历史性能数据，自动为给定问题实例识别最合适的优化算法。在本研究中，我们研究了AS模型在合成和现实优化景观上的泛化能力。我们考虑了两个广泛使用的学术基准测试套件（BBOB和CEC）以及两个现实世界问题集（机器人轨迹优化任务和无人机路径规划问题）。通过系统的跨基准测试评估，我们分析了AS模型如何在领域之间迁移，识别了泛化成功或失败的情况，并强调了在现实、特定领域环境中应用AS时出现的挑战。我们的研究结果提供了对当前AS方法鲁棒性的见解，并为开发更可靠、广泛适用的现实世界优化AS系统提供了信息。

英文摘要

Algorithm Selection (AS) aims to automatically identify the most suitable optimization algorithm for a given problem instance by leveraging measurable problem characteristics and historical performance data. In this study, we investigate the generalization ability of AS models across both synthetic and real-world optimization landscapes. We consider two widely used academic benchmark suites (BBOB and CEC) and two real-world problem sets (robotics trajectory optimization tasks and unmanned aerial vehicle path-planning problems). Through a systematic cross-benchmark evaluation, we analyze how AS models transfer between domains, identify where generalization succeeds or breaks down, and highlight the challenges that arise when applying AS in realistic, domain-specific contexts. Our findings provide insights into the robustness of current AS approaches and inform the development of more reliable, broadly applicable AS systems for real-world optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.02011 2026-06-02 cs.AI cs.LG

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

推理模型中的极端低位推理：失败模式与针对性恢复

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov

发表机构 * University of Washington（华盛顿大学）

AI总结针对大型推理模型在2位量化推理中因生成不稳定导致总token数膨胀而无法实现端到端加速的问题，提出轻量级FP16规划和循环救援两种控制方法，显著恢复模型精度并保持实际速度。

详情

AI中文摘要

大型推理模型（LRM）依赖长推理轨迹，导致推理成本高昂。虽然低位量化降低了每token解码成本，但我们表明，激进的2位推理可能无法实现端到端加速，因为生成过程中的不稳定性会膨胀总token数。2位量化不仅降低答案准确性，还常常产生更长的轨迹，包含重复循环、预算耗尽、延迟承诺和未闭合的推理段。我们分析了Qwen3推理模型在数学和常识基准上的完整推理轨迹，并表明准确率下降与这些过程级失败密切相关。为解决这些问题，我们引入了两种轻量级控制：FP16规划，为2位模型提供简短的高精度轮廓；以及循环救援，检测重复轨迹并要么承诺早期答案，要么回退到FP16。在MATH-500上，循环救援将Qwen3-8B准确率从17.2%提升至74.2%，而规划加循环救援将Qwen3-32B准确率从65.0%提升至87.2%。总体而言，我们的结果表明，当极端低位推理的失败被视为可控生成病理时，它变得可行：通过轻量级检测和选择性FP16支持，2位推理可以在恢复准确率的同时保持真实的端到端速度。我们的代码可在 https://github.com/brain-lab-research/quantized-reasoning 获取。

英文摘要

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02010 2026-06-02 cs.CL cs.AI

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench: 通过平面图绘制评估LLM空间推理能力

Oleksandr Nikitin

发表机构 * tvori.info

AI总结提出PlanarBench基准，通过让LLM根据边列表以ASCII艺术绘制平面图来评估其空间推理能力，发现边数是主要难度预测因子。

Comments 12 pages, 4 figures, https://github.com/wizzard0/planar-bench-as1073

2606.02009 2026-06-02 cs.CL

Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French

自动作文评分与语言认证：评估法语中的泛化性、一致性和有效性

Rodrigo Wilkens, Rémi Cardon, Vincent Folny, Thomas François

发表机构 * University of Exeter（埃克塞特大学）； France Éducation international（法国教育国际）； Cental, IL&C, UCLouvain ； Computer Science and Engineering Department, Universidad Carlos III de Madrid（马德里卡斯蒂利亚大学计算机科学与工程系）

AI总结本文提出一个增强的论证有效性框架，通过公平性分析、语言特征相关性、预测误差评估和与人工评分的一致性比较，对8种模型架构在法语作文评分上进行多维评估，推进了法语自动作文评分的前沿。

详情

AI中文摘要

在自动作文评分（AES）中，基准测试实践促进了最小化评估方法，这与评估框架（如论证有效性框架ABV）的广泛视角建议形成对比，ABV主张对系统进行多维评估，特别是在高风险语言测试的背景下。在本文中，我们引入了一个增强且更实用的ABV框架版本，结合了公平性分析、与语言特征的相关性、预测误差评估以及与人工评分者的一致性比较。将该框架应用于法语AES，我们在一个包含27k篇考试作文（每篇2名评分者）的语料库和一个包含961篇作文（每篇至少9名评分者）的泛化语料库上比较了8种模型架构。我们的分析展示了应用ABV框架以更好地理解AES模型的能力和缺陷的益处，同时推进了法语AES的最新水平。

英文摘要

In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.

URL PDF HTML ☆

赞 0 踩 0

2606.02002 2026-06-02 cs.CV

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

面向盲图像质量评估的统计与视觉-语言特征的失真感知融合

Bishr Omer Abdelrahman Adam, Xu Li

发表机构 * Northwestern Polytechnical University（西北工业大学）

AI总结提出一种失真感知融合框架，通过乘法门控机制动态加权NSS统计特征与VLM嵌入，在三个基准上取得最优或竞争性能，并揭示NSS对不同失真的贡献差异。

详情

AI中文摘要

盲图像质量评估（BIQA）旨在无参考图像的情况下预测感知图像质量。经典的自然场景统计（NSS）描述符和现代视觉语言模型（VLM）嵌入从根本不同的角度解决这一问题，但两者结合是否能产生互补优势以及如何根据输入图像加权其贡献尚待探索。我们提出一种失真感知融合框架，通过乘法门控机制将138维NSS描述符与两种互补的VLM嵌入（SigLIP和CLIP-H）集成，该门控机制学习基于图像内容的每输入流权重。与静态拼接融合不同，所提出的门控网络根据输入抑制或放大每个流的贡献，产生的权重与在KADID-10k上通过独立消融测量的每失真NSS贡献呈正相关（Spearman秩相关系数ρ=0.33）。该框架无需对VLM骨干网络进行端到端微调，并使用结合均方误差、Pearson线性相关和成对排序目标的混合损失进行训练。我们在三个标准基准上评估：KonIQ-10k（SROCC=0.9142，PLCC=0.9279）、KADID-10k（SROCC=0.9715，PLCC=0.9733，超越近期最先进方法）和LIVE Challenge in-the-Wild（通过跨数据集预训练和微调，SROCC=0.8527，PLCC=0.8802）。在KADID-10k上的每失真分析表明，NSS特征对噪声和色彩偏移失真（像素统计直接影响）贡献最大，对感知失真（如色彩饱和度变化）贡献最小。学习到的门控值验证了这些发现，确认模型自主发现了与手动每失真研究一致的失真-流亲和模式。

英文摘要

Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.

URL PDF HTML ☆

赞 0 踩 0

2606.02001 2026-06-02 cs.CL

Scaling Agentic Capabilities via Grounded Interaction Synthesis

通过基于交互合成扩展智能体能力

Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao, Shuqing Bian, Wei Lu, Xiaoyong Du

发表机构 * Renmin University of China（中国人民大学）； Peking University（北京大学）； Tencent（腾讯）

AI总结提出GAIS框架，通过两阶段接地机制（协议锚定环境和结构引导规划）自动生成多样化的环境和复杂任务，显著提升智能体在BFCL、τ²-Bench和ACEBench上的性能。

详情

AI中文摘要

通用智能体智能的关键在于与多样化的真实世界工具交互以完成复杂任务的能力，这种能力与交互数据的质量密切相关。为了规避人工标注的昂贵成本，现有范式完全依赖大型语言模型（LLMs）来扩展智能体环境和任务的合成。然而，这种无约束的生成常常退化为LLMs内部先验的有偏随机采样，无法捕捉真实世界领域的多样性和难度，也无法构建高保真、长周期的任务。在这项工作中，我们引入了基于交互合成（GAIS），这是一个通过两阶段接地机制自动构建多样化环境和复杂任务的框架。具体来说，我们构建了源自真实世界模型上下文协议（MCP）服务器的协议锚定环境，以确保功能多样性和难度。随后，我们采用结构引导规划来导航这些环境，主动施加逻辑依赖和对抗策略以生成复杂任务。在BFCL、τ²-Bench和ACEBench上的实验表明，GAIS合成的数据显著优于最先进的基线，使基础模型能够匹配甚至超越其官方指令微调版本。此外，GAIS展现出优越的数据效率和可扩展性，在显著减少数据量的情况下实现卓越能力，同时在基线停滞时保持持续增长。我们的代码和数据集可在https://github.com/Eric8932/GAIS公开获取。

英文摘要

General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random sampling of LLMs' internal priors, failing to capture the diversity and difficulty of real-world domains or construct high-fidelity, long-horizon tasks. In this work, we introduce Grounded Agentic Interaction Synthesis (GAIS), a framework that automates the scalable construction of diverse environments and complex tasks via a two-phase grounding mechanism. Specifically, we construct protocol-anchored environments derived from real-world Model Context Protocol (MCP) servers to ensure functional diversity and difficulty. Subsequently, we employ structure-guided planning to navigate these environments, actively enforcing logical dependencies and adversarial policies to generate complex tasks. Experiments on BFCL, $τ^2$-Bench, and ACEBench demonstrate that GAIS-synthesized data significantly outperforms state-of-the-art baselines, enabling base models to match or even surpass their official instruction-tuned counterparts. Furthermore, GAIS exhibits superior data efficiency and scalability, achieving exceptional capabilities with significantly less data while maintaining continuous growth where baselines stagnate. Our code and dataset are publicly available at https://github.com/Eric8932/GAIS.

URL PDF HTML ☆

赞 0 踩 0

2606.02000 2026-06-02 cs.CV cs.AI eess.IV

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型：基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴集团大模型实验室）； Hupan Lab（虎盘实验室）； Zhejiang University（浙江大学）； INSAIT

AI总结提出一种无渲染框架，通过压缩的3D人体网格标记直接条件化视频生成，实现精确的人体运动控制，减少2D引导伪影并提升3D结构建模能力。

Comments Project page: https://jingyunliang.github.io/MeshToken/

详情

AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而，这类模型是否真正感知视觉观察背后的3D结构，而不仅仅是生成合理的2D投影，仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题，该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同，我们提出了一种无渲染框架，直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息，同时实现了统一的基于标记的生成流程，在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明，该方法在人体运动控制基准上表现强劲，同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明，配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

URL PDF HTML ☆

赞 0 踩 0

2606.01999 2026-06-02 cs.LG cs.AI

Why Do Time Series Models Need Long Context Windows?

为什么时间序列模型需要长上下文窗口？

Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

发表机构 * Università della Svizzera Italiana（瑞士联邦理工学院）； EPFL（瑞士联邦理工学院）； Politecnico di Milano（米兰理工学院）

AI总结本文从生成过程识别和条件预测两个目标出发，证明长上下文窗口通过降低生成过程的不确定性来提升预测性能，并表明即使对于记忆长度为P的过程，输入窗口必须严格大于P才能达到最小误差。

详情

AI中文摘要

现代用于预测时间序列组的深度学习模型依赖于越来越长的观测窗口。然而，增加窗口大小的好处通常被简单地归因于捕捉长程依赖，而关于全局预测模型如何利用输入观测的更广泛讨论一直有限。在本文中，我们表明预测时间序列组涉及两个目标：(i) 生成过程识别（GPI），即推断生成输入序列的具体过程，以及 (ii) 条件预测（CF），即根据输入观测预测未来值。从这个角度来看，最优预测可以解释为对所有可能数据生成过程的平均，并按输入窗口给定的似然加权。这为长上下文窗口的好处提供了另一种解释：它们降低了运行过程中输入时间序列由哪个具体过程生成的不确定性。我们证明，即使对于记忆长度为 $P$ 的过程，严格大于 $P$ 的输入窗口大小对于达到最小可实现误差是必要的。最后，我们展示了如何将 GPI 和 CF 解耦，以在不牺牲准确性的情况下提高计算可扩展性。在合成和真实数据上的实验验证了我们的见解及其对设计预测架构的相关性。

英文摘要

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.01995 2026-06-02 cs.CL

CARTE: A Benchmark for Mapping Language Model Knowledge Across France

CARTE：法国语言模型知识映射基准

Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang, Michalis Vazirgiannis

发表机构 * École Polytechnique, Institut Polytechnique de Paris（巴黎政治学院）； National Technical University of Athens（雅典国家技术大学）； Mohamed bin Zayed University of Artificial Intelligence（姆阿扎德人工智能大学）

AI总结提出CARTE基准，通过2431道多选题评估大语言模型在法国13个大区14个主题领域的细粒度区域知识，并引入CARTE-LV子集聚焦语言变异，实验发现模型在区域和规模上存在性能差异。

详情

AI中文摘要

我们推出了CARTE（文化锚定的区域-领土评估），这是一个多项选择基准，用于评估大语言模型（LLMs）在法国境内基于地理和区域差异的知识上进行细粒度推理的能力。虽然先前的基准侧重于国家层面的文化理解，但它们很大程度上忽略了国内差异以及区分密切相关区域背景的需求。CARTE通过引入涵盖法国13个大区和14个主题领域（包括文化、语言、人口、经济、环境和流动性）的2431个问题来填补这一空白。我们进一步推出了CARTE-LV，这是一个针对法国区域语言变异的子集，能够对语言相关差异进行集中评估。我们在少样本设置下评估了27个参数从1B到12B的LLMs。我们的实验揭示了跨区域和模型规模的性能差异，表明预训练覆盖存在系统性差距，且对国内变异的鲁棒性有限。

英文摘要

We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.

URL PDF HTML ☆

赞 0 踩 0

2606.01993 2026-06-02 cs.CL cs.AI cs.LG

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能？

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Kuaishou Technology（快手科技）

AI总结提出MMG2Skill框架，将多模态异构的野外指南编译为可编辑技能，通过轨迹级根因反馈持续改进，在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情

AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而，这些知识通常是多模态、异构、有噪声的，并且隐含地假设人类执行者，使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距，我们将此问题形式化为指南到技能学习：将野外指南转换为可执行技能，并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力，我们引入了MMG2Skill-Bench，这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill，一个闭环框架，它将指南编译为可编辑技能，在执行过程中将固定的视觉语言模型（VLM）智能体条件化于这些技能，并从轨迹级根因反馈中修正技能，而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中，使用六个VLM骨干网络，MMG2Skill在每个模型-领域设置中始终优于普通基线智能体，在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明，直接用原始指南提示智能体会降低性能，而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中，当成功信号适当校准时，基于分析器的提前停止进一步防止了后期性能退化，并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

URL PDF HTML ☆

赞 0 踩 0

2606.01992 2026-06-02 cs.CV cs.AI cs.LG

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准：当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab（米兰理工学院，AIRLab）； S&H – Software & Hardware（S&H – 软件与硬件）

AI总结提出结构化基准TGAD，通过三个场景逐步增加语言功能角色，评估多模态异常检测系统的文本引导能力，发现当前系统仅表面受语言条件化，标准基准高估了其能力。

详情

AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统，并被呈现为支持文本引导的零样本和少样本检测。然而，这些方法使用继承自单模态基准的协议进行评估，这些协议保持文本条件不变，因此无法衡量语言是否条件化决策；报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测（TGAD），这是一个结构化基准，通过三个场景逐步增加语言的功能角色：MVTec AD上的受控提示敏感性设置；MVTec AD的组件标记扩展，要求模型将其评估限制在指定部件；以及新的组装面板数据集（APD），这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型：生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中，文本接口仅表面条件化决策：除非移除对象名词，否则提示内容被吸收（生成模型的I-AUROC从97.4降至82.6）；一旦指令部件外的缺陷被视为正常，组件级指令不约束决策（从90.3降至66.3）；当两者在APD上结合时，图像级判别崩溃至MVTec水平以下，一种情况低于随机水平（71.2、50.5、31.5）。这些结果表明，标准基准夸大了当前多模态异常检测系统的文本引导能力，并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01991 2026-06-02 cs.AI cs.CL cs.CY

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP：基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

发表机构 * Beijing Institute of Technology（北京理工大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）

AI总结针对LLM智能体因动作空间扩大而面临功率寻求风险，提出SafeMCP服务器端防御插件，通过内部世界模型进行前瞻推理，实现主动工具过滤和即时干预两级防御，在保持智能体效用的同时有效降低风险。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

详情

AI中文摘要

随着大语言模型（LLM）智能体越来越多地利用模型上下文协议（MCP）在复杂环境中运行，其动作空间的扩展赋予了智能体不安全的能力，并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要，但它们也创造了一个脆弱的风险表面，其中微小的错误或幻觉会被放大为灾难性故障。为此，我们提出了SafeMCP，一种{服务器端}防御插件，通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理，实现两级防御：主动工具过滤以限制危险功率扩展，以及即时干预作为故障安全机制。为了训练SafeMCP，我们引入了一个三阶段流程，包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习（RL）。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明，SafeMCP实现了安全平衡，在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

URL PDF HTML ☆

赞 0 踩 0

2606.01985 2026-06-02 cs.CV

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow：基于流匹配的多轮图像编辑强化学习

Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu

发表机构 * Apple（苹果公司）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Lambda, Inc（Lambda公司）

AI总结提出MT-EditFlow框架，通过流匹配强化学习优化多轮图像编辑的奖励信号，解决单轮编辑模型在多轮交互中的失败和误差传播问题，显著提升多轮编辑性能。

详情

AI中文摘要

近年来，基于指令的图像编辑取得了重大突破，模型现在能够处理现实世界中的编辑需求，满足日常用户的实用性要求。然而，主要为单轮编辑训练的编辑模型在多轮编辑中常常失败——在这种自然的交互设置中，用户基于模型自身之前的输出迭代地细化图像。这种失败源于“全有或全无”的要求，即单次失败会破坏整个序列，以及误差传播，即暴露偏差导致编辑误差累积。为了解决这些挑战，我们引入了MT-EditFlow，一个流匹配强化学习框架，旨在优化序列图像编辑的奖励信号。MT-EditFlow整合了多轮视角和多奖励公式，为基于GRPO和NFT的强化学习方法提供了统一的结构。我们通过研究有效的轮次级聚合评分策略、权衡奖励偏差与方差的VLM推理模式以及防止奖励破解的优势融合级别，系统地分析和优化了奖励信号。我们的发现表明，将聚合优势广播到整个编辑轨迹中，有效地弥合了局部规划与全局多轮任务成功之间的差距。大量实验表明，MT-EditFlow在多种基础模型上显著提升了性能。值得注意的是，它在FLUX.1-Kontext-dev上将第3轮整体性能提升了6.85分，超越了Qwen-Image-Edit等最先进的开源模型。通过保持高边际成功率和减少暴露偏差，MT-EditFlow为视觉内容创作中更可靠、更自然的人机协作奠定了基础。

英文摘要

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

URL PDF HTML ☆

赞 0 踩 0

2606.01975 2026-06-02 cs.AI cs.SE

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

基于LLM的算法开发：以张量网络收缩顺序优化中LLM使用为例

Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges

发表机构 * German Aerospace Center (DLR), Institute of Software Technology, department High-Performance Computing（德国航空航天中心（DLR）软件技术研究所高性能计算部门）

AI总结通过OpenEvolve对张量网络收缩顺序优化的案例研究，探讨了基于LLM的算法开发，重点分析了LLM选择、评估指标和测试实例等设计因素，强调了验证引导的进化编码代理的潜力以及人类科学家在评估、验证和解释方面的重要性与挑战。

Comments Submitted to the proceedings of the deRSE26 conference

2606.01973 2026-06-02 cs.LG cs.CV

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

开放集测试时自适应中分布内与分布外准确率的深入分析

Zefeng Li, Evan Shelhamer

发表机构 * University of British Columbia and Vector Institute（不列颠哥伦比亚大学和向量研究所）

AI总结本文通过基准测试和提出新基线，揭示了当前开放集测试时自适应方法在平衡分布内准确率和分布外检测能力上的不足。

Comments TMLR 2026

详情

AI中文摘要

开放集测试时自适应（TTA）在存在输入偏移和未知输出类别的情况下更新模型。尽管近期方法在提高已知类别的分布内（InD）准确率方面取得了进展，但它们准确检测分布外（OOD）未知类别的能力仍未得到充分探索。我们在小规模CIFAR-10-C和大规模ImageNet-C的标准损坏基准上，对鲁棒和开放集TTA方法（SAR、OSTTA、UniEnt和SoTTA）进行了基准测试。对于CIFAR-10-C，我们使用来自SVHN和CIFAR-100的OOD数据，分别对应其损坏形式SVHN-C和CIFAR-100-C。对于ImageNet-C，我们使用来自ImageNet-O和Textures的OOD数据，分别对应其损坏形式ImageNet-O-C和Textures-C。ImageNet-O更接近ImageNet，包含未知但相关的物体类别（如食物类的“蒜香面包”与“热狗”，基础设施类的“高速公路”与“水坝”），而Textures则远离ImageNet，包含非物体图案（如“裂纹”泥土、“多孔”海绵、“纹理”树叶）。我们评估了TTA方法在CIFAR-10-C和ImageNet-C上对InD与OOD识别的准确率和置信度。我们在CIFAR-10-C上验证了每种方法自身OOD检测技术的准确率。我们还在ImageNet-C上进行了评估，并报告了准确率和标准OOD检测指标。我们进一步考察了更现实的设置，其中OOD数据的比例和速率可以变化。为了探索InD识别与OOD拒绝之间的权衡，我们提出了一种新的基线，将softmax/多类输出替换为sigmoid/多标签输出。我们的分析首次表明，当前的开放集TTA方法难以平衡InD和OOD准确率，并且它们仅能不完全地过滤OOD数据以进行自身的自适应更新。

英文摘要

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

URL PDF HTML ☆

赞 0 踩 0

2606.01970 2026-06-02 cs.RO cs.MA cs.SY eess.SY

Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions

基于市场重规划的搜救任务中安全关键无人机群

Luiz Giacomossi, Andrea Haglund, Claire Namatovu, Emily Zainali, Esaias Målqvist, Yonatan M. Beyene, Ivan Tomasic, Baran Çürüklü, Håkan Forsberg

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Swedish Defence Research Agency（瑞典国防研究机构）； KTH Royal Institute technological Institute（皇家理工学院）

AI总结提出一种分布式协调架构IRDS，通过反向拍卖市场机制和几何共识协议，在无人机故障下自主重分配任务，在25%退化下保持93%任务成功率。

Comments 6 pages, 4 figures, accepted at MIPRO 2026

详情

AI中文摘要

搜救任务中可靠自主无人机群需要能够容忍代理退化并维持操作的容错协调。本文介绍了智能重规划无人机群（IRDS），一种为资源受限环境设计的分布式协调架构。所提出的框架采用反向拍卖市场机制，其中代理基于距离加权成本函数竞标服务搜索区域，并结合几何共识协议进行目标验证。我们通过物理仿真（N=8个代理，8x8网格）评估该方法，并施加随机故障注入。结果表明，无人机群能够以相对于总任务持续时间较低的延迟自主重新分配来自故障代理的任务，在25%劳动力退化下保持93%的任务成功率。所提出的框架展示了一种稳健的、经过实证测试的空中机器人自愈协调方法。

英文摘要

Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.

URL PDF HTML ☆

赞 0 踩 0

2606.01967 2026-06-02 cs.CL

Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning

训练提示至关重要：面向鲁棒微调的状态自适应优化

Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao, Jinhao Dong, Pengfei Hu, Wei Lu, Xiaoyong Du

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出状态自适应提示优化（SAPO）策略，通过将任务公式从静态输入转变为动态状态自适应变量，有效缓解灾难性遗忘并提升泛化能力，在多个基准上取得显著性能提升。

详情

AI中文摘要

虽然提示工程在推理过程中对最大化大型语言模型（LLM）的能力至关重要，但提示在训练过程中的作用仍未得到充分探索。现有的微调范式通常将训练提示视为表面形式，假设语义等价的指令会产生相同的学习结果。然而，我们揭示这种等价性具有欺骗性：虽然释义后的提示通常会导致类似的任务内性能，但它们在灾难性遗忘和泛化方面会引发截然不同的跨任务影响。关键的是，这些影响在任务间呈正相关，表明存在始终产生更好性能的优越提示。此外，我们发现这些优越提示可以在学习之前通过任务损失稳健地识别。利用这些见解，我们引入了状态自适应提示优化（SAPO），这是一种轻量级但有效的训练策略，它将任务公式从静态输入转变为动态的、状态自适应的变量。在多种基准上的全面实验证实了其有效性，它显著减轻了遗忘，同时提高了泛化能力，相比于最先进的方法取得了显著的性能提升。这些结果提供了关于训练提示如何塑造学习动态的见解，并为鲁棒微调提供了实用的方法。我们的代码可在 https://github.com/Eric8932/SAPO 获取。

英文摘要

While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning. Our code is available at https://github.com/Eric8932/SAPO.

URL PDF HTML ☆

赞 0 踩 0