arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2605.26340 2026-05-27 cs.AI cs.CL cs.MA

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne: 迈向基于证据链的人类级自主研究

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister

AI总结 提出证据链框架Chain-of-Evidence和自主研究系统ScientistOne,通过可追溯性解决可验证性失败问题,在多项任务上达到或超越人类专家水平。

Comments Project website: https://scientist-one.github.io/

详情
AI中文摘要

自主研究代理能产生有竞争力的解决方案和专业手稿,但其输出存在表面评估无法察觉的可验证性失败:捏造的引用、不可复现的分数以及与实现不符的方法描述。我们通过三项贡献解决这一问题。第一,Chain-of-Evidence (CoE),一个可验证性框架,要求每个声明都能追溯到其证据来源。第二,ScientistOne,一个端到端的自主研究系统,在文献综述、解决方案发现和论文撰写过程中通过构造保持证据链。第三,CoE Audit,一个事后审计,其四项完整性检查——分数验证、规范违反、引用验证和方法-代码对齐——统一适用于所有系统。在涵盖五个系统和五个前沿研究任务的75篇论文中,每个基线都表现出至少一种系统性失败模式:幻觉引用率高达21%,分数验证通过率低至42%,方法-代码对齐率在20%到80%之间。ScientistOne实现了零幻觉引用(0/337)、完美的分数验证(12/12)和最高的方法-代码对齐率(14/15),同时在所有五个任务上达到或超过人类专家表现。ScientistOne进一步泛化到涵盖医学影像、细粒度识别、3D感知和语言建模的六个额外任务,在Parameter Golf上取得最先进结果,并在基线完全失败的MLE-Bench任务上获得金牌。

英文摘要

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

2605.26339 2026-05-27 cs.LG cs.CL

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W: 通过哈达玛旋转和激活感知缩放实现LLM权重的联合2D码本量化

Preetam Sharma, Kacper Dobek

AI总结 提出QAM-W方法,通过L2归一化、块哈达玛旋转和2D坐标配对量化,结合激活感知缩放,在约5.5 bpw下使困惑度接近BF16,优于极坐标编码,并在5-6 bpw范围内保持质量。

详情
AI中文摘要

标量后训练量化器丢弃了权重行内的成对坐标结构。我们引入QAM-W(权重正交幅度调制),一种恢复该结构的编解码器:每行经过L2归一化、块哈达玛旋转、配对为2D坐标,并针对在单位圆高斯上训练的单个Lloyd-Max码本进行量化,同时采用激活感知的每通道缩放。在跨越四个家族(1.1B--13B参数)的五种LLM和八种量化配置的跨模型研究中,激活感知变体在约5.5 bpw下,每个模型的WikiText-2困惑度保持在BF16的±0.4%以内,以少32%的权重比特匹配SmoothQuant W8A8质量包络。联合2D编码在相同比特率下,在ΔPPL上优于极坐标(幅度×相位)编码2--15个百分点,且与BF16的配对KL散度在37个(方法,模型)行上以Spearman ρ=0.99跟踪ΔPPL%,与从编解码器失真到KL散度的单调复合界一致。3.5 bpw变体在量化容忍架构上具有竞争力。在严格的4 bpw下,旋转码本前沿方法QTIP优于QAM-W;贡献在于质量保持的5--6 bpw波段。

英文摘要

Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B--13B parameters) and eight quantized configurations, the activation-aware variant at $\approx 5.5$ bpw stays within $\pm 0.4\%$ of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at $32\%$ fewer weight bits. Joint 2D coding outperforms polar (amplitude $\times$ phase) coding by 2--15~pp $Δ$PPL at equal bitrate, and paired KL against BF16 tracks $Δ$PPL\% at Spearman $ρ= 0.99$ across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5--6~bpw band.

2605.26333 2026-05-27 cs.AI

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

管理虚拟实验室规划中LLM生成程序性知识的不确定性

Polychronis Karpodinis, Dimitris Kalles

AI总结 针对LLM生成实验程序存在的不确定性,提出一个原型框架,通过结构化领域表示和不确定的状态转移样本提取候选程序规则,转化为显式约束并修复不确定步骤,以提升虚拟实验室规划的可靠性。

详情
AI中文摘要

教育虚拟实验室可以使实验培训更具可扩展性、适应性和可访问性,尤其是在学生接触物理实验室设施有限的情况下。然而,编写新的模拟实验程序仍然成本高昂:教育工作者必须描述新设备,定义仪器和材料如何交互,并指定可在虚拟环境中执行或评估的有效程序流程。大型语言模型可以通过生成详细的实验程序来辅助这一编写过程,但其输出不应被视为可直接执行的计划。它们可能遗漏必要的操作,步骤顺序错误,或产生逻辑上不正确或与实验室设备不兼容的指令。本文提出了一个用于管理虚拟实验室规划中LLM生成程序性知识不确定性的原型框架。该框架旨在通过使用结构化领域表示和不确定的LLM生成状态转移样本来提取候选程序规则,将其转化为显式且可检查的约束,并利用它们修复不确定的程序步骤,从而减少程序不确定性。尽管动机领域是教育虚拟实验室,但底层问题更为普遍:在结构化交互环境中管理用于行动规划的不确定程序性知识。我们通过一个涉及实验室仪器、容器、工具和材料转移操作的虚拟实验室领域来展示该方法。

英文摘要

Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.

2605.26332 2026-05-27 cs.CV cs.AI

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

被擦除但可被利用:针对已遗忘文本到图像扩散模型的黑盒嵌入感知提示攻击

Arian Komaei Koma, Seyed Amir Kasaei, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

AI总结 提出一种黑盒嵌入感知对抗提示攻击BEAP,利用大语言模型迭代生成有效对抗提示,以恢复被遗忘概念,并在攻击成功率上提升超过60%。

详情
AI中文摘要

机器遗忘旨在从预训练的文本到图像扩散模型中移除特定概念,然而已有多种白盒和黑盒攻击被提出以使模型生成这些被遗忘的概念。然而,这些攻击并未假设现实的威胁模型,即它们要么假设可以访问模型权重,要么产生无意义的对抗提示,即使通过简单的基于规则的防护也能轻易检测到。本文旨在填补这一空白。我们提出BEAP,一种黑盒、嵌入感知的对抗提示攻击,利用大语言模型(LLM)迭代生成有效的对抗提示并利用这些隐藏的漏洞。BEAP在文本空间中执行嵌入感知搜索,结合多个奖励信号:被遗忘概念的存在性、文本-图像对齐和图像质量,以优化生成的提示。与之前的攻击方法不同,BEAP使其提示对安全过滤器不可检测,同时生成高质量图像。大量实验表明,BEAP的攻击成功率(ASR)比先前方法提高了60%以上,而每次成功攻击平均仅需15个提示。警告:本文包含可能具有冒犯性或令人不安性质的模型输出。

英文摘要

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

2605.26330 2026-05-27 cs.RO

NightSight: Passive Computation for Navigation in Dark Using Events

NightSight:利用事件在黑暗中进行被动计算导航

Deepak Singh, Brijan Vaghasiya, Shreyas Khobragade, Nitin Sanket

AI总结 提出一种结合单目事件相机、编码孔径镜头和红外点投影仪的轻量级感知方法,通过卷积神经网络解码深度相关模糊签名生成密集深度图,实现小型空中机器人在完全黑暗环境中的实时导航。

Comments 6 pages, 7 figures

详情
AI中文摘要

小型空中机器人由于其敏捷性、低成本以及在大型平台无法进入的杂乱空间中穿行的能力,特别适合在受限和危险环境中进行搜索和救援。然而,在完全黑暗中实现自主导航仍然是一个重大挑战,因为小型空中机器人难以容纳需要大量载荷、功率或计算的感知系统。在这项工作中,我们提出了一种轻量级感知方法,结合单目事件相机、编码孔径镜头和红外点投影仪,以实现在此类条件下的导航。通过编码孔径成像的投影图案会产生深度相关的模糊签名,隐式编码场景几何。我们训练了一个卷积神经网络,仅使用从简单平面墙设置生成的合成数据来将这些签名解码为密集深度图。尽管训练条件有限,该模型能零样本泛化到复杂的真实场景。我们的系统在NVIDIA Jetson Orin Nano上以20 Hz实时运行,展示了其对资源受限平台的适用性。我们进一步分析了不同编码孔径设计对深度估计性能的影响。我们的方法在2.5米范围内实现了高精度(l1误差7.0厘米,2.80%误差)。这些结果突显了结合结构光照明、编码光学和事件传感在完全黑暗中实现鲁棒感知和导航的潜力。

英文摘要

Small aerial robots are particularly well-suited for search and rescue in confined and hazardous environments due to their agility, low cost, and ability to traverse through cluttered spaces that are inaccessible to larger platforms. However, enabling autonomous navigation in complete darkness remains a significant challenge, because small aerial robots cannot easily accommodate perception systems that demand substantial payload, power, or computation. In this work, we present a lightweight perception approach that combines a monocular event camera, a coded aperture lens, and an infrared dot projector to enable navigation in such conditions. The projected pattern, when imaged through the coded aperture, produces depth dependent blur signatures that implicitly encode scene geometry. We train a convolutional neural network to decode these signatures into dense depth maps using only synthetic data generated from a simple planar wall setup. Despite this minimal training regime, the model generalizes zero-shot to complex real-world scenes. Our system operates in real time at 20 Hz on a NVIDIA Jetson Orin Nano, demonstrating suitability for resource-constrained platforms. We further analyze the impact of different coded aperture designs on depth estimation performance. Our approach gives high accuracy (l1 error 7.0cm) upto 2.5m range (2.80% error). These results highlight the potential of combining structured illumination, coded optics, and event-based sensing for enabling robust perception and navigation in complete darkness.

2605.26329 2026-05-27 cs.AI

JobBench: Aligning Agent Work With Human Will

JobBench:使智能体工作符合人类意愿

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran

AI总结 提出JobBench基准,通过专家识别的高优先级工作流程评估AI智能体,以人类需求为中心而非经济价值,覆盖35个职业的130个任务,使用事实锚定的评分链评估,最强模型仅达45.9%,旨在推动从替代到增强的劳动力市场影响。

详情
AI中文摘要

当前职业AI智能体的基准主要基于经济价值,讲述了一个替代的故事。我们引入了JobBench,该基准根据专家识别为高优先级委托的工作流程评估AI智能体,基于人类需求赋权,而非用GDP价值替代他们。JobBench覆盖了35个职业的130个智能体任务。每个任务被打包成一个包含异构参考文件的工作空间,要求智能体在真实专业工作的杂乱信息流中进行推理。输出由事实锚定的评分链进行评分,每个任务平均有35.6个二元标准。我们评估了36个模型;最强的Claude Opus~4.7在Claude Code下仅达到45.9%。我们希望JobBench将社区的目标劳动力市场影响从替代转向增强:构建能够完成人类真正希望委托的任务的智能体,而不仅仅是经济价值最高的任务。

英文摘要

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

2605.26328 2026-05-27 cs.CV

RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields

RadarSim: 通过多模态神经场模拟单芯片雷达

Chuhan Chen, Tianshu Huang, Akarsh Prabhakara, Chaithanya Kumar Mummadi, Zhongxiao Cong, Anthony Rowe, Matthew O'Toole, Deva Ramanan

AI总结 提出RadarSim,一种利用RGB相机高角分辨率从相机初始化的神经场生成多普勒雷达距离图像的统一可微渲染器,以解决雷达空间分辨率低的问题,并产生比纯雷达重建更清晰的几何和多普勒距离帧。

Comments Accepted to 3DV 2026. Project website: https://sally-chen.github.io/radar-sim/

详情
AI中文摘要

雷达是相机的理想补充:两者都是廉价、固态的传感器,相机提供精细的角分辨率,而雷达在恶劣天气下提供度量深度和鲁棒性。然而,雷达数据比相机图像更难解释,且不同传感器之间差异显著,这增加了对仿真以进行传感器和处理流水线原型设计的依赖。最近将雷达重建视为新视角合成问题的工作在重建雷达相关几何和模拟低级雷达数据方面显示出巨大潜力。然而,此类方法受到底层雷达低空间分辨率的限制。为了解决这个问题,我们提出了一种统一的可微渲染器RadarSim,它利用RGB相机的高角分辨率从相机初始化的神经场生成多普勒雷达距离图像。通过使用来自定制手持装置的校准雷达相机记录的新数据集,我们证明RadarSim比纯雷达重建产生更清晰的几何和多普勒距离帧。

英文摘要

Radars are an ideal complement to cameras: both are inexpensive, solid-state sensors, with cameras offering fine angular resolution, while radars provide metric depth and robustness under adverse weather. However, radar data is more difficult to interpret than camera images and varies significantly between sensors, necessitating increased reliance on simulation for prototyping sensors and processing pipelines. Recent work treating radar reconstruction as a novel view synthesis problem has shown great promise in reconstructing radar-relevant geometry and simulating low-level radar data. However, such methods are constrained by the low spatial resolution of the underlying radar. To address this, we propose a unified differentiable renderer, RadarSim, which leverages the high angular resolution of RGB cameras to generate Doppler radar range images from a camera-initialized neural field. Using a novel data set of calibrated radar camera recordings from a custom hand-held rig, we demonstrate that RadarSim produces sharper geometry and Doppler range frames than radar-only reconstructions.

2605.26327 2026-05-27 cs.LG

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

重新参数化Shampoo和SOAP用于子空间基更新和BFloat16存储

Alan Milligan, Zikun Xu, Simon Lacoste-Julien, Felix Dangel, Wu Lin

AI总结 本文通过重新参数化预条件器,在子空间中仅更新部分基向量,结合QR分解支持BFloat16存储,降低了Shampoo类方法的计算和内存开销,并缓解了低精度存储带来的性能下降。

Comments Preprint, working in progress

详情
AI中文摘要

基于Shampoo的方法,如KL-Shampoo和SOAP,在训练神经网络中表现出强大的性能,并依赖于QR分解。由于现有的QR实现需要单精度(FP32)算术且计算成本高,当预条件矩阵较大时,这些方法变得时间和内存密集。此外,使用BFloat16(BFP16)存储以减少内存使用会降低基于Shampoo的方法的性能。我们提出了一种预条件器的重新参数化,支持BFP16存储,并通过将更新的基向量与未改变的基向量结合形成完整基。通过在子空间中通过QR分解仅更新部分基,我们的方法减少了计算开销,同时缓解了BFP16存储导致的性能下降。我们的方法广泛适用于使用QR分解的基于Shampoo的方法,包括KL-Shampoo、SOAP和KL-SOAP。特别是,它改善了SOAP和KL-SOAP在BFP16存储下的性能,使KL-SOAP能够匹配或超过KL-Shampoo。总体而言,我们的方法使基于Shampoo的方法更加内存和时间高效。

英文摘要

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require single-precision (FP32) arithmetic and remain computationally expensive, these methods become time- and memory-intensive when their preconditioning matrices are large. Moreover, using BFloat16 (BFP16) storage to reduce memory usage can degrade the performance of Shampoo-based methods. We propose a reparametrization of the preconditioner that supports BFP16 storage and forms a complete basis by combining updated basis vectors with unchanged ones. By updating only part of the basis through QR decomposition in a subspace, our approach reduces computational overhead while mitigating the performance degradation caused by BFP16 storage. Our approach applies broadly to Shampoo-based methods that employ QR decomposition, including KL-Shampoo, SOAP, and KL-SOAP. In particular, it improves the performance of SOAP and KL-SOAP under BFP16 storage, enabling KL-SOAP to match or exceed KL-Shampoo. Overall, our approach makes Shampoo-based methods more memory- and time-efficient.

2605.26324 2026-05-27 cs.LG cs.AI cs.NA math.NA

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

半群一致性作为学习型物理模拟器的诊断工具

Lennon J. Shikhman

AI总结 提出归一化半群误差作为评估学习型物理模拟器时间组合和长程推演一致性的诊断指标,在热传导和Burgers动力学实验中验证其与推演退化正相关。

Comments 10 pages, 3 figures, 3 tables. Accepted to the AI4Physics Workshop at the 43rd International Conference on Machine Learning

详情
AI中文摘要

学习型物理模拟器通常通过单步或短时预测误差来评估,但这些指标可能遗漏时间组合和长程推演中的失败。对于自主、状态完备的系统,精确解映射满足半群定律:直接演化 $s+t$ 应与先演化 $s$ 再演化 $t$ 一致。我们提出归一化半群误差作为事后、模型无关的诊断,比较这些直接和组合的学习预测。在带有时间条件ConvNet和FNO基线的一维热传导和Burgers动力学中,半群误差与推演退化正相关,轨迹级Spearman相关系数 $ρ= 0.635$,95%置信区间 $[0.621, 0.649]$。半群正则化效果不一,支持半群一致性主要作为评估诊断而非普遍有益的训练目标。

英文摘要

Learned physics simulators are often evaluated by one-step or short-horizon prediction error, but these metrics can miss failures in temporal composition and long-horizon rollout. For autonomous, state-complete systems, exact solution maps satisfy a semigroup law: direct evolution over $s+t$ should agree with evolution over $s$ followed by $t$. We propose normalized semigroup error as a post hoc, model-agnostic diagnostic comparing these direct and composed learned predictions. On one-dimensional heat and Burgers dynamics with time-conditioned ConvNet and FNO baselines, semigroup error is positively associated with rollout degradation, with trajectory-level Spearman correlation $ρ= 0.635$ and $95%$ CI $[0.621, 0.649]$. Semigroup regularization has mixed effects, supporting semigroup consistency primarily as an evaluation diagnostic rather than a universally beneficial training objective.

2605.26322 2026-05-27 cs.AI

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM:通过显式信念建模评估大语言模型的心智理论

Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah

AI总结 提出OmniToM基准,通过显式信念结构(包括信念提取和标签化两阶段)评估LLM在叙事中追踪不同角色心智状态的能力,揭示其在知识获取和表征决策上的瓶颈。

Comments 30 pages, 8 figures, 19 tables; includes appendix

详情
AI中文摘要

心智理论(ToM)——推断他人知识、意图和情绪的能力——通常通过端点问答在大语言模型(LLM)中评估,其性能仅由对社交推理查询的最终答案判断。这种范式掩盖了模型是否真正构建了稳健推理所需的基础心智状态表征,尤其是在涉及分歧、演变或错误信念的场景中。为填补这一研究空白,我们引入OmniToM,一个通过要求对叙事中所有相关角色显式建模信念结构来直接评估这些表征的基准。这些结构由信念命题组成:关于角色认为世界或他人心智状态为真的最小陈述,使得知识、意图、情绪和错误信念能以通用格式分析。模型分两阶段评估:阶段1:信念提取,从故事中提取与社会动态相关的信念;阶段2:信念标签化,为每个信念分配一个七维模式标签,涵盖递归顺序、真值状态、知识获取、显式性、内容类型、心智来源和上下文。基于现有ToMBench故事语料库中的895个故事,并扩充了22,343个标记的信念命题,OmniToM使用人类校准的LLM辅助标注流水线。在零样本评估中,OmniToM揭示了不同模型存在特定角色的信念追踪瓶颈:当前LLM难以将叙事事实转化为角色信念和共享心智状态所需的知识获取和表征决策。

英文摘要

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.

2605.26321 2026-05-27 cs.AI

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor:缓解智能体基准生成中的工件漂移

Maksim Ivanov, Abhijay Rana

AI总结 提出Anchor管道,通过约束优化程序联合生成指令、环境、真值解和验证器,解决基准生成中的工件漂移问题,并构建ERP-Bench基准评估前沿模型性能。

Comments Accepted to RLEval '26 (Workshop at ACM Conference on AI and Agentic Systems 2026)

详情
AI中文摘要

AI智能体开始完成有价值的、长期的企业运营任务,但企业工作的训练和评估环境仍然难以平衡真实性、可验证性和规模。环境和任务创建经常遭受一种我们称之为工件漂移的失败模式:当指令、环境、预言机和验证器由松散耦合的过程创建时,它们经常对任务要求产生分歧,产生不可解、可奖励黑客或不一致的环境。我们引入Anchor,一个任务生成管道,将领域专家对业务工作流的规范形式化为约束优化程序。从单一参数化规范出发,该管道联合生成自然语言指令、环境配置、求解器认证的真实解和基于状态的验证器。使用Anchor,改变参数会产生具有可控难度和已知最优解的新任务,产生与框架无关的环境,其奖励仅取决于最终状态的业务正确性。我们应用Anchor生成ERP-Bench:一个包含300个长期任务的基准,涵盖生产级ERP系统中的采购和制造工作流。我们发现生成参数可预测实际难度,前沿模型在26.1%的试验中满足显式任务约束,但仅在17.4%的试验中达到完全最优解。总体而言,我们表明Anchor和ERP-Bench为构建具有经济价值的智能体工作的可审计评估环境提供了具体方法。我们在erpbench.ai发布任务生成器和ERP-Bench数据集。

英文摘要

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

2605.26320 2026-05-27 cs.LG cs.CL

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

MULTISEISMO: 面向跨模态地震理解的多模态地震数据集与模型

Sai Munikoti, Ian Stewart, Chengping Chai, Lisa Linville, Scott Vasquez, Sameera Horawalavithana, Karl Pazdernik

AI总结 针对地震学中多模态数据整合的缺失,构建了包含超过1.6万次地震事件的结构化多模态数据集MultiSeismo,并开发了专用多模态模型SeisModal,在跨模态地震推理任务上取得了优越性能。

详情
AI中文摘要

通用多模态模型(GMMs)在专业科学领域的应用仍然有限,原因是缺乏整合文本和图像之外多种数据模态的综合性领域特定数据集。在地震学中,理解地震现象需要综合时间序列波形数据、地理图像和上下文元数据,而现有地震数据集缺乏这种多模态整合。我们提出了MultiSeismo,一个大规模结构化多模态地震数据集,包含跨越13年(2010年至2023年)来自不同地理区域的超过1.6万次地震事件。每个事件数据整合了全球台网波形记录、烈度图、人口暴露可视化以及标准JSON格式的全面文本描述。此外,我们开发了MISCE,一个基于原始数据的多模态指令集,用于对GMMs进行监督训练和评估,涵盖从基本信息检索到复杂跨模态分析的地震推理任务。我们利用MISCE微调了一个现有的多模态模型(Unified IO 2),并增强了专门的时间序列编码器,从而得到了SeisModal——首个用于综合地震分析的领域特定多模态模型。在MultiSeismo上对最先进的多模态模型进行评估,揭示了显著挑战,特别是通用模型在处理时间序列数据方面的困难,同时证明了SeisModal在地震多模态推理任务上的优越性能。这些结果证明,MultiSeismo为未来地震学多模态研究提供了严格的基准,并验证了我们领域特定架构调整的成功。

英文摘要

The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal's superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.

2605.26316 2026-05-27 cs.CV cs.AI

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

E$^3$C: 具有3D环境记忆和自我-外部人体姿态控制的视频生成

Qiao Gu, Lingni Ma, Adam W Harley, Richard Newcombe, Florian Shkurti, Julian Straub

AI总结 提出E$^3$C可控视频扩散框架,通过3D点云记忆和双通道人体控制(自我与外骨骼),实现物理一致的自我中心视频生成。

Comments Preprint. Project Page: https://e3c-videogen.github.io/

详情
AI中文摘要

可控且物理合理的自我中心视频生成对于具身智能体推理自身及他人动作如何表现和改变世界至关重要。与通用视频合成相比,自我中心生成尤其具有挑战性:相机与演员紧密耦合,导致视角快速变化和频繁的自遮挡;底层动作细微、关节化,且通常仅部分可见;人和场景状态必须与指定控制一致地演化。我们提出E$^3$C,一种用于自我中心生成的可控视频扩散框架,构建结构化和紧凑的条件,将持久场景结构与人类驱动动态分离。从上下文帧中,E$^3$C构建基于半稠密点云的3D记忆,并用来自视频VAE特征的外观描述符增强每个点。将此记忆渲染到目标视角产生与目标帧对齐的条件。人类动态单独建模。场景中观察到的人由骨架渲染(外部人体控制)控制,而相机佩戴者由其3D身体关节和6DoF手腕运动(自我人体控制)指定。为了在佩戴者身体部位不可见时保持自我人体控制,我们引入了一个自我运动编码器,生成持久的交叉注意力标记。在Nymeria上的实验表明,E$^3$C在视觉保真度、相机运动准确性、物体一致性以及自我和外部人体控制方面优于强基线,同时还能实现直观的场景编辑。

英文摘要

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

2605.26315 2026-05-27 cs.LG cs.AI

Curriculum Learning for Safety Alignment

用于安全对齐的课程学习

Sandeep Kumar, Virginia Smith, Chhavi Yadav

AI总结 提出基于课程学习的Staged-Competence框架,通过难度分级的偏好数据和渐进式参考模型更新,提升DPO安全对齐的鲁棒性,在三个模型族上平均降低16%的OOD有害响应率和20%的越狱攻击成功率。

Comments Accepted at the ICML 2026 GlobalSouthML Workshop

详情
AI中文摘要

直接偏好优化(DPO)广泛用于大型语言模型的安全对齐。然而,先前的工作表明它脆弱且表现出较差的分布外(OOD)泛化能力。在本文中,我们研究课程学习是否能提高基于DPO的安全对齐的鲁棒性。我们提出Staged-Competence,一个基于课程的框架,它按难度组织偏好数据,采用基于能力的采样,并在训练过程中逐步更新参考模型。在三个模型族上平均,Staged-Competence将OOD有害响应率降低16%,越狱攻击成功率降低20%,同时保持接近零的过度拒绝,保留通用能力。我们进一步表明,Staged-Competence(1)仅使用75%的训练数据即可达到基线安全性,(2)在安全与不安全响应之间产生更好的分离。Staged-Competence与策略优化损失无关,并可扩展到其他DPO变体和对齐领域。我们的代码和数据可在https://github.com/Sandeep5500/curriculum-learning-for-safety获取。

英文摘要

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

2605.26302 2026-05-27 cs.AI cs.CL cs.MA

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在老化:面向部署系统的智能体寿命工程

Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang

AI总结 提出 AgingBench 基准,通过四种老化机制和诊断工具评估部署后智能体的可靠性退化,并指出需要寿命评估、机制级诊断和阶段针对性修复。

详情
AI中文摘要

长寿命AI智能体越来越多地被部署为持久化运行系统,但它们仍然像刚初始化的模型一样被评估。第一天基准测试忽略了一个基本系统问题:智能体在部署后能保持可靠多久?即使模型权重被冻结,智能体的有效状态也在不断变化,因为它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实,并经历常规维护。因此,可靠性成为整个智能体框架的寿命属性,而不仅仅是基础模型的快照属性。我们引入了AgingBench,一个用于智能体寿命工程的纵向可靠性基准:不仅测量部署的智能体是否退化,还测量退化的形式以及修复应针对何处。AgingBench将智能体老化组织为四种机制:压缩老化、干扰老化、修订老化和维护老化。为了诊断这些故障,AgingBench使用时间依赖图和对偶反事实探针,为记忆管道的写入、检索和利用阶段生成诊断档案。在7个场景、14个模型、多种记忆策略以及运行者控制和自主智能体上,跨越约400次运行(涵盖8到200个会话)的结果表明,智能体老化不是一维的:行为测试可以保持干净,而事实精度下降;派生状态跟踪可能在单个模型内急剧崩溃;相同的错误答案可能需要不同的修复,具体取决于诊断档案指向的内容。这些结果表明,可靠的智能体部署需要寿命评估、机制级诊断和阶段针对性修复,而不仅仅是更强的第一天模型。

英文摘要

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

2605.26295 2026-05-27 cs.CV

Sleep-stage efficient classification using a lightweight self-supervised model

使用轻量级自监督模型的睡眠阶段高效分类

Eldiane Borges dos Santos Durães, João Batista Florindo

AI总结 本研究通过简化mulEEG自监督模型并结合线性SVM分类器,实现了高效准确的睡眠阶段分类。

详情
Journal ref
Proceedings VISAPP 2025, 972-979 (2025)
AI中文摘要

睡眠阶段的准确分类对于诊断睡眠障碍至关重要,自动化该过程可以显著增强临床评估。本研究旨在探索使用自监督模型(具体为mulEEG的改编版本)结合线性SVM分类器来改进睡眠阶段分类。 extbf{方法:} mulEEG模型以自监督方式学习脑电图信号表示,本文通过将ResNet-50替换为ResNet-18主干网络(使用1D卷积作为时间序列编码器)对其进行了简化。还进行了另外两项改编:第一项评估了模型的不同配置和训练数据量,第二项测试了时间序列特征、频谱图特征及其拼接作为线性SVM分类器输入的有效性。 extbf{结果:} 结果显示,与简化模型相比,减少数据量提供了更好的成本效益比。使用ResNet-18的拼接特征也优于原始mulEEG模型的线性评估,实现了更高的分类性能。 extbf{结论:} 简化mulEEG模型以提取特征,并将其与稳健的分类器配对,可实现更高效、更准确的睡眠阶段分类。该方法有望改善临床睡眠评估,并可扩展到其他生物信号分类任务。

英文摘要

Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. \textbf{Methods:} The mulEEG model, which learns electroencephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. \textbf{Results:} The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. \textbf{Conclusions:} Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.

2605.26294 2026-05-27 cs.CV

CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection

用于皮肤癌检测的CNN、Transformer、混合模型和视觉语言模型

Durjoy Dey, Yuhong Yan, Hassan Hajjdiab

AI总结 本文在PAD-UFES-20数据集上统一评估了12种深度学习模型(包括CNN、ViT、混合卷积Transformer和视觉语言模型),结果表明混合模型和基于SigLIP的VLM在排名性能和临床相关操作点之间取得了最佳平衡。

Comments 13 pages, 3 figures, accepted at ICPRAI 2026, The Fifth International Conference on Pattern Recognition and Artificial Intelligence. To appear in Lecture Notes in Computer Science

详情
AI中文摘要

皮肤癌是一种常见且快速增长的恶性肿瘤,全球范围内发病率不断上升。早期检测对于改善预后至关重要。基于皮肤镜和临床图像训练的深度学习模型可以支持自动化和快速分诊。然而,许多研究仅评估了有限的架构,且不同研究的实验设置也各不相同。在本文中,我们在PAD-UFES-20数据集上对十二种深度学习模型进行了统一的二分类皮肤癌检测评估。这些模型涵盖四个家族:卷积神经网络(CNN)、视觉Transformer(ViT)、混合卷积Transformer骨干网络和视觉语言模型(VLM)。性能评估使用AUC、最大F1分数及其精确率和召回率,以及在80%特异性下的灵敏度,以反映筛查导向的需求。我们的结果表明,调优良好的CNN已经提供了强大的基线,但基于Transformer的家族持续改善了区分能力。混合模型(MaxViT Tiny、CoAtNet0)和基于SigLIP的VLM在排名性能和临床相关操作点之间实现了最佳整体权衡,而基于CLIP的模型提供了高精确率。所有实验的完整代码库已公开发布。这些发现共同为皮肤癌筛查中实际部署最合适的模型家族提供了实用指导,并为未来在PAD-UFES-20上的工作建立了可重复的参考点。

英文摘要

Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.

2605.26293 2026-05-27 cs.CL cs.AI

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

CroCo: 基于自生成结果的跨语言对比偏好调优

Mike Zhang, Ali Basirat, Desmond Elliott

AI总结 本文提出CroCo方法,利用英语偏好训练的奖励模型对多语言自生成结果进行对比偏好调优,无需语言特定偏好标注,在14种高低资源语言上提升模型性能,并避免灾难性遗忘。

详情
AI中文摘要

先前工作证实,通过奖励分数设置的大语言模型自生成结果之间的受控对比性,可以改善英语中的下游偏好调优。我们将此方法扩展到多种语言,并在总共14种高资源和低资源语言上,对两个模型在一系列多样化任务上进行评估。我们的核心发现是,跨语言对比偏好调优(CroCo)无需语言特定的偏好标注即可迁移。基于英语偏好(在多语言基础模型之上)训练的奖励模型,在大多数语言中产生了有用的语言内排名,并且在单语或多语设置中进行配对,在大多数设置上改进了每个模型,同时防止了监督微调的灾难性遗忘。我们观察到,这些增益需要基于策略的数据。非策略响应减少了收益,而在线偏好优化未能优于离线变体。具体来说,在结构化任务上,我们的方法在EuroLLM-9B的6/7种语言和Aya-3B的4/7种设置中匹配或超过了基础模型。在开放式生成中,两个调优模型在11种评估语言中均优于各自的基础模型。总体而言,我们展示了多语言偏好调优的有前景的方向。

英文摘要

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

2605.26289 2026-05-27 cs.LG

Stateful Inference for Low-Latency Multi-Agent Tool Calling

面向低延迟多智能体工具调用的有状态推理

Victor Norgren

AI总结 提出一种有状态推理架构,通过持久化KV缓存和增量处理,将多智能体工具调用的每轮成本从O(n_t)降至O(Δ_t),在6轮和35轮工作流中分别实现2.1倍和4.2倍的加速。

详情
AI中文摘要

多智能体工具调用正成为基于LLM系统的主要交互模式,但现有推理框架将每次工具调用视为独立请求,从头重新处理整个对话,尽管85-95%的提示与上一轮相同。我们提出一种有状态推理架构,将传统服务的每轮O(n_t)成本转换为仅增量O(Δ_t)成本:持久KV缓存跨轮次存在,仅通过摄入新令牌前进,而基数前缀缓存将其扩展到交错的多智能体流量,提示查找推测解码器加速结构化输出。在针对新颖、完全生成的工作负载的测试中,与vLLM和SGLang相比,参考实现在6轮智能体工作流中每轮快2.1倍,在35轮工作流的中位数轮次中快4.2倍,端到端挂钟时间减半。优势来自有状态重用和推测,而非缓存。

英文摘要

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(Δ_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

2605.26287 2026-05-27 cs.CV

A multifractal-based masked auto-encoder: an application to medical images

基于多重分形的掩码自编码器:在医学图像中的应用

Joao Batista Florindo, Viviane de Moura

AI总结 提出一种利用多重分形测度(Renyi熵)优化掩码策略的掩码自编码器(MO-MAE),通过聚焦高复杂度区域提升医学图像分类性能。

详情
Journal ref
Proceedings VISAPP 2025, 769-776 (2025)
AI中文摘要

掩码自编码器(MAE)在医学图像分类中显示出巨大潜力。然而,传统MAE采用的随机掩码策略可能忽略医学图像中的关键区域,而这些区域中即使微小的变化也可能指示疾病。为解决这一局限性,我们提出了一种利用多重分形测度(Renyi熵)优化掩码策略的新方法。我们的方法称为多重分形优化掩码自编码器(MO-MAE),它采用多重分形分析来识别高复杂度和信息量丰富的区域。通过将掩码过程聚焦于这些区域,MO-MAE确保模型学习重建最具诊断相关性的特征。这种方法对医学成像特别有益,因为精细检查组织结构对于准确诊断至关重要。我们在涵盖多种疾病的多个医学数据集上评估了MO-MAE,包括MedMNIST和COVID-CT。我们的结果表明,MO-MAE取得了有前景的性能,超越了其他基线和最先进的模型。由于所提出的测度计算简单,该方法还增加了最小的计算开销。我们的发现表明,多重分形优化的掩码策略增强了模型捕获和重建复杂组织结构的能力,从而实现了更准确和高效的医学图像表示。所提出的MO-MAE框架为提高医学图像分析中深度学习模型的准确性和效率提供了一个有前景的方向,可能推动计算机辅助诊断领域的发展。

英文摘要

Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model's ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.

2605.26285 2026-05-27 cs.LG cs.NA math.NA

Two-Parameter Flows for Learning Population Dynamics of Physical Systems

用于学习物理系统群体动力学的双参数流

Paul Schwerdtner, Tobias Blickhan, Benjamin Peherstorfer

AI总结 提出双参数流方法,通过从基础分布到每个边际的采样时间传输学习高维概率密度动力学,并利用耦合合成轨迹回归提取物理时间速度,无需轨迹信息即可处理旋转等非梯度动力学。

详情
AI中文摘要

本文解决了在无标签样本且不假设轨迹信息的情况下,学习高维概率密度随时间演化的动力学问题。我们引入了双参数流,仅学习从基础分布到每个边际的采样时间传输,然后通过回归耦合的合成轨迹提取物理时间速度。我们证明了所得的物理时间动力学是唯一的,并且继承了采样时间传输的正则性。由于我们可以利用标准且成熟的条件流匹配技术来学习基础到边际的传输,我们的方法可扩展到高维,避免了每步最优传输耦合,同时允许可解释旋转或循环物理现象的非梯度动力学。

英文摘要

This work addresses the problem of learning the dynamics of high-dimensional probability densities over time using unlabeled samples, without assuming access to trajectory information. We introduce two-parameter flows that learn only sampling-time transports from a base distribution to each marginal and then extract a physics-time velocity by regressing on coupled synthetic trajectories. We prove that the resulting physics-time dynamics are unique and inherit regularity from the sampling-time transports. Because we can build on standard, well-developed conditional flow matching techniques for learning the base-to-marginal transports, our approach scales to high dimensions and avoids per-step optimal-transport couplings, while allowing admissible non-gradient dynamics that can naturally explain rotational or circulating physics phenomena.

2605.26284 2026-05-27 cs.RO

PhyPush: One Push is All You Need for Sensorless Physical Property Estimation with Physics-Guided Transformers

PhyPush:基于物理引导的Transformer,一次推动即可实现无传感器物理属性估计

Koyo Fujii, Luis Figueredo, Praminda Caleb-Solly, Ivan Boschi, Edoardo Ida', Marco Carricato, Aly Magassouba

AI总结 提出PhyPush框架,利用物理引导的Transformer从单次推动的末端执行器速度估计物体质量和摩擦系数,通过牛顿第二定律和库仑摩擦模型约束提升物理一致性和泛化能力。

Comments Submitted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems

详情
AI中文摘要

准确估计物体质量和摩擦是实现可靠自适应机器人操作的基础。尽管交互感知为推断此类属性提供了强大机制,但现有方法大多依赖力/力矩传感器、触觉阵列或多相机运动捕捉系统等专用硬件,限制了可扩展性和部署。本文提出PhyPush,一种物理引导的Transformer框架,仅使用单次推动中运动学推导的末端执行器速度来估计物体的质量和摩擦系数。这通常需要标准机械臂上可用的数据。该模型通过物理引导损失融入牛顿第二定律和库仑摩擦模型的约束,提高了物理一致性以及对未见物体和表面的泛化能力。在多样化的仿真和真实世界设置中,PhyPush在具有挑战性的域外条件下始终能实现更准确的质量和摩擦估计。在仿真中,与能够获取完整力信息的基线相比,误差降低超过10%;在真实世界实验中,其表现优于数据驱动损失方法。总体而言,结果表明物理引导学习能够仅依赖单次推动和现成的运动学数据,实现低成本、传感器高效的物理属性估计。

英文摘要

Accurately estimating object mass and friction is fundamental to achieving reliable and adaptive robotic manipulation. Although interactive perception provides a powerful mechanism for inferring such properties, most existing approaches depend on specialized hardware such as force/torque sensors, tactile arrays, or multi-camera motion-capture systems, limiting scalability and deployment. This paper presents PhyPush, a physics-guided Transformer framework that estimates an object's mass and friction coefficient using only kinematically derived end-effector velocity from a single push. This typically requires data available on standard robotic arms. The model incorporates constraints from Newton's second law and the Coulomb friction model through a physics-guided loss, improving physical consistency and generalization to unseen objects and surfaces. Across diverse simulation and real-world setups, PhyPush consistently achieves more accurate mass and friction estimation in challenging out-of-domain conditions. In simulation, it reduces error by over 10% compared with a baseline that has privileged access to full force information, while in real-world experiments, it outperforms a data-driven loss approach. Overall, the results demonstrate that physics-guided learning can enable low-cost, sensor-efficient estimation of physical properties, relying solely on a single push and readily available kinematic data.

2605.26283 2026-05-27 cs.CV cs.LG

Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening

卷积、Transformer、混合模型及视觉语言模型在多病种视网膜筛查中的基准测试

Durjoy Dey, Aymane Ajbar, Yuhong Yan

AI总结 本研究在RFMiD数据集上对四种模型家族的12种架构进行基准测试,评估其在多病种视网膜筛查中的性能,发现基于注意力的模型(如SwinTiny、CoAtNet0、MaxViTTiny)在二元筛查和多标签分类中表现最佳,视觉语言模型与CNN基线相当但未超越最优Transformer和混合模型。

Comments 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings

详情
AI中文摘要

现代深度学习为自动化视网膜筛查提供了强大工具,但在现实多病种设置和领域偏移下,不同视觉模型家族的比较仍不明确。本研究使用视网膜眼底多病种图像数据集(RFMiD),对四种模型家族(卷积神经网络、视觉Transformer、混合CNN-Transformer骨干网络和视觉语言模型)的12种架构进行基准测试。我们评估两个任务:任何视网膜疾病的二元筛查和28个疾病类别的多标签分类。通过标准化训练、校准和评估协议,我们报告了在特异性接近80%的临床相关操作点下的AUC、F1、精确率、召回率和灵敏度。在RFMiD上,所有架构在二元筛查中表现良好,AUC均高于84%,但基于注意力的模型表现最佳。SwinTiny以及混合模型CoAtNet0和MaxViTTiny在二元筛查中取得最强结果,并在多标签设置中提高了宏F1和微F1。视觉语言模型(包括CLIP ViT-B/16和SigLIP-Base384)与CNN基线相当,但未超越最优Transformer和混合骨干网络。在Messidor-2上对可转诊糖尿病视网膜病变进行外部验证时,AUC范围为66.8%至84.7%,混合模型和Transformer模型再次表现出强劲性能。这些结果为多病种视网膜筛查中的模型选择提供了可重复的参考,并指导未来用于临床部署的自动化筛查工具。

英文摘要

Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.

2605.26282 2026-05-27 cs.LG

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu

AI总结 针对世界模型强化学习中搜索与价值学习之间的结构错位问题,提出基于扩散策略优化的模型基方法MBDPO,统一搜索与策略优化,实现可扩展的策略学习。

详情
AI中文摘要

基于模型的强化学习可以通过使用世界模型在大规模下得到有效支持。然而,在实践中,扩展此类方法仍然受到根本性限制。一个普遍公认的挑战是模型偏差和误差累积,这会降低长期预测的质量。除了这些问题,我们识别出一个更关键但尚未充分探索的瓶颈:现有世界模型方法中搜索与价值学习之间的结构错位。特别是,策略改进通常依赖于由独立的非搜索策略诱导的价值函数,导致训练不一致并最终产生次优学习。为了解决这一限制,我们在世界模型中提出基于模型的扩散策略优化(MBDPO),该框架通过扩散策略表示统一搜索和策略优化,从而释放世界模型在可扩展策略学习中的潜力。我们不在学习到的世界模型上构建显式规划器,而是将策略优化重新表述为潜在世界模型中搜索轨迹上的扩散过程。从这个视角,我们从收集的数据集中提取一个隐式能量函数来锚定策略,使MBDPO能够细化用于策略优化的分数场,同时缓解错位问题。我们在多种设置下评估MBDPO,包括多任务离线预训练、在线学习以及离线到在线微调。在离线场景中,我们进一步通过在大规模数据集上预训练来研究其扩展行为,观察到随着模型容量增加,性能持续单调提升。

英文摘要

Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.

2605.26279 2026-05-27 cs.AI cs.CE

Constraint acquisition needs better benchmarks

约束获取需要更好的基准测试

Rafał Stachowiak, Tomasz P. Pawlak

AI总结 针对约束获取(CA)和数学规划(MP)模型验证与增强研究缺乏合适基准的问题,提出MPMMine基准套件,通过统一结构、开放格式和多样化数据支持算法评估。

Comments 12 pages, 1 figure, for the associated dataset, see https://github.com/MPMMine/MPMMine

详情
AI中文摘要

约束获取(CA)及基于领域知识工件对数学规划(MP)模型进行验证和增强的相关研究,目前因缺乏合适的基准而受到限制。这一缺陷阻碍了可重复性和跨研究可比性,减缓了CA方法的成熟。现有基准是为求解器评估而非CA算法评估而设计的。它们组织松散,对单个问题的处理不一致,并且省略了CA方法所需的领域知识工件。本工作提出了MPMMine,一个旨在评估使用多样化领域知识工件发现、验证和增强MP模型的算法的基准套件。MPMMine以一致性、标准化、完整性、可扩展性、开放性和版本控制为指导。它采用统一结构并依赖开放格式:MiniZinc、CommonMark和JSON。它为每个问题提供多个模型,每个模型提供数十个实例,以及整数和连续域中的数千个解和非解,同时附带自然语言描述以支持文本到模型方法。

英文摘要

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by inadequate benchmarks. This deficiency impedes reproducibility and cross-study comparability, slowing the maturation of CA methods. Existing benchmarks were designed for solver evaluation rather than for assessing CA algorithms. They are loosely organized, treat individual problems inconsistently, and omit the domain knowledge artifacts required by CA methods. This work presents MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance MP models using diverse domain knowledge artifacts. MPMMine is guided by consistency, standardization, completeness, extensibility, openness, and version control. It adopts a uniform structure and relies on open formats: MiniZinc, CommonMark, and JSON. It provides multiple models per problem, tens of instances per model, and thousands of solutions and non-solutions in both integer and continuous domains, alongside natural-language descriptions to support text-to-model methods.

2605.26275 2026-05-27 cs.CL

SPEAR: Code-Augmented Agentic Prompt Optimization

SPEAR: 代码增强的智能体提示优化

Mengyin Lu, Cong Feng, Huimin Han, Guangming Lu, Yu Sun, Xiaonan Ding, Shihui Long, Fengyi Li, Tanvi Motwani

AI总结 提出SPEAR方法,将代码执行作为智能体工具进行提示优化,通过Python沙箱实现结构错误分析,并在工业任务和基准测试中取得显著提升。

Comments 19 pages, 3 figures, EMNLP 2026 submission

详情
AI中文摘要

自动提示工程(APE)重写提示以改进下游任务性能,但现有的APE循环将优化器本身视为固定流水线。我们将CodeAct(Wang等人,2024a)的代码即行动范式移植到APE,并提出SPEAR(沙盒化主动回滚提示工程师),一个具有四个工具(评估、python、set_prompt、finish)的自由形式智能体优化器,自主决定如何使用这些工具。独特的工具是Python沙箱:优化器在当前评估DataFrame上编写并执行任意Python代码,进行智能体自身编写的结构错误分析(混淆矩阵、错误聚类、每组指标)。两个护栏将长时程智能体转变为单调改进的优化器:指标回归时的自动回滚,以及可选的防护指标下限。我们在三个工业级LLM作为评判者的套件(涵盖招聘人员面试、对话记忆和查询改进系统的13个评判任务)以及七个BBH任务和GSM8K上进行评估。SPEAR在每个工业任务的主要指标上获胜(工具选择上κ 0.857 vs 0.359;过滤相关性上F1-macro 0.815 vs 0.763;最难提取维度上κ 0.254 vs 0.218)。在BBH-7上,SPEAR平均准确率0.938,而GEPA为0.628,TextGrad为0.484。消融实验表明,Python工具是复杂评判任务上最大的单一杠杆(在5类工具选择评判任务上Δ≈+0.79κ,在移除时最难提取维度上Δ≈+0.35κ);其不可替代的贡献是类对混淆聚合,而长上下文LLM无法从原始评估DataFrame中可靠提取该信息。

英文摘要

Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools -- evaluate, python, set_prompt, finish -- that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($κ$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $κ$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ($Δ\approx +0.79κ$ on the 5-class tool-selection judge, $Δ\approx +0.35κ$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.

2605.26273 2026-05-27 cs.CV

Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation

频率引导的RGB-热红外语义分割融合

İsmail Emre Canıtez, Özgür Erkent

AI总结 提出一种基于双ConvNeXt V2骨干网络的多模态融合架构,通过频率分解和置信门控残差机制融合RGB与热红外特征,在MFNet和PST900上以较低参数量实现先进性能。

Comments 9 pages, 7 figures, To be Presented at Perception Beyond the Visible Spectrum workshop series (IEEE PBVS) at CVPR, 2026

详情
AI中文摘要

在城市驾驶场景等复杂环境中,语义分割在光照条件不佳时仍具挑战性,仅凭RGB图像提供的信息不足。RGB-热红外融合利用可见光和红外图像的互补优势来提升场景理解;然而,在不同特征抽象层次上有效整合这些异质模态仍是一个开放问题。本文提出一种基于双ConvNeXt V2骨干网络的多模态融合架构,采用分阶段、模态自适应的融合策略。对于早期特征,我们引入基于频率的融合模块,通过高斯滤波将红外特征分解为低频和高频分量,应用双分支空间注意力选择性强调热模式与精细边界,并通过置信门控残差机制将其与RGB特征融合。对于后期特征,我们设计了一个具有跨模态注意力和多尺度深度可分离卷积的语义融合模块,以捕捉模态间的语义对应关系。融合后的特征通过带有深度监督的PANet风格双向解码器进行解码。在MFNet和PST900上的实验表明,我们最轻量化的变体分别达到61.73%和86.24%的mIoU,仅需35.43M参数,在显著减少参数和计算成本的同时优于近期方法。代码可在https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION获取。

英文摘要

Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73\% and 86.24\% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力:视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

AI总结 针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题,提出基于Jensen偏差的在线逐注意力分数校正方法,在INT2量化下恢复接近BF16的视频质量,且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情
AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算,但随着视频变长,该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力,但降低了视频质量。我们表明,这种降低的一个关键驱动因素是注意力权重的系统性偏差:由于softmax注意力中指数的凸性,量化噪声膨胀了缓存键的贡献,我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正,在期望中消除此偏差,该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似,额外的计算开销可忽略不计,且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化,我们的校正恢复了因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

2605.26262 2026-05-27 cs.CV

Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis

维度分布情感状态:利用效价和唤醒作为视觉情感分析的通用嵌入空间

Émile Bergeron, Tadagbé Dhossou, Sébastien Tremblay, Jean-François Lalonde

AI总结 提出一种新的情感表示方法DDES,结合连续双维情感空间和多数据集训练流程,以辅助博物馆策展人预测艺术品引发的情感反应。

详情
AI中文摘要

博物馆是传播文化艺术的重要场所。它们是植根于历史和传统的机构;其展览通常旨在突出这些方面。最近,该领域正在探索一种新方法:基于情感的展览。这些展览专门设计用于引发游客的情感,以最大化参与度,并作为民主化艺术接触和吸引更广泛、更多样化观众的一种方式。为此,必须首先提取艺术品的情感内容,然而,由专家手动标注艺术品是一个劳动密集且成本高昂的过程,并且存在引入策展人个人偏见的风险。为了协助博物馆策展人设计这些展览,我们希望开发一种能够预测艺术作品所引发的情感反应的工具。在本文中,我们利用连续的双维情感空间来增强情感表示和深度学习模型的训练过程。借鉴现有的分类和维度情感表示,我们引入了一种新的表示方法——维度分布情感状态(DDES),以及一个多数据集训练流程。我们表明,与广泛使用的表示相比,DDES提供了多种优势,同时表现出相似的基线性能。

英文摘要

Museums are important sites for the dissemination of culture and art. They are institutions rooted in history and tradition; their exhibitions are often designed to highlight these aspects. Recently, a new approach is being explored in the field: emotion-based exhibitions. These exhibitions are designed specifically to elicit emotions in the visitors, in order to maximize engagement, and as a way to democratize access to art and attract a wider, more diverse audience. To do so, the emotional content of the artworks must first be extracted, however, manually annotating the artworks by experts is a prohibitively labor-intensive process, and risks introducing the personal bias of curators. To assist the museum curators in their design of these exhibitions, we wish to develop a tool that can predict the emotional response evoked by a work of art. In this article, we leverage a continuous bi-dimensional emotion space to enhance emotion representations and the training process of deep learning models. Drawing inspiration from existing categorical and dimensional emotion representations, we introduce a new representation, Dimensional Distribution Emotion State (DDES), along with a pipeline for multi-dataset training. We show that DDES provides multiple advantages compared to widely used representations while exhibiting similar baseline performance.

2605.26256 2026-05-27 cs.AI

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

个性化具身多模态大语言模型代理在长期用户交互中的应用

Jeongeun Lee, Chanyoung Park, Dongha Lee

AI总结 提出POLAR框架,通过多模态知识图谱记忆机制增强具身代理在长期交互中的个性化能力,显著提升多步推理和用户上下文跟踪性能。

详情
AI中文摘要

基于多模态大语言模型的具身代理在物理环境中解决复杂任务方面展现出强大潜力。然而,个性化辅助不仅需要遵循通用指令或识别物体类别。在现实场景中,目标通常仅通过先前的交互隐式指定,要求代理利用随时间积累的个性化上下文。在这项工作中,我们提出了POLAR,一个用于长期用户交互中个性化具身代理的多模态记忆增强框架。POLAR将先前的交互组织成一个多模态知识图谱,该图谱捕获用于个性化上下文和视觉概念的语义记忆,以及用于代理轨迹等具身经验的 episodic 记忆。为了执行具身任务,POLAR检索相关记忆以解释当前请求并指导任务执行。我们在多个MLLM骨干网络和多样化的评估场景下评估POLAR,以研究记忆在长期个性化中的作用。结果表明,所提出的记忆机制通过更有效地利用先前交互中积累的信息,持续提升性能。当代理需要在多个交互中进行推理、执行多跳推理或随时间跟踪用户特定上下文的更新时,性能提升尤为显著。

英文摘要

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.