arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3412
2605.24530 2026-05-26 cs.CL cs.CV

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil: 统一视觉-文本集成与蒸馏的多模态文档检索

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

AI总结 提出Unveil框架,通过视觉-文本嵌入和知识蒸馏实现鲁棒的文档检索,兼顾布局与语义信息。

Comments ACL 2025 Main Conference

详情
AI中文摘要

现实场景中的文档检索由于文档格式和模态的多样性面临重大挑战。传统的基于文本的方法依赖于定制的解析技术,忽略布局信息且容易出错,而最近的无解析视觉方法在文本丰富的场景中往往难以捕捉细粒度的文本语义。为了解决这些限制,我们提出了 extbf{Unveil},一种新颖的视觉-文本嵌入框架,有效整合文本和视觉特征以实现鲁棒的文档表示。通过知识蒸馏,我们将视觉-文本嵌入模型的语义理解能力转移到纯视觉模型,实现高效的无解析检索同时保持语义保真度。实验结果表明,我们的视觉-文本嵌入方法超越了现有方法,而知识蒸馏成功弥合了视觉-文本方法与纯视觉方法之间的性能差距,提高了检索准确性和效率。

英文摘要

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

2605.22794 2026-05-26 cs.AI cs.LG

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

MOSS:自主智能体系统中通过源代码级重写的自我进化

Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Huajiang Zheng, Wei Xue, Jun Song, Xinmei Tian, Yike Guo

AI总结 提出MOSS系统,通过源代码级重写实现自主智能体系统的自我进化,利用生产故障证据自动批处理和多阶段确定性流水线,在OpenClaw上单周期内将平均评分从0.25提升至0.61。

Comments 12 pages, 3 figures, 2 tables. Preprint. Code: https://github.com/hkgai-official/Moss

详情
AI中文摘要

自主智能体系统在部署后基本是静态的:它们不会从用户交互中学习,重复的失败会持续存在,直到下一次人工驱动的更新发布修复。自我进化的智能体应运而生,但所有进化都局限于文本可变的工件——技能文件、提示配置、记忆模式、工作流图——而智能体框架本身保持不变。由于路由、钩子排序、状态不变量和调度存在于代码中而非任何文本工件中,整个结构故障类别在文本层上是物理上不可达的。我们认为源代码级适应是一种本质上更通用的媒介:它是图灵完备的,是每个文本可变范围的严格超集,通过确定性方式生效而非基础模型合规性,并且不会在长上下文漂移下退化。我们提出了MOSS,一个在生产智能体基板上执行源代码级自我重写的系统。每次进化都锚定在自动策划的生产故障证据批次上,并通过确定性的多阶段流水线进行;代码修改委托给可插拔的外部编码智能体CLI,而MOSS保留阶段顺序和判定。候选者通过在临时试验工作器中重放批次来验证,然后通过用户同意门控的就地容器交换和健康探针门控的回滚进行推广。在OpenClaw上,MOSS在单周期内无需人工干预将四个任务的平均评分从0.25提升至0.61。

英文摘要

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo:野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

AI总结 提出AnyMo框架,通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐,实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述,性能显著提升。

详情
AI中文摘要

随着可穿戴和移动设备日益融入日常生活,它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置,包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难,并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo,一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号,从配对的合成放置视图和掩蔽部分观测中预训练图编码器,将多位置IMU标记化为全身运动令牌,并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo:跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述,其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%,零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%,零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面:https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

2605.22684 2026-05-26 cs.LG

ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification

ChronoVAE-HOPE:超越注意力——面向专业时间序列分类的下一代VAE基础模型

José Alberto Rodríguez, Luis Balderas, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

AI总结 提出ChronoVAE-HOPE,一种基于VAE和HOPE块(含Titans模块和连续记忆系统)的下一代时间序列基础模型,通过解耦潜在空间分离趋势与季节成分,在UCR基准分类任务上表现优异。

详情
AI中文摘要

时间序列基础模型已成为通用时间序列预测领域的最新技术组成部分。然而,将其应用于专业分类任务仍受两个相互关联的挑战制约:标准注意力机制的二次成本以及无法解耦时间序列变异性背后的结构成分。本技术报告介绍了ChronoVAE-HOPE,一种下一代时间序列基础模型,它调和了大规模泛化与结构化潜在表示在时间序列分类中的需求。该方案的核心是构建于HOPE块之上的变分自编码器框架,该框架用双记忆系统替代二次注意力:用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统。一个关键的架构创新是解耦潜在空间,通过专用编码器头和分离的解码器路径将表示分解为独立的趋势和季节成分。ChronoVAE-HOPE在Monarch档案上进行自监督预训练,结合了掩码时间序列建模辅助目标和解耦VAE重建损失。预训练编码器随后被冻结,用于生成固定长度嵌入,以在UCR基准数据集上进行下游分类。实证结果表明,在不同时间域上,特别是在具有严格因果结构的设置中,模型表现出强劲性能。ChronoVAE-HOPE通过结构化生成表示为基础模型适应时间序列分类建立了一个稳健且可解释的框架。

英文摘要

Time Series Foundation Models (TSFMs) have become a new component of the state-of-the-art in general time series forecasting. However, adapting them to specialized classification tasks remains constrained by two interconnected challenges: the quadratic cost of standard attention mechanisms and the inability to disentangle the structural components underlying time series variability. This technical report introduces ChronoVAE-HOPE, a next-generation TSFM that reconciles massive generalization with structured latent representation for time series classification. The core of the proposal is a Variational Autoencoder (VAE) framework built upon the HOPE Block, which replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. A key architectural novelty is the disentangled latent space, which factorizes representations into independent trend and seasonal components via dedicated encoder heads and separate decoder pathways. ChronoVAE-HOPE undergoes self-supervised pre-training on the Monash archive, combining a Masked Time Series Modeling (MTSM) auxiliary objective with a disentangled VAE reconstruction loss. The pre-trained encoder is subsequently frozen and used to generate fixed-length embeddings for downstream classification on the UCR benchmark datasets. Empirical results demonstrate strong performance across diverse temporal domains, particularly in settings characterized by strict causal structure. ChronoVAE-HOPE establishes a robust and interpretable framework for the adaptation of foundation models to time series classification through structured generative representations.

2605.22337 2026-05-26 cs.AI

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Meta-Soft: 利用可组合元标记实现上下文保持的KV缓存压缩

Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu

AI总结 提出Meta-Soft动态压缩框架,通过可学习正交基矩阵和Gumbel-Softmax选择网络合成元标记,结合注意力流整合机制保留丢弃上下文信息,解决KV缓存压缩中的信息丢失和上下文断裂问题。

Comments 9 pages, 2 figures

详情
AI中文摘要

大型语言模型中使用的KV缓存具有线性增长的时间复杂度,因此当处理长上下文时,LLMs面临内存爆炸和解码效率降低的问题。当前的KV缓存驱逐已成为重要的研究方向;然而,基于固定软标记(例如Judge Q)的现有方法依赖静态参数集作为查询来评估KV对的重要性,因此无法动态适应不同的输入提示,也无法精确捕捉复杂且变化的任务相关性。此外,被驱逐的KV对被永久丢弃,导致不可逆的信息丢失和上下文断裂。为了解决这个问题,我们提出了Meta-Soft,一种基于探针驱动上下文整合的动态压缩框架。具体来说,我们构建了一个带有可学习正交基矩阵$\mathcal{L}$的元库,并使用带有Gumbel-Softmax的选择器网络生成可微分的稀疏组合权重,从而从输入提示特征中动态合成最具针对性的$k$个软标记。我们将这些软标记附加到输入序列的末尾以探针关键信息。我们还引入了一种基于注意力流的整合机制,该机制将移除标记的语义信息重新分配到保留标记中,从而有效保持被丢弃的上下文信息。在多个数据集上的实验表明,我们的方法优于现有的最先进驱逐方法,并为KV缓存压缩提供了新的解决方案。

英文摘要

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance. Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features. We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively. Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

2605.22242 2026-05-26 cs.LG physics.ao-ph

Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations

利用学习随机参数化分解 Lorenz '96 中的集合离散度

Birgit Kühbacher, Daan Crommelin, Niki Kilbertus

AI总结 本研究利用双尺度 Lorenz 1996 系统,通过比较多种集合配置和参数化策略,系统分析了内在变率、初始条件扰动和随机模型不确定性对集合离散度的影响,揭示了随机参数化特别是时间持续结构能增强早期离散度增长并改善离散度-误差一致性。

详情
AI中文摘要

由于混沌动力学、不完美的初始条件以及对底层物理过程的不完全表示,天气和气候预报本质上具有不确定性。业务集合预报旨在通过预报离散度来表示这些不确定性,然而许多方法产生的离散度估计不足,即离散度相对于预报误差增长过慢。利用双尺度 Lorenz 1996 系统作为广泛使用的受控测试平台,我们设计了一种系统方法来区分内在变率、初始条件扰动和随机模型不确定性。我们比较了多种集合配置和参数化策略,包括现有的确定性和自回归方法以及新颖的贝叶斯和基于流的方法。我们的结果表明,集合扰动不会增加系统的长期方差;相反,它们调节轨迹去相关和探索不变测度的速度。随机参数化,特别是那些具有时间持续结构的参数化,增强了早期离散度增长并改善了离散度-误差一致性。总体而言,我们阐明了不同不确定性来源在混沌系统中如何相互作用,并为天气和气候模型中随机参数化的设计和评估提供了指导。

英文摘要

Weather and climate forecasts are inherently uncertain due to chaotic dynamics, imperfect initial conditions, and incomplete representation of the underlying physical processes. Operational ensemble forecasts aim to represent these uncertainties through forecast spread, yet many approaches yield underdispersive estimates, with spread that grows too slowly relative to forecast error. Using the two-scale Lorenz 1996 system as a widely used, controlled testbed, we design a systematic approach to disentangle intrinsic variability, initial-condition perturbations, and stochastic model uncertainty. We compare multiple ensemble configurations and parameterization strategies, including existing deterministic and autoregressive as well as novel Bayesian and flow-based approaches. Our results show that ensemble perturbations do not increase the system's long-term variance; rather, they regulate how rapidly trajectories decorrelate and explore the invariant measure. Stochastic parameterizations, particularly those with temporally persistent structure, enhance early spread growth and improve spread-error consistency. Overall, we bring clarity to how different sources of uncertainty interact in a chaotic system and provide guidance for the design and evaluation of stochastic parameterizations in weather and climate models.

2605.21740 2026-05-26 cs.AI

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

SMDD-Bench: 大语言模型能否解决真实世界的小分子药物设计任务?

Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani

AI总结 提出SMDD-Bench基准,通过502个多轮长时任务实例评估LLM在真实小分子药物设计中的表现,发现最优模型GPT5.4仅解决40.2%任务。

详情
AI中文摘要

LLM智能体在科学发现应用中具有巨大潜力。然而,LLM智能体在跨不同化学空间和靶标的真实世界小分子药物设计(SMDD)任务上的表现尚不明确。当前的评估方法要么是临时的,对于真实发现过于简单,规模有限,或局限于单轮问答。为了标准化LLM智能体在小分子设计上的评估,我们引入了SMDD-Bench,一个具有挑战性的多轮长时智能体基准,包含502个保证可解的任务实例,涵盖5种任务类型:2D药效团识别、相互作用点发现、骨架跃迁、先导化合物优化和片段组装。SMDD-Bench任务覆盖广泛的化学空间,涉及102个独特的蛋白质靶标。完全解决该基准需要具备强大的化学和生物学推理能力及3D直觉,理解专业工具的使用,并在有限的oracle调用次数内展示规划专业知识。我们对7个前沿的开源和闭源LLM进行了基准测试,发现性能最好的LLM GPT5.4仅解决了40.2%的任务。我们希望SMDD-Bench能提供一个标准化的测试平台,激励该领域训练和评估用于全自动计算药物设计的LLM智能体。我们在smddbench.com上托管了一个公共排行榜。

英文摘要

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

2605.21652 2026-05-26 cs.CV cs.AI

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Look-Closer-Then-Diagnose: 通过主动缩放实现置信度感知的超声VQA

Yue Zhou, Erxuan Wu, Yikang Sun, Hongjoo Lee, Yuan Bi, Huixiong Xu, Nassir Navab, Zhongliang Jiang

AI总结 提出一个模拟超声医师认知流程的框架,通过“缩放-诊断”范式和基于组相对策略优化的不确定性感知奖励,提升超声视觉问答中病灶定位和诊断性能。

详情
AI中文摘要

视觉-语言模型(VLM)显著推进了医学视觉问答,但在超声领域性能仍不理想。临床实践中,超声医师在制定报告时会明确关注病灶区域,尽管诊断解释有时因固有的主观性而存在差异。然而,现有VLM并未明确设计为在诊断前交互式地放大病灶;此外,它们通常将标注视为无偏真值,未能考虑其固有的主观性和模糊性。在本文中,我们提出了一个专门考虑超声医师认知工作流的框架。我们首先引入了一个结构化的“缩放-诊断”范式,该范式复制了交互式搜索过程以实现病灶聚焦推理。此外,在组相对策略优化(GRPO)框架内,我们引入了一个基于随机组 rollout 的不确定性感知奖励,以估计预测一致性作为模型置信度的代理。这两个组件共同鼓励模型在清晰案例上强化准确预测,同时在模糊情况下保持谨慎。在肝脏、乳腺和甲状腺数据集上的实验表明,我们的框架将病灶定位提高了39.3%,证明我们的模型学会了主动靠近观察并诊断的能力。

英文摘要

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

2605.21417 2026-05-26 cs.CV cs.AI

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

排序重要:面向混合情感识别的排名感知选择性融合

Junghyun Lee, Hyunseo Kim, Hanna Jang, Junhyug Noh

AI总结 提出一种排名感知的多编码器框架,通过注意力门控模块选择最有效的编码器进行融合,并解耦预测为存在性和显著性头,结合无监督域适应,在混合情感识别任务中取得第二名成绩。

Comments Accepted at IEEE FG 2026 Workshops. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

详情
AI中文摘要

混合情感识别具有挑战性,因为情感通常表现为微妙且重叠的多模态线索的混合,而非单一主导信号。我们提出了一种排名感知的多编码器框架,该框架选择性地结合来自不同预提取视频和音频编码器的互补表示。我们的方法将异构编码器特征投影到共享潜在空间,通过基于注意力的门控模块估计样本级编码器重要性,并仅融合前n个最具信息量的编码器。为了更好地建模混合情感,我们将预测解耦为存在性和显著性头,并通过概率级融合对齐它们。我们进一步引入了无需伪标签的特征级无监督域适应,以提高在分布偏移下的鲁棒性。在BlEmoRE挑战赛上的实验表明,所提出的框架优于强单个编码器和朴素的多编码器融合基线。我们的最终系统在比赛中排名第二,支持了排名感知选择性融合在细粒度混合情感识别中的有效性。

英文摘要

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

2605.21190 2026-05-26 cs.CV

Semantic Granularity Navigation in Image Editing

图像编辑中的语义粒度导航

Liangsi Lu, Minzhe Guo, Xuhang Chen, Yang Shi

AI总结 提出NaviEdit,一种无需训练、推理时控制的解耦方法,通过自一致性约束将编辑进度与模型尺度解耦,在保持结构保真度的同时提升语义可编辑性。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管扩散模型和流模型具有生成能力,真实图像编辑仍然受到语义可编辑性与结构保真度之间持续权衡的限制。我们将此限制的一个主要原因追溯到现有范式中编辑进度与模型尺度的隐式耦合。在这种耦合下,更强的编辑通常需要访问更嘈杂的状态,这在语义变化被良好定位之前,将计算用于破坏布局。我们引入NaviEdit,一种无需训练的推理时控制器,通过严格的自一致性契约将编辑进度与模型尺度遍历解耦。NaviEdit在rollout级别运行,不改变底层预训练模型。它将尺度视为控制输入,并将固定的步数预算重新分配给语义响应的中间尺度,而不是破坏性的高噪声区域。实验表明,在兼容的编辑器和流骨干网络上,平均增益为正,支持解耦作为一种可移植的推理时控制原则。

英文摘要

Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

2605.20416 2026-05-26 cs.LG physics.comp-ph

Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning and generation with Vision-Language Models

基于米勒指数的潜在晶体学断裂面推理与生成:视觉-语言模型方法

Qinwu Xu, Xiaofu Ma, Yifan Jiang

AI总结 研究多模态大语言模型能否利用米勒指数作为结构化潜在表示来推理断裂几何,实验表明模型在理想条件下可进行潜在推理,并能拒绝不适用物理的表示。

详情
AI中文摘要

我们研究多模态大语言模型(MLLMs)是否能够利用晶体学平面指数(米勒指数)作为结构化潜在表示来推理断裂几何。我们将米勒指数 $z = (h,k,l)$ 形式化为控制理想平面断裂的潜在变量,并评估两种互补能力:(i) 潜在推理,即模型在物理有效条件下将视觉观测映射到平面假设;(ii) 潜在适用性评估,即模型判断这种表示对于给定断裂图像是否有意义。通过涵盖合成数据、受控的2D-3D几何对以及多种材料类别(包括陶瓷、玻璃、金属和混凝土)的真实断裂图像的广泛实验,我们表明MLLMs能够在理想设置下可靠地进行潜在推理,并且关键的是,当底层物理不支持时,能够拒绝该潜在表示。作为探索性扩展,我们进一步检查了AI生成的断裂序列,并观察到定性上合理的脆性断裂进展行为,这表明多模态生成模型可能编码了与材料失效动力学相关的部分隐式物理先验。这些结果表明,只要明确建模有效性域,MLLMs可以作为基于结构化潜在先验的物理感知推理系统。

英文摘要

We study whether multimodal large language models (MLLMs) can leverage crystallographic plane indices (Miller indices) as a structured latent representation for reasoning about fracture geometry. We formulate Miller indices $z = (h,k,l)$ as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: (i) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and (ii) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image. Through extensive experiments spanning synthetic data, controlled 2D--3D geometric pairs, and real-world fracture images across multiple material classes -- including ceramics, glass, metals, and concrete -- we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it. As an exploratory extension, we further examine AI-generated fracture sequences and observe qualitatively plausible brittle-fracture progression behaviors, suggesting that multimodal generative models may encode partial implicit physical priors related to material failure dynamics. These results suggest that MLLMs can act as physics-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled.

2605.20278 2026-05-26 cs.LG cs.AI cs.CV

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

AI总结 提出ClaimDiff-RL框架,利用原子声明差异作为奖励单元,通过多模态判断器枚举视觉差异并分配错误类型和严重程度,以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情
AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题:描述被整体判断,而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富,避免幻觉而不遗漏显著细节。然而,成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号,模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架,该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述,多模态判断器枚举视觉上可区分的差异,针对图像验证每个差异,分配开放词汇的错误类型和严重程度,并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明,整体标量奖励可以通过增加遗漏事实来减少幻觉,而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡,并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上,ClaimDiff-RL改善了幻觉-遗漏事实平衡,保留了通用能力,甚至在多个细粒度能力维度(如物体计数、空间关系和场景识别)上超越了Gemini-3-Pro-Preview。这些结果表明,类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

2605.20025 2026-05-26 cs.AI

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

AutoResearchClaw: 基于人机协作的自我强化自主研究

Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wang, Caiming Xiong, James Zou, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

AI总结 提出AutoResearchClaw多智能体自主研究系统,通过结构化辩论、自愈执行、可验证报告、七种人机协作模式和跨运行进化机制,在ARC-Bench基准上比AI Scientist v2提升54.7%。

详情
AI中文摘要

自动化科学发现需要的不仅仅是根据想法生成论文。真正的研究是迭代的:假设从多个角度受到挑战,实验失败并为下一次尝试提供信息,经验在循环中积累。现有的自主研究系统通常将此过程建模为线性流水线:它们依赖单智能体推理,在执行失败时停止,并且不跨运行携带经验。我们提出AutoResearchClaw,一个基于五种机制的多智能体自主研究流水线:用于假设生成和结果分析的结构化多智能体辩论;带有Pivot/Refine决策循环的自愈执行器,将失败转化为信息;防止虚构数字和幻觉引用的可验证结果报告;具有七种干预模式的人机协作,涵盖从完全自主到逐步监督;以及将过去错误转化为未来保障的跨运行进化。在ARC-Bench(一个25个主题的实验阶段基准)上,AutoResearchClaw比AI Scientist v2高出54.7%。跨七种干预模式的人机协作消融实验表明,在高杠杆决策点上的精确、有针对性的协作始终优于完全自主和详尽的逐步监督。我们将AutoResearchClaw定位为一种研究放大器,增强而非取代人类的科学判断。代码可在https://github.com/aiming-lab/AutoResearchClaw获取。

英文摘要

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

2605.20023 2026-05-26 cs.AI cs.MA

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

当技能无济于事:关于程序性知识在进攻性网络安全中工具型智能体的负面结果

Samuel Jacob Chacko, James Hugglestone, Chashi Mahiul Islam, Xiuwen Liu

AI总结 本文通过重新分析一项控制实验,发现当环境反馈带宽高时,技能(Skills)对智能体性能的边际效益消失甚至产生负面影响,并提出了可证伪的假设。

Comments Accepted as a poster at ACM CAIS 2026 AgentSkills Workshop

详情
AI中文摘要

智能体技能(Agent Skills)是程序性知识的结构化包,在推理时加载到LLM智能体中,据报道在不同领域平均将任务通过率提高了16.2个百分点。然而,相同的基准测试显示出很大的方差,84个任务中有16个在引入技能后出现了负增量。社区尚未阐明技能何时有帮助以及何时只是冗余开销的清晰机制。我们重新分析了一项最近发表的180次运行的控制研究,该研究涉及一个基于MCP的自主夺旗(CTF)智能体,在四种文档条件(591、12865、17253和36001个token)下,并表明这些条件几乎完全对应于无技能、经验技能、策划技能和全面技能的消融。在进攻性网络安全(一个现有技能基准未深入覆盖的领域)中,技能的边际效益崩溃。无技能和全面技能条件之间的差距仅为8.9个百分点($p = 0.71$,$\chi^2$;$p = 0.25$,Cochran-Armitage趋势检验;六对Cohen's $h$值中有五对低于$0.2$的小效应阈值)。我们认为缺失的变量是环境反馈带宽。当智能体的工具层返回严格、模式验证、低延迟的观察时,环境本身提供了通常需要技能提供的程序性校正信号。因此,策划技能的边际效益显著降低,并且在某些情况下(例如,我们的时序侧信道设置)会主动降低性能。我们阐述了一个可证伪的假设,概述了其对复合AI系统的设计启示,并将发布重新分析管道以支持复制。

英文摘要

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (591, 12865, 17253, and 36001 tokens) and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

2605.19491 2026-05-26 cs.CV

Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

尺度思考:通过自适应连续推理加速千兆像素病理图像分析

Jiusong Ge, Yingkang Zhan, Wenjie Zhao, Di Zhang, Ke Wang, Jiashuai Liu, Chunze Yang, Chengzu Li, Jian Zhang, Yuxin Dong, Ni Zhang, Qidong Liu, Mireia Crispin-Ortuzar, Huazhu Fu, Chen Li, Zeyu Gao

AI总结 提出PathCTM模型,通过动态尺度切换和注意力引导的区域剪枝实现高效连续推理,大幅减少计算开销并保持诊断性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

传统的全切片图像(WSI)分析方法通常依赖于多实例学习(MIL)范式,该范式在高倍率下提取补丁级特征并进行聚合以进行切片级预测。然而,这种详尽的补丁级处理计算成本高,严重限制了WSI分析的效率和可扩展性。为应对这一挑战,我们提出了PathCTM(面向病理学的连续思维模型),该模型能够对千兆像素WSI进行令牌高效的尺度空间连续推理。PathCTM将诊断推理表述为动态的序列信息追踪。它逐步从低倍率全局检查过渡到高倍率局部检查,并在收集到足够证据以有效限制决策不确定性时自适应终止推理。具体而言,它使用条件计算进行动态尺度切换,并采用注意力引导的区域剪枝,结合置信度感知的早期停止。大量实验表明,与基于标准MIL的方法相比,PathCTM将所需图像补丁数量减少了95.95%,推理时间缩短了约95.62%,同时AUC没有下降。代码可在https://github.com/JSGe-AI/PathCTM获取。

英文摘要

Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at https://github.com/JSGe-AI/PathCTM.

2605.19430 2026-05-26 cs.RO

Neuromorphic Control of a Flapping-Wing Robot on Resource-Constrained Hardware

资源受限硬件上扑翼机器人的神经形态控制

Rim El Filali, Chenrui Feng, Chao Gao, Weibin Gu

AI总结 针对重量小于30克的蝴蝶仿生扑翼机器人,提出一种层次化神经形态控制框架,在低成本ESP32微控制器上部署两个轻量级脉冲神经网络实现状态估计与控制,通过模仿学习训练,在无系留飞行中实现稳定俯仰和航向跟踪,相比传统人工神经网络延迟降低36%、功耗降低18%。

详情
AI中文摘要

扑翼微型飞行器(FWMAV)具有卓越的机动性和气动效率,但由于非线性动力学和严格的大小、重量和功率(SWaP)约束(例如重量小于30克的蝴蝶仿生机器人),给机载控制带来了重大挑战。为此,我们提出了一种层次化神经形态控制框架,能够在广泛可用、资源受限的ESP32微控制器(单价约5美元)上实现完全机载的闭环飞行。具体而言,我们的方法在机载部署了两个轻量级脉冲神经网络(SNN):一个用于从原始传感器反馈进行状态估计,另一个通过调节中央模式发生器(CPG)进行翅膀驱动控制。通过模仿学习训练,该系统在无系留真实飞行中实现了稳定的俯仰和航向角跟踪。实验结果进一步表明,与传统人工神经网络(ANN)基线相比,基于SNN的控制器推理延迟降低了36%(从1059微秒降至680微秒),功耗降低了18%(从0.033瓦降至0.027瓦),证明了无需专用硬件的脉冲计算可行性。据我们所知,这项工作首次展示了FWMAV自主飞行的完全机载神经形态控制,突显了SNN在严格SWaP约束下实现节能自主的潜力。视觉摘要:http://bit.ly/4nI8ECY 代码:https://anonymous.4open.science/r/Espikify-76E3/

英文摘要

Flapping-Wing Micro Aerial Vehicles (FWMAVs) provide exceptional maneuverability and aerodynamic efficiency but pose significant challenges for onboard control due to nonlinear dynamics and stringent Size, Weight, and Power (SWaP) constraints, as exemplified by a butterfly-inspired robot less than 30 gram. To this end, we present a hierarchical neuromorphic control framework that enables fully onboard, closed-loop flight on a widely available, resource-constrained ESP32 microcontroller with a unit cost of approximately $5. Specifically, our method deploys two lightweight Spiking Neural Networks (SNNs) onboard: one for state estimation from raw sensory feedback and another for control via modulation of a Central Pattern Generator (CPG) for wing actuation. Trained by imitation learning, the system achieves stable pitch and heading angle tracking during untethered real-world flight. Experimental results further reveal that the SNN-based controller reduces latency by 36% (1059us to 680us) and power by 18% (0.033W to 0.027W) for inference compared to the conventional Artificial Neural Network (ANN) baseline, demonstrating the viability of spike-based computation without specialized hardware. To the best of our knowledge, this work constitutes the first demonstration of fully onboard neuromorphic control for autonomous flight of a FWMAV, highlighting the potential of SNNs to enable energy-efficient autonomy under stringent SWaP constraints. Visual abstract: http://bit.ly/4nI8ECY Code: https://anonymous.4open.science/r/Espikify-76E3/

2605.19409 2026-05-26 cs.LG cs.AI

Unlocking the Potential of Continual Model Merging: An ODE Perspective

解锁持续模型合并的潜力:ODE视角

Lihong Lin, Haidong Kang

AI总结 提出ODE-M框架,将持续模型合并建模为参数空间中的轨迹,通过整流时变速度场和效用感知时间调度平衡历史知识与新任务,提升长任务流性能。

Comments 21 pages, 8 figures

详情
AI中文摘要

持续模型合并(CMM)通过顺序整合任务适配模型实现基础模型的快速定制,无需重复训练。然而,现有合并规则通常通过固定代数或基于投影的操作更新部署模型,对保留多少先前积累的知识相对于新任务模型的控制有限。这种限制导致长任务流中保留不稳定和性能下降,当任务具有异构效用时更为明显。我们提出ODE驱动的合并(ODE-M),一个可控框架,将每次持续合并视为参数空间中的轨迹而非一步端点更新。受模式连通性启发,ODE-M使用整流时变速度场构建屏障感知轨迹,其中来自小型校准集的轻量级一阶反馈抑制损失增加的运动,同时保持向新模型的进展。然后通过沿该轨迹选择操作点(通过效用感知时间调度)获得下一个合并模型,为平衡保留的历史知识和新任务专业知识提供显式机制。在标准CMM基准上的大量实验表明,ODE-M在CLIP ViT骨干、流长度和异构任务效用设置上持续优于强持续合并基线。

英文摘要

Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without repeated retraining. However, existing merging rules usually update the deployed model through fixed algebraic or projection-based operations, providing limited control over how much previously accumulated knowledge should be retained relative to the incoming task model. This limitation leads to unstable retention and performance degradation in long task streams, and becomes more pronounced when tasks have heterogeneous utilities. We propose ODE-driven Merging (ODE-M), a controllable framework that formulates each continual merge as a trajectory in parameter space rather than a one-step endpoint update. Motivated by mode connectivity, ODE-M constructs a barrier-aware trajectory using a rectified time-dependent velocity field, where lightweight first-order feedback from a small calibration set suppresses loss-increasing motion while preserving progress toward the incoming model. The next merged model is then obtained by selecting an operating point along this trajectory through a utility-aware time schedule, providing an explicit mechanism for balancing retained historical knowledge and incoming task expertise. Extensive experiments on standard CMM benchmarks show that ODE-M consistently improves over strong continual merging baselines across CLIP ViT backbones, stream lengths, and heterogeneous task-utility settings.

2605.18840 2026-05-26 cs.LG cs.AI cs.CL

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

前沿模型的成长之痛:当排行榜不再区分以及接下来衡量什么

Adil Amin

AI总结 本文通过分解SWE-bench和GPQA Diamond分数为种群耦合趋势和每版本残差(h场),诊断前沿模型能力之间的协作与权衡,并提供三步诊断法、每实验室测量优先级表及七个可证伪预测。

Comments 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." ( https://doi.org/10.48550/arXiv.2605.18838 ). Code: https://github.com/adilamin89/cape-scaling . Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

排行榜在独立轴上对前沿模型进行排名,但并未揭示能力在版本间是相互增强还是权衡——而在前沿,这种相互作用是更具信息量的信号。我们将配对的SWE-bench和GPQA Diamond分数分解为种群耦合趋势和每版本残差(h场),该残差从两个公开基准分数诊断能力重点。在来自10个实验室的34个模型(2024-2026)中,能力相互协作(r = +0.72,p < 10^{-6}),但协作程度系统性地变化:每个实验室的耦合斜率跨度达5倍(谷歌1.15 vs. DeepSeek 0.23),且实验室发生转向——DeepSeek从推理密集型逆转为编码优先(Δh = 15.9个百分点);Anthropic在编码偏离和恢复之间振荡。种群回归作为等斜线相边界:用于识别基础尺度耦合转变的相同分类器√[(a/b)·B₁] [Amin, 2026] 对前沿模型进行分类,并已在下一个转变处检测到混合相行为(两个模型低于GPQA-IFEval等斜线)。h场不仅具有诊断性——它还告诉你需要改变什么。预训练建立耦合为0.871,而RLHF增加0.081 [Amin, 2026]:预训练级别的转变是永久的(DeepSeek的四个版本逆转持续存在),后训练转变是可逆的(Anthropic的三次编码偏离均在单个版本内恢复),仅推理计算在不重新训练的情况下将h改变+7.8个百分点。知道哪个组件占主导地位决定了是重新训练还是等待。我们提供了三步诊断法(定位、分类、预测)、每实验室测量优先级表以及七个带有时间戳标准的可证伪预测。五个截止日期后的版本落在95%预测区间内。代码、数据和交互式仪表盘:https://zehenlabs.com/cape/。

英文摘要

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($Δh = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

2605.18657 2026-05-26 cs.LG cs.AI

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

KairosHope: 一种基于双记忆架构的下一代时间序列基础模型,用于专门分类

Luis Balderas, José Alberto Rodríguez, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

AI总结 针对标准注意力计算瓶颈和经典统计知识缺失问题,提出KairosHope模型,通过双记忆系统(Titans模块和连续记忆系统CMS)替代二次注意力,并融合深度表示与统计特征的混合决策头,在UCR基准上实现优越分类性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)在通用预测任务中取得了显著成功;然而,它们对专门分类问题的适应仍然受到标准注意力的计算瓶颈和对经典统计知识的系统性忽略的限制。本技术报告介绍了KairosHope,一种下一代TSFM,旨在协调大规模泛化与分类任务中的分析精度。该提案的核心是HOPE块,一种用双记忆系统替代二次注意力的架构:用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统(CMS)。为了丰富归纳偏差,引入了混合决策头,它将深度潜在表示与通过tsfeatures包提取的确定性统计特征融合。KairosHope在大型Monash档案上进行自监督预训练,结合了掩码时间序列建模(MTSM)和对比学习(InfoNCE)。随后,通过严格的线性探测和全微调(LP-FT)协议在UCR基准数据集上进行适应,以防止灾难性遗忘。实验结果表明,在具有严格时间因果关系的领域(如HAR或传感器数据)中,性能优越。因此,KairosHope为基础模型适应时间序列分析建立了一个稳健高效的框架。

英文摘要

Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to specialized classification problems remains constrained by the computational bottleneck of standard attention and the systematic omission of classical statistical knowledge. This technical report introduces KairosHope, a next-generation TSFM designed to reconcile massive generalization with analytical precision in classification tasks. The core of the proposal is the HOPE block, an architecture that replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. To enrich the inductive bias, a Hybrid Decision Head is introduced, which fuses deep latent representations with deterministic statistical features extracted via tsfeatures package. KairosHope undergoes self-supervised pre-training on the massive Monash archive, combining Masked Time Series Modeling (MTSM) and contrastive learning (InfoNCE). Its subsequent adaptation to the UCR benchmark datasets is conducted through a rigorous Linear Probing and Full Fine-Tuning (LP-FT) protocol to prevent catastrophic forgetting. Empirical results demonstrate superior performance in domains characterized by strict temporal causality such as HAR or Sensor data. Consequently, KairosHope establishes a robust and efficient framework for the adaptation of foundation models to time series analysis.

2605.17531 2026-05-26 cs.CV

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

不要猜测,只需询问:通过多轮澄清解决指代分割中的歧义

Yuting Yang, Haichao Jiang, Tianming Liang, Quan Zhang, Jian-Fang Hu

AI总结 提出IC-Seg框架,通过多轮对话主动澄清用户意图,并引入Hi-GRPO分层优化策略,有效解决指代分割中用户查询歧义问题。

详情
AI中文摘要

指代分割旨在根据文本查询分割图像或视频中的目标对象。尽管过去几年取得了显著进展,现有工作总是假设用户提供的查询已经精确且清晰。然而,这种假设不切实际。在现实场景中,期望所有用户仔细审查其视觉内容并确保查询唯一且无歧义是不现实的。遇到此类情况时,现有分割模型倾向于任意猜测用户偏好,常常导致不理想的结果。为解决这一限制,我们提出IC-Seg,一种新颖的智能体框架,在分割前通过多轮对话主动澄清用户意图。为有效激励这种能力,我们进一步引入Hi-GRPO,一种新的分层优化策略,在轨迹、轮次和步骤层面注入密集且信息丰富的监督信号。该策略鼓励高效的意图澄清,有效消除冗余交互并提高整体对话质量。为评估,我们建立了Ambi-RVOS,一个带有模糊用户查询的指代视频对象分割基准。大量实验表明,IC-Seg不仅在解决模糊查询方面大幅优于现有方法,而且在标准推理分割基准上保持最先进性能。代码和数据将在https://github.com/iSEE-Laboratory/IC-Seg发布。

英文摘要

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

2605.17268 2026-05-26 cs.AI cs.CV cs.RO

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实?自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

AI总结 通过分析300次VLA推理,发现输出推理与轨迹的忠实度仅42.5%,存在大量漏检行人、轨迹脆弱及推理-动作不一致问题,并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

详情
AI中文摘要

我们首次系统研究了视觉-语言-动作(VLA)驾驶模型的忠实度,分析了100个多样化PhysicalAI-AV场景中300次Alpamayo-R1-10B推理。主要发现是,输出带有轨迹的自然语言推理可能显著不忠实:(i) 整体推理保真度仅为42.5%,因果链与场景现实匹配不到一半;(ii) 在三分之一涉及行人的场景中漏检了94个行人;(iii) 在轻微视觉扰动下轨迹脆弱性达97.7%;(iv) 平均推理-动作一致性仅为48.3%,53.3%的推理表现出一致性低,其中37.9%声称停止但模型继续前行。我们从信息论角度形式化定义了忠实度,定义了实体和动作保真度及验证标准,并概述了与这些结果一致的四组件安全架构。

英文摘要

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.

2605.16591 2026-05-26 cs.LG cs.AI

How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning

少样本示例如何累加:上下文学习中函数向量的因果分解

Entang Wang, Yiwei Wang, Aleksandra Bakalova, Michael Hahn

AI总结 本文通过因果分解揭示少样本提示中函数向量由示例级子向量线性组合而成,并发现模型通过注意力重加权机制根据上下文调整示例贡献。

Comments Accepted at ICML 2026. 70 pages, 65 figures

详情
AI中文摘要

上下文学习(ICL)擅长从极少量示例中学习新任务,但我们仍缺乏对少样本提示如何塑造模型函数向量(FV)——一种驱动ICL查询任务行为的因果激活方向——的机制性解释。跨任务和模型,一个$n$样本FV可以通过示例级子FV的线性组合很好地近似,表明来自单个演示的贡献具有加性和可组合性。除了加性之外,我们展示了模型基于先前示例对单个示例的表示进行上下文化,以自适应地重新加权哪些演示主导FV:注意力转向在上下文中信息量更大、歧义更少的示例。最后,因果分解将查询-键路由与值更新分离,发现上下文化对FV质量最一致的贡献来自查询-键对齐——尤其是在歧义设置中——而值介导的效应则更加异质。综合起来,这些结果将加性叠加与上下文相关的注意力重加权统一为一个机制性的、可检验的说明,解释少样本提示如何实现任务。

英文摘要

In-context learning (ICL) excels at new tasks from minimal examples, yet we still lack a mechanistic explanation of how few-shot prompts shape a model's function vector (FV)--a causal activation direction that drives task behavior on the ICL query. Across tasks and models, an $n$-shot FV is well-approximated by a linear combination of example-level sub-FVs, suggesting additive and composable contributions from individual demonstrations. Beyond additivity, we show that models contextualize individual examples' representations based on prior examples to adaptively reweight which demonstrations dominate the FV: attention shifts toward examples that are more informative and less ambiguous under the context. Finally, a causal decomposition separates Query-Key routing from Value updates, finding that contextualization's most consistent contributions to FV quality arise from Query-Key alignment--particularly in ambiguous settings--while Value-mediated effects are more heterogeneous. Together, these results unify additive superposition with context-dependent attention reweighting into a mechanistic, testable account of how few-shot prompts implement tasks.

2605.16409 2026-05-26 cs.CV cs.CL cs.LG

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调和提示引导的链式思维推理用于多模态大语言模型

Qinwu Xu, Yifan Jiang, Haoyu Ren

AI总结 提出一种多语言OCR感知的多模态训练框架,通过合成数据生成、OCR感知微调和结构化视觉链式思维提示,提升多模态大语言模型在复杂视觉条件下的OCR完整性和多语言翻译准确性。

详情
AI中文摘要

光学字符识别(OCR)和多语言文本理解仍然是多模态大语言模型(MLLMs)的主要失败模式,尤其是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架,该框架结合了(i)大规模合成OCR到翻译数据生成,(ii)使用LoRA适配的OCR感知监督微调(SFT),以及(iii)在不确定视觉条件下进行推理的结构化视觉链式思维(CoT)提示。使用基于LLaMA的多模态架构,所提出的框架在OCR完整性、多语言翻译准确性和退化视觉条件下的鲁棒性方面有了显著提升。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明,与基线模型相比,视觉-文本对齐显著改善。特别是,所提出的OCR感知后训练框架提高了对小、模糊、空间分散和部分遮挡文本的提取,同时减少了对不确定OCR条件下语言先验的依赖。与前沿多模态系统(包括GPT-5类和Gemini系列模型)的定性比较进一步表明,在噪声和视觉模糊的OCR场景下,OCR对齐得到改善,幻觉减少。总体而言,结果表明,以数据为中心的OCR感知多模态后训练为改进多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

英文摘要

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.

2605.15971 2026-05-26 cs.RO

OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

OHP-RL:在线人类偏好作为机器人操作强化学习中的指导

Yunyang Mo, Jian Li, Qiwei Wu, Yihang Kang, Renjing Xu

AI总结 提出OHP-RL框架,利用人类干预作为偏好信息,通过状态依赖偏好门自适应调节策略学习,在Franka机器人接触丰富的操作任务中实现高成功率、快速收敛和低人类干预。

详情
AI中文摘要

虽然强化学习使机器人能够自主获取技能,但其在实际部署中受到低效和不安全探索的严重限制。人类在环干预提供了一种实用的解决方案,但现有方法通常将这些干预作为辅助训练信号,未能充分捕捉它们提供的关于何时以及如何引导自主性的更丰富信息。人类干预通常编码了在安全和任务约束下对行为的相对偏好,而不是规定要模仿的精确动作。受此观点启发,我们提出在线人类偏好作为强化学习中的指导(OHP-RL),这是一个利用人类干预作为偏好信息来指导策略学习的框架。OHP-RL引入了一个状态依赖的偏好门,自适应地调节人类干预应在何时以及多大程度上塑造策略学习。这种设计使智能体能够从间歇性和不完美的人类反馈中受益,同时保持自主探索和稳定的策略优化。我们在Franka机器人上的三个具有挑战性的真实世界接触丰富操作任务中评估了OHP-RL。在所有任务中,OHP-RL始终实现了高成功率、更快的收敛以及比先前方法显著更低的人类干预努力。此外,学习到的策略在整个训练过程中表现出更稳定和与人类一致的行为。

英文摘要

While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.

2605.15777 2026-05-26 cs.AI

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

SaaS-Bench:计算机使用代理能否利用真实世界SaaS解决专业工作流程?

Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

AI总结 提出SaaS-Bench基准,包含23个可部署SaaS系统和106个真实工作场景任务,评估计算机使用代理在长期规划、跨应用协调等能力上的表现,发现最强模型端到端任务完成率不足4%。

Comments 24 pages, 11 figures

详情
AI中文摘要

计算机使用代理(CUA)正迅速将大型语言模型(LLM)从基于文本的推理扩展到更复杂环境中的行动执行,例如网络浏览器和图形用户界面(GUI)。然而,现有的网络和GUI代理基准通常依赖于简化设置、孤立任务或短周期交互,难以评估代理在现实专业工作流程中的能力。软件即服务(SaaS)环境是CUA评估的自然选择,因为它们承载了现代数字工作的很大一部分,并且自然涉及动态系统状态、跨应用协调、领域特定知识和长期依赖。为此,我们引入了SaaS-Bench,一个基于23个可部署SaaS系统(涵盖六个专业领域)的基准,包含106个基于现实工作场景的任务。这些任务需要长期执行,涵盖纯文本和多模态设置,并通过加权验证检查点进行评估,以衡量严格任务完成和部分进展。实验表明,代表性的基于LLM的代理在SaaS-Bench上表现不佳,即使最强的模型端到端完成任务也少于4%,暴露了在规划、状态跟踪、跨应用上下文维护和错误恢复方面的局限性。代码可在https://github.com/UniPat-AI/SaaS-Bench获取以进行复现。

英文摘要

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

2605.15759 2026-05-26 cs.CL

DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

DimMem:面向高效长期智能体记忆的维度结构化

Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang

AI总结 提出DimMem维度记忆框架,通过原子化、类型化、自包含的记忆单元(含时间、地点、原因等显式字段)实现维度感知检索与更新,在LoCoMo-10和LongMemEval-S上分别达到81.43%和78.20%准确率,且每查询token成本降低24%。

详情
AI中文摘要

大型语言模型(LLM)智能体需要长期记忆来利用过去交互中的信息。然而,现有的记忆系统常常面临保真度与效率之间的权衡:原始对话历史成本高昂,而扁平化的事实或摘要可能丢弃精确回忆所需的结构。我们提出 extbf{DimMem},一种轻量级维度记忆框架,将每条记忆表示为一个原子化、类型化、自包含的单元,并带有显式字段,如时间、地点、原因、目的和关键词。这种表示暴露了维度感知检索、记忆更新和选择性助手上下文回忆所需的结构,而无需在模型上下文中存储完整历史。在LoCoMo-10和LongMemEval-S上,DimMem分别达到 extbf{81.43\%}和 extbf{78.20\%}的整体准确率,优于现有的轻量级记忆系统,同时将LoCoMo每查询token成本降低 extbf{24\%}。我们进一步证明,维度记忆提取可通过紧凑模型学习:在DimMem模式上微调后,Qwen3-4B提取器在两个基准测试上均超越使用GPT-4.1-mini的LightMem,并在关键设置中达到与更大提取器相当或更优的性能。这些结果表明,显式维度结构化是LLM智能体长期记忆有效且高效的基础。代码见https://github.com/ChowRunFa/DimMem。

英文摘要

Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf{81.43\%} and \textbf{78.20\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf{24\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.

2605.15011 2026-05-26 cs.CL

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

科学贡献图:基于文献的规模化自动技术路线图绘制

Peter A. Jansen

AI总结 提出从学术论文中提取科学贡献并链接其前提条件的自动技术路线图任务,构建包含200万贡献和1250万前提边的AI/NLP领域科学贡献图,并引入科学前提预测任务,实验表明现有模型在该任务上表现快速提升。

Comments 8 pages, 5 figures

详情
AI中文摘要

科学贡献很少孤立发展,而是建立在先前发现的基础上。我们将自动技术路线图的任务定义为从学术文章中提取科学贡献并将其与前提条件联系起来。我们提出了科学贡献图,这是一个大规模的人工智能/自然语言处理领域资源,包含从23万篇开放获取论文中提取的200万个详细科学贡献,并通过1250万条前提边连接。我们进一步引入了科学前提预测,这是一项科学发现任务,模型预测哪些现有技术可以促成未来的发现,并表明当代模型在该任务上迅速改进,在使用时间过滤回测评估时达到0.48 MAP。我们预计这样的技术路线图资源将支持科学影响评估和自动科学发现。

英文摘要

Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

2605.14890 2026-05-26 cs.CL cs.AI

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

分词器生育率与基础模型在乌克兰法律文本上的零样本性能:一项比较研究

Volodymyr Ovcharov

AI总结 本研究比较了七种基础模型在乌克兰法律文本上的分词器生育率和零样本性能,发现分词器生育率差异达1.6倍,Qwen 3模型比Llama系列多消耗60%的token,而NVIDIA Nemotron Super 3 (120B)以更低的成本取得最佳性能,同时揭示了少样本提示在形态丰富语言上的退化以及战时法律语言对模型泛化的影响。

Comments 25 pages, 13 tables, 5 figures; v2 adds cross-temporal generalization experiment and classical baseline

详情
AI中文摘要

在乌克兰法律文本上,不同基础模型的分词器生育率差异达1.6倍,然而这一成本关键维度在模型选择实践中被忽视。我们使用来自乌克兰国家登记册(EDRSR)的273份经过验证的法院判决,对来自五个提供商的七个模型进行了基准测试,测量了分词器生育率以及在三个任务上的零样本性能。发现了四个结果。(1)Qwen 3模型在相同输入上比Llama系列模型多消耗60%的token,使得分词器分析成为成本高效部署的前提。(2)NVIDIA Nemotron Super 3 (120B)取得了最高综合得分(83.1),以三分之一的API成本超越了Mistral Large 3(总参数多5.6倍)——模型规模并不能很好地代表领域性能。(3)少样本提示使性能下降高达26个百分点;分层和提示敏感性消融实验证实,这是乌克兰语演示的内在问题,而非示例选择的伪影。(4)跨时间泛化实验表明,在战前法院判决(2008-2013)上训练的分类器,应用于全面入侵时期的判决(2022-2026)时,性能下降27.9个百分点,并呈现出显著的前后不对称性:较新的模型向后迁移效果更好(比向前迁移高14.6个百分点),但较旧的模型在战时法律语言上完全失败。对于从业者:分词器分析应优先于模型选择,对于形态丰富的语言,零样本比少样本更可靠。为了支持可重复性并解决乌克兰语在法律NLP基准中的缺失,我们发布了一个包含14,452份法院判决的公开数据集,时间跨度为2008-2026年,标注了三个时间段的七个结果标签,这些时间段捕捉了武装冲突对司法程序的影响。

英文摘要

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

2605.14559 2026-05-26 cs.AI math.OC

PyCSP3-Scheduling: A Scheduling Extension for PyCSP3

PyCSP3-Scheduling: PyCSP3的调度扩展

Sohaib Afifi

AI总结 提出PyCSP3 Scheduling库,通过53个专用约束和27个表达式为PyCSP3添加调度抽象,并编译为标准约束,在261个实例上验证了与原始公式的目标一致性,但运行时性能因编译开销而异。

详情
AI中文摘要

PyCSP$^3$提供了一种高效构建约束模型以解决组合约束问题的方法,并将其导出为XCSP$^3$,保持了建模与求解的完全分离。然而,它缺乏对调度抽象(如区间变量、序列变量和资源函数)的原生支持。因此,即使PyCSP$^3$已经提供了如NoOverlap和Cumulative等整数数组上的全局约束,调度模型仍需通过低层整数变量和手动通道约束进行编码。我们提出了PyCSP$^3$ Scheduling,一个通过53个专用约束和27个表达式为PyCSP$^3$添加调度抽象的库,并将其编译为标准PyCSP$^3$/XCSP$^3$约束,维护了支撑PyCSP$^3$生态系统的建模/求解分离。在17个模型家族(每个5次运行)的261个配对实例上,两种公式在所有72个双重证明最优对以及近一半的家族(8/17)中产生了相同的目标值,且在编译后结构保持不变;然而,运行时性能在不同家族间存在差异,部分家族有显著提升(高达5.8倍),而其他家族由于编译分解的开销出现性能下降。代码和基准测试可在以下网址获取:https://github.com/sohaibafifi/pycsp3-scheduling

英文摘要

PyCSP$^3$ provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP$^3$, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP$^3$ already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP$^3$ Scheduling, a library that adds scheduling abstractions to PyCSP$^3$ through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP$^3$/XCSP$^3$ constraints, maintaining the modeling/solving separation that underpins the PyCSP$^3$ ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3-scheduling

2605.14552 2026-05-26 cs.CV

LiWi: Layering in the Wild

LiWi: 野外分层

Yu He, Fang Li, Haoyang Tong, Lichen Ma, Xinyuan Shan, Jingling Fu, Dong Chen, Luohang Liu, Junshi Huang, Yan Li

AI总结 提出基于代理驱动数据分解和联合优化光度保真度与alpha边界的方法,实现野外自然图像的高保真分层分解,构建了LiWi-100k数据集并达到SOTA性能。

Comments Project Page https://rassetmusty.github.io/LiWi

详情
AI中文摘要

生成模型的最新进展使得令人印象深刻的分层图像生成成为可能,但其成功主要局限于图形设计领域。野外图像的分层仍然是一个未充分探索的问题,限制了细粒度编辑和图像在真实场景中的应用。具体而言,可扩展的分层数据和自然图像中对象交互(如光照效果和结构边界)的建模仍面临挑战。为解决这些瓶颈,我们提出了一种用于高保真自然图像分解的新框架。首先,我们引入了一种代理驱动数据分解(ADD)流水线,该流水线协调代理和工具以合成分层数据,无需人工干预。利用该流水线,我们构建了一个大规模数据集LiWi-100k,包含超过10万张高质量的分层野外图像。其次,我们提出了一个新框架,联合改进光度保真度和alpha边界精度。具体而言,阴影引导学习显式建模光照效果,退化-恢复目标通过从退化图像恢复干净前景图像提供边界校正监督。大量实验表明,我们的框架在自然图像分解中达到了最先进的性能,在RGB L1和Alpha IoU指标上优于现有模型。我们将很快发布代码和数据集。

英文摘要

Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.