arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.24545 2026-05-26 cs.LG cs.AI

Rethinking Federated Unlearning via the Lens of Memorization

通过记忆视角重新思考联邦遗忘学习

Jiaheng Wei, Yanjun Zhang, He Zhang, Leo Yu Zhang, Chao Chen, Kok-Leong Ong, Jun Zhang, Yang Xiang

发表机构 * Royal Melbourne Institute of Technology(皇家墨尔本理工学院) Griffith University(格里菲斯大学) Swinburne University of Technology(斯威本理工大学)

AI总结 针对联邦学习中遗忘数据与保留数据重叠导致遗忘无效和客户端不公平的问题,提出基于分组记忆评估的联邦记忆剪枝方法,通过重置负责记忆的冗余参数实现高效遗忘。

Comments This paper has been accepted by SIGKDD 2026

详情
AI中文摘要

联邦学习越来越需要机器遗忘来遵守隐私法规。然而,现有的联邦遗忘方法可能忽略了遗忘数据与保留数据之间的重叠信息,导致遗忘无效和客户端之间的不公平。在这项工作中,我们通过记忆的视角重新审视联邦遗忘。我们认为,遗忘主要应移除归因于待遗忘数据的独特记忆信息,同时保留也得到剩余数据支持的重叠模式。具体地,我们提出了分组记忆评估,一种示例级度量,将记忆知识与重叠知识分离。基于该度量,我们引入了联邦记忆剪枝(FedMemPrune),一种基于剪枝的遗忘方法,重置负责记忆的冗余参数。大量实验表明,FedMemPrune 与基于重训练的遗忘基线紧密匹配,同时比现有联邦遗忘算法更有效地消除记忆,在保持保留知识效用的情况下实现了强大的遗忘性能。

英文摘要

Federated learning (FL) increasingly needs machine unlearning to comply with privacy regulations. However, existing federated unlearning approaches may overlook the overlapping information between the unlearning and remaining data, leading to ineffective unlearning and unfairness between clients. In this work, we revisit federated unlearning through the lens of memorization. We argue that unlearning should mainly remove the unique memorized information attributable to the data to be forgotten, while preserving overlapping patterns that are also supported by the remaining data. Specifically, we propose Grouped Memorization Evaluation, an example-level metric that separates memorized knowledge from overlapping knowledge. Building on this metric, we introduce Federated Memorization Pruning (FedMemPrune), a pruning-based unlearning approach that resets redundant parameters responsible for memorization. Extensive experiments show that FedMemPrune closely matches retraining-based unlearning baselines while more effectively eliminating memorization than existing federated unlearning algorithms, yielding strong unlearning performance without sacrificing the utility of retained knowledge.

2605.24543 2026-05-26 cs.AI cs.SY eess.SY

Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

面向可持续电动汽车充电与二氧化碳减排的排放感知强化学习:在不同可再生能源渗透率下

Ninglin Ou, Mohammad A. Razzaque, Iftekher Islam Shovon, Shafkat Khan Siam, Shafiuzzaman K Khadem, Krishnendu Guha, Mayeen U Khandaker, Md. Noor-A-Rahim

发表机构 * organization= nasc Research, School of Computer Science \& IT, University College Cork , country = IE organization= School of Computing, Engineering Digital Technologies, Teesside University , country= UK organization= International Energy Research Centre, Tyndall National Institute, Cork , country = IE Radiation Technologies Group, CCDCU, Faculty of Engineering Technology, Sunway University , country = Malaysia organization= Department of Physics, College of Science, Korea University , country = Republic of Korea

AI总结 提出基于软演员-评论家算法的排放感知强化学习策略,通过多目标奖励函数优化电动汽车充电调度,在EV2Gym平台上实现高达87%的碳排放减少和52%的可再生能源自消纳率。

Comments Submitted the Engineering Applications of Artificial Intelligence Journal (Elsevier)

详情
AI中文摘要

电动汽车(EV)的快速增长通过非协调充电导致的峰值负荷尖峰、电压不稳定和变压器过载给配电网络带来挑战。虽然模型预测控制(MPC)和标准强化学习(RL)方法已解决这些问题,但现有方法很少将实时碳强度或波动的可再生能源(RE)可用性作为主要调度目标,留下了巨大的脱碳潜力未实现。本文提出一种基于软演员-评论家(SAC)算法的排放感知RL策略,其多目标奖励函数惩罚碳排放、削减的现场可再生能源和未满足的用户需求。该智能体在EV2Gym平台上的统一基准框架中训练,结合了表后太阳能和风能曲线、时变的EirGrid碳强度数据以及25个电动汽车供电设备(EVSE)单元上真实的工作场所EV行为。比较了九种控制策略,包括启发式方法、排放感知MPC变体和所提出的RL智能体,在五种可再生能源渗透率场景(0%-50%)下各进行十次独立运行。RL智能体在50%风能渗透率下实现了低至23.96克二氧化碳每千瓦时的碳强度,相比未控制基线减排高达87%,并优于基于外部图表的配电网络(PDN)基准。在所有场景下,变压器过载保持在7千瓦时以下,而“尽可能快”(AFAP)启发式方法高达1093千瓦时;在风能和太阳能联合供应下,可再生能源自消纳率达到52%。将碳强度预测嵌入RL状态和奖励中,使充电与低排放时段对齐,同时保持电网合规性和用户满意度。

英文摘要

The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, and transformer overloads from uncoordinated charging. While Model Predictive Control (MPC) and standard Reinforcement Learning (RL) methods have addressed these issues, existing approaches rarely treat real-time carbon intensity or fluctuating renewable energy (RE) availability as primary scheduling objectives, leaving substantial decarbonisation potential unrealised. This paper proposes an emission-aware RL strategy based on the Soft Actor Critic (SAC) algorithm, with a multi-objective reward that penalises carbon emissions, curtailed on-site renewables, and unmet user demand. The agent is trained within a unified benchmarking framework on the EV2Gym platform, incorporating behind-the-meter solar and wind profiles, time-varying EirGrid carbon intensity data, and realistic workplace EV behaviour across 25 Electric Vehicle Supply Equipment (EVSE) units. Nine control strategies, including heuristics, emission-aware MPC variants, and the proposed RL agent, are compared under five renewable penetration scenarios (0%-50%) over ten independent runs each. The RL agent achieves a carbon intensity as low as 23.96 grams of carbon dioxide per kilowatt-hour under 50% wind penetration, representing up to 87% emission reduction versus the uncontrolled baseline, and outperforms the external graph-based Power Distribution Network (PDN) benchmark. Transformer overload remains below 7 kWh across scenarios, against up to 1093 kWh for the As Fast As Possible (AFAP) heuristic, and renewable self-consumption reaches 52% under combined wind and solar supply. Embedding carbon intensity forecasts into the RL state and reward aligns charging with low-emission periods while preserving grid compliance and user satisfaction.

2605.24541 2026-05-26 cs.LG cs.AI cs.CL cs.IR

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

SemanticZip: 以LLM作为语义解压器的有损文本压缩的试点框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)(嵌入式智能实验室)

AI总结 提出SemanticZip框架,通过LLM将文本压缩为紧凑代码并解压为任务相关语义,在结构化散文、JSON等六种表示上评估,发现结构化散文恢复率最高(WAR=0.956,19.1%令牌增益),而CCL-Min平衡性最佳(39.4%令牌增益,WAR=0.874)。

Comments 13 pages, 1 figure, 2 tables. Pilot framework paper; code and supplementary artifacts available in ancillary files

详情
AI中文摘要

大型语言模型(LLM)系统的文本压缩通常被框架化为令牌删除、检索、摘要或精确重建。我们研究了一种更具攻击性但明确有损的设置:将文本压缩为紧凑代码,LLM可以将其扩展为任务相关的含义。我们将此设置称为SemanticZip。与无损压缩不同,SemanticZip不需要字节相同的重建;与普通摘要不同,它将基于模型的解压缩视为编解码器的一部分,并评估是否恢复了任务相关的语义承诺。 本文是一个试点框架,而非基准声明。我们形式化了LLM介导的解压缩,定义了受保护/有损数据包架构,并在五个作者构建的诊断案例上评估了六种表示体系:结构化散文、JSON、CCL-Core、CCL-Min、SemanticZip ASCII和SemanticZip emoji。一个独立的解码器LLM从每种压缩表示中重建类型化的语义原子,我们评估关键原子召回率、加权原子召回率、精确度和分词器增益。在该试点中,结构化散文具有最高的可恢复性,WAR=0.956,o200k_base令牌增益19.1%。CCL-Min是最强的平衡点,令牌增益39.4%,WAR=0.874。SemanticZip ASCII提供了最大的有用压缩,令牌增益46.5%,WAR=0.802,而表情符号密集的SemanticZip在压缩和恢复方面表现均较差。 主要贡献并非声称这些数字建立了通用前沿。相反,我们引入了一个可重复的实验接口,用于研究有损、LLM可解压的文本代码,以及一个设计原则:安全关键和精确的承诺应保持受保护,而可预测的低风险上下文可以进行语义压缩。

英文摘要

Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped.

2605.24539 2026-05-26 cs.AI

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

DemoEvolve:利用演示克服智能体框架演化中的稀疏反馈

Lirong Che, Yuzhe yang, Peiwen lin, Chuang wang, Xueqian wang, Jian su

发表机构 * Tsinghua University(清华大学) AgiBot

AI总结 提出DemoEvolve方法,通过人类演示引导框架演化,解决长时域随机环境中自生成轨迹因稀疏反馈和高方差导致的脆弱性问题,在Liar's Dice和Balatro任务中验证了其有效性。

详情
AI中文摘要

智能体框架演化通过修改冻结语言模型周围的可执行结构来改进它们。我们将这一范式研究为一种样本高效的快速适应形式:智能体无需更新模型权重,而是通过改变其外部框架来获取任务特定能力,同时保留基础模型的通用能力。先前的工作表明,自生成轨迹可以支持框架搜索,暗示智能体可以通过练习获得新的任务能力。然而,在长时域随机环境中,自我练习变得脆弱:奖励稀疏,结果方差高,且失败难以归因于具体的框架机制。我们引入了DemoEvolve,一种基于演示引导的框架演化方法。当仅依赖奖励的搜索过于宽泛和嘈杂时,胜任的人类轨迹作为编码提议者的专家参考经验,指导框架级别的诊断和编辑。在Liar's Dice上的实验表明,当回合短且失败可归因时,自轨迹演化可以工作。相比之下,Balatro暴露了更困难的长时域随机场景,其中自轨迹演化被稀疏反馈和候选选择噪声误导,而仅靠教程式文本知识无法带来稳定的改进。在相同的有限预算下,DemoEvolve产生了更有效和可审计的框架编辑,并实现了更好的性能。总体而言,演示使稀疏反馈的框架演化更具可诊断性、可定位性和稳定性。

英文摘要

Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

2605.24534 2026-05-26 cs.CL

Generating Legal Commentaries from Case Databases via Retrieval, Clustering, and Generation

通过检索、聚类和生成从案例数据库中生成法律评论

Max Prior, Niklas Wais, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 提出一个全自动流水线,利用检索、聚类和生成方法,从法院判决中自动生成法律评论,无需人工教义框架。

详情
AI中文摘要

我们提出了一个全自动流水线,将大量法院判决转化为法规的法律评论——无需提供任何手工制作的教义框架。使用德国联邦最高法院引用德国民法典第242、280、812和823条的4555份判决,我们提取段落级块,总结其推理,并推导关键词,这些关键词被嵌入和聚类。对于每个聚类,一个LLM生成标题并综合引用丰富的章节,然后由四个最先进的LLM合并成连贯的评论。我们使用人类专家和LLM评判员,沿着五个维度——主题相关性、标题匹配、引用忠实性、聚类区分度和逻辑顺序——进行评估。我们的结果表明,从法院判决中挖掘类似评论的论点以生成可在几分钟内以最低成本更新的报告是可行的,但突出了由于来源受限和法律推理规范性而产生的局限性。

英文摘要

We present a fully automated pipeline that transforms large collections of court decisions into legal commentaries for statutes - without providing any handcrafted doctrinal framework. Using 4.555 decisions of the German Federal Court of Justice that cite sections 242, 280, 812 and 823 of the German Civil Code (BGB), we extract paragraph-level chunks, summarize their reasoning, and derive keywords, which are embedded and clustered. For each cluster, an LLM generates headings and synthesizes citation-rich sections, which are then merged into coherent commentaries by four state-of-the-art LLMs. We evaluate along five dimensions - topical relevance, heading-match, citation faithfulness, cluster distinction and logical ordering - using both a human expert and an LLM-judge. Our results show that commentary-like argument mining from court decisions to generate reports that can be refreshed within minutes at minimal cost is feasible, yet they highlight limitations arising from restricted sources and the normativity of legal reasoning.

2605.24533 2026-05-26 cs.CV

Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation

可学习形状原型与遮挡几何引导注入的模态实例分割

Fufan Zhang, Jingxiang Wang, Xiangjie Ye

发表机构 * School of Mechanical Engineering and Automation, Northeastern University(机械工程与自动化学院,东北大学) School of Information Science and Engineering, Northeastern University(信息科学与工程学院,东北大学)

AI总结 提出一种门控可靠性自适应形状先验框架,通过可学习原型和交叉注意力生成实例自适应形状先验,并利用可见掩码的符号距离场调节注入强度,在多个评估设置下超越现有方法。

Comments 13 pages, 7 figures, 5 tables. Submitted to IEEE Transactions on Circuits and Systems for Video Technology

详情
AI中文摘要

模态实例分割旨在预测完整的物体掩码,包括被遮挡区域,这些区域缺乏像素级观测,必须借助形状先验进行推断。现有方法通过固定容量编码空间或昂贵的生成模型获取形状先验,并在所有空间位置均匀注入,而不适应可见区域和遮挡区域之间不同的先验需求。本文提出一种门控可靠性自适应形状先验框架,该框架引入一个形状先验记忆模块,通过交叉注意力组合可学习原型,通过加权原型组合(而非生成)产生实例自适应形状先验。然后,一个空间自适应可靠性门利用可见掩码的符号距离场,根据每个位置的遮挡深度调节注入强度,在可见区域保留可靠特征,同时将形状补偿引导至遮挡区域。在两个主流模态实例分割基准上的实验表明,所提方法在多个评估设置下优于现有方法,在标准设置下,其中一个基准上的遮挡区域平均交并比提高了超过11个百分点,同时总参数量约为三分之一。线性探针分析进一步揭示,可见掩码交叉注意力模块隐式地将遮挡几何编码到视觉标记表示中,解释了所提模块分解的有效性。

英文摘要

Amodal instance segmentation aims to predict the complete object mask including occluded regions that lack pixel-level observations and must be inferred with the aid of shape priors. Existing methods acquire shape priors through fixed-capacity encoding spaces or expensive generative models, and inject them uniformly across all spatial positions without adapting to the varying prior demand between visible and occluded regions. In this paper, we propose a gated reliability-adaptive shape prior framework, which introduces a shape prior memory module that combines learnable prototypes via cross-attention to produce instance-adaptive shape priors through weighted prototype combination rather than generation. A spatial adaptive reliability gate then employs the signed distance field of the visible mask to modulate injection intensity at each position according to its occlusion depth, preserving reliable features in visible regions while directing shape compensation toward occluded areas. Experiments on two mainstream amodal instance segmentation benchmarks demonstrate that the proposed method outperforms existing approaches under multiple evaluation settings, improving the mean intersection-over-union over occluded regions by over 11 percentage points on one of the two benchmarks under the standard setting, while using approximately one-third of the total parameters. Linear probing analysis further reveals that the visible-mask cross-attention module implicitly encodes occlusion geometry into visual token representations, explaining the effectiveness of the proposed module decomposition.

2605.24532 2026-05-26 cs.CV

Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation

图像条件实例提示网络用于遥感图像指代分割

Biaoyu Ren, Qingsheng Wang, Cun Xu, Dingkang Yang, Wenxuan Wang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi'an, China(西北工业大学计算机科学学院,西安,中国) College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China(复旦大学智能机器人与先进制造学院,上海,中国) Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen, China(西北工业大学深圳研究院,深圳,中国)

AI总结 提出图像条件实例提示网络(ICIPNet),通过自适应视觉语义表示和双边信息融合模块,缓解跨模态特征融合瓶颈,提升遥感图像指代分割性能。

Comments 6 pages, 3 figures. Equal contribution: Biaoyu Ren and Qingsheng Wang. Corresponding authors: Dingkang Yang and Wenxuan Wang

详情
AI中文摘要

遥感图像指代分割(RRSIS)是一项与具身感知范式相关的情境化、任务驱动的跨模态任务,要求模型将视觉空间特征与语言意图对齐以实现精确的目标感知。近期研究聚焦于细化文本特征的粒度并优化图像-文本特征融合,以更好地引导目标特征表示。然而,描述粒度不足和对语义偏移的敏感性可能导致跨模态特征融合的瓶颈。为解决这些问题,我们提出带有双边信息融合的图像条件实例提示网络(ICIPNet),旨在缓解跨模态特征融合的瓶颈。ICIPNet引入图像条件实例提示(ICIP)模块,无需外部知识即可生成自适应的视觉和语义表示。双边信息融合(BIF)模块沿token和通道维度增强特征融合。实验表明,所提出的ICIPNet优于现有RRSIS模型。

英文摘要

Referring Remote Sensing Image Segmentation (RRSIS) is a situated, task-driven cross-modal task related to the embodied perception paradigm, requiring models to align visual-spatial features with linguistic intentions for precise target perception. Recent research has focused on refining the granularity of textual features and optimizing image-text feature fusion to better guide target feature representations. However, insufficient descriptive granularity and sensitivity to semantic shifts can cause bottlenecks in cross-modal feature fusion. To address these issues, we propose the Image-Conditioned Instance Prompt Network (ICIPNet) with Bilateral Information Fusion, which is designed to alleviate bottlenecks in cross-modal feature fusion. ICIPNet introduces an Image-Conditioned Instance Prompt (ICIP) module to generate self-adaptive visual and semantic representations without external knowledge. The Bilateral Information Fusion (BIF) module enhances feature fusion along the token and channel dimensions. Experiments demonstrate that the proposed ICIPNet outperforms existing RRSIS models.

2605.24531 2026-05-26 cs.CV

NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals

NudgeVAD: 通过FiLM残差的语言引导端到端驾驶

Chieh-Chi Yang, Yu-Hsiang Chen, Yi-Ting Chen

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

AI总结 提出NudgeVAD框架,利用语言作为校准的微调信号,通过恒等初始化的FiLM和零初始化残差头,在命令不可靠时显著提升驾驶轨迹预测性能。

Comments Technical report for the doScenes Instructed Driving Challenge, CVPR 2026 DriveX Workshop. 1st place in the Ablation track

详情
AI中文摘要

自然语言指令有望实现可控的端到端驾驶,但当规划器已经接收到可靠的高级命令时,其优势可能被掩盖。我们提出NudgeVAD,一个冻结规划器残差框架,利用语言作为对VAD轨迹的校准微调。通过恒等初始化的FiLM和零初始化的残差头,NudgeVAD在初始化时等价于冻结规划器,因此学习到的偏差仅来自语言条件残差。我们沿命令可靠性轴评估NudgeVAD。在可靠命令下,语言改进了初始规划器,但与VAD-FT (UNCOND)(一个计算量匹配的、无语言微调的VAD模型)相比几乎冗余。然而,在随机命令下,语言变得至关重要:去除文本使ADE6s降至3.166米,而带有文本的NudgeVAD恢复至2.806米,并优于VAD-FT (UNCOND) 0.312米。这些结果表明,语言并非普遍可加;当分类命令通道不可靠时,它最有价值。

英文摘要

Natural-language instructions promise controllable end-to-end driving, but their benefit can be hidden when planners already receive reliable high-level commands. We propose NudgeVAD, a frozen-planner residual framework that uses language as a calibrated nudge to a VAD trajectory. With identity-initialized FiLM and a zero-initialized residual head, NudgeVAD is equivalent to the frozen planner at initialization, so learned deviations arise only from language-conditioned residuals. We evaluate NudgeVAD along a command-reliability axis. With reliable commands, language improves the initial planner but becomes nearly redundant once compared against VAD-FT (UNCOND), a compute-matched VAD model fine-tuned without language. With random commands, however, language becomes essential: detaching text degrades ADE6s to 3.166 m, while NudgeVAD with text recovers 2.806 m and outperforms VAD-FT (UNCOND) by 0.312 m. These results show that language is not universally additive; it is most valuable when the categorical command channel is unreliable.

2605.24530 2026-05-26 cs.CL cs.CV

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil: 统一视觉-文本集成与蒸馏的多模态文档检索

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University(北京理工大学通用人工智能国家重点实验室) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航空航天信息研究所) Key Laboratory of Target Cognition and Application Technology(目标认知与应用技术重点实验室) Beijing Institute of Technology(北京理工大学) Ucap Cloud(Ucap云)

AI总结 提出Unveil框架,通过视觉-文本嵌入和知识蒸馏实现鲁棒的文档检索,兼顾布局与语义信息。

Comments ACL 2025 Main Conference

详情
AI中文摘要

现实场景中的文档检索由于文档格式和模态的多样性面临重大挑战。传统的基于文本的方法依赖于定制的解析技术,忽略布局信息且容易出错,而最近的无解析视觉方法在文本丰富的场景中往往难以捕捉细粒度的文本语义。为了解决这些限制,我们提出了 extbf{Unveil},一种新颖的视觉-文本嵌入框架,有效整合文本和视觉特征以实现鲁棒的文档表示。通过知识蒸馏,我们将视觉-文本嵌入模型的语义理解能力转移到纯视觉模型,实现高效的无解析检索同时保持语义保真度。实验结果表明,我们的视觉-文本嵌入方法超越了现有方法,而知识蒸馏成功弥合了视觉-文本方法与纯视觉方法之间的性能差距,提高了检索准确性和效率。

英文摘要

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

2605.22794 2026-05-26 cs.AI cs.LG

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

MOSS:自主智能体系统中通过源代码级重写的自我进化

Qianshu Cai, Yonggang Zhang, Xianzhang Jia, Huajiang Zheng, Wei Xue, Jun Song, Xinmei Tian, Yike Guo

发表机构 * University of Science and Technology of China(中国科学技术大学) Hong Kong Generative AI Research & Development Center(香港生成式AI研究与开发中心) The Hong Kong University of Science and Technology(香港理工大学) Hong Kong Baptist University(香港 Baptist大学)

AI总结 提出MOSS系统,通过源代码级重写实现自主智能体系统的自我进化,利用生产故障证据自动批处理和多阶段确定性流水线,在OpenClaw上单周期内将平均评分从0.25提升至0.61。

Comments 12 pages, 3 figures, 2 tables. Preprint. Code: https://github.com/hkgai-official/Moss

详情
AI中文摘要

自主智能体系统在部署后基本是静态的:它们不会从用户交互中学习,重复的失败会持续存在,直到下一次人工驱动的更新发布修复。自我进化的智能体应运而生,但所有进化都局限于文本可变的工件——技能文件、提示配置、记忆模式、工作流图——而智能体框架本身保持不变。由于路由、钩子排序、状态不变量和调度存在于代码中而非任何文本工件中,整个结构故障类别在文本层上是物理上不可达的。我们认为源代码级适应是一种本质上更通用的媒介:它是图灵完备的,是每个文本可变范围的严格超集,通过确定性方式生效而非基础模型合规性,并且不会在长上下文漂移下退化。我们提出了MOSS,一个在生产智能体基板上执行源代码级自我重写的系统。每次进化都锚定在自动策划的生产故障证据批次上,并通过确定性的多阶段流水线进行;代码修改委托给可插拔的外部编码智能体CLI,而MOSS保留阶段顺序和判定。候选者通过在临时试验工作器中重放批次来验证,然后通过用户同意门控的就地容器交换和健康探针门控的回滚进行推广。在OpenClaw上,MOSS在单周期内无需人工干预将四个任务的平均评分从0.25提升至0.61。

英文摘要

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo:野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出AnyMo框架,通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐,实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述,性能显著提升。

详情
AI中文摘要

随着可穿戴和移动设备日益融入日常生活,它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置,包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难,并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo,一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号,从配对的合成放置视图和掩蔽部分观测中预训练图编码器,将多位置IMU标记化为全身运动令牌,并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo:跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述,其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%,零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%,零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面:https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

2605.22684 2026-05-26 cs.LG

ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification

ChronoVAE-HOPE:超越注意力——面向专业时间序列分类的下一代VAE基础模型

José Alberto Rodríguez, Luis Balderas, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

发表机构 * Department of Computer Science and Artificial Intelligence(计算机科学与人工智能系) DiCITS iMUDS DaSCI University of Granada(格拉纳达大学) Advanced Medical Imaging Group(先进医学成像组) Instituto de Investigación Biosanitaria de Granada(格拉纳达生物医学研究 institute) Department of Software Engineering(软件工程系) Department of Rural Engineering(农村工程系) University of Córdoba(科尔多瓦大学)

AI总结 提出ChronoVAE-HOPE,一种基于VAE和HOPE块(含Titans模块和连续记忆系统)的下一代时间序列基础模型,通过解耦潜在空间分离趋势与季节成分,在UCR基准分类任务上表现优异。

详情
AI中文摘要

时间序列基础模型已成为通用时间序列预测领域的最新技术组成部分。然而,将其应用于专业分类任务仍受两个相互关联的挑战制约:标准注意力机制的二次成本以及无法解耦时间序列变异性背后的结构成分。本技术报告介绍了ChronoVAE-HOPE,一种下一代时间序列基础模型,它调和了大规模泛化与结构化潜在表示在时间序列分类中的需求。该方案的核心是构建于HOPE块之上的变分自编码器框架,该框架用双记忆系统替代二次注意力:用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统。一个关键的架构创新是解耦潜在空间,通过专用编码器头和分离的解码器路径将表示分解为独立的趋势和季节成分。ChronoVAE-HOPE在Monarch档案上进行自监督预训练,结合了掩码时间序列建模辅助目标和解耦VAE重建损失。预训练编码器随后被冻结,用于生成固定长度嵌入,以在UCR基准数据集上进行下游分类。实证结果表明,在不同时间域上,特别是在具有严格因果结构的设置中,模型表现出强劲性能。ChronoVAE-HOPE通过结构化生成表示为基础模型适应时间序列分类建立了一个稳健且可解释的框架。

英文摘要

Time Series Foundation Models (TSFMs) have become a new component of the state-of-the-art in general time series forecasting. However, adapting them to specialized classification tasks remains constrained by two interconnected challenges: the quadratic cost of standard attention mechanisms and the inability to disentangle the structural components underlying time series variability. This technical report introduces ChronoVAE-HOPE, a next-generation TSFM that reconciles massive generalization with structured latent representation for time series classification. The core of the proposal is a Variational Autoencoder (VAE) framework built upon the HOPE Block, which replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. A key architectural novelty is the disentangled latent space, which factorizes representations into independent trend and seasonal components via dedicated encoder heads and separate decoder pathways. ChronoVAE-HOPE undergoes self-supervised pre-training on the Monash archive, combining a Masked Time Series Modeling (MTSM) auxiliary objective with a disentangled VAE reconstruction loss. The pre-trained encoder is subsequently frozen and used to generate fixed-length embeddings for downstream classification on the UCR benchmark datasets. Empirical results demonstrate strong performance across diverse temporal domains, particularly in settings characterized by strict causal structure. ChronoVAE-HOPE establishes a robust and interpretable framework for the adaptation of foundation models to time series classification through structured generative representations.

2605.22337 2026-05-26 cs.AI

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Meta-Soft: 利用可组合元标记实现上下文保持的KV缓存压缩

Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东智能科学与技术研究院) University of Macau(澳门大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出Meta-Soft动态压缩框架,通过可学习正交基矩阵和Gumbel-Softmax选择网络合成元标记,结合注意力流整合机制保留丢弃上下文信息,解决KV缓存压缩中的信息丢失和上下文断裂问题。

Comments 9 pages, 2 figures

详情
AI中文摘要

大型语言模型中使用的KV缓存具有线性增长的时间复杂度,因此当处理长上下文时,LLMs面临内存爆炸和解码效率降低的问题。当前的KV缓存驱逐已成为重要的研究方向;然而,基于固定软标记(例如Judge Q)的现有方法依赖静态参数集作为查询来评估KV对的重要性,因此无法动态适应不同的输入提示,也无法精确捕捉复杂且变化的任务相关性。此外,被驱逐的KV对被永久丢弃,导致不可逆的信息丢失和上下文断裂。为了解决这个问题,我们提出了Meta-Soft,一种基于探针驱动上下文整合的动态压缩框架。具体来说,我们构建了一个带有可学习正交基矩阵$\mathcal{L}$的元库,并使用带有Gumbel-Softmax的选择器网络生成可微分的稀疏组合权重,从而从输入提示特征中动态合成最具针对性的$k$个软标记。我们将这些软标记附加到输入序列的末尾以探针关键信息。我们还引入了一种基于注意力流的整合机制,该机制将移除标记的语义信息重新分配到保留标记中,从而有效保持被丢弃的上下文信息。在多个数据集上的实验表明,我们的方法优于现有的最先进驱逐方法,并为KV缓存压缩提供了新的解决方案。

英文摘要

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance. Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features. We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively. Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

2605.22242 2026-05-26 cs.LG physics.ao-ph

Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations

利用学习随机参数化分解 Lorenz '96 中的集合离散度

Birgit Kühbacher, Daan Crommelin, Niki Kilbertus

发表机构 * Technical University of Munich(慕尼黑技术大学) Helmholtz Munich(海德堡-慕尼黑研究所) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Centrum Wiskunde & Informatica (CWI)(荷兰代尔夫特数学与信息研究所) Korteweg-de Vries Institute for Mathematics, University of Amsterdam(阿姆斯特丹大学克罗内克-德·维尔斯数学研究所)

AI总结 本研究利用双尺度 Lorenz 1996 系统,通过比较多种集合配置和参数化策略,系统分析了内在变率、初始条件扰动和随机模型不确定性对集合离散度的影响,揭示了随机参数化特别是时间持续结构能增强早期离散度增长并改善离散度-误差一致性。

详情
AI中文摘要

由于混沌动力学、不完美的初始条件以及对底层物理过程的不完全表示,天气和气候预报本质上具有不确定性。业务集合预报旨在通过预报离散度来表示这些不确定性,然而许多方法产生的离散度估计不足,即离散度相对于预报误差增长过慢。利用双尺度 Lorenz 1996 系统作为广泛使用的受控测试平台,我们设计了一种系统方法来区分内在变率、初始条件扰动和随机模型不确定性。我们比较了多种集合配置和参数化策略,包括现有的确定性和自回归方法以及新颖的贝叶斯和基于流的方法。我们的结果表明,集合扰动不会增加系统的长期方差;相反,它们调节轨迹去相关和探索不变测度的速度。随机参数化,特别是那些具有时间持续结构的参数化,增强了早期离散度增长并改善了离散度-误差一致性。总体而言,我们阐明了不同不确定性来源在混沌系统中如何相互作用,并为天气和气候模型中随机参数化的设计和评估提供了指导。

英文摘要

Weather and climate forecasts are inherently uncertain due to chaotic dynamics, imperfect initial conditions, and incomplete representation of the underlying physical processes. Operational ensemble forecasts aim to represent these uncertainties through forecast spread, yet many approaches yield underdispersive estimates, with spread that grows too slowly relative to forecast error. Using the two-scale Lorenz 1996 system as a widely used, controlled testbed, we design a systematic approach to disentangle intrinsic variability, initial-condition perturbations, and stochastic model uncertainty. We compare multiple ensemble configurations and parameterization strategies, including existing deterministic and autoregressive as well as novel Bayesian and flow-based approaches. Our results show that ensemble perturbations do not increase the system's long-term variance; rather, they regulate how rapidly trajectories decorrelate and explore the invariant measure. Stochastic parameterizations, particularly those with temporally persistent structure, enhance early spread growth and improve spread-error consistency. Overall, we bring clarity to how different sources of uncertainty interact in a chaotic system and provide guidance for the design and evaluation of stochastic parameterizations in weather and climate models.

2605.21740 2026-05-26 cs.AI

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

SMDD-Bench: 大语言模型能否解决真实世界的小分子药物设计任务?

Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Stealth Pennsylvania State University(隐形宾夕法尼亚州立大学)

AI总结 提出SMDD-Bench基准,通过502个多轮长时任务实例评估LLM在真实小分子药物设计中的表现,发现最优模型GPT5.4仅解决40.2%任务。

详情
AI中文摘要

LLM智能体在科学发现应用中具有巨大潜力。然而,LLM智能体在跨不同化学空间和靶标的真实世界小分子药物设计(SMDD)任务上的表现尚不明确。当前的评估方法要么是临时的,对于真实发现过于简单,规模有限,或局限于单轮问答。为了标准化LLM智能体在小分子设计上的评估,我们引入了SMDD-Bench,一个具有挑战性的多轮长时智能体基准,包含502个保证可解的任务实例,涵盖5种任务类型:2D药效团识别、相互作用点发现、骨架跃迁、先导化合物优化和片段组装。SMDD-Bench任务覆盖广泛的化学空间,涉及102个独特的蛋白质靶标。完全解决该基准需要具备强大的化学和生物学推理能力及3D直觉,理解专业工具的使用,并在有限的oracle调用次数内展示规划专业知识。我们对7个前沿的开源和闭源LLM进行了基准测试,发现性能最好的LLM GPT5.4仅解决了40.2%的任务。我们希望SMDD-Bench能提供一个标准化的测试平台,激励该领域训练和评估用于全自动计算药物设计的LLM智能体。我们在smddbench.com上托管了一个公共排行榜。

英文摘要

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

2605.21652 2026-05-26 cs.CV cs.AI

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Look-Closer-Then-Diagnose: 通过主动缩放实现置信度感知的超声VQA

Yue Zhou, Erxuan Wu, Yikang Sun, Hongjoo Lee, Yuan Bi, Huixiong Xu, Nassir Navab, Zhongliang Jiang

发表机构 * Computer Aided Medical Procedures (CAMP)(计算机辅助医疗程序) TU Munich, Germany(慕尼黑工业大学,德国) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Munich, Germany(慕尼黑,德国) Zhongshan Hospital, Fudan University, China(复旦大学中山医院) The University of Hong Kong, Hongkong, China(香港大学,香港,中国)

AI总结 提出一个模拟超声医师认知流程的框架,通过“缩放-诊断”范式和基于组相对策略优化的不确定性感知奖励,提升超声视觉问答中病灶定位和诊断性能。

详情
AI中文摘要

视觉-语言模型(VLM)显著推进了医学视觉问答,但在超声领域性能仍不理想。临床实践中,超声医师在制定报告时会明确关注病灶区域,尽管诊断解释有时因固有的主观性而存在差异。然而,现有VLM并未明确设计为在诊断前交互式地放大病灶;此外,它们通常将标注视为无偏真值,未能考虑其固有的主观性和模糊性。在本文中,我们提出了一个专门考虑超声医师认知工作流的框架。我们首先引入了一个结构化的“缩放-诊断”范式,该范式复制了交互式搜索过程以实现病灶聚焦推理。此外,在组相对策略优化(GRPO)框架内,我们引入了一个基于随机组 rollout 的不确定性感知奖励,以估计预测一致性作为模型置信度的代理。这两个组件共同鼓励模型在清晰案例上强化准确预测,同时在模糊情况下保持谨慎。在肝脏、乳腺和甲状腺数据集上的实验表明,我们的框架将病灶定位提高了39.3%,证明我们的模型学会了主动靠近观察并诊断的能力。

英文摘要

Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains suboptimal. In clinical practice, sonographers explicitly focus on lesion regions to formulate reports, though diagnostic interpretations sometimes vary due to inherent subjectivity. However, existing VLMs are not explicitly structured to interactively zoom into lesions prior to diagnosis; moreover, they typically treat annotations as unbiased ground truths, failing to account for their inherent subjectivity and ambiguity. In this paper, we propose a framework specifically designed to consider the sonographer's cognitive workflow. We first introduce a structured Zoom-then-Diagnose paradigm, which replicates the interactive search process to enable lesion-focused reasoning. Furthermore, within the Group Relative Policy Optimization (GRPO) framework, we introduce an uncertainty-aware reward derived from stochastic group-wise rollouts to estimate prediction consistency as a proxy for model confidence. Together, these two components encourage the model to reinforce accurate predictions on clear cases while remaining cautious under ambiguity. Experiments across liver, breast, and thyroid datasets show that our framework improves lesion localization by 39.3\%, demonstrating that our model has learned the ability to actively look closer and diagnose.

2605.21417 2026-05-26 cs.CV cs.AI

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

排序重要:面向混合情感识别的排名感知选择性融合

Junghyun Lee, Hyunseo Kim, Hanna Jang, Junhyug Noh

发表机构 * Department of Artificial Intelligence and Software(人工智能与软件系)

AI总结 提出一种排名感知的多编码器框架,通过注意力门控模块选择最有效的编码器进行融合,并解耦预测为存在性和显著性头,结合无监督域适应,在混合情感识别任务中取得第二名成绩。

Comments Accepted at IEEE FG 2026 Workshops. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures

详情
AI中文摘要

混合情感识别具有挑战性,因为情感通常表现为微妙且重叠的多模态线索的混合,而非单一主导信号。我们提出了一种排名感知的多编码器框架,该框架选择性地结合来自不同预提取视频和音频编码器的互补表示。我们的方法将异构编码器特征投影到共享潜在空间,通过基于注意力的门控模块估计样本级编码器重要性,并仅融合前n个最具信息量的编码器。为了更好地建模混合情感,我们将预测解耦为存在性和显著性头,并通过概率级融合对齐它们。我们进一步引入了无需伪标签的特征级无监督域适应,以提高在分布偏移下的鲁棒性。在BlEmoRE挑战赛上的实验表明,所提出的框架优于强单个编码器和朴素的多编码器融合基线。我们的最终系统在比赛中排名第二,支持了排名感知选择性融合在细粒度混合情感识别中的有效性。

英文摘要

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

2605.21190 2026-05-26 cs.CV

Semantic Granularity Navigation in Image Editing

图像编辑中的语义粒度导航

Liangsi Lu, Minzhe Guo, Xuhang Chen, Yang Shi

发表机构 * Guangdong University of Technology, Guangzhou, China(广东工业大学,广州,中国) Huizhou University, Huizhou, China(惠州市大学,惠州,中国)

AI总结 提出NaviEdit,一种无需训练、推理时控制的解耦方法,通过自一致性约束将编辑进度与模型尺度解耦,在保持结构保真度的同时提升语义可编辑性。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管扩散模型和流模型具有生成能力,真实图像编辑仍然受到语义可编辑性与结构保真度之间持续权衡的限制。我们将此限制的一个主要原因追溯到现有范式中编辑进度与模型尺度的隐式耦合。在这种耦合下,更强的编辑通常需要访问更嘈杂的状态,这在语义变化被良好定位之前,将计算用于破坏布局。我们引入NaviEdit,一种无需训练的推理时控制器,通过严格的自一致性契约将编辑进度与模型尺度遍历解耦。NaviEdit在rollout级别运行,不改变底层预训练模型。它将尺度视为控制输入,并将固定的步数预算重新分配给语义响应的中间尺度,而不是破坏性的高噪声区域。实验表明,在兼容的编辑器和流骨干网络上,平均增益为正,支持解耦作为一种可移植的推理时控制原则。

英文摘要

Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

2605.20416 2026-05-26 cs.LG physics.comp-ph

Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning and generation with Vision-Language Models

基于米勒指数的潜在晶体学断裂面推理与生成:视觉-语言模型方法

Qinwu Xu, Xiaofu Ma, Yifan Jiang

发表机构 * Independent research(独立研究)

AI总结 研究多模态大语言模型能否利用米勒指数作为结构化潜在表示来推理断裂几何,实验表明模型在理想条件下可进行潜在推理,并能拒绝不适用物理的表示。

详情
AI中文摘要

我们研究多模态大语言模型(MLLMs)是否能够利用晶体学平面指数(米勒指数)作为结构化潜在表示来推理断裂几何。我们将米勒指数 $z = (h,k,l)$ 形式化为控制理想平面断裂的潜在变量,并评估两种互补能力:(i) 潜在推理,即模型在物理有效条件下将视觉观测映射到平面假设;(ii) 潜在适用性评估,即模型判断这种表示对于给定断裂图像是否有意义。通过涵盖合成数据、受控的2D-3D几何对以及多种材料类别(包括陶瓷、玻璃、金属和混凝土)的真实断裂图像的广泛实验,我们表明MLLMs能够在理想设置下可靠地进行潜在推理,并且关键的是,当底层物理不支持时,能够拒绝该潜在表示。作为探索性扩展,我们进一步检查了AI生成的断裂序列,并观察到定性上合理的脆性断裂进展行为,这表明多模态生成模型可能编码了与材料失效动力学相关的部分隐式物理先验。这些结果表明,只要明确建模有效性域,MLLMs可以作为基于结构化潜在先验的物理感知推理系统。

英文摘要

We study whether multimodal large language models (MLLMs) can leverage crystallographic plane indices (Miller indices) as a structured latent representation for reasoning about fracture geometry. We formulate Miller indices $z = (h,k,l)$ as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: (i) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and (ii) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image. Through extensive experiments spanning synthetic data, controlled 2D--3D geometric pairs, and real-world fracture images across multiple material classes -- including ceramics, glass, metals, and concrete -- we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it. As an exploratory extension, we further examine AI-generated fracture sequences and observe qualitatively plausible brittle-fracture progression behaviors, suggesting that multimodal generative models may encode partial implicit physical priors related to material failure dynamics. These results suggest that MLLMs can act as physics-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled.

2605.20278 2026-05-26 cs.LG cs.AI cs.CV

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

发表机构 * The Chinese University of Hong Kong(香港中文大学) MiniMax

AI总结 提出ClaimDiff-RL框架,利用原子声明差异作为奖励单元,通过多模态判断器枚举视觉差异并分配错误类型和严重程度,以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情
AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题:描述被整体判断,而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富,避免幻觉而不遗漏显著细节。然而,成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号,模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架,该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述,多模态判断器枚举视觉上可区分的差异,针对图像验证每个差异,分配开放词汇的错误类型和严重程度,并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明,整体标量奖励可以通过增加遗漏事实来减少幻觉,而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡,并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上,ClaimDiff-RL改善了幻觉-遗漏事实平衡,保留了通用能力,甚至在多个细粒度能力维度(如物体计数、空间关系和场景识别)上超越了Gemini-3-Pro-Preview。这些结果表明,类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

2605.20025 2026-05-26 cs.AI

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

AutoResearchClaw: 基于人机协作的自我强化自主研究

Jiaqi Liu, Shi Qiu, Mairui Li, Bingzhou Li, Haonian Ji, Siwei Han, Xinyu Ye, Peng Xia, Zihan Dong, Meng Chen, Congyu Zhang, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Lu Feng, Xujiang Zhao, Haifeng Chen, Jiawei Zhou, Xiao Wang, Weitong Zhang, Hongtu Zhu, Yun Li, Jieru Mei, Hongliang Fei, Jiaheng Zhang, Linjie Li, Linjun Zhang, Yuyin Zhou, Sheng Wang, Caiming Xiong, James Zou, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Rutgers University(罗格斯大学) NEC Labs America(NEC美国实验室) Meta(Meta公司) Stanford University(斯坦福大学) Google(谷歌公司) University of Washington(华盛顿大学)

AI总结 提出AutoResearchClaw多智能体自主研究系统,通过结构化辩论、自愈执行、可验证报告、七种人机协作模式和跨运行进化机制,在ARC-Bench基准上比AI Scientist v2提升54.7%。

详情
AI中文摘要

自动化科学发现需要的不仅仅是根据想法生成论文。真正的研究是迭代的:假设从多个角度受到挑战,实验失败并为下一次尝试提供信息,经验在循环中积累。现有的自主研究系统通常将此过程建模为线性流水线:它们依赖单智能体推理,在执行失败时停止,并且不跨运行携带经验。我们提出AutoResearchClaw,一个基于五种机制的多智能体自主研究流水线:用于假设生成和结果分析的结构化多智能体辩论;带有Pivot/Refine决策循环的自愈执行器,将失败转化为信息;防止虚构数字和幻觉引用的可验证结果报告;具有七种干预模式的人机协作,涵盖从完全自主到逐步监督;以及将过去错误转化为未来保障的跨运行进化。在ARC-Bench(一个25个主题的实验阶段基准)上,AutoResearchClaw比AI Scientist v2高出54.7%。跨七种干预模式的人机协作消融实验表明,在高杠杆决策点上的精确、有针对性的协作始终优于完全自主和详尽的逐步监督。我们将AutoResearchClaw定位为一种研究放大器,增强而非取代人类的科学判断。代码可在https://github.com/aiming-lab/AutoResearchClaw获取。

英文摘要

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at https://github.com/aiming-lab/AutoResearchClaw.

2605.20023 2026-05-26 cs.AI cs.MA

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

当技能无济于事:关于程序性知识在进攻性网络安全中工具型智能体的负面结果

Samuel Jacob Chacko, James Hugglestone, Chashi Mahiul Islam, Xiuwen Liu

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 本文通过重新分析一项控制实验,发现当环境反馈带宽高时,技能(Skills)对智能体性能的边际效益消失甚至产生负面影响,并提出了可证伪的假设。

Comments Accepted as a poster at ACM CAIS 2026 AgentSkills Workshop

详情
AI中文摘要

智能体技能(Agent Skills)是程序性知识的结构化包,在推理时加载到LLM智能体中,据报道在不同领域平均将任务通过率提高了16.2个百分点。然而,相同的基准测试显示出很大的方差,84个任务中有16个在引入技能后出现了负增量。社区尚未阐明技能何时有帮助以及何时只是冗余开销的清晰机制。我们重新分析了一项最近发表的180次运行的控制研究,该研究涉及一个基于MCP的自主夺旗(CTF)智能体,在四种文档条件(591、12865、17253和36001个token)下,并表明这些条件几乎完全对应于无技能、经验技能、策划技能和全面技能的消融。在进攻性网络安全(一个现有技能基准未深入覆盖的领域)中,技能的边际效益崩溃。无技能和全面技能条件之间的差距仅为8.9个百分点($p = 0.71$,$\chi^2$;$p = 0.25$,Cochran-Armitage趋势检验;六对Cohen's $h$值中有五对低于$0.2$的小效应阈值)。我们认为缺失的变量是环境反馈带宽。当智能体的工具层返回严格、模式验证、低延迟的观察时,环境本身提供了通常需要技能提供的程序性校正信号。因此,策划技能的边际效益显著降低,并且在某些情况下(例如,我们的时序侧信道设置)会主动降低性能。我们阐述了一个可证伪的假设,概述了其对复合AI系统的设计启示,并将发布重新分析管道以支持复制。

英文摘要

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (591, 12865, 17253, and 36001 tokens) and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ($p = 0.71$, $χ^2$; $p = 0.25$, Cochran--Armitage trend test; five of six pairwise Cohen's $h$ values fall below the $0.2$ small-effect threshold). We argue that the missing variable is \emph{environment-feedback bandwidth}. When an agent's tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

2605.19491 2026-05-26 cs.CV

Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

尺度思考:通过自适应连续推理加速千兆像素病理图像分析

Jiusong Ge, Yingkang Zhan, Wenjie Zhao, Di Zhang, Ke Wang, Jiashuai Liu, Chunze Yang, Chengzu Li, Jian Zhang, Yuxin Dong, Ni Zhang, Qidong Liu, Mireia Crispin-Ortuzar, Huazhu Fu, Chen Li, Zeyu Gao

发表机构 * School of Computer Science(计算机科学学院) Technology, Xi’an Jiaotong University, Xi’an, China(技术学院,西安交通大学,西安,中国) Department of Transmedia Art, Xi’an Academy of Fine Arts, Xi’an, China(多媒体艺术系,西安美术学院,西安,中国) Department of Oncology, University of Cambridge, Cambridge, U.K.(肿瘤学系,剑桥大学,剑桥,英国) Language Technology Lab, University of Cambridge, Cambridge, U.K.(语言技术实验室,剑桥大学,剑桥,英国) Institute of High Performance Computing, Agency for Science, Technology(高性能计算研究所,科技研究局)

AI总结 提出PathCTM模型,通过动态尺度切换和注意力引导的区域剪枝实现高效连续推理,大幅减少计算开销并保持诊断性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

传统的全切片图像(WSI)分析方法通常依赖于多实例学习(MIL)范式,该范式在高倍率下提取补丁级特征并进行聚合以进行切片级预测。然而,这种详尽的补丁级处理计算成本高,严重限制了WSI分析的效率和可扩展性。为应对这一挑战,我们提出了PathCTM(面向病理学的连续思维模型),该模型能够对千兆像素WSI进行令牌高效的尺度空间连续推理。PathCTM将诊断推理表述为动态的序列信息追踪。它逐步从低倍率全局检查过渡到高倍率局部检查,并在收集到足够证据以有效限制决策不确定性时自适应终止推理。具体而言,它使用条件计算进行动态尺度切换,并采用注意力引导的区域剪枝,结合置信度感知的早期停止。大量实验表明,与基于标准MIL的方法相比,PathCTM将所需图像补丁数量减少了95.95%,推理时间缩短了约95.62%,同时AUC没有下降。代码可在https://github.com/JSGe-AI/PathCTM获取。

英文摘要

Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at https://github.com/JSGe-AI/PathCTM.

2605.19430 2026-05-26 cs.RO

Neuromorphic Control of a Flapping-Wing Robot on Resource-Constrained Hardware

资源受限硬件上扑翼机器人的神经形态控制

Rim El Filali, Chenrui Feng, Chao Gao, Weibin Gu

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Tsinghua University(清华大学) Department of Computer Science and Technology(计算机科学与技术系) Xinchen Qihang Inc.(新晨科技有限公司)

AI总结 针对重量小于30克的蝴蝶仿生扑翼机器人,提出一种层次化神经形态控制框架,在低成本ESP32微控制器上部署两个轻量级脉冲神经网络实现状态估计与控制,通过模仿学习训练,在无系留飞行中实现稳定俯仰和航向跟踪,相比传统人工神经网络延迟降低36%、功耗降低18%。

详情
AI中文摘要

扑翼微型飞行器(FWMAV)具有卓越的机动性和气动效率,但由于非线性动力学和严格的大小、重量和功率(SWaP)约束(例如重量小于30克的蝴蝶仿生机器人),给机载控制带来了重大挑战。为此,我们提出了一种层次化神经形态控制框架,能够在广泛可用、资源受限的ESP32微控制器(单价约5美元)上实现完全机载的闭环飞行。具体而言,我们的方法在机载部署了两个轻量级脉冲神经网络(SNN):一个用于从原始传感器反馈进行状态估计,另一个通过调节中央模式发生器(CPG)进行翅膀驱动控制。通过模仿学习训练,该系统在无系留真实飞行中实现了稳定的俯仰和航向角跟踪。实验结果进一步表明,与传统人工神经网络(ANN)基线相比,基于SNN的控制器推理延迟降低了36%(从1059微秒降至680微秒),功耗降低了18%(从0.033瓦降至0.027瓦),证明了无需专用硬件的脉冲计算可行性。据我们所知,这项工作首次展示了FWMAV自主飞行的完全机载神经形态控制,突显了SNN在严格SWaP约束下实现节能自主的潜力。视觉摘要:http://bit.ly/4nI8ECY 代码:https://anonymous.4open.science/r/Espikify-76E3/

英文摘要

Flapping-Wing Micro Aerial Vehicles (FWMAVs) provide exceptional maneuverability and aerodynamic efficiency but pose significant challenges for onboard control due to nonlinear dynamics and stringent Size, Weight, and Power (SWaP) constraints, as exemplified by a butterfly-inspired robot less than 30 gram. To this end, we present a hierarchical neuromorphic control framework that enables fully onboard, closed-loop flight on a widely available, resource-constrained ESP32 microcontroller with a unit cost of approximately $5. Specifically, our method deploys two lightweight Spiking Neural Networks (SNNs) onboard: one for state estimation from raw sensory feedback and another for control via modulation of a Central Pattern Generator (CPG) for wing actuation. Trained by imitation learning, the system achieves stable pitch and heading angle tracking during untethered real-world flight. Experimental results further reveal that the SNN-based controller reduces latency by 36% (1059us to 680us) and power by 18% (0.033W to 0.027W) for inference compared to the conventional Artificial Neural Network (ANN) baseline, demonstrating the viability of spike-based computation without specialized hardware. To the best of our knowledge, this work constitutes the first demonstration of fully onboard neuromorphic control for autonomous flight of a FWMAV, highlighting the potential of SNNs to enable energy-efficient autonomy under stringent SWaP constraints. Visual abstract: http://bit.ly/4nI8ECY Code: https://anonymous.4open.science/r/Espikify-76E3/

2605.19409 2026-05-26 cs.LG cs.AI

Unlocking the Potential of Continual Model Merging: An ODE Perspective

解锁持续模型合并的潜力:ODE视角

Lihong Lin, Haidong Kang

发表机构 * Northeastern University, Shenyang, China(东北大学,沈阳,中国)

AI总结 提出ODE-M框架,将持续模型合并建模为参数空间中的轨迹,通过整流时变速度场和效用感知时间调度平衡历史知识与新任务,提升长任务流性能。

Comments 21 pages, 8 figures

详情
AI中文摘要

持续模型合并(CMM)通过顺序整合任务适配模型实现基础模型的快速定制,无需重复训练。然而,现有合并规则通常通过固定代数或基于投影的操作更新部署模型,对保留多少先前积累的知识相对于新任务模型的控制有限。这种限制导致长任务流中保留不稳定和性能下降,当任务具有异构效用时更为明显。我们提出ODE驱动的合并(ODE-M),一个可控框架,将每次持续合并视为参数空间中的轨迹而非一步端点更新。受模式连通性启发,ODE-M使用整流时变速度场构建屏障感知轨迹,其中来自小型校准集的轻量级一阶反馈抑制损失增加的运动,同时保持向新模型的进展。然后通过沿该轨迹选择操作点(通过效用感知时间调度)获得下一个合并模型,为平衡保留的历史知识和新任务专业知识提供显式机制。在标准CMM基准上的大量实验表明,ODE-M在CLIP ViT骨干、流长度和异构任务效用设置上持续优于强持续合并基线。

英文摘要

Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without repeated retraining. However, existing merging rules usually update the deployed model through fixed algebraic or projection-based operations, providing limited control over how much previously accumulated knowledge should be retained relative to the incoming task model. This limitation leads to unstable retention and performance degradation in long task streams, and becomes more pronounced when tasks have heterogeneous utilities. We propose ODE-driven Merging (ODE-M), a controllable framework that formulates each continual merge as a trajectory in parameter space rather than a one-step endpoint update. Motivated by mode connectivity, ODE-M constructs a barrier-aware trajectory using a rectified time-dependent velocity field, where lightweight first-order feedback from a small calibration set suppresses loss-increasing motion while preserving progress toward the incoming model. The next merged model is then obtained by selecting an operating point along this trajectory through a utility-aware time schedule, providing an explicit mechanism for balancing retained historical knowledge and incoming task expertise. Extensive experiments on standard CMM benchmarks show that ODE-M consistently improves over strong continual merging baselines across CLIP ViT backbones, stream lengths, and heterogeneous task-utility settings.

2605.18840 2026-05-26 cs.LG cs.AI cs.CL

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

前沿模型的成长之痛:当排行榜不再区分以及接下来衡量什么

Adil Amin

发表机构 * Zehen Labs(泽亨实验室)

AI总结 本文通过分解SWE-bench和GPQA Diamond分数为种群耦合趋势和每版本残差(h场),诊断前沿模型能力之间的协作与权衡,并提供三步诊断法、每实验室测量优先级表及七个可证伪预测。

Comments 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." ( https://doi.org/10.48550/arXiv.2605.18838 ). Code: https://github.com/adilamin89/cape-scaling . Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

排行榜在独立轴上对前沿模型进行排名,但并未揭示能力在版本间是相互增强还是权衡——而在前沿,这种相互作用是更具信息量的信号。我们将配对的SWE-bench和GPQA Diamond分数分解为种群耦合趋势和每版本残差(h场),该残差从两个公开基准分数诊断能力重点。在来自10个实验室的34个模型(2024-2026)中,能力相互协作(r = +0.72,p < 10^{-6}),但协作程度系统性地变化:每个实验室的耦合斜率跨度达5倍(谷歌1.15 vs. DeepSeek 0.23),且实验室发生转向——DeepSeek从推理密集型逆转为编码优先(Δh = 15.9个百分点);Anthropic在编码偏离和恢复之间振荡。种群回归作为等斜线相边界:用于识别基础尺度耦合转变的相同分类器√[(a/b)·B₁] [Amin, 2026] 对前沿模型进行分类,并已在下一个转变处检测到混合相行为(两个模型低于GPQA-IFEval等斜线)。h场不仅具有诊断性——它还告诉你需要改变什么。预训练建立耦合为0.871,而RLHF增加0.081 [Amin, 2026]:预训练级别的转变是永久的(DeepSeek的四个版本逆转持续存在),后训练转变是可逆的(Anthropic的三次编码偏离均在单个版本内恢复),仅推理计算在不重新训练的情况下将h改变+7.8个百分点。知道哪个组件占主导地位决定了是重新训练还是等待。我们提供了三步诊断法(定位、分类、预测)、每实验室测量优先级表以及七个带有时间戳标准的可证伪预测。五个截止日期后的版本落在95%预测区间内。代码、数据和交互式仪表盘:https://zehenlabs.com/cape/。

英文摘要

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($Δh = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

2605.18657 2026-05-26 cs.LG cs.AI

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

KairosHope: 一种基于双记忆架构的下一代时间序列基础模型,用于专门分类

Luis Balderas, José Alberto Rodríguez, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

发表机构 * Department of Computer Science and Artificial Intelligence(计算机科学与人工智能系) DiCITS, iMUDS, DaSCI(DiCITS、iMUDS、DaSCI) University of Granada(格拉纳达大学) Advanced Medical Imaging Group(先进医学成像组) Instituto de Investigación Biosanitaria de Granada (ibs.Granada)(格拉纳达生物医学研究机构(ibs.Granada)) Department of Software Engineering(软件工程系) Department of Rural Engineering(农村工程系) University of Córdoba(科尔多瓦大学)

AI总结 针对标准注意力计算瓶颈和经典统计知识缺失问题,提出KairosHope模型,通过双记忆系统(Titans模块和连续记忆系统CMS)替代二次注意力,并融合深度表示与统计特征的混合决策头,在UCR基准上实现优越分类性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)在通用预测任务中取得了显著成功;然而,它们对专门分类问题的适应仍然受到标准注意力的计算瓶颈和对经典统计知识的系统性忽略的限制。本技术报告介绍了KairosHope,一种下一代TSFM,旨在协调大规模泛化与分类任务中的分析精度。该提案的核心是HOPE块,一种用双记忆系统替代二次注意力的架构:用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统(CMS)。为了丰富归纳偏差,引入了混合决策头,它将深度潜在表示与通过tsfeatures包提取的确定性统计特征融合。KairosHope在大型Monash档案上进行自监督预训练,结合了掩码时间序列建模(MTSM)和对比学习(InfoNCE)。随后,通过严格的线性探测和全微调(LP-FT)协议在UCR基准数据集上进行适应,以防止灾难性遗忘。实验结果表明,在具有严格时间因果关系的领域(如HAR或传感器数据)中,性能优越。因此,KairosHope为基础模型适应时间序列分析建立了一个稳健高效的框架。

英文摘要

Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to specialized classification problems remains constrained by the computational bottleneck of standard attention and the systematic omission of classical statistical knowledge. This technical report introduces KairosHope, a next-generation TSFM designed to reconcile massive generalization with analytical precision in classification tasks. The core of the proposal is the HOPE block, an architecture that replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. To enrich the inductive bias, a Hybrid Decision Head is introduced, which fuses deep latent representations with deterministic statistical features extracted via tsfeatures package. KairosHope undergoes self-supervised pre-training on the massive Monash archive, combining Masked Time Series Modeling (MTSM) and contrastive learning (InfoNCE). Its subsequent adaptation to the UCR benchmark datasets is conducted through a rigorous Linear Probing and Full Fine-Tuning (LP-FT) protocol to prevent catastrophic forgetting. Empirical results demonstrate superior performance in domains characterized by strict temporal causality such as HAR or Sensor data. Consequently, KairosHope establishes a robust and efficient framework for the adaptation of foundation models to time series analysis.

2605.17531 2026-05-26 cs.CV

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

不要猜测,只需询问:通过多轮澄清解决指代分割中的歧义

Yuting Yang, Haichao Jiang, Tianming Liang, Quan Zhang, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education(教育部机器智能与先进计算重点实验室)

AI总结 提出IC-Seg框架,通过多轮对话主动澄清用户意图,并引入Hi-GRPO分层优化策略,有效解决指代分割中用户查询歧义问题。

详情
AI中文摘要

指代分割旨在根据文本查询分割图像或视频中的目标对象。尽管过去几年取得了显著进展,现有工作总是假设用户提供的查询已经精确且清晰。然而,这种假设不切实际。在现实场景中,期望所有用户仔细审查其视觉内容并确保查询唯一且无歧义是不现实的。遇到此类情况时,现有分割模型倾向于任意猜测用户偏好,常常导致不理想的结果。为解决这一限制,我们提出IC-Seg,一种新颖的智能体框架,在分割前通过多轮对话主动澄清用户意图。为有效激励这种能力,我们进一步引入Hi-GRPO,一种新的分层优化策略,在轨迹、轮次和步骤层面注入密集且信息丰富的监督信号。该策略鼓励高效的意图澄清,有效消除冗余交互并提高整体对话质量。为评估,我们建立了Ambi-RVOS,一个带有模糊用户查询的指代视频对象分割基准。大量实验表明,IC-Seg不仅在解决模糊查询方面大幅优于现有方法,而且在标准推理分割基准上保持最先进性能。代码和数据将在https://github.com/iSEE-Laboratory/IC-Seg发布。

英文摘要

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

2605.17268 2026-05-26 cs.AI cs.CV cs.RO

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实?自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Computer Science(计算机科学学院) University of Wollongong in Dubai(迪拜大学)

AI总结 通过分析300次VLA推理,发现输出推理与轨迹的忠实度仅42.5%,存在大量漏检行人、轨迹脆弱及推理-动作不一致问题,并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

详情
AI中文摘要

我们首次系统研究了视觉-语言-动作(VLA)驾驶模型的忠实度,分析了100个多样化PhysicalAI-AV场景中300次Alpamayo-R1-10B推理。主要发现是,输出带有轨迹的自然语言推理可能显著不忠实:(i) 整体推理保真度仅为42.5%,因果链与场景现实匹配不到一半;(ii) 在三分之一涉及行人的场景中漏检了94个行人;(iii) 在轻微视觉扰动下轨迹脆弱性达97.7%;(iv) 平均推理-动作一致性仅为48.3%,53.3%的推理表现出一致性低,其中37.9%声称停止但模型继续前行。我们从信息论角度形式化定义了忠实度,定义了实体和动作保真度及验证标准,并概述了与这些结果一致的四组件安全架构。

英文摘要

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.

2605.16409 2026-05-26 cs.CV cs.CL cs.LG

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调和提示引导的链式思维推理用于多模态大语言模型

Qinwu Xu, Yifan Jiang, Haoyu Ren

发表机构 * Meta AI UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种多语言OCR感知的多模态训练框架,通过合成数据生成、OCR感知微调和结构化视觉链式思维提示,提升多模态大语言模型在复杂视觉条件下的OCR完整性和多语言翻译准确性。

详情
AI中文摘要

光学字符识别(OCR)和多语言文本理解仍然是多模态大语言模型(MLLMs)的主要失败模式,尤其是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架,该框架结合了(i)大规模合成OCR到翻译数据生成,(ii)使用LoRA适配的OCR感知监督微调(SFT),以及(iii)在不确定视觉条件下进行推理的结构化视觉链式思维(CoT)提示。使用基于LLaMA的多模态架构,所提出的框架在OCR完整性、多语言翻译准确性和退化视觉条件下的鲁棒性方面有了显著提升。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明,与基线模型相比,视觉-文本对齐显著改善。特别是,所提出的OCR感知后训练框架提高了对小、模糊、空间分散和部分遮挡文本的提取,同时减少了对不确定OCR条件下语言先验的依赖。与前沿多模态系统(包括GPT-5类和Gemini系列模型)的定性比较进一步表明,在噪声和视觉模糊的OCR场景下,OCR对齐得到改善,幻觉减少。总体而言,结果表明,以数据为中心的OCR感知多模态后训练为改进多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

英文摘要

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.