2605.24534 2026-05-26 cs.CL

AnyMo：野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales（新南威尔士大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出AnyMo框架，通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐，实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述，性能显著提升。

详情

AI中文摘要

随着可穿戴和移动设备日益融入日常生活，它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置，包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难，并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo，一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号，从配对的合成放置视图和掩蔽部分观测中预训练图编码器，将多位置IMU标记化为全身运动令牌，并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo：跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述，其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%，零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%，零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面：https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

URL PDF HTML ☆

赞 0 踩 0

2605.22684 2026-05-26 cs.LG

ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification

ChronoVAE-HOPE：超越注意力——面向专业时间序列分类的下一代VAE基础模型

José Alberto Rodríguez, Luis Balderas, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

发表机构 * Department of Computer Science and Artificial Intelligence（计算机科学与人工智能系）； DiCITS ； iMUDS ； DaSCI ； University of Granada（格拉纳达大学）； Advanced Medical Imaging Group（先进医学成像组）； Instituto de Investigación Biosanitaria de Granada（格拉纳达生物医学研究 institute）； Department of Software Engineering（软件工程系）； Department of Rural Engineering（农村工程系）； University of Córdoba（科尔多瓦大学）

AI总结提出ChronoVAE-HOPE，一种基于VAE和HOPE块（含Titans模块和连续记忆系统）的下一代时间序列基础模型，通过解耦潜在空间分离趋势与季节成分，在UCR基准分类任务上表现优异。

详情

AI中文摘要

时间序列基础模型已成为通用时间序列预测领域的最新技术组成部分。然而，将其应用于专业分类任务仍受两个相互关联的挑战制约：标准注意力机制的二次成本以及无法解耦时间序列变异性背后的结构成分。本技术报告介绍了ChronoVAE-HOPE，一种下一代时间序列基础模型，它调和了大规模泛化与结构化潜在表示在时间序列分类中的需求。该方案的核心是构建于HOPE块之上的变分自编码器框架，该框架用双记忆系统替代二次注意力：用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统。一个关键的架构创新是解耦潜在空间，通过专用编码器头和分离的解码器路径将表示分解为独立的趋势和季节成分。ChronoVAE-HOPE在Monarch档案上进行自监督预训练，结合了掩码时间序列建模辅助目标和解耦VAE重建损失。预训练编码器随后被冻结，用于生成固定长度嵌入，以在UCR基准数据集上进行下游分类。实证结果表明，在不同时间域上，特别是在具有严格因果结构的设置中，模型表现出强劲性能。ChronoVAE-HOPE通过结构化生成表示为基础模型适应时间序列分类建立了一个稳健且可解释的框架。

英文摘要

Time Series Foundation Models (TSFMs) have become a new component of the state-of-the-art in general time series forecasting. However, adapting them to specialized classification tasks remains constrained by two interconnected challenges: the quadratic cost of standard attention mechanisms and the inability to disentangle the structural components underlying time series variability. This technical report introduces ChronoVAE-HOPE, a next-generation TSFM that reconciles massive generalization with structured latent representation for time series classification. The core of the proposal is a Variational Autoencoder (VAE) framework built upon the HOPE Block, which replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. A key architectural novelty is the disentangled latent space, which factorizes representations into independent trend and seasonal components via dedicated encoder heads and separate decoder pathways. ChronoVAE-HOPE undergoes self-supervised pre-training on the Monash archive, combining a Masked Time Series Modeling (MTSM) auxiliary objective with a disentangled VAE reconstruction loss. The pre-trained encoder is subsequently frozen and used to generate fixed-length embeddings for downstream classification on the UCR benchmark datasets. Empirical results demonstrate strong performance across diverse temporal domains, particularly in settings characterized by strict causal structure. ChronoVAE-HOPE establishes a robust and interpretable framework for the adaptation of foundation models to time series classification through structured generative representations.

URL PDF HTML ☆

赞 0 踩 0

2605.22337 2026-05-26 cs.AI

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Meta-Soft: 利用可组合元标记实现上下文保持的KV缓存压缩

Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology（广东智能科学与技术研究院）； University of Macau（澳门大学）； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Hong Kong University of Science and Technology（香港科技大学）

AI总结提出Meta-Soft动态压缩框架，通过可学习正交基矩阵和Gumbel-Softmax选择网络合成元标记，结合注意力流整合机制保留丢弃上下文信息，解决KV缓存压缩中的信息丢失和上下文断裂问题。

Comments 9 pages, 2 figures

详情

AI中文摘要

大型语言模型中使用的KV缓存具有线性增长的时间复杂度，因此当处理长上下文时，LLMs面临内存爆炸和解码效率降低的问题。当前的KV缓存驱逐已成为重要的研究方向；然而，基于固定软标记（例如Judge Q）的现有方法依赖静态参数集作为查询来评估KV对的重要性，因此无法动态适应不同的输入提示，也无法精确捕捉复杂且变化的任务相关性。此外，被驱逐的KV对被永久丢弃，导致不可逆的信息丢失和上下文断裂。为了解决这个问题，我们提出了Meta-Soft，一种基于探针驱动上下文整合的动态压缩框架。具体来说，我们构建了一个带有可学习正交基矩阵$\mathcal{L}$的元库，并使用带有Gumbel-Softmax的选择器网络生成可微分的稀疏组合权重，从而从输入提示特征中动态合成最具针对性的$k$个软标记。我们将这些软标记附加到输入序列的末尾以探针关键信息。我们还引入了一种基于注意力流的整合机制，该机制将移除标记的语义信息重新分配到保留标记中，从而有效保持被丢弃的上下文信息。在多个数据集上的实验表明，我们的方法优于现有的最先进驱逐方法，并为KV缓存压缩提供了新的解决方案。

英文摘要

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance. Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features. We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively. Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

URL PDF HTML ☆

赞 0 踩 0

2605.22242 2026-05-26 cs.LG physics.ao-ph

Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations

利用学习随机参数化分解 Lorenz '96 中的集合离散度

Birgit Kühbacher, Daan Crommelin, Niki Kilbertus

发表机构 * Technical University of Munich（慕尼黑技术大学）； Helmholtz Munich（海德堡-慕尼黑研究所）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Centrum Wiskunde & Informatica (CWI)（荷兰代尔夫特数学与信息研究所）； Korteweg-de Vries Institute for Mathematics, University of Amsterdam（阿姆斯特丹大学克罗内克-德·维尔斯数学研究所）

AI总结本研究利用双尺度 Lorenz 1996 系统，通过比较多种集合配置和参数化策略，系统分析了内在变率、初始条件扰动和随机模型不确定性对集合离散度的影响，揭示了随机参数化特别是时间持续结构能增强早期离散度增长并改善离散度-误差一致性。

详情

AI中文摘要

由于混沌动力学、不完美的初始条件以及对底层物理过程的不完全表示，天气和气候预报本质上具有不确定性。业务集合预报旨在通过预报离散度来表示这些不确定性，然而许多方法产生的离散度估计不足，即离散度相对于预报误差增长过慢。利用双尺度 Lorenz 1996 系统作为广泛使用的受控测试平台，我们设计了一种系统方法来区分内在变率、初始条件扰动和随机模型不确定性。我们比较了多种集合配置和参数化策略，包括现有的确定性和自回归方法以及新颖的贝叶斯和基于流的方法。我们的结果表明，集合扰动不会增加系统的长期方差；相反，它们调节轨迹去相关和探索不变测度的速度。随机参数化，特别是那些具有时间持续结构的参数化，增强了早期离散度增长并改善了离散度-误差一致性。总体而言，我们阐明了不同不确定性来源在混沌系统中如何相互作用，并为天气和气候模型中随机参数化的设计和评估提供了指导。

英文摘要

Weather and climate forecasts are inherently uncertain due to chaotic dynamics, imperfect initial conditions, and incomplete representation of the underlying physical processes. Operational ensemble forecasts aim to represent these uncertainties through forecast spread, yet many approaches yield underdispersive estimates, with spread that grows too slowly relative to forecast error. Using the two-scale Lorenz 1996 system as a widely used, controlled testbed, we design a systematic approach to disentangle intrinsic variability, initial-condition perturbations, and stochastic model uncertainty. We compare multiple ensemble configurations and parameterization strategies, including existing deterministic and autoregressive as well as novel Bayesian and flow-based approaches. Our results show that ensemble perturbations do not increase the system's long-term variance; rather, they regulate how rapidly trajectories decorrelate and explore the invariant measure. Stochastic parameterizations, particularly those with temporally persistent structure, enhance early spread growth and improve spread-error consistency. Overall, we bring clarity to how different sources of uncertainty interact in a chaotic system and provide guidance for the design and evaluation of stochastic parameterizations in weather and climate models.

URL PDF HTML ☆

赞 0 踩 0

2605.21740 2026-05-26 cs.AI

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

SMDD-Bench: 大语言模型能否解决真实世界的小分子药物设计任务？

Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Barati Farimani

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Stealth Pennsylvania State University（隐形宾夕法尼亚州立大学）

AI总结提出SMDD-Bench基准，通过502个多轮长时任务实例评估LLM在真实小分子药物设计中的表现，发现最优模型GPT5.4仅解决40.2%任务。

详情

AI中文摘要

LLM智能体在科学发现应用中具有巨大潜力。然而，LLM智能体在跨不同化学空间和靶标的真实世界小分子药物设计（SMDD）任务上的表现尚不明确。当前的评估方法要么是临时的，对于真实发现过于简单，规模有限，或局限于单轮问答。为了标准化LLM智能体在小分子设计上的评估，我们引入了SMDD-Bench，一个具有挑战性的多轮长时智能体基准，包含502个保证可解的任务实例，涵盖5种任务类型：2D药效团识别、相互作用点发现、骨架跃迁、先导化合物优化和片段组装。SMDD-Bench任务覆盖广泛的化学空间，涉及102个独特的蛋白质靶标。完全解决该基准需要具备强大的化学和生物学推理能力及3D直觉，理解专业工具的使用，并在有限的oracle调用次数内展示规划专业知识。我们对7个前沿的开源和闭源LLM进行了基准测试，发现性能最好的LLM GPT5.4仅解决了40.2%的任务。我们希望SMDD-Bench能提供一个标准化的测试平台，激励该领域训练和评估用于全自动计算药物设计的LLM智能体。我们在smddbench.com上托管了一个公共排行榜。

英文摘要

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small molecule drug design (SMDD) tasks across diverse chemistries and targets is unclear. Current evaluation methods are either ad hoc, too simple for real-world discovery, limited in scale, or restricted to single-turn question answering. In effort to standardize the evaluation of LLM agents on small molecule design, we introduce SMDD-Bench, a challenging, multi-turn, long-horizon agentic benchmark consisting of 502 guaranteed-solvable task instances spanning 5 task types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. SMDD-Bench tasks span a wide region of chemical space and involve 102 unique protein targets. Completely solving the benchmark would require having strong chemical and biological reasoning and 3D intuition, understanding specialized tool use, and displaying planning expertise over a limited number of oracle calls. We benchmark 7 frontier open and closed source LLMs and find even the most performant LLM, GPT5.4, solves only 40.2\% of tasks. We hope SMDD-Bench provides a standardized testbed to invigorate the field towards training and evaluating LLM agents for fully autonomous computational drug design. We host a public leaderboard at smddbench.com .

URL PDF HTML ☆

赞 0 踩 0

2605.21652 2026-05-26 cs.CV cs.AI

基于米勒指数的潜在晶体学断裂面推理与生成：视觉-语言模型方法

Qinwu Xu, Xiaofu Ma, Yifan Jiang

发表机构 * Independent research（独立研究）

AI总结研究多模态大语言模型能否利用米勒指数作为结构化潜在表示来推理断裂几何，实验表明模型在理想条件下可进行潜在推理，并能拒绝不适用物理的表示。

详情

AI中文摘要

我们研究多模态大语言模型（MLLMs）是否能够利用晶体学平面指数（米勒指数）作为结构化潜在表示来推理断裂几何。我们将米勒指数 $z = (h,k,l)$ 形式化为控制理想平面断裂的潜在变量，并评估两种互补能力：(i) 潜在推理，即模型在物理有效条件下将视觉观测映射到平面假设；(ii) 潜在适用性评估，即模型判断这种表示对于给定断裂图像是否有意义。通过涵盖合成数据、受控的2D-3D几何对以及多种材料类别（包括陶瓷、玻璃、金属和混凝土）的真实断裂图像的广泛实验，我们表明MLLMs能够在理想设置下可靠地进行潜在推理，并且关键的是，当底层物理不支持时，能够拒绝该潜在表示。作为探索性扩展，我们进一步检查了AI生成的断裂序列，并观察到定性上合理的脆性断裂进展行为，这表明多模态生成模型可能编码了与材料失效动力学相关的部分隐式物理先验。这些结果表明，只要明确建模有效性域，MLLMs可以作为基于结构化潜在先验的物理感知推理系统。

英文摘要

We study whether multimodal large language models (MLLMs) can leverage crystallographic plane indices (Miller indices) as a structured latent representation for reasoning about fracture geometry. We formulate Miller indices $z = (h,k,l)$ as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: (i) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and (ii) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image. Through extensive experiments spanning synthetic data, controlled 2D--3D geometric pairs, and real-world fracture images across multiple material classes -- including ceramics, glass, metals, and concrete -- we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it. As an exploratory extension, we further examine AI-generated fracture sequences and observe qualitatively plausible brittle-fracture progression behaviors, suggesting that multimodal generative models may encode partial implicit physical priors related to material failure dynamics. These results suggest that MLLMs can act as physics-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled.

URL PDF HTML ☆

赞 0 踩 0

2605.20278 2026-05-26 cs.LG cs.AI cs.CV

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； MiniMax

AI总结提出ClaimDiff-RL框架，利用原子声明差异作为奖励单元，通过多模态判断器枚举视觉差异并分配错误类型和严重程度，以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情

AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题：描述被整体判断，而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富，避免幻觉而不遗漏显著细节。然而，成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号，模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架，该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述，多模态判断器枚举视觉上可区分的差异，针对图像验证每个差异，分配开放词汇的错误类型和严重程度，并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明，整体标量奖励可以通过增加遗漏事实来减少幻觉，而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡，并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上，ClaimDiff-RL改善了幻觉-遗漏事实平衡，保留了通用能力，甚至在多个细粒度能力维度（如物体计数、空间关系和场景识别）上超越了Gemini-3-Pro-Preview。这些结果表明，类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

URL PDF HTML ☆

赞 0 踩 0

2605.20025 2026-05-26 cs.AI

资源受限硬件上扑翼机器人的神经形态控制

Rim El Filali, Chenrui Feng, Chao Gao, Weibin Gu

发表机构 * Institute for AI Industry Research (AIR)（人工智能产业研究院）； Tsinghua University（清华大学）； Department of Computer Science and Technology（计算机科学与技术系）； Xinchen Qihang Inc.（新晨科技有限公司）

AI总结针对重量小于30克的蝴蝶仿生扑翼机器人，提出一种层次化神经形态控制框架，在低成本ESP32微控制器上部署两个轻量级脉冲神经网络实现状态估计与控制，通过模仿学习训练，在无系留飞行中实现稳定俯仰和航向跟踪，相比传统人工神经网络延迟降低36%、功耗降低18%。

详情

AI中文摘要

扑翼微型飞行器（FWMAV）具有卓越的机动性和气动效率，但由于非线性动力学和严格的大小、重量和功率（SWaP）约束（例如重量小于30克的蝴蝶仿生机器人），给机载控制带来了重大挑战。为此，我们提出了一种层次化神经形态控制框架，能够在广泛可用、资源受限的ESP32微控制器（单价约5美元）上实现完全机载的闭环飞行。具体而言，我们的方法在机载部署了两个轻量级脉冲神经网络（SNN）：一个用于从原始传感器反馈进行状态估计，另一个通过调节中央模式发生器（CPG）进行翅膀驱动控制。通过模仿学习训练，该系统在无系留真实飞行中实现了稳定的俯仰和航向角跟踪。实验结果进一步表明，与传统人工神经网络（ANN）基线相比，基于SNN的控制器推理延迟降低了36%（从1059微秒降至680微秒），功耗降低了18%（从0.033瓦降至0.027瓦），证明了无需专用硬件的脉冲计算可行性。据我们所知，这项工作首次展示了FWMAV自主飞行的完全机载神经形态控制，突显了SNN在严格SWaP约束下实现节能自主的潜力。视觉摘要：http://bit.ly/4nI8ECY 代码：https://anonymous.4open.science/r/Espikify-76E3/

英文摘要

Flapping-Wing Micro Aerial Vehicles (FWMAVs) provide exceptional maneuverability and aerodynamic efficiency but pose significant challenges for onboard control due to nonlinear dynamics and stringent Size, Weight, and Power (SWaP) constraints, as exemplified by a butterfly-inspired robot less than 30 gram. To this end, we present a hierarchical neuromorphic control framework that enables fully onboard, closed-loop flight on a widely available, resource-constrained ESP32 microcontroller with a unit cost of approximately $5. Specifically, our method deploys two lightweight Spiking Neural Networks (SNNs) onboard: one for state estimation from raw sensory feedback and another for control via modulation of a Central Pattern Generator (CPG) for wing actuation. Trained by imitation learning, the system achieves stable pitch and heading angle tracking during untethered real-world flight. Experimental results further reveal that the SNN-based controller reduces latency by 36% (1059us to 680us) and power by 18% (0.033W to 0.027W) for inference compared to the conventional Artificial Neural Network (ANN) baseline, demonstrating the viability of spike-based computation without specialized hardware. To the best of our knowledge, this work constitutes the first demonstration of fully onboard neuromorphic control for autonomous flight of a FWMAV, highlighting the potential of SNNs to enable energy-efficient autonomy under stringent SWaP constraints. Visual abstract: http://bit.ly/4nI8ECY Code: https://anonymous.4open.science/r/Espikify-76E3/

URL PDF HTML ☆

赞 0 踩 0

2605.19409 2026-05-26 cs.LG cs.AI

Unlocking the Potential of Continual Model Merging: An ODE Perspective

解锁持续模型合并的潜力：ODE视角

Lihong Lin, Haidong Kang

发表机构 * Northeastern University, Shenyang, China（东北大学，沈阳，中国）

AI总结提出ODE-M框架，将持续模型合并建模为参数空间中的轨迹，通过整流时变速度场和效用感知时间调度平衡历史知识与新任务，提升长任务流性能。

Comments 21 pages, 8 figures

详情

AI中文摘要

持续模型合并（CMM）通过顺序整合任务适配模型实现基础模型的快速定制，无需重复训练。然而，现有合并规则通常通过固定代数或基于投影的操作更新部署模型，对保留多少先前积累的知识相对于新任务模型的控制有限。这种限制导致长任务流中保留不稳定和性能下降，当任务具有异构效用时更为明显。我们提出ODE驱动的合并（ODE-M），一个可控框架，将每次持续合并视为参数空间中的轨迹而非一步端点更新。受模式连通性启发，ODE-M使用整流时变速度场构建屏障感知轨迹，其中来自小型校准集的轻量级一阶反馈抑制损失增加的运动，同时保持向新模型的进展。然后通过沿该轨迹选择操作点（通过效用感知时间调度）获得下一个合并模型，为平衡保留的历史知识和新任务专业知识提供显式机制。在标准CMM基准上的大量实验表明，ODE-M在CLIP ViT骨干、流长度和异构任务效用设置上持续优于强持续合并基线。

英文摘要

Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without repeated retraining. However, existing merging rules usually update the deployed model through fixed algebraic or projection-based operations, providing limited control over how much previously accumulated knowledge should be retained relative to the incoming task model. This limitation leads to unstable retention and performance degradation in long task streams, and becomes more pronounced when tasks have heterogeneous utilities. We propose ODE-driven Merging (ODE-M), a controllable framework that formulates each continual merge as a trajectory in parameter space rather than a one-step endpoint update. Motivated by mode connectivity, ODE-M constructs a barrier-aware trajectory using a rectified time-dependent velocity field, where lightweight first-order feedback from a small calibration set suppresses loss-increasing motion while preserving progress toward the incoming model. The next merged model is then obtained by selecting an operating point along this trajectory through a utility-aware time schedule, providing an explicit mechanism for balancing retained historical knowledge and incoming task expertise. Extensive experiments on standard CMM benchmarks show that ODE-M consistently improves over strong continual merging baselines across CLIP ViT backbones, stream lengths, and heterogeneous task-utility settings.

URL PDF HTML ☆

赞 0 踩 0

2605.18840 2026-05-26 cs.LG cs.AI cs.CL

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

前沿模型的成长之痛：当排行榜不再区分以及接下来衡量什么

Adil Amin

发表机构 * Zehen Labs（泽亨实验室）

AI总结本文通过分解SWE-bench和GPQA Diamond分数为种群耦合趋势和每版本残差（h场），诊断前沿模型能力之间的协作与权衡，并提供三步诊断法、每实验室测量优先级表及七个可证伪预测。

Comments 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." ( https://doi.org/10.48550/arXiv.2605.18838 ). Code: https://github.com/adilamin89/cape-scaling . Dashboard: https://zehenlabs.com/cape/

详情

AI中文摘要

排行榜在独立轴上对前沿模型进行排名，但并未揭示能力在版本间是相互增强还是权衡——而在前沿，这种相互作用是更具信息量的信号。我们将配对的SWE-bench和GPQA Diamond分数分解为种群耦合趋势和每版本残差（h场），该残差从两个公开基准分数诊断能力重点。在来自10个实验室的34个模型（2024-2026）中，能力相互协作（r = +0.72，p < 10^{-6}），但协作程度系统性地变化：每个实验室的耦合斜率跨度达5倍（谷歌1.15 vs. DeepSeek 0.23），且实验室发生转向——DeepSeek从推理密集型逆转为编码优先（Δh = 15.9个百分点）；Anthropic在编码偏离和恢复之间振荡。种群回归作为等斜线相边界：用于识别基础尺度耦合转变的相同分类器√[(a/b)·B₁] [Amin, 2026] 对前沿模型进行分类，并已在下一个转变处检测到混合相行为（两个模型低于GPQA-IFEval等斜线）。h场不仅具有诊断性——它还告诉你需要改变什么。预训练建立耦合为0.871，而RLHF增加0.081 [Amin, 2026]：预训练级别的转变是永久的（DeepSeek的四个版本逆转持续存在），后训练转变是可逆的（Anthropic的三次编码偏离均在单个版本内恢复），仅推理计算在不重新训练的情况下将h改变+7.8个百分点。知道哪个组件占主导地位决定了是重新训练还是等待。我们提供了三步诊断法（定位、分类、预测）、每实验室测量优先级表以及七个带有时间戳标准的可证伪预测。五个截止日期后的版本落在95%预测区间内。代码、数据和交互式仪表盘：https://zehenlabs.com/cape/。

英文摘要

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($Δh = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

URL PDF HTML ☆

赞 0 踩 0

2605.18657 2026-05-26 cs.LG cs.AI

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

KairosHope: 一种基于双记忆架构的下一代时间序列基础模型，用于专门分类

Luis Balderas, José Alberto Rodríguez, Miguel Lastra, Antonio Arauzo-Azofra, José M. Benítez

发表机构 * Department of Computer Science and Artificial Intelligence（计算机科学与人工智能系）； DiCITS, iMUDS, DaSCI（DiCITS、iMUDS、DaSCI）； University of Granada（格拉纳达大学）； Advanced Medical Imaging Group（先进医学成像组）； Instituto de Investigación Biosanitaria de Granada (ibs.Granada)（格拉纳达生物医学研究机构（ibs.Granada））； Department of Software Engineering（软件工程系）； Department of Rural Engineering（农村工程系）； University of Córdoba（科尔多瓦大学）

AI总结针对标准注意力计算瓶颈和经典统计知识缺失问题，提出KairosHope模型，通过双记忆系统（Titans模块和连续记忆系统CMS）替代二次注意力，并融合深度表示与统计特征的混合决策头，在UCR基准上实现优越分类性能。

详情

AI中文摘要

时间序列基础模型（TSFMs）在通用预测任务中取得了显著成功；然而，它们对专门分类问题的适应仍然受到标准注意力的计算瓶颈和对经典统计知识的系统性忽略的限制。本技术报告介绍了KairosHope，一种下一代TSFM，旨在协调大规模泛化与分类任务中的分析精度。该提案的核心是HOPE块，一种用双记忆系统替代二次注意力的架构：用于动态短期保留的Titans模块和用于长期历史上下文抽象的连续记忆系统（CMS）。为了丰富归纳偏差，引入了混合决策头，它将深度潜在表示与通过tsfeatures包提取的确定性统计特征融合。KairosHope在大型Monash档案上进行自监督预训练，结合了掩码时间序列建模（MTSM）和对比学习（InfoNCE）。随后，通过严格的线性探测和全微调（LP-FT）协议在UCR基准数据集上进行适应，以防止灾难性遗忘。实验结果表明，在具有严格时间因果关系的领域（如HAR或传感器数据）中，性能优越。因此，KairosHope为基础模型适应时间序列分析建立了一个稳健高效的框架。

英文摘要

Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to specialized classification problems remains constrained by the computational bottleneck of standard attention and the systematic omission of classical statistical knowledge. This technical report introduces KairosHope, a next-generation TSFM designed to reconcile massive generalization with analytical precision in classification tasks. The core of the proposal is the HOPE block, an architecture that replaces quadratic attention with a dual-memory system: Titans modules for dynamic short-term retention and a Continuum Memory System (CMS) for the abstraction of long-term historical context. To enrich the inductive bias, a Hybrid Decision Head is introduced, which fuses deep latent representations with deterministic statistical features extracted via tsfeatures package. KairosHope undergoes self-supervised pre-training on the massive Monash archive, combining Masked Time Series Modeling (MTSM) and contrastive learning (InfoNCE). Its subsequent adaptation to the UCR benchmark datasets is conducted through a rigorous Linear Probing and Full Fine-Tuning (LP-FT) protocol to prevent catastrophic forgetting. Empirical results demonstrate superior performance in domains characterized by strict temporal causality such as HAR or Sensor data. Consequently, KairosHope establishes a robust and efficient framework for the adaptation of foundation models to time series analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.17531 2026-05-26 cs.CV

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

不要猜测，只需询问：通过多轮澄清解决指代分割中的歧义

Yuting Yang, Haichao Jiang, Tianming Liang, Quan Zhang, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； Guangdong Province Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education（教育部机器智能与先进计算重点实验室）

AI总结提出IC-Seg框架，通过多轮对话主动澄清用户意图，并引入Hi-GRPO分层优化策略，有效解决指代分割中用户查询歧义问题。

详情

AI中文摘要

指代分割旨在根据文本查询分割图像或视频中的目标对象。尽管过去几年取得了显著进展，现有工作总是假设用户提供的查询已经精确且清晰。然而，这种假设不切实际。在现实场景中，期望所有用户仔细审查其视觉内容并确保查询唯一且无歧义是不现实的。遇到此类情况时，现有分割模型倾向于任意猜测用户偏好，常常导致不理想的结果。为解决这一限制，我们提出IC-Seg，一种新颖的智能体框架，在分割前通过多轮对话主动澄清用户意图。为有效激励这种能力，我们进一步引入Hi-GRPO，一种新的分层优化策略，在轨迹、轮次和步骤层面注入密集且信息丰富的监督信号。该策略鼓励高效的意图澄清，有效消除冗余交互并提高整体对话质量。为评估，我们建立了Ambi-RVOS，一个带有模糊用户查询的指代视频对象分割基准。大量实验表明，IC-Seg不仅在解决模糊查询方面大幅优于现有方法，而且在标准推理分割基准上保持最先进性能。代码和数据将在https://github.com/iSEE-Laboratory/IC-Seg发布。

英文摘要

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

URL PDF HTML ☆

赞 0 踩 0

2605.17268 2026-05-26 cs.AI cs.CV cs.RO

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实？自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； Central South University（中南大学）； School of Computer Science（计算机科学学院）； University of Wollongong in Dubai（迪拜大学）

AI总结通过分析300次VLA推理，发现输出推理与轨迹的忠实度仅42.5%，存在大量漏检行人、轨迹脆弱及推理-动作不一致问题，并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

2605.16409 2026-05-26 cs.CV cs.CL cs.LG

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调和提示引导的链式思维推理用于多模态大语言模型

Qinwu Xu, Yifan Jiang, Haoyu Ren

发表机构 * Meta AI ； UT Austin（德克萨斯大学奥斯汀分校）

AI总结提出一种多语言OCR感知的多模态训练框架，通过合成数据生成、OCR感知微调和结构化视觉链式思维提示，提升多模态大语言模型在复杂视觉条件下的OCR完整性和多语言翻译准确性。

详情

AI中文摘要

光学字符识别（OCR）和多语言文本理解仍然是多模态大语言模型（MLLMs）的主要失败模式，尤其是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架，该框架结合了（i）大规模合成OCR到翻译数据生成，（ii）使用LoRA适配的OCR感知监督微调（SFT），以及（iii）在不确定视觉条件下进行推理的结构化视觉链式思维（CoT）提示。使用基于LLaMA的多模态架构，所提出的框架在OCR完整性、多语言翻译准确性和退化视觉条件下的鲁棒性方面有了显著提升。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明，与基线模型相比，视觉-文本对齐显著改善。特别是，所提出的OCR感知后训练框架提高了对小、模糊、空间分散和部分遮挡文本的提取，同时减少了对不确定OCR条件下语言先验的依赖。与前沿多模态系统（包括GPT-5类和Gemini系列模型）的定性比较进一步表明，在噪声和视觉模糊的OCR场景下，OCR对齐得到改善，幻觉减少。总体而言，结果表明，以数据为中心的OCR感知多模态后训练为改进多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

英文摘要

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Rethinking Federated Unlearning via the Lens of Memorization

Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

Generating Legal Commentaries from Case Databases via Retrieval, Clustering, and Generation

Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation

Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation

NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

ChronoVAE-HOPE: Beyond Attention -- A Next-Generation VAE Foundation Model for Specialized Time Series Classification

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Decomposing Ensemble Spread in Lorenz '96 With Learned Stochastic Parameterizations

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Semantic Granularity Navigation in Image Editing

Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning and generation with Vision-Language Models

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

Neuromorphic Control of a Flapping-Wing Robot on Resource-Constrained Hardware

Unlocking the Potential of Continual Model Merging: An ODE Perspective

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models