arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2604.26498 2026-06-09 cs.LG q-bio.QM 版本更新

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

大模型真的在药物发现中胜出吗?AI驱动的分子性质和活性预测中模型规模的基准评估

Jinjiang Guo, Sheng Ding

发表机构 * Global Health Drug Discovery Institute(全球健康药物发现研究所) School of Pharmaceutical Sciences(药学院)

AI总结 本文通过26个ADME、毒性及生物活性端点评估,发现传统机器学习在多数任务中表现最佳,大模型在部分困难分割中竞争力有限,模型性能依赖于任务与验证场景的适配性,而非单纯规模。

Comments Improved benchmark design and reproducibility, replaced restricted datasets with public benchmarks in primary analyses, and added sensitivity analyses supporting the interpretation of model scaling and evaluation protocol effects in molecular prediction

详情
AI中文摘要

分子基础模型和大语言模型的快速发展促使人们以规模为中心看待AI在药物发现中的应用,认为更大的预训练模型将取代紧凑的化学信息学模型。我们测试了这一假设,涵盖26个ADME、毒性及生物活性端点,共165,541个端点级别化合物标签记录。基准测试包含78个端点和分割条目,通过随机、Murcko骨架和结构分离的5折交叉验证协议评估,代表递增的化学泛化难度。在156个任务和指标比较中,传统机器学习(ML)提供了最大的最佳表现份额(47.4%),其次是预训练分子序列模型(28.8%)、图神经网络(21.8%)和基于LLM的SAR基线(1.9%)。传统ML在随机分割插值中占优,并总体上是最大的胜利家族。GNN和序列模型在部分更难的分割中具有竞争力,但其严格胜利份额在固定最终窗口读取下减少,表明对训练设置和模型选择的敏感性。配对Bootstrap分析显示,模型间的小数值差异不应被视为决定性胜利。训练折叠中的SAR知识提高了GPT5.5-SAR和Opus4.7-SAR指标,但并未使基于规则的推理成为监督预测器的通用替代品。紧凑的专业模型仍高度有效,预测性能取决于模型、任务和验证场景之间的适配性,而非规模本身。

英文摘要

The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and structure separated 5-fold cross validation protocols, representing increasing chemical generalization difficulty. Across 156 task and metric comparisons, classical machine learning (ML) provides the largest share of best performing entries (47.4%), followed by pretrained molecular sequence models (28.8%), graph neural networks (21.8%) and LLM based SAR baselines (1.9%). Classical ML dominates random split interpolation and remains the largest winner family overall. GNN and sequence models are competitive in selected harder splits, but their strict winner shares decrease under a fixed final-window readout, indicating sensitivity to training settings and model selection. Paired bootstrap analyses show that small numerical differences between individual models should not be read as decisive victories. SAR knowledge from training folds improves GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective, and predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

2503.01125 2026-06-09 cs.RO 版本更新

TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning

TACO:基于目标和指令的强化学习实现通用空翻飞行控制

Zikang Yin, Canlun Zheng, Shiliang Guo, Zhikun Wang, Shiyu Zhao

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) WINDY Lab, Department of Artificial Intelligence, Westlake University(西湖大学人工智能研究院)

AI总结 本文提出TACO框架,通过目标和指令导向的强化学习实现统一的空翻任务处理,并支持在线参数调整,结合频谱归一化方法提升策略的平滑性与对称性,验证了其在高速环形飞行和连续多翻转中的能力。

Comments For the experiment video, please refer to https://youtu.be/x1v7nD2iHIk

详情
AI中文摘要

尽管空翻飞行控制已广泛研究,现有方法的关键限制在于通常局限于特定机动任务且无法在线调整飞行模式参数。本文提出目标和指令导向的强化学习(TACO)框架,可统一处理不同机动任务并支持在线参数调整。此外,我们提出一种结合输入输出缩放的频谱归一化方法,以增强策略的时间和空间平滑性、独立性和对称性,从而克服仿真到现实的差距。通过广泛的仿真和实际实验验证了TACO方法,证明其能够实现高速环形飞行和连续多翻转。

英文摘要

Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. Additionally, we propose a spectral normalization method with input-output rescaling to enhance the policy's temporal and spatial smoothness, independence, and symmetry, thereby overcoming the sim-to-real gap. We validate the TACO approach through extensive simulation and real-world experiments, demonstrating its capability to achieve high-speed circular flights and continuous multi-flips.

2605.14211 2026-06-09 cs.AI cs.LG 版本更新

ASH: Agents that Self-Hone via Embodied Learning

ASH: 通过具身学习自我精炼的智能体

Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun

发表机构 * University of Waterloo(多伦多大学) National Research Council Canada(加拿大国家研究理事会)

AI总结 提出ASH系统,通过从无标签互联网视频中学习具身策略,利用自改进循环和逆动力学模型,在长时域任务中显著超越基线方法。

Comments Published as a workshop paper at ICML 2026 Workshop on Scalable Learning and Optimization for Efficient Multimodal AI Agents

详情
AI中文摘要

长时域具身任务仍然是AI中的一个基本挑战,因为当前方法依赖于手工设计的奖励或带动作标签的演示,两者都无法扩展。我们引入了ASH,一个智能体系统,它从无标签、嘈杂的互联网视频中学习具身策略,无需奖励塑造或专家注释。ASH遵循自我改进循环;当它卡住时,ASH从其自身轨迹中学习逆动力学模型(IDM),并利用其IDM从相关互联网视频中提取监督信号。ASH使用无监督学习从大规模互联网视频中识别关键时刻,并将其保留为长期记忆——使其能够处理长时域问题。我们在两个需要多小时规划的互补环境中评估ASH:回合制角色扮演游戏《宝可梦 绿宝石》和实时动作冒险游戏《塞尔达传说:缩小帽》。在这两个游戏中,行为克隆、检索增强和零样本基础模型基线趋于平稳,而ASH在我们的8小时评估中持续进步。ASH在《宝可梦 绿宝石》中平均达到11.2/12个里程碑,在《塞尔达传说》中平均达到9.9/12个里程碑,而最强基线在两个环境中分别卡在平均6.5/12和6.0/12个里程碑。我们证明了自我改进的智能体是长时域具身学习的可扩展方案。

英文摘要

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

2605.13768 2026-06-09 cs.LG cs.AI cs.IT math.IT 版本更新

High-Rate Quantized Matrix Multiplication II

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem(希伯来大学杰里科分校) MIT(麻省理工学院)

AI总结 本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法,通过水填充算法改进LLM量化方法,展示WaterSIC方案在信息论极限下的性能。

详情
AI中文摘要

本文是关于量化矩阵乘法(MatMul)工作的第二部分。在第一部分中,我们考虑了无校准量化的情况,而在这里,我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差(WMSE)源编码问题相关,其经典的(反向)水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法(GPTQ),目前这些算法平均分配速率。最近的一种方案(称为``WaterSIC'')仅使用标量INT量化器进行分析,其高速率性能被证明为(a)基无关(即由$Σ_X$的行列式决定,因此不同于现有方案,不受随机旋转的影响);(b)在信息论极限下的性能与$\frac{2πe}{12}$(或0.25 bit/entry)的乘法因子内。GPTQ的性能受基的选择影响,但对于随机旋转和实际的$Σ_X$来自Llama-3-8B,我们发现其性能在0.1 bit(取决于层类型)以内,表明GPTQ结合随机旋转也接近最优,至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

2605.11212 2026-06-09 cs.CL 版本更新

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision:通过时间视觉冗余减少扩展计算机使用代理

Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Microsoft Research(微软研究院)

AI总结 ReVision通过去除冗余视觉片段,减少token使用并提升成功率,使代理能处理更长轨迹。

详情
AI中文摘要

计算机使用代理(CUAs)依赖于图形用户界面的视觉观察,每个截图被编码为大量视觉token。随着交互轨迹增长,token成本迅速增加,限制了在固定上下文和计算预算下可纳入的历史量。这导致使用历史时性能提升有限,不同于其他领域。我们通过引入ReVision解决这一效率问题,该方法用于训练多模态语言模型,在轨迹中去除冗余视觉片段,使用学习的片段选择器比较连续截图的片段表示,同时保留模型所需的时空结构。在三个基准测试(OSWorld、WebTailBench和AgentNetBench)中,当使用Qwen2.5-VL-7B处理包含5个历史截图的轨迹时,ReVision平均减少46%的token使用,同时将成功率提高3%。这建立了明显的效率提升,使代理能用更少token处理更长轨迹。通过这一改进效率,我们重新审视CUAs中历史的作用,发现当去除冗余时,性能随更多过去观察的纳入而持续提升。

英文摘要

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

2605.12213 2026-06-09 cs.AI 版本更新

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

面向目标的推理用于基于RAG的记忆在对话型代理LLM系统中

Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, Scott Sanner

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文提出Goal-Mem框架,通过目标导向的推理提升RAG记忆在复杂任务中的表现,尤其在多跳推理和隐含推理中效果显著。

详情
AI中文摘要

基于LLM的对话型AI代理在长时间范围内维持一致行为存在困难,因为上下文有限。虽然RAG方法通过外部记忆模块存储交互并进行检索来克服这一限制,但其在回答具有挑战性的问题(如多跳、常识推理)上的有效性最终取决于代理对检索信息的推理能力。然而,现有方法通常基于语义相似性检索原始用户语句,缺乏对缺失中间事实的显式推理,且常返回无关或不足的证据。本文引入Goal-Mem,一种面向目标的推理框架,通过从用户语句作为目标进行逆向推导。而非逐步扩展检索上下文,Goal-Mem将每个目标分解为原子子目标,进行针对性记忆检索以满足每个子目标,并迭代识别在中间目标无法解决时应从记忆中检索哪些信息。我们通过自然语言逻辑(NLL)形式化这一过程,该逻辑系统结合了FOL的推理可验证性和自然语言的表达性。通过在两个数据集上进行广泛实验,并与九个强大的记忆基线进行比较,我们证明Goal-Mem在多个任务中表现更优,尤其在需要多跳推理和隐含推理的任务中效果显著。

英文摘要

LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent's ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user's utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.

2605.11855 2026-06-09 cs.LG cs.AI cs.AR 版本更新

Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications

提升为超低功耗应用设计的可并行递归神经网络的性能和学习稳定性

Julien Brandoit, Arthur Fyon, Damien Ernst, Guillaume Drion

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出CMRU和αCMRU,通过累积更新公式恢复梯度流并保持持久记忆,提升收敛稳定性并减少初始化敏感性,在多样本基准中表现优异,尤其在需要离散长距离保留的任务中表现突出。

Comments Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9

详情
AI中文摘要

序列学习主要由Transformer和可并行递归神经网络(如状态空间模型)主导,但学习长期依赖仍具挑战性,最先进的设计以性能牺牲换取功耗降低。Bistable Memory Recurrent Unit(BMRU)被引入以实现超低功耗RNNs的软硬件协同设计:具有滞后特性的量化状态提供持久记忆并直接映射到模拟基本单元。然而,BMRU在复杂序列任务上性能落后于可并行RNNs。本文识别出在状态更新期间出现的梯度阻塞是关键限制,并提出累积更新公式以恢复梯度流并保持持久记忆,通过时间创建跳跃连接。这导致了累积记忆递归单元(CMRU)及其放松变体αCMRU。实验表明,累积公式显著提高了收敛稳定性并减少了初始化敏感性。CMRU和αCMRU在小模型规模下在多样本基准中与线性递归单元(LRUs)和最小门控递归单元(minGRUs)匹配或超越,尤其在需要离散长距离保留的任务中表现突出,同时CMRU保留量化状态、持久记忆和抗噪声动态,这些对于模拟实现至关重要。

英文摘要

Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the $α$CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and $α$CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.

2605.08384 2026-06-09 cs.CL 版本更新

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

jina-embeddings-v5-omni: 通过锁定对齐塔实现几何保持嵌入

Florian Hönicke, Michael Günther, Andreas Koukounas, Mohammad Kalim Akram, Scott Martens, Saba Sturua, Han Xiao

发表机构 * Jina by Elastic(Jina 由 Elastic 公司)

AI总结 本文提出GELATO方法,通过冻结对齐塔实现多模态嵌入,生成统一语义空间,训练效率高且保持文本嵌入一致性。

Comments 11 pages, 9 figures, 5 tables

详情
AI中文摘要

在本文中,我们介绍了GELATO(通过锁定对齐塔实现几何保持嵌入),一种新型的多模态嵌入模型。我们基于VLM式架构,非文本编码器被调整以生成语言模型的输入,进而生成所有输入类型的嵌入。我们展示了结果:jina-embeddings-v5-omni套件,一对模型将文本、图像、音频和视频输入编码到单一语义嵌入空间。GELATO扩展了两个Jina Embeddings v5文本模型,通过添加图像和音频编码器支持额外模态。骨干文本嵌入模型和新增的非文本模态编码器保持冻结。我们仅训练连接组件,代表联合模型总权重的0.35%。因此,训练比全参数重新训练要高效得多。此外,语言模型保持基本不变,对文本输入生成与Jina Embeddings v5文本模型完全相同的嵌入。我们的评估表明,GELATO产生的结果与最先进的方法相媲美,几乎与更大的多模态嵌入模型具有同等性能。

英文摘要

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

2605.11502 2026-06-09 cs.CL 版本更新

Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations

基于知识引导扰动的鲁棒生物医学出版物类型与研究设计分类

Shufan Ming, Joe D. Menke, Neil R. Smalheiser, Halil Kilicoglu

发表机构 * School of Information Sciences(信息科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Department of Psychiatry(精神病学系) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出基于受控语义扰动的评估框架,通过实体遮蔽和领域对抗训练提升生物医学出版物类型分类的鲁棒性,发现通过抑制非任务定义特征可缓解鲁棒性与领域准确性之间的权衡。

Comments Accepted by IEEE ICHI 2026

详情
AI中文摘要

准确且一致地对生物医学文献进行出版物类型和研究设计索引对于支持证据综合和知识发现至关重要。先前工作主要集中在扩展标签覆盖、丰富特征表示和提高领域内准确性,评估通常在与训练数据同分布的数据上进行。尽管预训练生物医学语言模型在这些设置下表现优异,但优化领域内准确性的模型可能依赖于表面词汇或数据集特定的提示,导致在分布偏移下鲁棒性降低。本文引入基于受控语义扰动的评估框架,评估出版物类型分类器的鲁棒性,并研究结合实体遮蔽和领域对抗训练的鲁棒性导向训练策略,以减轻对虚假主题相关性的依赖。结果表明,当鲁棒性目标设计为选择性抑制非任务定义特征同时保留显著的方法学信号时,通常观察到的鲁棒性与领域准确性之间的权衡可以被缓解。我们发现这些改进源于两种互补机制:(1)当输入中存在此类提示时,增加对显式方法学提示的依赖;(2)减少对虚假领域特定主题特征的依赖。这些发现强调了出版物类型和研究设计分类中特征级鲁棒性分析的重要性,并建议通过更选择性地抑制主题信息来进一步提高鲁棒性。数据、代码和模型可在:https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI 获取。

英文摘要

Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI

2605.11484 2026-06-09 cs.AI 版本更新

Engagement Process: Rethinking the Temporal Interface of Action and Observation

参与过程:重新思考动作与观察的时间接口

Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出参与过程(EP)模型,通过显式时间接口处理动作与观察的不同时间尺度交互,支持多速率协调和子系统组合,揭示隐藏的时间行为并使策略适应显式时间成本。

详情
AI中文摘要

在数字和物理环境中完成任务日益涉及复杂的时序交互,其中动作和观察在不同的时间尺度上展开,而非与固定观察-动作步骤对齐。为了建模此类交互,我们提出参与过程(EP),一种继承POMDP决策理论结构的交互形式,使时间在动作-观察接口中显式化。EP将动作和观察表示为沿时间解耦的事件流,而非在固定决策步骤上配对更新。此接口捕捉单agent的时间问题,如决策延迟、延迟反馈和持续动作,同时支持更丰富的agent侧组织、多速率协调和子系统间的组合交互。在玩具、LLM-agent和学习实验中,EP揭示了由基于步骤的接口隐藏的时间行为,并使策略在显式时间成本下适应。

英文摘要

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性,利用数据高效的自监督框架引导视频扩散模型,显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

详情
AI中文摘要

尽管最近的视频扩散模型(VDMs)能产生视觉上令人印象深刻的结果,但它们在保持3D结构一致性方面存在根本性困难,常导致物体变形或空间漂移。我们假设这些失败是因为标准去噪目标缺乏显式的几何一致性激励。为此,我们引入VideoGPA(视频几何偏好对齐),一种数据高效的自监督框架,利用几何基础模型自动推导密集偏好信号,通过直接偏好优化(DPO)引导VDMs。该方法有效将生成分布引导至内在3D一致性,而无需人工标注。VideoGPA通过最少的偏好对显著提升了时间稳定性、几何合理性与运动一致性,在大量实验中一致优于最先进基线。

英文摘要

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

2605.10376 2026-06-09 cs.CV 版本更新

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

SleepWalk:一种三层压力测试基准,用于指导的视觉-语言导航

Niyati Rawal, Sushant Ravva, Shah Alam Abir, Saksham Jain, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

发表机构 * Indian AI Research Organization (IAIRO)(印度人工智能研究组织) ĸragya Lab, BITS Pilani Goa(BITS Pilani Goa 的 ĸragya 实验室) University of Dhaka(达卡大学) Delhi Technological University(德里技术大学) Apple(苹果公司) Meta

AI总结 本文提出SleepWalk基准,用于评估基于指令的轨迹预测,针对局部化、交互导向的具身推理,揭示当前VLM在空间推理中的系统性失败。

详情
AI中文摘要

视觉-语言模型(VLMs)在多模态感知和语言理解方面迅速发展,但尚不清楚它们是否能可靠地将语言接地为在3D数字环境中空间一致且可能执行的动作。我们引入SleepWalk,一种评估基于指令的轨迹预测的基准,该基准生成自文本场景描述并过滤以确保可导航性。与以往以长距离探索房间为中心的导航基准不同,SleepWalk针对局部化、以交互为中心的具身推理:给定渲染的视觉观察和自然语言指令,模型必须预测一个尊重场景几何、避免碰撞并终止在动作兼容位置的轨迹。该基准涵盖多样化的室内和室外环境,并将任务分为三层空间和时间难度,使在增加的组合复杂性下对接地进行细粒度分析成为可能。使用标准化的点评估评估协议,我们评估了三种前沿VLMs在2,472个经过精心挑选的3D环境中,每个场景有九条指令。结果揭示了在遮挡、交互约束和多步指令下的系统性失败:随着任务难度等级的增加,性能下降。总体而言,当前VLMs可以生成在空间上一致、可能执行且与预期动作一致的轨迹。通过在受控且可扩展的设置中暴露失败,SleepWalk为推进基于接地的多模态推理、具身规划、视觉-语言导航和3D环境中的动作能力代理提供了关键基准。

英文摘要

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

2412.01324 2026-06-09 cs.RO 版本更新

Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control

集成化分层决策在逆运动学规划与控制中

Kai Pfeiffer, Quan Zhang, Yuqing Chen, Gordon Boateng, Yuquan Wang, Vincent Bonnet, Aberrahmane Kheddar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出一种高效的非线性规划框架,整合分层决策与全身逆运动学规划控制,解决逆运动学规划中同时选择端效应器位置的问题。

Comments Accepted paper to "Robotics: Science and Systems" (2026)

详情
AI中文摘要

本文提出了一种新颖且高效的非线性规划框架,紧密整合分层决策与全身逆运动学规划与控制。决策在机器人领域诸多方面起核心作用,从稀疏逆运动学控制(使用最少的关节)到同时选择多个候选端效应器位置的逆运动学规划。当前方法常依赖混合整数非线性规划的大量计算,将决策与逆运动学分离(有时用可达性方法近似),或使用高效但不够灵活的ℓ1范数线性稀疏规划方法,未解决底层非线性问题。相比之下,所提出的稀疏分层非线性规划求解器通过利用稀疏分层结构和ℓ0范数(在机器人领域很少使用)实现了高效、灵活和准确。该求解器有效处理了文献中未解决的复杂非线性分层决策问题,例如同时从大量候选中优先选择端效应器位置的逆运动学规划,或同时选择双臂抓取位置的逆运动学控制。

英文摘要

This work presents a novel and efficient nonlinear programming framework that tightly integrates hierarchical decision-making with whole-body inverse kinematic planning and control. Decision-making plays a central role in many aspects of robotics, from sparse inverse kinematic control with a minimal number of joints, to inverse kinematic planning while simultaneously selecting a discrete end-effector location from multiple candidates. Current approaches often rely on heavy computations using mixed-integer nonlinear programming, separate decision-making from inverse kinematics (some times approximated by reachability methods), or employ efficient but less versatile $\ell_1$-norm formulations of linear sparse programming, without addressing the underlying nonlinear problem formulations. In contrast, the proposed sparse hierarchical nonlinear programming solver is efficient, versatile, and accurate by exploiting sparse hierarchical structure and leveraging the $\ell_0$-norm which is rarely used in robotics. The solver efficiently tackles complex nonlinear hierarchical decision-making problems previously unaddressed in the literature, such as inverse kinematic planning with simultaneous prioritized selection of end-effector locations from a large set of candidates, or inverse kinematic control with simultaneous selection of bi-manual grasp locations on a randomly rotated box.

2605.08876 2026-06-09 cs.LG 版本更新

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

OTora:一种用于LLM代理推理层面拒绝服务攻击的统一红队框架

Xinyu Li, Ronghui Mu, Lin Li, Tianjin Huang, Gaojie Jin

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 OTora是首个统一的两阶段红队框架,用于实现推理层面拒绝服务攻击,通过优化对抗触发器和生成代理感知的推理负载,提升推理token数量和延迟,同时保持任务准确性。

Comments Accepted to ICML 2026

详情
AI中文摘要

OTora是一种用于LLM代理推理层面拒绝服务攻击的统一红队框架。大型语言模型(LLMs)正越来越多地被部署为能够执行工具增强的多步骤任务的自主代理,其中延迟是实际应用中的关键因素。然而,一个被忽视的威胁是推理层面拒绝服务(R-DoS),攻击者通过增加代理的推理深度或工具使用预算来降低可用性,同时保持任务正确性。我们介绍了OTora,这是首个统一的两阶段红队框架,用于实现R-DoS攻击。第一阶段优化了对抗触发器,通过插入意识评分和动态目标共进化,诱导定向工具调用,支持黑盒和白盒环境。第二阶段通过ICL引导的遗传搜索生成代理感知的推理负载,放大过度思考的同时保持正确的任务结果。在WebShop、Email和OS代理上,基于多种基础模型如LLaMA-70B和GPT-OSS-120B,OTora实现了推理token数量增加10倍和延迟减慢数量级,同时保持接近基线的任务准确性。最后,我们讨论了检测和限制异常推理和延迟峰值的缓解策略。代码可在https://github.com/llm2409/OTora上获得。

英文摘要

Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute tool-augmented, multi-step tasks, where latency is a critical factor for real-world applications. Yet an overlooked threat is Reasoning-Level Denial-of-Service (R-DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent's reasoning depth or tool-use budget. We introduce OTora, the first unified, two-stage red-teaming framework for instantiating R-DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion-aware scoring and dynamic target co-evolution, supporting both black-box and white-box settings. Stage II generates agent-aware reasoning payloads via an ICL-guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA-70B and GPT-OSS-120B, OTora achieves up to 10 times increases in reasoning tokens and order-of-magnitude latency slowdowns, all while preserving near-baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes. The code is available at https://github.com/llm2409/OTora.

2605.04913 2026-06-09 cs.CL cs.LG 版本更新

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

重新思考局部学习:一种更便宜更快的LLM后训练配方

Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su

发表机构 * Independent Researcher(独立研究者) D 4 Lab(D4实验室) Southeast University(东南大学)

AI总结 本文提出LoPT,一种局部学习后训练策略,通过在transformer中点设置梯度边界,降低内存成本,提高训练效率并保留预训练能力。

Comments 35pages

详情
AI中文摘要

LLM后训练通常通过完整深度传播任务梯度。尽管这种端到端结构简单通用,但将其任务适应与完整深度激活存储、长距离反向依赖和直接任务梯度访问预训练表示耦合在一起。我们主张这种完整深度反向耦合可能不必要的昂贵和侵入性,尤其是在后训练监督远比预训练狭窄时。为此,我们提出LoPT:局部学习后训练,一种简单的后训练策略,使梯度达到成为显式设计选择。LoPT在transformer中点放置单一梯度边界:后半部分块从任务目标学习,而前半部分块通过轻量级特征重建目标进行更新,以保留有用的表示并保持接口兼容性。LoPT缩短了任务引起的反向路径,同时限制了狭窄任务梯度对早期层表示的直接干扰。大量实验表明,LoPT在较低的内存成本、较高的训练效率和更好的保留预训练能力方面实现了竞争性性能。我们的代码可在:https://github.com/HumyuShi/LoPT获取。

英文摘要

LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT

2605.06384 2026-06-09 cs.LG cs.AI cs.FL 版本更新

MinMax Recurrent Neural Cascades

MinMax 循环神经网络级联

Alessandro Ronca

发表机构 * IRIS-AI

AI总结 MinMax RNCs 通过MinMax代数构建,具备强表达性、高效评估、稳定动态和非消失状态梯度等特性,在合成任务中表现优异,能处理长序列并超越传统循环基线。

Comments Code: https://github.com/minmaxrnc/model

详情
AI中文摘要

我们引入MinMax循环神经网络级联(MinMax RNCs),一种基于MinMax代数新形式递归的循环神经网络。我们展示了MinMax RNCs具有一些难以同时获得的关键性质:强大的形式表达性、高效的评估、稳定的动态和非消失的状态梯度。首先,其形式表达性对应正则语言,可能是有限记忆系统的最大表达性。其次,除了递归形式的评估外,它们还允许并行扫描评估,具有对数深度和线性工作量。第三,其状态和激活在所有序列长度下均被统一限制。第四,其损失梯度几乎处处存在且在所有序列长度下均被统一限制。第五,它们不表现出消失的状态梯度:状态相对于过去状态的梯度可以独立于状态之间的时距保持范数一。经验上,我们发现这些理论性质转化为强大的实际性能。MinMax RNCs完美解决了考虑的合成任务,能够泛化到长序列,并在实验中超越了考虑的循环基线。我们还训练了一个1.12亿参数的MinMax RNC进行下一个token预测,获得与其规模相竞争的性能,提供了初始证据表明MinMax递归可以扩展到现实世界的序列建模任务。

英文摘要

We introduce MinMax Recurrent Neural Cascades (MinMax RNCs), a class of recurrent neural networks built from a novel form of recurrence over the MinMax algebra. We show that MinMax RNCs enjoy key properties that are difficult to obtain simultaneously: strong formal expressivity, efficient evaluation, stable dynamics, and non-vanishing state gradients. First, their formal expressivity corresponds to the regular languages, arguably the maximal expressivity for finite-memory systems. Second, in addition to evaluation in recurrent form, they also admit parallel-scan evaluation with logarithmic depth and linear work in the input length. Third, their states and activations are uniformly bounded for all sequence lengths. Fourth, their loss gradients exist almost everywhere and are uniformly bounded for all sequence lengths. Fifth, they do not exhibit vanishing state gradients: the gradient of a state with respect to a past state can retain norm one independently of the temporal distance between the states. Empirically, we find that these theoretical properties translate into strong practical performance. MinMax RNCs solve the considered synthetic tasks perfectly, generalise to long sequences, and outperform the recurrent baselines considered in our experiments. We also train a 112M-parameter MinMax RNC for next-token prediction, obtaining competitive performance for its size and providing initial evidence that MinMax recurrence can scale to real-world sequence-modelling tasks.

2605.06582 2026-06-09 cs.LG cs.CL cs.SD 版本更新

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

PairAlign:一种通过自对齐的序列标记化框架及其在音频标记化中的应用

Adhiraj Banerjee, Vipul Arora

发表机构 * Department of Electrical Engineering, Indian Institute of Technology, Kanpur(电子工程系,印度理工学院,坎浦尔)

AI总结 PairAlign通过序列级自对齐实现紧凑音频标记化,利用条件序列生成方法,提升标记一致性、长度控制和编辑相似性。

Comments 57 pages main content, 109 total pages, 9 Figures, pre-print, Under Review

详情
AI中文摘要

许多感官数据的操作——比较、记忆、检索和推理——自然地在离散符号结构上表达。在语言中,这种接口由标记提供;在音频中,必须学习。现有音频标记器依赖于量化、聚类或编解码器重建,将标记局部分配,因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign,一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成:编码器将语音映射为连续条件,自回归解码器从BOS开始生成标记,学习标记身份、顺序、长度和EOS位置。给定两个保持内容的视图,每个视图的序列在另一个视图的表示下被训练为可能,而无关示例提供竞争序列。这为可扩展的编辑距离保留代理,同时抑制许多对一的坍缩。PairAlign从VQ式标记化开始,并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制进行优化。在3秒语音上,PairAlign学习紧凑、非退化的序列,具有广泛的词汇使用和强跨视图一致性。在检索测试中,它保留编辑距离搜索,同时将存档标记数量减少55%。连续扫频探针显示其局部重叠低于密集几何标记器,但具有更强的长度控制和在100毫秒移位下的受约束编辑轨迹。PairAlign是一种序列符号预测学习者:像JEPA式目标一样,它从另一个视图预测一个抽象目标作为学习的可变长度符号序列,而不是连续潜在变量。

英文摘要

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

2605.05136 2026-06-09 cs.CV 版本更新

CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization

CPCANet:基于共同主成分分析的深度展开方法用于领域泛化

Yu-Hsi Chen, Abd-Krim Seghouane

发表机构 * The University of Melbourne(墨尔本大学)

AI总结 本文提出CPCANet,通过深度展开Flury-Gautschi算法实现共同主成分分析,提升领域泛化性能,在四个基准测试中达到零样本转移的最先进水平。

Comments 9 pages, 5 tables

详情
AI中文摘要

领域泛化(DG)旨在学习在分布外转移下仍具鲁棒性的表示,并有效推广到未见目标领域。尽管最近的不变学习策略和架构进步已取得良好性能,但通过二阶统计显式发现结构化的领域不变子空间仍被忽视。本文提出CPCANet,一种基于共同主成分分析(CPCA)的新型框架,将迭代的Flury-Gautschi(FG)算法展开为完全可微的神经层。该方法将CPCA的统计特性整合到端到端可训练框架中,强制在不同领域间发现共享子空间,同时保持可解释性。在四个标准DG基准测试中,CPCANet在零样本转移中达到最先进性能。此外,CPCANet架构无关,无需特定数据集调优,提供了一种简单高效的鲁棒表示学习方法以应对分布偏移。代码可在https://github.com/wish44165/CPCANet获取。

英文摘要

Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen target domains. While recent invariant learning strategies and architectural advances have achieved strong performance, explicitly discovering a structured domain-invariant subspace through second-order statistics remains underexplored. In this work, we propose CPCANet, a novel framework grounded in Common Principal Component Analysis (CPCA), which unrolls the iterative Flury-Gautschi (FG) algorithm into fully differentiable neural layers. This approach integrates the statistical properties of CPCA into an end-to-end trainable framework, enforcing the discovery of a shared subspace across diverse domains while preserving interpretability. Experiments on four standard DG benchmarks demonstrate that CPCANet achieves state-of-the-art (SOTA) performance in zero-shot transfer. Moreover, CPCANet is architecture-agnostic and requires no dataset-specific tuning, providing a simple and efficient approach to learning robust representations under distribution shift. Code is available at https://github.com/wish44165/CPCANet.

2605.03862 2026-06-09 cs.AI cs.CL 版本更新

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

正确性不足:通过执行器导向的奖励训练推理计划器

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su

发表机构 * D 4 Lab(D4实验室) Independent Researcher(独立研究者)

AI总结 本文提出TraceLift框架,通过执行器导向的奖励提升推理质量,利用rubric-based Reasoning Reward Model评估推理轨迹的可靠性与有效性。

Comments 36 pages

详情
AI中文摘要

可验证奖励的强化学习已成为提升大语言模型显式推理的常见方法,但仅凭最终答案正确性无法揭示推理轨迹的忠实性、可靠性或对消费模型的效用。为此,我们提出TraceLift,将推理视为可消费的中间产物。在计划器训练中,计划器生成标记化的推理。冻结的执行器将此推理转化为最终产物供验证器反馈,同时执行器导向的奖励塑造中间轨迹。此奖励乘以基于rubric的Reasoning Reward Model评分,乘以在相同冻结执行器上测量的提升,奖励高质量且有用的轨迹。为使推理质量直接可学习,我们引入TRACELIFT-GROUPS数据集,包含数学和代码种子问题。每个示例是同一问题组,包含高质量参考轨迹和多个可能的错误轨迹,通过局部扰动降低推理质量或解决方案支持,同时保持任务相关性。在代码和数学基准上的广泛实验表明,执行器导向的推理奖励提高了两阶段计划器-执行器系统,表明推理监督应不仅评估轨迹是否看起来好,还应评估其是否帮助消耗模型。

英文摘要

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

2605.01320 2026-06-09 cs.CV 版本更新

PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression

PACE:用于学习LiDAR点云压缩的后因果熵建模

Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

发表机构 * School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China.(信息科学与技术学院,杭州师范大学,杭州,中国) School of Electronic Science and Engineering, Nanjing University, Nanjing, China(电子科学与工程学院,南京大学,南京,中国)

AI总结 PACE通过非因果骨干网络和轻量级预测器提升LiDAR点云压缩效率,实现90%以上的解码延迟降低和BD-BR节省。

详情
AI中文摘要

LiDAR点云压缩对自动驾驶系统处理高分辨率传感器数据至关重要。尽管基于八叉树结构的学得熵建模能获得高压缩增益,但面临两个关键瓶颈:1)解码时因因果、多阶段上下文建模导致的延迟过高;2)性能-延迟权衡的刚性,使单一模型难以适应变化约束。这些限制源于上下文聚合骨干与概率预测之间的紧密耦合。为此,我们提出PACE,一种新的框架,将祖先上下文聚合重新表述为非因果骨干,并将因果性限制在轻量级、阶段可扩展的预测器中,消除重复骨干执行并减少计算开销。预测器支持任意数量的预测阶段,使模型能够无缝适应多样化的性能-延迟权衡,而无需重新加载参数。实验表明,PACE在压缩效率上达到新状态,实现显著的BD-BR节省,并在自回归模式下将解码延迟降低超过90%,使其在实际应用中具有吸引力。

英文摘要

LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between the context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, enabling seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90\% in autoregressive mode, making it attractive for practical applications.

2605.01171 2026-06-09 cs.CV cs.LG 版本更新

CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization

CADFit:基于混合优化的精确网格到CAD程序生成

Ghadi Nehme, Eamon Whalen, Faez Ahmed

发表机构 * Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(麻省理工学院机械工程系)

AI总结 提出CADFit框架,通过基于几何反馈的增量拟合和验证参数化操作,从网格中恢复复杂可编辑的CAD构造序列,在多个基准上优于现有方法,并显著降低无效比率。

详情
AI中文摘要

尽管最近取得了进展,但从几何输入(如网格或点云)恢复参数化CAD构造序列仍然是设计和制造的关键挑战,因为现有的CAD重建和生成方法主要局限于难以编辑的格式(如网格或Breps)或可编辑的简单草图-拉伸流水线和低复杂度数据集。我们引入了CADFit,一个基于混合优化的CAD重建框架,通过使用几何反馈增量拟合和验证参数化操作,从网格中恢复复杂、可编辑的CAD构造序列。我们的方法的特点是将重建公式化为对结构化CAD程序的IoU驱动优化,并支持丰富的操作集,包括拉伸、旋转、圆角和倒角。在多个CAD基准上的实验表明,CADFit在体积交并比和倒角距离方面优于最先进的网格到CAD方法,同时显著降低了重建CAD程序的无效比率,特别是对于复杂设计。我们进一步提出了一个多模态流水线,通过将基于图像的几何重建与CADFit相结合,实现从图像端到端重建CAD构造序列。通过实现更高复杂度CAD模型的精确重建,CADFit为生成更丰富的数据集和推进未来基于学习的CAD逆向工程方法提供了实用基础。代码可在:https://github.com/ghadinehme/CADFit 获取。

英文摘要

Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering. The code is available at: https://github.com/ghadinehme/CADFit.

2506.20588 2026-06-09 cs.CV 版本更新

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

TRIM:一种最大化时间相对信息和代表性的自监督视频摘要框架

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Pompeu Fabra University(庞培法布拉大学) Universitat Autònoma de Barcelona(自治大学)

AI总结 TRIM框架通过自监督学习实现高效视频摘要,无需注意力机制等复杂结构,优于现有无监督方法并挑战传统复杂架构。

详情
AI中文摘要

随着视频内容的普及,视频摘要和亮点提取成为关键研究领域。然而,许多先进方法依赖监督标注或注意力模型,计算成本高且在分布变化时表现不稳定。我们提出一种新颖的自监督视频摘要模型,无需注意力、RNN或Transformer,通过马尔可夫过程驱动的损失度量和两阶段自监督学习范式,实现性能与效率的平衡。TRIM在SUMME和TVSUM数据集上达到最佳性能,超越所有现有无监督方法,并与最佳监督模型相当,展示了高效无标注架构的潜力,为更通用的视频摘要技术铺平道路,并挑战现有复杂架构的依赖。

英文摘要

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

2605.00647 2026-06-09 cs.LG 版本更新

Label-Conditioned Cross-Modal Fusion for Adult-to-Pediatric ECG Transfer via Curriculum-Gated Contrastive Alignment

基于标签的跨模态融合用于成人到儿童ECG转移 via 课程门控对比对齐

Xinran Liu, Yuwen Li, Hongxiang Gao, Heyang Xu, Jianqing Li, Zongmin Wang, Chengyu Liu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) Nanjing Medical University(南京医科大学) Zhengzhou University(郑州大学)

AI总结 本文提出PEACE框架,通过预训练和适应性融合提升儿童ECG诊断,采用对比学习和课程适应策略,在有限标注下实现高准确率。

详情
AI中文摘要

自动化的儿童心电图(ECG)解释仍具挑战性,因为心率、间隔和波形的发育差异限制了主要在成人数据上训练的模型的可转移性,同时专家标注的儿童ECG数据集稀缺。我们提出PEACE(通过跨模态增强的儿童-成人ECG对齐),一个在MIMIC-IV ECG上预训练并适应于儿童目标的成人到儿童ECG转移框架。PEACE整合标签特定的双向对比学习(LSBC)以对齐ECG表示与诊断语义,并采用课程适应融合(CAF)以在有限的儿童监督下稳定优化。标签条件的短文本描述在训练期间提供辅助语义监督,而推理仅需ECG信号。在ZZU-pECG上,PEACE在零样本、50样本和全微调设置下分别达到宏平均AUCs为59.39%、81.74%和91.56%,优于ECG-only、多模态和通用领域适应基线,包括DANN和MMD。在PTB-XL上,经过全微调后,其在九个和谐标签上的宏平均AUC达到96.90%。基于梯度的注意力图显示在与房间相关RVH相关的QRS电压和形态区域以及与LQTS相关的QRS到T/复极化间隔区域的显著性增加,与常规解释中常见的ECG区域一致。这些结果表明,成人规模的ECG预训练结合节律、形态和ST-T复极化语义描述在标签稀缺的情况下提高了可转移的儿童诊断,同时保持了临床可解释的波形焦点。

英文摘要

Automated pediatric electrocardiogram (ECG) interpretation remains challenging because developmental differences in heart rate, intervals, and waveforms limit the transferability of models trained mainly on adult data, while expert-labeled pediatric ECG cohorts are scarce. We propose PEACE (Pediatric-Adult ECG Alignment via Cross-modal Enhancement), an adult-to-pediatric ECG transfer framework pretrained on MIMIC-IV ECGs and adapted to pediatric targets. PEACE integrates label-specific bidirectional contrastive learning (LSBC) to align ECG representations with diagnostic semantics and curriculum adaptive fusion (CAF) to stabilize optimization under limited pediatric supervision. Label-conditioned short text descriptors provide auxiliary semantic supervision during training, whereas inference requires ECG signals only. On ZZU-pECG, PEACE achieves macro-average AUCs of 59.39%, 81.74%, and 91.56% under zero-shot, 50-shot, and full fine-tuning settings, respectively, outperforming ECG-only, multimodal, and generic domain adaptation baselines including DANN and MMD. On PTB-XL, it reaches 96.90% macro-average AUC after full fine-tuning over nine harmonized labels with nonzero mapped incidence. Gradient-based attention maps show increased saliency around QRS voltage and morphology regions for chamber-related RVH and around QRS-to-T/repolarization intervals for LQTS, broadly consistent with ECG regions commonly inspected during routine interpretation. These results suggest that adult-scale ECG pretraining coupled with rhythm, morphology, and ST-T repolarization semantic descriptors improves transferable pediatric diagnosis under label scarcity while preserving clinically interpretable waveform focus.

2605.00358 2026-06-09 cs.CL cs.CV 版本更新

From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

从反向传播到正向回放:重新审视LLM参数编辑中的目标构造

Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee

发表机构 * University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学)

AI总结 本文重新审视LLM参数编辑中的目标构造,提出一种更简洁的替代方法,通过正向传播代替反向传播,提高目标隐藏状态的准确性和兼容性。

Comments ICML 2026, code: https://github.com/jugechengzi/FE

详情
AI中文摘要

LLM参数编辑方法通常依赖于计算目标层的理想隐藏状态(称为锚点)并将其分布到多个前层(通常称为反向传播)以实现协同编辑。尽管长期广泛使用,其基础理论尚未系统研究。本文首先系统研究其基础,有助于明确其能力边界、实际考虑和潜在失败模式。然后,我们提出了一种简单优雅的替代方法,用正向传播代替反向传播。不优化最后一层的靶标,而是在第一编辑层优化锚点,然后将其传播到后续所有编辑层,以获得准确且相互兼容的目标隐藏状态。这种方法达到与现有方法相同计算复杂度,同时产生更准确的层间目标。我们的方法简单,不影响初始目标隐藏状态的计算或后续编辑流程的其他组件,因此对广泛的LLM参数编辑方法有益。

英文摘要

LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.

2605.00273 2026-06-09 cs.CV cs.AI 版本更新

When Do Diffusion Models learn to Generate Multiple Objects?

扩散模型何时学会生成多个物体?

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 研究探讨了扩散模型在多物体生成中的局限性,发现场景复杂度比概念不平衡更关键,且低数据条件下计数任务更难学习。

Comments ICML2026

详情
AI中文摘要

文本到图像的扩散模型实现了出色的视觉保真度,却在多物体生成中仍不可靠。尽管有大量实证证据表明这些失败,但其根本原因仍不清楚。我们首先探讨这种限制有多大源于数据本身。为了区分数据影响,我们考虑了不同数据集大小下的两种模式:(1)概念泛化,其中每个单独的概念在训练期间可能在不平衡的数据分布下被观察到;(2)组合泛化,其中特定的概念组合被系统性地排除。为了研究这些模式,我们引入了mosaic(多物体空间关系、属性、计数),一种受控的数据集生成框架。通过在mosaic上训练扩散模型,我们发现场景复杂性起主导作用,而非概念不平衡,并且在低数据模式中计数尤为难以学习。此外,随着训练过程中排除更多概念组合,组合泛化能力会崩溃。这些发现突显了扩散模型的根本限制,并促使更强的归纳偏见和数据设计以实现稳健的多物体组合生成。

英文摘要

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

2604.27810 2026-06-09 cs.LG 版本更新

Hyper-Dimensional Fingerprints as Molecular Representations

超维指纹作为分子表示

Jonas Teufel, Luca Torresi, André Eberhard, Pascal Friederich

发表机构 * Karlsruhe Institute of Technology (KIT), Institute of Nanotechnology (INT)(卡尔斯鲁厄理工学院(KIT),纳米技术研究所) Karlsruhe Institute of Technology (KIT), Institute of Anthropomatics and Robotics (IAR)(卡尔斯鲁厄理工学院(KIT),人机学与机器人研究所)

AI总结 本文提出超维指纹(HDF),通过高维向量的代数运算生成确定性分子表示,无需训练,在多种属性预测任务中表现优异,且在低维情况下保持分子相似性的一致性。

Comments Code: https://doi.org/10.5281/zenodo.19373621

详情
AI中文摘要

计算分子表示是虚拟筛选、性质预测和材料发现的基础。传统指纹效率高但因基于哈希的压缩丢失结构信息,特别是在低维情况下。通过图神经网络学习的表示恢复了这种表达性,但需要任务特定的训练和大量计算资源。本文引入超维指纹(HDF),用高维向量的代数运算替代消息传递神经网络的学习转换,生成无需训练的确定性分子表示。在多样化的属性预测基准上,HDF在大多数任务中优于传统指纹,且在不同数据集和模型间表现出更高的一致性。关键的是,HDF嵌入保持分子相似性:在32维时,HDF空间的距离与图编辑距离的皮尔逊相关系数达到0.9,而摩根指纹在同等尺寸下仅为0.55。这种结构保真度在低维情况下持续,允许简单的最近邻回归在64个组件中保持预测性。进一步在贝叶斯分子优化中展示了实际影响,HDF基于的替代模型在摩根指纹表现与随机搜索相当的领域中显著提高了样本效率。HDF因此提供了一种通用的、无需训练的替代方案,表明传统固定长度指纹中接受的信息损失是哈希编码方案的限制,而非指纹范式本身。

英文摘要

Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.

2604.27476 2026-06-09 cs.CV 版本更新

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM: 为视觉-语言模型高效边缘推理设计的框架

Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An

发表机构 * Go Further. AI School of Data Science Fudan University(复旦大学数据科学学院) RUYi Dynamics Co. Ltd(RUYi Dynamics有限公司) Independent Researcher(独立研究者)

AI总结 EdgeFM通过轻量级代理驱动框架,优化边缘部署的视觉-语言模型推理,提升跨平台性能和可移植性,实现比传统工具链更快的推理速度。

Comments Technique Report version

详情
AI中文摘要

视觉-语言模型(VLMs)在边缘工业应用中表现出强大的适用性,但其部署受到确定性低延迟和资源限制下稳定执行的严重限制。现有框架要么依赖臃肿的通用设计,要么迫使开发者进入封闭的硬件特定生态系统,导致硬件锁定和较差的跨平台适应性。观察到现代AI代理可以高效搜索和调整配置以生成高度优化的低级内核,我们提出EdgeFM,一种轻量级、代理驱动的VLM/LLM推理框架,专为跨平台工业边缘部署设计。EdgeFM通过移除非必要功能来降低单次请求延迟,并将代理调优的内核优化封装为可重用的模块化库。通过允许直接调用这些技能而不是等待封闭源代码实现,它有效缩小了长期以来由专有工具链主导的性能差距。该框架原生支持主流平台,包括x86和NVIDIA Orin SoCs,并代表了首个在国产Horizon Journey平台上的端到端VLA部署,增强了跨平台可移植性。在大多数情况下,它比传统供应商特定工具链的推理性能更优,实现NVIDIA Orin平台上比TensorRT-Edge-LLM快1.49倍的速度提升。实验结果表明,EdgeFM提供了有利的端到端推理性能,为多样化的边缘工业场景提供了开源、生产级的解决方案。

英文摘要

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

2604.27273 2026-06-09 cs.SD 版本更新

Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?

少样本合成方言语音用于ASR微调:什么有助于什么?

Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson, Dilek Hakkani-Tür, Volodymyr Kindratenko

发表机构 * University of Washington(华盛顿大学)

AI总结 研究比较了合成方言语音在ASR微调中的有效性,发现随机音素扰动比目标方言音素编辑更有效,且真实语音与合成语音混合可稳定低资源微调。

Comments Accepted as a contributed talk and poster at the ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

合成方言语音是一种在真实方言录音稀缺时提升自动语音识别(ASR)性能的有希望的方法。我们探讨了什么使此类数据对ASR微调有用:目标方言音素编辑暴露识别器于方言特定发音,或随机音素扰动在音素空间中充当增强。在少样本TTS流程中,我们比较了LLM生成的方言编辑与匹配速率的随机替换和oracle控制,使用真实方言音素和语调。随机替换恢复了大部分ASR增益:LLM目标方言编辑仅比随机替换略好,真实音素接近随机基线并随着合成ASR微调集增大接近它,添加真实语调仅带来小幅增益。混合合成与真实方言语音也稳定了低资源微调,但固定合成预算后期会稀释真实数据信息,显示真实-合成比例的重要性。

英文摘要

Synthetic accented speech is a promising way to improve automatic speech recognition (ASR) when real accented recordings are scarce. We ask what makes such data useful for ASR fine-tuning: target-accent phoneme edits that expose the recognizer to accent-specific pronunciations, or random phoneme perturbations that act as augmentation in phoneme space. In a few-shot TTS pipeline, we compare LLM-generated accent edits with matched-rate random substitutions and oracle controls using ground-truth accented phonemes and prosody. Random substitutions recover much of the ASR gain: LLM target-accent edits improve over random by only a small margin, ground-truth phonemes stay close to the random baseline and nearly converge with it as the synthetic ASR fine-tuning set grows larger, and adding ground-truth prosody yields only a modest further gain. Mixing synthetic with real accented speech also stabilizes low-resource fine-tuning, but a fixed synthetic budget can later dilute the information in real data, showing that the real--synthetic ratio matters.

2604.26985 2026-06-09 cs.LG cs.AI 版本更新

Simple Self-Conditioning Adaptation for Masked Diffusion Models

简单自条件适应用于掩码扩散模型

Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出一种简单有效的后训练适应方法,通过自条件预测提升掩码扩散模型的生成能力,减少生成困惑度并提升图像合成和分子生成质量。

详情
AI中文摘要

掩码扩散模型(MDMs)通过迭代去噪在吸收掩码过程中生成离散序列。在标准掩码扩散中,如果一个token在反向更新后仍被掩码,模型会丢弃该位置的干净状态预测。因此,仍被掩码的位置必须反复从掩码token本身推断。这种设计限制了跨步骤的细化。为解决这一限制,本文提出了一种简单但有效的后训练适应方法,使每个去噪步骤都基于模型自身之前的干净状态预测。所提出的方法称为自条件掩码扩散模型(SCMDM),需要最小的架构更改,不引入递归的潜在状态路径,不依赖辅助参考模型,并在采样过程中不增加额外的去噪器评估。这与部分自条件方法形成重要区别,后者需要昂贵的从头模型训练。特别是,本文表明,在后训练阶段,部分自条件,包括用于从头训练自条件模型的常用50% dropout策略,是次优的。相反,一旦模型自生成的干净状态估计变得有信息,专业化于细化优于混合条件和无条件目标。SCMDM在多个领域进行了评估,显示出对普通MDM基线的一致改进,实现了在OWT训练模型上的生成困惑度几乎减少50%(从42.89到23.72),同时在离散图像合成质量、小分子生成和基因组分布建模的保真度方面也取得了显著改进。

英文摘要

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model's self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.

2604.25781 2026-06-09 cs.CV cs.GR 版本更新

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Sketch2Arti:基于草图的CAD物体关节建模

Yi Yang, Hao Pan, Yijing Cui, Alla Sheffer, Changjian Li

发表机构 * University of Edinburgh(爱丁堡大学) Tsinghua University(清华大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出Sketch2Arti系统,通过用户从选定视角绘制的2D草图,自动发现CAD模型中的可动部件并预测其运动参数,支持复杂物体的多关节迭代建模,无需类别信息且可推广到多样物体。

Comments Project page: https://arlo-yang.github.io/Sketch2Arti

详情
AI中文摘要

关节建模旨在推断3D物体的可动部件及其运动参数,实现交互式动画、模拟和形状编辑。本文提出Sketch2Arti,首个基于草图的CAD物体关节建模系统。我们的关键观察是,设计师通过轻量级草图(如箭头和笔画)自然地传达关节意图,指示部件应如何移动,但将这些草图转化为关节3D模型仍主要依赖手动操作。Sketch2Arti通过允许用户从选定视角绘制简单2D草图来指定关节,弥合了这一差距。给定CAD模型和用户草图,我们的方法自动发现对应的可动部件并预测其运动参数,支持对复杂物体进行多个关节的迭代建模,并实现精细控制。重要的是,Sketch2Arti以类别无关的方式训练,无需物体类别信息,从而对现有关节数据集之外的多样物体具有强泛化能力。此外,对于缺乏内部结构的壳体模型,Sketch2Arti支持由用户草图引导的可控内部补全,生成与现有几何和预测运动约束一致的合理内部组件。综合实验和用户评估证明了Sketch2Arti的有效性、可控性和泛化性。代码、数据集和原型系统见https://arlo-yang.github.io/Sketch2Arti。

英文摘要

Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti.