arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3962
2605.15690 2026-06-09 cs.LG 版本更新

FRWKV+: Periodic-Aware Adaptive Gating for Frequency-Space Linear Time Series Forecasting

FRWKV+: 基于周期感知的自适应门控用于频率域线性时间序列预测

Qingyuan Yang, Dongyue Chen, Da Teng, Junhua Xiao, Jiaji Pan, Shizhuo Deng

发表机构 * College of Information Science and Engineering, Northeastern University(信息科学与工程学院,东北大学) Foshan Graduate School of Innovation, Northeastern University(创新研究生学院,东北大学) National Frontiers Science Center for Industrial Intelligence and Systems Optimization(工业智能与系统优化国家级前沿科学中心)

AI总结 本文提出FRWKV-Plus模型,通过引入跨分支频谱门和信任门控残差修正,提升频率域时间序列预测的准确性与效率,实验表明其在多个基准数据集上表现优异。

详情
AI中文摘要

准确且高效的长期多变量时间序列预测需要捕捉重复的时序结构,同时在许多变量和预测范围上保持推理成本低。频率域模型能紧凑地表示长程和周期性变化,但通常将实部和虚部频谱组件作为弱耦合流处理,并将周期性提示作为普通输入特征,即使这些提示不可靠。本文提出FRWKV-Plus,一种轻量级周期感知频率域预测模型,基于高效的FRWKV骨干网络。FRWKV-Plus引入了跨分支频谱门,通过总结其兄弟分支来重新加权每个频谱分支,并引入信任门控残差修正,将紧凑的周期内上下文转换为有界的、符号灵活的调整。通过构造,修正在初始化时保持恒等,并严格有界,因此周期性证据可以细化但不会主导或反转基础交互。在七个标准基准上,FRWKV-Plus在强线性、频率域、递归式和Transformer基预测器中表现一致竞争,同时保持骨干网络的轻量级特性。受控三种子消融实验显示,每个组件都起作用,收益在强周期性数据上较小,在更难的交换和IL数据集上更显著,且周期内上下文是最有影响力的单一组件。实现已公开在https://github.com/yangqingyuan-byte/FRWKV-plus。

英文摘要

Accurate and efficient long-term multivariate time series forecasting requires capturing recurring temporal structure while keeping inference cheap across many variables and horizons. Frequency-space models represent long-range and periodic variation compactly, but they typically process the real and imaginary spectral components as weakly coupled streams and treat periodic cues as ordinary input features, even when such cues are unreliable. This paper proposes FRWKV-Plus, a lightweight periodic-aware frequency-space forecasting model built on the efficient FRWKV backbone. FRWKV-Plus introduces a cross-branch spectral gate that reweights each spectral branch using a summary of its sibling branch, and a trust-gated residual correction that converts compact within-period context into a bounded, sign-flexible adjustment of these gates under a learned, data-dependent trust score. By construction, the correction is identity-preserving at initialization and strictly bounded, so periodic evidence can refine but never dominate or invert the base interaction. On seven standard benchmarks, FRWKV-Plus is consistently competitive with strong linear, frequency-domain, recurrent-style, and Transformer-based forecasters while preserving the lightweight profile of the backbone. Controlled three-seed ablations show that each component contributes, that the benefit is modest on strongly periodic data and pronounced on the harder Exchange and ILI datasets, and that the within-period context is the most influential single component. The implementation is publicly available at https://github.com/yangqingyuan-byte/FRWKV-plus.

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Ghosted Layers: 无约束激活对齐用于恢复层剪枝的LLM

Vincent-Daniel Yun, Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee

发表机构 * University of Southern California(南加州大学) Inha University(inha大学)

AI总结 本文提出Ghosted Layers方法,通过无约束优化解决层剪枝后激活分布不匹配问题,提升LLM准确性和 perplexity 而不牺牲效率。

详情
AI中文摘要

层剪枝从大型语言模型中移除整个Transformer解码器块,但导致后续存活层接收到的隐藏状态分布与训练时分布不匹配,从而引起显著性能下降。我们提出Ghosted Layers,一种无需训练的恢复模块,通过解决边界激活对齐问题来解决此问题。我们的方法从少量校准集推导出闭合形式的最优线性算子,以重建由剪枝层引入的激活差异。我们展示该解决方案对应于对齐目标的无约束最优解,而现有方法受限于有限算子子空间内的约束解。在多个LLM backbone和剪枝策略上的实验表明,我们的方法在保持层剪枝效率增益的同时,一致提升了准确性和perplexity,优于先前的无训练基线。官方代码仓库:https://github.com/daniel-eai/ghosted_layers_official_repository/.

英文摘要

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning. Official code repository: https://github.com/daniel-eai/ghosted_layers_official_repository/.

2605.15466 2026-06-09 cs.CV 版本更新

Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

以实体为中心的世界模型:交互感知的掩码用于因果视频预测

Santosh Kumar Paidi

发表机构 * Genentech, Inc.(基因泰克公司)

AI总结 本文提出IA-JEPA,通过运动中心的自监督掩码策略,优先捕捉物理交互,提升因果推理任务的准确性,并在真实世界动作和物理谜题中验证了其泛化能力。

详情
Comments
12 pages, 4 figures
AI中文摘要

从未标记视频中学习预测性世界模型是人工智能的基础挑战。尽管联合嵌入预测架构(JEPA)在语义分类中设定了新基准,但它们往往缺乏物理感知,无法捕捉下游推理所需的因果动态。我们假设这源于标准的基于块的掩码策略,这些策略优先考虑视觉纹理而非罕见但信息丰富的运动事件。我们提出交互感知JEPA(IA-JEPA),利用自监督的运动中心掩码策略,优先考虑物理交互。通过专门针对碰撞或动量转移的实体,我们迫使架构重建潜在轨迹而非静态背景特征。在CLEVRER基准上评估,IA-JEPA在因果推理任务中达到14.26%的准确率,显著高于标准块掩码基线的3.22%。关键的是,我们证明IA-JEPA通过诱导更高熵、更具判别性的潜在空间(+10%熵增)打破了标准自监督的“静态偏见”,并线性化物理能量(R²=0.43)。我们展示这种交互偏见可推广到真实世界的人类动作(Something-Something V2)和零样本物理谜题(PHYRE-Lite)。我们的结果提供了一条可扩展的、完全自监督的路径,以构建开始内部化物理世界因果结构的基础世界模型。

英文摘要

Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

2605.15416 2026-06-09 cs.LG cs.AI 版本更新

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

基于边际的置信度排名用于可靠的LLM判断

Gaojie Jin, Yong Tao, Lijia Yu, Tianjin Huang

发表机构 * arXiv.org

AI总结 本文提出一种基于边际的置信度排名方法,通过学习专用置信度估计器,改进LLM在人类判断一致性上的表现,通过模拟标注者多样性与边际排名公式,显式建模LLM区分人类一致与不一致案例的置信度,并推导出通用性保证。

详情
Comments
Accepted to ICML 2026
AI中文摘要

Jung等人(2025)提出了一种假设检验框架,以确保大型语言模型(LLMs)与人类判断之间的一致性,基于模型估计的置信度与人类不一致风险之间单调性的假设。然而,在实践中,这一假设可能被违反,且置信度估计器的泛化行为未被显式分析。我们通过学习专用置信度估计器而非依赖启发式置信信号来缓解这些问题。我们的方法利用模拟标注者多样性和基于边际的排名公式,显式建模LLM区分人类一致与不一致案例的置信度。我们进一步推导出该估计器的泛化保证,揭示出一个与边际相关的权衡,从而指导适应性估计器训练过程的设计。当集成到固定序列测试中时,所学的置信度估计器提高了排名准确性,并在多个数据集和判断模型上实现了更高的成功率,以满足目标一致性水平。

英文摘要

Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.

2605.14531 2026-06-09 cs.CL 版本更新

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

语言生成作为最优控制:潜在控制空间中的闭环扩散

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文将语言生成重新表述为随机最优控制问题,通过统一理论视角分析自回归和扩散模型,解释其局限性,并提出基于流匹配的闭环控制器实现高效文本生成。

详情
AI中文摘要

本工作将语言生成重新表述为随机最优控制问题,提供统一的理论视角来分析自回归和扩散模型,并解释其局限性(效率-保真度悖论、不可逆误差传播、优化可行性与保真度)在轨迹奇异性、共轭状态消失和梯度缺失的组合下的表现。为解决这些问题,我们近似求解哈密顿-雅可比-贝尔曼(HJB)方程,得到一个作为闭环控制器的最优策略。为避免直接求解HJB PDE的不可行性,我们采用流匹配作为最优轨迹求解器,在校正的潜在控制空间中。这使我们的Manta-LM配备全局积分算子能够近似全局向量场,从而实现同时实现高保真文本生成和高效、低成本并行采样的模型。实验表明,我们的方法在语言建模和条件生成任务中表现强劲,同时表现出改进的稳定性、效率和可控性。

英文摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

2604.26498 2026-06-09 cs.LG q-bio.QM 版本更新

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

大模型真的在药物发现中胜出吗?AI驱动的分子性质和活性预测中模型规模的基准评估

Jinjiang Guo, Sheng Ding

发表机构 * Global Health Drug Discovery Institute(全球健康药物发现研究所) School of Pharmaceutical Sciences(药学院)

AI总结 本文通过26个ADME、毒性及生物活性端点评估,发现传统机器学习在多数任务中表现最佳,大模型在部分困难分割中竞争力有限,模型性能依赖于任务与验证场景的适配性,而非单纯规模。

详情
Comments
Improved benchmark design and reproducibility, replaced restricted datasets with public benchmarks in primary analyses, and added sensitivity analyses supporting the interpretation of model scaling and evaluation protocol effects in molecular prediction
AI中文摘要

分子基础模型和大语言模型的快速发展促使人们以规模为中心看待AI在药物发现中的应用,认为更大的预训练模型将取代紧凑的化学信息学模型。我们测试了这一假设,涵盖26个ADME、毒性及生物活性端点,共165,541个端点级别化合物标签记录。基准测试包含78个端点和分割条目,通过随机、Murcko骨架和结构分离的5折交叉验证协议评估,代表递增的化学泛化难度。在156个任务和指标比较中,传统机器学习(ML)提供了最大的最佳表现份额(47.4%),其次是预训练分子序列模型(28.8%)、图神经网络(21.8%)和基于LLM的SAR基线(1.9%)。传统ML在随机分割插值中占优,并总体上是最大的胜利家族。GNN和序列模型在部分更难的分割中具有竞争力,但其严格胜利份额在固定最终窗口读取下减少,表明对训练设置和模型选择的敏感性。配对Bootstrap分析显示,模型间的小数值差异不应被视为决定性胜利。训练折叠中的SAR知识提高了GPT5.5-SAR和Opus4.7-SAR指标,但并未使基于规则的推理成为监督预测器的通用替代品。紧凑的专业模型仍高度有效,预测性能取决于模型、任务和验证场景之间的适配性,而非规模本身。

英文摘要

The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and structure separated 5-fold cross validation protocols, representing increasing chemical generalization difficulty. Across 156 task and metric comparisons, classical machine learning (ML) provides the largest share of best performing entries (47.4%), followed by pretrained molecular sequence models (28.8%), graph neural networks (21.8%) and LLM based SAR baselines (1.9%). Classical ML dominates random split interpolation and remains the largest winner family overall. GNN and sequence models are competitive in selected harder splits, but their strict winner shares decrease under a fixed final-window readout, indicating sensitivity to training settings and model selection. Paired bootstrap analyses show that small numerical differences between individual models should not be read as decisive victories. SAR knowledge from training folds improves GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective, and predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

2605.13768 2026-06-09 cs.LG cs.AI cs.IT math.IT 版本更新

High-Rate Quantized Matrix Multiplication II

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem(希伯来大学杰里科分校) MIT(麻省理工学院)

AI总结 本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法,通过水填充算法改进LLM量化方法,展示WaterSIC方案在信息论极限下的性能。

详情
AI中文摘要

本文是关于量化矩阵乘法(MatMul)工作的第二部分。在第一部分中,我们考虑了无校准量化的情况,而在这里,我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差(WMSE)源编码问题相关,其经典的(反向)水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法(GPTQ),目前这些算法平均分配速率。最近的一种方案(称为``WaterSIC'')仅使用标量INT量化器进行分析,其高速率性能被证明为(a)基无关(即由$Σ_X$的行列式决定,因此不同于现有方案,不受随机旋转的影响);(b)在信息论极限下的性能与$\frac{2πe}{12}$(或0.25 bit/entry)的乘法因子内。GPTQ的性能受基的选择影响,但对于随机旋转和实际的$Σ_X$来自Llama-3-8B,我们发现其性能在0.1 bit(取决于层类型)以内,表明GPTQ结合随机旋转也接近最优,至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

2605.11212 2026-06-09 cs.CL 版本更新

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision:通过时间视觉冗余减少扩展计算机使用代理

Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Microsoft Research(微软研究院)

AI总结 ReVision通过去除冗余视觉片段,减少token使用并提升成功率,使代理能处理更长轨迹。

详情
AI中文摘要

计算机使用代理(CUAs)依赖于图形用户界面的视觉观察,每个截图被编码为大量视觉token。随着交互轨迹增长,token成本迅速增加,限制了在固定上下文和计算预算下可纳入的历史量。这导致使用历史时性能提升有限,不同于其他领域。我们通过引入ReVision解决这一效率问题,该方法用于训练多模态语言模型,在轨迹中去除冗余视觉片段,使用学习的片段选择器比较连续截图的片段表示,同时保留模型所需的时空结构。在三个基准测试(OSWorld、WebTailBench和AgentNetBench)中,当使用Qwen2.5-VL-7B处理包含5个历史截图的轨迹时,ReVision平均减少46%的token使用,同时将成功率提高3%。这建立了明显的效率提升,使代理能用更少token处理更长轨迹。通过这一改进效率,我们重新审视CUAs中历史的作用,发现当去除冗余时,性能随更多过去观察的纳入而持续提升。

英文摘要

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

2605.12213 2026-06-09 cs.AI 版本更新

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

面向目标的推理用于基于RAG的记忆在对话型代理LLM系统中

Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, Scott Sanner

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文提出Goal-Mem框架,通过目标导向的推理提升RAG记忆在复杂任务中的表现,尤其在多跳推理和隐含推理中效果显著。

详情
AI中文摘要

基于LLM的对话型AI代理在长时间范围内维持一致行为存在困难,因为上下文有限。虽然RAG方法通过外部记忆模块存储交互并进行检索来克服这一限制,但其在回答具有挑战性的问题(如多跳、常识推理)上的有效性最终取决于代理对检索信息的推理能力。然而,现有方法通常基于语义相似性检索原始用户语句,缺乏对缺失中间事实的显式推理,且常返回无关或不足的证据。本文引入Goal-Mem,一种面向目标的推理框架,通过从用户语句作为目标进行逆向推导。而非逐步扩展检索上下文,Goal-Mem将每个目标分解为原子子目标,进行针对性记忆检索以满足每个子目标,并迭代识别在中间目标无法解决时应从记忆中检索哪些信息。我们通过自然语言逻辑(NLL)形式化这一过程,该逻辑系统结合了FOL的推理可验证性和自然语言的表达性。通过在两个数据集上进行广泛实验,并与九个强大的记忆基线进行比较,我们证明Goal-Mem在多个任务中表现更优,尤其在需要多跳推理和隐含推理的任务中效果显著。

英文摘要

LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent's ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user's utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.

2605.11855 2026-06-09 cs.LG cs.AI cs.AR 版本更新

Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications

提升为超低功耗应用设计的可并行递归神经网络的性能和学习稳定性

Julien Brandoit, Arthur Fyon, Damien Ernst, Guillaume Drion

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出CMRU和αCMRU,通过累积更新公式恢复梯度流并保持持久记忆,提升收敛稳定性并减少初始化敏感性,在多样本基准中表现优异,尤其在需要离散长距离保留的任务中表现突出。

详情
Comments
Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9
AI中文摘要

序列学习主要由Transformer和可并行递归神经网络(如状态空间模型)主导,但学习长期依赖仍具挑战性,最先进的设计以性能牺牲换取功耗降低。Bistable Memory Recurrent Unit(BMRU)被引入以实现超低功耗RNNs的软硬件协同设计:具有滞后特性的量化状态提供持久记忆并直接映射到模拟基本单元。然而,BMRU在复杂序列任务上性能落后于可并行RNNs。本文识别出在状态更新期间出现的梯度阻塞是关键限制,并提出累积更新公式以恢复梯度流并保持持久记忆,通过时间创建跳跃连接。这导致了累积记忆递归单元(CMRU)及其放松变体αCMRU。实验表明,累积公式显著提高了收敛稳定性并减少了初始化敏感性。CMRU和αCMRU在小模型规模下在多样本基准中与线性递归单元(LRUs)和最小门控递归单元(minGRUs)匹配或超越,尤其在需要离散长距离保留的任务中表现突出,同时CMRU保留量化状态、持久记忆和抗噪声动态,这些对于模拟实现至关重要。

英文摘要

Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the $α$CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and $α$CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.

2605.08384 2026-06-09 cs.CL 版本更新

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

jina-embeddings-v5-omni: 通过锁定对齐塔实现几何保持嵌入

Florian Hönicke, Michael Günther, Andreas Koukounas, Mohammad Kalim Akram, Scott Martens, Saba Sturua, Han Xiao

发表机构 * Jina by Elastic(Jina 由 Elastic 公司)

AI总结 本文提出GELATO方法,通过冻结对齐塔实现多模态嵌入,生成统一语义空间,训练效率高且保持文本嵌入一致性。

详情
Comments
11 pages, 9 figures, 5 tables
AI中文摘要

在本文中,我们介绍了GELATO(通过锁定对齐塔实现几何保持嵌入),一种新型的多模态嵌入模型。我们基于VLM式架构,非文本编码器被调整以生成语言模型的输入,进而生成所有输入类型的嵌入。我们展示了结果:jina-embeddings-v5-omni套件,一对模型将文本、图像、音频和视频输入编码到单一语义嵌入空间。GELATO扩展了两个Jina Embeddings v5文本模型,通过添加图像和音频编码器支持额外模态。骨干文本嵌入模型和新增的非文本模态编码器保持冻结。我们仅训练连接组件,代表联合模型总权重的0.35%。因此,训练比全参数重新训练要高效得多。此外,语言模型保持基本不变,对文本输入生成与Jina Embeddings v5文本模型完全相同的嵌入。我们的评估表明,GELATO产生的结果与最先进的方法相媲美,几乎与更大的多模态嵌入模型具有同等性能。

英文摘要

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

2605.11502 2026-06-09 cs.CL 版本更新

Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations

基于知识引导扰动的鲁棒生物医学出版物类型与研究设计分类

Shufan Ming, Joe D. Menke, Neil R. Smalheiser, Halil Kilicoglu

发表机构 * arXiv.org cs.CL(计算机语言学)

AI总结 本文提出基于受控语义扰动的评估框架,通过实体遮蔽和领域对抗训练提升生物医学出版物类型分类的鲁棒性,发现通过抑制非任务定义特征可缓解鲁棒性与领域准确性之间的权衡。

详情
Comments
Accepted by IEEE ICHI 2026
AI中文摘要

准确且一致地对生物医学文献进行出版物类型和研究设计索引对于支持证据综合和知识发现至关重要。先前工作主要集中在扩展标签覆盖、丰富特征表示和提高领域内准确性,评估通常在与训练数据同分布的数据上进行。尽管预训练生物医学语言模型在这些设置下表现优异,但优化领域内准确性的模型可能依赖于表面词汇或数据集特定的提示,导致在分布偏移下鲁棒性降低。本文引入基于受控语义扰动的评估框架,评估出版物类型分类器的鲁棒性,并研究结合实体遮蔽和领域对抗训练的鲁棒性导向训练策略,以减轻对虚假主题相关性的依赖。结果表明,当鲁棒性目标设计为选择性抑制非任务定义特征同时保留显著的方法学信号时,通常观察到的鲁棒性与领域准确性之间的权衡可以被缓解。我们发现这些改进源于两种互补机制:(1)当输入中存在此类提示时,增加对显式方法学提示的依赖;(2)减少对虚假领域特定主题特征的依赖。这些发现强调了出版物类型和研究设计分类中特征级鲁棒性分析的重要性,并建议通过更选择性地抑制主题信息来进一步提高鲁棒性。数据、代码和模型可在:https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI 获取。

英文摘要

Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI

2605.11484 2026-06-09 cs.AI 版本更新

Engagement Process: Rethinking the Temporal Interface of Action and Observation

参与过程:重新思考动作与观察的时间接口

Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出参与过程(EP)模型,通过显式时间接口处理动作与观察的不同时间尺度交互,支持多速率协调和子系统组合,揭示隐藏的时间行为并使策略适应显式时间成本。

详情
AI中文摘要

在数字和物理环境中完成任务日益涉及复杂的时序交互,其中动作和观察在不同的时间尺度上展开,而非与固定观察-动作步骤对齐。为了建模此类交互,我们提出参与过程(EP),一种继承POMDP决策理论结构的交互形式,使时间在动作-观察接口中显式化。EP将动作和观察表示为沿时间解耦的事件流,而非在固定决策步骤上配对更新。此接口捕捉单agent的时间问题,如决策延迟、延迟反馈和持续动作,同时支持更丰富的agent侧组织、多速率协调和子系统间的组合交互。在玩具、LLM-agent和学习实验中,EP揭示了由基于步骤的接口隐藏的时间行为,并使策略在显式时间成本下适应。

英文摘要

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性,利用数据高效的自监督框架引导视频扩散模型,显著增强时间稳定性、几何合理性与运动一致性。

详情
Comments
8 pages, 5 figures, ICML 2026
AI中文摘要

尽管最近的视频扩散模型(VDMs)能产生视觉上令人印象深刻的结果,但它们在保持3D结构一致性方面存在根本性困难,常导致物体变形或空间漂移。我们假设这些失败是因为标准去噪目标缺乏显式的几何一致性激励。为此,我们引入VideoGPA(视频几何偏好对齐),一种数据高效的自监督框架,利用几何基础模型自动推导密集偏好信号,通过直接偏好优化(DPO)引导VDMs。该方法有效将生成分布引导至内在3D一致性,而无需人工标注。VideoGPA通过最少的偏好对显著提升了时间稳定性、几何合理性与运动一致性,在大量实验中一致优于最先进基线。

英文摘要

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

2412.01324 2026-06-09 cs.RO 版本更新

Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control

集成化分层决策在逆运动学规划与控制中

Kai Pfeiffer, Quan Zhang, Yuqing Chen, Gordon Boateng, Yuquan Wang, Vincent Bonnet, Aberrahmane Kheddar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出一种高效的非线性规划框架,整合分层决策与全身逆运动学规划控制,解决逆运动学规划中同时选择端效应器位置的问题。

详情
Comments
Accepted paper to "Robotics: Science and Systems" (2026)
AI中文摘要

本文提出了一种新颖且高效的非线性规划框架,紧密整合分层决策与全身逆运动学规划与控制。决策在机器人领域诸多方面起核心作用,从稀疏逆运动学控制(使用最少的关节)到同时选择多个候选端效应器位置的逆运动学规划。当前方法常依赖混合整数非线性规划的大量计算,将决策与逆运动学分离(有时用可达性方法近似),或使用高效但不够灵活的ℓ1范数线性稀疏规划方法,未解决底层非线性问题。相比之下,所提出的稀疏分层非线性规划求解器通过利用稀疏分层结构和ℓ0范数(在机器人领域很少使用)实现了高效、灵活和准确。该求解器有效处理了文献中未解决的复杂非线性分层决策问题,例如同时从大量候选中优先选择端效应器位置的逆运动学规划,或同时选择双臂抓取位置的逆运动学控制。

英文摘要

This work presents a novel and efficient nonlinear programming framework that tightly integrates hierarchical decision-making with whole-body inverse kinematic planning and control. Decision-making plays a central role in many aspects of robotics, from sparse inverse kinematic control with a minimal number of joints, to inverse kinematic planning while simultaneously selecting a discrete end-effector location from multiple candidates. Current approaches often rely on heavy computations using mixed-integer nonlinear programming, separate decision-making from inverse kinematics (some times approximated by reachability methods), or employ efficient but less versatile $\ell_1$-norm formulations of linear sparse programming, without addressing the underlying nonlinear problem formulations. In contrast, the proposed sparse hierarchical nonlinear programming solver is efficient, versatile, and accurate by exploiting sparse hierarchical structure and leveraging the $\ell_0$-norm which is rarely used in robotics. The solver efficiently tackles complex nonlinear hierarchical decision-making problems previously unaddressed in the literature, such as inverse kinematic planning with simultaneous prioritized selection of end-effector locations from a large set of candidates, or inverse kinematic control with simultaneous selection of bi-manual grasp locations on a randomly rotated box.

2605.08876 2026-06-09 cs.LG 版本更新

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

OTora:一种用于LLM代理推理层面拒绝服务攻击的统一红队框架

Xinyu Li, Ronghui Mu, Lin Li, Tianjin Huang, Gaojie Jin

发表机构 * arXiv.org cs.LG(计算机科学与应用数学)

AI总结 OTora是首个统一的两阶段红队框架,用于实现推理层面拒绝服务攻击,通过优化对抗触发器和生成代理感知的推理负载,提升推理token数量和延迟,同时保持任务准确性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

OTora是一种用于LLM代理推理层面拒绝服务攻击的统一红队框架。大型语言模型(LLMs)正越来越多地被部署为能够执行工具增强的多步骤任务的自主代理,其中延迟是实际应用中的关键因素。然而,一个被忽视的威胁是推理层面拒绝服务(R-DoS),攻击者通过增加代理的推理深度或工具使用预算来降低可用性,同时保持任务正确性。我们介绍了OTora,这是首个统一的两阶段红队框架,用于实现R-DoS攻击。第一阶段优化了对抗触发器,通过插入意识评分和动态目标共进化,诱导定向工具调用,支持黑盒和白盒环境。第二阶段通过ICL引导的遗传搜索生成代理感知的推理负载,放大过度思考的同时保持正确的任务结果。在WebShop、Email和OS代理上,基于多种基础模型如LLaMA-70B和GPT-OSS-120B,OTora实现了推理token数量增加10倍和延迟减慢数量级,同时保持接近基线的任务准确性。最后,我们讨论了检测和限制异常推理和延迟峰值的缓解策略。代码可在https://github.com/llm2409/OTora上获得。

英文摘要

Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute tool-augmented, multi-step tasks, where latency is a critical factor for real-world applications. Yet an overlooked threat is Reasoning-Level Denial-of-Service (R-DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent's reasoning depth or tool-use budget. We introduce OTora, the first unified, two-stage red-teaming framework for instantiating R-DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion-aware scoring and dynamic target co-evolution, supporting both black-box and white-box settings. Stage II generates agent-aware reasoning payloads via an ICL-guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA-70B and GPT-OSS-120B, OTora achieves up to 10 times increases in reasoning tokens and order-of-magnitude latency slowdowns, all while preserving near-baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes. The code is available at https://github.com/llm2409/OTora.

2605.04913 2026-06-09 cs.CL cs.LG 版本更新

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

重新思考局部学习:一种更便宜更快的LLM后训练配方

Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su

发表机构 * Independent Researcher(独立研究者) D 4 Lab(D4实验室) Southeast University(东南大学)

AI总结 本文提出LoPT,一种局部学习后训练策略,通过在transformer中点设置梯度边界,降低内存成本,提高训练效率并保留预训练能力。

详情
Comments
35pages
AI中文摘要

LLM后训练通常通过完整深度传播任务梯度。尽管这种端到端结构简单通用,但将其任务适应与完整深度激活存储、长距离反向依赖和直接任务梯度访问预训练表示耦合在一起。我们主张这种完整深度反向耦合可能不必要的昂贵和侵入性,尤其是在后训练监督远比预训练狭窄时。为此,我们提出LoPT:局部学习后训练,一种简单的后训练策略,使梯度达到成为显式设计选择。LoPT在transformer中点放置单一梯度边界:后半部分块从任务目标学习,而前半部分块通过轻量级特征重建目标进行更新,以保留有用的表示并保持接口兼容性。LoPT缩短了任务引起的反向路径,同时限制了狭窄任务梯度对早期层表示的直接干扰。大量实验表明,LoPT在较低的内存成本、较高的训练效率和更好的保留预训练能力方面实现了竞争性性能。我们的代码可在:https://github.com/HumyuShi/LoPT获取。

英文摘要

LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT

2605.06582 2026-06-09 cs.LG cs.CL cs.SD 版本更新

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

PairAlign:一种通过自对齐的序列标记化框架及其在音频标记化中的应用

Adhiraj Banerjee, Vipul Arora

发表机构 * Department of Electrical Engineering, Indian Institute of Technology, Kanpur(电子工程系,印度理工学院,坎浦尔)

AI总结 PairAlign通过序列级自对齐实现紧凑音频标记化,利用条件序列生成方法,提升标记一致性、长度控制和编辑相似性。

详情
Comments
57 pages main content, 109 total pages, 9 Figures, pre-print, Under Review
AI中文摘要

许多感官数据的操作——比较、记忆、检索和推理——自然地在离散符号结构上表达。在语言中,这种接口由标记提供;在音频中,必须学习。现有音频标记器依赖于量化、聚类或编解码器重建,将标记局部分配,因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign,一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成:编码器将语音映射为连续条件,自回归解码器从BOS开始生成标记,学习标记身份、顺序、长度和EOS位置。给定两个保持内容的视图,每个视图的序列在另一个视图的表示下被训练为可能,而无关示例提供竞争序列。这为可扩展的编辑距离保留代理,同时抑制许多对一的坍缩。PairAlign从VQ式标记化开始,并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制进行优化。在3秒语音上,PairAlign学习紧凑、非退化的序列,具有广泛的词汇使用和强跨视图一致性。在检索测试中,它保留编辑距离搜索,同时将存档标记数量减少55%。连续扫频探针显示其局部重叠低于密集几何标记器,但具有更强的长度控制和在100毫秒移位下的受约束编辑轨迹。PairAlign是一种序列符号预测学习者:像JEPA式目标一样,它从另一个视图预测一个抽象目标作为学习的可变长度符号序列,而不是连续潜在变量。

英文摘要

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

2605.05136 2026-06-09 cs.CV 版本更新

CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization

CPCANet:基于共同主成分分析的深度展开方法用于领域泛化

Yu-Hsi Chen, Abd-Krim Seghouane

发表机构 * The University of Melbourne(墨尔本大学)

AI总结 本文提出CPCANet,通过深度展开Flury-Gautschi算法实现共同主成分分析,提升领域泛化性能,在四个基准测试中达到零样本转移的最先进水平。

详情
Comments
9 pages, 5 tables
AI中文摘要

领域泛化(DG)旨在学习在分布外转移下仍具鲁棒性的表示,并有效推广到未见目标领域。尽管最近的不变学习策略和架构进步已取得良好性能,但通过二阶统计显式发现结构化的领域不变子空间仍被忽视。本文提出CPCANet,一种基于共同主成分分析(CPCA)的新型框架,将迭代的Flury-Gautschi(FG)算法展开为完全可微的神经层。该方法将CPCA的统计特性整合到端到端可训练框架中,强制在不同领域间发现共享子空间,同时保持可解释性。在四个标准DG基准测试中,CPCANet在零样本转移中达到最先进性能。此外,CPCANet架构无关,无需特定数据集调优,提供了一种简单高效的鲁棒表示学习方法以应对分布偏移。代码可在https://github.com/wish44165/CPCANet获取。

英文摘要

Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen target domains. While recent invariant learning strategies and architectural advances have achieved strong performance, explicitly discovering a structured domain-invariant subspace through second-order statistics remains underexplored. In this work, we propose CPCANet, a novel framework grounded in Common Principal Component Analysis (CPCA), which unrolls the iterative Flury-Gautschi (FG) algorithm into fully differentiable neural layers. This approach integrates the statistical properties of CPCA into an end-to-end trainable framework, enforcing the discovery of a shared subspace across diverse domains while preserving interpretability. Experiments on four standard DG benchmarks demonstrate that CPCANet achieves state-of-the-art (SOTA) performance in zero-shot transfer. Moreover, CPCANet is architecture-agnostic and requires no dataset-specific tuning, providing a simple and efficient approach to learning robust representations under distribution shift. Code is available at https://github.com/wish44165/CPCANet.

2605.03862 2026-06-09 cs.AI cs.CL 版本更新

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

正确性不足:通过执行器导向的奖励训练推理计划器

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su

发表机构 * D 4 Lab(D4实验室) Independent Researcher(独立研究者)

AI总结 本文提出TraceLift框架,通过执行器导向的奖励提升推理质量,利用rubric-based Reasoning Reward Model评估推理轨迹的可靠性与有效性。

详情
Comments
36 pages
AI中文摘要

可验证奖励的强化学习已成为提升大语言模型显式推理的常见方法,但仅凭最终答案正确性无法揭示推理轨迹的忠实性、可靠性或对消费模型的效用。为此,我们提出TraceLift,将推理视为可消费的中间产物。在计划器训练中,计划器生成标记化的推理。冻结的执行器将此推理转化为最终产物供验证器反馈,同时执行器导向的奖励塑造中间轨迹。此奖励乘以基于rubric的Reasoning Reward Model评分,乘以在相同冻结执行器上测量的提升,奖励高质量且有用的轨迹。为使推理质量直接可学习,我们引入TRACELIFT-GROUPS数据集,包含数学和代码种子问题。每个示例是同一问题组,包含高质量参考轨迹和多个可能的错误轨迹,通过局部扰动降低推理质量或解决方案支持,同时保持任务相关性。在代码和数学基准上的广泛实验表明,执行器导向的推理奖励提高了两阶段计划器-执行器系统,表明推理监督应不仅评估轨迹是否看起来好,还应评估其是否帮助消耗模型。

英文摘要

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

2605.01320 2026-06-09 cs.CV 版本更新

PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression

PACE:用于学习LiDAR点云压缩的后因果熵建模

Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

发表机构 * arXiv.org cs.CV(计算机视觉)

AI总结 PACE通过非因果骨干网络和轻量级预测器提升LiDAR点云压缩效率,实现90%以上的解码延迟降低和BD-BR节省。

详情
AI中文摘要

LiDAR点云压缩对自动驾驶系统处理高分辨率传感器数据至关重要。尽管基于八叉树结构的学得熵建模能获得高压缩增益,但面临两个关键瓶颈:1)解码时因因果、多阶段上下文建模导致的延迟过高;2)性能-延迟权衡的刚性,使单一模型难以适应变化约束。这些限制源于上下文聚合骨干与概率预测之间的紧密耦合。为此,我们提出PACE,一种新的框架,将祖先上下文聚合重新表述为非因果骨干,并将因果性限制在轻量级、阶段可扩展的预测器中,消除重复骨干执行并减少计算开销。预测器支持任意数量的预测阶段,使模型能够无缝适应多样化的性能-延迟权衡,而无需重新加载参数。实验表明,PACE在压缩效率上达到新状态,实现显著的BD-BR节省,并在自回归模式下将解码延迟降低超过90%,使其在实际应用中具有吸引力。

英文摘要

LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between the context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, enabling seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90\% in autoregressive mode, making it attractive for practical applications.

2605.01171 2026-06-09 cs.CV cs.LG 版本更新

CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization

CADFit:基于混合优化的精确网格到CAD程序生成

Ghadi Nehme, Eamon Whalen, Faez Ahmed

发表机构 * arXiv.org

AI总结 提出CADFit框架,通过基于几何反馈的增量拟合和验证参数化操作,从网格中恢复复杂可编辑的CAD构造序列,在多个基准上优于现有方法,并显著降低无效比率。

详情
AI中文摘要

尽管最近取得了进展,但从几何输入(如网格或点云)恢复参数化CAD构造序列仍然是设计和制造的关键挑战,因为现有的CAD重建和生成方法主要局限于难以编辑的格式(如网格或Breps)或可编辑的简单草图-拉伸流水线和低复杂度数据集。我们引入了CADFit,一个基于混合优化的CAD重建框架,通过使用几何反馈增量拟合和验证参数化操作,从网格中恢复复杂、可编辑的CAD构造序列。我们的方法的特点是将重建公式化为对结构化CAD程序的IoU驱动优化,并支持丰富的操作集,包括拉伸、旋转、圆角和倒角。在多个CAD基准上的实验表明,CADFit在体积交并比和倒角距离方面优于最先进的网格到CAD方法,同时显著降低了重建CAD程序的无效比率,特别是对于复杂设计。我们进一步提出了一个多模态流水线,通过将基于图像的几何重建与CADFit相结合,实现从图像端到端重建CAD构造序列。通过实现更高复杂度CAD模型的精确重建,CADFit为生成更丰富的数据集和推进未来基于学习的CAD逆向工程方法提供了实用基础。代码可在:https://github.com/ghadinehme/CADFit 获取。

英文摘要

Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering. The code is available at: https://github.com/ghadinehme/CADFit.

2506.20588 2026-06-09 cs.CV 版本更新

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

TRIM:一种最大化时间相对信息和代表性的自监督视频摘要框架

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Pompeu Fabra University(庞培法布拉大学) Universitat Autònoma de Barcelona(自治大学)

AI总结 TRIM框架通过自监督学习实现高效视频摘要,无需注意力机制等复杂结构,优于现有无监督方法并挑战传统复杂架构。

详情
AI中文摘要

随着视频内容的普及,视频摘要和亮点提取成为关键研究领域。然而,许多先进方法依赖监督标注或注意力模型,计算成本高且在分布变化时表现不稳定。我们提出一种新颖的自监督视频摘要模型,无需注意力、RNN或Transformer,通过马尔可夫过程驱动的损失度量和两阶段自监督学习范式,实现性能与效率的平衡。TRIM在SUMME和TVSUM数据集上达到最佳性能,超越所有现有无监督方法,并与最佳监督模型相当,展示了高效无标注架构的潜力,为更通用的视频摘要技术铺平道路,并挑战现有复杂架构的依赖。

英文摘要

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

2605.00647 2026-06-09 cs.LG 版本更新

Label-Conditioned Cross-Modal Fusion for Adult-to-Pediatric ECG Transfer via Curriculum-Gated Contrastive Alignment

基于标签的跨模态融合用于成人到儿童ECG转移 via 课程门控对比对齐

Xinran Liu, Yuwen Li, Hongxiang Gao, Heyang Xu, Jianqing Li, Zongmin Wang, Chengyu Liu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) Nanjing Medical University(南京医科大学) Zhengzhou University(郑州大学)

AI总结 本文提出PEACE框架,通过预训练和适应性融合提升儿童ECG诊断,采用对比学习和课程适应策略,在有限标注下实现高准确率。

详情
AI中文摘要

自动化的儿童心电图(ECG)解释仍具挑战性,因为心率、间隔和波形的发育差异限制了主要在成人数据上训练的模型的可转移性,同时专家标注的儿童ECG数据集稀缺。我们提出PEACE(通过跨模态增强的儿童-成人ECG对齐),一个在MIMIC-IV ECG上预训练并适应于儿童目标的成人到儿童ECG转移框架。PEACE整合标签特定的双向对比学习(LSBC)以对齐ECG表示与诊断语义,并采用课程适应融合(CAF)以在有限的儿童监督下稳定优化。标签条件的短文本描述在训练期间提供辅助语义监督,而推理仅需ECG信号。在ZZU-pECG上,PEACE在零样本、50样本和全微调设置下分别达到宏平均AUCs为59.39%、81.74%和91.56%,优于ECG-only、多模态和通用领域适应基线,包括DANN和MMD。在PTB-XL上,经过全微调后,其在九个和谐标签上的宏平均AUC达到96.90%。基于梯度的注意力图显示在与房间相关RVH相关的QRS电压和形态区域以及与LQTS相关的QRS到T/复极化间隔区域的显著性增加,与常规解释中常见的ECG区域一致。这些结果表明,成人规模的ECG预训练结合节律、形态和ST-T复极化语义描述在标签稀缺的情况下提高了可转移的儿童诊断,同时保持了临床可解释的波形焦点。

英文摘要

Automated pediatric electrocardiogram (ECG) interpretation remains challenging because developmental differences in heart rate, intervals, and waveforms limit the transferability of models trained mainly on adult data, while expert-labeled pediatric ECG cohorts are scarce. We propose PEACE (Pediatric-Adult ECG Alignment via Cross-modal Enhancement), an adult-to-pediatric ECG transfer framework pretrained on MIMIC-IV ECGs and adapted to pediatric targets. PEACE integrates label-specific bidirectional contrastive learning (LSBC) to align ECG representations with diagnostic semantics and curriculum adaptive fusion (CAF) to stabilize optimization under limited pediatric supervision. Label-conditioned short text descriptors provide auxiliary semantic supervision during training, whereas inference requires ECG signals only. On ZZU-pECG, PEACE achieves macro-average AUCs of 59.39%, 81.74%, and 91.56% under zero-shot, 50-shot, and full fine-tuning settings, respectively, outperforming ECG-only, multimodal, and generic domain adaptation baselines including DANN and MMD. On PTB-XL, it reaches 96.90% macro-average AUC after full fine-tuning over nine harmonized labels with nonzero mapped incidence. Gradient-based attention maps show increased saliency around QRS voltage and morphology regions for chamber-related RVH and around QRS-to-T/repolarization intervals for LQTS, broadly consistent with ECG regions commonly inspected during routine interpretation. These results suggest that adult-scale ECG pretraining coupled with rhythm, morphology, and ST-T repolarization semantic descriptors improves transferable pediatric diagnosis under label scarcity while preserving clinically interpretable waveform focus.

2605.00358 2026-06-09 cs.CL cs.CV 版本更新

From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

从反向传播到正向回放:重新审视LLM参数编辑中的目标构造

Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee

发表机构 * arXiv.org University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学)

AI总结 本文重新审视LLM参数编辑中的目标构造,提出一种更简洁的替代方法,通过正向传播代替反向传播,提高目标隐藏状态的准确性和兼容性。

详情
Comments
ICML 2026, code: https://github.com/jugechengzi/FE
AI中文摘要

LLM参数编辑方法通常依赖于计算目标层的理想隐藏状态(称为锚点)并将其分布到多个前层(通常称为反向传播)以实现协同编辑。尽管长期广泛使用,其基础理论尚未系统研究。本文首先系统研究其基础,有助于明确其能力边界、实际考虑和潜在失败模式。然后,我们提出了一种简单优雅的替代方法,用正向传播代替反向传播。不优化最后一层的靶标,而是在第一编辑层优化锚点,然后将其传播到后续所有编辑层,以获得准确且相互兼容的目标隐藏状态。这种方法达到与现有方法相同计算复杂度,同时产生更准确的层间目标。我们的方法简单,不影响初始目标隐藏状态的计算或后续编辑流程的其他组件,因此对广泛的LLM参数编辑方法有益。

英文摘要

LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.

2605.00273 2026-06-09 cs.CV cs.AI 版本更新

When Do Diffusion Models learn to Generate Multiple Objects?

扩散模型何时学会生成多个物体?

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 研究探讨了扩散模型在多物体生成中的局限性,发现场景复杂度比概念不平衡更关键,且低数据条件下计数任务更难学习。

详情
Comments
ICML2026
AI中文摘要

文本到图像的扩散模型实现了出色的视觉保真度,却在多物体生成中仍不可靠。尽管有大量实证证据表明这些失败,但其根本原因仍不清楚。我们首先探讨这种限制有多大源于数据本身。为了区分数据影响,我们考虑了不同数据集大小下的两种模式:(1)概念泛化,其中每个单独的概念在训练期间可能在不平衡的数据分布下被观察到;(2)组合泛化,其中特定的概念组合被系统性地排除。为了研究这些模式,我们引入了mosaic(多物体空间关系、属性、计数),一种受控的数据集生成框架。通过在mosaic上训练扩散模型,我们发现场景复杂性起主导作用,而非概念不平衡,并且在低数据模式中计数尤为难以学习。此外,随着训练过程中排除更多概念组合,组合泛化能力会崩溃。这些发现突显了扩散模型的根本限制,并促使更强的归纳偏见和数据设计以实现稳健的多物体组合生成。

英文摘要

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

2604.27810 2026-06-09 cs.LG 版本更新

Hyper-Dimensional Fingerprints as Molecular Representations

超维指纹作为分子表示

Jonas Teufel, Luca Torresi, André Eberhard, Pascal Friederich

发表机构 * Karlsruhe Institute of Technology (KIT), Institute of Nanotechnology (INT)(卡尔斯鲁厄理工学院(KIT),纳米技术研究所) Karlsruhe Institute of Technology (KIT), Institute of Anthropomatics and Robotics (IAR)(卡尔斯鲁厄理工学院(KIT),人机学与机器人研究所)

AI总结 本文提出超维指纹(HDF),通过高维向量的代数运算生成确定性分子表示,无需训练,在多种属性预测任务中表现优异,且在低维情况下保持分子相似性的一致性。

详情
Comments
Code: https://doi.org/10.5281/zenodo.19373621
AI中文摘要

计算分子表示是虚拟筛选、性质预测和材料发现的基础。传统指纹效率高但因基于哈希的压缩丢失结构信息,特别是在低维情况下。通过图神经网络学习的表示恢复了这种表达性,但需要任务特定的训练和大量计算资源。本文引入超维指纹(HDF),用高维向量的代数运算替代消息传递神经网络的学习转换,生成无需训练的确定性分子表示。在多样化的属性预测基准上,HDF在大多数任务中优于传统指纹,且在不同数据集和模型间表现出更高的一致性。关键的是,HDF嵌入保持分子相似性:在32维时,HDF空间的距离与图编辑距离的皮尔逊相关系数达到0.9,而摩根指纹在同等尺寸下仅为0.55。这种结构保真度在低维情况下持续,允许简单的最近邻回归在64个组件中保持预测性。进一步在贝叶斯分子优化中展示了实际影响,HDF基于的替代模型在摩根指纹表现与随机搜索相当的领域中显著提高了样本效率。HDF因此提供了一种通用的、无需训练的替代方案,表明传统固定长度指纹中接受的信息损失是哈希编码方案的限制,而非指纹范式本身。

英文摘要

Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.

2604.27476 2026-06-09 cs.CV 版本更新

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM: 为视觉-语言模型高效边缘推理设计的框架

Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An

发表机构 * Go Further. AI School of Data Science Fudan University(复旦大学数据科学学院) RUYi Dynamics Co. Ltd(RUYi Dynamics有限公司) Independent Researcher(独立研究者)

AI总结 EdgeFM通过轻量级代理驱动框架,优化边缘部署的视觉-语言模型推理,提升跨平台性能和可移植性,实现比传统工具链更快的推理速度。

详情
Comments
Technique Report version
AI中文摘要

视觉-语言模型(VLMs)在边缘工业应用中表现出强大的适用性,但其部署受到确定性低延迟和资源限制下稳定执行的严重限制。现有框架要么依赖臃肿的通用设计,要么迫使开发者进入封闭的硬件特定生态系统,导致硬件锁定和较差的跨平台适应性。观察到现代AI代理可以高效搜索和调整配置以生成高度优化的低级内核,我们提出EdgeFM,一种轻量级、代理驱动的VLM/LLM推理框架,专为跨平台工业边缘部署设计。EdgeFM通过移除非必要功能来降低单次请求延迟,并将代理调优的内核优化封装为可重用的模块化库。通过允许直接调用这些技能而不是等待封闭源代码实现,它有效缩小了长期以来由专有工具链主导的性能差距。该框架原生支持主流平台,包括x86和NVIDIA Orin SoCs,并代表了首个在国产Horizon Journey平台上的端到端VLA部署,增强了跨平台可移植性。在大多数情况下,它比传统供应商特定工具链的推理性能更优,实现NVIDIA Orin平台上比TensorRT-Edge-LLM快1.49倍的速度提升。实验结果表明,EdgeFM提供了有利的端到端推理性能,为多样化的边缘工业场景提供了开源、生产级的解决方案。

英文摘要

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

2604.27273 2026-06-09 cs.SD 版本更新

Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?

少样本合成方言语音用于ASR微调:什么有助于什么?

Yurii Halychanskyi, Nimet Beyza Bozdag, Mark Hasegawa-Johnson, Dilek Hakkani-Tür, Volodymyr Kindratenko

发表机构 * University of Washington(华盛顿大学)

AI总结 研究比较了合成方言语音在ASR微调中的有效性,发现随机音素扰动比目标方言音素编辑更有效,且真实语音与合成语音混合可稳定低资源微调。

详情
Comments
Accepted as a contributed talk and poster at the ICML 2026 Workshop on Machine Learning for Audio
AI中文摘要

合成方言语音是一种在真实方言录音稀缺时提升自动语音识别(ASR)性能的有希望的方法。我们探讨了什么使此类数据对ASR微调有用:目标方言音素编辑暴露识别器于方言特定发音,或随机音素扰动在音素空间中充当增强。在少样本TTS流程中,我们比较了LLM生成的方言编辑与匹配速率的随机替换和oracle控制,使用真实方言音素和语调。随机替换恢复了大部分ASR增益:LLM目标方言编辑仅比随机替换略好,真实音素接近随机基线并随着合成ASR微调集增大接近它,添加真实语调仅带来小幅增益。混合合成与真实方言语音也稳定了低资源微调,但固定合成预算后期会稀释真实数据信息,显示真实-合成比例的重要性。

英文摘要

Synthetic accented speech is a promising way to improve automatic speech recognition (ASR) when real accented recordings are scarce. We ask what makes such data useful for ASR fine-tuning: target-accent phoneme edits that expose the recognizer to accent-specific pronunciations, or random phoneme perturbations that act as augmentation in phoneme space. In a few-shot TTS pipeline, we compare LLM-generated accent edits with matched-rate random substitutions and oracle controls using ground-truth accented phonemes and prosody. Random substitutions recover much of the ASR gain: LLM target-accent edits improve over random by only a small margin, ground-truth phonemes stay close to the random baseline and nearly converge with it as the synthetic ASR fine-tuning set grows larger, and adding ground-truth prosody yields only a modest further gain. Mixing synthetic with real accented speech also stabilizes low-resource fine-tuning, but a fixed synthetic budget can later dilute the information in real data, showing that the real--synthetic ratio matters.

2604.26985 2026-06-09 cs.LG cs.AI 版本更新

Simple Self-Conditioning Adaptation for Masked Diffusion Models

简单自条件适应用于掩码扩散模型

Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出一种简单有效的后训练适应方法,通过自条件预测提升掩码扩散模型的生成能力,减少生成困惑度并提升图像合成和分子生成质量。

详情
AI中文摘要

掩码扩散模型(MDMs)通过迭代去噪在吸收掩码过程中生成离散序列。在标准掩码扩散中,如果一个token在反向更新后仍被掩码,模型会丢弃该位置的干净状态预测。因此,仍被掩码的位置必须反复从掩码token本身推断。这种设计限制了跨步骤的细化。为解决这一限制,本文提出了一种简单但有效的后训练适应方法,使每个去噪步骤都基于模型自身之前的干净状态预测。所提出的方法称为自条件掩码扩散模型(SCMDM),需要最小的架构更改,不引入递归的潜在状态路径,不依赖辅助参考模型,并在采样过程中不增加额外的去噪器评估。这与部分自条件方法形成重要区别,后者需要昂贵的从头模型训练。特别是,本文表明,在后训练阶段,部分自条件,包括用于从头训练自条件模型的常用50% dropout策略,是次优的。相反,一旦模型自生成的干净状态估计变得有信息,专业化于细化优于混合条件和无条件目标。SCMDM在多个领域进行了评估,显示出对普通MDM基线的一致改进,实现了在OWT训练模型上的生成困惑度几乎减少50%(从42.89到23.72),同时在离散图像合成质量、小分子生成和基因组分布建模的保真度方面也取得了显著改进。

英文摘要

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model's self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.