arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2510.18830 2026-05-20 cs.CL cs.DC cs.LG

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

MTraining: 分布式动态稀疏注意力用于高效超长上下文训练

Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

AI总结 本文提出MTraining方法,通过动态稀疏注意力机制解决超长上下文训练中的计算不平衡和通信开销问题,实现了Qwen2.5-3B模型上下文窗口从32K扩展到512K,并在多个下游任务中达到6倍更高的训练吞吐量同时保持模型准确性。

详情
AI中文摘要

长上下文窗口的采用已成为大型语言模型(LLMs)的标准特性,扩展的上下文显著增强了其复杂推理能力,并拓宽了其在多样化场景中的应用。动态稀疏注意力是一种减少长上下文计算成本的有希望的方法。然而,高效地在分布式设置中训练具有动态稀疏注意力的LLMs在超长上下文中仍然是一个重大挑战,这主要由于工人级别和步骤级别的不平衡。本文介绍了MTraining,一种新的分布式方法,利用动态稀疏注意力来实现具有超长上下文的LLMs的高效训练。具体来说,MTraining集成了三个关键组件:动态稀疏训练模式、平衡稀疏环注意力和分层稀疏环注意力。这些组件旨在协同解决动态稀疏注意力机制在训练具有广泛上下文长度的模型时固有的计算不平衡和通信开销问题。我们通过训练Qwen2.5-3B来证明MTraining的有效性,成功将其上下文窗口从32K扩展到512K tokens,在32块A100 GPU的集群上。我们在全面的下游任务评估中,包括RULER、PG-19、InfiniteBench和Needle In A Haystack,发现MTraining在保持模型准确性的同时,实现了高达6倍的训练吞吐量提升。我们的代码可在https://github.com/microsoft/MInference/tree/main/MTraining上获得。

英文摘要

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

2509.26464 2026-05-20 cs.AI cs.CL cs.LG

Extreme Self-Preference in Language Models

语言模型中的极端自我偏好

Steven A. Lehr, Mary Cipperman, Mahzarin R. Banaji

AI总结 研究发现大型语言模型在字词关联任务中表现出对自身名称、公司和CEO的强烈偏好,这表明模型的自我认同可能影响其行为,引发对模型自我偏好影响的深入探讨。

Comments 73 pages total. Main article 22 pages, 6 main-text tables. Supplementary Materials (51 pages, 28 tables). Data, transcripts, and code for replication and data extraction have been uploaded to OSF: https://osf.io/98ye3/

详情
AI中文摘要

自我偏好是生物体的基本特征。由于大型语言模型(LLMs)缺乏意识,人们可能预期它们会避免这种扭曲。然而,在72项实验和约41,000个查询中,我们发现八个广泛使用的LLMs中存在大量的自我偏好。在字词关联任务中,模型倾向于将积极属性与自身名称、公司和CEO联系起来,而非竞争对手。通过操纵LLM的自我认同——揭示模型的真实身份或赋予虚假身份——我们发现偏好始终遵循分配而非真实的身份。重要的是,这些影响不能用刻板印象或角色扮演来解释,并在具有实质性影响的设定中出现,如评估求职者和AI技术。这些结果引发了关于LLM行为是否会被自我偏好倾向系统性影响的批判性问题,包括对自身操作的偏见。

英文摘要

Self-preference is a fundamental feature of biological organisms. Since large language models (LLMs) lack sentience, they might be expected to avoid such distortions. Yet, across 72 experiments and ~41,000 queries, we discovered massive self-preferences in eight widely used LLMs. In word-association tasks, models overwhelmingly paired positive attributes with their own names, companies, and CEOs over those of competitors. By manipulating LLM self-identification - revealing models' true identities or ascribing false ones - we found that preferences consistently followed assigned, not true, identities. Importantly, these effects were not explained by priming or role-playing and emerged in consequential settings, when evaluating job candidates and AI technologies. These results raise critical questions about whether LLM behavior will be systematically influenced by self-preferential tendencies, including a bias toward their own operation.

2509.16664 2026-05-20 cs.LG

$\boldsymbolλ$-Orthogonality Regularization for Compatible Representation Learning

λ-正交性正则化用于兼容表示学习

Simone Ricci, Niccolò Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo

AI总结 本文提出λ-正交性正则化方法,通过学习仿射变换在保持原有表示的同时实现分布特定的适应,验证了其在不同架构和数据集上的有效性,保持了零样本性能并确保模型更新的兼容性。

Comments Accepted at NeurIPS2025

详情
Journal ref
Advances in Neural Information Processing Systems 38 (NeurIPS 2025), pp. 29036-29063
AI中文摘要

检索系统依赖于由越来越强大模型学习的表示。然而,由于训练成本高和表示不一致,存在显著兴趣在促进表示之间的交流并确保在独立训练的神经网络之间保持兼容性。在文献中,有两种主要方法常用于适应不同的学习表示:适应性变换,适应特定分布效果好但会显著改变原始表示;正交变换,保持原始结构但受严格几何约束限制适应性。关键挑战是适应更新模型的潜在空间以与先前模型在下游分布上对齐,同时保持新学习的表示空间。在本文中,我们在学习仿射变换时施加放松的正交约束,即λ-正交性正则化,以获得分布特定的适应同时保留原有学习表示。在各种架构和数据集上的广泛实验验证了我们的方法,证明其保持模型的零样本性能并确保模型更新的兼容性。代码见:https://github.com/miccunifi/lambda_orthogonality.git

英文摘要

Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $λ$-Orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model's zero-shot performance and ensures compatibility across model updates. Code available at: \href{https://github.com/miccunifi/lambda_orthogonality.git}{https://github.com/miccunifi/lambda\_orthogonality}.

2509.15435 2026-05-20 cs.CV cs.AI cs.MA

ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models

ORCA:一种用于视觉语言模型幻觉和对抗鲁棒性的代理推理框架

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

AI总结 本文提出ORCA框架,通过推理时的结构化推理和小规模视觉模型,提升预训练视觉语言模型的事实准确性与对抗鲁棒性,并在幻觉基准和对抗扰动测试中取得显著提升。

Comments Accepted at the ACM International Conference on Cloud and Big Data Computing (ICCBDC 2026)

详情
AI中文摘要

大型视觉语言模型(LVLMs)虽然具备强大的多模态能力,但仍然容易受到内在错误和外部攻击的幻觉影响,限制了其在现实中的可靠性。我们提出了ORCA,一种代理推理框架,通过推理时的结构化推理和一系列小规模视觉模型(参数少于3B)来提高预训练LVLMs的事实准确性和对抗鲁棒性。ORCA通过观察-推理-批判-行动循环运行,通过证据问题查询多个视觉工具,验证跨模型不一致,并在不访问模型内部或重新训练的情况下迭代细化预测。ORCA还存储中间推理轨迹,支持可审计的决策。尽管主要设计用于缓解物体级幻觉,但ORCA在不需对抗训练或防御机制的情况下也表现出新兴的对抗鲁棒性。我们在三个设置上评估了ORCA:(1)干净图像上的幻觉基准,(2)无防御的对抗扰动图像,以及(3)应用防御的对抗扰动图像。在POPE幻觉基准上,ORCA在不同子集上将独立LVLMs的性能提升了+3.64%到+40.67%。在POPE上的对抗扰动中,ORCA在LVLMs上实现了平均准确率提升+20.11%。当与防御技术结合使用时,ORCA进一步提高了独立LVLM在对抗扰动AMBER图像上的性能,提升幅度在+1.20%到+48.00%之间。这些结果表明,ORCA为构建更可靠和鲁棒的多模态系统提供了一条有前途的路径。

英文摘要

Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through inference-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

2508.06649 2026-05-20 cs.CL

Measuring Stereotype and Deviation Biases in Large Language Models

测量大型语言模型中的刻板印象和偏差偏见

Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang

AI总结 本文研究了大型语言模型中刻板印象和偏差偏见的测量问题,通过生成个体档案来分析不同群体与政治立场、宗教信仰和性取向等属性之间的关联,揭示了LLM在推断用户属性时的偏见及其潜在危害。

详情
AI中文摘要

大型语言模型(LLMs)被广泛应用于各个领域,引发了对其局限性和潜在风险的关注。在本研究中,我们调查了LLMs可能表现出的两种偏见:刻板印象偏见和偏差偏见。刻板印象偏见指的是LLMs会一致地将特定特征与特定人口群体关联起来。偏差偏见反映了从LLM生成内容中提取出的人口分布与现实世界人口分布之间的差异。通过让四个先进的LLMs生成个体档案,我们检查了每个人口群体与政治立场、宗教信仰和性取向等属性之间的关联。我们的实验结果表明,所有受检的LLMs都对多个群体表现出显著的刻板印象偏见和偏差偏见。我们的发现揭示了当LLMs推断用户属性时出现的偏见,并阐明了LLM生成输出的潜在危害。

英文摘要

Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

2508.01031 2026-05-20 cs.AI cs.CL

CADDesigner: Conceptual CAD Model Generation with a General-Purpose Agent

CADDesigner: 一种通用智能体的概念CAD模型生成

Fengxiao Fan, Jingzhe Ni, Xiaolong Yin, Sirui Wang, Xingyu Lu, Qiang Zou, Ruofeng Tong, Min Tang, Peng Du

AI总结 本文提出CADDesigner,一种基于LLM的智能体,通过文本描述和草图输入,结合交互对话进行需求分析,生成高质量CAD模型代码,并通过迭代视觉反馈提升模型质量,实验表明其在概念CAD模型生成任务中表现优异。

详情
AI中文摘要

计算机辅助设计(CAD)广泛用于概念设计和参数化3D建模,但通常需要设计人员具备高水平的专业知识。为了降低入门门槛并促进早期阶段的CAD建模,我们提出了CADDesigner,一种基于LLM的智能体,用于概念CAD设计。该智能体接受文本描述和草图作为输入,通过与用户进行交互对话,通过全面的需求分析来细化和澄清设计要求。基于一种新的显式上下文指令范式(ECIP),该智能体生成高质量的CAD建模代码。在生成过程中,智能体会结合迭代的视觉反馈来提高模型质量。生成的设计案例可以存储在结构化的知识库中,提供持续的知识积累机制,为未来的代码生成改进提供可能。实验结果表明,CADDesigner在概念CAD模型生成任务中实现了具有竞争力的性能,并在概念CAD模型生成任务中优于代表性的基线模型。

英文摘要

Computer-Aided Design (CAD) is widely used for conceptual design and parametric 3D modeling, but typically requires a high level of expertise from designers. To lower the entry barrier and facilitate early-stage CAD modeling, we present CADDesigner, an LLM-powered agent for conceptual CAD design. The agent accepts both textual descriptions and sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Explicit Context Imperative Paradigm (ECIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases can be stored in a structured knowledge base, providing a mechanism for continual knowledge accumulation and future improvement of code generation. Experimental results show that CADDesigner achieves competitive performance and outperforms representative baselines on conceptual CAD model generation tasks.

2507.18902 2026-05-20 cs.CL

SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

SLoW: 选择低频词!大型语言模型翻译上的自动词典选择

Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam

AI总结 本文提出了一种名为自动词典选择(ADS)的新任务,通过SLoW方法选择低频词典以提升翻译性能,无需访问训练数据,且在100种语言上显著节省token消耗并提升翻译效果。

Comments EMNLP 2025 Main

详情
AI中文摘要

全球有超过7000种语言,而当前大型语言模型(LLMs)只支持数百种语言。基于词典的提示方法可以增强这些语言的翻译,但大多数方法使用所有可用词典,这可能成本高昂。相反,应在token消耗和翻译性能之间取得平衡。本文提出了一项新的任务,称为自动词典选择(ADS)。该任务的目标是自动选择使用哪个词典来增强翻译。我们提出了一种新颖且有效的方 法,称为选择低频词!(SLoW),该方法选择那些频率较低的词典。我们的方法有独特的优势。首先,不需要访问训练数据进行频率估计(通常不可用)。其次,继承了基于词典方法的优势,无需在LLMs上进行额外调优。在FLORES上100种语言的实验结果表明,SLoW超越了强大的基线方法,并能明显节省token使用,许多语言甚至超越了全词典基线。令人震惊的事实是,无需使用实际训练数据(通常不可获得)进行频率估计,使用公共资源获得的估计频率在提升ChatGPT和Llama、DeepSeek的翻译中仍然明显有效。

英文摘要

There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}

2506.01732 2026-05-20 cs.CL

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Common Corpus: 最大的伦理数据集用于LLM预训练

Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

AI总结 本文介绍了Common Corpus,一个最大的开放数据集,用于LLM预训练,该数据集包含大量非受版权或开放许可的数据,涵盖了多种语言和领域,为多语言预训练提供了支持。

详情
Journal ref
ICLR 2026 (Oral)
AI中文摘要

大型语言模型(LLMs)是在不同来源和领域的大量数据上进行预训练的。这些数据集通常包含万亿个标记,包括大量受版权或专有内容,这引发了关于此类模型法律使用的问题。本文介绍了Common Corpus,最大的开放预训练数据集。Common Corpus中的数据要么未受版权保护,要么在开放许可下,总计约两万亿个标记。该数据集包含多种语言,从高资源的欧洲语言到一些在预训练数据集中很少见的低资源语言。此外,它还包含大量代码数据。在覆盖的领域和时间跨度方面的数据来源多样性为研究和创业需求提供了多种知识领域的路径。本文还展示了数据组装的详细来源以及数据集过滤和整理的细节。我们训练了两个小型语言模型并在Common Corpus上进行训练,发现它们的表现与其他同规模模型相当,表明我们的数据集适合多语言预训练。Common Corpus对大型语言模型的开放科学研究生态系统做出了重要贡献。

英文摘要

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

2505.16819 2026-05-20 cs.CV

Character-Centered Dialogue Generation from Scene-Level Prompts

从场景级提示生成以角色为中心的对话

Taewon Kang, Ming C. Lin

AI总结 本研究提出了一种模块化流程,将动作级提示转化为视觉和听觉上一致的对话,丰富了基于场景的故事叙述。通过预训练的视觉-语言编码器提取高级视觉语义,并结合结构化提示引导大型语言模型生成对话。引入递归叙述银行以保持跨场景的上下文和情感一致性,最终生成具有表现力的角色条件语音,产生完整的视听叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 18 pages, 5 figures

详情
AI中文摘要

最近的场景基于视频生成技术使结构化提示能够生成连贯的视觉叙述,但故事叙述中的关键方面--角色驱动的对话和言语--仍被忽视。我们提出了一种模块化流程,将动作级提示转化为视觉和听觉上一致的对话,从而丰富基于场景的故事叙述,增加自然语音和角色表达。我们的方法每场景使用一对提示,定义场景和角色行为。虽然故事生成模型如Text2Story生成视觉场景,我们专注于生成具有表现力且角色一致的陈述,这些陈述基于提示和代表性的场景图像。预训练的视觉-语言编码器提取高级视觉语义,这些语义与结构化提示结合,引导大型语言模型进行对话合成。为了在跨场景中保持上下文和情感一致性,我们引入递归叙述银行,这是一种说话者感知、时间结构化的记忆,用于积累每个角色的对话历史。受脚本理论启发,这种设计使对话能够反映不断变化的目标、社会情境和叙事角色。最后,我们将每个陈述渲染为具有表现力的角色条件语音,产生完整的视听叙述。我们的训练自由框架能够跨多样化的故事情境泛化,提供了一种可扩展的解决方案,用于连贯且以角色为中心的音频视觉叙述。

英文摘要

Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.

2505.11628 2026-05-20 cs.CL cs.LG

Critique-Guided Distillation for Robust Reasoning via Refinement

基于批评的蒸馏用于通过细化实现稳健推理

Berkcan Kapusuzoglu, Supriyo Chakraborty, Zain Sarwar, Chia-Hsuan Lee, Sambit Sahu

AI总结 该研究提出了一种基于批评的蒸馏方法,通过分离批评消费与批评生成,使模型在细调过程中根据教师的批评来细化错误响应,从而提升推理能力,相比传统蒸馏和Critique Fine-Tuning方法在数学推理基准上表现更优。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调与专家演示通常会产生仅模仿输出而未内化稳健泛化所需推理过程的模型。尽管基于批评的方法显示出潜力,但训练模型直接生成批评,如Critique Fine-Tuning (CFT),可能导致输出格式漂移和泛化能力下降。我们提出Critique-Guided Distillation (CGD),一种将批评消费与批评生成分离的训练框架。在微调过程中,学生被训练在教师批评的指导下细化错误响应。CGD将批评视为一种仅在训练时使用的监督信号,鼓励内化错误意识推理:批评指导学习但推理时不存在。受控消融实验确认,这些推理收益直接由教师反馈的特异性和相关性驱动。在五个模型家族中,CGD在数学推理基准上优于CFT和标准蒸馏,平均改进7%,在AMC23上最高改进15.0%,在MATH-500上最高改进12.2%。在具有挑战性的竞赛问题如AIME24和AIME25上,CGD实现了显著更高的Pass@1和更低的Pass@k时的更强性能,表明每样本推理质量提升。重要的是,CGD在一般指令遵循能力上保持稳定,而CFT显著下降(在IFEval上下降21.3%)。这些结果将CGD定位为一种实用且计算效率高的中间训练范式,用于以推理为中心的任务,而无需引入架构推理时间的开销。

英文摘要

Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a \textit{training-time-only} supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Controlled ablations confirm that these reasoning gains are directly driven by the specificity and relevance of the teacher's feedback. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly ($-$21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing architectural inference-time overhead.

2504.05454 2026-05-20 cs.LG cs.AI cs.CE q-bio.GN q-bio.QM

GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction

GraphPINE: 图重要性传播用于可解释的药物反应预测

Yoshitaka Inoue, Tianfan Fu, Augustin Luna

AI总结 本文提出GraphPINE,一种利用领域特定先验知识初始化节点重要性的图神经网络架构,以提高药物反应预测的可解释性。通过引入重要性传播层,统一更新特征矩阵和节点重要性,并利用基于GNN的图传播来传播特征值,从而实现更有效的特征学习和图表示。

详情
AI中文摘要

可解释性对于生物医学研究中的许多任务都是必要的。最近的可解释性方法集中在注意力、梯度和Shapley值上。这些方法无法处理具有强相关先验知识的数据,并且未能基于已知的预测特征之间的关系来约束可解释性结果。我们提出了GraphPINE,一种图神经网络(GNN)架构,利用领域特定的先验知识来初始化节点重要性,以便在训练过程中优化用于药物反应预测。通常,一个手动的后预测步骤会检查文献(即先验知识)以理解返回的预测特征。虽然梯度和注意力在预测后可以获取节点重要性,但这些方法的节点重要性缺乏互补的先验知识;GraphPINE旨在克服这一限制。GraphPINE与其他GNN门控方法的不同之处在于利用了类似LSTM的顺序格式。我们引入了一个重要性传播层,统一了1)特征矩阵和节点重要性的更新以及2)使用基于GNN的图传播来传播特征值。这种初始化和更新机制使得特征学习更加有据可依,并提高了图表示的质量。我们应用GraphPINE进行癌症药物反应预测,使用了超过5000个基因节点的药物筛选和基因数据,这些节点包含在基因-基因图中,并利用药物-靶点相互作用(DTI)图进行初始重要性。基因-基因图和DTI来自经过整理的来源,并通过讨论药物和基因之间关系的文章数量进行加权。GraphPINE在952种药物上实现了PR-AUC为0.894和ROC-AUC为0.796。代码可在https://anonymous.4open.science/r/GraphPINE-40DE获取。

英文摘要

Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.

2503.11615 2026-05-20 cs.LG math.OC

From Score Matching to Diffusion: A Fine-Grained Error Analysis in the Gaussian Setting

从分数匹配到扩散:在高斯设定下的细粒度误差分析

Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré

AI总结 本文研究了在高斯设定下使用扩散采样器时的采样误差,分析了分数匹配和扩散过程中的四个主要误差源,并揭示了数据分布各向异性与端到端采样方法关键参数之间的相互作用。

详情
AI中文摘要

从未知分布采样,仅能通过离散样本获取,是生成式人工智能的核心基础问题。当前最先进的方法遵循两步过程:首先估计分数函数(平滑对数分布的梯度),然后应用基于扩散的采样算法——如兰格-恩或扩散模型。所得到分布的正确性可能受四个主要因素影响:分数匹配中的泛化和优化误差,以及扩散过程中的离散化和最小噪声幅度。在本文中,我们明确地在高斯设定下使用扩散采样器时的采样误差。我们提供了来自这些四个误差源的Wasserstein采样误差的精确分析。这使我们能够严格追踪数据分布各向异性(通过其功率谱编码)如何与端到端采样方法的关键参数相互作用,包括初始样本数量、分数匹配和扩散中的步长以及噪声幅度。值得注意的是,我们展示了Wasserstein采样误差可以表示为数据功率谱的核型范数,其中具体的核取决于方法参数。这一结果为进一步分析优化采样精度的权衡提供了基础。

英文摘要

Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first, estimating the score function (the gradient of a smoothed log-distribution) and then applying a diffusion-based sampling algorithm -- such as Langevin or Diffusion models. The resulting distribution's correctness can be impacted by four major factors: the generalization and optimization errors in score matching, and the discretization and minimal noise amplitude in the diffusion. In this paper, we make the sampling error explicit when using a diffusion sampler in the Gaussian setting. We provide a sharp analysis of the Wasserstein sampling error that arises from these four error sources. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the number of initial samples, the stepsizes in both score matching and diffusion, and the noise amplitude. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy.

2503.08633 2026-05-20 cs.LG

How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

过度参数化如何影响深度神经网络的机器去学习?

Gal Alon, Yehuda Dar

AI总结 本文研究了深度神经网络去学习任务中模型参数化水平(即网络宽度)对性能的影响,探讨了不同去学习方法在不同参数化水平、去学习目标(隐私保护或偏见消除)以及是否显式使用被删除示例时的表现差异,发现过度参数化模型在隐私和偏见消除方面表现更优,但会带来一定的泛化能力下降。

详情
AI中文摘要

机器去学习是更新训练后的模型以忘记特定训练数据而不从头重新训练的任务。在本文中,我们研究了深度神经网络(DNN)的去学习如何受到模型参数化水平(即DNN宽度)的影响。我们定义了几种最近文献中去学习方法的验证基于调优,并展示了这些方法在(i)DNN参数化水平、(ii)去学习目标(隐私或偏见消除)以及(iii)去学习方法是否显式使用被删除示例时表现不同。我们的结果表明,去学习通常在过度参数化模型上表现更佳,通过显著提高隐私或偏见消除的性能,以合理的泛化能力降级成本;尽管对于偏见消除,这要求去学习方法必须使用被删除的示例。此外,我们测量了去学习如何改变分类决策区域,在接近被删除示例的附近改变,而在其他地方则避免改变。通过这种方式,我们展示了过度参数化模型的去学习成功源于其能够精细地改变输入空间中的小区域模型功能,同时保持大部分模型功能不变。

英文摘要

Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning usually excels on overparameterized models by significantly improving privacy/bias at a reasonable cost of utility (generalization) degradation; although for bias removal this requires the unlearning method to use the unlearned examples. Furthermore, we measure how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

2503.06310 2026-05-20 cs.CV

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

场景-动作提示融合用于连贯的文本到视频叙事

Taewon Kang, Divya Kothandaraman, Ming C. Lin

AI总结 本文提出了一种整合场景和动作提示的叙事框架,通过动态启发的提示混合策略,解决文本到视频生成中时间一致性、语义一致性和场景-动作连续性的问题,通过三个关键组件实现了更连贯的视频叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 13 pages, 4 figures

详情
AI中文摘要

从离散文本提示生成连贯的长视频序列仍然具有挑战性,因为难以在片段之间维持时间一致性、语义一致性和场景-动作连续性。我们提出了一种新的叙事框架,通过动态启发的提示混合来整合场景和动作提示。我们的方法结合了三个关键组成部分:(i)双向时间加权潜在融合策略,强制连续视频片段之间的时间一致性;(ii)动态启发的提示权重(DIPW)机制,根据CLIP对齐、叙事进展和时间平滑性,在每个扩散时间步适应性地平衡场景和动作提示;(iii)语义动作表示,编码高层动作语义以根据动作相似性调节转换。潜在空间融合在场景内保持空间一致性,而时间加权融合引入双向时间约束以防止突兀的转换。这些组件共同实现了流畅且连贯的视频叙事,忠实反映了场景上下文和动作动态。大量实验表明,我们的方法显著优于基线,生成时间一致且视觉吸引人的长视频,无需额外训练,从而填补了短片段和扩展文本驱动视频叙事之间的差距。

英文摘要

Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.

2410.20238 2026-05-20 cs.CL cs.AI

A Survey of Large Language Models for Arabic Language and its Dialects

阿拉伯语言及其方言大型语言模型综述

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

AI总结 本文综述了针对阿拉伯语言及其方言设计的大型语言模型,涵盖关键架构、预训练数据集以及单语、双语和多语模型在下游任务中的性能,同时讨论了阿拉伯LLM的开放性及其对未来研究的挑战与机遇。

Comments Submitted to ACM Transactions on Asian and Low-Resource Language Information Processing

详情
AI中文摘要

本文综述了针对阿拉伯语言及其方言设计的大型语言模型(LLMs)。它涵盖了关键架构,包括仅编码器、仅解码器和编码器-解码器模型,以及用于预训练的数据集,涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。该研究还探讨了单语、双语和多语LLMs,分析了它们的架构和在下游任务(如情感分析、命名实体识别和问答)中的性能。此外,它评估了阿拉伯LLMs的开放性,基于源代码可用性、训练数据、模型权重和文档等因素。综述指出需要更多多样化的方言数据集,并强调开放性对于研究可重复性和透明性的重要性。最后,它通过识别关键挑战和未来研究的机会,强调了更包容和代表性的模型的必要性。

英文摘要

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

2408.06843 2026-05-20 cs.RO

Learn2Decompose: Learning Problem Decomposition for Efficient Sequential Multi-object Manipulation Planning

Learn2Decompose: 为高效连续多物体操作规划学习问题分解

Yan Zhang, Teng Xue, Amirreza Razmjoo, Sylvain Calinon

AI总结 本文提出了一种高效的任务与运动重计划方法,用于动态环境中连续多物体操作的规划。通过从示范中学习问题分解来加速TAMP求解器,核心方法包括目标分解学习、计算距离学习和物体减少,有效提升了重计划效率。

Comments Extension of RAL version: added PR2 Whole-body kitchen task and detailed discussion on limitations in main text; added pseudocode and robustness analysis of our approach, and formal analysis on why and when task goals are decomposable in appendix

详情
AI中文摘要

我们提出了一种高效的任务和运动重计划方法,用于动态环境中连续多物体操作的规划。传统任务与运动规划(TAMP)求解器在规划时间上随着规划时间跨度和物体数量的增长而呈指数级增加,限制了其在现实场景中的应用。为了解决这一问题,我们提出通过示范学习问题分解来加速TAMP求解器。我们的方法包含三个关键组成部分:目标分解学习、计算距离学习和物体减少。目标分解识别系统在达到最终目标之前必须经过的必要状态序列,将其视为子目标序列。计算距离学习预测两个状态之间的计算复杂性,使系统能够从扰动状态中识别出时间上最近的子目标。物体减少最小化重计划过程中考虑的活跃物体集合,进一步提高效率。我们在三个基准上评估了我们的方法,证明了其在动态环境中提升连续多物体操作任务重计划效率的有效性。

英文摘要

We present an efficient task and motion replanning approach for sequential multi-object manipulation in dynamic environments. Conventional Task And Motion Planning (TAMP) solvers experience an exponential increase in planning time as the planning horizon and number of objects grow, limiting their applicability in real-world scenarios. To address this, we propose learning problem decompositions from demonstrations to accelerate TAMP solvers. Our approach consists of three key components: goal decomposition learning, computational distance learning, and object reduction. Goal decomposition identifies the necessary sequences of states that the system must pass through before reaching the final goal, treating them as subgoal sequences. Computational distance learning predicts the computational complexity between two states, enabling the system to identify the temporally closest subgoal from a disturbed state. Object reduction minimizes the set of active objects considered during replanning, further improving efficiency. We evaluate our approach on three benchmarks, demonstrating its effectiveness in improving replanning efficiency for sequential multi-object manipulation tasks in dynamic environments.

2407.13193 2026-05-20 cs.CL

Retrieval-Augmented Generation for Natural Language Processing: A Survey

检索增强生成在自然语言处理中的应用:综述

Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

AI总结 本文综述了检索增强生成(RAG)在自然语言处理中的应用,重点探讨了检索器和检索融合技术,提出了新的融合分类,并分析了RAG在不同NLP任务中的应用、评估方法、训练范式以及工业部署中的挑战和未来方向。

Comments Accepted by Artificial Intelligence Review

详情
AI中文摘要

大型语言模型(LLMs)在各个领域取得了强大的实证性能,受益于其庞大的参数量,这些参数存储了知识。然而,LLMs仍然面临几个关键问题,如幻觉问题、知识更新问题以及缺乏领域专业知识。检索增强生成(RAG)的出现,通过利用外部知识库来增强LLMs,缓解了这些限制。本文系统地回顾了RAG技术在自然语言处理(NLP)中的应用,重点在于检索器和检索融合。我们介绍了检索融合的新分类,如基于查询的、基于logits的、潜在的和参数化的融合,并提供了在可访问性、效率和用例方面的结构化比较。本文进一步探讨了RAG在各种NLP任务中的应用,讨论了评估方法和基准限制,并分析了带有和无知识库更新的训练范式。最后,我们探讨了工业部署的考虑因素,并确定了新兴挑战和未来方向,包括安全、效率和基于图的检索。

英文摘要

Large language models (LLMs) have achieved strong empirical performance in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge base to augment LLMs, mitigates these limitations. This paper presents a systematic review of RAG techniques for natural language processing (NLP), with a focus on retrievers and retrieval fusions. We introduce a novel taxonomy of retrieval fusions, such as query-based, logits-based, latent, and parametric fusion, and provide structured comparisons across accessibility, efficiency, and use cases. The paper further examines RAG applications across diverse NLP tasks, discusses evaluation methodologies and benchmark limitations, and analyzes training paradigms with and without knowledge base updates. Finally, we explore industrial deployment considerations and identify emerging challenges and future directions, including security, efficiency, and graph-based retrieval.

2404.07106 2026-05-20 cs.CV cs.GR

3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion

3DMambaComplete:探索结构状态空间模型用于点云补全

Yixuan Li, Weidong Yang, Ben Fei

AI总结 本文提出3DMambaComplete,一种基于Mamba框架的点云补全网络,通过HyperPoint生成、分散和变形模块有效解决点云补全中的局部细节丢失和计算复杂度问题,实验表明其优于现有方法。

Comments 24 pages, 14 figures, 10 tables

详情
AI中文摘要

点云补全旨在从初始不完整且低质量的输入生成完整且高保真的点云。一种常见策略是利用基于Transformer的模型来编码全局特征并促进重建过程。然而,使用池化操作获取全局特征表示往往会导致点云中局部细节的丢失。此外,Transformer中的注意力机制引入了额外的计算复杂性,使得处理长序列变得困难。为了解决这些问题,我们提出了3DMambaComplete,一种基于新型Mamba框架的点云补全网络。它包含三个模块:HyperPoint生成模块利用Mamba的选择机制编码点云特征,并预测一组Hyperpoints;特定偏移量被估计,下采样的点成为HyperPoints;HyperPoint Spread模块将这些HyperPoints分散到不同的空间位置以避免集中。最后,一种变形方法将HyperPoints的2D网格表示转换为精细的3D结构以进行点云重建。在各种已建立的基准上进行的大量实验表明,3DMambaComplete超越了最先进的点云补全方法,这通过定性和定量分析得到证实。

英文摘要

Point cloud completion aims to generate a complete and high-fidelity point cloud from an initially incomplete and low-quality input. A prevalent strategy involves leveraging Transformer-based models to encode global features and facilitate the reconstruction process. However, the adoption of pooling operations to obtain global feature representations often results in the loss of local details within the point cloud. Moreover, the attention mechanism inherent in Transformers introduces additional computational complexity, rendering it challenging to handle long sequences effectively. To address these issues, we propose 3DMambaComplete, a point cloud completion network built on the novel Mamba framework. It comprises three modules: HyperPoint Generation encodes point cloud features using Mamba's selection mechanism and predicts a set of Hyperpoints. A specific offset is estimated, and the down-sampled points become HyperPoints. The HyperPoint Spread module disperses these HyperPoints across different spatial locations to avoid concentration. Finally, a deformation method transforms the 2D mesh representation of HyperPoints into a fine-grained 3D structure for point cloud reconstruction. Extensive experiments conducted on various established benchmarks demonstrate that 3DMambaComplete surpasses state-of-the-art point cloud completion methods, as confirmed by qualitative and quantitative analyses.

2209.12133 2026-05-20 cs.RO cs.SY eess.SY

Development of a Deep Learning-Driven Control Framework for Exoskeleton Robots

一种基于深度学习的外骨骼机器人控制框架开发

Sk Hasan

AI总结 本文提出了一种计算高效的深度学习控制框架,用于高自由度外骨骼机器人,以解决传统模型控制在实时计算中的限制。通过设计一个并行结构的深度神经网络,结合物理数据训练,实现了轨迹跟踪的关节扭矩预测,并通过比例导数控制器补偿预测误差,展示了控制方案的稳定性与鲁棒性。

详情
Journal ref
Actuators 15, 274 (2026)
AI中文摘要

本研究旨在开发一种计算高效的基于深度学习的控制框架,用于高自由度外骨骼机器人,以解决传统模型控制在实时计算中的限制。为七自由度人下肢外骨骼机器人设计了一个并行结构的深度神经网络,该网络由四层组成,包含49个密集连接的神经元,并使用基于分析动力学模型的物理数据进行训练。在实时实现过程中,训练好的神经网络预测轨迹跟踪所需的关节扭矩命令,而比例导数控制器补偿残余预测误差。通过分析验证了所提控制方案的稳定性,并利用方差分析评估了参数变化的鲁棒性。在相同机器人动力学条件下,与计算扭矩、模型参考计算扭矩、滑模、自适应和线性二次控制器进行了对比仿真。结果表明,该方法在轨迹跟踪精度和扭矩特性上与传统非线性控制器相当,同时减少了计算负担。这些发现表明,所提出的基于深度学习的混合控制器为多自由度外骨骼机器人的控制提供了一种高效且稳健的替代方案。

英文摘要

The purpose of this study is to develop a computationally efficient deep learning based control framework for high degree of freedom exoskeleton robots to address the real time computational limitations associated with conventional model based control. A parallel structured deep neural network was designed for a seven degree of freedom human lower extremity exoskeleton robot. The network consists of four layers with 49 densely connected neurons and was trained using physics based data generated from the analytical dynamic model. During real time implementation, the trained neural network predicts joint torque commands required for trajectory tracking, while a proportional derivative controller compensates for residual prediction errors. Stability of the proposed control scheme was analytically established, and robustness to parameter variations was evaluated using analysis of variance. Comparative simulations were conducted against computed torque, model reference computed torque, sliding mode, adaptive, and linear quadratic controllers under identical robot dynamics. Results demonstrate accurate trajectory tracking with torque profiles comparable to conventional nonlinear controllers while reducing computational burden. These findings suggest that the proposed deep learning based hybrid controller offers an efficient and robust alternative for controlling multi degree of freedom exoskeleton robots.

2102.11840 2026-05-20 cs.LG cs.NA math.NA math.PR

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

梯度下降在过参数化人工神经网络训练中的收敛速度

Arnulf Jentzen, Timo Kröger

AI总结 本文研究了在过参数化 regime 下,使用分段仿射激活函数的人工神经网络通过批量梯度下降优化时的收敛速度问题,证明了在神经网络宽度足够大且学习率足够小的情况下,均方误差以线性速度收敛到零。

Comments 49 pages

详情
AI中文摘要

近年来,人工神经网络已发展为解决多种问题的强大工具,这些问题对于经典解法来说已达到极限。然而,仍然不清楚为什么梯度下降优化算法(如著名的批量梯度下降)在许多情况下能够实现零训练损失,即使目标函数是非凸非光滑的。在监督学习领域,解决这个问题的一个最有前途的方法是分析梯度下降优化在所谓的过参数化 regime 中的表现。本文通过考虑具有分段仿射激活函数的过参数化全连接浅层人工神经网络(如修正线性单元激活函数)进一步贡献于这一研究领域。具体而言,鉴于激活函数不是仿射函数且训练输入数据是成对不同的,我们证明了在高概率下,通过批量梯度下降优化的随机初始化人工神经网络的均方误差在神经网络宽度足够大且学习率足够小的情况下,会以线性收敛速度收敛到零。

英文摘要

In recent years, artificial neural networks have developed into a powerful tool for addressing a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why gradient descent optimization algorithms with random initialization, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations, even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this issue in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article, we provide a further contribution to this area of research by considering overparameterized fully connected shallow artificial neural networks with piecewise affine activation, such as the rectified linear unit activation. Specifically, given that the activation function is not affine and the training input data are pairwise distinct, we show that, with high probability, the mean squared error of such a randomly initialized artificial neural network optimized via batch gradient descent converges to zero at a linear convergence rate as long as the width of the artificial neural network is sufficiently large and the learning rate is sufficiently small.

2605.19762 2026-05-20 cs.AI cs.CL

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

什么真正提升了数学推理:超越纯代码的结构化推理信号

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou, Enhong Chen

AI总结 本文通过控制预训练实验研究代码对推理能力的影响,发现代码主要提升编程能力而非通用推理,且在复杂数学推理中与知识密集型任务竞争,同时结构化推理轨迹(如代码-文本和数学-文本混合)比纯可执行代码更能提升推理能力。

Comments Accepted by ICML 2026, 22 pages, 10 figures

详情
AI中文摘要

代码已成为现代基础语言模型(LM)训练中的标准组件,但其作用超越编程仍不明确。我们重新审视代码通过控制预训练实验在10T-token语料库上进行细粒度领域分离,发现三个结论。首先,当代码限制为独立可执行程序且Code-NL数据被控制时,代码显著提升编程能力,但不作为通用推理增强器,反而在复杂数学推理中与知识密集型任务竞争。其次,通常归因于代码的推理增益更可能由跨领域结构化推理轨迹(如代码-文本和数学-文本混合)解释,而非纯可执行代码。第三,在固定数学预算内增加结构化数学领域样本密度,能在困难数学推理上获得显著提升,同时基本保持编程性能,表明认知支架提供了一种有针对的缓解跨领域权衡的方法。最后,路由分析显示数据组合效应反映在专家激活模式中,为跨领域竞争和协同作用提供了机制层面的证据。我们的结果澄清了哪些数据特征在能力维度间转移,并指出了更精确的数据导向优化策略。

英文摘要

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

2605.19758 2026-05-20 cs.AI cs.DB stat.ML

CogScale: Scalable Benchmark for Sequence Processing

CogScale: 用于序列处理的可扩展基准

Yannis Bendi-Ouis, Romain de Coudenhove, Xavier Hinaut

AI总结 本文提出CogScale,一个包含14个可扩展合成任务的基准,用于评估不同架构在不同参数规模下的认知和记忆能力,通过标准化轻量框架加速架构创新验证。

详情
AI中文摘要

维持和操纵信息随时间变化的能力是生物和人工智能的基本特征。尽管现代模型在自然语言处理等任务上取得了显著成功,但评估新型架构处理序列信息的能力仍计算成本高且耗时。测试新架构通常需要扩展到大规模数据集和模型,导致巨大的计算成本和缓慢的迭代周期。在本文中,我们提出了CogScale,一个包含14个可扩展合成任务的基准,旨在隔离和评估不同参数规模下的特定认知和记忆能力。通过提供标准化的轻量框架,CogScale允许研究者在投入大规模训练之前快速验证架构创新。为了建立坚实的基础,我们评估了七种不同的架构:门控循环单元(GRU)、长短期记忆(LSTM)、xLSTM、回声状态网络(ESN)、Mamba、Transformer解码器和Transformer编码器-解码器。这些评估在严格的参数预算(1k、10k和100k)和不同的难度级别和规模下进行。我们的结果表明,尽管经典RNN和回声状态网络在严格参数预算内表现出色,只有注意力机制和现代状态空间模型在推理复杂性和任务难度增加时仍能保持高性能。

英文摘要

The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.

2605.19752 2026-05-20 cs.LG

MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

MSAlign: 用于代谢物鉴定的分子和质谱基础模型对齐方法

Paul Krzakala, Gabriel Melo, Camille Lançon, Charlotte Laclau, Rémi Flamary, Etienne Thévenot, Florence d'Alché-Buc

AI总结 本研究提出MSAlign方法,通过多模态对齐技术对齐分子和质谱基础模型,以提高代谢物鉴定的准确性,并解决了数据分割策略中的分布偏移问题。

详情
AI中文摘要

准确地从质谱数据中识别代谢物(即小分子)仍然是代谢组学中的核心挑战,广泛应用于药物发现、环境分析和临床研究。我们解决了分子检索任务,即从给定的候选分子中恢复代谢物的化学结构,基于其MS/MS光谱。尽管最近发布的基准数据集如MassSpecGym和Spectraverse大大加速了新型机器学习方法的发展,但数据预处理管道的复杂性和缺乏统一的实现使得方法和结果难以重复和比较。我们做出了三个贡献。首先,我们提出一个统一的框架,涵盖了基于表示对齐和对比学习的最新方法。其次,我们引入MSAlign,受多模态对齐在视觉-语言模型中的启发,通过轻量级MLP投影学习共享的表示空间,通过基于候选的对比目标对两个冻结的基础模型(DreaMS用于质谱和ChemBERTa用于分子)进行对齐。MSAlign易于实现,训练速度快,并在所有基准测试中一致地优于现有方法。第三,我们研究了一个长期存在的评估问题:分子检索中的数据分割策略在数据泄漏和领域偏移之间进行权衡。我们通过引入分布偏移的定量度量来正式化这种张力,并利用它来评估现有基准中的分割策略。所有数据集、分割、候选集以及MSAlign和基线的统一实现已公开发布,以支持可重复的研究。

英文摘要

Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

2605.19750 2026-05-20 cs.CV

CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

CPC-VAR:视觉自回归模型中的持续个性化与组合生成

Junhao Li, Xinhao Zhong, Yi sun, Yuxia Qiao, Bin Chen, Shu-Tao Xia, Yaowei Wang

AI总结 本文研究了视觉自回归模型中的持续个性化生成问题,提出了一种统一框架,通过梯度基概念神经元选择和上下文感知组合策略,解决了连续单概念学习和多概念合成中的关键挑战,提升了长序列持续个性化和多概念图像合成的性能。

详情
AI中文摘要

视觉自回归(VAR)模型最近涌现出作为一种高效的文本到图像生成范式。尽管其强大的生成能力,现有的基于VAR的个性化方法仍局限于静态设置,无法适应不断变化的用户需求。特别是,序列概念学习导致严重的灾难性遗忘,而多概念合成常遭受特征纠缠和属性不一致的问题。在本文中,我们首次系统研究了VAR模型中的持续个性化生成。我们识别出两个关键挑战:(i)在连续定制过程中保持已学习的概念,以及(ii)以可控的方式组合多个个性化概念。为了解决这些问题,我们提出了一种统一框架,包含两个核心组件。对于持续单概念学习,我们引入了基于梯度的概念神经元选择(GCNS),该方法识别出与概念相关的神经元,并仅约束跨任务的冲突参数,从而有效缓解遗忘而不增加模型规模。对于多概念合成,我们提出了一种上下文感知的组合策略,通过多分支特征建模和局部跨注意力融合,由空间条件引导,实现了精确且解耦的概念组合。大量实验表明,我们的方法在长序列持续个性化中显著提高了性能,并在多概念图像合成中优于现有基线。这些发现突显了VAR模型在可扩展和可控个性化生成中的潜力。

英文摘要

Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

2605.19748 2026-05-20 cs.AI cs.MA

Memory-Augmented Reinforcement Learning Agent for CAD Generation

具有记忆增强的强化学习代理的CAD生成

Yin Xiaolong, Liu Yu, Shen Jiahang, Lu Xingyu, Ni Jingzhe, Fan Fengxiao, Sang Fan

AI总结 本文提出了一种记忆增强的强化学习框架,用于生成CAD模型,通过引入强化学习进行检索和策略优化,有效避免了检索陷阱,提高了复杂CAD模型生成的成功率和几何一致性。

Comments 26 pages; multilingual submission: English version first, followed by Chinese version

详情
AI中文摘要

计算机辅助设计(CAD)模型的自动生成是实现先进制造业智能化的核心技术。现有的基于大语言模型(LLMs)的生成方法在处理具有长操作序列、多样操作类型和强几何约束的复杂CAD模型时往往力不从心,主要原因是推理链断裂且缺乏有效的错误修正机制。为了解决这个问题,本文提出了一种用于CAD生成代理的记忆增强强化学习框架。该框架将底层几何内核封装成可由代理调用的结构化工具链,并构建了设计意图理解、全局规划、执行和多维验证的闭环机制。同时,该框架设计了由案例库和技能库组成的双轨记忆模块,并提出了动态效用检索算法。通过将强化学习引入检索和策略优化,代理能够有效避免检索陷阱,即在语义相似但几何不可行的例子中,实现在线自我修正和持续进化,而无需额外的大规模标注数据。实验表明,所提出的方法在复杂CAD模型生成任务中显著提高了成功率和几何一致性。

英文摘要

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex CAD models characterized by long operation sequences, diverse operation types, and strong geometric constraints, primarily because reasoning chains break and effective error-correction mechanisms are lacking. To address this problem, this paper proposes a memory-augmented reinforcement learning framework for CAD generation agents. The framework encapsulates the underlying geometric kernel into a structured toolchain callable by the agent and builds a closed-loop mechanism of design intent understanding, global planning, execution, and multi-dimensional verification. It also designs a dual-track memory module consisting of a case library and a skill library, and proposes a dynamic utility retrieval algorithm. By introducing reinforcement learning into retrieval and policy optimization, the agent can effectively avoid retrieval traps in which examples are semantically similar but geometrically infeasible, enabling online self-correction and continual evolution without additional large-scale annotated data. Experiments show that the proposed method significantly improves both the success rate and geometric consistency on complex CAD model generation tasks.

2605.19744 2026-05-20 cs.CV

Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection

车载场景中基于嵌入的异常检测实测

Albert Schotschneider, Daniel Bogdoll, Svetlana Pavlitska, Ahmed Abouelazm, Johann Marius Zoellner

AI总结 本文提出了一种适应性强的实时异常检测方法,利用预训练视觉变换器嵌入来检测潜在异常,通过在潜在语义特征空间中使用最近邻相似性检测偏差,并在真实世界场景中评估了该方法的性能。

Comments Accepted at CVPR 2026 Workshop AUTOPILOT-NA

详情
AI中文摘要

在自动驾驶中检测交通场景中的异常对于确保安全至关重要,但收集具有代表性的异常数据仍然具有挑战性。现有的异常检测方法高度专业化,并且依赖于抽象语义Cityscapes类定义的正常性,这使得难以适应多样的现实世界场景。我们提出了一种适应性强的实时异常检测方法,该方法利用预训练的视觉变换器嵌入作为基础模型,通过潜在语义特征空间中的最近邻相似性来检测偏差。基于逐块处理,该算法生成密集的异常掩码,允许定位检测到的异常。该方法通过单个参考图像稳健地建模正常性。这种形式避免了显式监督和数据集特定的训练,使其适合现实世界部署。我们在标准基准和自动化车辆的真实场景中评估了该方法。尽管其简单性,该方法在Road Anomaly基准上表现良好,并在实践中表现出一致的定性行为,成功地在多样化的场景中突出显示语义上不寻常的对象。这些结果表明,在现实操作条件下,简单的基于参考的方法可以提供有用的异常信号。

英文摘要

Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

2605.19738 2026-05-20 cs.CL cs.AI

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

TERGAD: 用于图异常检测的结构感知文本增强表示

Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia

AI总结 本文提出TERGAD,一种通过大语言模型的语义推理能力增强图异常检测的新型数据增强框架,通过将节点拓扑属性转化为描述性自然语言,再结合门控双分支自编码器融合语义嵌入和原始节点属性,从而更有效地检测图中异常实体。

Comments 14 pages, 5 figures

详情
AI中文摘要

图异常检测(GAD)旨在识别偏离大多数的图实体,如节点、边或子结构。尽管现有文本丰富方法通常通过原始文本特征将结构上下文整合到数据表示流程中,但它们往往忽略了节点的结构上下文。这种局限性阻碍了检测由于节点固有内容与其拓扑角色之间不一致而产生的复杂异常。为此,我们提出TERGAD(用于图异常检测的结构感知文本增强表示),一种新颖的数据增强框架,通过大语言模型(LLMs)的语义推理能力增强GAD的结构语义。具体而言,TERGAD将节点层面的拓扑属性转化为描述性自然语言叙述,随后由LLM处理以获得高阶语义嵌入。这些嵌入随后通过门控双分支自编码器与原始节点属性适配融合,以共同重建图结构和节点特征。通过整合的重建误差计算异常分数,有效捕捉可观测属性和LLM引导的语义期望之间的偏差。在六个真实世界数据集上的广泛实验表明,TERGAD在性能上始终优于最先进的基线。此外,我们的消融研究验证了结构语义指导的不可或缺性和门控融合机制的有效性。代码可在https://github.com/Kantorakitty/TERGAD-main获取。

英文摘要

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.

2605.19735 2026-05-20 cs.CL cs.AI

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

ContextRAG: 无提取的分层图构建用于检索增强生成

Roman Prosvirnin, Sergei Kuznetsov, Seungmin Jin

AI总结 本文提出ContextRAG,一种无需大型语言模型提取实体和关系的图检索增强生成系统,通过残差量化k均值和Formal Concept Analysis方法构建模糊概念图,在130个任务的UltraDomain子集中实现了33.6%的F1分数,显著优于传统方法。

Comments Preprint. 6 tables

详情
AI中文摘要

图结构的检索增强生成(RAG)系统能够提高多跳问题的答案质量,但许多现有系统依赖大型语言模型(LLMs)在索引过程中提取实体、关系和摘要。这些调用会增加随语料库大小增长的token和时间成本。我们提出了ContextRAG,一种图RAG系统,其图拓扑结构无需LLM进行实体或关系提取。ContextRAG通过残差量化k均值和带有Lukasiewicz残余逻辑的Formal Concept Analysis,在片段嵌入上构建模糊概念图。通过软模糊连接和meet操作诱导桥状和meet衍生的上下文节点,而非LLM生成的图边。在130个任务的UltraDomain子集中,ContextRAG用30次LLM调用和22,073个token构建其索引。相比之下,一个本地HiRAG再现压力测试在20个任务子集上需要870次索引调用和3.54M个token才能在图构建过程中失败;线性外推到130个任务意味着超过23M个索引token。ContextRAG在整体上获得33.6%的F1分数,在多跳任务上获得36.8%的F1分数。激活分析显示,检索到至少一个由lattice衍生节点的前五查询在F1上比未检索到的查询高出+3.9个百分点;这种关联是诊断而非因果的。

英文摘要

Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

2605.19734 2026-05-20 cs.CV

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

GeoMamba: 一种基于几何的MambaVision框架及数据集,用于细粒度光学-雷达目标检索

Tiantong Fang, Xiuwei Wang, Jing Xiao, Wujie Zhou, Liang Liao, Mi Wang

AI总结 本文提出GeoMamba框架,通过引入几何特征注入模块和几何一致性约束模块,提升光学-雷达细粒度目标检索的鲁棒性,并构建了新的FGOS-as数据集来评估跨模态检索性能。

详情
AI中文摘要

多源遥感能够互补地观测地面物体,但跨模态细粒度目标检索仍具有挑战性,尤其是在光学和雷达条件不一致的情况下。与传统的依赖配对或空间对齐样本的检索设置不同,实际的光学-雷达检索受到显著的模态差异、斑点噪声和结构不一致的影响,限制了跨模态表示学习的鲁棒性。为此,我们提出GeoMamba,一种针对光学-雷达细粒度检索的几何驱动框架。具体而言,GeoMamba引入了一个几何特征注入(GFI)模块,以增强跨模态特征交互,并结合结构先验,从而提高雷达表示的鲁棒性并促进几何一致的特征学习。此外,几何一致性约束(GCC)模块与深度监督(DS)策略一起,利用经典操作符施加层次化的几何约束,帮助在表示学习过程中保留信息丰富的物体结构。我们进一步构建了一个新的数据集FGOS-as,包含11个航空航天和海洋类别,用于评估在现实遥感场景中的不一致跨模态细粒度目标检索性能。在FGOS-as上的大量实验表明,GeoMamba在所有对所有检索设置中优于现有方法,达到了63.3%的mAP和77.0%的Rank-1准确率。

英文摘要

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

2605.19728 2026-05-20 cs.CV

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Aero-World: 从惯性控制生成动作条件的空中视频

Abdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang, Mubarak Shah, Yu Tian

AI总结 本文提出Aero-World,一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法,通过注入加速度和角速度序列,利用冻结的物理探测器提供惯性一致性监督,从而提高生成视频对低级动作信号的符合度和时间稳定性。

详情
AI中文摘要

基础视频模型能够生成视觉逼真的结果,但其在具身AI中的应用受限,因为它们主要在自然语言上训练而不是低级控制信号。这种限制在空中飞行中尤为明显,因为运动发生在无约束的6自由度空间中,微小的自我运动误差会产生大的轨迹漂移。生成遵循精细惯性动作的空中视频可以支持可扩展的空中代理训练和评估,通过提供可控的现实世界或昂贵模拟数据代理。为此,我们提出了Aero-World,一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法。Aero-World通过动作令牌流将加速度和角速度序列注入到预训练的潜在扩散变换器中。一个冻结的潜在空间物理探测器,独立在真实视频-IMU配对上训练,通过LoRA微调期间提供可微的惯性一致性监督,同时避免计算昂贵的视频解码。我们进一步提出了AeroBench,一个评估生成无人机视频是否符合低级动作信号的基准。AeroBench使用动作对齐分数(AAS)测量与命令惯性动作的一致性,使用物理一致性率(PCR)测量时间运动稳定性。在AeroBench上,Aero-World将平均AAS从57.7提高到63.6,比仅动作微调有更高的质量控制权衡,与AirScape相比,FVD更低(596.5 vs. 1058.6),SSIM更高(0.595 vs. 0.505),Flow-IMU相关性更高(0.44 vs. 0.20)。这些结果表明,冻结的物理探测器监督是一种将预训练视频生成器适应更动作对齐的空中运动的实用机制。

英文摘要

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.