arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2136
专题追踪
2506.08134 2026-06-10 cs.AI cs.CY 版本更新

Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem

立场:机器学习社区必须构建AI增强的同行评审生态系统

Qiyao Wei, Samuel Holt, Jing Yang, Markus Wulfmeier, Mihaela van der Schaar

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对ML领域稿件激增导致同行评审危机,本文主张将AI辅助评审作为优先研究课题,提出利用大语言模型作为协作工具,增强事实核查、评审指导、作者改进和决策支持,并强调需要更细粒度的评审数据。

Comments 18 pages, 3 figures. Accepted (Oral) at the ICML 2026 Position Paper Track

详情
AI中文摘要

同行评审是机器学习(ML)科学进步的基石,但正面临规模危机。向NeurIPS、ICML和ICLR等顶级ML会议提交的稿件数量呈指数级增长,超过了合格评审者的有限容量,引发了对评审质量、一致性和评审者疲劳的担忧。本文立场认为,AI辅助同行评审必须成为紧急的研究和基础设施优先事项。我们倡导一个全面的AI增强生态系统,利用大语言模型(LLMs)不是替代人类判断,而是作为作者、评审者和领域主席(ACs)的复杂协作者。我们提出了AI在增强事实核查、指导评审者表现、协助作者改进质量以及支持ACs决策中的具体角色。关键的是,我们认为此类系统的开发依赖于获取更细粒度、结构化和符合伦理的同行评审过程数据。我们概述了一个研究议程,包括说明性实验,以开发和验证这些AI助手,并讨论了重大的技术和伦理挑战。我们呼吁ML社区主动构建这个AI辅助的未来,确保科学验证的持续完整性和可扩展性,同时保持高标准的同行评审。

英文摘要

Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.

2505.23341 2026-06-10 cs.CV 版本更新

Dual-stream attention-guided learning for weakly supervised whole slide image classification

双流注意力引导学习用于弱监督全切片图像分类

Daoxi Cao, Hangbei Cheng, Yijin Li, Ruolin Zhou, Xuehan Zhang, Xinyi Li, Binwei Li, Xuancheng Gu, Jianan Zhang, Xueyu Liu, Yongfei Wu

发表机构 * College of Computer Science and Technology, College of Data Science, Taiyuan University of Technology(太原科技大学计算机科学与技术学院、数据科学学院) College of Humanities, Law and Foreign Languages, Taiyuan University of Technology(太原科技大学人文学院、法律与外语学院) College of Artificial Intelligence, Taiyuan University of Technology(太原科技大学人工智能学院) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络安全学院) School of Mathematics, Taiyuan University of Technology(太原科技大学数学学院)

AI总结 提出双流注意力引导学习框架,通过师生双流架构和注意力引导伪标签,解决弱监督下全切片图像中关键区域识别和实例关系建模问题,在合成和真实病理数据集上优于现有方法。

详情
AI中文摘要

全切片图像(WSIs)因其超高分辨率和丰富的形态学信息在癌症诊断中发挥关键作用,多实例学习(MIL)已成为解决WSIs巨大尺寸和实例细粒度标注稀缺的主流范式。然而,现有大多数MIL方法难以仅使用切片级标签准确识别诊断关键局部区域(实例),并且在高效建模实例间关系方面存在不足。为解决这些问题,我们提出了一种双流注意力引导学习(DSAGL)框架。DSAGL通过师生双流架构桥接切片级监督和实例级学习,并通过生成注意力引导伪标签缓解实例歧义。该框架采用共享轻量级编码器高效建模长距离依赖,并利用基于注意力的融合机制增强对稀疏信息区域的敏感性。在合成基准和真实病理WSI数据集上的大量实验表明,DSAGL在弱监督下始终优于最先进的MIL方法,实现了卓越的判别性能和鲁棒性。

英文摘要

Whole slide images (WSIs) play a crucial role in cancer diagnosis due to their ultra-high resolution and rich morphological information, and multiple instance learning (MIL) has become a prevalent paradigm to solve the massive size of WSIs and the scarcity of fine-grained annotations of instance. However, most existing MIL methods struggle to accurately identify diagnostically critical local regions (instance) using only slide-level labels, and suffer from modelling the relationship of instances efficiently. To address these defects, we propose a Dual-Stream Attention-Guided Learning (DSAGL) framework. DSAGL bridges slide-level supervision and instance-level learning through a teacher-student dual-stream architecture, and mitigates instance ambiguity by generating attention-guided pseudo labels. The framework employs a shared lightweight encoder to efficiently model long-range dependencies and an attention-based fusion mechanism to enhance sensitivity to sparse, informative regions. Extensive experiments on synthetic benchmarks and real-world pathological WSI datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL methods, achieving superior discriminative performance and robustness under weak supervision.

2506.14753 2026-06-10 cs.CV cs.LG 版本更新

Cost-Aware Routing for Efficient Text-To-Image Generation

面向文本到图像生成的高效路由:成本感知方法

Qinchan Li, Kenneth Chen, Changyue Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy

发表机构 * Tandon School of Engineering, New York University(纽约大学Tandon工程学院) Google(谷歌) Eigen 4D Inc.(Eigen 4D公司)

AI总结 提出成本感知路由框架,根据提示复杂度自动选择不同去噪步数或模型,在保证高质量的同时降低计算成本,优于单一模型。

Comments Accepted by TMLR

详情
AI中文摘要

扩散模型以其通过迭代去噪过程为输入提示生成高保真图像的能力而闻名。不幸的是,由于固有的顺序生成过程,高保真度也伴随着高计算成本。在这项工作中,我们寻求在质量和计算成本之间实现最优平衡,并提出一个框架,允许每个提示的计算量根据其复杂度而变化。每个提示自动路由到最合适的文本到图像生成函数,该函数可能对应扩散模型的不同去噪步数,或一个不同的、独立的文本到图像模型。与统一的成本降低技术(例如,蒸馏、模型量化)不同,我们的方法通过学习将昂贵的选择(例如,100+去噪步)仅保留给少数复杂提示,而对较简单的提示采用更经济的选择(例如,小型蒸馏模型),从而实现最优权衡。我们在COCO和DiffusionDB上经验性地证明,通过学习路由到九个已训练的文本到图像模型,我们的方法能够提供比这些模型单独使用时更高的平均质量。代码可在以下网址获取:https://this URL。

英文摘要

Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due to the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone. Code is available at https://github.com/winglicopy/CATImage.

2406.08726 2026-06-10 cs.CL 版本更新

Standard Language Ideology in AI-Generated Language

AI生成语言中的标准语言意识形态

Genevieve Smith, Eve Fleisig, Ishita Rustagi, Xavier Yin

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 本文提出一个分类法,揭示大型语言模型如何强化标准语言意识形态,导致语言变体的边缘化,并讨论其社会影响及应对建议。

Comments To appear in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

大型语言模型(LLMs)生成的文本强化了标准语言意识形态:偏向于某些被认为比其它语言变体更具声望、权威和合法性的语言变体。本文贡献了一个基于社会技术的分面分类法,阐明了生成式AI系统如何再现标准语言意识形态及其社会影响。我们引入了标准AI生成语言意识形态的概念,以解释AI系统如何赋予某些语言变体合法性,同时边缘化其他变体,构建了性能差异、刻板印象、挪用和抹除的模式。然后,我们讨论了关于什么是理想系统行为的持续紧张,以及生成式AI工具尝试或拒绝模仿不同语言变体的优缺点。为了解决塑造生成式AI的权力关系以及我们分类法中识别的机制——合法化、刻板印象、挪用和抹除——我们提出了强调问责、社区代理、控制和所有权的建议。这些建议将语言多样性视为在公正的AI未来中需要保护、珍视和维持的资源。

英文摘要

Large language models (LLMs) generate text that reinforces standard language ideology: a bias towards certain language varieties that are granted more prestige, authority, and legitimacy than others. This paper contributes a sociotechnically grounded faceted taxonomy that illustrates how generative AI systems reproduce standard language ideology and its societal implications. We introduce the concept of standard AI-generated language ideology to explain how AI systems confer legitimacy on certain language varieties while marginalizing others, structuring patterns of performance disparity, stereotyping, appropriation, and erasure. We then discuss ongoing tensions around what constitutes desirable system behavior, as well as advantages and drawbacks of generative AI tools attempting or refusing to imitate different language varieties. To address the power relations shaping generative AI and the mechanisms identified in our taxonomy--legitimation, stereotyping, appropriation, and erasure--we offer recommendations that emphasize accountability, community agency, control, and ownership. These recommendations recognize linguistic diversity as a resource to be protected, valued, and sustained as part of a just AI future.

2506.09171 2026-06-10 cs.LG cs.AI cs.CL 版本更新

Fact-Augmented Lookahead Planning for LLM Agents

面向LLM智能体的事实增强前瞻规划

Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LWM-Planner框架,通过从轨迹中提取关键事实并用于条件化动作提议、世界模型模拟和状态值估计,实现无需参数更新的在线规划改进,在多个环境上优于ReAct/Reflexion和纯搜索基线。

Comments Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026). Camera-ready version. 9-page main text plus appendices (63 pages total), 1 figure

详情
AI中文摘要

大型语言模型(LLM)能力日益增强,但在交互式、部分可观测、长周期环境中,当搜索无引导或近期历史不足时,LLM智能体仍难以有效规划。我们提出LWM-Planner,一种事实增强的前瞻规划框架,仅通过上下文学习改善智能体行为。每个回合后,智能体从轨迹中提取任务关键原子事实,通过轻量级预测一致性过滤器验证候选事实(并可选择压缩),然后使用生成的事实集来条件化动作提议、单步潜在世界模型模拟和状态值估计。规划通过递归、有限深度的前瞻进行,基于累积事实和近期历史对候选轨迹进行搜索,实现无需参数更新的在线改进。我们提供抽象风格的动机:将事实视为减少状态混淆(代理$\epsilon_{\mathrm{sim}}$),将事实条件模拟视为降低单步误差(代理$\delta_{\mathrm{model}}$),但不声称形式化保证。实验上,在文本FrozenLake变体、CrafterMini和ALFWorld上,该方法在累积回报上优于ReAct/Reflexion和纯搜索基线,表明额外的测试时搜索在由紧凑的经验派生事实引导时最为有用。

英文摘要

Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insufficient. We introduce LWM-Planner, a fact-augmented lookahead planning framework that improves agent behavior purely through in-context learning. After each episode, the agent extracts task-critical atomic facts from its trajectories, validates candidates with a lightweight predictive-consistency filter (and optionally compresses them), and uses the resulting fact set to condition action proposal, single-step latent world-model simulation, and state-value estimation. Planning then proceeds via recursive, depth-limited lookahead over candidate trajectories conditioned on the accumulated facts and recent history, enabling online improvement without parameter updates. We provide abstraction-style motivation: treating facts as reducing state aliasing (proxy $ε_{\mathrm{sim}}$) and fact-conditioned simulation as lowering one-step error (proxy $δ_{\mathrm{model}}$), without claiming formal guarantees. Empirically, on text FrozenLake variants, CrafterMini, and ALFWorld, the approach improves cumulative return over ReAct/Reflexion and search-only baselines, suggesting that additional test-time search is most useful when grounded by compact, experience-derived facts.

2506.03411 2026-06-10 cs.LG cs.GT 版本更新

A Machine Learning Theory Perspective on Strategic Litigation

战略诉讼的机器学习理论视角

Melissa Dutz, Han Shao, Avrim Blum, Aloni Cohen

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of Maryland(马里兰大学) The University of Chicago(芝加哥大学)

AI总结 从机器学习理论出发,建模普通法体系中战略诉讼者通过选择案件影响下级法院决策规则的问题,分析其影响力和最优策略,发现反直觉现象。

详情
AI中文摘要

战略诉讼是指提起诉讼的目标不仅限于解决特定纠纷,而是产生更广泛的影响。在普通法体系中,案件产生深远影响的一种方式是通过确立新的法律先例,后续法院必须遵循。本文从机器学习理论的角度探讨战略诉讼。我们考虑一个普通法法律体系的抽象模型,其中下级法院通过应用从上级法院过去裁决中学习到的决策规则来裁决新案件。在该模型中,我们探索战略诉讼者的力量,他们战略性地将案件提交给上级法院,以影响下级法院在未来案件中应用的决策规则。我们探讨的问题包括:战略诉讼者能产生什么影响?战略诉讼者应该将哪些案件提交法院?当战略诉讼者确信法院会做出不利于他们的裁决时,提起诉讼是否有意义?我们表明,这一战略案件选择问题具有有趣的结构,即使是简单的设置也会表现出反直觉的现象。当案件由一维点表示且下级法院的学习算法是最近邻时,或者当案件由d维点表示且下级法院的学习算法是支持向量机时,我们刻画了可诱导决策规则的集合,并开发了根据战略诉讼者目标选择最优案件集提交给上级法院的算法。

英文摘要

Strategic litigation involves bringing a case to court with the goal of having an impact beyond resolving the particular dispute at hand. In a common law system, one way a case may have far-reaching impact is by establishing new legal precedent that later courts must follow. In this paper, we explore strategic litigation from the perspective of machine learning theory. We consider an abstract model of a common law legal system where a lower court decides new cases by applying a decision rule learned from a higher court's past rulings. In this model, we explore the power of a strategic litigator, who strategically brings cases to the higher court to influence the decision rule applied by the lower court in future cases. We explore questions including: What impact can a strategic litigator have? Which cases should a strategic litigator bring to court? Does it ever make sense for a strategic litigator to bring a case when they are sure the court will rule against them? We show that this strategic case selection problem has interesting structure, with even simple settings exhibiting counterintuitive phenomena. When cases are represented by points in one dimension and the lower court's learning algorithm is nearest neighbor, or as points in d dimensions and the lower court's learning algorithm is a support vector machine, we characterize the set of inducible decision rules and develop algorithms for selecting an optimal set of cases to bring to the higher court given the strategic litigator's objectives.

2411.05698 2026-06-10 cs.CV cs.AI cs.LG 版本更新

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification

Visual-TCAV:用于图像分类事后可解释性的基于概念的归因和显著性图

Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla

发表机构 * Politecnico di Milano(米兰理工大学)

AI总结 提出Visual-TCAV框架,结合概念激活向量和积分梯度,生成类无关显著性图并估计概念归因,在受控实验中比TCAV更忠实于真实解释。

Comments Accepted in TMLR

详情
AI中文摘要

卷积神经网络在图像分类中表现出色,但由于模型规模和复杂性,解释其预测具有挑战性。最先进的显著性方法生成局部解释,突出输入图像中识别类别的区域,但无法解释感兴趣的概念如何贡献于预测。另一方面,基于概念的方法(如TCAV)提供了网络对人类定义概念敏感性的见解,但无法计算其在特定预测中的归因,也无法显示其在输入图像中的位置。我们引入了Visual-TCAV,一种新颖的可解释性框架,旨在通过提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV使用概念激活向量(CAV)生成类无关的显著性图,显示网络识别特定概念的位置。此外,它可以使用积分梯度的推广来估计这些概念对任何类别输出的归因。我们通过一个已知解释真实情况的受控实验评估了该方法的忠实性,显示出比TCAV更好的真实情况对齐。我们的代码可在https://this URL获取。

英文摘要

Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.

2505.11702 2026-06-10 cs.LG stat.ML 版本更新

Post-Training Augmentation Invariance

训练后增强不变性

Keenan Eikenberry, Lizuo Liu, Yoonsang Lee

发表机构 * Department of Mathematics, Dartmouth College(达特茅斯学院数学系)

AI总结 提出训练后增强不变性框架,通过轻量级MLP适配器网络在预训练模型潜空间上实现近似不变性,无需微调且保持原始特征。

详情
AI中文摘要

本文开发了一个训练后增强不变性的框架,其目标是为预训练网络添加不变性属性,同时不改变其在原始非增强输入分布上的行为。我们精确定义了这一概念,并引入了增强编码器,这是一种概率编码器,形式化了基于增强的编码过程,并作为我们的基本研究对象。我们提出了两种增强编码器的损失函数,即马尔可夫-瓦瑟斯坦最小化和瓦瑟斯坦相关性最大化,并通过实验证明,这两种损失函数可用于训练轻量级的单隐藏层MLP适配器网络$E_{\theta}$,当将其附加到预训练网络$F$的潜空间时,确实能实现(近似)训练后增强不变性。例如,在STL10上使用$F=\text{DINO}$特征时,复合网络$C\circ E_{\theta}\circ F$(其中$C$是线性分类器,$E_{\theta}$是我们提出的适配器网络之一)在任意旋转图像上达到94%的分类准确率,而没有适配器$E_{\theta}$的$C\circ F$网络则降至71%。类似地,我们可以将噪声不变分类结果从58%提升至86%。重要的是,我们无需微调即可获得这些结果($F$的权重全程冻结),并且我们的方法对原始特征的破坏很小,因为$E_{\theta}$在非增强潜分布上几乎等距作用。相比之下,我们展示了使用替代候选损失函数(特别是SimCLR和HSIC最大化)训练的适配器网络产生了不具竞争力的分类结果,并从根本上破坏了原始潜空间。代码见https://this URL。

英文摘要

This work develops a framework for post-training augmentation invariance, in which our goal is to add invariance properties to a pretrained network without altering its behavior on the original, non-augmented input distribution. We define this notion precisely and additionally introduce augmented encoders, which are probabilistic encoders that formalize augmentation-based encoding processes and that serve as our fundamental object of study. We introduce two losses for augmented encoders, namely, Markov-Wasserstein minimization and Wasserstein correlation maximization, and we demonstrate empirically that both losses can be used to train lightweight, one-hidden-layer MLP adapter networks $E_θ$ that, when appended to the latent space of a pretrained network $F$, do indeed lead to (approximate) post-training augmentation invariance. For example, on STL10 with $F=\text{DINO}$ features, the composite network $C\circ E_θ\circ F$, where $C$ is a linear classifier and where $E_θ$ is one of our proposed adapter networks, achieves 94% classification accuracy on arbitrarily rotated images, whereas a network of the form $C\circ F$ without the adapter $E_θ$ drops to 71% accuracy. Similarly, we can boost noise-invariant classification results from 58% up to 86%. Significantly, we obtain these results with no fine-tuning (the weights of $F$ remain frozen throughout), and our methods introduce little corruption to the original features, since $E_θ$ acts nearly isometrically on the non-augmented latent distribution. In contrast, we show that adapter networks trained with alternative candidate losses, specifically SimCLR and HSIC maximization, produce uncompetitive classification results and fundamentally corrupt the original latent space. Code available at https://github.com/keenan-eikenberry/augmentation_invariance

2505.11034 2026-06-10 cs.CV cs.AI cs.LG 版本更新

CleanPatrick: A Benchmark for Image Data Cleaning

CleanPatrick: 图像数据清洗基准

Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly

发表机构 * University of Basel(巴塞尔大学) Lucerne University of Applied Sciences and Arts(卢塞恩应用科学大学) University Hospital of Basel(巴塞尔大学医院) Northwestern University(西北大学) Northeast Dermatology Associates(东北皮肤科诊所) Medical University of Vienna(维也纳医科大学) Banner Health(Banner健康系统)

AI总结 提出首个大规模图像数据清洗基准CleanPatrick,基于Fitzpatrick17k皮肤病数据集,收集大量众包标注并采用项目反应理论聚合,将问题检测形式化为排序任务,评估多种方法。

Comments Accepted at Journal of Data-centric Machine Learning Research (DMLR)

详情
AI中文摘要

鲁棒的机器学习依赖于干净的数据,然而当前的图像数据清洗基准依赖于合成噪声或狭窄的人类研究,限制了比较和现实相关性。我们引入CleanPatrick,这是图像领域首个大规模数据清洗基准,基于公开的Fitzpatrick17k皮肤病学数据集构建。我们收集了来自933名医学众包工作者的496,377个二元标注,识别出离题样本(4%)、近似重复(21%)和标签错误(32%),并采用受项目反应理论启发的聚合模型,随后经过专家审查以获得高质量的真实标签。CleanPatrick将问题检测形式化为排序任务,并采用反映真实审计流程的标准排序指标。我们基准测试了经典异常检测器、感知哈希、SSIM、Confident Learning、NoiseRank、FINE、BHN和SelfClean。在CleanPatrick上,自监督表示在近似重复检测方面表现出色,经典方法在受限审查预算下实现了有竞争力的离题检测,而在保守的人类判断下检测不合理标签对于细粒度医学分类仍然具有挑战性。通过发布数据集和评估框架,CleanPatrick使得图像清洗策略的系统比较成为可能。

英文摘要

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (32%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and employs standard ranking metrics that mirror real audit workflows. We benchmark classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, FINE, BHN, and SelfClean. On CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and detecting implausible labels under conservative human judgment remains challenging for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies.

2501.00745 2026-06-10 cs.CL cs.AI cs.GT cs.IR econ.TH 版本更新

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

基于大型语言模型的搜索引擎对抗攻击动力学

Xiyang Hu

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文将排名操纵攻击建模为无限重复囚徒困境,分析合作维持条件,发现降低攻击成功率可能反而激励攻击,防御措施在某些情况下无效。

Comments New Frontiers in Game-Theoretic Learning Workshop, International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

基于大型语言模型(LLM)的搜索引擎日益集成,改变了信息检索的格局。然而,这些系统容易受到对抗攻击,尤其是排名操纵攻击,攻击者通过精心制作网页内容来操纵LLM的排名并推广特定内容,从而在竞争对手中获得不公平优势。在本文中,我们研究了排名操纵攻击的动力学。我们将此问题建模为无限重复囚徒困境,其中多个参与者策略性地决定合作还是攻击。我们分析了合作得以维持的条件,识别了关键因素,如攻击成本、折现率、攻击成功率和触发策略,这些因素影响参与者的行为。我们识别了系统动力学中的临界点,表明当参与者具有前瞻性时,合作更有可能维持。然而,从防御角度来看,我们发现简单地降低攻击成功概率,在某些条件下反而会激励攻击。此外,限制攻击成功率上限的防御措施在某些情况下可能徒劳无功。这些见解凸显了保护基于LLM的系统的复杂性。我们的工作为理解和缓解其脆弱性提供了理论基础和实践见解,同时强调了自适应安全策略和深思熟虑的生态系统设计的重要性。

英文摘要

The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

2505.08213 2026-06-10 cs.RO 版本更新

HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands

HandCept: 用于灵巧手精确本体感知的视觉-惯性融合框架

Huang Junda, Honghao Guo, Hao Wu, Zhengyang Liu, Marcelo H Ang, Jianshu Zhou

发表机构 * The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 提出HandCept,首个视觉-惯性本体感知框架,通过零样本学习和无延迟扩展卡尔曼滤波融合腕部RGB-D相机与9轴IMU,实现2°-4°关节角估计误差且无漂移,优于纯视觉或纯惯性方法。

Comments 8 pages, 7 figures, conference

详情
AI中文摘要

随着机器人向通用操作发展,灵巧手变得越来越关键。然而,由于体积和通用性的限制,灵巧手的本体感知仍然是一个瓶颈。在这项工作中,我们提出了HandCept,这是第一个旨在克服传统灵巧手关节角估计方法挑战的视觉-惯性本体感知框架。HandCept解决了在动态环境中实现准确且鲁棒的关节角估计的难题,在这种环境中,视觉和惯性测量都容易受到噪声和漂移的影响。它利用零样本学习方法,使用腕部RGB-D相机和9轴IMU,通过无延迟扩展卡尔曼滤波器(EKF)实时融合。我们的结果表明,HandCept实现了通常在$2^{\circ}$到$4^{\circ}$之间的关节角估计误差,且没有可观察到的漂移,优于纯视觉和纯惯性方法。此外,我们验证了IMU系统的稳定性和均匀性,表明IMU之间的公共基帧简化了系统标定。为了支持仿真到现实的迁移,我们还开源了我们的高保真渲染管线,这对于在没有真实世界真值的情况下进行训练至关重要。这项工作为灵巧手的本体感知提供了一种鲁棒、可泛化的解决方案,对机器人操作和人机交互具有重要意义。this https URL

英文摘要

As robotics progresses toward general manipulation, dexterous hands are becoming increasingly critical. However, proprioception in dexterous hands remains a bottleneck due to limitations in volume and generality. In this work, we present HandCept, the first visual-inertial proprioception framework designed to overcome the challenges of traditional joint angle estimation methods for dexterous hands. HandCept addresses the difficulty of achieving accurate and robust joint angle estimation in dynamic environments where both visual and inertial measurements are prone to noise and drift. It leverages a zero-shot learning approach using a wrist-mounted RGB-D camera and 9-axis IMUs, fused in real time via a latency-free Extended Kalman Filter (EKF). Our results show that HandCept achieves joint angle estimation errors generally between $2^{\circ}$ and $4^{\circ}$ without observable drift, outperforming visual-only and inertial-only methods. Furthermore, we validate the stability and uniformity of the IMU system, demonstrating that a common base frame across IMUs simplifies system calibration. To support sim-to-real transfer, we also open-source our high-fidelity rendering pipeline, which is essential for training without real-world ground truth. This work offers a robust, generalizable solution for proprioception in dexterous hands, with significant implications for robotic manipulation and human-robot interaction. https://github.com/huangjund/blenderYCB

2505.01458 2026-06-10 cs.RO cs.AI 版本更新

A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI

具身智能时代基于物理模拟器的机器人导航与操作综述

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

发表机构 * Department of Computer Science, City University of Hong Kong(城市大学计算机科学系) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) Department of Informatics, Universität Hamburg(汉堡大学信息学院)

AI总结 本文综述了物理模拟器在缩小具身智能中导航与操作的模拟到现实差距方面的关键特性、任务支持及硬件需求,并提供了基准数据集、指标、平台和方法资源。

Comments Under Review

详情
AI中文摘要

导航和操作是具身智能的核心能力,但直接在现实世界中训练智能体执行这些任务成本高、耗时且不安全。因此,模拟到现实的迁移已成为关键方法,然而模拟到现实的差距仍然存在。本综述通过分析先前综述中关注有限的属性,考察了物理模拟器如何解决这一差距。我们还分析了它们在导航和操作任务中的特性,以及它们的硬件需求。此外,我们提供了包含基准数据集、指标、模拟平台和方法的资源,以帮助研究人员在考虑硬件约束的同时选择合适的工具。

英文摘要

Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly, time-consuming, and unsafe. Therefore, sim-to-real transfer has emerged as a key approach, yet the sim-to-real gap persists. This survey examines how physics simulators address this gap by analyzing properties that have received limited attention in prior surveys. We also analyze their features for navigation and manipulation tasks, as well as their hardware requirements. Additionally, we offer a resource with benchmark datasets, metrics, simulation platforms, and methods to help researchers select suitable tools while accounting for hardware constraints.

2504.18424 2026-06-10 cs.CV 版本更新

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

LaRI: 用于单视图3D几何推理的分层射线交点

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka

发表机构 * ETH Zurich(苏黎世联邦理工学院) Adobe Research(Adobe研究)

AI总结 提出LaRI方法,通过分层点图预测射线与多个表面的交点,实现单次前馈的完整场景重建,支持物体级和场景级任务。

Comments Project page: https://ruili3.github.io/lari

Journal ref ICML 2026

详情
AI中文摘要

我们提出了分层射线交点(LaRI),一种用于从单张图像进行遮挡几何推理的全监督方法。与仅限于可见表面的传统深度估计不同,LaRI使用分层点图预测相机射线相交的多个表面。与现有利用神经隐式表示或迭代优化的方法相比,LaRI在一次前馈传递中完成完整的场景重建,实现了高效且视图对齐的几何推理,以支持物体级和场景级任务。我们进一步提出预测射线停止索引,该索引从LaRI的输出中识别有效的相交像素和层。为了更好地支持和评估这一任务,我们使用渲染引擎构建了一个注释流水线,为五个公共数据集(包括覆盖3D物体和场景的合成数据和真实世界数据)构建了注释。作为一种通用方法,LaRI的性能在物体级和场景级重建任务中得到了验证。

英文摘要

We present Layered Ray Intersections (LaRI), a fully supervised method for occluded geometry reasoning from a single image. Unlike conventional depth estimation, which is limited to visible surfaces, LaRI predicts multiple surfaces intersected by the camera rays using layered point maps. Compared to the existing approaches that leverage neural implicit representations or iterative refinement, LaRI achieves complete scene reconstruction in one feed-forward pass, enabling efficient and view-aligned geometric reasoning to underpin both object-level and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. To better underpin and evaluate this task, we build an annotation pipeline using rendering engines, construct annotations for five public datasets, including synthetic and real-world data covering 3D objects and scenes. As a generic method, LaRI's performance is validated in object-level and scene-level reconstruction tasks.

2504.03118 2026-06-10 cs.CV cs.AI 版本更新

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

NuWa: 为边缘设备导出轻量级类别特定视觉Transformer

Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Hai Jin, Yun Yang

发表机构 * National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab(大数据技术与系统国家工程研究中心、服务计算技术与系统实验室、集群与网格计算实验室) Swinburne University of Technology(斯威本科技大学) Deakin University(迪金大学)

AI总结 针对边缘设备只需识别特定类别的问题,提出NuWa方法,通过自知识净化去除有害权重,并利用闭式优化高效导出紧凑ViT,无需重训练即可提升类别精度并加速推理。

Comments Accepted at CVPR 2026

详情
AI中文摘要

视觉Transformer(ViT)通常需要压缩以部署在资源受限的边缘设备(如无人机和智能车辆)上。然而,现有的模型压缩方法忽略了许多边缘设备仅需特定类别的知识用于其应用。因此,导出的全类别ViT保留了冗余知识,在这些类别上表现次优。我们发现,简单地将校准数据集替换为类别特定数据不足以解决此问题,因为这些方法面临两个根本限制。首先,它们忽略了存在对类别有害的权重,这些权重干扰特化,而移除它们可以提升类别特定性能。其次,目标类别的多样性和边缘设备的资源约束需要大量定制模型。现有方法耗时且计算成本高,因此不可扩展。在这项工作中,我们提出NuWa,一种成本高效的方法,通过从基础ViT导出小型ViT来应对这些挑战,适用于具有特定类别需求的边缘设备。NuWa执行自知识净化以剪除对类别有害的权重,并通过闭式优化高效导出紧凑ViT。无需剪枝后重训练,导出的边缘ViT在类别特定精度上超越基础ViT,并加速推理。综合实验表明,NuWa在类别特定任务上比最先进的无训练剪枝方法精度高出高达29.00%。与性能最佳的依赖训练剪枝方法相比,NuWa实现了33.69倍的剪枝加速,并将剪枝成本降低高达99.83%,平均精度损失仅为0.61%。项目页面:this https URL。

英文摘要

Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69x pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss. Project Page: https://github.com/CGCL-codes/NuWa.

2502.09928 2026-06-10 cs.CV cs.AI 版本更新

Deep Tree Tensor Networks

深度树张量网络

Chang Nie

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 提出深度树张量网络(DTTN),通过多线性运算捕获指数阶特征交互,在多个基准上超越现有方法。

详情
AI中文摘要

源自量子物理的张量网络(TNs)已被广泛用作指数机器和参数分解器用于识别任务。典型的TN模型,如矩阵乘积态(MPS),在自然图像识别中尚未取得成功应用。当它们被使用时,主要是在现有网络中压缩参数,从而失去了捕获指数阶特征交互的独特能力。本文提出了一种名为\textit{\textbf{深度树张量网络}}(DTTN)的新架构,它通过多线性运算捕获跨特征的$2^L$阶乘法交互,同时本质上展开为具有参数共享属性的\textit{树}状TN拓扑。DTTN由多个反对称交互模块(AIMs)堆叠而成,这种设计便于高效实现。此外,我们的理论分析证明了量子启发的TN模型与多项式/多线性网络在特定条件下的等价性。我们认为DTTN可以促进该领域内更具可解释性的研究。所提出的模型在多个基准和领域上进行了评估,显示出优于同行方法和最先进架构的性能。我们的代码在此https URL公开提供。

英文摘要

Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image recognition. When employed, they primarily serve to compress parameters within pre-existing networks, thereby losing their distinctive capability to capture exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interaction modules (AIMs), and this design facilitates efficient implementation. Furthermore, our theoretical analysis demonstrates the equivalence between quantum-inspired TN models and polynomial/multilinear networks under specific conditions. We posit that the DTTN could catalyze more interpretable research within this field. The proposed model is evaluated across multiple benchmarks and domains, demonstrating superior performance compared to both peer methods and state-of-the-art architectures. Our code is publicly available at https://github.com/NieCha/deep_tree_tensor_network.

2501.14717 2026-06-10 cs.CL 版本更新

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

表格LLM真正重要的是什么?模型与数据影响的元评估

Naihao Deng, Sheng Zhang, Henghui Zhu, Shuaichen Chang, Jiani Zhang, Alexander Hanbo Li, Chung-Wei Hang, Hideo Kobayashi, Yiqun Hu, Patrick Ng

发表机构 * University of Michigan(密歇根大学) AWS AI Labs(AWS人工智能实验室) Figma OKX Google(谷歌)

AI总结 通过指令微调12个模型并在16个基准上评估,发现基座模型选择比训练数据对性能影响更大,泛化与推理仍是挑战。

Comments EACL 2026 Findings

详情
AI中文摘要

表格建模已经发展了数十年。在这项工作中,我们重新审视了这一轨迹,并强调了LLM时代出现的新挑战,特别是选择悖论:在表格指令微调的背景下,由于基础模型和训练集的多样性,难以将性能提升归因于特定因素。我们通过指令微调三个基础模型在四个现有数据集上,复制了四个表格LLM,共得到12个模型。然后我们在16个表格基准上评估这些模型。我们的研究首次定量分离了训练数据和基础模型选择的影响,揭示了基础模型选择比训练数据本身起更主导的作用。泛化和推理仍然具有挑战性,需要未来在表格建模上继续努力。基于我们的发现,我们分享了对表格建模未来方向的思考。

英文摘要

Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diverse base models and training sets in the context of table instruction tuning. We replicate four table LLMs by instruction-tuning three foundation models on four existing datasets, yielding 12 models. We then evaluate these models across 16 table benchmarks. Our study is the first to quantitatively disentangle the effects of training data and base model selection, revealing that base model choice plays a more dominant role than the training data itself. Generalization and reasoning remain challenging, inviting future effort on table modeling. Based on our findings, we share our thoughts on the future directions for table modeling.

2412.11449 2026-06-10 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

Whisper-GPT -- 语音和音乐的连续离散混合表示语言模型

Prateek Verma

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Whisper-GPT,一种结合连续音频表示(如频谱图)和离散音频令牌的生成式大语言模型,解决了离散令牌方法上下文长度过长的问题,在语音和音乐的下一个令牌预测中降低了困惑度和负对数似然。

Comments 6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India

详情
AI中文摘要

我们提出了WHISPER-GPT:一种用于语音和音乐的生成式大语言模型(LLM),它允许我们在单个架构中同时处理连续音频表示和离散令牌。近年来,利用神经压缩算法(例如ENCODEC)导出的离散音频令牌的生成式音频、语音和音乐模型激增。然而,这种方法的主要缺点之一是处理上下文长度。如果必须考虑不同频率下的所有音频内容来进行下一个令牌预测,那么对于高保真生成架构来说,上下文长度会急剧增长。通过结合连续音频表示(如频谱图)和离散声学令牌,我们保留了两者的优点:在单个令牌中拥有来自音频特定时间实例的所有必要信息,同时允许LLM预测未来令牌,从而获得采样和离散空间提供的其他好处。我们展示了与基于令牌的语音和音乐LLM相比,我们的架构如何提高下一个令牌预测的困惑度和负对数似然分数。

英文摘要

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

2403.00420 2026-06-10 cs.LG cs.AI 版本更新

Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey

通过对抗攻击和训练实现鲁棒深度强化学习:综述

Lucas Schott, Josephine Delas, Hatem Hajri, Elies Gherbi, Reda Yaich, Nora Boulahia-Cuppens, Frederic Cuppens, Sylvain Lamprier

发表机构 * Institut de Recherche Technologique SystemX(技术研究 institute SystemX)

AI总结 本文综述了深度强化学习中的对抗攻击与训练方法,系统分类并比较其目标与机制,以提升模型对环境变化和扰动的鲁棒性。

Comments 83 pages, 17 figues, 3 table, 15 algorithms

详情
AI中文摘要

深度强化学习是机器学习的一个子领域,用于训练在复杂环境中执行序列动作的自主智能体。尽管在已知环境中表现出色,但它仍容易受到微小条件变化的影响,引发对其在实际应用中可靠性的担忧。为了提高可用性,深度强化学习必须展示出可信赖性和鲁棒性。提升深度强化学习对环境条件未知变化和可能扰动的鲁棒性的一种方法是通过对抗训练,即针对观测和环境动态的合适对抗攻击来训练智能体。针对这一关键问题,我们的工作深入分析了当代对抗攻击和训练方法,系统地对它们进行分类,并比较了它们的目标和操作机制。

英文摘要

Deep Reinforcement Learning (DRL) is a subfield of machine learning for training autonomous agents that take sequential actions across complex environments. Despite its significant performance in well-known environments, it remains susceptible to minor condition variations, raising concerns about its reliability in real-world applications. To improve usability, DRL must demonstrate trustworthiness and robustness. A way to improve the robustness of DRL to unknown changes in the environmental conditions and possible perturbations is through Adversarial Training, by training the agent against well-suited adversarial attacks on the observations and the dynamics of the environment. Addressing this critical issue, our work presents an in-depth analysis of contemporary adversarial attack and training methodologies, systematically categorizing them and comparing their objectives and operational mechanisms.

2411.02817 2026-06-10 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs

条件 Vendi 分数:生成式 AI 模型和 LLM 的提示感知多样性评估

Mohammad Jalali, Azim Ospanov, Amin Gohari, Farzan Farnia

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(计算机科学与工程系,香港中文大学) Department of Information Engineering, The Chinese University of Hong Kong(信息工程系,香港中文大学)

AI总结 针对文本提示引导的生成模型,提出条件 Vendi 和条件 RKE 分数,通过条件熵分离模型自身多样性,并证明收敛性及在多个任务中恢复真实多样性排序。

详情
AI中文摘要

由文本提示引导的生成模型在保真度和提示对齐方面被广泛评估,但其产生输出的能力仍未被充分探索。现有的多样性指标(如基于核矩阵的 von Neumann 和 Rényi 熵的 Vendi 和 RKE)是为无条件模型开发的,无法区分提示引起的变异和模型引起的变异。我们通过引入 \textit{Conditional-Vendi} 和 \textit{Conditional-RKE} 来解决这一差距,这些多样性度量源自正半定矩阵的条件熵。这些分数在提示引导生成中分离出模型引起的多样性,其中 Conditional-RKE 具有 $O(1/\sqrt{n})$ 的收敛速度。对于 Conditional-Vendi,我们引入了一种截断谱近似,产生可扩展且一致的估计。在文本到图像、图像字幕和 LLM 任务上的实验表明,条件分数恢复了真实多样性排序,并且还可以引导扩散模型生成更多样化的样本。代码库可从此 https URL 获取。

英文摘要

Generative models guided by text prompts are widely evaluated for fidelity and prompt alignment, yet their ability to produce outputs remains underexplored. Existing diversity metrics such as Vendi and RKE, which are based on the von Neumann and Rényi entropies of kernel matrices, were developed for unconditional models and cannot distinguish prompt-induced from model-induced variability. We address this gap by introducing \textit{Conditional-Vendi} and \textit{Conditional-RKE}, diversity measures derived from the conditional entropy of positive semidefinite matrices. These scores isolate model-induced diversity in prompt-guided generation, with Conditional-RKE enjoying an $O(1/\sqrt{n})$ convergence rate. For Conditional-Vendi, we introduce a truncated-spectrum approximation that yields scalable and consistent estimates. Experiments on text-to-image, image-captioning, and LLM tasks show that the conditional scores recover ground-truth diversity orderings and can also guide diffusion models toward more diverse samples. The codebase is available at https://github.com/mjalali/conditional-vendi.

2409.04111 2026-06-10 cs.LG 版本更新

Active-Passive Federated Learning for Vertically Partitioned Multi-view Data

面向垂直分区多视角数据的主动-被动联邦学习

Jiyuan Liu, Siqi Wang, Xinhang Wan, Yi Zhang, Junsong Chen, Xin Lu, Xinwang Liu

发表机构 * National University of Defense Technology(国防科技大学)

AI总结 提出主动-被动联邦学习框架,主动客户端独立构建完整模型,被动客户端仅辅助训练,解决推理时客户端协作不可靠问题,通过重构损失和对比损失实例化两种分类方法并验证有效性。

详情
AI中文摘要

垂直联邦学习是一种自然且优雅的方法,用于集成跨设备(客户端)垂直分区的多视角数据,同时保护其隐私。除了模型训练,现有方法在模型推理时需要所有客户端的协作。然而,模型推理可能长期维持服务,而协作(尤其是当客户端属于不同组织时)在现实场景中不可预测,例如合同取消、网络不可用等,导致推理失败。为了解决这个问题,我们首次尝试提出了一种灵活的主动-被动联邦学习(APFed)框架。具体来说,主动客户端是学习任务的发起者,负责构建完整模型,而被动客户端仅作为辅助。一旦模型构建完成,主动客户端可以独立进行推理。此外,我们将APFed框架实例化为两种分类方法,分别在被动客户端上采用重构损失和对比损失。同时,这两种方法在一系列实验中进行了测试,并取得了理想的结果,验证了它们的有效性。

英文摘要

Vertical federated learning is a natural and elegant approach to integrate multi-view data vertically partitioned across devices (clients) while preserving their privacies. Apart from the model training, existing methods requires the collaboration of all clients in the model inference. However, the model inference is probably maintained for service in a long time, while the collaboration, especially when the clients belong to different organizations, is unpredictable in real-world scenarios, such as concellation of contract, network unavailablity, etc., resulting in the failure of them. To address this issue, we, at the first attempt, propose a flexible Active-Passive Federated learning (APFed) framework. Specifically, the active client is the initiator of a learning task and responsible to build the complete model, while the passive clients only serve as assistants. Once the model built, the active client can make inference independently. In addition, we instance the APFed framework into two classification methods with employing the reconstruction loss and the contrastive loss on passive clients, respectively. Meanwhile, the two methods are tested in a set of experiments and achieves desired results, validating their effectiveness.

2206.02178 2026-06-10 cs.AI cs.LG 版本更新

Belief Acquisition as Stochastic Filtering

信念获取作为随机滤波

Dawei Chen, John Lloyd, Samuel Yang-Zhao, Kee Siong Ng

发表机构 * School of Computing, Australian National University(计算机学院,澳大利亚国立大学)

AI总结 本文提出将信念获取视为随机滤波问题,通过分解条件滤波器在高维状态空间中同时跟踪状态和估计参数,并在流行病跟踪等实验中验证有效性。

Comments 51 pages

详情
AI中文摘要

本文研究如何利用随机滤波实现信念获取。首先,概述了经验信念的理论基础。然后,研究了该背景下的随机滤波。本文引入了因子化条件滤波器,这是一种新的滤波算法,用于在高维状态空间中同时跟踪状态和估计参数。算法的条件性质用于估计参数,因子化性质用于将状态空间分解为低维子空间,使得在这些子空间上的滤波得到的分布的乘积是对整个状态空间上分布的良好近似。算法成功应用的条件是:观测在子空间级别可用,且转移模式可以分解为近似局限于子空间的局部转移模式;这些条件在计算机科学、工程和地球物理滤波应用中广泛满足。在大型接触网络上跟踪流行病和估计参数的实验结果显示了该方法的有效性。

英文摘要

This paper studies how belief acquisition can be accomplished using stochastic filtering. First, a theoretical foundation for empirical beliefs is outlined. Then stochastic filtering in this context is studied. The paper introduces factored conditional filters, new filtering algorithms for simultaneously tracking states and estimating parameters in high-dimensional state spaces. The conditional nature of the algorithms is used to estimate parameters and the factored nature is used to decompose the state space into low-dimensional subspaces in such a way that filtering on these subspaces gives distributions whose product is a good approximation to the distribution on the entire state space. The conditions for successful application of the algorithms are that observations be available at the subspace level and that the transition schema can be factored into local transition schemas that are approximately confined to the subspaces; these conditions are widely satisfied in computer science, engineering, and geophysical filtering applications. Experimental results on tracking epidemics and estimating parameters in large contact networks show the effectiveness of the approach.

2310.05264 2026-06-10 cs.LG cs.CV 版本更新

The Emergence of Reproducibility and Generalizability in Diffusion Models

扩散模型中可重复性与泛化性的出现

Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, Qing Qu

发表机构 * CIFAR-10 dataset(CIFAR-10数据集)

AI总结 研究发现扩散模型在相同初始噪声和确定性采样器下,不同模型输出高度相似,且这种可重复性在记忆和泛化两种训练模式下均存在,对训练效率、模型隐私等有重要启示。

Comments NeurIPS Diffusion Model Workshop 2023 (best paper award), the Forty-first International Conference on Machine Learning (ICML 2024)

详情
AI中文摘要

在这项工作中,我们研究了扩散模型的一个有趣且普遍的现象,我们称之为“一致模型可重复性”:给定相同的起始噪声输入和确定性采样器,不同的扩散模型通常会产生非常相似的输出。我们通过全面的实验证实了这一现象,这意味着不同的扩散模型一致地达到相同的数据分布和评分函数,无论扩散模型框架、模型架构或训练过程如何。更引人注目的是,我们的进一步研究表明,扩散模型学习到的不同分布受到训练数据大小的影响。这一点得到了以下事实的支持:模型可重复性表现在两种不同的训练机制中:(i)“记忆机制”,其中扩散模型过拟合到训练数据分布,以及(ii)“泛化机制”,其中模型学习底层数据分布。我们的研究还发现,这一有价值的特性推广到许多扩散模型的变体,包括用于条件使用、解决逆问题和模型微调的变体。最后,我们的工作提出了许多有趣的理论问题供未来研究,并强调了关于训练效率、模型隐私和扩散模型受控生成的实际意义。

英文摘要

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.

2404.11716 2026-06-10 cs.AI 版本更新

A Survey on Semantic Modeling for Building Energy Management

建筑能源管理的语义建模综述

Miracle Aniakor, Vinicius V. Cogo, Pedro M. Ferreira

发表机构 * LASIGE, DI, Faculdade de Ciências, Universidade de Lisboa, Portugal(里斯本大学科学学院激光工程与信息研究所)

AI总结 综述建筑运行阶段语义建模,分析60个模型和20多个用例,提出本体证据完备性指标,发现物理结构覆盖好而动态概念覆盖不足,指出提升互操作性和泛化能力的方向。

Comments 52 pages, 7 figures, 5 tables

详情
AI中文摘要

建筑能源管理(BEM)对于减少建筑领域的能源消耗和二氧化碳排放至关重要。尽管物联网技术现在提供了广泛的运行数据,但异构数据模型、设备描述和上下文表示仍然限制了语义互操作性,阻碍了通用、自主、上下文感知的BEM应用的发展。本体通过提供结构化、机器可解释的建筑数据、系统和运行上下文表示来解决这一挑战。本综述考察了建筑运行阶段的BEM语义建模。它回顾了60个语义模型,分析了20多个基于本体的BEM用例,并进一步量化了这些用例中的本体实例化率(OIR)和缺失概念。为了支持基于证据的本体使用评估,我们引入了本体证据完备性(OEC)的概念,这是一种衡量研究是否将运行概念明确映射到用于表示它们的本体类别的度量。结果表明,当前的语义模型在表示物理建筑结构、技术系统、传感设备和可观察的运行数据方面比抽象和动态的运行概念更一致。诸如关键绩效指标、评估、服务、控制逻辑、优化任务和计算工作流等概念的覆盖仍然不够一致。因此,应用的BEM研究经常依赖于本体重用、集成、专门化、外部继承或特定应用扩展来解决BEM中的覆盖和互操作性差距。通过综合这些模式,本综述阐明了现有语义模型的能力,并指出了更可互操作、更通用和更上下文感知的BEM系统的发展方向。

英文摘要

Building Energy Management (BEM) is central to reducing energy use and CO2 emissions in the building sector. Although IoT technologies now provide extensive operational data, heterogeneous data models, device descriptions, and contextual representations continue to limit semantic interoperability, limiting the development of generalisable, autonomous, context-aware BEM applications. Ontologies address this challenge by providing structured, machine-interpretable representations of building data, systems, and operational context. This survey examines semantic modelling for BEM during the building operational phase. It reviews 60 semantic models and analyses more than 20 ontology-based BEM use cases. It further quantifies Ontology Instantiation Rates (OIR) and missing concepts across those use cases. To support evidence-based assessment of ontology use, we introduce the notion of Ontology Evidence Completeness (OEC), a measure of whether studies explicitly map operational concepts to the ontology classes used to represent them. Findings show that current semantic models more consistently represent physical building structure, technical systems, sensing devices, and observable operational data than abstract and dynamic operational concepts. Concepts such as key performance indicators, assessments, services, control logic, optimisation tasks, and computational workflows remain less consistently covered. Applied BEM studies therefore frequently depend on ontology reuse, integration, specialisation, external inheritance, or application-specific extension to address coverage and interoperability gaps across BEM. By synthesising these patterns, this survey clarifies the capabilities of existing semantic models and identifies directions for more interoperable, generalisable, and context-aware BEM systems.

2012.15621 2026-06-10 cs.CL 版本更新

Open Korean Corpora: A Practical Report

开放韩语语料库:一份实践报告

Won Ik Cho, Sangwhan Moon, Youngsook Song

发表机构 * AI Center, Samsung Electronics(三星电子AI中心) Google LLC(谷歌公司) Lablup Inc.(Lablup公司)

AI总结 本文梳理并评述了现有韩语开放语料库,涵盖机构级资源及各类任务数据集,并针对低资源语言提出了开源数据集构建与发布的建议。

Comments Published (v1) in NLP-OSS @EMNLP2020; May 2023 (v2) added with new datasets; June 2026 (v3) added analyses

详情
AI中文摘要

韩语在研究界常被视为低资源语言。虽然这一说法部分正确,但也因为资源的可用性没有得到充分的宣传和管理。本工作整理并评述了一份韩语语料库列表,首先描述了机构级别的资源开发,然后进一步遍历了当前针对不同任务类型的开放数据集。最后,我们提出了针对低资源语言应如何进行开源数据集构建和发布以促进研究的方向。

英文摘要

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

2203.03018 2026-06-10 cs.RO cs.SY eess.SY 版本更新

RAPTOR: Rapid Aerial Pickup and Transport of Objects by Robots

RAPTOR: 机器人快速空中抓取与运输物体

Aurel Appius, Erik Bauer, Marc Blöchlinger, Aashi Kalra, Robin Oberson, Arman Raayatsanati, Pascal Strauch, Sarath Suresh, Marco von Salis, Robert K. Katzschmann

发表机构 * Soft Robotics Lab, ETH Zurich, Switzerland(软机器人实验室,苏黎世联邦理工学院,瑞士)

AI总结 提出一种结合软材料Fin Ray夹爪和Fast DDS中间件的四旋翼平台RAPTOR,实现高速飞行中对不同几何形状物体的灵活抓取,平均抓取成功率83%,有效载荷达先前工作的四倍。

Comments 7 pages, 10 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022. Video: https://youtu.be/KHkBlBABsC8 Project page: https://srl-ethz.github.io/RAPTOR

详情
AI中文摘要

通过机器人进行快速空中抓取可以推动许多利用物体快速动态抓取和放置的应用。传统用于空中机械臂的刚性夹爪需要高精度和特定物体几何形状才能成功抓取。我们提出RAPTOR,一种四旋翼平台结合定制Fin Ray夹爪,利用软材料的特性增加夹爪与物体之间的接触面,从而实现对不同几何形状物体的更灵活抓取。为了减少通信延迟,我们提出一种基于Fast DDS(数据分发服务)的新型轻量级中间件解决方案,作为ROS(机器人操作系统)的替代方案。我们展示了RAPTOR在真实环境中以平均1 m/s的速度抓取四种不同几何形状物体时,平均抓取成功率达到83%。在高速设置下,RAPTOR的有效载荷是先前工作的四倍。我们的结果突显了空中无人机在自动化仓库和其他需要速度、敏捷性和鲁棒性且在难以到达区域操作的操作应用中的潜力。

英文摘要

Rapid aerial grasping through robots can lead to many applications that utilize fast and dynamic picking and placing of objects. Rigid grippers traditionally used in aerial manipulators require high precision and specific object geometries for successful grasping. We propose RAPTOR, a quadcopter platform combined with a custom Fin Ray gripper to enable more flexible grasping of objects with different geometries, leveraging the properties of soft materials to increase the contact surface between the gripper and the objects. To reduce the communication latency, we present a new lightweight middleware solution based on Fast DDS (Data Distribution Service) as an alternative to ROS (Robot Operating System). We show that RAPTOR achieves an average of 83% grasping efficacy in a real-world setting for four different object geometries while moving at an average velocity of 1 m/s during grasping. In a high-velocity setting, RAPTOR supports up to four times the payload compared to previous works. Our results highlight the potential of aerial drones in automated warehouses and other manipulation applications where speed, swiftness, and robustness are essential while operating in hard-to-reach places.

2102.05314 2026-06-10 cs.LG math.ST stat.ML stat.TH 版本更新

Time series forecasting from partial observations via Non-negative Matrix Factorization

基于非负矩阵分解的部分观测时间序列预测

Yohann de Castro, Luca Mencarelli

发表机构 * Institut Camille Jordan, Ecole Centrale Lyon(让-卡米尔·约当研究所,中央理工大学) Institut Universitaire de France(法国大学研究院)

AI总结 提出滑动掩码方法(SMM)结合非负矩阵补全进行多非负时间序列预测,通过掩码原型矩阵分解(mAMF)和掩码归一化非负矩阵分解(mNMF)实现,理论证明恢复误差与噪声成比例,实验优于Transformer、LSTM等方法。

详情
AI中文摘要

在现代时间序列问题中,我们旨在预测可能包含缺失值和噪声的多重时间序列。本文引入滑动掩码方法(SMM),通过非负矩阵补全来预测多个非负时间序列:将观测到的噪声值和预测/缺失值收集成矩阵形式,并通过将其行表示为少量非负向量(称为原型)的凸组合来实现学习。我们提出了两种估计方法,掩码原型矩阵分解(mAMF)和掩码归一化非负矩阵分解(mNMF),它们可以与SMM方法结合。我们证明这些估计能以与噪声成比例的误差恢复真实原型。我们使用近端交替线性化方法(PALM)来计算原型和凸组合权重。我们在真实数据上将我们的估计器与最先进的方法(Transformer、LSTM、SARIMAX...)进行了多时间序列预测比较,结果表明我们的方法在大多数实验中优于它们。

英文摘要

In modern time series problems, one aims at forecasting multiple time series with possible missing and noisy values. In this paper, we introduce the Sliding Mask Method (SMM) for forecasting multiple nonnegative time series by means of nonnegative matrix completion: observed noisy values and forecast/missing values are collected into matrix form, and learning is achieved by representing its rows as a convex combination of a small number of nonnegative vectors, referred to as the archetypes. We introduce two estimates, the mask Archetypal Matrix factorization (mAMF) and the mask normalized Nonnegative Matrix Factorization (mNMF) which can be combined with the SMM method. We prove that these estimates recover the true archetypes with an error proportional to the noise. We use a proximal alternating linearized method (PALM) to compute the archetypes and the convex combination weights. We compared our estimators with state-of-the-art methods (Transformers, LSTM, SARIMAX...) in multiple time series forecasting on real data and obtain that our method outperforms them in most of the experiments.

2606.04833 2026-06-10 cs.LG cs.AI

Signed Dual Attention: Capturing Signed Dependencies in Time Series Forecasting

符号双注意力:在时间序列预测中捕捉符号依赖关系

Balthazar Courvoisier, Tristan Cazenave

发表机构 * Queensfield AI Technologies

AI总结 提出符号双注意力机制,通过双消息传递方案同时捕捉正负依赖关系,无需额外参数,提升时间序列预测性能。

Comments 5 pages, 3 figures, accepted at AAAI 2026 AI4TS Workshop

详情
AI中文摘要

最初为自然语言处理开发的Transformer架构和注意力机制,现在已成为各种深度学习模型的核心,包括时间序列预测应用。然而,标准注意力机制隐含地假设同质性交互,限制了其对具有正负依赖关系(如时间序列)的数据建模能力。在这项工作中,我们引入了符号双注意力,一种新颖的注意力公式,无需额外参数即可捕捉正负关系模式。通过利用受相关结构启发的双消息传递方案,符号双注意力在单个共享块内传播支持和对比信息,有效实现了两个头注意力的表达能力而无需额外参数。该模块可以无缝集成到现有架构中,并在需要符号关系建模的某些情况下带来性能提升。这种方法为构建更具表达力和参数效率的Transformer开辟了道路。

英文摘要

Initially developed for natural language processing, Transformer architectures and attention mechanisms are now central to a wide range of deep learning models, including applications in time series forecasting. A standard attention mechanism, however, implicitly assumes homophilic interactions, limiting its ability to model data with positive and negative dependencies, such as time series. In this work, we introduce the Signed Dual Attention, a novel attention formulation that captures both positive and negative relational patterns without additional parameters. By leveraging a dual message-passing scheme inspired by correlation structures, Signed Dual Attention propagates both supportive and contrastive information within a single shared block, effectively achieving the expressiveness of two head attention without additional parameters. This module can be seamlessly integrated into existing architectures and can yield performance gains in certain situations, requiring signed relational modeling. This approach opens a pathway toward more expressive and parameter-efficient transformers.

2606.00097 2026-06-10 cs.RO cs.MA

RocketSmith: An Agentic System for High-Powered Rocket Design and Manufacturing

RocketSmith: 一种用于高功率火箭设计与制造的智能系统

Peter Pak, Jesse Barkley, Rumi Loghmani, Derek Baich, Ananya Pamal, Amir Barati Farimani

发表机构 * Graduate Research Assistant, Mechanical Engineering(机械工程研究生助理) AI Fellow, Mechanical Engineering(人工智能研究员,机械工程) Undergraduate Student, Mechanical Engineering(机械工程本科生) Senior Member, Pittsburgh Prefecture One(高级会员,匹兹堡郡一区) Russell V. Trader Associate Professor, Mechanical Engineering(Russell V. Trader副教授,机械工程)

AI总结 本文提出RocketSmith,一种基于智能体系统的自动化设计、制造与优化框架,通过子智能体与技能实现零样本和人在回路的飞行参数优化,并利用增材制造成功开发并测试了四枚高功率火箭。

详情
AI中文摘要

本文介绍了RocketSmith,一种能够完成高功率火箭开发中设计、制造和优化过程的智能系统。该系统实现了软件工具的智能自动化,不仅能够验证飞行稳定性等因素,还能生成火箭组件的参数化设计。通过一组子智能体和技能,该系统能够在零样本和人在回路的工作流程中通过迭代优化飞行参数。利用该系统,结合增材制造的独特设计能力,开发了四种不同电机和组件配置的高功率火箭。这些组件使用各种FDM打印机打印,手动评估飞行准备状态,并在发射活动中进行了飞行测试。测试中,所有火箭均实现了稳定发射,其中两枚火箭成功回收并具备再次飞行条件。在收集的飞行数据中,实测远地点与飞行模拟计算值的准确率达到84%。

英文摘要

This work presents RocketSmith, an agentic system capable of the design, manufacturing, and optimization processes in high powered rocket development. The system enables the intelligent automation of software tools as to not only validate factors such as flight stability but also generate the parametric design components for the rocket assembly. A collection of subagents and skills enable optimization workflows of flight parameters via iteration in both zero-shot and human-in-the-loop workflows. With this system, four distinct high power rockets with various motor and assembly configurations were developed utilizing the unique design capabilities of additive manufacturing. These assembly components were fabricated using various FDM printers, manually evaluated for flight readiness, and flight tested at a launch event. From these tests, all rockets achieved a stable launched and two of the four rockets were successfully recovered in reflyable condition. Within the collected flight data, an 84% accuracy was achieved when comparing measured apogee to that calculated in flight simulations.

2509.04154 2026-06-10 cs.LG cs.AI

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

鲁棒滤波注意力:自注意力作为精度加权状态估计

Peter Racioppo

发表机构 * Independent Researcher, Los Angeles, CA, USA(独立研究者,美国加利福尼亚州洛杉矶)

AI总结 提出鲁棒滤波注意力(RFA),将自注意力建模为基于线性随机微分方程的状态估计,在语言建模中实现优于RoPE的困惑度与零样本外推稳定性。

详情
AI中文摘要

我们引入鲁棒滤波注意力(RFA),一种将自注意力表述为鲁棒状态估计的方法。每个令牌被视为由线性随机微分方程(SDE)控制的潜在轨迹的带噪声观测,注意力权重由该模型下的一致性决定,而非静态特征相似性。在各向同性噪声和衰减假设下,RFA的计算复杂度与标准注意力相当。在语言建模基准上,RFA在训练窗口内实现了比RoPE更低的困惑度,同时在零样本外推到更长上下文时保持稳定。该框架还提供了标准位置机制的动力学解释,将旋转嵌入和近因偏差与随机动力学引起的传输和不确定性传播联系起来。

英文摘要

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI:一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology(电气工程系,谢里夫大学)

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

Comments Some fundemental change in text and codebase

详情
AI中文摘要

MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

英文摘要

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .