arXivDaily arXiv每日学术速递 周一至周五更新
重置
2602.22629 2026-06-12 cs.CV 版本更新

CRAG: Can 3D Generative Models Help 3D Assembly?

CRAG: 3D生成模型能否辅助3D装配?

Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng, Jing Zhang

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CRAG方法,将3D装配与形状生成联合优化,通过生成完整形状和预测部件姿态实现相互增强,在多种几何、部件数和缺失情况下达到最优性能。

详情
Comments
15 pages, 8 figures
AI中文摘要

大多数现有的3D装配方法将问题视为纯姿态估计,通过刚性变换重新排列观察到的部件。相比之下,人类装配自然地将结构推理与整体形状推断相结合。受此直觉启发,我们将3D装配重新表述为装配和生成的联合问题。我们表明这两个过程相互增强:装配为生成提供部件级结构先验,而生成注入整体形状上下文,解决装配中的歧义。与无法合成缺失几何形状的先前方法不同,我们提出了CRAG,它同时生成合理的完整形状并预测输入部件的姿态。大量实验表明,在具有不同几何形状、不同部件数量和缺失部件的野外物体上,该方法达到了最先进的性能。项目页面:this https URL

英文摘要

Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Project Page: this https URL

2602.00462 2026-06-12 cs.CV cs.AI 版本更新

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

LatentLens: 揭示大语言模型中高度可解释的视觉标记

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出 LatentLens 方法,通过将视觉标记与文本语料库中的上下文标记表示进行最近邻匹配,实现视觉标记的可解释性,发现大多数视觉标记在各层均具有可解释性。

详情
Comments
ICML 2026 (Camera Ready)
AI中文摘要

将大型语言模型(LLM)转换为视觉语言模型(VLM)可以通过将视觉编码器输出的视觉标记映射到LLM的嵌入空间来实现。有趣的是,这种映射可以简单到浅层MLP变换。为了理解LLM为何能如此容易地处理视觉标记,我们需要可解释性方法来揭示在LLM处理的每一层中视觉标记表示所编码的内容。在这项工作中,我们引入了LatentLens,一种将潜在表示映射到自然语言描述的新方法。LatentLens编码一个大型文本语料库,并存储该语料库中每个标记的上下文化标记表示。然后将视觉标记表示与这些上下文化表示进行比较,并将最邻近的表示作为视觉标记的描述。我们在15个不同的VLM上评估了该方法,结果表明,常用的方法(如LogitLens)大大低估了视觉标记的可解释性。相反,使用LatentLens,大多数视觉标记在所有研究的模型和所有层中都是可解释的。定性上,我们展示了LatentLens产生的描述在语义上有意义,并且与单个标记相比,为人类提供了更细粒度的解释。更广泛地说,我们的发现为视觉和语言表示之间的对齐提供了新的证据,并为分析LLM的潜在表示开辟了新的方向。

英文摘要

Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

2512.24787 2026-06-12 cs.IR cs.AI 版本更新

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

HiGR:腾讯工业级层次化生成式推荐框架

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

发表机构 * Platform and Content Group, Tencent(腾讯平台与内容组) Sun Yat-sen University(中山大学)

AI总结 提出HiGR框架,通过结构化语义ID和层次化解码器解决生成式推荐在工业规模下的规划效率与列表质量对齐问题,离线质量提升超10%,推理加速5倍。

详情
AI中文摘要

Slate推荐(在单个展示中向用户呈现排序项目列表)在主流在线平台中无处不在。虽然最近的生成式推荐方法在利用语义ID建模项目序列方面显示出强大潜力,但直接将其应用于工业规模的slate推荐面临根本性脱节:纠缠的SID空间混淆了高级列表规划,长序列上的细粒度自回归解码限制了语义规划效率,而令牌级目标与整体slate质量不一致。在本文中,我们提出HiGR,一个工业规模的层次化生成式slate推荐框架,通过协同设计的流水线弥合这一脱节。首先,HiGR通过前缀对比残差量化VAE(PCRQ-VAE)学习结构化SID。通过强制高级前缀捕获共享语义,PCRQ-VAE创建了一个可控的离散空间,作为高效规划的前提。利用这一结构化空间,我们的层次化Slate解码器(HSD)将自回归建模从纠缠的令牌级解码转变为粗粒度偏好嵌入。该设计显著降低了推理延迟,同时允许显式的全局slate结构规划。最后,这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标隐式反馈——排序保真度、真实用户兴趣和多样性。广泛的离线实验表明,HiGR在离线推荐质量上优于最先进的基线超过10%,同时实现了5倍的推理加速。腾讯平台上的在线A/B测试进一步将观看时间提高了1.22%,视频播放量提高了1.73%。HiGR已在多个腾讯平台表面部署,服务数亿用户,证明了其工业规模的适用性。

英文摘要

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

2602.18154 2026-06-12 cs.CL cs.AI cs.DB 版本更新

FENCE: A Financial and Multimodal Jailbreak Detection Dataset

FENCE:一个金融和多模态越狱检测数据集

Mirae Kim, Seonghun Jeong, Youngjun Kwak

发表机构 * arXiv

AI总结 针对金融领域多模态越狱检测资源匮乏的问题,提出FENCE数据集,包含韩英双语文本和图像,用于训练和评估检测器,实验表明基线检测器准确率达99%。

详情
Comments
lrec 2026 accepted paper
AI中文摘要

越狱对大型语言模型(LLM)和视觉语言模型(VLM)的部署构成重大风险。VLM尤其脆弱,因为它们处理文本和图像,创造了更广泛的攻击面。然而,可用于越狱检测的资源很少,特别是在金融领域。为填补这一空白,我们提出了FENCE,一个双语(韩语-英语)多模态数据集,用于训练和评估金融应用中的越狱检测器。FENCE通过金融相关查询与图像威胁配对,强调领域真实性。使用商业和开源VLM进行的实验揭示了持续的脆弱性,GPT-4o显示出可测量的攻击成功率,而开源模型则表现出更大的暴露。在FENCE上训练的基线检测器实现了99%的分布内准确率,并在外部基准测试中保持强劲性能,突显了该数据集在训练可靠检测模型方面的鲁棒性。FENCE为推进金融领域的多模态越狱检测以及支持敏感领域中更安全、更可靠的AI系统提供了重点资源。警告:本文包含可能具有冒犯性的示例数据。

英文摘要

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.

2602.15424 2026-06-12 cs.RO 版本更新

Lyapunov-Based PI-Like Control for Robust Trajectory Tracking of a Four-Wheel Independently Driven and Steered Robot: Design and Experimental Validation

基于李雅普诺夫的PI类控制用于四轮独立驱动与转向机器人的鲁棒轨迹跟踪:设计与实验验证

Branimir Ćaran, Vladimir Milić, Marko Švaco, Bojan Jerbić

发表机构 * Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb(Zagreb大学机械工程与造船工程学院) Regional Centre of Excellence for Robotic Technology (CRTA)(机器人技术卓越研究中心) Croatian Academy of Sciences and Arts(克罗地亚科学院)

AI总结 提出一种基于李雅普诺夫的PI类控制器,结合模型前馈补偿,实现四轮独立驱动与转向机器人的鲁棒轨迹跟踪,并通过实验验证其优于PI和滑模控制器。

详情
Comments
This work has been submitted to the IEEE for possible publication
AI中文摘要

本文提出了一种基于李雅普诺夫综合的PI类控制器,用于独立驱动和转向的四轮移动机器人的鲁棒轨迹跟踪。对于本文所考虑的机器人,使用了一个明确的结构验证数学模型,以实现系统化的控制器设计,并具有严格的稳定性保证,适用于实时实现。针对内环的速度误差和积分误差联合动力学,开发了基于李雅普诺夫的实用稳定性分析,得出了速度误差和积分误差联合状态的实用稳定性和一致最终有界性的显式界和充分条件。所得控制律保留了PI类结构,并具有基于模型的前馈补偿,使其适用于标准嵌入式平台上的实现,同时提高了对构型依赖的残余动力学和未建模效应的鲁棒性。所提设计的有效性和鲁棒性在四轮独立转向和独立驱动的移动机器人平台上进行了实验验证,包括水平和垂直操作条件,并与PI控制器和滑模控制器进行了对比。

英文摘要

In this paper, a Lyapunov-based synthesis of a PI-like controller is proposed for robust trajectory tracking of an independently driven and steered four-wheel mobile robot. For the robot considered in this work, an explicit structurally verified mathematical model is used to enable systematic controller design with rigorous stability guarantees suitable for real time implementation. An augmented Lyapunov-based practical stability analysis is developed for the combined velocity-error and integral-error dynamics of the inner loop, yielding explicit bounds and sufficient conditions for practical stability and uniform ultimate boundedness of the combined velocity-error and integral-error state. The resulting control law retains a PI-like structure with model-based feedforward compensation, making it suitable for implementation on standard embedded platforms while improving robustness against configuration dependent residual dynamics and unmodelled effects. The effectiveness and robustness of the proposed design are demonstrated experimentally on a four-wheel independently steered and independently driven mobile robot platform, under both horizontal and vertical operating conditions and benchmarked against a PI controller and a sliding-mode controller.

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval:将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出InnoEval框架,通过异构深度知识检索和多视角评审委员会,实现基于知识的多维度解耦评估,在点对点、成对和分组评估任务中优于基线方法。

详情
Comments
ICML 2026
AI中文摘要

大型语言模型的快速发展催生了科学思路的激增,但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而,现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题,我们将思路评估视为一个基于知识的多视角推理问题,并引入InnoEval,一个深度创新评估框架,旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎,从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识,从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集,以基准测试InnoEval。实验表明,InnoEval在点对点、成对和分组评估任务中始终优于基线方法,展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患:工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出多轮工具使用安全基准MT-AgentRisk,发现多轮设置下攻击成功率平均增加16%,并设计无训练、与工具无关的自探索防御方法ToolShield,平均降低30%攻击成功率。

详情
AI中文摘要

基于LLM的智能体能力日益增强,但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具,这一差距扩大,引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置,我们提出一个原则性的分类法,将单轮有害任务转化为多轮攻击序列。利用该分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化:在开放和封闭模型的多轮设置中,攻击成功率(ASR)平均增加16%。为了缩小这一差距,我们提出了ToolShield,一种无需训练、与工具无关的自我探索防御方法:当遇到新工具时,智能体自主生成测试用例,执行它们以观察下游效果,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at this https URL.

2602.07294 2026-06-12 cs.CE cs.AI 版本更新

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fin-RATE:面向SEC文件的金融分析与追踪评估基准

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

发表机构 * Tongji University(同济大学) University of California, San Diego(加州大学圣地亚哥分校) Yale University(耶鲁大学) Goldman Sachs(高盛集团)

AI总结 针对LLM在金融领域分析复杂监管文件的需求,提出基于SEC文件的Fin-RATE基准,通过三种任务路径评估模型,发现跨文档和跨时间分析时性能显著下降。

详情
AI中文摘要

随着大型语言模型(LLM)在金融领域的部署日益增多,LLM越来越需要解析复杂的监管披露文件。然而,现有基准通常关注孤立细节,未能反映专业分析的复杂性——这种分析需要综合多个文档、报告期和公司实体的信息。此外,这些基准无法区分错误源于检索失败、生成不准确、领域特定推理错误还是对查询或上下文的误解,从而难以精确诊断性能瓶颈。为弥补这些不足,我们引入Fin-RATE,这是一个基于美国证券交易委员会(SEC)文件构建的基准,通过三条路径模拟金融分析师的工作流程:单个披露文件内的细节导向推理、共享主题下的跨实体比较,以及同一公司在多个报告期内的纵向跟踪。我们在真实上下文和检索增强设置下,对17个领先的LLM(包括开源、闭源和金融专用模型)进行了基准测试。结果显示,当任务从单文档推理转向纵向和跨实体分析时,性能显著下降,准确率分别下降18.60%和14.35%。这种下降与比较幻觉、时间和实体不匹配的增加有关,并进一步反映在推理质量和事实一致性的下降上——这些局限性是现有基准尚未正式分类或量化的。

英文摘要

With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

2602.12753 2026-06-12 cs.LG 版本更新

Hierarchical Successor Representation for Robust Transfer

层次化后继表示用于鲁棒迁移

Changmin Yu, Máté Lengyel

发表机构 * University of Cambridge(剑桥大学) DeepMind(深度思维)

AI总结 提出层次化后继表示(HSR),通过时间抽象构建鲁棒的状态特征,结合非负矩阵分解实现稀疏低秩表示,支持多隔间环境下的高效任务迁移与探索。

详情
AI中文摘要

后继表示(SR)为将预测动态与奖励解耦提供了强大框架,能够实现跨奖励配置的快速泛化。然而,经典SR受其固有的策略依赖性限制:由于持续学习、环境非平稳性和任务需求变化,策略会发生变化,使得已建立的预测表示过时。此外,在拓扑复杂的环境中,SR遭受谱扩散,导致特征密集重叠且扩展性差。本文提出层次化后继表示(HSR)以克服这些限制。通过将时间抽象纳入预测表示的构建,HSR学习到对任务引起的策略变化鲁棒的稳定状态特征。将非负矩阵分解(NMF)应用于HSR,得到稀疏低秩的状态表示,有助于在多隔间环境中实现向新任务的高样本效率迁移。进一步分析表明,HSR-NMF发现了可解释的拓扑结构,提供了策略无关的层次化地图,有效桥接了无模型最优性和基于模型的灵活性。除了为任务迁移提供有用基础外,我们还展示了HSR的时间扩展预测结构也可用于驱动高效探索,有效扩展到大规模程序生成的环境。

英文摘要

The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

2505.11846 2026-06-12 cs.LG math.AG 版本更新

Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks

刀刃上的学习:多项式神经网络的可辨识性与奇异性

Vahid Shahverdi, Giovanni Luca Marchetti, Kathlén Kohn

发表机构 * Department of Mathematics, KTH Royal Institute of Technology(数学系,皇家理工学院)

AI总结 研究以多项式为激活函数的MLP和CNN的函数空间(神经流形),证明MLP参数化几乎处处有限对一,CNN参数化一一对应,并刻画奇异性源于稀疏子网络,解释MLP的稀疏偏好。

详情
Comments
Published at ICLR 2026
AI中文摘要

我们研究由神经网络参数化的函数空间,称为神经流形。具体地,我们关注具有充分一般多项式激活函数的深度多层感知机(MLP)和卷积神经网络(CNN)。首先,我们解决可辨识性问题,表明对于MLP神经流形中的几乎所有函数,只有有限多个参数选择产生该函数。对于CNN,参数化通常是一一对应的。作为推论,我们计算了神经流形的维数。其次,我们描述神经流形的奇异点。我们完全刻画了CNN的奇异性,部分刻画了MLP的奇异性。在这两种情况下,奇异性都源于稀疏子网络。对于MLP,我们证明这些奇异性通常对应于均方误差损失的临界点,而这对CNN不成立。这为MLP的稀疏偏好提供了几何解释。我们的所有结果都利用了代数几何的工具。

英文摘要

We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.

2602.12024 2026-06-12 cs.RO 版本更新

Adaptive-Horizon Conflict-Based Search for Closed-Loop Multi-Agent Path Finding

自适应视界冲突搜索用于闭环多智能体路径规划

Jiarui Li, Federico Pecora, Runyu Zhang, Gioele Zardini

发表机构 * Laboratory for Information and Decision Systems, Massachusetts Institute of Technology(信息与决策系统实验室,麻省理工学院) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 提出ACCBS算法,通过动态调整规划视界和重用约束树,在有限计算预算下快速生成高质量可行解,兼具渐近最优性和扰动适应性。

详情
AI中文摘要

MAPF是自动化仓库和物流中大型机器人编队的核心协调问题。现有方法要么是开环规划器,生成固定轨迹并难以处理扰动,要么是闭环启发式方法,没有可靠性能保证,限制了其在安全关键部署中的使用。本文提出ACCBS,一种基于有限视界CBS变体的闭环算法,具有受MPC中迭代加深启发的视界变化机制。ACCBS根据可用计算预算动态调整规划视界,并重用单个约束树以实现视界之间的无缝过渡。因此,它能在预算增加时快速产生高质量可行解,同时渐近最优,表现出任意时间行为。大量案例研究表明,ACCBS结合了对扰动的灵活性和强性能保证,有效弥合了大规模机器人部署中理论最优性与实际鲁棒性之间的差距。

英文摘要

MAPF is a core coordination problem for large robot fleets in automated warehouses and logistics. Existing approaches are typically either open-loop planners, which generate fixed trajectories and struggle to handle disturbances, or closed-loop heuristics without reliable performance guarantees, limiting their use in safety-critical deployments. This paper presents ACCBS, a closed-loop algorithm built on a finite-horizon variant of CBS with a horizon-changing mechanism inspired by iterative deepening in MPC. ACCBS dynamically adjusts the planning horizon based on the available computational budget, and reuses a single constraint tree to enable seamless transitions between horizons. As a result, it produces high-quality feasible solutions quickly while being asymptotically optimal as the budget increases, exhibiting anytime behavior. Extensive case studies demonstrate that ACCBS combines flexibility to disturbances with strong performance guarantees, effectively bridging the gap between theoretical optimality and practical robustness for large-scale robot deployment.

2602.10132 2026-06-12 physics.plasm-ph cs.AI 版本更新

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

TokaMark:MAST托卡马克等离子体模型的综合基准

Cécile Rousseau, Samuel Jackson, Rodrigo H. Ordonez-Hurtado, Nicola C. Amorisco, Tobia Boschi, George K. Holt, Andrea Loreti, Eszter Székely, Alexander Whittle, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Sue Thorne, Mykhaylo Zayats

发表机构 * IBM Research Europe(IBM欧洲研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(STFC哈特ree中心)

AI总结 为解决聚变数据稀缺、分散且标注不一致的问题,提出TokaMark基准,包含14项任务,统一多模态聚变数据访问和评估协议,并提供基线模型,以加速数据驱动的AI等离子体建模。

详情
AI中文摘要

开发运行如托卡马克等商业可行的聚变能源反应堆,需要从稀疏、有噪声且不完整的传感器读数中准确预测等离子体动力学。底层物理的复杂性和实验数据的异质性给传统数值方法带来了巨大挑战,并凸显了现代数据驱动方法的潜力。然而,实现这一潜力的主要障碍是缺乏经过整理、公开可用的数据集和标准化基准。现有的聚变数据集稀缺、分散在不同机构、特定于设施且标注不一致,这限制了可重复性,并阻碍了AI方法的公平和可扩展比较。在本文中,我们介绍了TokaMark,一个结构化基准,用于评估在Mega Ampere Spherical Tokamak(MAST)收集的真实实验数据上的AI模型。TokaMark提供了一套全面的工具,旨在统一多模态聚变数据的访问并标准化评估协议。该基准包括14项精心策划的任务,涵盖一系列物理机制,利用多种诊断手段并覆盖多个操作用例。提供了一个基线模型,以便在统一框架内进行透明的比较和验证。通过建立统一的基准,TokaMark旨在加速数据驱动的AI等离子体建模的进展,为最终实现可持续和稳定的聚变能源做出贡献。数据集、基准、文档和工具已在此https URL下开源。

英文摘要

Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, and highlight the promise of modern data-native approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to unify access to multi-modal fusion data and standardize evaluation protocols. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The dataset, benchmark, documentation, and tooling are open-sourced under this https URL.

2602.09379 2026-06-12 cs.MA cs.CL 版本更新

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

LingxiDiagBench: 用于基准测试大语言模型在中文精神科咨询与诊断中的多智能体框架

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

发表机构 * Tianqiao and Chrissy Chen Institute(天桥和克里斯西·陈研究所) EverMind AI Inc.(EverMind AI公司) Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine(上海精神卫生中心,上海交通大学医学院)

AI总结 提出LingxiDiagBench多智能体框架,包含16K电子病历对齐的合成咨询对话数据集,评估LLM在静态诊断和动态咨询中的表现,发现其对抑郁-焦虑共病识别和12类鉴别诊断准确率低,动态咨询常不如静态评估。

详情
AI中文摘要

精神障碍在全球范围内高度流行,但精神科医生的短缺以及基于访谈诊断固有的主观性,对及时、一致的心理健康评估造成了重大障碍。AI辅助精神科诊断的进展受到缺乏基准测试的限制,这些基准测试需同时提供逼真的患者模拟、临床医生验证的诊断标签,并支持动态多轮咨询。我们提出LingxiDiagBench,一个大规模多智能体基准测试,评估LLM在中文静态诊断推理和动态多轮精神科咨询中的表现。其核心是LingxiDiag-16K,一个包含16,000个电子病历对齐的合成咨询对话数据集,旨在再现12个ICD-10精神科类别中真实的临床人口统计和诊断分布。通过对最先进LLM的大量实验,我们建立了关键发现:(1)尽管LLM在二元抑郁-焦虑分类上达到高准确率(高达92.3%),但在抑郁-焦虑共病识别(43.0%)和12类鉴别诊断(28.5%)上性能显著下降;(2)动态咨询通常不如静态评估,表明无效的信息收集策略显著损害下游诊断推理;(3)由LLM作为评判者评估的咨询质量与诊断准确性仅呈中等相关性,表明结构良好的提问本身并不能确保正确的诊断决策。我们发布LingxiDiag-16K和完整的评估框架,以支持可重复的研究,网址为:https://this https URL。

英文摘要

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at this https URL.

2602.09730 2026-06-12 cs.CV cs.LG math.NA 版本更新

Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

龟裂的魅力:一种变分-生成式绘画裂纹检测方法

Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith

发表机构 * Dept. of Mathematics, LMU Munich(数学系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY(海德堡影像,德意志电子同步辐射光源) Fachbereich Mathematik, University of Hamburg(数学学院,汉堡大学) CIT School, Technical University of Munich(技术大学慕尼黑信息学院)

AI总结 提出混合方法,将裂纹检测建模为逆问题,用深度生成模型作为画作先验,结合Mumford-Shah变分泛函和裂纹先验,通过联合优化获得像素级裂纹定位图。

详情
AI中文摘要

近期成像技术、深度学习与数值性能的进步使得对艺术品的非侵入性详细分析成为可能,支持其记录与保护。特别是,数字化绘画中龟裂的自动检测对于评估退化和指导修复至关重要,但由于可能复杂的场景以及裂纹与类似裂纹的艺术特征(如笔触或毛发)之间的视觉相似性,这仍然具有挑战性。我们提出一种混合方法,将裂纹检测建模为一个逆问题,将观测图像分解为无裂纹绘画和裂纹分量。采用深度生成模型作为底层艺术品的有力先验,同时使用Mumford-Shah型变分泛函结合裂纹先验来捕捉裂纹结构。联合优化得到绘画中裂纹定位的像素级图。

英文摘要

Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.

2602.07698 2026-06-12 cs.SE cs.CL 版本更新

On Sequence-to-Sequence Models for Automated Log Parsing

关于自动化日志解析的序列到序列模型

Adam Sorrenti, Andriy Miranskyy

发表机构 * Toronto University(多伦多大学)

AI总结 本研究系统评估了四种序列建模架构(Transformer、Mamba、单/双向LSTM)在自动化日志解析中的性能,发现Transformer表现最佳,Mamba在计算成本较低时具有竞争力,并分析了表示选择、序列长度和数据效率的影响。

详情
Comments
Added a comparison with large language models
AI中文摘要

上下文:日志解析是软件系统中的关键标准操作流程,支持监控、异常检测和故障诊断。然而,由于日志格式异构、训练与部署数据之间的分布偏移以及基于规则的方法的脆弱性,自动化日志解析仍然具有挑战性。目标:本研究旨在系统评估序列建模架构、表示选择、序列长度和训练数据可用性如何影响自动化日志解析性能和计算成本。方法:我们进行了一项受控实证研究,比较了四种序列建模架构:Transformer、Mamba状态空间、单向LSTM和双向LSTM模型。总共在多个数据集配置下训练了396个模型,并使用相对Levenshtein编辑距离进行统计显著性检验评估。结果:Transformer实现了最低的平均相对编辑距离(0.111),其次是Mamba(0.145)、单LSTM(0.186)和双LSTM(0.265),数值越低越好。Mamba在计算成本大幅降低的情况下提供了有竞争力的准确性。字符级分词通常能提升性能,序列长度对Transformer准确性的实际影响可忽略不计,Mamba和Transformer均表现出比循环模型更强的样本效率。结论:总体而言,Transformer将解析错误降低了23.4%,而Mamba在数据或计算受限的情况下是一个强有力的替代方案。这些结果还阐明了表示选择、序列长度和样本效率的作用,为研究人员和从业者提供了实用指导。

英文摘要

Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. Objectives: This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. Methods: We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Results: Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Conclusion: Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Ex-Omni:为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) LIGHTSPEED Independent Researcher(独立研究员)

AI总结 提出Ex-Omni模型,通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成,并引入统一令牌查询门控融合机制,实现全模态大语言模型同步生成语音和3D面部动画。

详情
AI中文摘要

全模态大语言模型旨在统一多模态理解和生成,然而,尽管自然的人机交互至关重要,但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni),一个开源模型,通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦,其中语音单元提供时间支架,隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入,以及InstructS2SF-1200K,一个包含1200K样本的预训练数据集。大量实验表明,Ex-Omni在保持竞争性语音理解和生成能力的同时,实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

2602.05121 2026-06-12 eess.SY cs.RO 版本更新

Trojan Attacks on Neural Network Controllers for Robotic Systems

针对机器人系统神经网络控制器的木马攻击

Farbod Younesi, Walter Lucia, Amr Youssef

发表机构 * Concordia University(康科德大学) Concordia Institute for Information Systems Engineering(康科德信息系统工程研究所) Fonds de recherche du Québec – Nature et Technologies(魁北克自然与技术研究基金) National Cybersecurity Consortium(国家网络安全联盟)

AI总结 针对机器人神经网络控制器,设计轻量级并行木马网络,在特定触发条件下篡改控制指令,通过仿真验证攻击有效性。

详情
Comments
Paper submitted to the 2026 IEEE Conference on Control Technology and Applications (CCTA)
AI中文摘要

神经网络控制器越来越多地应用于机器人系统中,用于轨迹跟踪和姿态稳定等任务。然而,它们对可能不可信的训练流程或供应链的依赖引入了显著的安全漏洞。本文以差速驱动移动机器人平台为案例,研究针对神经控制器的后门(木马)攻击。具体来说,假设机器人的跟踪控制器实现为神经网络,我们设计了一个轻量级的并行木马网络,可以嵌入到控制器中。该恶意模块在正常操作期间保持休眠,但在检测到由机器人姿态和目标参数定义的高度特定触发条件时,会破坏主控制器的轮速命令,导致不良且可能不安全的机器人行为。我们提供了所提出的木马网络的概念验证实现,并通过两种不同攻击场景下的仿真进行了验证。结果证实了所提出攻击的有效性,并表明基于神经网络的机器人控制系统面临潜在的关键安全威胁。

英文摘要

Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.

2602.04675 2026-06-12 cs.LG 版本更新

Generalized Schrödinger Bridge on Graphs

图上的广义薛定谔桥

Panagiotis Theodoropoulos, Juno Nam, Evangelos Theodorou, Jaemoo Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GSBoG框架,通过似然优化学习图上可控连续时间马尔可夫链策略,满足端点边际分布并优化中间状态成本,实现可扩展的拓扑感知运输。

详情
AI中文摘要

图上的运输是许多领域中的一个基本挑战,决策必须尊重拓扑和操作约束。尽管需要可执行的策略,现有的图运输方法缺乏这种表达能力。它们依赖于限制性假设,无法在稀疏拓扑上泛化,并且随着图大小和时间范围的增加而扩展性差。为了解决这些问题,我们引入了图上的广义薛定谔桥(GSBoG),这是一种新颖的可扩展数据驱动框架,用于在状态成本增强动力学下学习任意图上的可执行受控连续时间马尔可夫链(CTMC)策略。值得注意的是,GSBoG学习轨迹级策略,避免了密集的全局求解器,从而增强了可扩展性。这是通过似然优化方法实现的,满足端点边际分布,同时优化状态依赖运行成本下的中间行为。在具有挑战性的真实世界图拓扑上的大量实验表明,GSBoG能够可靠地学习准确、尊重拓扑的策略,同时优化特定应用的中间状态成本,突出了其广泛的适用性,并为一般图上的成本感知动态运输开辟了新途径。

英文摘要

Transportation on graphs is a fundamental challenge across many domains, where decisions must respect topological and operational constraints. Despite the need for actionable policies, existing graph-transport methods lack this expressivity. They rely on restrictive assumptions, fail to generalize across sparse topologies, and scale poorly with graph size and time horizon. To address these issues, we introduce Generalized Schrödinger Bridge on Graphs (GSBoG), a novel scalable data-driven framework for learning executable controlled continuous-time Markov chain (CTMC) policies on arbitrary graphs under state cost augmented dynamics. Notably, GSBoG learns trajectory-level policies, avoiding dense global solvers and thereby enhancing scalability. This is achieved via a likelihood optimization approach, satisfying the endpoint marginals, while simultaneously optimizing intermediate behavior under state-dependent running costs. Extensive experimentation on challenging real-world graph topologies shows that GSBoG reliably learns accurate, topology-respecting policies while optimizing application-specific intermediate state costs, highlighting its broad applicability and paving new avenues for cost-aware dynamical transport on general graphs.

2601.09693 2026-06-12 cs.LG stat.ML 版本更新

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

对比几何学习实现统一的结构与配体药物设计

Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

发表机构 * DeepMind Ltd(DeepMind有限公司)

AI总结 提出对比几何模型ConGLUDe,统一结构与配体训练,实现虚拟筛选、靶标钓鱼和配体条件口袋预测,在多项基准测试中表现优异。

详情
Comments
Forty-Third International Conference on Machine Learning
AI中文摘要

基于结构和基于配体的计算药物设计传统上依赖于不相关的数据源和建模假设,限制了它们在大规模上的联合使用。在这项工作中,我们引入了用于统一计算药物设计的对比几何学习(ConGLUDe),这是一个单一的对比几何模型,统一了基于结构和基于配体的训练。ConGLUDe将产生全蛋白质表示和预测结合位点的隐式嵌入的几何蛋白质编码器与快速配体编码器耦合,消除了对预定义口袋的需求。通过对比学习将配体与全局蛋白质表示和多个候选结合位点对齐,ConGLUDe除了支持虚拟筛选和靶标钓鱼外,还支持配体条件口袋预测,同时在蛋白质-配体复合物和大规模生物活性数据上联合训练。在多种基准测试中,ConGLUDe实现了具有竞争力的零样本虚拟筛选性能,在具有挑战性的靶标钓鱼任务上显著优于现有方法,并展示了最先进的配体条件口袋选择。这些结果突显了统一结构-配体训练的优势,并将ConGLUDe定位为迈向药物发现通用基础模型的一步。

英文摘要

Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for predefined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

2505.13102 2026-06-12 cs.LG cs.AI eess.SP 版本更新

Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

轻量级可解释Transformer:基于混合图算法展开的交通预测

Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种通过展开混合图优化算法构建的轻量级可解释类Transformer网络,用于时空交通预测,在保持竞争性能的同时大幅减少参数。

详情
Comments
24 pages, 7 figures, 11 tables
AI中文摘要

与采用经典自注意力机制的传统“黑箱”Transformer不同,我们通过展开基于混合图的优化算法构建了一个轻量级且可解释的类Transformer神经网络,用于具有空间和时间维度的交通预测。我们构建了两个图:一个无向图$\mathcal{G}^u$捕捉跨地理的空间相关性,以及一个有向图$\mathcal{G}^d$捕捉时间上的序列关系。我们预测信号$\mathbf{x}$的未来样本,假设其相对于$\mathcal{G}^u$和$\mathcal{G}^d$都是“平滑的”,为此我们设计了新的$\ell_2$和$\ell_1$范数变分项来量化并促进有向图上的信号平滑性(低频重构)。我们基于交替方向乘子法(ADMM)设计了一个迭代算法,并将其展开为一个前馈网络以进行数据驱动的参数学习。我们周期性地插入用于$\mathcal{G}^u$和$\mathcal{G}^d$的图学习模块,这些模块扮演自注意力的角色。实验表明,我们的展开网络在交通预测性能上与最先进的预测方案相当,同时大幅减少了参数数量。

英文摘要

Unlike conventional "black-box" transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-like neural net by unrolling a mixed-graph-based optimization algorithm to forecast traffic with spatial and temporal dimensions. We construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We predict future samples of signal $\mathbf{x}$, assuming it is "smooth" with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We design an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We periodically insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$ that play the role of self-attention. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically.

2602.02181 2026-06-12 cs.RO 版本更新

Extending the Law of Intersegmental Coordination: Implications for Powered Prosthetic Controls

扩展节段间协调定律:对动力假肢控制的启示

Elad Siman Tov, Nili E. Krausz

发表机构 * Faculty of Mechanical Engineering, Technion – Israel Institute of Technology(机械工程系,技术学院–以色列理工学院)

AI总结 针对下肢截肢者步行代谢成本问题,提出基于节段间协调定律的假肢控制框架,通过分析三维运动学数据扩展出力矩协调定律,并开发了开源工具包。

详情
Comments
Submitted to 2026 IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob)
AI中文摘要

动力假肢能够为截肢者提供净正功,并在过去二十年中取得了进步。然而,降低截肢者步行代谢成本仍是一个未解决的问题。节段间协调定律(ISC)已在多种步态中被观察到,并先前被认为与步行能量消耗有关,但很少在下肢截肢者步态背景下进行分析或应用。该定律指出,大腿、小腿和足部在步态周期中的仰角是协变的。在这项工作中,我们开发了一种方法,用于分析下肢三维运动学数据的节段间协调,以简化ISC分析。此外,受运动控制、生物力学和机器人学文献的启发,我们使用该方法将ISC扩展为一种新的力矩协调定律。我们发现了这些仰角空间力矩(ESM),并展示了显示健全步态基于力矩的协调的结果。我们还分析了使用动力和被动假肢的截肢者步态的ISC,发现虽然仰角保持平面性,但ESM缺乏平面协调。我们提出了一个ISC驱动的动力假肢控制框架,使用健康协调作为约束来预测小腿角度/力矩,以补偿由于被动足部引起的改变。我们开发了ISC3d工具箱,该工具可在线免费获取,可用于计算三维运动学和动力学ISC。这为进一步研究协调在步态中的作用提供了手段,并可能有助于解决人类运动神经控制的基本问题。

英文摘要

Powered prostheses are capable of providing net positive work to amputees and have advanced in the past two decades. However, reducing amputee metabolic cost of walking remains an open problem. The Law of Intersegmental Coordination (ISC) has been observed across gaits and previously implicated in energy expenditure of walking, yet it has rarely been analyzed or applied within the context of lower-limb amputee gait. This law states that the elevation angles of the thigh, shank and foot over the gait cycle covary. In this work, we developed a method to analyze intersegmental coordination for lower-limb 3D kinematic data, to simplify ISC analysis. Moreover, inspired by motor control, biomechanics and robotics literature, we used our method to extend ISC to a new law of coordination of moments. We find these Elevation Space Moments (ESM), and present results showing a moment-based coordination for able bodied gait. We also analyzed ISC for amputee gait with powered and passive prostheses, and found that while elevation angles remained planar, the ESM lacked planar coordination. We present an ISC-driven powered prosthetic control framework, using healthy coordination as a constraint to predict the shank angles/moments to compensate for alterations due to a passive foot. We developed the ISC3d toolbox that is freely available online, which may be used to compute kinematic and kinetic ISC in 3D. This provides a means to further study the role of coordination in gait and may help address fundamental questions of the neural control of human movement.

2602.01572 2026-06-12 cs.CL cs.IR 版本更新

LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

基于LLM的嵌入:注意力值比隐藏状态更好地编码句子语义

Yeqin Zhang, Yunfei Wang, Jiaxuan Chen, Ke Qin, Yizheng Zhao, Cam-Tu Nguyen

发表机构 * arXiv.org cs.CL(计算机语言学)

AI总结 本文提出Value Aggregation方法,利用LLM的注意力值向量而非隐藏状态来生成句子嵌入,在无训练设置下超越现有方法,甚至匹配或超越集成方法MetaEOL。

详情
AI中文摘要

句子表示是许多自然语言处理(NLP)应用的基础。虽然近期方法利用大型语言模型(LLM)来推导句子表示,但大多数依赖于最终层的隐藏状态,这些隐藏状态针对下一个词预测进行了优化,因此通常无法捕捉全局的句子级语义。本文引入了一个新颖的视角,证明注意力值向量比隐藏状态更有效地捕捉句子语义。我们提出了值聚合(VA),一种简单的方法,它跨多个层和词索引池化标记值。在无训练设置中,VA优于其他基于LLM的嵌入,甚至匹配或超越了基于集成的MetaEOL。此外,我们证明,当与合适的提示配对时,层注意力输出可以被解释为对齐的加权值向量。具体来说,最后一个标记的注意力分数充当权重,而输出投影矩阵($W_O$)将这些加权值向量与LLM残差流的公共空间对齐。这种改进的方法,称为对齐加权VA(AlignedWVA),在无训练的基于LLM的嵌入中达到了最先进的性能,大幅超越了高成本的MetaEOL。最后,我们强调了通过微调值聚合来获得强LLM嵌入模型的潜力。

英文摘要

Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

2602.00343 2026-06-12 cs.DC cs.AI cs.PF 版本更新

Standardized Methods and Recommendations for Green Federated Learning

绿色联邦学习的标准化方法与建议

Austin Tapp, Holger R. Roth, Ziyue Xu, Abhijeet Parida, Hareem Nisar, Marius George Linguraru

发表机构 * Children’s National Hospital(儿童医院) NVIDIA(英伟达) Children’s National Hospital George Washington University(儿童医院乔治华盛顿大学)

AI总结 提出基于NVFlare和CodeCarbon的联邦学习碳核算方法,通过实验验证系统慢速和协调效应可显著增加碳排放,强调标准化碳核算对可复现绿色FL评估的必要性。

详情
AI中文摘要

联邦学习(FL)能够在隐私敏感的分布式数据上进行协作模型训练,但由于不一致的测量边界和异构的报告方式,其环境影响难以跨研究进行比较。我们提出了一种实用的碳核算方法,用于FL的CO2e跟踪,使用NVIDIA NVFlare和CodeCarbon进行显式的、阶段感知的任务(初始化、每轮训练、评估和空闲/协调)。为了捕捉非计算效应,我们还根据网络可配置能量模型,从传输的模型更新大小估计通信排放。我们在两个代表性工作负载上验证了所提出的方法:CIFAR-10图像分类和视网膜视盘分割。在CIFAR-10中,受控的客户端效率场景表明,在原本固定的FL协议下,系统级慢速和协调效应可能对碳足迹产生显著影响,相对于高效率基线,总CO2e增加了8.34倍(中等)和21.73倍(低)。在视网膜分割中,交换GPU层级(H100 vs. V100)产生了1.7倍的运行时间差距(290 vs. 503分钟),同时在不同站点间总能量和CO2e的变化不均匀,强调了按站点和按轮报告的必要性。总体而言,我们的结果支持一种标准化的碳核算方法,作为可复现的“绿色”FL评估的前提。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at this https URL.

2512.12571 2026-06-12 cs.CV 版本更新

Measurement Plasticity: Sensor-Level Adaptation for Vision-Language Models

测量塑性:面向视觉-语言模型的传感器级自适应

Boyeong Im, Wooseok Lee, Yoojin Kwon, Hyung-Sin Kim

发表机构 * arXiv.org University of Seoul(首尔大学)

AI总结 提出多视角物理提示(MVP)用于测试时自适应,通过将相机曝光三角(ISO、快门速度、光圈)作为物理提示,在传感器层面进行自适应,无需梯度或模型修改,在ImageNet-ES上优于数字方法。

详情
Comments
Accepted to the ICML 2026 Workshop on Continual Adaptation at Scale
AI中文摘要

我们提出用于测试时自适应(TTA)的多视角物理提示(MVP),这是一种前向传播框架,通过将相机曝光三角(即ISO、快门速度、光圈)视为物理提示,将TTA从令牌层面转移到光子层面。在推理时,MVP使用源亲和度得分获取选定的多个物理视角,评估每个保留视角的数字增强变体并过滤最低熵预测,然后通过硬投票聚合预测。这种先选择后投票的设计简单、易于校准,且无需梯度或模型修改。在ImageNet-ES和ImageNet-ES-Diverse上,MVP在自动曝光以及与传统传感器控制结合的情况下均优于纯数字TTA。在减少参数候选以降低捕获延迟的情况下,MVP仍然有效,展示了其实用性。

英文摘要

We propose Multi-View Physical-prompt (MVP) for Test-Time Adaptation (TTA), a forward-only framework that moves TTA from tokens to photons by treating the camera exposure triangle (i.e., ISO, shutter speed, and aperture) as physical prompts. At inference, MVP acquires selected multiple physical views using a source-affinity score, evaluates digitally augmented variants of each retained view and filters the lowest-entropy predictions, and aggregates predictions with hard voting. This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP outperforms digital-only TTA on both Auto-Exposure and a combination with conventional sensor control. MVP remains effective under reduced parameter candidates that lower capture latency, demonstrating its practicality.

2601.22594 2026-06-12 cs.CL cs.AI 版本更新

Language Model Circuits Are Sparse in the Neuron Basis

语言模型电路在神经元基上是稀疏的

Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann

发表机构 * Stanford University(斯坦福大学)

AI总结 本文实证发现MLP神经元与稀疏自编码器一样是稀疏特征基,并基于此开发了端到端梯度归因流水线,在多项任务中揭示了因果有效的神经元电路。

详情
Comments
ICML Spotlight, camera-ready
AI中文摘要

神经网络用于计算的高层概念不一定与单个神经元对齐(Smolensky, 1986)。因此,语言模型可解释性研究转向了将神经元基分解为更可解释的模型计算单元的技术,例如稀疏自编码器(SAEs)。然而,并非所有基于神经元的表示都不可解释。我们首次实证表明,MLP神经元与SAEs一样是稀疏的特征基。利用这一发现,我们开发了一个端到端的基于梯度的归因流水线,用于在MLP神经元基上进行电路追踪,从而在多种任务中揭示因果有效的神经元。在标准的主谓一致基准测试(Marks et al., 2025)上,约$10^2$个MLP神经元的电路足以控制模型行为。在(Lindsey et al., 2025)的多跳城市-州-首都任务中,我们发现了一个电路,其中小部分神经元编码特定的潜在推理步骤(例如将城市映射到其所在州),并且可以通过引导来改变模型的输出。因此,这项工作在不增加额外训练成本的情况下推进了语言模型的自动化可解释性。

英文摘要

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

2601.22090 2026-06-12 cs.RO 版本更新

ReactEMG Stroke: Healthy-to-Stroke Few-shot Adaptation for sEMG-Based Intent Detection

ReactEMG 中风:基于表面肌电图的意图检测的健康到中风少样本适应

Runsheng Wang, Katelyn Lee, Xinyue Zhu, Lauren Winterbottom, Dawn M. Nilsen, Joel Stein, Matei Ciocarlie

发表机构 * Department of Mechanical Engineering, Columbia University in the City of New York(哥伦比亚大学纽约市机械工程系) Department of Computer Science, Columbia University in the City of New York(哥伦比亚大学纽约市计算机科学系) Department of Rehabilitation and Regenerative Medicine, Columbia University Irving Medical Center(哥伦比亚大学伊文思医疗中心康复与再生医学系)

AI总结 提出一种健康到中风的适应流程,利用大规模健康受试者sEMG预训练模型,仅用少量中风患者数据微调,显著提升意图检测准确率和鲁棒性。

详情
AI中文摘要

表面肌电图(sEMG)是一种有前景的控制信号,用于中风后按需辅助手部康复,但从瘫痪肌肉检测意图通常需要长时间、特定于受试者的校准,并且对变异性很脆弱。我们提出了一种健康到中风的适应流程,该流程从在大规模健全受试者sEMG上预训练的模型初始化意图检测器,然后仅使用少量特定于受试者的数据为每个中风参与者进行微调。使用从三位慢性中风患者收集的新数据集,我们比较了适应策略(仅头部调优、参数高效的LoRA适配器和全端到端微调),并在包含现实分布偏移(如会话内漂移、姿势变化和臂带重新定位)的保留测试集上评估。在各种条件下,与相同数据预算下的零样本迁移和仅中风训练相比,健康预训练适应一致地改善了中风意图检测;最佳适应方法将平均转换准确率从0.42提高到0.61,原始准确率从0.69提高到0.78。这些结果表明,迁移可复用的健康域EMG表示可以减少校准负担,同时提高实时中风后意图检测的鲁棒性。我们的项目网站、视频、代码和数据集可在以下网址获取:this https URL。

英文摘要

Surface electromyography (sEMG) is a promising control signal for assist-as-needed hand rehabilitation after stroke, but detecting intent from paretic muscles often requires lengthy, subject-specific calibration and remains brittle to variability. We propose a healthy-to-stroke adaptation pipeline that initializes an intent detector from a model pretrained on large-scale able-bodied sEMG, then fine-tunes it for each stroke participant using only a small amount of subject-specific data. Using a newly collected dataset from three individuals with chronic stroke, we compare adaptation strategies (head-only tuning, parameter-efficient LoRA adapters, and full end-to-end fine-tuning) and evaluate on held-out test sets that include realistic distribution shifts such as within-session drift, posture changes, and armband repositioning. Across conditions, healthy-pretrained adaptation consistently improves stroke intent detection relative to both zero-shot transfer and stroke-only training under the same data budget; the best adaptation methods improve average transition accuracy from 0.42 to 0.61 and raw accuracy from 0.69 to 0.78. These results suggest that transferring a reusable healthy-domain EMG representation can reduce calibration burden while improving robustness for real-time post-stroke intent detection. Our project website, video, code, and dataset are available at: this https URL.

2601.22003 2026-06-12 stat.ML cs.LG stat.CO 版本更新

Efficient Stochastic Optimisation via Sequential Monte Carlo

通过序贯蒙特卡洛实现高效随机优化

James Cuin, Davide Carbone, Yanbo Tang, O. Deniz Akyildiz

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对梯度难以计算的优化问题,提出用序贯蒙特卡洛(SMC)采样器替代昂贵的内采样循环,实现高效随机优化,并在能量模型奖励调优中验证有效性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

在机器学习和统计学中,从最大边际似然估计过程到生成模型的微调,经常出现优化具有难处理梯度函数的问题。针对这类问题的随机近似方法通常需要内部采样循环来获得(有偏的)随机梯度估计,这很快会变得计算昂贵。在这项工作中,我们开发了用于优化具有难处理梯度函数的序贯蒙特卡洛(SMC)采样器。我们的方法用高效的SMC近似替代昂贵的内部采样方法,这可以带来显著的计算收益。我们为我们的方法所定义的基本递归建立了收敛结果,这些递归由SMC采样器近似。我们在各种设置下对能量模型的奖励调优展示了我们方法的有效性。

英文摘要

The problem of optimising functions with intractable gradients frequently arises in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

2601.21570 2026-06-12 cs.AI cs.RO 版本更新

From Digital to Physical: Digital Agents as Autonomous Coaches for Physical Intelligence

从数字到物理:数字代理作为物理智能的自主教练

Zixing Lei, Genjia Liu, Yuanshuo Zhang, Qipeng Liu, Yuzhu Cai, Sixiang Chen, Jixian Wu, Yunhong Wang, Weixin Li, Chuan Wen, Bo Zhao, Shanghang Zhang, Wenzhao Lian, Siheng Chen

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) Zhongguancun Academy, Beijing, China(中关村学院) School of Integrated Circuits, Shanghai Jiao Tong University, Shanghai, China(上海交通大学集成电路学院) School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China(北京大学计算机科学学院多媒体信息处理国家重点实验室)

AI总结 提出EmboCoach-Bench基准,评估LLM代理自主设计具身策略的能力,通过迭代调试和优化,代理在平均成功率上超越人工基线26.5%,并具备自我修正能力。

详情
Comments
53 pages, 12 figures
AI中文摘要

具身AI领域正朝着通用机器人系统快速发展,得益于高保真模拟和大规模数据收集。然而,这种扩展能力仍然受到劳动密集型人工监督的严重瓶颈,从复杂的奖励塑造到跨异构后端的超参数调整。受LLM在软件自动化和科学发现中成功的启发,我们引入了\ extsc{EmboCoach-Bench},一个评估LLM代理自主设计具身策略能力的基准。涵盖32个专家精选的RL和IL任务,我们的框架将可执行代码作为通用接口。我们超越静态生成,评估动态闭环工作流,其中代理利用环境反馈迭代地起草、调试和优化解决方案,涵盖从物理信息奖励设计到扩散策略等策略架构的改进。广泛评估得出三个关键见解:(1)自主代理在平均成功率上可以定性超越人工设计的基线26.5%;(2)具有环境反馈的代理工作流有效增强了策略开发,并显著缩小了开源和专有模型之间的性能差距;(3)代理对病态工程案例表现出自我修正能力,通过迭代仿真循环调试成功从近乎完全失败中恢复任务性能。最终,这项工作为自我进化的具身智能奠定了基础,加速了具身AI领域从劳动密集型手动调优到可扩展自主工程的范式转变。

英文摘要

The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs' success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5\% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

2601.21324 2026-06-12 stat.ML cs.LG 版本更新

Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination

批量校准的置信模糊集:样本外污染下的快速、可处理决策

Mengqi Chen, Thomas B. Berrett, Theodoros Damoulas, Michele Caprio

发表机构 * University of Bristol(布里斯托大学) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校) University of Oxford(牛津大学)

AI总结 提出批量校准置信模糊集,通过分离批量内污染和尾部贡献,得到闭式有限风险目标,转化为线性或二阶锥规划,实现高效鲁棒优化。

详情
Comments
Accepted for publication (spotlight) at ICML 2026
AI中文摘要

分布鲁棒优化(DRO)在模糊集上最小化最坏情况期望损失,该模糊集可捕捉样本外环境中的分布偏移。虽然Huber(线性-空)污染是$\varepsilon$分数任意扰动的经典最小假设模型,但将其纳入模糊集可能导致最坏情况风险无穷大,且DRO目标变得无意义,除非施加强有界性或支撑假设。我们通过引入批量校准的置信模糊集来解决这些挑战:我们从数据中学习一个高质量批量集,同时考虑批量内的污染,并分别约束剩余尾部贡献。这导致一个闭式、有限的$\mathrm{mean}+\sup$鲁棒目标,以及针对常见损失和批量几何结构的可处理线性或二阶锥规划。通过该框架,我们强调并利用上期望(不精确概率概念)与最坏情况风险之间的等价性,展示IP置信集如何转化为具有可解释容忍水平的DRO目标。在重尾库存控制、地理偏移房价回归和人口偏移文本分类上的实验显示了竞争性的鲁棒性-准确性权衡和高效的优化时间,使用了贝叶斯、频率学派或经验参考分布。

英文摘要

Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an $\varepsilon$-fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite $\mathrm{mean}+\sup$ robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.

2601.13346 2026-06-12 cs.CL 版本更新

AfroScope: A Framework for Studying the Linguistic Landscape of Africa

AfroScope:研究非洲语言景观的框架

Sang Yun Kwon, AbdelRahim Elmadany, Muhammad Abdul-Mageed

发表机构 * The University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出AfroScope框架,包含覆盖640种语言的数据集和模型套件,通过层次分类和专用嵌入模型解决近亲语言混淆问题,提升宏F1分数1.57点,并分析跨语言迁移和领域效应。

详情
AI中文摘要

语言识别(LID)是确定给定文本语言的任务,是影响下游NLP应用可靠性的基本预处理步骤。尽管近期工作扩展了非洲LID,现有系统在语言覆盖范围以及近亲语言和变体的细粒度区分方面仍然有限。我们引入了AfroScope,一个统一的非洲LID框架,包括AfroScope-Data(覆盖640种语言的数据集)和AfroScope-Models(一套具有广泛非洲语言覆盖的强LID模型)。为了解决近亲语言之间持续存在的混淆问题,我们提出了一种层次分类方法,利用AfroScope-Mirror(一种专门用于目标消歧的嵌入模型),在易混淆子集上相比最佳基础模型提升了1.57个宏F1分数。我们进一步分析了跨语言迁移和领域效应,展示了语言家族结构、脚本兼容性和领域覆盖如何影响LID性能。我们将非洲LID定位为大规模测量数字文本中非洲语言景观的使能技术,并在线发布了AfroScope-Data和AfroScope-Models。

英文摘要

Language Identification (LID), the task of determining the language of a given text, is a fundamental preprocessing step that shapes the reliability of downstream NLP applications. While recent work has expanded African LID, existing systems remain limited in both language coverage and fine-grained discrimination among closely related languages and varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 640 languages, and AfroScope-Models, a suite of strong LID models with broad African language coverage. To address persistent confusions among closely related languages, we propose a hierarchical classification approach that leverages AfroScope-Mirror, a specialized embedding model for targeted disambiguation, improving macro-F1 by 1.57 points on the confusable subset compared to our best base model. We further analyze cross-lingual transfer and domain effects, showing how language-family structure, script compatibility, and domain coverage shape LID performance. We position African LID as an enabling technology for large-scale measurement of Africa's linguistic landscape in digital text, and release AfroScope-Data and AfroScope-Models online.