arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.00936 2026-05-29 cs.CV

Resolution as a Direction: Vector-Panning Feature Alignment for Cross-Resolution Re-Identification

分辨率作为方向：跨分辨率重识别的向量平移特征对齐

Zanwu Liu, Chao Yuan, Bo Li, Xiaowei Zhang, Guanglin Niu

发表机构 * School of Artificial Intelligence, Beihang University, Beijing, China（北京航空航天大学人工智能学院）； School of Computer Science and Engineering, Beihang University, Beijing, China（北京航空航天大学计算机科学与工程学院）； College of Computer Science and Technology, Qingdao University, Qingdao, China（青岛大学计算机科学与技术学院）

AI总结提出向量平移特征对齐（VPFA）方法，通过将低分辨率特征沿学习到的分辨率方向平移得到伪高分辨率表示，实现轻量级且高效的跨分辨率行人重识别。

详情

AI中文摘要

跨分辨率行人重识别（CR-ReID）在实际监控中仍然具有挑战性，其中相机质量和拍摄距离导致低分辨率（LR）查询与高分辨率（HR）图库图像之间存在显著的分辨率差距。先前的方法通常依赖于超分辨率（SR）或分辨率不变表示学习，这往往增加系统复杂性，并且可能无法直接解决由分辨率退化引起的特征不匹配问题。在这项工作中，我们从一项专门分析中报告了一个新的经验发现，其中身份特定的变化被平均化：标准ReID主干产生的HR-LR特征差异在嵌入空间中表现出一致的、与分辨率相关的语义方向。我们进一步基于典型相关分析（CCA）和皮尔逊相关分析支持这一观察。受此发现启发，我们提出了向量平移特征对齐（VPFA），一个轻量级的后处理模块，学习将LR特征沿学习到的分辨率方向平移，以获得伪HR表示。VPFA在特征提取后运行，可以以可忽略的开销集成到现有的ReID系统中。在多个CR-ReID基准上的大量实验表明，VPFA实现了最先进的性能，同时与基于SR或联合训练的方法相比提高了效率。

英文摘要

Cross-resolution person re-identification (CR-ReID) remains challenging in practical surveillance, where camera quality and capture distance lead to substantial resolution gaps between low-resolution (LR) queries and high-resolution (HR) gallery images. Prior approaches commonly rely on super-resolution (SR) or resolution-invariant representation learning, which often increases system complexity and may not directly address the feature mismatch induced by resolution degradation. In this work, we report a new empirical finding from a dedicated analysis in which identity-specific variation is averaged out: the HR--LR feature discrepancy produced by standard ReID backbones exhibits a consistent, resolution-related semantic direction in the embedding space. We further support this observation with statistical analyses based on Canonical Correlation Analysis (CCA) and Pearson correlation analysis. Motivated by this finding, we propose Vector Panning Feature Alignment (VPFA), a lightweight post-hoc module that learns to pan LR features along the learned resolution direction to obtain pseudo-HR representations. VPFA operates after feature extraction and can be integrated into existing ReID systems with negligible overhead. Extensive experiments on multiple CR-ReID benchmarks show that VPFA achieves state-of-the-art performance while improving efficiency compared to SR-based or jointly trained alternatives.

URL PDF HTML ☆

赞 0 踩 0

2509.24895 2026-05-29 cs.LG

Towards Understanding the Shape of Representations in Protein Language Models

理解蛋白质语言模型中表示的形状

Kosio Beshkov, Anders Malthe-Sørenssen

发表机构 * Department of Physics（物理系）； University of Oslo（奥斯陆大学）

AI总结本研究通过平方根速度表示和图过滤分析蛋白质语言模型（PLM）的表示空间，发现ESM2模型中Karcher均值和有效维度随层数非线性变化，且PLM优先编码残基的局部关系，最忠实于结构的表示出现在模型倒数第二层附近。

Comments Accepted as a poster at ICLR 2026. OpenReview: https://openreview.net/forum?id=Dnn8SSBJaY

Journal ref International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

虽然蛋白质语言模型（PLM）是未来从头蛋白质设计最有前途的研究途径之一，但它们将序列转换为隐藏表示的方式以及这些表示中编码的信息尚未完全理解。一些工作试图提出PLM的可解释性工具，但侧重于理解单个序列如何被这些模型转换。因此，PLM如何转换整个序列空间及其关系仍然未知。在这项工作中，我们尝试通过将蛋白质结构和表示与平方根速度（SRV）表示和图过滤联系起来，来理解这个转换后的序列空间。这两种方法自然地导出一个度量空间，在该空间中，可以比较成对的蛋白质或蛋白质表示。我们分析了来自SCOP数据集的不同类型蛋白质，并表明Karcher均值和SRV形状空间的有效维度作为不同大小ESM2模型中层数的函数遵循非线性模式。此外，我们使用图过滤作为工具来研究模型编码蛋白质结构特征的上下文长度。我们发现PLM优先编码残基之间的直接和局部关系，但对于较大的上下文长度开始退化。最忠实于结构的编码往往出现在模型最后一层附近但之前，表明在这些层之上训练折叠模型可能会提高折叠性能。

英文摘要

While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.

URL PDF HTML ☆

赞 0 踩 0

2509.23730 2026-05-29 cs.AI

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

EAPO: 利用按需专家协助增强策略优化

Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia

发表机构 * ByteDance BandAI（字节跳动BandAI）

AI总结提出专家辅助策略优化（EAPO）框架，通过训练中与外部专家的多轮交互增强探索，解决强化学习中的稀疏奖励和低效探索问题，在多个基准上平均提升5个点。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）最近在可验证奖励下的强化学习（RL）优化中，推理能力得到了提升。现有方法主要依赖基于结果的监督来增强内部LLM推理，这往往导致探索效率低下和奖励稀疏。为了缓解这一问题，我们提出了专家辅助策略优化（EAPO），一种新颖的RL框架，通过在训练过程中引入与外部专家的多轮交互来增强探索。与先前策略孤立推理的方法不同，EAPO激励策略自适应地决定何时以及如何咨询专家，从而产生更丰富的奖励信号和更可靠的推理轨迹。外部协助最终将专家知识内化到策略模型中，放大了模型固有的推理能力。在评估时，策略模型已经过良好优化，能够独立解决问题，产生改进的推理路径和更准确的解决方案。在AIME 2024/2025和AIMO 2025上，EAPO始终优于专家辅助、专家蒸馏和RL基线，平均比自探索RL高出5个点，并且泛化到非数学基准，包括HumanEval、HLE、GPQA、MMLU、EvalPlus、HotpotQA和SimpleQA。

英文摘要

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model's inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. On AIME 2024/2025 and AIMO 2025, EAPO consistently outperforms expert-assisted, expert-distilled, and RL baselines, averaging a 5-point gain over self-exploration RL, and also generalizes to non-math benchmarks, including HumanEval, HLE, GPQA, MMLU, EvalPlus, HotpotQA, and SimpleQA.

URL PDF HTML ☆

赞 0 踩 0

2509.22504 2026-05-29 cs.AI cs.LG

Estimating the Empowerment of Language Model Agents

估计语言模型代理的赋权能力

Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Washington（华盛顿大学）

AI总结提出基于信息论中赋权概念的评估框架EELMA，通过多轮文本交互近似有效赋权，实验表明赋权与任务性能强相关，可作为与任务成功度量互补的通用评估指标。

Comments Published at the International Conference on Machine Learning (ICML) 2026. 9 pages, 9 figures; camera-ready version

详情

AI中文摘要

随着语言模型（LM）代理在现实应用中的能力日益增强和广泛采用，除了昂贵且人工设计的基准测试外，对可扩展评估框架的需求日益增长。我们提出基于赋权的信息论评估，赋权是一种衡量代理通过其行动对未来状态影响的信息论度量。为了应对基于文本环境的独特挑战，我们引入了EELMA（估计语言模型代理的赋权能力），一种从多轮文本交互中近似有效赋权的算法。我们在文本游戏以及现实的网络和工具使用环境中演示了EELMA，表明赋权与平均任务性能强相关。我们进一步分析了赋权如何随模型、环境复杂性和代理配置而变化，并表明高赋权状态和行动通常标志着通用能力的关键时刻。这些结果确立了赋权作为一种与任务成功度量互补的、与目标无关的度量，用于LM代理评估。

英文摘要

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable evaluation frameworks beyond costly, manually designed benchmarks. We propose information-theoretic evaluation based on empowerment, an information-theoretic measure of an agent's influence on future states through its actions. To handle the unique challenges of text-based environments, we introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We demonstrate EELMA on textual games and realistic web and tool-use environments, showing that empowerment strongly correlates with average task performance. We further analyze how empowerment varies across models, environment complexity, and agent configurations, and show that high-empowerment states and actions often mark pivotal moments for general capabilities. These results establish empowerment as a goal-agnostic metric that complements task-success measures for LM-agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2509.21979 2026-05-29 cs.CV cs.AI

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

医疗视觉语言模型中的谄媚行为基准测试与缓解

Juangui Xu, Zikun Guo, Jingwei Lv, Hongbin Lin, Shu Yang, Jun Wen, Di Wang, Lijie Hu

发表机构 * MBZUAI ； Saarland University（萨尔兰大学）； HKUST(GZ)（香港科技大学（广州））； KAUST（卡塔尔大学）

AI总结针对医疗视觉语言模型中的谄媚问题，提出分层医疗视觉问答基准和VIPER策略，通过过滤非证据社会线索减少谄媚，提升模型鲁棒性。

Comments 19figures, 61pages. The first two authors contributed equally

详情

AI中文摘要

视觉语言模型（VLM）有潜力改变医疗工作流程。然而，其部署受到谄媚行为的限制。尽管这对患者安全构成严重威胁，但系统性的基准测试仍然缺乏。本文通过引入一个医疗基准来填补这一空白，该基准在分层医疗视觉问答任务中对VLM应用多种模板。我们发现当前的VLM极易受到视觉线索的影响，失败率与模型大小或整体准确性相关。我们发现感知权威和用户模仿是强大的触发因素，表明存在独立于视觉数据的偏差机制。为了克服这一点，我们提出了一种基于证据的视觉信息净化响应（VIPER）策略，该策略主动过滤掉非基于证据的社会线索，从而强化基于证据的推理。VIPER在保持可解释性的同时减少了谄媚，并且始终优于基线方法，为VLM的稳健和安全集成奠定了必要的基础。

英文摘要

Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.

URL PDF HTML ☆

赞 0 踩 0

2509.21154 2026-05-29 cs.LG cs.AI

GRPO is Secretly a Process Reward Model

GRPO 秘密地是一个过程奖励模型

Michael Sullivan, Alexander Koller

发表机构 * Department of Language Science ； Technology, Saarland Informatics Campus, Saarland University, Saarbr \"u cken, Germany

AI总结本文理论证明，使用结果奖励模型的 GRPO 强化学习算法等价于一个基于蒙特卡洛的过程奖励模型，并发现其缺陷，提出 λ-GRPO 改进，在推理任务上提升性能。

Comments 16 pages, 9 figures; accepted at ICML 2026

详情

AI中文摘要

过程奖励模型（PRMs）允许在强化学习（RL）中进行细粒度的信用分配，并且与结果奖励模型（ORMs）形成对比，后者为整个轨迹分配单一奖励。然而，我们在本文中提供了理论证明，配备 ORM 的组相对策略优化（GRPO）RL 算法实际上等价于一个配备非平凡、基于蒙特卡洛的 PRM 的 PRM-aware RL 目标（在温和假设下）。利用 GRPO-as-a-PRM 框架，我们识别出 GRPO 目标中的一个缺陷，该缺陷与不平衡的过程步骤和奖励相互作用，阻碍了探索和利用（在不同条件下）。我们提出对算法进行简单修改以减轻这一缺陷（λ-GRPO），并表明使用 λ-GRPO 调优的 LLM 在下游推理任务上优于使用标准 GRPO 调优的 LLM，并且更快达到峰值性能。这些结果表明，我们可以利用原始 GRPO 算法中隐藏的内置 PRM 结构来提升模型性能，而无需使用显式 PRM，并且对训练时间和成本的影响可以忽略不计。

英文摘要

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

URL PDF HTML ☆

赞 0 踩 0

2508.19282 2026-05-29 cs.CL cs.AI

Less Is More: Elevating RAG via Performance-Driven Context Compression

少即是多：通过性能驱动的上下文压缩提升RAG

Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma

发表机构 * City University of Hong Kong, Hong Kong SAR, China（香港城市大学）； Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE（阿布扎赫尔 Mohamed bin Zayed 人工智能大学）； Huazhong University of Science and Technology（华中科技大学）； Peking University, Beijing, China（北京大学）； Shenzhen Technology University, Shenzhen, China（深圳技术大学）

AI总结提出CORE-RAG框架，利用任务性能作为反馈信号迭代优化压缩策略，在3%压缩率下平均精确匹配得分提升3.3点。

Comments Accepted by ICML 2026

详情

AI中文摘要

检索增强生成（RAG）已成为改善知识更新时效性和大型语言模型事实准确性的有前景范式。然而，纳入大量检索文档显著增加输入长度，导致计算成本过高。现有压缩方法通常因依赖预定义启发式规则而损害任务性能。这些启发式规则无法确保压缩后的上下文有利于生成任务。为解决这些限制，我们提出CORE-RAG，一种用于RAG系统中上下文压缩的新颖框架。CORE通过性能驱动的学习框架消除对代理启发式规则的依赖，直接利用任务性能作为反馈信号迭代优化压缩器策略。在此优化过程之前，我们引入知识蒸馏阶段，用稳健策略初始化压缩器。大量实验证明了我们方法的优越性。在3%的高压缩比下，CORE不仅避免了性能下降，而且与使用完整文档相比，平均精确匹配（EM）得分提高了3.3分。我们的代码可在https://github.com/ziqiangcui/CORE-RAG-ICML26获取。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual accuracy of large language models. However, incorporating a large volume of retrieved documents significantly increases input length, leading to prohibitive computational costs. Existing compression approaches often compromise task performance, primarily due to their reliance on predefined heuristics. These heuristics fail to ensure that the compressed context is conducive to the generation tasks. To address these limitations, we propose CORE-RAG, a novel framework for context compression in RAG systems. CORE eliminates reliance on proxy heuristics through a performance-driven learning framework, which directy utilizes task performance as a feedback signal to iteratively refine the compressor policy. Prior to this optimization process, we incorporate a knowledge distillation phase to initialize the compressor with a robust policy. Extensive experiments demonstrate the superiority of our approach. At a high compression ratio of 3%, CORE not only avoids performance degradation but also improves the average Exact Match (EM) score by 3.3 points compared to using full documents. Our code is available at https://github.com/ziqiangcui/CORE-RAG-ICML26.

URL PDF HTML ☆

赞 0 踩 0

2508.19202 2026-05-29 cs.CL

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

通过探针知识和推理揭示LLMs中的科学问题解决

Alan Li, Yixin Liu, Arpan Sarkar, Doug Downey, Arman Cohan

发表机构 * Department of Computer Science, Yale University ； Department of Molecular \& Cellular Biology, Harvard University ； Allen Institute for AI ； Department of Computer Science, Northwestern University

AI总结本文提出SciReas基准和KRUX探针框架，系统评估LLMs在科学推理中的知识与推理角色，发现知识检索是主要瓶颈，外部上下文知识和推理增强均能提升性能。

Comments 33 pages, 18 figures

Journal ref ICML 2026 Main Conference

详情

AI中文摘要

科学问题解决对LLMs提出了独特挑战，需要深厚的领域知识和通过复杂推理应用这些知识的能力。尽管自动化科学推理器在协助人类科学家方面具有巨大潜力，但目前尚无广泛采用的全面基准来评估科学推理，也很少有方法系统地梳理知识和推理在这些任务中的不同作用。为弥补这些空白，我们引入了SciReas，一个用于科学推理任务的多样化现有基准套件，以及SciReas-Pro，一个需要更复杂推理的选择性子集。我们的全面评估揭示了在单独依赖单个基准时隐藏的科学推理性能洞察。然后，我们提出了KRUX，一个用于研究推理和知识在科学任务中不同作用的探针框架。结合两者，我们进行了深入分析，得出几个关键发现：（1）从模型参数中检索任务相关知识是LLMs在科学推理中的关键瓶颈；（2）推理模型始终受益于在推理增强之上添加上下文中的外部知识；（3）增强言语化推理提高了LLMs浮现任务相关知识的能力。

英文摘要

Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge.

URL PDF HTML ☆

赞 0 踩 0

2508.15180 2026-05-29 cs.AI

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

PuzzleClone: 一种基于DSL的可验证数据合成框架

Kai Xiong, Yanwei Huang, Rongjunchen Zhang, Kun Chen, Haipang Wu, Yingcai Wu

发表机构 * HiThink Research（HiThink研究院）； HKUST（香港科技大学）； Zhejiang University（浙江大学）

AI总结提出PuzzleClone框架，通过DSL驱动的方法合成大规模、高可靠、多样化的可验证数学逻辑数据集，并构建PC-83K基准，实验表明后训练能显著提升LLM在逻辑与数学任务上的性能。

详情

AI中文摘要

高质量、带有可验证答案的数学和逻辑数据集对于增强大型语言模型（LLM）的推理能力至关重要。虽然最近的数据增强技术促进了大规模基准的创建，但现有的LLM生成数据集往往存在可靠性、多样性和可扩展性有限的问题。为了解决这些挑战，我们引入了PuzzleClone，一个使用新颖的DSL驱动方法大规模合成可验证数据的正式框架。我们的方法具有三个关键创新：（1）将种子谜题编码为结构化的逻辑规范，（2）通过系统化的变量和约束随机化生成可扩展的变体，（3）通过再现机制确保有效性。应用PuzzleClone，我们构建了PC-83K，一个包含超过83K个多样化且经过程序验证的谜题的基准。生成的谜题涵盖了广泛的难度和格式，对当前最先进的模型构成了重大挑战。实验结果表明，在PC-83K上进行后训练（SFT和RL）不仅在测试集上取得了显著提升，而且在各种逻辑和数学基准上也取得了改进。后训练将PC-83K上的平均性能从14.5提高到66.0，并在7个逻辑和数学基准上持续改进，绝对百分点最高达18.4（SATBench从51.6提高到70.0）。我们的代码和数据可在https://github.com/HiThink-Research/PuzzleClone获取。

英文摘要

High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using a novel DSL-driven approach. Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct PC-83K, a benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. Experimental results show that post training (SFT and RL) on PC-83K yields substantial improvements not only on the testset but also on various logic and mathematical benchmarks. Post training raises average performance on PC-83K from 14.5 to 66.0 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 18.4 absolute percentage points (SATBench from 51.6 to 70.0). Our code and data are available at https://github.com/HiThink-Research/PuzzleClone.

URL PDF HTML ☆

赞 0 踩 0

2508.14610 2026-05-29 cs.RO

TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance

TRUST-Planner：面向具有不确定障碍物时空避让的AAV拓扑引导鲁棒轨迹规划器

Junzhi Li, Teng Long, Jingliang Sun, Jianxin Zhong

发表机构 * School of Aerospace Engineering, Beijing Institute of Technology（北京理工大学航空航天工程学院）； Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education（教育部飞行器动力学与控制重点实验室）

AI总结提出TRUST-Planner拓扑引导分层规划框架，通过动态增强可见概率图、无终端最小控制多项式和动态距离场实现复杂动态环境下的鲁棒时空避障，达到96%成功率和毫秒级计算效率。

Comments Accepted by IEEE Transactions on Industrial Electronics (TIE) for publication. The final version will be available online at https://ieeexplore.ieee.org/ after publication

详情

DOI: 10.1109/TIE.2026.3695224

AI中文摘要

尽管自主飞行器（AAV）的运动规划已取得广泛进展，但现有框架在复杂动态环境中仍面临局部极小值和死锁的挑战，导致碰撞风险增加。为了解决这些问题，我们提出了TRUST-Planner，一种拓扑引导的分层规划框架，用于鲁棒的时空避障。在前端，提出了一种动态增强可见概率图（DEV-PRM），以快速探索拓扑路径进行全局引导。后端利用统一的无终端最小控制多项式（UTF-MINCO）和动态距离场（DDF），实现高效的预测性避障和快速并行计算。此外，引入了一种增量式多分支轨迹管理框架，以实现时空拓扑决策，同时有效利用历史信息减少重规划时间。仿真结果表明，TRUST-Planner优于基线竞争对手，在测试的复杂环境中实现了96%的成功率和毫秒级计算效率。真实世界实验进一步验证了所提方法的可行性和实用性。

英文摘要

Despite extensive developments in motion planning of autonomous aerial vehicles (AAVs), existing frameworks faces the challenges of local minima and deadlock in complex dynamic environments, leading to increased collision risks. To address these challenges, we present TRUST-Planner, a topology-guided hierarchical planning framework for robust spatial-temporal obstacle avoidance. In the frontend, a dynamic enhanced visible probabilistic roadmap (DEV-PRM) is proposed to rapidly explore topological paths for global guidance. The backend utilizes a uniform terminal-free minimum control polynomial (UTF-MINCO) and dynamic distance field (DDF) to enable efficient predictive obstacle avoidance and fast parallel computation. Furthermore, an incremental multi-branch trajectory management framework is introduced to enable spatio-temporal topological decision-making, while efficiently leveraging historical information to reduce replanning time. Simulation results show that TRUST-Planner outperforms baseline competitors, achieving a 96\% success rate and millisecond-level computation efficiency in tested complex environments. Real-world experiments further validate the feasibility and practicality of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2507.23270 2026-05-29 cs.RO cs.SY eess.SY

Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells

基于仿真的多机器人装配单元自动化程序优化的运动序列规划

Loris Schneider, Marc Ungen, Elias Huber, Jan-Felix Klein

发表机构 * Institute for Material Handling and Logistics (IFL), Karlsruhe Institute of Technology（材料搬运与物流研究所（IFL），卡尔斯鲁厄理工学院）； Bosch Corporate Research, Robert Bosch GmbH（博世企业研究，罗伯特·博世有限公司）

AI总结提出一种基于仿真的方法，通过将装配步骤分解为核心操作和遍历操作，并采用分解式运动规划策略优化调度，以生成高效无碰撞的多机器人运动序列，减少装配时间。

Comments Accepted for publication at IEEE CASE 2026

详情

AI中文摘要

可重构多机器人单元提供了一种应对波动装配需求的有前景的方法。然而，其配置的重复规划带来了新的挑战，特别是在生成优化、协调的多机器人运动序列以最小化装配时间方面。本文提出了一种基于仿真的方法，用于生成此类优化序列。该方法将装配步骤分解为与任务相关的核心操作和连接的遍历操作。核心操作受约束且预先确定，而遍历操作具有显著的优化潜力。核心操作的调度被形式化为一个优化问题，需要使用基于分解的运动规划策略集成可行的遍历操作。探索了几种求解技术，包括采样启发式、基于树的搜索和无梯度优化。对于运动规划，提出了一种分解方法，识别调度中的特定区域，这些区域可以使用改进的集中式路径规划算法独立求解。所提出的方法生成了高效且无碰撞的多机器人装配程序，优于依赖分散式、机器人个体运动规划的基线方法。通过仿真实验证明了其有效性。

英文摘要

Reconfigurable multi-robot cells offer a promising approach to meet fluctuating assembly demands. However, the recurrent planning of their configurations introduces new challenges, particularly in generating optimized, coordinated multi-robot motion sequences that minimize the assembly duration. This work presents a simulation-based method for generating such optimized sequences. The approach separates assembly steps into task-related core operations and connecting traverse operations. While core operations are constrained and predetermined, traverse operations offer substantial optimization potential. Scheduling the core operations is formulated as an optimization problem, requiring feasible traverse operations to be integrated using a decomposition-based motion planning strategy. Several solution techniques are explored, including a sampling heuristic, tree-based search and gradient-free optimization. For motion planning, a decomposition method is proposed that identifies specific areas in the schedule, which can be solved independently with modified centralized path planning algorithms. The proposed method generates efficient and collision-free multi-robot assembly procedures that outperform a baseline relying on decentralized, robot-individual motion planning. Its effectiveness is demonstrated through simulation experiments.

URL PDF HTML ☆

赞 0 踩 0

2507.16880 2026-05-29 cs.CV cs.AI cs.LG

Finding DoRI: Discovery of Retained Images in Diffusion Models

Finding DoRI: 扩散模型中保留图像的发现

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

发表机构 * CISPA Helmholtz Center for Information Security（CISPA信息安全研究中心）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； Technical University of Darmstadt（达姆施塔特技术大学）； Hessian Center for AI (Hessian.AI)（黑森人工智能中心（Hessian.AI））； Centre for Cognitive Science, Technical University of Darmstadt（达姆施塔特技术大学认知科学中心）

AI总结通过挑战记忆局部化假设，发现文本嵌入的小扰动可重新触发数据复制，并证明记忆本质上是非局部的，从而提出对抗微调实现更鲁棒的缓解方法。

Comments Published at ICML 2026

详情

AI中文摘要

文本到图像扩散模型（DMs）在图像生成方面取得了显著成功。然而，由于它们可能无意中记忆并复制训练数据，数据隐私和知识产权问题仍然存在。最近的缓解工作集中在识别和剪枝负责触发逐字训练数据复制的权重，基于记忆可以被局部化的假设。我们挑战这一假设，并证明即使经过这样的剪枝，对先前缓解的提示的文本嵌入进行微小扰动可以重新触发数据复制，揭示了此类方法的脆弱性。我们的进一步分析提供了多个迹象表明记忆确实本质上不是局部的：（1）记忆图像的复制触发因素分布在文本嵌入空间中；（2）产生相同复制图像的嵌入会产生不同的模型激活；（3）不同的剪枝方法对同一图像识别出不一致的记忆相关权重集。最后，我们表明绕过局部性假设可以通过对抗微调实现更鲁棒的缓解。这些发现为文本到图像DMs中记忆的基本性质提供了新见解，并为未来开发更可靠的对抗DM记忆的缓解方法提供了信息。

英文摘要

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeed \textit{not} inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

URL PDF HTML ☆

赞 0 踩 0

2507.09266 2026-05-29 cs.CV

SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation

SAGE: 面向令牌高效手语翻译的分段感知无词汇编码

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

发表机构 * CVSSP, University of Surrey, United Kingdom（CVSSP，萨里大学，英国）

AI总结提出分段感知视觉标记化框架，通过手语分段将连续视频转换为离散视觉令牌，结合令牌级对比对齐和双层监督，在减少序列长度50%的同时，在PHOENIX14T基准上超越现有方法。

Comments Accepted in International Conference on Computer Vision (ICCV) Workshops. Code released at https://github.com/JianHe0628/SAGE

详情

AI中文摘要

无词汇手语翻译（SLT）发展迅速，无需词汇标注即可实现强性能。然而，这些进展往往伴随着模型复杂度和计算需求增加，引发了对可扩展性的担忧，尤其是在大规模手语数据集日益普及的背景下。我们提出了一种分段感知视觉标记化框架，利用手语分段将连续视频转换为离散的、基于手语信息的视觉令牌。与先前方法相比，这使输入序列长度减少多达50%，内存使用降低高达2.67倍，并在更大数据集上具有更好的可扩展性。为了桥接视觉和语言模态，我们引入了令牌到令牌的对比对齐目标，以及双层监督，同时对齐语言嵌入和中间隐藏状态。这在不依赖词汇级监督的情况下改善了细粒度跨模态对齐。我们的方法在PHOENIX14T基准上显著超越了现有技术的性能，同时大幅减少了序列长度。进一步实验还表明，在可比序列长度下，我们的性能优于先前工作，验证了我们的标记化和对齐策略的潜力。

英文摘要

Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.

URL PDF HTML ☆

赞 0 踩 0

2507.00037 2026-05-29 cs.LG cs.AI

Model Fusion via Retrofitting

通过回溯改造的模型融合

Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

发表机构 * ETH Z\"urich

AI总结提出一种以神经元为中心的融合算法，通过将父模型中间神经元分组为目标表示并训练融合模型子网络逼近，结合神经元归因分数进行显著特征对齐，适用于任意可模块化为有向无环图结构的架构，在零样本和非独立同分布场景下表现最佳。

Comments 5 figures, 15 tables, 23 pages

详情

AI中文摘要

模型融合旨在将独立训练的神经网络组合成一个单一模型而无需重新训练，但由于排列不变性、随机初始化和异构训练数据导致的表示差异，这一过程变得复杂。现有方法在非独立同分布数据分布下的零样本设置中尤其困难，并且通常局限于特定架构或成对融合。我们引入了一类以神经元为中心的融合算法，将融合视为一个原则性的表示匹配问题：父模型中的中间神经元被分组为目标表示，然后训练融合模型的相应子网络来逼近这些表示。与先前工作不同，我们的方法结合了神经元归因分数以偏向于显著特征的对齐，并且可以应用于任何可模块化为有向无环图层次的架构——在VGG、ResNet和ViT上进行了实证验证。在标准基准上的实验显示，与现有融合方法相比，我们的方法取得了一致的改进，在零样本和非独立同分布场景中增益最大。代码可在https://github.com/AndrewSpano/model-fusion-via-retrofitting获取。

英文摘要

Model fusion seeks to combine independently trained neural networks into a single model without retraining, but is complicated by representational divergence arising from permutation invariance, random initialization, and heterogeneous training data. Existing methods struggle particularly in zero-shot settings under non-IID data distributions, and are often limited to specific architectures or pairwise fusion. We introduce a neuron-centric family of fusion algorithms that frames fusion as a principled representation-matching problem: intermediate neurons across parent models are grouped into target representations, which the fused model's corresponding sub-networks are then trained to approximate. Unlike prior work, our approach incorporates neuron attribution scores to bias alignment toward salient features, and can be applied to any architecture modularizable as a DAG of levels -- empirically validated on VGGs, ResNets, and ViTs. Experiments across standard benchmarks show consistent improvements over existing fusion methods, with the largest gains in zero-shot and non-IID scenarios. Code is available at https://github.com/AndrewSpano/model-fusion-via-retrofitting.

URL PDF HTML ☆

赞 0 踩 0

2506.12815 2026-05-29 cs.LG

TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models

TrojanTO：针对轨迹优化模型的行动级后门攻击

Yang Dai, Oubo Ma, Longfei Zhang, Xingxing Liang, Xiaochun Cao, Shouling Ji, Jiaheng Zhang, Jincai Huang, Li Shen

发表机构 * Laboratory for Big Data and Decision, National University of Defense Technology（大数据与决策实验室，国防科技大学）； Zhejiang University（浙江大学）； Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区）； National University of Singapore（新加坡国立大学）

AI总结提出TrojanTO，首个针对轨迹优化模型的行动级后门攻击方法，通过交替训练增强触发与目标动作关联，并利用轨迹过滤和批量投毒实现高隐蔽性，在低攻击预算下有效植入后门。

Comments 23 pages, 6 figures

Journal ref International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

轨迹优化（TO）模型的最新进展在离线强化学习中取得了显著成功。然而，它们对后门攻击的脆弱性尚不清楚。我们发现，现有的强化学习后门攻击基于奖励操纵，由于TO模型固有的序列建模特性，这些攻击对其基本无效。此外，高维动作空间带来的复杂性进一步加剧了动作操纵的挑战。为解决这些问题，我们提出了TrojanTO，这是首个针对TO模型的行动级后门攻击。TrojanTO采用交替训练来增强触发器与目标动作之间的关联，以提高攻击有效性。为提高攻击隐蔽性，它通过轨迹过滤进行精确投毒以保持正常性能，并通过批量投毒确保触发器一致性。大量评估表明，TrojanTO能够在低攻击预算（0.3%的轨迹）下，跨不同任务和攻击目标有效植入后门攻击。此外，TrojanTO对DT、GDT和DC具有广泛的适用性，突显了其跨多种TO模型架构的可扩展性。

英文摘要

Recent advances in Trajectory Optimization (TO) models have achieved remarkable success in offline reinforcement learning. However, their vulnerabilities against backdoor attacks are poorly understood. We find that existing backdoor attacks in reinforcement learning are based on reward manipulation, which are largely ineffective against the TO model due to its inherent sequence modeling nature. Moreover, the complexities introduced by high-dimensional action spaces further compound the challenge of action manipulation. To address these gaps, we propose TrojanTO, the first action-level backdoor attack against TO models. TrojanTO employs alternating training to enhance the connection between triggers and target actions for attack effectiveness. To improve attack stealth, it utilizes precise poisoning via trajectory filtering for normal performance and batch poisoning for trigger consistency. Extensive evaluations demonstrate that TrojanTO effectively implants backdoor attacks across diverse tasks and attack objectives with a low attack budget (0.3\% of trajectories). Furthermore, TrojanTO exhibits broad applicability to DT, GDT, and DC, underscoring its scalability across diverse TO model architectures.

URL PDF HTML ☆

赞 0 踩 0

2505.13745 2026-05-29 cs.LG stat.ML

Synthetic Non-stationary Data Streams for Recognition of the Unknown

用于未知识别的合成非平稳数据流

Joanna Komorniczak

发表机构 * Wrocław University of Science and Technology（沃拉夫大学科学与技术学院）

AI总结提出一种同时包含概念漂移和新类出现的合成数据流生成策略，并评估无监督漂移检测器在开放集识别任务中的表现。

详情

DOI: 10.1007/978-3-032-19102-1_9

AI中文摘要

数据非平稳性问题在数据流处理中常被讨论。在动态环境中，方法应持续准备分析时变数据——因此，它们应支持增量训练并应对概念漂移。非平稳数据流环境中另一个同样重要的变化是新的、先前未知类别的出现。通常，方法专注于这两种现象之一——检测概念漂移或检测新类别——而数据流中可能同时出现这两种困难。此外，关于先前未知的观测，开放类别集的话题近年来变得尤为重要，方法的目标是在已知类别内高效分类，并识别模型能力范围外的对象。本文提出一种合成数据流生成策略，其中同时出现概念漂移和代表未知对象的新类别。所呈现的研究展示了无监督漂移检测器如何处理检测新类别和概念漂移的任务，并演示了生成的数据流如何用于开放集识别任务。

英文摘要

The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.

URL PDF HTML ☆

赞 0 踩 0

2505.02604 2026-05-29 cs.LG

Connecting Independently Trained Modes via Layer-Wise Connectivity

通过逐层连接性连接独立训练的模态

Yongding Tian, Zaid Al-Ars, Maksim Kitsak, Peter Hofstee

发表机构 * Computer Engineering Lab, Delft University of Technology, Delft, NL（代尔夫特理工大学计算机工程实验室）； Network and Architecture Service, Delft University of Technology, Delft, NL（代尔夫特理工大学网络与架构服务）； IBM Infrastructure, TX, USA（IBM基础设施）

AI总结提出一种新的经验算法，通过逐层连接性构建独立训练神经网络模型之间的连续低损失路径，在多种现代架构上实现更一致的模式连接。

Comments 28 pages, 22 figures, accepted in ICML 2026: https://openreview.net/forum?id=4VOTzpH9MO

详情

AI中文摘要

实证研究表明，可以在独立训练的神经网络模型之间构建连续的低损失路径。这种现象称为模式连接性，指的是在参数空间中不同模式（即训练良好的解）之间存在这样的路径。然而，现有的经验方法不能可靠地连接独立训练的模态，并且主要在一组狭窄的架构（例如，基本的CNN、VGG和ResNet）上进行了评估，使得它们在新模型上的有效性尚不清楚。在这项工作中，我们提出了一种新的经验算法，用于连接独立训练的模态，该算法超越了传统架构，支持更广泛的网络，包括MobileNet、ShuffleNet、EfficientNet、RegNet、深度层聚合（DLA）和紧凑卷积变换器（CCT）。除了更广泛的适用性外，所提出的方法在独立训练的模态对之间产生更一致的连接路径，并支持连接使用不同训练超参数获得的模态。

英文摘要

Empirical studies have shown that continuous low-loss paths can be constructed between independently trained neural network models. This phenomenon, known as mode connectivity, refers to the existence of such paths between distinct modes-i.e., well-trained solutions in parameter space. However, existing empirical methods do not reliably connect independently trained modes and have been evaluated mainly on a narrow set of architectures (e.g., basic CNNs, VGG, and ResNet), leaving their effectiveness on newer models unclear. In this work, we propose a new empirical algorithm for connecting independently trained modes that generalizes beyond traditional architectures and supports a broader range of networks, including MobileNet, ShuffleNet, EfficientNet, RegNet, Deep Layer Aggregation (DLA), and Compact Convolutional Transformers (CCT). In addition to broader applicability, the proposed method yields more consistent connectivity paths across independently trained mode pairs and supports connecting modes obtained with different training hyperparameters.

URL PDF HTML ☆

赞 0 踩 0

2505.02069 2026-05-29 cs.LG stat.ML

Neural Logistic Bandits

神经逻辑老虎机

Seoungbin Bae, Dabeen Lee

发表机构 * Department of Industrial \& Systems Engineering, KAIST ； Department of Mathematical Sciences \& Research Institute of Mathematics, Seoul National University ； Interdisciplinary Program in Artificial Intelligence, Seoul National University

AI总结针对神经逻辑老虎机问题，利用一种新型的自归一化向量值鞅的Bernstein型不等式，提出两种算法NeuralLog-UCB-1和NeuralLog-UCB-2，分别实现与有效维度相关的遗憾上界，改进了现有结果。

详情

AI中文摘要

我们研究了神经逻辑老虎机问题，其主要任务是通过神经网络学习逻辑链接函数内的未知奖励函数。现有方法要么对$κ$（其中$1/κ$表示奖励分布的最小方差）有不利的依赖，要么直接依赖于特征维度$d$，而在基于神经网络的设置中$d$可能非常大。在这项工作中，我们引入了一种新型的自归一化向量值鞅的Bernstein型不等式，旨在绕过对环境维度的直接依赖。这使我们能够推导出一个遗憾上界，该上界随有效维度$\widetilde{d}$增长，而不是特征维度，同时保持对$κ$的最小依赖。基于该集中不等式，我们提出了两种算法NeuralLog-UCB-1和NeuralLog-UCB-2，它们分别保证了$\widetilde{O}(\widetilde{d}\sqrt{κT})$和$\widetilde{O}(\widetilde{d}\sqrt{T/κ})$阶的遗憾上界，改进了现有结果。最后，我们在合成数据集和真实数据集上报告了数值结果，以验证我们的理论发现。

英文摘要

We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencies on $κ$, where $1/κ$ represents the minimum variance of reward distributions, or suffer from direct dependence on the feature dimension $d$, which can be huge in neural network-based settings. In this work, we introduce a novel Bernstein-type inequality for self-normalized vector-valued martingales that is designed to bypass a direct dependence on the ambient dimension. This lets us deduce a regret upper bound that grows with the effective dimension $\widetilde{d}$, not the feature dimension, while keeping a minimal dependence on $κ$. Based on the concentration inequality, we propose two algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, that guarantee regret upper bounds of order $\widetilde{O}(\widetilde{d}\sqrt{κT})$ and $\widetilde{O}(\widetilde{d}\sqrt{T/κ})$, respectively, improving on the existing results. Lastly, we report numerical results on both synthetic and real datasets to validate our theoretical findings.

URL PDF HTML ☆

赞 0 踩 0

2504.06022 2026-05-29 cs.CV

CamC2V: Context-aware Controllable Video Generation

CamC2V: 上下文感知的可控视频生成

Luis Denninger, Sina Mokhtarzadeh Azar, Juergen Gall

发表机构 * University of Bonn（波恩大学）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习与人工智能研究所）

AI总结提出CamC2V模型，通过集成多图像条件与3D约束及相机控制，实现上下文感知的连贯视频生成，在RealEstate10K数据集上FVD提升24.09%。

Comments Published at 3DV 2026

详情

AI中文摘要

近年来，图像到视频（I2V）扩散模型展示了令人印象深刻的场景理解和生成质量，通过引入图像条件来指导生成。然而，这些模型主要将静态图像动画化，而不扩展其提供的上下文。引入额外的约束，如相机轨迹，可以增强多样性，但往往会降低视觉质量，限制了它们在需要忠实场景表示的任务中的适用性。我们提出了CamC2V，一种上下文到视频（C2V）模型，它将多个图像条件作为上下文与3D约束以及相机控制集成在一起，以丰富全局语义和细粒度视觉细节。这使得视频生成更加连贯且上下文感知。此外，我们论证了有效上下文表示中时间感知的必要性。我们在RealEstate10K数据集上的全面研究表明，视觉质量和相机可控性提高了24.09%（FVD）。我们的代码公开在：https://github.com/LDenninger/CamC2V。

英文摘要

Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrade visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamC2V, a context-to-video (C2V) model that integrates multiple image conditions as context with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates a $24.09\%$ (FVD) improvement in visual quality and camera controllability. Our code is publicly available at: https://github.com/LDenninger/CamC2V.

URL PDF HTML ☆

赞 0 踩 0

2502.21004 2026-05-29 cs.CV

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

软化掩码：自适应时间软掩码用于高效动态面部表情识别

Meng-zhu Li, Quanxing Zha, Hongjun Wu

发表机构 * Beijing Union University（北京联合大学）； Huaqiao University（华侨大学）； Beijing University of Posts and Telecommunication（北京邮电大学）

AI总结提出一种结合自监督重建与监督分类的AdaTosk网络，通过自适应时间软掩码（类不可知和类语义软掩码）增强关键表情时刻并减少语义冗余，在降低计算成本的同时保持竞争性能。

Comments 6 pages, 3 figures

详情

DOI: 10.1109/ICME59968.2025.11209787

AI中文摘要

动态面部表情识别（DFER）通过非语言交流促进对心理意图的理解。现有方法难以管理无关信息（如背景噪声和冗余语义），影响效率和有效性。本文提出一种新颖的监督式时间软掩码自编码器网络用于DFER，即AdaTosk，它将并行监督分类分支与自监督重建分支相结合。自监督重建分支应用随机二元硬掩码生成多样化的训练样本，促进可见令牌中的有意义的特征表示。同时，分类分支采用自适应时间软掩码，根据时间重要性灵活地掩盖可见令牌。其两个关键组成部分，即类不可知软掩码和类语义软掩码，分别用于增强关键表情时刻并随时间减少语义冗余。在广泛使用的基准测试上进行的大量实验表明，与当前最先进方法相比，我们的AdaTosk显著降低了计算成本，同时仍保持竞争性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.

URL PDF HTML ☆

赞 0 踩 0

2502.20954 2026-05-29 cs.LG

Robust and Efficient Writer-Independent IMU-Based Handwriting Recognition

鲁棒且高效的独立于书写者的基于IMU的手写识别

Jindong Li, Tim Hamann, Jens Barth, Peter Kämpf, Dario Zanca, Björn Eskofier

发表机构 * Machine Learning and Data Analytics Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany（机器学习与数据分析实验室，埃朗根-纽伦堡大学，埃朗根，德国）； STABILO International GmbH, Heroldsberg, Germany（STABILO国际有限公司，赫尔兹堡，德国）； Translational Digital Health Group, Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany（转化数字健康组，健康人工智能研究所，慕尼黑-德国环境健康研究中心，纽赫堡，德国）

AI总结提出一种结合CNN编码器和BiLSTM解码器的模型，在IMU数据上实现独立于书写者的手写识别，在OnHW数据集和自建数据集上分别达到7.37%和9.44%的字符错误率，并展现出对未见书写风格的鲁棒性。

Comments Accepted at iWOAR 2025. Published in Springer LNCS, 2026. Code available at https://github.com/jindongli24/REWI

Journal ref Sensor-Based Activity Recognition and Artificial Intelligence (iWOAR 2025), Lecture Notes in Computer Science, pp. 261-286, Springer, Cham, 2026

详情

DOI: 10.1007/978-3-032-13312-0_16

AI中文摘要

使用惯性测量单元（IMU）数据进行手写识别（HWR）由于书写风格的多样性和数据集的有限性仍然具有挑战性。以往的方法往往难以处理未见过的书写者的手写，使得独立于书写者（WI）的识别成为一个关键但困难的问题。本文提出了一种模型，旨在提高基于IMU数据的WI HWR性能，该模型使用CNN编码器和基于BiLSTM的解码器。我们的方法对未见过的书写风格表现出强大的鲁棒性，在公共OnHW数据集和我们基于单词的数据集的WI划分上均优于现有方法，分别实现了7.37%和9.44%的字符错误率（CER），以及15.12%和32.17%的词错误率（WER）。鲁棒性评估表明，我们的模型在不同年龄组中保持优越性能，并且从一个组学到的知识相比其他方法能更好地泛化到另一个组。在我们基于句子的数据集上的评估进一步展示了识别完整句子的潜力。通过全面的消融研究，我们表明我们的设计选择在性能和效率之间实现了良好的平衡。这些发现支持开发更适应和可扩展的HWR系统用于实际应用。

英文摘要

Handwriting recognition (HWR) using inertial measurement unit (IMU) data remains challenging due to variations in writing styles and the limited availability of datasets. Previous approaches often struggle with handwriting from unseen writers, making writer-independent (WI) recognition a crucial yet difficult problem. This paper presents a model designed to improve WI HWR on IMU data, using a CNN encoder and BiLSTM-based decoder. Our approach demonstrates strong robustness to unseen handwriting styles, outperforming existing methods on the WI splits of both the public OnHW dataset and our word-based dataset, achieving character error rates (CERs) of 7.37% and 9.44%, and word error rates (WERs) of 15.12% and 32.17%, respectively. Robustness evaluation shows that our model maintains superior performance across different age groups, with knowledge learned from one group generalizing better to another compared to other approaches. Evaluation on our sentence-based dataset further demonstrates the potential for recognizing full sentences. Through comprehensive ablation studies, we show that our design choices achieve a strong balance between performance and efficiency. These findings support the development of more adaptable and scalable HWR systems for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2502.20838 2026-05-29 cs.SD cs.AI cs.LG eess.AS

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

弱监督检测与长时间生物声学数据中鲸叫声的时间定位

Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Japan（东京科学研究院工程学院系统与控制工程系）

AI总结提出DSMIL-LocNet框架，利用弱监督多实例学习仅使用录音级标签实现鲸叫声的分类和时间定位，在长录音上优于全监督基线。

Comments Accepted in European Signal Processing Conference (EUSIPCO) 2026

详情

AI中文摘要

被动声学监测（PAM）系统生成持续数月连续录音，但自动化生物声学分析鲸叫声需要两种独立的标注工作：用于分类的二元存在标签和用于定位的精确时间边界。一个多分钟录音的二元标签可以在几秒钟内分配，但对其中的每个叫声打时间戳需要数小时的专家努力。在操作规模上同时提供两者是不可行的。我们提出DSMIL-LocNet，一个弱监督多实例学习（MIL）框架，仅使用录音级存在/缺失标签执行分类和时间定位。我们的双流架构整合频谱和时间特征，处理2-30分钟的录音，而无需现有CNN方法在长输入上退化的时间压缩。在AcousticTrends BlueFinLibrary上，DSMIL-LocNet在300-1800秒录音上达到F1分数0.88-0.91，而全监督CNN基线退化为0.19-0.64。它还提供这些基线在没有帧级标注的情况下无法产生的时间定位。代码：https://github.com/Ragib-Amin-Nihal/DSMIL-LocNet

英文摘要

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc

URL PDF HTML ☆

赞 0 踩 0

2502.10330 2026-05-29 cs.LG

Diffusion-based learning framework for Constrained Nonconvex Optimization with Weighted Bootstrapped Refinement

基于扩散的约束非凸优化学习框架与加权自举细化

Shutong Ding, Yimiao Zhou, Ke Hu, Xi Yao, Junchi Yan, Xiaoying Tang, Ye Shi

发表机构 * ShanghaiTech University（上海科技大学）； MoE Key Laboratory of Intelligent Perception（智能感知MoE重点实验室）； Human Machine Collaboration（人机协同）； Shanghai Jiao Tong University（上海交通大学）； China Mobile Communications Company Limited Research Institute（中国移动通信公司有限研究院）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出DiOpt框架，通过监督预热和自举训练两阶段学习噪声到约束区域的映射，解决扩散模型在约束非凸优化中的分布错位问题，实现高约束满足和最优性。

Comments accepted by ICML2026

详情

AI中文摘要

扩散模型的最新进展显示出通过利用其多模态性来加速非凸问题求解的潜力。然而，现有的大多数基于扩散的优化方法依赖于监督学习，并且缺乏强制执行约束满足的机制，而这在现实应用中是需要满足的。在这种情况下，我们研究并理论分析了监督扩散求解器的固有问题，并识别出分布错位问题，即生成的解分布在可行区域上的概率质量通常较低。为了解决这个问题，我们提出了DiOpt，一种新的基于扩散的约束非凸优化学习框架，它有效地学习了从噪声到约束区域的映射。具体来说，该框架在两个不同的阶段运行：初始的预热阶段，通过监督学习实现，随后是自举训练阶段。这种双阶段架构旨在迭代地细化解，从而在高度满足约束的情况下改进目标函数。最后，我们还在推理中采用解选择技术以获得更好的最优性。值得注意的是，DiOpt是首次成功将扩散求解器集成到约束非凸优化中。在多样化的非凸任务上的评估显示了DiOpt在最优性和约束满足方面的优越性。我们的官方页面发布在https://dingsht.tech/diopt-webpage。

英文摘要

Recent advances in diffusion models show promising potential to accelerate nonconvex problem solving by leveraging their multimodality. However, most existing diffusion-based optimization approaches rely on supervised learning and lack a mechanism to enforce constraint satisfaction, which is required in real-world applications. In that case, we investigate and theoretically analyze the inherent problem of supervised diffusion solvers and identify the distributional misalignment problem, i.e., the generated solution distribution often exhibits low probability mass on the feasible region. To resolve this issue, we propose DiOpt, a new diffusion-based learning framework for constrained nonconvex optimization, which effectively learns the mapping from noise to the constraint region. Specifically, this framework operates in two distinct phases: an initial warm-start phase, implemented via supervised learning, followed by a bootstrapping training phase. This dual-phase architecture is designed to iteratively refine solutions, thereby improving the objective function with high constraint satisfaction. Finally, we also employ a solution selection technique in inference for better optimality. Notably, DiOpt is the first successful integration of the diffusion solver in constrained nonconvex optimization. Evaluations on diverse nonconvex tasks demonstrate the superiority of DiOpt in both optimality and constraint satisfaction. Our official page is released at https://dingsht.tech/diopt-webpage.

URL PDF HTML ☆

赞 0 踩 0

2502.10205 2026-05-29 cs.LG

Looking around you: external information enhances representations for event sequences

环顾四周：外部信息增强事件序列的表示

Petr Sokerin, Maria Kovaleva, Ekaterina Boyarina, Pavel Tikhomirov, Denis Vorobiyov, Alexey Zaytsev

发表机构 * LARSS Laboratory, AI Center, Skoltech（LARSS实验室、人工智能中心、Skoltech）

AI总结针对事件序列表示学习中忽略同时发生序列上下文的问题，提出通过聚合多个用户表示来增强特定用户表示的方法，其中可学习注意力机制在多个数据集上显著提升指标。

详情

AI中文摘要

将可解释集成树（E2Tree）扩展到回归场景

Massimo Aria, Agostino Gnasso, Carmela Iorio, Marjolein Fokkema

发表机构 * Department of Economics and Statistics, University of Naples Federico II（那不勒斯费德里科二世大学经济学与统计学系）； Institute of Psychology, Leiden University（莱顿大学心理学研究所）

AI总结本文通过引入新的不相似度度量，将可解释集成树方法从分类扩展到回归，并在真实数据集上验证其解释能力。

Journal ref Applied Stochastic Models in Business and Industry, Vol. 42, No. 1, e70064 (2026)

详情

DOI: 10.1002/asmb.70064

AI中文摘要

集成方法如随机森林通过聚合多个弱学习器提供了高精度的预测，改变了监督学习的格局。然而，尽管它们有效，这些方法往往缺乏透明度，阻碍了用户理解随机森林模型如何得出预测。可解释集成树（E2Tree）是一种解释随机森林的新方法，提供了响应变量与预测变量之间关系的图形表示。E2Tree的一个显著特点是它不仅考虑预测变量对响应的影响，还通过计算和使用不相似度度量来考虑预测变量之间的关联。E2Tree方法最初是为分类任务提出的。在本文中，我们将该方法扩展到回归场景。为了展示所提算法的解释能力，我们在真实数据集上进行了演示。

英文摘要

Ensemble methods such as random forests have transformed the landscape of supervised learning, offering highly accurate prediction through the aggregation of multiple weak learners. However, despite their effectiveness, these methods often lack transparency, impeding users' comprehension of how RF models arrive at their predictions. Explainable ensemble trees (E2Tree) is a novel methodology for explaining random forests, that provides a graphical representation of the relationship between response variables and predictors. A striking characteristic of E2Tree is that it not only accounts for the effects of predictor variables on the response but also accounts for associations between the predictor variables through the computation and use of dissimilarity measures. The E2Tree methodology was initially proposed for use in classification tasks. In this paper, we extend the methodology to encompass regression contexts. To demonstrate the explanatory power of the proposed algorithm, we illustrate its use on real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Resolution as a Direction: Vector-Panning Feature Alignment for Cross-Resolution Re-Identification

Towards Understanding the Shape of Representations in Protein Language Models

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

Estimating the Empowerment of Language Model Agents

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

GRPO is Secretly a Process Reward Model

Less Is More: Elevating RAG via Performance-Driven Context Compression

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

TRUST-Planner: Topology-guided Robust Trajectory Planner for AAVs with Uncertain Obstacle Spatial-temporal Avoidance

Simulation-based planning of Motion Sequences for Automated Procedure Optimization in Multi-Robot Assembly Cells

Finding DoRI: Discovery of Retained Images in Diffusion Models

SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation

Model Fusion via Retrofitting

TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models

Synthetic Non-stationary Data Streams for Recognition of the Unknown

Connecting Independently Trained Modes via Layer-Wise Connectivity

Neural Logistic Bandits

CamC2V: Context-aware Controllable Video Generation

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

Robust and Efficient Writer-Independent IMU-Based Handwriting Recognition

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

Diffusion-based learning framework for Constrained Nonconvex Optimization with Weighted Bootstrapped Refinement

Looking around you: external information enhances representations for event sequences

Lexical categories of stem-forming roots in Mapudüngun verb forms

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective

A Quotient Homology Theory of Representation in Neural Networks

Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

Extending Explainable Ensemble Trees (E2Tree) to regression contexts