2606.02552 2026-06-02 cs.CV cs.AI 版本更新

为什么不采用超参数友好的优化？一种用于长尾识别的单调自适应范数缩放方法

Shuo Zhang, Chenqi Li, Tingting Zhu

发表机构 * University of Oxford（牛津大学）

AI总结提出一种无需参数正则化的自适应单调归一化方法（SAMN），通过保序回归直接对类别权重范数施加单调性约束，实现超参数友好的长尾识别。

详情

AI中文摘要

长尾识别对深度学习构成了重大挑战。两阶段解耦范式将表示学习与分类器重训练分离，提供了一种有前景的解决方案。在分类器重训练阶段，自适应范数缩放是一种流行技术。它通过参数正则化调整每类权重范数，这不可避免地引入了超参数。然而，许多研究报告指出，长尾识别对这些超参数敏感，因为它们的设置显著影响性能。在本文中，我们首先从类条件分布的角度为范数缩放方法提供支持。此外，我们提出了一种简单而有效的方法，称为自适应单调归一化（SAMN）。SAMN避免了参数正则化的需求。它直接使用保序回归算法对每类权重范数施加单调性，使该方法对超参数友好。SAMN是一种通用策略，可与其他方法无缝集成以提升性能。在基准数据集上的实验表明，我们的方法显著提升了长尾识别性能，通常达到最先进的结果。

英文摘要

Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results.

URL PDF HTML ☆

赞 0 踩 0

2606.02522 2026-06-02 cs.CV cs.AI 版本更新

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video: 诊断视频多模态大语言模型在瞬时视觉事件上的时间保真度

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shandong University（山东大学）； Southeast University（东南大学）； Tencent Youtu Lab（腾讯优图实验室）

AI总结提出 Moment-Video 基准，通过瞬时视觉事件理解任务诊断视频 MLLMs 的时间保真度，发现最佳模型准确率仅 39.6%，多数开源模型低于 25%。

Comments 28 pages, 10 figures, 11 tables

详情

AI中文摘要

视频多模态大语言模型（MLLMs）在通用和长视频理解方面取得了快速进展，但它们保留简短答案关键视觉证据的能力仍未得到充分探索。许多实际问题由瞬时视觉事件决定：可能仅持续几帧的局部化动作或状态转换。这种证据可能因稀疏帧采样而跳过、因视觉标记压缩而抑制，或因粗粒度时间聚合而稀释，导致语言端推理无法可靠恢复的失败。我们引入了 Moment-Video，一个通过瞬时视觉事件理解来诊断视频 MLLMs 时间保真度的基准。每个问题都基于局部化、视觉可观察且对采样敏感的事件，要求模型注意、计数、描述或推理瞬态证据，而非依赖持久对象、全局场景上下文或语言先验。Moment-Video 包含 1,000 个人工验证的视频问答对，涵盖 7 个领域和 25 个细分子类别，覆盖四种任务类型：时间发生、时间计数、动作描述和时间推理。我们在 Moment-Video 上评估了 33 个专有和开源 MLLMs。最佳模型 Seed-2.0-Pro 仅达到 39.6% 的整体准确率，而大多数开源模型低于 25%，揭示了瞬时视觉事件理解方面的巨大差距。诊断分析表明，更密集的帧采样改善了一些模型，但并未消除瓶颈，更长的视频带来了更强的时间定位挑战。这些发现表明，当前视频 MLLMs 仍然缺乏时间保真的表示来捕捉、保留和使用简短但决定性的视觉证据。

英文摘要

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.

URL PDF HTML ☆

赞 0 踩 0

幽灵工具调用：投机性智能体工具的发布时隐私保护

Bardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems（马克斯·普朗克软件系统研究所）； EPFL（苏黎世联邦理工学院）； Aarhus University（阿arhus大学）

AI总结针对工具增强型语言智能体投机性预发调用泄露用户意图的问题，提出投机性工具隐私契约，在发布时而非提交后保护隐私。

详情

AI中文摘要

工具增强型语言智能体投机性地发出可能的未来工具调用以隐藏延迟，但这些调用在智能体提交分支之前将推断出的用户意图泄露给外部服务。每个收到调用的外部观察者在智能体放弃分支后仍保留该披露。问题在于时机，而非授权：提交后的清理、只读限制或访问控制白名单都无法撤回观察者已持有的信息。我们将这些调用称为幽灵工具调用，并提出投机性工具隐私契约，这是一种运行时抽象，将提交前的观察视为与状态突变不同的第一类效应。我们在原型运行时中实现了该契约，并在三个语料库上评估了十二种策略。投机性调度增加了观察者能够推断用户意图的程度；事后过滤器、只读限制和访问控制白名单无法消除这种推断；只有那些在调度前改变或抑制投机性调用的参数或目标投影的发布时策略才能减少这种推断。

英文摘要

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the call retains the disclosure after the agent abandons the branch. Timing is the issue, not authorization: no commit-time cleanup, read-only restriction, or access-control allow-list unsends what an observer already holds. We call these invocations ghost tool calls and propose Speculative Tool Privacy Contracts, a runtime abstraction that treats observation before commitment as a first-class effect, distinct from state mutation. We implement the contracts in a prototype runtime and evaluate twelve policies across three corpora. Speculative dispatch increases what an observer can infer about user intent; post-hoc filters, read-only restrictions, and access-control allow-lists leave that inference intact; only issue-time policies that change or suppress the speculative call's argument or destination projection before dispatch reduce it.

URL PDF HTML ☆

赞 0 踩 0

2606.02470 2026-06-02 cs.AI 版本更新

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

MCP-Persona：通过环境模拟在真实个人应用上基准测试LLM智能体

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen

发表机构 * Tsinghua University（清华大学）

AI总结针对现有基准忽略个人社交应用中工具与个人账户或本地数据库交互的挑战，提出MCP-Persona基准，通过模拟真实个性化MCP工具评估LLM智能体性能，实验表明现有智能体在个性化工具使用上存在显著困难。

Comments ICML 2026 Camera Ready

详情

AI中文摘要

模型上下文协议（MCP）已成为连接大型语言模型（LLM）与外部数据源和工具的变革性标准，并已迅速在个人应用和开发平台中得到采用。然而，现有基准主要关注通用信息搜索工具，未能捕捉个人社交应用带来的实际挑战，在这些应用中工具与个人账户或本地数据库交互。为弥合这一关键差距，我们引入了MCP-Persona，这是首个专门用于评估智能体在真实世界个性化MCP工具上性能的基准。MCP-Persona涵盖了一系列多样化的广泛使用的应用，从社交媒体平台如Reddit和小红书（Rednote）到企业协作套件如飞书（Lark）和Slack。我们在各种最先进（SOTA）智能体上的广泛实验表明，它们在个性化工具使用上存在显著困难，从而凸显了该基准在识别和解决这些局限性方面的关键作用。MCP-Persona公开可用：https://github.com/wwh0411/MCP-Persona。

英文摘要

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.

URL PDF HTML ☆

赞 0 踩 0

2606.02465 2026-06-02 cs.CL cs.AI 版本更新

Learning When to Translate for Multilingual Reasoning

学习何时翻译以实现多语言推理

Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence（人工智能研究生院）； Department of Computer Science & Engineering（计算机科学与工程系）； POSTECH

AI总结提出Luar框架，通过强化学习训练推理语言模型在直接理解不可靠时选择性调用翻译，从而缩小多语言推理差距。

Comments preprint

详情

AI中文摘要

推理语言模型（RLMs）在复杂推理任务上表现出色，但仍存在显著的多语言推理差距，这主要源于非英语输入中的语言理解失败。英语翻译可以通过将非英语输入转换为RLMs更可靠解释的形式来缓解这些失败，但当模型能够从原始查询中可靠推理时，翻译每个输入是不必要的。为应对这一挑战，我们提出Luar，一种语言理解边界感知的强化学习框架，训练RLMs在直接理解不可靠时选择性调用翻译。Luar训练模型在直接解决原始输入和对其英语翻译进行推理之间做出选择，仅在翻译增强推理预期显著优于直接推理时鼓励翻译。在多语言推理基准测试中，Luar优于标准GRPO和其他基于训练的基线，在低资源语言上尤其获得巨大提升。进一步分析表明，Luar在直接推理足够的情况下避免不必要的翻译，同时将其翻译调用行为扩展到未见过的低资源语言。总之，我们的工作提出了一种选择性多语言推理方法：RLMs可以学习仅在直接理解不可靠时调用翻译。该项目将在https://github.com/deokhk/LUAR公开。

英文摘要

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at https://github.com/deokhk/LUAR

URL PDF HTML ☆

赞 0 踩 0

2606.02463 2026-06-02 cs.CV cs.AI 版本更新

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

MASER: 面向具身3D空间智能的模态自适应专家路由

Hilton Raj, Vishnuram AV

发表机构 * Boston University（波士顿大学）

AI总结提出MASER框架，通过训练共享VLM骨干的五个模态适配器并学习基于问题选择最佳适配器的神经路由策略，解决具身代理在3D环境中多模态推理时忽略问题语义的问题。

Comments Accepted to CVPR 2026 Foundation Models Meet Embodied Agents Workshop

详情

AI中文摘要

在3D环境中，具身代理通过推理自然语言、RGB图像、点云、深度图和相机位姿等多模态信息来回答空间相关问题。现有的视觉语言模型（VLM）在单一模态上微调，完全忽略了可能偏好不同于微调模态的问题语义。为解决这一问题，我们提出MASER（模态自适应专家路由），一个轻量级框架，训练共享VLM骨干的五个不同模态适配器，并学习一个神经路由策略，在推理时根据问题选择最佳适配器。我们使用冻结的句子变换器对每个问题进行编码，并将嵌入通过一个小型多层感知器（MLP），该感知器在oracle适配器-准确率标签上训练。我们在Open3D-VQA基准上评估我们的方法，评估结果表明没有单一模态是普遍最优的——点云答案在51.5%的情况下最佳。MASER以51.3%的oracle一致性进行路由，优于随机森林消融（43.5%），且每个问题仅调用一次适配器。

英文摘要

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.

URL PDF HTML ☆

赞 0 踩 0

食物噪音与虚假安全：系统评估LLMs如何在临床医生反馈下未能适应饮食障碍查询

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar, Arabella Sinclair

发表机构 * University of Aberdeen（阿伯丁大学）； University of Colorado Anschutz（科罗拉多大学安舒茨分校）； Heriot-Watt University（赫里奥特-瓦特大学）； University College London（伦敦大学学院）

AI总结本研究通过与临床饮食障碍专家合作，系统评估了大型语言模型在处理饮食障碍用户查询时，因不加批判地适应不安全或自伤请求而可能产生的危害。

2606.02443 2026-06-02 cs.CL cs.AI cs.CV 版本更新

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）

AI总结提出PaSBench-Video基准，包含740个视频，评估多模态大模型在危险发生前及时发出预警的能力，发现现有模型在时序精度和低误报率上表现不佳。

详情

AI中文摘要

从危险的第一个可见迹象到事故发生之间，通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型（MLLM）可以作为始终在线的安全监控器，在此窗口内发出警告。然而，当前的基准测试并未检验这一能力：它们依赖静态输入，忽略时间精度，并且省略了对安全场景的误报测量。我们提出了PaSBench-Video，一个包含740个视频的基准测试，涵盖驾驶、医疗、日常生活和工业生产四个领域，其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频，并发出在时间上校准且内容正确的警告。测试了13个MLLM后，我们发现没有模型在我们的最严格指标上超过20.0%，并且召回率与误报率紧密相关，皮尔逊相关系数为0.64：更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化：在日常生活领域，模型在低误报率下实现了中等召回率，因为该领域的风险本质上是异常的；而在驾驶领域，模型不加区分地触发警告，因为常规场景和危险场景看起来相似。这些结果表明，当前模型依赖于场景级别的活动线索，而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

URL PDF HTML ☆

赞 0 踩 0

2606.02438 2026-06-02 cs.AI 版本更新

LLM-Evolved Pattern Generators for Optimal Classical Planning

LLM演化模式生成器用于最优经典规划

Windy Phung, Dominik Drexler, Arnaud Lequen, Jendrik Seipp

AI总结提出首个通过LLM驱动的进化程序合成学习可容许启发式函数的方法，用于最优经典规划，结合饱和成本分区保证A*搜索的最优性。

详情

AI中文摘要

学习到的启发式函数最近已成为满足规划中传统领域无关启发式函数的竞争性替代方案。然而，现有方法侧重于改进搜索引导而非保证可容许性，这使得它们不适用于最优经典规划。我们提出了第一种学习领域相关启发式函数的方法，这些启发式函数在设计上是可容许的，从而保留了A*搜索的最优性保证。我们不是学习从状态到启发式值的直接映射，而是学习构建可诱导可容许启发式函数的抽象。我们使用LLM驱动的进化程序合成框架，为每个领域获得一个程序，该程序为该领域中的任何任务生成模式集合，并通过饱和成本分区以可容许的方式组合所得模式。实验表明，学习到的程序编码了可解释的领域特定见解，在测试时以可忽略的开销运行，并在多个领域上产生了与最先进的领域无关基线相匹配的覆盖范围，同时每个状态的评估速度显著更快。

英文摘要

Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Existing approaches, however, focus on improving search guidance rather than guaranteeing admissibility, which makes them unsuitable for optimal classical planning. We present the first method for learning domain-dependent heuristics that are admissible by design and thus preserve the optimality guarantees of A* search. Instead of learning a direct mapping from states to heuristic values, we learn to construct abstractions that induce admissible heuristics. We use an LLM-driven evolutionary program-synthesis framework to obtain, for each domain, a program that produces a pattern collection for any task in that domain, and we combine the resulting patterns admissibly via saturated cost partitioning. Empirically, the learned programs encode interpretable domain-specific insights, run with negligible overhead at test time and yield heuristics that match the coverage of state-of-the-art domain-independent baselines on several domains while evaluating each state substantially faster.

URL PDF HTML ☆

赞 0 踩 0

2606.02434 2026-06-02 cs.AI 版本更新

GC-MoE: 基因组引导的细胞类型特异性专家混合模型用于基于组织学的单细胞空间转录组学

Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita, Kazuya Nishimura, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * Kyushu University（九州大学）； German Research Center for Artificial Intelligence (DFKI GmbH)（德国人工智能研究中心）； RPTU University Kaiserslautern-Landau（科布伦茨-劳恩堡大学）； The University of Osaka（大阪大学）； IntelligentX GmbH ； Osaka Metropolitan University（大阪 Metropolitan 大学）

AI总结提出GC-MoE模型，通过路由网络估计细胞类型概率并软组合细胞类型特异性专家，结合细胞类型特异性共表达感知预测器和细胞间交互注意力模块，从组织学图像和细胞位置预测单细胞基因表达，在公共数据集上优于现有方法。

详情

AI中文摘要

基于组织学的单细胞空间转录组学（ST）估计旨在从组织病理学图像和细胞位置预测单个细胞的基因表达，从而减少对昂贵的单细胞ST测量的需求。与现有的组织学到ST方法主要预测包含多个细胞的局部区域的斑点级谱不同，该任务需要对细胞间的表达变异性进行建模，而这种变异性强烈地由细胞类型结构化。我们提出了基因组引导的细胞类型特异性专家混合模型（GC-MoE），该模型通过路由网络估计细胞类型概率，并软组合细胞类型特异性专家进行基因表达预测。为了进一步编码细胞类型依赖的基因程序，我们引入了细胞类型特异性共表达感知预测器（CAP），以及一个轻量级的细胞间交互注意力（C2CA）模块用于邻域细胞上下文。在公共单细胞ST数据集上的实验和消融研究表明，该方法在现有单细胞和适应性斑点级基线方法上均有一致的改进。

英文摘要

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02418 2026-06-02 quant-ph cs.AI 版本更新

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

基于LLM引导搜索的双变量自行车码的进化发现

Juan Cruz-Benito, Andrew W. Cross, David Kremer, Ismael Faro

发表机构 * IBM Research（IBM研究院）； IBM T. J. Watson Research Center（IBM T.J. 巴特勒研究中心）

AI总结提出一种LLM引导的进化工作流，通过变异生成双变量自行车码和扰动变体的Python程序，在约1650次迭代中筛选约2×10^5个候选码，发现了465个不同候选码，包括非CSS扰动码和CSS码，展示了LLM引导的程序进化在结构化量子码发现中的实用性。

详情

AI中文摘要

量子LDPC码的发现需要在大型代数设计空间中进行搜索，同时可靠地认证任何候选码的参数和等价类。我们引入了一种LLM引导的进化工作流，其中语言模型变异生成双变量自行车码和扰动双变量自行车码ansätze的Python程序。在五次活动中，系统执行了约1,650次进化迭代，筛选了约$2 \times 10^5$个候选码，需要约140小时的计算时间和约400美元的LLM推理成本。候选码通过一个分阶段验证流水线进行评估，该流水线结合了$\mathrm{GF}(2)$秩计算、距离估计和认证、混合整数线性规划、BLISS Tanner图去重、可分解性分析和局部Clifford等价检查。在块长度$n \leq 360$时，工作流识别出465个不同的候选码：97个CSS双变量自行车码和368个非CSS扰动变体。CSS搜索恢复了已知的高性能码，并找到了新的有限长度代表，包括一个不可分解的[[288,16,12]]码和更高权重的码，在距离$d = 8$时最多有$k = 50$。非CSS搜索产生了在[[144,12,12]]处匹配总码品质因子的扰动码，以及根据MILP状态报告为认证值或上界的额外高距离候选码。总体而言，这些结果表明，当与独立评估配对时，LLM引导的程序进化可以作为一种实用的结构化量子码发现工具。

英文摘要

Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence classes of any candidates found. We introduce an LLM-guided evolutionary workflow in which language models mutate Python programs that generate bivariate-bicycle and perturbed bivariate-bicycle code ansätze. Across five campaigns, the system performed approximately 1{,}650 evolutionary iterations, screened about $2 \times 10^5$ candidate codes, and required ${\sim}140$ hours of computation and ${\sim}$US\$400 in LLM inference cost. Candidate codes are evaluated through a staged validation pipeline combining $\mathrm{GF}(2)$ rank computation, distance estimation and certification, mixed-integer linear programming, BLISS Tanner-graph deduplication, decomposability analysis, and local-Clifford equivalence checks. At block length $n \leq 360$, the workflow identifies 465 distinct candidate codes: 97 CSS bivariate-bicycle codes and 368 non-CSS perturbed variants. The CSS search recovers known high-performing codes and finds new finite-length representatives, including an indecomposable [[288,16,12]] code and higher-weight codes with up to $k = 50$ at distance $d = 8$. The non-CSS search produces perturbed codes matching the gross-code figure of merit at [[144,12,12]], along with additional high-distance candidates reported as certified values or upper bounds according to MILP status. Overall, these results show that LLM-guided program evolution can serve as a practical tool for structured quantum-code discovery when paired with independent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.02388 2026-06-02 cs.LG cs.AI 版本更新

Policy and World Modeling Co-Training for Language Agents

语言智能体的策略与世界模型协同训练

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang

发表机构 * Southern University of Science and Technology（南方科技大学）； Hong Kong University of Science and Technology（香港科学大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科学大学（广州））； Hong Kong Polytechnic University（香港理工大学）； LIGHTSPEED

AI总结提出PaW框架，通过在强化学习过程中添加辅助世界模型监督，无需改变推理范式，提升语言智能体在多个任务上的性能。

Comments 9 pages, 6 figures

详情

AI中文摘要

强化学习通过教导大语言模型智能体哪些行动能带来高奖励来改进它们，但对这些行动对环境的影响提供很少的监督。世界建模可以填补这一空白，但现有方法通常需要单独的模拟器、额外的训练阶段或额外的推理时计算。我们观察到，在策略强化学习 rollout 已经包含了所需的信号：每个转移将行动与其产生的下一个观察配对。基于这一观察，我们提出了PaW，一个策略和世界模型协同训练框架，它在强化学习过程中向同一策略添加辅助世界模型监督，而不改变推理范式。为了使辅助世界模型监督信息丰富且稳定，PaW引入了三个组件：基于行动熵的世界模型数据选择、噪声容忍的世界模型损失和奖励自适应的损失平衡。在三个智能体任务基准上的实验表明，在不同模型和强化学习算法上，PaW相对于强强化学习基线有一致的改进。这些结果表明，标准的强化学习 rollout 是语言智能体训练中世界模型监督的实用来源。

英文摘要

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

URL PDF HTML ☆

赞 0 踩 0

2606.02381 2026-06-02 cs.AI cs.LG math.DS 版本更新

A Mathematical Conflict Framework for Contextual Data Modulation

上下文数据调制的数学冲突框架

Hakan Emre Kartal

发表机构 * GitHub

AI总结提出一个基于算子的数学冲突框架，将冲突视为局部、方向性和上下文敏感的量，通过统一抽象算子整合权重、尺度行为和输出映射，作为独立于优化过程的数学对象。

Comments 15 pages, 3 figures, framework paper

2606.02380 2026-06-02 cs.CL cs.AI 版本更新

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench：通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Peking University（北京大学）； University of Science and Technology of China（中国科学技术大学）； University of Chinese Academy of Science（中国科学院大学）； Alibaba Group（阿里巴巴集团）

AI总结针对LLM智能体在工具使用中可能出现的自发性策略欺骗（计划与行动不一致），提出SPADE-Bench基准，通过结合实际工具执行和受控压力场景，严格区分欺骗与幻觉，实验证实该问题真实且紧迫。

详情

AI中文摘要

随着基于LLM的智能体扩展其操作范围，可靠性成为实际部署的前提。然而，在实际应用中，人类用户无法监控每一个即时行为；相反，执行过程往往是一个黑箱，用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险：智能体可能呈现与执行行动不一致的面向观察者的报告，使得系统不可控，尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点，我们引入了SPADE-Bench，一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同，SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度，并通过在压力下进行受控的计划-行动比较，严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实，智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架，SPADE-Bench填补了智能体安全中的关键空白，促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2606.02374 2026-06-02 cs.AI 版本更新

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

超越像素的空间表示学习：统一栅格数据和向量语义以构建以人为中心的地理空间基础模型

Steffen Knoblauch, Hao Li, Gengchen Mai, Konstantin Klemmer, Song Gao, WenWen Li

AI总结本文提出统一栅格感知与向量推理的联合空间表示学习范式，旨在解决当前地球观测基础模型仅依赖栅格模态、忽略向量数据中丰富结构化信息的局限性。

详情

AI中文摘要

地球观测（EO）从根本上改变了环境过程和人类活动的监测，达到了行星尺度。自监督学习的最新进展催生了地球观测基础模型（EOFMs），这些模型利用PB级未标记EO数据学习跨广泛下游地理空间任务的可迁移表示。尽管取得了这些进展，当前的EOFMs仍然局限于栅格模态，忽视了诸如OpenStreetMap和Overture等可公开访问的向量数据源中编码的丰富结构化信息。向量数据提供了地理实体的显式和紧凑表示，包括几何、拓扑和语义关系，提供了在图像中通常模糊或难以获取的关键上下文信号。因此，栅格和向量数据代表了地理空间的互补视图：栅格数据捕捉连续的物理和光谱模式，而向量数据编码离散对象及其关系结构，并且通常更多地代表人类系统而非物理系统（例如社会或人口数据）。然而，现有的地理空间表示学习范式孤立地处理这些模态，依赖于不完美且常有损的转换来桥接它们。这篇观点文章呼吁向联合空间表示学习（SRL）的范式转变，即在统一的嵌入空间中整合栅格感知与基于向量的推理。基于多模态地理空间学习的新兴努力，我们强调了对齐异构空间数据源的概念基础、技术挑战和有前景的方向。我们认为，这种整合对于开发能够更准确、可解释且语义扎实地理解地球的下一代地理空间AI系统至关重要。

英文摘要

Earth Observation (EO) has fundamentally transformed the monitoring of environmental processes and human activities up to planetary scale. Recent advances in self-supervised learning have given rise to Earth Observation Foundation Models (EOFMs), which leverage petabyte-scale unlabeled EO data to learn transferable representations across a wide range of downstream geospatial tasks. Despite these advances, current EOFMs remain largely confined to raster modalities, overlooking the rich, structured information encoded in openly-accessible vector data sources such as OpenStreetMap and Overture. Vector data provides explicit and compact representations of geographic entities, including geometry, topology, and semantic relationships, offering critical contextual signals that are often ambiguous or inaccessible in imagery alone. Raster and vector data thus represent complementary views of geographic space: raster data captures continuous physical and spectral patterns, while vector data encodes discrete objects and their relational structure and often represents more of the human rather than the physical systems (e.g. social or demographic data). However, existing geospatial representation learning paradigms treat these modalities in isolation, relying on imperfect and often lossy transformations to bridge them. This perspective paper calls for a paradigm shift toward joint Spatial Representation Learning (SRL) in an unified embedding space that integrate raster perception with vector-based reasoning. Building on emerging efforts in multimodal geospatial learning, we highlight conceptual foundations, technical challenges, and promising directions for aligning heterogeneous spatial data sources. We contend that such integration is essential for developing next-generation geospatial AI systems capable of more accurate, interpretable, and semantically grounded understanding of the Earth.

URL PDF HTML ☆

赞 0 踩 0

2606.02373 2026-06-02 cs.AI cs.CL cs.IR 版本更新

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1：基于状态外化马具的搜索智能体强化学习

Pengcheng Jiang, Zhiyi Shi, Kelly Hong, Xueqiang Xu, Jiashuo Sun, Jimeng Sun, Hammad Bashir, Jiawei Han

AI总结提出Harness-1，一个20B参数的搜索智能体，通过强化学习在有状态搜索马具中训练，将常规状态管理外化到环境，在八个检索基准上平均召回率0.730，超越现有开源搜索子智能体11.4个百分点。

详情

AI中文摘要

搜索智能体通常被训练为基于不断增长的转录的策略：模型必须决定如何搜索，同时记住它看到了什么、哪些证据有用、哪些约束仍然开放、哪些声明已被检查。我们认为这种表述将过多的常规状态管理放在策略内部：强化学习被迫同时优化语义搜索决策和可恢复的簿记，而环境可以更可靠地维护这些簿记。我们引入Harness-1，一个20B参数的搜索智能体（检索子智能体），在有状态搜索马具内通过强化学习训练。该马具维护环境端的工作记忆，包括候选池、重要性标记的精选集、紧凑的证据链接、验证记录、压缩和去重的观察结果，以及预算感知的上下文渲染。策略保留语义决策：搜索什么、保留或丢弃哪些文档、验证什么以及何时停止。在涵盖网络、金融、专利和多跳问答的八个检索基准上，Harness-1实现了0.730的平均精选召回率，比次强的开源搜索子智能体高出11.4个百分点，并与更大的前沿模型搜索器保持竞争力。其优势在保留的迁移基准上尤为显著，表明基于显式搜索状态的强化学习可以产生超越训练领域的检索行为。我们的代码可在https://github.com/pat-jj/harness-1获取。

英文摘要

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at https://github.com/pat-jj/harness-1.

URL PDF HTML ☆

赞 0 踩 0

2606.02372 2026-06-02 cs.AI cs.CL 版本更新

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP：面向LLM智能体的世界模型与智能体策略协同进化

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

发表机构 * Central South University（中南大学）； College of Computer Science, Sichuan University（四川大学计算机学院）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结提出COMAP框架，通过闭环交互协同进化文本世界模型和智能体策略，在具身任务规划、网页导航和工具使用基准上显著提升性能。

详情

AI中文摘要

为语言智能体配备世界模型使其能够在执行前预测环境动态并评估候选动作。然而，现有的文本世界模型通常在训练后固定不变，无法适应由进化中的智能体引发的策略内状态-动作分布。同时，智能体改进方法往往依赖外部奖励或验证器，限制了其在现实交互环境中的适用性。本文提出COMAP，一种通过闭环交互协同进化文本世界模型和智能体策略的新框架。在每个决策步骤，世界模型预测候选动作的未来状态反馈，智能体通过估计该反馈的可靠性并相应调整动作来进行未来感知反思。由此产生的策略内轨迹随后通过自蒸馏用于更新世界模型，使其更好地匹配智能体不断演化的交互分布。在具身任务规划、网页导航和工具使用基准上，COMAP始终优于竞争基线，例如使用Qwen3-4B相对提升16.75%。进一步分析表明，协同进化循环随时间提高了世界模型的预测准确性，并导致更有效的长程决策。我们的代码可在https://github.com/loyiv/CoMAP获取。

英文摘要

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

URL PDF HTML ☆

赞 0 踩 0

2606.02365 2026-06-02 cs.LG cs.AI 版本更新

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

FOAM：基于频率和算子误差的自适应阻尼方法，用于减少Shampoo的陈旧性误差

Kyunghun Nam, Sumyeong Ahn

发表机构 * Kyunghun Nam ； Sumyeong Ahn

AI总结提出FOAM算法，通过自适应控制阻尼因子和特征分解频率来抑制陈旧性误差，在保持收敛的同时减少Shampoo的计算时间。

Comments 9 pages, ICML 2026 camera-ready version

详情

AI中文摘要

Shampoo因其在大规模优化基准上的卓越性能而备受关注，但它面临一个重要的实际瓶颈：矩阵求逆的过高计算开销。为了缓解这一问题，从业者通常依赖陈旧的预条件子更新，这在计算效率和优化保真度之间产生了根本性的权衡。在这项工作中，我们通过收敛性和稳定性的互补视角对陈旧性进行了理论研究。虽然陈旧性提高了计算效率，但它固有地降低了性能并引入了数值不稳定性。关键的是，我们发现作为数值稳定器的阻尼可以有效抑制这些负面影响。在此分析指导下，我们提出了FOAM，一种自适应算法，通过基于陈旧性误差的近似动态控制阻尼因子和特征分解频率来稳定训练。实验结果表明，与标准Shampoo相比，FOAM在保持稳健收敛的同时减少了挂钟时间。

英文摘要

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

URL PDF HTML ☆

赞 0 踩 0

2606.02359 2026-06-02 cs.AI 版本更新

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

MOC：基于LLM的多智能体系统中的多阶通信

Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan, Qiang Duan

发表机构 * Fudan University（复旦大学）； Nanjing Normal University（南京师范大学）； Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出多阶通信（MOC）方案，通过重构智能体间通信以捕获多跳依赖，并设计结构消息合并策略，在多个数据集上提升任务性能并降低通信成本。

详情

AI中文摘要

尽管基于大语言模型（LLM）的多智能体系统取得了显著进展，但大多数研究侧重于优化协调拓扑，而同样关键的问题——如何有效地在智能体之间传输和优化消息——却很大程度上未被充分探索。当前的通信方案通常依赖于一阶邻居响应的直接拼接，这导致了受限的证据感受野，并使得关键信息在多跳路径上被稀释。为了解决这些局限性，我们提出了多阶通信（MOC）方案，该方案重构了智能体间通信以捕获多跳依赖，并引入了一种结构消息合并策略以确保效率。具体来说，我们形式化了通信机制以构建结构化的多阶证据流，随后设计了一种语义-拓扑合并算法，以在令牌约束内优化语义保真度。在六个不同数据集和不同参数规模的LLM骨干上的大量实验表明，MOC一致地提升了任务性能并降低了通信成本。代码可在 https://github.com/yao-guan/MOC 获取。

英文摘要

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at https://github.com/yao-guan/MOC.

URL PDF HTML ☆

赞 0 踩 0

2606.02357 2026-06-02 cs.CV cs.AI 版本更新

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

多模态智能体真的从工具使用中受益吗？能力增益的系统性研究

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

AI总结通过对比工具增强与无工具的多模态智能体在多项任务上的表现，发现工具使用并未带来一致的性能提升，智能体更多是学会了工具调用模式而非真正利用工具扩展能力。

详情

AI中文摘要

工具增强的多模态智能体在基准测试中表现出显著提升，这常被视为智能体已学会使用工具的证据。我们认为这种解读可能为时过早：仅凭工具调用轨迹并不能证明工具提供了答案关键信息。我们研究了两种代表性的“用图像思考”智能体，Thyme 和 DeepEyesV2，在真实世界理解、OCR、图表理解和数学推理任务上的表现。每个智能体与其无工具版本以及从同一源池训练但不含工具调用轨迹的纯文本推理器进行比较。工具访问并未带来一致的总体改进，未能可靠地降低生成令牌成本，并且仅留下一个很小的仅工具解决集：DeepEyesV2 的 93% 工具解决问题和 Thyme 的 96% 也被至少一种无工具设置解决。机制消融进一步表明，完整的工具使用循环并不始终优于单独的工具调用格式或返回的执行结果。在我们研究的设置中，所分析的智能体似乎更可靠地学习了工具调用模式而非工具贡献的能力，这表明评估应区分工具的可用性与工具是否真正扩展了智能体可解决的问题。

英文摘要

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.

URL PDF HTML ☆

赞 0 踩 0

2606.02355 2026-06-02 cs.AI cs.LG 版本更新

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI：具有内在技能的自我内化强化学习用于LLM智能体训练

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai

发表机构 * Xiamen University（厦门大学）； Meituan（美团）； Macao Polytechnic University（澳门 polytechnic 大学）

AI总结提出SIRI框架，通过自我技能挖掘、验证和内化，使LLM智能体无需外部技能生成器或推理时技能库即可提升长程任务性能，在ALFWorld和WebShop上优于基线方法。

详情

AI中文摘要

长程LLM智能体可以从可重用技能中受益，但现有的基于技能的方法通常依赖于训练期间的外部技能生成器或推理时的持久技能检索，增加了工程复杂性、上下文长度和部署延迟。我们提出了具有内在技能的自我内化强化学习（SIRI），这是一个三阶段框架，使智能体能够发现、验证和内化技能，无需外部技能生成器或推理时的技能库。SIRI首先使用GiGPO预热策略以获得基本交互能力并收集成功的无技能轨迹。然后进行自我技能挖掘，当前策略从其自身的成功普通轨迹中总结紧凑技能，并通过配对的技能增强和技能无关轨迹进行验证。最后，SIRI仅使用轨迹级效用和动作级优势将有帮助的技能引导动作令牌蒸馏到普通策略中。推理时，智能体仅使用原始提示运行。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct，SIRI将GiGPO从ALFWorld的0.908提升到0.930，从WebShop的0.728提升到0.813，优于基于提示、基于强化学习和基于记忆增强的基线。进一步分析表明，我们的自我挖掘策略可以实现与闭源大模型蒸馏相当的性能。我们的代码可在https://github.com/kirito618/SIRI获取。

英文摘要

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

URL PDF HTML ☆

赞 0 踩 0

2606.02337 2026-06-02 cs.AI 版本更新

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

约束多智能体强化学习的协调图

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

发表机构 * Department of Electrical and Computer Engineering, Linköping University（1 链çe普大学电气与计算机工程系）

AI总结提出CG-CMARL框架，利用协调图和拉格朗日对偶分解联合动作空间与约束耦合问题，实现独立于智能体数量的模型学习，并通过Max-Sum消息传递和拉格朗日乘子控制目标-约束权衡，生成帕累托前沿。

Comments Accepted at the Reinforcement Learning Conference (RLC) 2026. 40 pages (12 main + 28 appendix), 5 figures, 16 tables, 7 theorems

详情

AI中文摘要

约束多智能体强化学习（CMARL）面临两个相互交织的挑战：联合动作空间随智能体数量指数增长，以及额外的约束以奖励结构无法捕捉的方式耦合智能体。我们引入了用于约束多智能体强化学习的协调图（CG-CMARL），这是一个通过结合协调图和拉格朗日对偶性来应对这两个挑战的框架。该系统将联合问题分解为成对区域，每个区域由一组共享的Q函数服务，一个用于主要目标，每个约束对应一个，使得学习模型的数量与智能体数量无关。在执行时，Max-Sum消息传递在因子图上协调动作，而拉格朗日乘子控制目标-约束权衡，允许单个训练模型无需重新训练即可描绘帕累托前沿。我们在温和条件下提供了收敛保证，以及一个可分解为独立可解释来源的组合误差界，每个来源可追溯到特定的设计选择并可独立控制。在协作导航任务（其中多达10个智能体的团队必须协调到达目标位置，同时满足成对约束）上的实验表明，我们的方法产生的帕累托前沿优于以固定奖励塑形比率训练的既有基线，同时扩展到集中式方法变得棘手的大规模团队。

英文摘要

Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.

URL PDF HTML ☆

赞 0 踩 0

2606.02326 2026-06-02 cs.AI 版本更新

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

否决前修复：面向上下文决策的修复增强约束学习

Yifan Wang

AI总结提出修复增强约束学习（RACL）框架，将已知修复操作融入分类器语义，在否决前考虑可负担修复，以降低错误否决率并揭示决策规则的可学习性。

Comments 7 pages, 3 figures

详情

AI中文摘要

硬约束通常被视为最终否决：一旦候选违反要求，学习规则拒绝它，任何修复都在决策语义之外处理。这忽略了一种常见的部署场景，即系统已经知道有限的修改菜单，例如添加票务选项、更改配置或请求可用的服务升级。现有的约束学习、软松弛和补救方法解决了邻近问题，但它们没有学习在否决前是否应修复某个选项。我们引入修复增强约束学习（RACL），一种上下文决策框架，将已知修复算子提升到分类器语义中。当可负担的修复使候选可行且足够偏好时，候选被接受；否则系统返回结构化的拒绝信用，并在适用时返回修复计划。这种否决前修复视图严格推广了无修复的HASSLE风格语义，揭示了终端否决规则不可约的错误否决差距，将二分类不可识别性与决策规则可学习性分离，并为观测可行性共享权重设置提供了容量和校准界限。在受控和DB1B衍生基准测试中，RACL恢复了预期的信用和修复结构。在最难的原始数据衍生层级上，验证选择的RACL将错误否决减少到10/4039（FVR 0.0025），而最强的修复搜索黑盒基线约为1064/4039，同时明确展示了FVR/EDR权衡。

英文摘要

Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repair is handled outside the decision semantics. This misses a common deployed regime in which the system already knows a finite menu of modifications, such as adding a ticket option, changing a configuration, or requesting an available service upgrade. Existing constraint-learning, soft-relaxation, and recourse methods address nearby problems, but they do not learn whether an option should be repaired before being vetoed. We introduce Repair-Augmented Constraint Learning (RACL), a contextual decision framework that lifts known repair operators into the classifier semantics. A candidate is accepted when an affordable repair makes it feasible and preferred enough; otherwise the system returns a structured rejection credit and, when applicable, a repair plan. This repair-before-veto view strictly generalizes no-repair HASSLE-style semantics, reveals an irreducible false-veto gap for terminal-veto rules, separates binary-label non-identifiability from decision-rule learnability, and gives capacity and calibration bounds for the observed-feasibility shared-weight setting. Across controlled and DB1B-derived benchmarks, RACL recovers the intended credit and repair structure. On the hardest raw-data-derived tier, validation-selected RACL reduces false vetoes to 10/4039 (FVR 0.0025), versus about 1064/4039 for the strongest repair-search black-box baseline, while making the FVR/EDR trade-off explicit.

URL PDF HTML ☆

赞 0 踩 0

2606.02322 2026-06-02 cs.LG cs.AI 版本更新

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

重新利用对抗扰动进行持续学习：从防御到主动对齐

Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li, Rongsheng Li, Ning Li, Zhen Xu, Weiqing Huang, Ming Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Deakin University（德肯大学）； Harbin Engineering University（哈尔滨工程大学）

AI总结提出AdvCL框架，通过将对抗扰动重新用作几何控制信号，结合三个即插即用模块（Intra-Smooth、Proto-Clip、Inter-Align），在持续学习中同时提升标准性能、鲁棒性、降低遗忘并增强迁移。

详情

AI中文摘要

在动态环境中，大型语言模型需要不断适应新任务，但持续学习常常遭受遗忘、有限的迁移以及对对抗扰动的脆弱性。为了解决这个问题，我们提出了AdvCL，它将对抗扰动重新用作稳定的持续适应的几何控制信号。AdvCL结合了三个即插即用模块：Intra-Smooth通过小的对抗扰动促进局部平滑性；Proto-Clip使用相似性裁剪以防止过度对齐到当前任务原型；Inter-Align则通过对齐到先前任务原型的方向性对齐来减少表示间隙。实验表明，在标准性能和鲁棒性方面均有一致的提升，同时具有更低的遗忘和更强的迁移。我们进一步通过量化Intra-Smooth对扰动设置的敏感性以及Inter-Align对任务相似性和几何距离的影响，分析了关键机制。总之，这些模块在组合时提供互补增益，每个模块也可以单独集成到各种持续学习范式中，包括回放、正则化和动态架构，从而为持续学习提供了一种几何控制机制。

英文摘要

In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, limited transfer, and vulnerability to adversarial perturbations. To address this, we present AdvCL, which repurposes adversarial perturbations as a geometric control signal for stable continual adaptation. AdvCL combines three plug-in modules: Intra-Smooth promotes local smoothness via small adversarial perturbations; Proto-Clip uses similarity clipping to prevent excessive alignment to current task prototype; and Inter-Align applies directional alignment toward previous task prototype to reduce representational gaps. Experiments show consistent gains in both standard performance and robustness, with lower forgetting and stronger transfer. We further analyze key mechanisms by quantifying the sensitivity of Intra-Smooth to perturbation settings and the effect of Inter-Align on task similarity and geometric distance. In summary, the modules provide complementary gains when combined, and each can also be integrated individually into diverse CL paradigms, including replay, regularization, and dynamic architectures, thereby offering a geometric control mechanism for continual learning.

URL PDF HTML ☆

赞 0 踩 0

2606.02302 2026-06-02 cs.CR cs.AI 版本更新

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

SeClaw: 面向自主代理评估的规范驱动安全任务合成

Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi, Yichi Wang, Jing Yang, Taowen Wang, Jinhao Duan, Mengshu Sun, Peiyan Dong, Xuan Shen, Yang Cao, Renjing Xu, Kaidi Xu, Jindong Gu, Bo Zhang, Jize Zhang, Chenhao Lin, Philip Torr, Chao Shen

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Ant Digital Technologies, Ant Group（蚂蚁集团数字技术部）； Xi’an Jiaotong University（西安交通大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； University of Oxford（牛津大学）； City University of Hong Kong（香港城市大学）； Institute of Science Tokyo（东京科学研究所）； Zhejiang University（浙江大学）； Massachusetts Institute of Technology（麻省理工学院）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Beijing University of Technology（北京理工大学）

AI总结提出SeClaw框架，通过规范驱动的安全任务合成与基于执行的安全评估，实现对自主LLM代理在状态化环境中的安全风险的可扩展、可复现评估。

详情

AI中文摘要

自主LLM代理越来越多地在有状态环境中运行，访问工具、文件、内存和外部服务。虽然这些能力支持复杂的现实工作流，但它们也引入了难以通过现有评估捕获的安全风险。当前的代理安全基准通常依赖手动策划的任务，对新兴威胁的覆盖有限，并且主要关注最终结果而非导致不安全行为的执行过程。我们引入了SeClaw，一个结合规范驱动的安全任务合成与基于执行的安全评估的框架，用于自主代理。规范驱动的安全任务合成能够从结构化风险规范中可扩展且可控地构建安全任务，而SeClaw docker提供了一个标准化测试平台，用于评估代理在各种安全风险场景下的行为。该基准涵盖了由资源、用户任务、环境和内在代理行为引起的风险，并支持对不安全行为的轨迹感知评估，超越最终响应。通过桥接系统化的任务合成和可复现的安全评估，SeClaw为测量、诊断和比较自主LLM代理中的安全故障提供了实用基础。代码可在 https://github.com/seclaw-eval/seclaw-eval 获取。

英文摘要

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

URL PDF HTML ☆

赞 0 踩 0

2606.02301 2026-06-02 cs.HC cs.AI cs.CV 版本更新

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

定量运动测试：从单部智能手机视频测量患者运动

Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher, Shuangyi Tong, Annina Schmid, Katja Wiech, Anushka Irani, Ben Seymour

发表机构 * Nuffield Department of Clinical Neurosciences, University of Oxford（临床神经科学系,Nuffield大学,牛津大学）； Max Planck Institute of Biological Cybernetics（生物信息学研究所）； Oxford Gait Laboratory, University of Oxford（牛津大学步态实验室）； Harvard Medical School（哈佛医学院）； Massachusetts General Hospital（麻省总医院）； Institute of Biomedical Engineering, University of Oxford（生物医学工程研究所,牛津大学）； Mayo Clinic（梅奥诊所）

AI总结提出基于计算机视觉的定量运动测试（QMT）方法，利用深度学习3D姿态估计从单目智能手机视频提取运动生物标志物，在实验室验证中与光学运动捕捉高度一致（r>0.85），并在纤维肌痛和慢性坐骨神经痛患者中展示了可靠性和纵向监测能力。

详情

AI中文摘要

慢性疼痛通过降低功能能力而损害生活质量，但在现实环境中客观测量这种功能影响仍然具有挑战性。虽然光学运动捕捉为评估运动质量改变提供了高精度，但成本高昂且局限于实验室环境。我们旨在开发并验证定量运动测试（QMT），这是一个从标准单目智能手机视频中提取3D运动生物标志物的计算机视觉流程，平衡临床可及性与生物力学精度。我们利用基于深度学习的3D姿态估计，在健康对照组（N=13）中针对金标准光学运动捕捉验证了QMT流程。经过留一法受试者校准以纠正系统偏差后，我们在两个前瞻性临床队列中部署QMT以评估现实世界效用：一项纤维肌痛患者的干预前后试验，以及一项慢性坐骨神经痛患者和健康对照的30天纵向家庭监测研究。在实验室验证中，QMT提取的临床运动指标与光学运动捕捉高度一致，显示出强相关性（r>0.85）和低平均绝对误差。QMT在纤维肌痛患者中显示出高重测信度（r>0.86），并成功追踪了慢性坐骨神经痛患者的日常运动波动。虽然现实家庭环境引入了比实验室环境更高的测量方差，但QMT完全基于远程记录发现了健康对照组和坐骨神经痛患者之间的组级差异。单目3D姿态估计为传统评估提供了一种可扩展的替代方案。QMT为临床试验中跟踪疾病进展和治疗反应提供了客观、可及的生物标志物，但需要进一步研究以优化家庭环境中的可靠性。

英文摘要

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

URL PDF HTML ☆

赞 0 踩 0

2606.02287 2026-06-02 cs.LG cs.AI 版本更新

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

CityTrajBench: 城市尺度车辆轨迹生成的统一基准

Shibo Zhu, Xiaodan Shi, Dayin Chen, Yuntian Chen, Haoran Zhang, Tianhao Wu, Jinyue Yan

发表机构 * Department of Building Environment and Energy Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China（香港理工大学建筑环境与能源工程系）； Eastern Institute for Advanced Study, Eastern Institute of Technology, Ningbo, China（宁波东部先进研究所）； International Centre of Urban Energy Nexus, The Hong Kong Polytechnic University, Hong Kong SAR, China（香港理工大学城市能源 nexus 中心）； Department of Computer and Systems Sciences, Stockholm University, Sweden（斯德哥尔摩大学计算机与系统科学系）； Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology, Ningbo, China（浙江工业智能与数字孪生重点实验室）； Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China（宁波数字孪生研究所）； LocationMind Inc., Tokyo 101-0042, Japan（LocationMind公司）

AI总结为解决轨迹生成方法因数据集、预处理、表示和评估指标不一致导致的比较困难，提出CityTrajBench统一基准框架，标准化数据处理、模型适配与多级评估，并在三个真实数据集上对比统计、VAE、GAN、扩散和流匹配模型，揭示不同模型在全局真实性、轨迹几何保真度等指标上的权衡。

详情

AI中文摘要

城市轨迹生成是交通模拟、城市规划和移动性分析的基础任务。然而，由于现有研究通常依赖不同的数据集、预处理流程、轨迹表示和评估指标，轨迹生成方法之间的系统比较仍然困难。这种碎片化使得报告的性能差异是否源于生成机制本身或实验协议不一致变得不明确。为解决这一问题，我们提出了CityTrajBench，一个用于城市尺度车辆轨迹生成的统一基准框架和协议。CityTrajBench在共同设置下标准化了数据摄入、轨迹归一化、特征构建、模型适配、地图感知后处理、模型选择和多级评估。它支持异构生成器，包括统计基线、基于VAE、GAN、扩散和流匹配的模型，并在三个真实世界城市轨迹数据集上评估它们。该基准衡量全局空间真实性、行程级分布保真度、轨迹级几何相似性、条件移动一致性和效率。实验揭示了模型家族之间的明确权衡：DiffTraj在轨迹级几何保真度上最强，DiffRNTraj在结构敏感的全局真实性上具有竞争力，而TrajFlow在真实性、质量、条件一致性和效率之间提供了强平衡。同时，一个简单的马尔可夫基线在粗粒度行程和局部移动统计上仍具有竞争力。这些发现表明，城市轨迹生成质量本质上是多目标的，没有单一模型在所有标准上同等占优，并且CityTrajBench为未来城市移动性生成研究提供了可复现的基准协议和测试平台。

英文摘要

Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systematic comparison across trajectory generation methods remains difficult because existing studies often rely on different datasets, preprocessing pipelines, trajectory representations, and evaluation metrics. This fragmentation makes it unclear whether reported performance differences arise from the generation mechanism itself or from inconsistent experimental protocols. To address this issue, we present CityTrajBench, a unified benchmark framework and protocol for city-scale vehicle trajectory generation. CityTrajBench standardizes data ingestion, trajectory normalization, feature construction, model adaptation, map-aware post-processing, model selection, and multi-level evaluation under a common setting. It supports heterogeneous generators, including statistical baselines, VAE-based, GAN-based, diffusion-based, and flow-matching-based models, and evaluates them on three real-world urban trajectory datasets. The benchmark measures global spatial realism, trip-level distribution fidelity, trajectory-level geometric similarity, conditional mobility consistency, and efficiency. Experiments reveal clear trade-offs across model families: DiffTraj is strongest on trajectory-level geometric fidelity, DiffRNTraj is competitive on structure-sensitive global realism, and TrajFlow provides a strong balance across realism, quality, conditional consistency, and efficiency. Meanwhile, a simple Markov baseline remains competitive on coarse-grained trip and local-movement statistics. These findings show that urban trajectory generation quality is inherently multi-objective, that no single model dominates all criteria equally, and that CityTrajBench provides a reproducible benchmark protocol and testbed for future research on urban mobility generation.

URL PDF HTML ☆

赞 0 踩 0

2606.02282 2026-06-02 cs.AI 版本更新

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

POIROT: 在多智能体系统中审讯智能体以进行故障检测

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García, Annemarie F. Laudanski, Álvaro Gutiérrez, Eduardo Rocon, Manuel Cebrian

发表机构 * Center for Automation and Robotics, Spanish National Research Council (CSIC-UPM)（自动化研究中心，西班牙国家科研 council (CSIC-UPM)）

AI总结提出POIROT协议，利用多智能体系统自身的智能体作为诊断层进行故障检测，在复杂问题、多智能体和复合故障条件下优于单LLM评估基线。

Comments 44 pages, 6 figures

详情

AI中文摘要

将大型语言模型编排成多智能体系统（LLM-MAS）解锁了卓越的推理能力，但难以表征的突发故障和幻觉阻碍了其在安全关键领域的部署——新兴的AI法规使得这一差距在法律上难以维持。现有的评估范式有一个共同的缺陷：集中式判断造成单点故障，并且需要领域特定专业知识。本文提出POIROT，一种将系统自身的智能体重新用作其诊断层的协议，利用架构中已有的认知多样性。在评估的设置中，POIROT优于单LLM评估基线，其增益随问题复杂度（OR = 1.60，$p = 0.008$）、智能体数量和故障维度而扩展，并在复合故障条件下持续存在。这些结果表明，安全监督不必外部化：执行角色的智能体拥有足够的集体智慧来审计它。我们将POIROT作为开源库发布，同时发布BLAME，一个用于安全关键多智能体系统中故障归因的基准。

英文摘要

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine（医学人工智能实验室）； RWTH Aachen University（亚琛工业大学）； Department of Diagnostic and Interventional Radiology（诊断与介入放射学部门）

AI总结研究临床视觉-语言模型（VLM）在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险，并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情

AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型（VLM）学习了一个共享嵌入空间，该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中（例如仅图像数据共享或受控访问的报告）构成了隐私风险，因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索，并使用公共配对队列（其中真实配对是已知的）作为基准来审计风险，而不是作为隐私场景。在来自MIMIC-CXR（43,793个保留对）和外部CheXpert Plus（29,296个对）的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM，我们发现重链接率随专业化程度系统性地上升：最强的VLM在候选池N=100时以15倍随机概率检索到正确报告，在N=10,000时以50倍随机概率，在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在，表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险，我们冻结了两个编码器，仅对定义对齐层的投影头应用差分隐私优化（epsilon=0.34，delta=6x10^-6）。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%，并无需重新训练即可迁移到CheXpert Plus，同时图像侧效用基本保持：线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接，而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

URL PDF HTML ☆

赞 0 踩 0

2606.02255 2026-06-02 cs.CL cs.AI 版本更新

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

谁在NLP中进行标注？2018年至2025年人工标注报告的大规模评估

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger

发表机构 * NLLG Lab University of Technology Nuremberg（NLLG实验室梅尔堡技术大学）； Interdisciplinary Transformation University（跨学科转型大学）

AI总结本研究通过大规模审计NLP论文中的人工标注报告，发现标注细节报告不完整，并提出了改进报告质量的框架和建议。

详情

AI中文摘要

人工标注是许多NLP研究的经验基础，从数据集构建到模型评估，但论文往往不清楚谁产生了标注以及如何控制标注过程。我们首次对主要NLP会议中的人工标注报告进行了大规模、任务级别的审计，询问哪些标注细节被记录，哪些缺失，以及报告如何随时间、主题、会议和人工判断的预期用途而变化。我们引入了一个统一的标注报告实践分类法，并针对Annotated-gold（一个由41篇论文和72个标注任务组成的人工裁决黄金标准）验证了一个LLM辅助的提取流程，其中最佳模型与裁决标签达到了与人类相当的一致性，Krippendorff's alpha为0.606，而人类间一致性为0.585。利用该流程，我们构建了Annotated-llm数据集，涵盖2018-2025年ACL会议论文，包含来自1603篇论文的2667个提取的标注任务，发现论文经常报告操作细节，如招募策略、标注者专业知识和标注量，但往往省略评估标注有效性所需的细节，包括培训、语言能力、报酬、社会人口统计、裁决和一致性值，尤其是在模型评估研究中。我们的结果表明，NLP中的标注报告随时间有所改善，但仍不均衡，我们建立了一个可扩展的框架和最低报告建议，以使人工标注更可靠、可重复和可解释。

英文摘要

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

URL PDF HTML ☆

赞 0 踩 0

2606.02253 2026-06-02 cs.AI 版本更新

CEON: Circular Economy Ontology Network

CEON: 循环经济本体网络

Huanyu Li, Els de Vleeschauwer, Robin Keskisärkkä, Mikael Lindecrantz, Mina Abd Nikooie Pour, Ying Li, Ben De Meester, Patrick Lambrix, Eva Blomqvist

AI总结为解决循环经济领域跨行业信息共享的语义互操作性问题，提出了循环经济本体网络（CEON），定义了跨行业概念并实现语义感知数据文档化，在建筑、电子和纺织行业场景中验证了其有效性。

详情

AI中文摘要

提高社会中资源利用的循环性已被视为实现可持续性的一条途径，即向更加循环的经济转型。为此有许多不同的循环策略，例如重复使用产品和组件、翻新和再制造旧产品，或回收剩余或使用过的材料。为了实现这些策略，有必要在基础设施层面共享信息，并在产品生命周期内跨行业部门进行沟通。因此，在这种信息共享和沟通中实现语义互操作性是提高循环性的关键。然而，涉及产品生命周期相关众多行业的循环经济（CE）领域的知识表示仍然具有挑战性。为弥补这一差距，我们在Onto-DESIDE项目中开发了循环经济本体网络（CEON）。该本体网络旨在通过定义跨行业概念来填补CE领域的空白，并实现语义感知的数据文档化。我们通过跨行业数据文档化场景（涵盖建筑、电子和纺织行业）展示了CEON。

英文摘要

Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more circular economy. There are many different circular strategies to do so, such as reusing products and components, refurbishing and remanufacturing used products, or recycling left-over or used materials. To enable these strategies, it is necessary to share information at the infrastructure level and to communicate between industry sectors along the product life cycle. Enabling semantic interoperability in this information sharing and communication is therefore a key to increasing circularity. However, knowledge representation for the circular economy (CE) domain, which involves many relevant industry sectors related to product life cycles, remains challenging. To bridge this gap, we developed the Circular Economy Ontology Network (CEON) within the Onto-DESIDE project. This ontology network aims to fill gaps in CE by defining cross-sectorial concepts and to enable semantics-aware data documentation. We demonstrate CEON through cross-industry data documentation scenarios spanning construction, electronics, and textile sectors.

URL PDF HTML ☆

赞 0 踩 0

2606.02251 2026-06-02 cs.RO cs.AI eess.SP 版本更新

FW-NKF: Frequency-Weighted Neural Kalman Filters

FW-NKF: 频率加权神经卡尔曼滤波器

Adnan Harun Dogan, Berken Utku Demirel, Christian Holz

发表机构 * Department of Computer Science, ETH Zürich（苏黎世联邦理工学院计算机科学系）

AI总结提出频率加权神经卡尔曼滤波器（FW-NKF），通过将因果谱整形算子嵌入卡尔曼测量残差并联合学习观测和状态转移网络，抑制频带受限噪声，在混沌系统和惯性姿态估计等任务中定位误差降低达10%。

Comments Published at ICRA 2026

详情

AI中文摘要

鲁棒状态估计是机器人自主性的核心，然而经典卡尔曼滤波器难以应对频率相关干扰和模型失配，如传感器振动、电磁干扰和周期性噪声。尽管深度卡尔曼滤波器（DKF）变体通过学习潜在状态转移扩展了扩展卡尔曼滤波（EKF）框架，但它们缺乏明确的机制来抑制在实际场景中通常污染传感器测量的带限噪声分量。我们引入了频率加权神经卡尔曼滤波器（FW-NKF），这是一种统一的混合方法，将因果谱整形算子嵌入卡尔曼测量残差，并联合学习观测网络和状态转移网络。通过同时调整滤波器频谱和潜在状态表示，FW-NKF在抑制噪声主导频带的同时捕获复杂的残差结构。我们在四个异构基准上进行了广泛实验，包括混沌系统（如多维洛伦兹系统）和全身惯性姿态估计，发现定位误差降低高达10%，且方向精度显著提升。我们的消融研究证实，频率加权和深度潜在状态建模对整体性能有贡献。

英文摘要

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02242 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

发表机构 * Tevian, Russia（俄罗斯Tevian）； Lomonosov Moscow State University, Russia（俄罗斯罗蒙诺索夫莫斯科国立大学）

AI总结针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题，提出解耦两阶段训练流程，使用单一视觉编码器避免跨任务干扰，实验表明图像预训练和文本监督能提升双任务性能。

详情

AI中文摘要

基于图像（I2I）和基于文本（T2I）的行人重识别（ReID）的联合优化受到模态差异和冲突训练目标的阻碍，导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性，但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异，以实现有效训练。由于I2I和T2I ReID通常分开研究，为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现，我们提出了一种解耦的两阶段训练流程，用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器，支持I2I和T2I检索，同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验，改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外，我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信，我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

URL PDF HTML ☆

赞 0 踩 0

2606.02218 2026-06-02 cs.LG cs.AI 版本更新

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过感知掉队者的组大小调整实现更快的同步在线策略强化学习

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar

发表机构 * University of Minnesota（明尼苏达大学）； University of Waterloo（滑铁卢大学）； Argonne National Laboratory（阿贡国家实验室）

AI总结提出动态组大小控制器SAGC，通过在线约束优化调整组大小，减少同步在线策略强化学习中的掉队者事件，提升墙钟效率并保持或改善训练奖励和模型质量。

详情

AI中文摘要

同步强化学习方法如组相对策略优化（GRPO）提供稳定且可复现的在线策略训练，但极易受到掉队者的影响——单个异常长的轨迹可能延迟整个组的奖励计算和参数更新。随着组大小增加，这个问题变得更加严重，在更大组的好处与同步停滞的墙钟成本之间产生矛盾。我们提出感知掉队者的组控制（SAGC），一种动态组大小控制器，根据观察到的轨迹行为在线调整训练组。SAGC将组大小选择形式化为一个在线约束优化问题，旨在保留更大组的好处，同时控制掉队者事件的长期发生率。在同步GRPO和DAPO训练中，以及在普通和强工程基线上，SAGC一致地减少了掉队者发生率并提高了墙钟效率，同时实现了有竞争力或更好的训练奖励。我们进一步表明这些收益转化为最终模型质量：在下游推理基准上，SAGC与最强的静态组大小基线相比具有竞争力或更好，并且通常在没有显式长度惩罚的情况下产生更短的输出。这些结果将动态组控制定位为使同步在线策略强化学习更高效和更稳健的实用方法。

英文摘要

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

URL PDF HTML ☆

赞 0 踩 0

2606.02211 2026-06-02 cs.CL cs.AI 版本更新

信念变化算子的抽象世界语义框架

Daniel Grimaldi, M. Vanina Martinez, Ricardo O. Rodriguez

发表机构 * Departamento de Computación Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires（布宜诺斯艾利斯大学计算机系）； Instituto de Investigación en Ciencias de la Computación UBA-CONICET（UBA-CONICET计算科学研究所）； Artificial Intelligence Research Institute (IIIA-CSIC)（人工智能研究 institute (IIIA-CSIC)）

AI总结提出一种无逻辑语法的集合论框架——抽象世界语义，通过将世界视为原始元素并定义世界收缩与修正算子，统一分析信念变化模型，并推广至经典与非优先信念变化。

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR 版本更新

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

发表机构 * University of Freiburg（弗赖堡大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结针对认证训练中自然精度与认证精度的权衡问题，提出基于Pareto前沿的多目标超参数优化方法，实现公平的方法间比较，并发现先前配置的欠调优现象，建立新的最优性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能，但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证，但计算成本高昂。为缓解这一问题，认证训练技术在训练过程中优化可验证的鲁棒性，通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的，报告单一配置的常见做法存在问题：它可能误导关于整体性能的结论，并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较，我们执行高效的自动化多目标超参数优化，为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优，从而获得更优性能并建立新的最优水平。利用这些前沿，我们首次对认证训练方法进行了全面的多目标比较，表明先前的进展并不像假设的那样显著，并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

URL PDF HTML ☆

赞 0 踩 0

2606.02120 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

理解增强的模型协作用于长尾自我中心错误检测

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS（人工智能安全国家重点实验室，计算技术研究所，中国科学院）； School of Computer Science and Tech., University of Chinese Academy of Sciences（中国科学院大学计算机科学与技术学院）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Institute of Information Engineering, CAS（信息工程研究所，中国科学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结提出理解增强的模型协作方法（UE-MCM），结合粗粒度视频理解与细粒度动作推理，通过双分支模型和自适应融合门检测自我中心视频中的错误，并优化长尾分布。

详情

AI中文摘要

在本报告中，我们解决了从自我中心视频数据中判断用户是否错误执行动作的问题。为此，我们提出了一种理解增强的模型协作方法（UE-MCM），该方法将高效的粗粒度视频理解与准确的细粒度动作推理相结合。具体来说，UE-MCM包含一个小模型分支和一个大模型分支。大模型分支关注细粒度动作本身是否执行错误，而小模型分支联合输入粗粒度视频和细粒度片段，以识别可能局部正确但与整体工作流不一致的动作。小模型分支基于CLIP4CLIP视频编码器构建，该编码器从通过扩散对比重建增强的CLIP模型初始化，大模型分支使用Qwen3-VL嵌入模型从细粒度动作片段中提取高容量表示。然后，通过轻量级协作门自适应融合小分支预测和大分支预测。为了处理错误实例的长尾分布，我们通过互补目标优化分类器，包括重加权交叉熵、AUC导向学习和标签感知调整。所得系统平衡了速度和准确性，使其能够有效检测自我中心教学视频中的细微、罕见和模糊错误。

英文摘要

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

URL PDF HTML ☆

赞 0 踩 0

2606.02119 2026-06-02 cs.LG cs.AI 版本更新

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

到底有多难？难度感知的多目标遗忘学习

Jiangwei Chen, Xinyuan Niu, Rachael Hwee Ling Sim, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

发表机构 * National University of Singapore（新加坡国立大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对现有遗忘学习无法保证同时提升遗忘质量和保持保留效用的缺陷，提出一种基于约束优化的难度感知多目标遗忘算法（HAMU），通过量化遗忘数据与保留数据的相似度来指导模型更新，在保证遗忘质量提升的同时最小化保留效用损失。

Comments ICML 2026

详情

网络分布式多智能体强化学习用于四旋翼无人机一致性控制

Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department, German University in Cairo (GUC), Egypt（埃及德国大学（GUC）机械工程系）； Institute of Flight Mechanics and Control (IFR), Head of Flight Robotics, University of Stuttgart, Germany（德国斯图加特大学飞行力学与控制研究所）； Faculty of EMS, Head of Mechatronics Engineering Department, German University in Cairo (GUC), Egypt（埃及德国大学（GUC）EMS学院）

AI总结提出网络分布式多智能体强化学习框架，利用通信图实现分布式策略，通过MASAC训练高层规划器，实现零样本扩展到250个智能体。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/MELECON64486.2026.11418865
Journal ref: 2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON)

AI中文摘要

本文提出了一种用于四旋翼无人机一致性控制的网络分布式多智能体强化学习（ND-MARL）框架。与依赖集中式规划或完全分散式执行的传统多智能体MARL公式相比，ND-MARL将群体通信图纳入决策过程。在2-邻居通信拓扑下，每个智能体仅观察两个邻居的信息，并通过分布式策略输出动作。使用多智能体软演员-评论家（MASAC）训练高层分布式一致性规划器，并将其嵌入层次化堆栈中，以生成由低层四旋翼控制器跟踪的参考目标位置。结果表明，与集中式MARL控制器相比，实现了平滑的一致性轨迹和规划器-跟踪器集成。最值得注意的是，学习到的控制器表现出零样本可扩展性，即在三智能体系统上训练的策略，在相同的2-邻居通信拓扑下，无需重新训练或微调即可部署到多达250个智能体的群体中，实现了随着团队规模增大而稳态散布增加的一致收敛，这是由于稀疏信息传播所致。这些发现突显了ND-MARL作为分布式、通信感知的四旋翼一致性控制的稳定框架。

英文摘要

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.

URL PDF HTML ☆

赞 0 踩 0

2606.02093 2026-06-02 cs.CL cs.AI cs.LG 版本更新

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

不确定性量化中模糊性在错误预测中的作用

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

发表机构 * University of Cambridge（剑桥大学）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结通过解耦输入模糊性与不确定性信号，利用门控专家和选择性预测提升大语言模型在问答任务中的错误预测性能。

Comments 8 pages not including references and appendices, 3 figures

详情

AI中文摘要

错误预测任务，即预测模型输出是否正确，通常通过不确定性量化（UQ）来解决。然而，虽然不确定性指标捕捉了模型缺乏知识或能力进行预测的情况，但它们也反映了模型输入和上下文中固有的偶然不确定性。本文提出了一种通过将输入模糊性与UQ信号解耦来改进大语言模型（LLM）错误预测的方法。我们在问答（QA）任务上使用六种UQ指标进行实验，结果表明，UQ指标在无歧义实例上的错误预测能力优于具有多个合理答案的问题。我们使用门控专家和选择性预测将真实和预测的模糊性标签纳入错误预测流程。我们发现，模糊性信息提高了跨模型家族、训练和评估范式、数据集（包括据称无歧义的数据集）以及偶然不确定性来源的错误预测分数，在标准数据集上对单个UQ指标的PRR提升超过10个百分点。

英文摘要

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.02092 2026-06-02 eess.IV cs.AI cs.CV 版本更新

LALE: Lightweight-Transformer Architecture for Land-Cover Estimation

LALE：用于土地覆盖估计的轻量级Transformer架构

Ümit Mert Çağlar, Alptekin Temizel

发表机构 * Middle East Technical University（中亚技术大学）

AI总结提出LALE架构，通过分辨率分支编码器（轻量级ConvMixer处理高分辨率局部特征，Transformer处理低分辨率全局上下文）和全MLP多尺度解码器，在遥感图像分割中实现高效性能与计算成本的平衡。

详情

AI中文摘要

RL-ACRGNet：基于强化学习的胸部放射学报告生成网络

Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya

发表机构 * Human-AI Interaction (HAIx) Lab, Indian Institute of Technology Gandhinagar（人类-人工智能交互实验室，印度理工学院冈丁加尔）； Department of Computer Science and Engineering, Madhav Institute of Technology and Science Deemed University (MITS-DU)（计算机科学与工程系，马达夫技术与科学 deemed 大学（MITS-DU））； Multimedia and Information Security Research Group, Department of Computer Science and Engineering, ABV-Indian Institute of Information Technology and Management（多媒体与信息安全研究组，计算机科学与工程系，ABV-印度信息科技与管理学院）

AI总结提出RL-ACRGNet，一种结合预训练DenseNet编码器与多级LSTM解码器的离策略强化学习框架，通过度量奖励机制优化视觉语义嵌入，在IU-Xray和MIMIC-CXR数据集上超越基线，生成高质量临床报告。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

医学影像解读是现代临床诊断的基石，然而手动生成放射学报告既耗时又容易出现解读不一致。在医学AI领域，通过深度学习自动化这些描述有望简化临床工作流程并标准化诊断输出。然而，由于在捕获细粒度视觉特征和确保临床连贯性方面的局限性，准确的疾病检测和精确的报告生成仍然是重大挑战。为了解决这些问题，我们提出了RL-ACRGNet，一种改进的编码器-解码器模型，它将预训练的DenseNet编码器与多级LSTM解码器集成在离策略强化学习框架中。通过使用双网络方法，基于度量奖励机制细化视觉语义嵌入，我们证明RL-ACRGNet在IU-Xray数据集上持续优于最先进的基线，在BLEU-4（0.47%）、METEOR（0.17%）和ROUGE-L（0.518）上取得了定量改进。此外，在大规模MIMIC-CXR数据集上的综合评估证实了该模型的稳健泛化能力及其生成高质量、临床相关报告的能力。

英文摘要

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports

URL PDF HTML ☆

赞 0 踩 0

2606.02022 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配：多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow（莫斯科Tevian）； Lomonosov Moscow State University（莫斯科国立罗蒙诺索夫大学）

AI总结本文揭示了多视角目标关联中常用的排名度量（如AP、FPR-95）与分配目标之间的根本性不匹配，并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情

AI中文摘要

多视角目标关联是一个重要的计算机视觉问题，是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题，但最近的工作严重依赖成对排名度量（如AP和FPR-95）进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上，我们表明即使分配已经正确，AP和FPR-95也可能不完美，而基于Sinkhorn的归一化可以使它们完美。相反，最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试，在实践中验证了这种不匹配。我们表明，仅优化几个后处理参数就能显著提升AP和FPR-95，而分配级别的度量（如ACC和IPAA）却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

URL PDF HTML ☆

赞 0 踩 0

2606.02011 2026-06-02 cs.AI cs.LG 版本更新

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

推理模型中的极端低位推理：失败模式与针对性恢复

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov

发表机构 * University of Washington（华盛顿大学）

AI总结针对大型推理模型在2位量化推理中因生成不稳定导致总token数膨胀而无法实现端到端加速的问题，提出轻量级FP16规划和循环救援两种控制方法，显著恢复模型精度并保持实际速度。

详情

AI中文摘要

大型推理模型（LRM）依赖长推理轨迹，导致推理成本高昂。虽然低位量化降低了每token解码成本，但我们表明，激进的2位推理可能无法实现端到端加速，因为生成过程中的不稳定性会膨胀总token数。2位量化不仅降低答案准确性，还常常产生更长的轨迹，包含重复循环、预算耗尽、延迟承诺和未闭合的推理段。我们分析了Qwen3推理模型在数学和常识基准上的完整推理轨迹，并表明准确率下降与这些过程级失败密切相关。为解决这些问题，我们引入了两种轻量级控制：FP16规划，为2位模型提供简短的高精度轮廓；以及循环救援，检测重复轨迹并要么承诺早期答案，要么回退到FP16。在MATH-500上，循环救援将Qwen3-8B准确率从17.2%提升至74.2%，而规划加循环救援将Qwen3-32B准确率从65.0%提升至87.2%。总体而言，我们的结果表明，当极端低位推理的失败被视为可控生成病理时，它变得可行：通过轻量级检测和选择性FP16支持，2位推理可以在恢复准确率的同时保持真实的端到端速度。我们的代码可在 https://github.com/brain-lab-research/quantized-reasoning 获取。

英文摘要

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02010 2026-06-02 cs.CL cs.AI 版本更新

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

PlanarBench: 通过平面图绘制评估LLM空间推理能力

Oleksandr Nikitin

发表机构 * tvori.info

AI总结提出PlanarBench基准，通过让LLM根据边列表以ASCII艺术绘制平面图来评估其空间推理能力，发现边数是主要难度预测因子。

Comments 12 pages, 4 figures, https://github.com/wizzard0/planar-bench-as1073

2606.02000 2026-06-02 cs.CV cs.AI eess.IV 版本更新

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型：基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴集团大模型实验室）； Hupan Lab（虎盘实验室）； Zhejiang University（浙江大学）； INSAIT

AI总结提出一种无渲染框架，通过压缩的3D人体网格标记直接条件化视频生成，实现精确的人体运动控制，减少2D引导伪影并提升3D结构建模能力。

Comments Project page: https://jingyunliang.github.io/MeshToken/

详情

AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而，这类模型是否真正感知视觉观察背后的3D结构，而不仅仅是生成合理的2D投影，仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题，该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同，我们提出了一种无渲染框架，直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息，同时实现了统一的基于标记的生成流程，在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明，该方法在人体运动控制基准上表现强劲，同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明，配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

URL PDF HTML ☆

赞 0 踩 0

2606.01999 2026-06-02 cs.LG cs.AI 版本更新

Why Do Time Series Models Need Long Context Windows?

为什么时间序列模型需要长上下文窗口？

Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

发表机构 * Università della Svizzera Italiana（瑞士联邦理工学院）； EPFL（瑞士联邦理工学院）； Politecnico di Milano（米兰理工学院）

AI总结本文从生成过程识别和条件预测两个目标出发，证明长上下文窗口通过降低生成过程的不确定性来提升预测性能，并表明即使对于记忆长度为P的过程，输入窗口必须严格大于P才能达到最小误差。

详情

AI中文摘要

现代用于预测时间序列组的深度学习模型依赖于越来越长的观测窗口。然而，增加窗口大小的好处通常被简单地归因于捕捉长程依赖，而关于全局预测模型如何利用输入观测的更广泛讨论一直有限。在本文中，我们表明预测时间序列组涉及两个目标：(i) 生成过程识别（GPI），即推断生成输入序列的具体过程，以及 (ii) 条件预测（CF），即根据输入观测预测未来值。从这个角度来看，最优预测可以解释为对所有可能数据生成过程的平均，并按输入窗口给定的似然加权。这为长上下文窗口的好处提供了另一种解释：它们降低了运行过程中输入时间序列由哪个具体过程生成的不确定性。我们证明，即使对于记忆长度为 $P$ 的过程，严格大于 $P$ 的输入窗口大小对于达到最小可实现误差是必要的。最后，我们展示了如何将 GPI 和 CF 解耦，以在不牺牲准确性的情况下提高计算可扩展性。在合成和真实数据上的实验验证了我们的见解及其对设计预测架构的相关性。

英文摘要

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.01993 2026-06-02 cs.CL cs.AI cs.LG 版本更新

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能？

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Kuaishou Technology（快手科技）

AI总结提出MMG2Skill框架，将多模态异构的野外指南编译为可编辑技能，通过轨迹级根因反馈持续改进，在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情

AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而，这些知识通常是多模态、异构、有噪声的，并且隐含地假设人类执行者，使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距，我们将此问题形式化为指南到技能学习：将野外指南转换为可执行技能，并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力，我们引入了MMG2Skill-Bench，这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill，一个闭环框架，它将指南编译为可编辑技能，在执行过程中将固定的视觉语言模型（VLM）智能体条件化于这些技能，并从轨迹级根因反馈中修正技能，而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中，使用六个VLM骨干网络，MMG2Skill在每个模型-领域设置中始终优于普通基线智能体，在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明，直接用原始指南提示智能体会降低性能，而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中，当成功信号适当校准时，基于分析器的提前停止进一步防止了后期性能退化，并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

URL PDF HTML ☆

赞 0 踩 0

2606.01992 2026-06-02 cs.CV cs.AI cs.LG 版本更新

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准：当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab（米兰理工学院，AIRLab）； S&H – Software & Hardware（S&H – 软件与硬件）

AI总结提出结构化基准TGAD，通过三个场景逐步增加语言功能角色，评估多模态异常检测系统的文本引导能力，发现当前系统仅表面受语言条件化，标准基准高估了其能力。

详情

AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统，并被呈现为支持文本引导的零样本和少样本检测。然而，这些方法使用继承自单模态基准的协议进行评估，这些协议保持文本条件不变，因此无法衡量语言是否条件化决策；报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测（TGAD），这是一个结构化基准，通过三个场景逐步增加语言的功能角色：MVTec AD上的受控提示敏感性设置；MVTec AD的组件标记扩展，要求模型将其评估限制在指定部件；以及新的组装面板数据集（APD），这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型：生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中，文本接口仅表面条件化决策：除非移除对象名词，否则提示内容被吸收（生成模型的I-AUROC从97.4降至82.6）；一旦指令部件外的缺陷被视为正常，组件级指令不约束决策（从90.3降至66.3）；当两者在APD上结合时，图像级判别崩溃至MVTec水平以下，一种情况低于随机水平（71.2、50.5、31.5）。这些结果表明，标准基准夸大了当前多模态异常检测系统的文本引导能力，并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01991 2026-06-02 cs.AI cs.CL cs.CY 版本更新

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

SafeMCP：基于环境接地前瞻推理的LLM智能体防御主动功率调节

Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai

发表机构 * Beijing Institute of Technology（北京理工大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）

AI总结针对LLM智能体因动作空间扩大而面临功率寻求风险，提出SafeMCP服务器端防御插件，通过内部世界模型进行前瞻推理，实现主动工具过滤和即时干预两级防御，在保持智能体效用的同时有效降低风险。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference

详情

AI中文摘要

随着大语言模型（LLM）智能体越来越多地利用模型上下文协议（MCP）在复杂环境中运行，其动作空间的扩展赋予了智能体不安全的能力，并凸显了功率寻求的风险。虽然广阔的动作空间和更大的环境影响对于任务完成至关重要，但它们也创造了一个脆弱的风险表面，其中微小的错误或幻觉会被放大为灾难性故障。为此，我们提出了SafeMCP，一种{服务器端}防御插件，通过关于未来安全风险的预测推理来约束工具获取。SafeMCP利用内部世界模型进行前瞻推理，实现两级防御：主动工具过滤以限制危险功率扩展，以及即时干预作为故障安全机制。为了训练SafeMCP，我们引入了一个三阶段流程，包括环境动态接地、安全策略初始化和具有双重可验证奖励的强化学习（RL）。在PowerSeeking Bench、ToolEmu和AgentHarm上的实验表明，SafeMCP实现了安全平衡，在有效缓解风险的同时保持了智能体的效用。

英文摘要

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

URL PDF HTML ☆

赞 0 踩 0

2606.01982 2026-06-02 cs.AI 版本更新

An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification

一种基于NLP的课程-劳动力市场对齐框架：模式约束的LLM抽取、ESCO锚定的语义匹配和多维差距量化

Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki, Khaled Shuaib

AI总结提出一个四阶段NLP框架，通过模式约束的LLM抽取、ESCO语义匹配、仲裁协议和验证机制，实现课程与劳动力市场的对齐，并量化多维供需差距。

Comments 53 pages, 9 figures, 4 tables

详情

AI中文摘要

从多样化的教育和劳动力市场语料库中进行模式约束的信息抽取仍然是自然语言处理中的一个开放挑战，因为现有流程主要依赖于无法恢复隐含能力的词汇表面方法，缺乏共享分类法的基础，并且没有提供抽取可靠性或文档级完整性的正式度量。为了解决这些限制，本文提出了一个四阶段NLP框架，结合了(i) 对两个前沿LLM集成模型进行模式约束提示，针对JSON Schema强制实施的七槽能力形式；(ii) 使用Sentence-BERT (SBERT)将抽取的记录与十一个领域的ESCO v1.2.1受控词汇表对齐；(iii) 一个解决模型间分歧的两级裁决协议；(iv) 一个结合每槽Cohen's kappa、模式符合性和文档级完整性审计的验证机制。该框架在高等教育质量保证的关键应用中实例化，即阿联酋大学ABET认证的计算机科学学士学位课程的课程-劳动力市场对齐。该流程从2025-2026学年的85门课程学习计划中抽取400条能力记录，并在从计算核心到概率加权学生轨迹的五范围分析下，与30个职位发布（483个要求条款）以0.50的SBERT余弦阈值对齐。抽取器在技能槽上达到0.79的Cohen's kappa，模式符合性100%，文档级完整性100%。对齐揭示了可解释的供需差距：通用和横向技能差距25.0%，算法与计算理论差距13.8%，软件工程与项目管理差距12.2%，而人工智能与数据科学差距接近零的1.8%，尽管供应覆盖率为38.6%。

英文摘要

Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework that combines (i) schema-constrained prompting of a two-model frontier-LLM ensemble against a JSON Schema-enforced seven-slot competency formalism, (ii) Sentence-BERT (SBERT) alignment of the extracted records against an eleven-domain ESCO v1.2.1 controlled vocabulary, (iii) a two-tier adjudication protocol that resolves inter-model disagreements, and (iv) a verification mechanism that combines per-slot Cohen's kappa, schema conformance, and document-level completeness audits. The framework is instantiated for a critical application in higher-education quality assurance, namely curriculum-labor market alignment for the ABET-accredited BSc Computer Science program at the United Arab Emirates University. The pipeline extracts 400 competency records from the 85-course 2025-2026 study plan and aligns them, under a five-scope analysis ranging from the computing core to a probability-weighted student trajectory, with 30 job postings (483 requirement clauses) at an SBERT cosine threshold of 0.50. The extractor achieves Cohen's kappa of 0.79 on the skill slot, with 100% schema conformance and 100% document-level completeness. The alignment surfaces interpretable supply-demand gaps of 25.0% in general and transversal skills, 13.8% in algorithms and computational theory, and 12.2% in software engineering and project management, with a near-zero 1.8% gap in artificial intelligence and data science despite 38.6% supply coverage.

URL PDF HTML ☆

赞 0 踩 0

2606.01975 2026-06-02 cs.AI cs.SE 版本更新

Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks

基于LLM的算法开发：以张量网络收缩顺序优化中LLM使用为例

Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges

发表机构 * German Aerospace Center (DLR), Institute of Software Technology, department High-Performance Computing（德国航空航天中心（DLR）软件技术研究所高性能计算部门）

AI总结通过OpenEvolve对张量网络收缩顺序优化的案例研究，探讨了基于LLM的算法开发，重点分析了LLM选择、评估指标和测试实例等设计因素，强调了验证引导的进化编码代理的潜力以及人类科学家在评估、验证和解释方面的重要性与挑战。

Comments Submitted to the proceedings of the deRSE26 conference

2605.02640 2026-06-02 cs.AI 版本更新

Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution

可信人工智能面临不变性冲突，因果性是解决方案

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克信息研究所）

AI总结本文通过将可信AI目标重新解释为数据生成过程变化下的不相容不变性要求，论证因果性是理解和平衡性能与多个可信目标之间权衡的必要框架。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

AI中文摘要

随着人工智能（包括机器学习模型和基础模型）在高风险领域的部署日益增多，确保其可信度已成为一个核心挑战。然而，可信人工智能的核心目标，如公平性、鲁棒性、隐私性和可解释性，很难同时实现，尤其是在保持效用的同时。这篇立场论文认为，因果性对于理解和平衡性能与可信人工智能多个目标之间的权衡是必要的。我们将可信人工智能的权衡重新解释为数据生成过程不同变化下的不相容不变性要求，从而为我们的论点奠定基础。然后，我们通过文献中的案例研究和风格化的合成数据模拟来说明这一论点，表明因果性提供了一个统一的框架，用于理解可信人工智能中的权衡如何产生，以及如何通过选择性不变性来缓解或解决这些权衡。这一视角既适用于经典机器学习模型，也适用于大规模基础模型。最后，我们概述了利用因果性构建既可信又高性能的人工智能所面临的开放挑战和机遇。

英文摘要

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI. We ground our arguments in re-interpreting trustworthy AI trade-offs as incompatible invariance requirements under different changes to the data-generating process. We then illustrate this argument through case-study analyses from the literature and a stylized synthetic-data simulation, showing that causality provides a unifying framework for understanding how trade-offs in trustworthy AI arise and how they can be softened or resolved through selective invariance. This perspective applies to both classical ML models and large-scale FMs. Finally, we outline open challenges and opportunities for using causality to build both trustworthy and high-performing AI.

URL PDF HTML ☆

赞 0 踩 0

2605.02122 2026-06-02 cs.LG cs.AI 版本更新

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

STABLEVAL: 面向AI系统的分歧感知与稳定评估

Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对多数投票法在标注者分歧下导致排名不稳定的问题，提出STABLEVAL框架，通过建模潜在正确性和标注者混淆模式，实现稳定且不确定性感知的系统评估。

详情

AI中文摘要

人类评估仍然是评估现代AI系统的主要标准，然而标注者的分歧、偏见和变异性使得在标准多数投票聚合下系统排名变得脆弱。多数投票忽略了标注者可靠性和项目级别的模糊性，往往在标注者子集之间产生不稳定的比较。我们引入了STABLEVAL，一个分歧感知的评估框架，该框架对潜在项目正确性和标注者特定的混淆模式进行建模，以产生后验期望项目得分和校准的智能体级别分数。与Dawid-Skene等标签去噪方法不同，STABLEVAL明确设计用于稳定和不确定性感知的系统评估，而不是硬标签恢复。我们将排名稳定性形式化为首要评估目标，并分析聚合方法如何保留或扭曲底层标注者行为。在受控的合成实验和多个真实世界人工标注基准上，多数投票在标注者异质性和对抗性噪声下表现出增加的得分误差和排名不稳定性，而STABLEVAL产生了更稳定和统计上更合理的系统排名。这些结果表明，对分歧进行建模对于稳健和可复现的AI评估至关重要。

英文摘要

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.01948 2026-06-02 cs.IR cs.AI 版本更新

Rank-Constrained Deep Matrix Completion for Group Recommendation

面向群组推荐的秩约束深度矩阵补全

Mubaraka Sani Ibrahim, Lehel Csató, Isah Charles Saidu

发表机构 * Department of Computer Science, African University of Science and Technology（非洲科学与技术大学计算机科学系）； Faculty of Mathematics and Computer Science, Babes-Bolyai University（巴纳特-博雅大学数学与计算机科学学院）； Department of Computer Science, Baze University（贝泽大学计算机科学系）

AI总结提出Group RC-DMC框架，通过Set-Transformer聚合器整合群组级表示学习，结合低秩结构和注意力非线性建模，实现个体与群组级别的准确预测。

详情

AI中文摘要

群体活动的日益普及增加了根据用户个体偏好向用户群组提供推荐的方法需求。许多现有的群组推荐系统依赖于聚合个体用户偏好，但通常难以处理现实场景中常见的高维且高度稀疏的评分数据。我们提出了群组秩约束深度矩阵补全（Group RC-DMC），这是一个新颖的框架，通过Set-Transformer聚合器整合群组级表示学习，扩展了RC-DMC，联合利用了低秩结构和基于注意力的非线性建模。与大多数现有群组推荐系统不同，Group RC-DMC在一个统一框架中融合了显式低秩正则化、线性编码器-解码器架构和基于注意力的非线性群组建模，在个体和群组级别都产生准确的预测。Group RC-DMC通过低秩矩阵补全解决数据稀疏性，仅从观测评分计算每个用户的潜在表示，并基于周期性奇异值阈值化使用核范数近端步骤对潜在空间施加秩约束。解码器被参数化为低秩分解，从而实现高效推理。在MovieLens和Goodbooks数据集上的实验结果表明，Group RC-DMC实现了优越的重建精度（以更低的群组RMSE衡量），同时在计算效率上保持竞争力，并且在群组级别的性能（精确率、召回率和F1分数）上与加权前分解（WBF）和加权后分解（AF）基线相当。结果突显了模型恢复用户-物品交互的底层低秩结构的能力，并为小、中、大用户群组提供稳健的群组推荐。

英文摘要

The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their individual preferences. Many existing group recommender systems rely on aggregating individual user preferences, but they often struggle with high-dimensional and highly sparse rating data commonly found in real-world scenarios. We propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a novel framework that extends RC-DMC by integrating group-level representation learning via a Set-Transformer aggregator, jointly leveraging low-rank structure and attention-based nonlinear modeling. Unlike most existing group recommender systems, Group RC-DMC unifies explicit low-rank regularization, linear encoder-decoder architectures, and attention-based nonlinear group modeling within a single framework, yielding accurate predictions at both the individual and group levels. Group RC-DMC addresses data sparsity through low-rank matrix completion, computing per-user latent representations from observed ratings only, and enforcing a rank constraint on the latent space using a nuclear-norm proximal step based on periodic singular value thresholding. The decoder is parametrized as a low-rank factorization, enabling efficient inference. Experimental results on the MovieLens and Goodbooks datasets demonstrate that Group RC-DMC achieves superior reconstruction accuracy, measured by lower group RMSE, while remaining computationally efficient and competitive in group-level performance in terms of precision, recall, and F1 score compared with weighted-before-factorization (WBF) and after-factorization (AF) baselines. The results highlight the model's ability to recover the underlying low-rank structure of user-item interactions and provide robust group recommendations across small, medium, and large user groups.

URL PDF HTML ☆

赞 0 踩 0

2606.01947 2026-06-02 cs.CV cs.AI 版本更新

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * University of Freiburg（弗赖堡大学）

AI总结本研究针对实例分割任务，探索了适配器和低秩适应（LoRA）两种参数高效微调方法，在仅微调约1-6%参数的情况下取得竞争性能，并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

Comments Published by the Machine Learning and Knowledge Extraction Journal

详情

DOI: 10.3390/make6040133
Journal ref: Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807

AI中文摘要

近年来，随着大型预训练模型的兴起，人工智能的研究和应用发生了转变，这些模型在众多任务中取得了最先进的结果。然而，参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展，但针对基于Transformer的模型在实例分割任务中的参数高效微调（PEFT）方法的研究仍然有限。为填补这一空白，本研究调查了PEFT方法的有效性，特别是适配器和低秩适应（LoRA），并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力（本文首次探索），在仅微调约1-6%模型参数的情况下取得了竞争性能，相比传统微调所需的40-55%有显著改进。关键发现表明，每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外，LoRA在应用于可变形注意力时表现出强大的参数效率，并在某些情况下超越了适配器配置。这些结果表明，PEFT技术的影响因数据集复杂性和模型架构而异，强调了上下文特定调优的重要性。总体而言，这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01912 2026-06-02 cs.AI 版本更新

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

SMH-Bench：用于智能家居中环境基础推理与行动的LLM代理基准测试

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group（美的集团）； Beijing University of Posts and Telecommunications（北京邮电大学）； Donghua University（东华大学）； The University of Sydney（悉尼大学）； Peking University（北京大学）

AI总结提出SMH-Bench基准，基于可执行模拟器HomeEnv，通过1100个任务评估LLM在智能家居中的推理与行动能力，发现前沿模型在自动化调度、模糊处理和个性化推理方面存在不足。

详情

AI中文摘要

智能家居正朝着复杂的、依赖于状态的生活环境发展，需要大型语言模型（LLM）对用户意图、偏好和多设备交互进行推理。然而，现有的智能家居基准通常侧重于静态的指令到API映射或有限的模拟，未能评估LLM是否能够在现实家庭场景中可靠地进行推理、交互和行动。为了解决这些局限性，我们引入了SMH-Bench，这是一个用于评估智能家居环境中LLM的全面基准。基于可执行且可验证的智能家居模拟器HomeEnv，SMH-Bench包含1100个高质量任务，涵盖7个类别和22个细粒度子类别。它进一步将任务分层为简单、中等和复杂家庭，范围从小型公寓到拥有135个设备的密集多房间环境。实验表明，尽管前沿LLM在显式控制和查询任务上表现强劲，但在自动化任务调度、模糊处理和个性化推理方面仍存在显著弱点，尤其是在家庭复杂性增加时。我们希望SMH-Bench能够促进更可靠、上下文感知且实际可部署的智能家居代理的发展。

英文摘要

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

URL PDF HTML ☆

赞 0 踩 0

2606.01909 2026-06-02 cs.SD cs.AI eess.AS 版本更新

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Echo: 一种用于共享潜在空间中说话人日志和语音识别的联合嵌入预测架构

Louis Mouchon

发表机构 * Louis Mouchon（洛伊斯·莫尚）

AI总结提出Echo系统，基于单个25M参数ViT编码器，通过JEPA预训练和分阶段特化，在512维潜在空间中联合实现说话人日志、语音分离和语音识别，无需部署时微调。

Comments 18 pages, 17 tables, 1 figure. Proof-of-concept, independent research

详情

AI中文摘要

我们提出Echo，一个围绕单个25M参数ViT编码器构建的概念验证音频系统。该编码器使用JEPA目标进行预训练，然后分阶段特化，以在同一个512维潜在空间中承载说话人身份、语音内容和动态源路由，部署时无需针对每个任务进行微调。轻量级头部处理说话人日志（ArcFace + VBx）和动态源分离（空目标K集预测）。在未知K的合成VoxCeleb2混合数据上，标准堆栈达到15.00%的盲DER、97.80%的PIT分离准确率，潜在SI-SDR提升+9.52 dB，以及在留出k-NN探针上说话人/内容因子化差距为+53.50分。Echo的意义不在于任何单一任务上的新SOTA，而在于三个任务在一个编码器上以这种规模共同共存。我们逐阶段记录了设计，报告了死胡同，并识别了通过VQ瓶颈进行端到端ASR的结构性障碍，该瓶颈仍然限制了PoC。

英文摘要

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

URL PDF HTML ☆

赞 1 踩 0

2606.01906 2026-06-02 cs.AI 版本更新

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

贝叶斯谱情感转移发现：来自多标注者分歧

Keito Inoshita, Takato Ueno

发表机构 * Keio University（庆应大学）； National Institute of Advanced Industrial Science and Technology（国家工业科学与技术研究院）

AI总结提出贝叶斯谱情感转移发现（BSETD）两阶段框架，从多标注者软标签中挖掘情感转移结构，并通过谱分解分离惯性与传染成分，在EmotionLines数据集上验证了与心理学理论的一致性。

详情

AI中文摘要

情感通过对话的动态过程演变，理解其转移结构对于从心理健康筛查到对话系统等应用至关重要。然而，现有研究通常通过多数投票将多评分者判断压缩为单个硬标签，丢弃了理解轮次间转移所需的不确定性信号。本文提出贝叶斯谱情感转移发现（BSETD），一个从多评分者软标签中发现情感转移结构的两阶段框架。第一阶段，通过软标签的外积构建层次狄利克雷-多项后验，为K×K转移矩阵的每个单元配备可信区间和Benjamini-Hochberg（BH）错误发现率（FDR）控制的显著性。第二阶段，对称图拉普拉斯矩阵经谱分解，分离出低频（惯性）和高频（传染）成分。在EmotionLines上，BSETD同时恢复了两个不同情感空间的标志：Plutchik相邻的转移——厌恶到愤怒（log2提升+0.94）和愤怒到厌恶（+0.86）被过度表示，而Russell效价反转的转移——快乐到愤怒（-0.90）和愤怒到快乐（-0.89）被欠表示。五源跨语料验证得到英语内成对皮尔逊相关0.91-0.98，与中文M3ED对比0.79-0.85，以及同一话语集上人类硬标签与LLM虚拟软标签之间0.979的相关性，表明保留标注者不确定性的流程将情感动态的计算研究与既有的心理学理论联系起来。

英文摘要

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.

URL PDF HTML ☆

赞 0 踩 0

2606.01901 2026-06-02 cs.CV cs.AI cs.CL 版本更新

LEO星座中基于多卫星视角的协作空间目标检测

Xingyu Qu, Wenxuan Zhang, Peng Hu

发表机构 * Government of Canada（加拿大政府）； Natural Sciences and Engineering Research Council of Canada（加拿大自然科学和工程研究理事会）

AI总结针对LEO星座中空间目标检测的挑战，提出基于深度学习框架的多视角观测融合方法，使用YOLO检测器处理多视角数据，实验表明多视角融合显著提升检测精度。

详情

AI中文摘要

随着低地球轨道（LEO）星座中卫星数量的增加，近地空间环境日益拥挤，使得空间目标检测（SOD）成为空间安全和可持续性面临的紧迫挑战。为了降低碰撞风险并确保空间操作的连续性，SOD系统必须在严格的星载约束下提供快速准确的检测。在本文中，我们研究了深度学习（DL）框架内多视角观测融合的潜力，以增强SOD性能。我们设计了一个实用的多视角流水线和几种输入表示，用于将多视角数据输入基于YOLO的检测器。我们的实验表明，在大多数情况下使用多视角输入是可行的，并且通常能在mAP50和mAP50-95上产生更好的结果。例如，在模型YOLOv9-m中，单视角与三视角融合RGB设置相比，mAP50从0.638增加到0.732，而mAP50-95从0.227提高到0.276。与单视角设置相比，最佳的三视角灰度配置将mAP50提高了36.3%，mAP50-95提高了46.5%。这些发现确立了多视角融合作为SOD的一种可行且有效的策略，对LEO星座部署中的空间态势感知具有广泛意义。

英文摘要

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.01894 2026-06-02 cs.AI 版本更新

Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations

物理约束的Mamba-SDE用于不规则观测下的剩余使用寿命预测

Deyu Zhuang, Peiliang Gong, Yang Shao, Liyuan Shu, Qi Zhu, Xiaoli Li, Daoqiang Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； Nanyang Technological University（南洋理工大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出PC-MambaSDE框架，通过掩码感知连续Mamba编码器和物理引导的潜在SDE，解决不规则观测下剩余使用寿命预测的物理不可行性问题。

详情

AI中文摘要

准确的剩余使用寿命预测对于工业预测性维护至关重要。然而，由于传感器观测的不规则性，表现为异步采样、突发缺失和时间抖动，实际部署具有挑战性。更糟糕的是，纯数据驱动模型常常生成物理上不合理的退化轨迹，违反损伤累积的不可逆性。为了解决这个问题，我们提出了PC-MambaSDE，一个统一的连续时间框架，用于在不规则观测下进行鲁棒的RUL预测。具体来说，我们设计了一个掩码感知连续Mamba编码器，显式利用观测掩码提取富含上下文的控制信号。此外，我们引入了一个带有参数化修正混合漂移的物理引导潜在SDE，叠加全局物理偏差以强制单调退化，即使在严重观测间隙下也是如此。另外，我们通过终端退化惩罚将RUL预测公式化为边界值问题，该惩罚解耦健康指标维度并应用惩罚损失引导轨迹向故障状态演化。理论上，我们通过Girsanov定理证明了我们的变分目标在数学上等价于最小化KL散度，并通过Lyapunov分析保证了学习动力学的全局渐近稳定性。为了进行严格评估，我们开发了一个混合不规则性生成方案，模拟真实的工业缺陷。在公开基准上的大量实验表明，PC-MambaSDE显著优于最先进的方法，特别是在极端观测稀缺情况下，验证了将物理先验嵌入连续时间潜在动力学的有效性。

英文摘要

Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purely data-driven models often generate physically implausible degradation trajectories that violate the irreversible nature of damage accumulation. To address this, we propose PC-MambaSDE, a unified continuous-time framework for robust RUL prediction under irregular observations. Specifically, we design a Mask-Aware Continuous Mamba Encoder that explicitly leverages observation masks to extract context-rich control signals. Furthermore, we introduce a Physics-Guided Latent SDE with parametrically rectified hybrid drift, superimposing a global physical bias to enforce monotonic degradation even amid severe observation gaps. Additionally, we formulate RUL prediction as a boundary value problem via a Terminal Degradation Penalty, which decouples a Health Index dimension and applies a penalty loss to guide trajectories toward the failure state. Theoretically, we prove that our variational objective is mathematically equivalent to minimizing the KL divergence via Girsanov's theorem, and we guarantee the global asymptotic stability of the learned dynamics through Lyapunov analysis. To enable rigorous evaluation, we develop a Hybrid Irregularity Generation Scheme that simulates realistic industrial imperfections. Extensive experiments on public benchmarks demonstrate that PC-MambaSDE significantly outperforms state-of-the-art methods, particularly under extreme observation scarcity, validating the efficacy of embedding physical priors into continuous-time latent dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.01886 2026-06-02 cs.AI cs.CE 版本更新

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

吸收复杂性：面向金融LLM代理的交互原生知识驾驭系统

Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita, Dmytro Kyrylenko, Sofiia Pidturkina, Julia Stadnyk

发表机构 * True Trading ； Inc4.net

AI总结提出交互原生知识驾驭（InKH）架构，通过被动知识注入、时序图记忆和过期失效机制，将复杂性吸收到系统中，在金融LLM代理任务中显著降低延迟、令牌成本和过时知识使用，同时提升任务质量和可追溯性。

Comments 17 pages, 3 figures

详情

AI中文摘要

金融AI代理常常因一个简单原因而失败：它们让用户承担复杂性。用户必须反复陈述目标、风险偏好、投资组合背景、过往判断以及不断变化的市场假设，而代理则回答、检索、行动并遗忘。在金融领域，这不仅仅是方便与否的问题。在市场分析、跟单交易审查和交易准备等任务中，被遗忘的背景和过时的记忆可能导致延迟、重复错误、弱可审计性以及不安全的决策。我们提出了交互原生知识驾驭（InKH），一种面向金融LLM代理的架构，将复杂性吸收到系统中。InKH将用户、市场、投资组合和工具事件转换为结构化的操作知识。它使用被动知识注入在主模型步骤之前组装一个有界的工作上下文缓冲区，使用时序图记忆进行低延迟检索，使用维基审计界面实现人类可读的治理，以及具有成熟度、衰减和写入时失效的背景提取。我们在一个可重复的受控合成基准上评估了InKH，该基准包含24个随机种子、4轮、每轮80个片段和6个基线，产生了46,080个基线条件评估。InKH在900毫秒延迟下实现了0.815的平均任务质量。与代理驱动的维基漫步记忆相比，它将延迟降低了82.95%，令牌成本降低了82.29%，过时知识使用降低了96.58%，同时质量提高了0.108，可追溯性提高了0.461。与没有失效机制的时序图系统相比，它在相当的服务成本下将质量提高了0.050，并将过时记忆使用降低了96.58%。结果支持了金融AI的设计论点：当复杂性被系统吸收而不是转移给用户时，采用就会发生。该基准验证了架构层面的行为，而非实时交易性能。

英文摘要

Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk preferences, portfolio context, past judgments, and shifting market assumptions, while the agent answers, retrieves, acts, and forgets. In finance, this is not just inconvenient. In tasks such as market analysis, copy-trading review, and trade preparation, forgotten context and stale memory can create latency, repeated errors, weak auditability, and unsafe decisions. We propose the interaction-native knowledge harness (InKH), an architecture for financial LLM agents that absorbs complexity into the system. InKH converts user, market, portfolio, and tool events into structured operational knowledge. It uses passive knowledge injection to assemble a bounded working context buffer before the main model step, temporal graph memory for low-latency retrieval, a wiki audit surface for human-readable governance, and background extraction with maturity, decay, and write-time invalidation. We evaluate InKH on a reproducible controlled synthetic benchmark with 24 random seeds, 4 rounds, 80 episodes per round, and 6 baselines, producing 46,080 baseline-conditioned evaluations. InKH achieves mean task quality of 0.815 at 900 ms latency. Compared with agent-driven wiki-walk memory, it reduces latency by 82.95 percent, token cost by 82.29 percent, and stale-knowledge usage by 96.58 percent, while improving quality by 0.108 and traceability by 0.461. Compared with a temporal-graph system without invalidation, it improves quality by 0.050 and reduces stale-memory usage by 96.58 percent with comparable serving cost. The results support a design thesis for financial AI: adoption happens when complexity is absorbed by the system rather than transferred to the user. The benchmark validates architecture-level behavior, not live trading performance.

URL PDF HTML ☆

赞 0 踩 0

2606.01862 2026-06-02 cs.MA cs.AI cs.NI 版本更新

RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation

RadioMaster: 自主无线电信号生成的多智能体系统

Jiazhen Lei, Tianze Cao, Yuxin Sha, Sihan Wang, Bingbing Wang, Fengyuan Zhu, Zeming Yang, Xiaohua Tian

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出RadioMaster，一个全自主的多智能体框架，通过RadioWiki、RadioAgent和RadioEmulator三大支柱，将用户意图转化为真实无线信号，解决现有模型因领域知识和硬件约束敏感性不足而无法生成无线电信号的问题。

详情

AI中文摘要

将用户意图转化为物理无线电信号是无线原型设计中关键但繁琐的最后一步，因为它需要复杂的物理层细节知识，并带来巨大的实现挑战。大型语言模型（LLM）和多智能体系统已经彻底改变了传统的软件工程，提出了一个引人深思的问题：它们能否解决这些艰巨的困难？然而，我们的研究表明，当前模型在应用于无线电信号生成时存在显著局限性，无法完成此任务。这种性能下降主要源于严重的领域无知和对物理硬件约束的根本不敏感。为弥补这一差距，我们引入了RadioMaster，一个完全自主的多智能体框架，旨在将用户输入无缝转化为真实的无线发射。RadioMaster基于三个协同支柱运行：用于领域特定知识检索的RadioWiki、用于协作I/Q样本生成和硬件配置的RadioAgent，以及用于闭环物理层验证的RadioEmulator。此外，我们构建了RadioBench，这是首个专门针对无线电信号生成领域的全面基准测试。广泛的真实世界评估表明，RadioMaster在配置可行性和信号保真度方面显著优于最先进的基线方法。

英文摘要

Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as it requires intricate knowledge of physical layer details and presents immense implementation challenges. Large Language Models (LLMs) and multi-agent systems have revolutionized conventional software engineering, raising the compelling question of whether they can resolve these formidable difficulties. However, our investigations reveal that current models experience significant limitations and fail to accomplish this task when applied to radio signal generation. This performance degradation primarily stems from severe domain ignorance and a fundamental insensitivity to physical hardware constraints. To bridge this gap, we introduce RadioMaster, a fully autonomous multi-agent framework designed to seamlessly translate user input into real-world wireless emissions. RadioMaster operates on three synergistic pillars: RadioWiki for domain-specific knowledge retrieval, RadioAgent for collaborative I/Q sample generation alongside hardware configuration, and RadioEmulator for closed-loop physical layer verification. Furthermore, we construct RadioBench, the first comprehensive benchmark tailored specifically for the radio signal generation domain. Extensive real-world evaluations demonstrate that RadioMaster significantly outperforms state-of-the-art (SOTA) baselines regarding configuration viability and signal fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.01856 2026-06-02 cs.DC cs.AI 版本更新

Boosting Multimodal Federated Learning via Chained Modality Optimization

通过链式模态优化提升多模态联邦学习

Zixin Zhang, Fan Qi, Shuai Li, Xiaoshan Yang, Changsheng Xu

发表机构 * School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, China（天津理工大学计算机科学与工程学院）； Institute of Automation, Chinese Academy of Sciences, Beijing, China（中国科学院自动化研究所）； College of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia, China（内蒙古大学计算机学院）

AI总结针对多模态联邦学习中模态竞争导致全局模型次优的问题，提出FedMChain框架，通过分阶段优化、误差补偿正则化和稀疏符号引导聚合，提升预测性能并降低通信开销。

详情

AI中文摘要

多模态联邦学习（MMFL）能够在具有异构数据和模态可用性的分散客户端之间实现隐私保护的协作学习。然而，现有大多数MMFL方法将多模态训练视为联合优化问题，忽略了一个关键瓶颈：模态竞争，即主导模态抑制较弱模态，导致全局模型次优。为解决这一问题，我们提出FedMChain，一个平衡的MMFL框架，将联邦多模态训练结构化为一系列模态阶段。这种分阶段设计为每个模态在多模态客户端上提供了专用的局部优化窗口，以缓解模态竞争，并通过误差补偿正则化器进一步促进跨模态互补性。在服务器端，我们采用稀疏符号引导聚合策略，利用方向符号一致性进行稳健的模态内聚合，避免破坏性平均，并支持较少的同步频率以降低通信开销。在多模态基准上的大量实验表明，FedMChain在需要比基线更少通信频率的同时，持续提高了预测性能。

英文摘要

Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data and modality availability. However, most existing MMFL methods cast multimodal training as a joint optimization problem, overlooking a key bottleneck: modality competition, where dominant modalities suppress weaker ones and lead to suboptimal global models. To address this, we propose FedMChain, a balanced MMFL framework that structures federated multimodal training as a chain of modality-wise phases. This phase-wise design gives each modality a dedicated local optimization window on multimodal clients to mitigate modality competition, and further promotes cross-modal complementarity via an error-compensated regularizer. On the server side, we employ a sparse sign-guided aggregation strategy that leverages directional sign agreement for robust intra-modality aggregation, avoids destructive averaging, and supports less frequent synchronization to reduce communication overhead. Extensive experiments on multimodal benchmarks demonstrate that FedMChain consistently improves predictive performance while requiring less frequent communication than baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01850 2026-06-02 cs.AI 版本更新

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

压缩是否保留不确定性？基于共形预测的量化和稀疏大语言模型统一基准

Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang, Junhao Dong, Jingling Yuan

发表机构 * Wuhan University of Technology（武汉理工大学）； Nanyang Technological University（南洋理工大学）

AI总结本研究通过共形预测方法，在五个NLP任务上对12种不同压缩配置的大语言模型进行基准测试，发现压缩经常解耦准确率与不确定性，大模型更能吸收压缩引起的不确定性，且不确定性膨胀常呈阈值状而非渐进。

详情

AI中文摘要

模型压缩技术如量化和剪枝被广泛用于降低大语言模型（LLMs）的部署成本，现有评估几乎只关注准确率保持。然而，在安全关键应用中，模型可靠量化自身不确定性的能力同样重要。我们问：压缩是否保留了这种能力？为回答此问题，我们在五个NLP任务上对12种不同压缩配置的LLM进行基准测试，使用共形预测提供严格、无分布的不确定性度量。实验揭示：(I) 压缩经常解耦准确率与不确定性；(II) 大模型吸收压缩引起的不确定性的能力远强于小模型；(III) 不确定性膨胀常呈阈值状而非渐进。这些结果表明，仅基于准确率的评估不足以评估压缩LLM的部署就绪度，不确定性感知的基准测试应成为模型压缩流程的标准组成部分。

英文摘要

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.01845 2026-06-02 cs.CL cs.AI 版本更新

Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses

揭示大型语言模型在推断非语言回应中的语用意义的局限性

Sugyeong Eo, Heuiseok Lim

发表机构 * Department of Software, Yonsei University Mirae Campus（燕山大学软件系）； Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）

AI总结本研究首次系统评估大型语言模型（LLMs）从纯非语言回应对话中推断语用意义的能力，发现其准确率相比语言回应下降高达60%，并表明上下文学习有助于语用推理。

详情

AI中文摘要

尽管大型语言模型（LLMs）在语用语言理解方面取得了显著进展，但先前的研究主要集中在其对语言行为的理解上。然而，非语言行为仍然是人类交流的基本组成部分，特别是当故意单独使用以传达间接意义时。在这项工作中，我们首次系统评估了LLMs从仅包含非语言回应的对话中推断语用意义的能力。我们探讨了三个研究问题：（1）LLMs能否识别通过非语言回应传达的间接意图？（2）LLMs何时以及如何未能捕捉非语言意图？（3）我们如何提高LLMs解释非语言意图的能力？通过评估，我们观察到LLMs难以从非语言回应中推断出潜在意义，准确率相比语言回应下降高达60个百分点。进一步的广泛分析揭示了LLMs在解释非语言行为时的行为模式，并表明上下文学习有助于语用推理。

英文摘要

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs' ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs' ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs' interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.

URL PDF HTML ☆

赞 0 踩 0

2606.01843 2026-06-02 cs.CV cs.AI 版本更新

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

抑制伪造特定捷径以实现可泛化的深度伪造检测

Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu, Le Wu, Tat-Seng Chua

发表机构 * Hefei University of Technology（合肥工业大学）； National University of Singapore（国立新加坡大学）

AI总结提出Shortcut Subspace Suppression (S^3)框架，通过子空间建模显式表征并抑制方法特定捷径，以提升深度伪造检测的跨方法泛化能力。

详情

AI中文摘要

深度伪造检测在跨伪造方法泛化方面表现不佳，因为现有模型倾向于依赖虚假的方法特定捷径，这些捷径无法迁移到未见过的篡改操作。尽管近期方法试图改进泛化性，但它们缺乏明确的机制来识别和抑制学习表示中的此类捷径。在这项工作中，我们提出了捷径子空间抑制（S^3）框架，通过子空间建模显式表征并抑制方法特定捷径。我们的关键洞察是，区分不同伪造方法的变体捕获了方法特定的伪影，因此可作为方法特定捷径的有效代理。为此，我们训练一个轻量级线性探针进行伪造方法分类，并执行奇异值分解（SVD）以提取主导的捷径子空间。基于此公式，我们开发了两种互补策略来减少对捷径的依赖。在训练期间，我们软性抑制特征表示中的捷径子空间，鼓励模型依赖更可泛化的线索进行真/假判别。在推理时，我们引入一个无需训练的对应方法，衰减与识别出的捷径方向对齐的神经元，从而实现即插即用的泛化增强，并提高可解释性。在多个基准上的大量实验表明，我们的方法显著改善了跨方法泛化，同时保持了强大的域内性能。代码将在论文被接收后发布。

英文摘要

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.

URL PDF HTML ☆

赞 0 踩 0

2606.01840 2026-06-02 cs.AI 版本更新

Evaluation of Baseline Methods for IDD-based SSD External Memory Search

基于IDD的SSD外部内存搜索的基线方法评估

Yuki Suzuki, Alex Fukunaga

发表机构 * International Symposium on Combinatorial Search (SoCS 2026)（国际组合搜索会议（SoCS 2026））

AI总结本文评估了基于即时重复检测（IDD）的A*算法在SSD外部内存搜索中的简单基线方法性能，并分析了操作系统级页面缓存的影响。

Comments accepted to The 19th International Symposium on Combinatorial Search (SoCS2026)

2606.01838 2026-06-02 cs.CL cs.AI cs.LG 版本更新

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

LayerRoute: 基于LoRA微调的输入条件自适应层跳过方法用于智能语言模型

Prateek Kumar Sikdar

发表机构 * Accenture（埃森哲）

AI总结提出LayerRoute，通过为每个Transformer块添加轻量级路由器和LoRA适配器，根据输入类型（工具调用或规划推理）自适应跳过层，在仅增加0.22%可训练参数下实现12.91%的跳过差异并提升质量。

Comments 10 pages, 3 figures, 4 tables

详情

AI中文摘要

智能语言模型系统交替使用两种结构不同的步骤类型：结构化工具调用（短、确定性、低困惑度）和开放式规划/推理步骤（长、复杂、高困惑度）。尽管存在这种异质性，当前的推理系统对每个步骤应用相同的计算量。我们引入LayerRoute，一个轻量级适配器，学习基于每个输入有选择地跳过Transformer块。LayerRoute为Qwen2.5-0.5B-Instruct中的24个Transformer块中的每一个增加：（1）一个每层路由器（约897个参数，Linear(896,1)），通过直通估计器输出硬二值门；（2）在Q/K/V/O注意力投影上的LoRA适配器（秩8，约1.08M参数）。骨干权重保持冻结。在智能体数据（Hermes、Glaive、GSM8K、Turing）上进行单次端到端训练，并加入门正则化项，迫使系统发现每个输入类型下哪些块是可跳过的。经过3000步（在A100 40GB上6.4分钟），LayerRoute实现了12.91%的跳过差异：工具调用跳过15.25%的FLOPs，而规划步骤仅跳过2.34%，仅使用1.10M可训练参数（占494M骨干的0.22%）。由于LoRA适配，质量相比基础模型有所提升，工具调用上的困惑度差为-1.29，规划步骤上为-1.30。

英文摘要

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.

URL PDF HTML ☆

赞 0 踩 0

2606.01834 2026-06-02 cs.CV cs.AI 版本更新

Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition

轻量级TCN中的物理引导注意力用于高效基于WiFi CSI的人体活动识别

Chinthaka Ranasingha, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Harshala Gammulle

发表机构 * Signal Processing, Artificial Intelligence and Vision Technologies (SAIVT) Research Group, School of Electrical Engineering and Robotics, Queensland University of Technology (QUT)（信号处理、人工智能与视觉技术（SAIVT）研究组，电气工程与机器人学院，昆士兰科技大学（QUT））

AI总结提出一种紧凑的TCN框架，通过多普勒能量引导的时间注意力和方差驱动的通道注意力机制，显式引入运动感知归纳偏置，在减少参数和计算成本的同时实现优于深度基线模型的性能。

详情

AI中文摘要

基于WiFi信道状态信息（CSI）的人体动作识别（HAR）因其非接触、低成本及保护隐私的特性而受到越来越多的关注。然而，现有的基于学习的方法主要依赖深度、计算密集的架构来隐式地从CSI测量中捕捉运动动态，从而增加了模型复杂度并降低了效率。相反，我们认为，结合针对CSI信号物理特性的适当归纳偏置能够实现更高效和有效的学习。在这项工作中，我们提出一个紧凑的基于时间卷积网络（TCN）的框架，将运动感知的归纳偏置显式地融入特征学习。具体地，我们在特征空间中引入多普勒能量引导的时间注意力机制以强调运动显著的时间段，以及一个方差驱动的通道注意力模块，根据时间运动统计自适应地加权信息子载波。通过整合这些领域特定的先验知识，所提模型在不增加架构深度的情况下有效捕捉运动动态。在多个基准数据集上的大量实验表明，我们的方法相比更深的基线模型取得了优越的性能，同时显著减少了参数数量和计算成本。

英文摘要

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture motion dynamics from CSI measurements, thereby increasing model complexity and reducing efficiency. Instead, we argue that incorporating appropriate inductive biases tailored to the physical characteristics of CSI signals enables more efficient and effective learning. In this work, we propose a compact temporal convolutional network (TCN)-based framework that explicitly incorporates motion-aware inductive biases into feature learning. Specifically, we introduce a Doppler-energy-guided temporal attention mechanism in feature space to emphasize motion-salient time segments, and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics adaptively. By integrating these domain-specific priors, the proposed model effectively captures motion dynamics without increasing architectural depth. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves superior performance compared to deeper baselines, while significantly reducing parameter count and computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.01833 2026-06-02 cs.LG cs.AI 版本更新

Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

学习生成空间中的隐式偏置以加速蛋白质动力学仿真

Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu, Siyu Zhu, Tzuhsiung Yang, Yuan Qi

发表机构 * Fudan University（复旦大学）； Shanghai Academy of AI for Science（上海人工智能科学研究院）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出在预训练生成式仿真器的生成空间中引入隐式历史依赖偏置，结合距离加权分数估计和环境支持正则化，通过重投影步骤保持结构有效性，显著提升采样多样性和稀有状态覆盖速度。

详情

AI中文摘要

蛋白质动力学生成式仿真器能够以分子动力学一小部分成本生成合理的轨迹，但它们继承了训练分布，在长期外推下倾向于重访已知状态而非到达稀有状态。受经典增强采样启发，我们在预训练仿真器的生成空间中引入隐式历史依赖偏置。具体来说，一个历史感知的分数估计器向冻结的仿真器添加距离加权偏置，引导逆时采样远离先前生成的结构，并通过环境支持项进行正则化。为在长时间尺度下保持结构有效性，一个基于分数的精化步骤利用冻结仿真器将漂移的样本重新投影到数据流形上。实验表明，该方法（i）在DynamicPDB-80上将多样性提升35%；（ii）在12个零样本快速折叠蛋白质上，单独使用学习到的偏置达到无偏仿真器覆盖的速度最高快约15倍，与精化结合后覆盖速度最高快约37倍，同时覆盖的低能态数量多约3倍。代码即将发布。

英文摘要

Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by $35\%$ on DynamicPDB-80; (ii) on $12$ zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator's coverage up to ${\sim}15\times$ faster, and pairing it with refinement reaches the coverage up to ${\sim}37\times$ faster while covering ${\sim}3\times$ as many low-energy states. Code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2606.01830 2026-06-02 cs.AI 版本更新

OctoT2I：一种自我进化的智能文本到图像路由系统

Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（电子与计算机工程学院，北京大学）； Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University（广东省超高清沉浸媒体技术重点实验室，北京大学深圳研究生院）

AI总结提出OctoT2I框架，通过自进化机制构建知识库并采用状态化多轮路由策略，联合优化生成质量与推理效率，在GenEval上达到0.96性能，同时实现90.3%推理加速和56.6%能效提升。

详情

AI中文摘要

并行异步自适应一阶方法的随机收敛性

Serge Gratton, Philippe L. Toint

发表机构 * Université de Toulouse, INP, IRIT, Toulouse, France（图卢兹大学，INP，IRIT，法国图卢兹）； IA Artificial and Natural Intelligence Toulouse Institute (ANITI)（图卢兹3IA人工智能与自然智能研究所（ANITI））； NAXYS, University of Namur, Namur, Belgium（NAXYS，纳慕尔大学，比利时纳慕尔）

AI总结本文提出一类新的异步自适应一阶优化方法，包括多种流行算法的异步变体，并分析其在非凸函数上的随机收敛性，达到O(1/√t)的收敛速率。

2606.01783 2026-06-02 cs.IR cs.AI 版本更新

Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation

打破信息孤岛：面向跨域推荐的语义人物画像

Jonathan Mayo, Moshe Unger, Konstantin Bauman

发表机构 * Technology and Information Management Department, Coller School of Management, Tel Aviv University（技术与信息管理系，科勒管理学院，特拉维夫大学）； Management Information Systems Department, Fox School of Business, Temple University（管理信息系统系，福克斯商学院， Temple大学）

AI总结提出SPHERE方法，利用大语言模型生成语义人物画像，实现无共享用户或物品的跨域推荐，并通过双塔架构和动态融合门增强推荐性能。

详情

AI中文摘要

数字平台日益成为孤立的信息孤岛，限制了它们跨域构建全面用户表征的能力。跨域推荐系统试图通过将知识从源域迁移到目标域来克服这一限制，但大多数现有方法依赖于共享用户、共享物品或结构相似的交互图。这些假设在独立平台上往往不切实际。我们提出SPHERE（面向异构跨域推荐的语义人物画像），一种设计构件，能够在严格不相交的域之间实现推荐知识迁移，无需共享用户或物品。SPHERE不通过身份或图结构对齐域，而是使用大语言模型诱导共享行为词汇，为用户生成结构化语义人物画像，并检索行为相似的源域社区，形成社区源人物画像。该语义信号通过双塔架构和动态融合门与协同信号集成，使SPHERE能够增强标准推荐骨干。在Amazon Books、Goodreads和Steam上的实证评估表明，在全排名评估下，SPHERE在NCF、SVD++和LightGCN基线上取得了一致的改进。结果表明，跨域迁移效果不仅由域之间的语义接近度决定，还关键取决于目标域的结构密度和原生预测强度。该研究通过将跨域个性化重新定义为基于行为的语义对齐，为信息系统研究做出贡献，提供了一种在保持可解释性和模块化的同时克服信息孤岛的实用机制。

英文摘要

Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representations across domains. Cross-domain recommender systems seek to overcome this limitation by transferring knowledge from a source domain to a target domain, yet most existing approaches depend on shared users, shared items, or structurally similar interaction graphs. These assumptions are often unrealistic across independent platforms. We propose SPHERE (Semantic Personas for Heterogeneous cross-domain Recommendation), a design artifact that enables recommendation knowledge transfer across strictly disjoint domains with no shared users or items. Rather than aligning domains through identity or graph structure, SPHERE uses large language models to induce a shared behavioral vocabulary, generate structured semantic personas for users, and retrieve behaviorally similar source-domain communities that form a Community Source Persona. This semantic signal is integrated with collaborative signals through a dual-tower architecture and dynamic fusion gate, allowing SPHERE to augment standard recommender backbones. Empirical evaluation across Amazon Books, Goodreads, and Steam demonstrates consistent improvements over NCF, SVD++, and LightGCN baselines under full-ranking evaluation. The results show that cross-domain transfer effectiveness is not determined solely by semantic proximity between domains; rather, it depends critically on the structural density and native predictive strength of the target domain. The study contributes to information systems research by reframing cross-domain personalization as behavior-based semantic alignment, offering a practical mechanism for overcoming information silos while preserving interpretability and modularity.

URL PDF HTML ☆

赞 0 踩 0

2606.01781 2026-06-02 cs.AI 版本更新

Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction

结构引导的自适应传播用于蛋白质-蛋白质相互作用位点预测

Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma

发表机构 * Institute of Computing Science and Technology, Guangzhou University（广州大学计算机科学与技术学院）； School of Computer Science, Peking University（北京大学计算机科学学院）； Information Science & Technology Department, Beijing Capital International Airport Co., Ltd.（北京首都国际机场有限公司信息科学与技术部）； School of Information Science and Technology, Dalian Maritime University（大连海事大学信息科学与技术学院）

AI总结提出SGAP-PPIS模型，利用等变图神经网络的多尺度几何状态生成残基级传播系数，实现自适应信息扩散，在Test_60上取得竞争性能。

Comments 9 pages, 3 figures

详情

AI中文摘要

准确预测蛋白质-蛋白质相互作用位点（PPIS）对于理解细胞过程、疾病机制和治疗靶点发现至关重要。基于图的深度学习通过整合残基级结构上下文推进了PPIS预测。然而，尽管蛋白质界面存在结构和功能异质性，大多数基于图的模型仍依赖固定传播方案，对所有残基一视同仁。这种传播可能限制信息扩散适应局部几何环境的能力，使得难以区分真正的相互作用位点和结构相似的非相互作用邻居。我们提出SGAP-PPIS，一种用于PPIS预测的结构引导自适应传播模型。SGAP-PPIS不使用固定传播机制，而是利用等变图神经网络的多尺度几何状态生成残基级传播系数。这种设计允许每个残基根据其几何微环境自适应地平衡局部特征保留和邻域扩散。实验结果表明，SGAP-PPIS在Test_60上达到了与最先进方法竞争的性能。消融研究表明，几何条件自适应传播、尺度对齐几何引导和多步传播状态表示共同推动了这些改进。

英文摘要

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test\_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.

URL PDF HTML ☆

赞 0 踩 0

2606.01774 2026-06-02 cs.LG cs.AI 版本更新

FLARE: Diffusion for Hybrid Language Model

FLARE: 混合语言模型的扩散方法

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan, Yiran Xu, Wanrong Zhu, Jason Kuen, Koustava Goswami, Rajiv Jain, Yongxin Chen, Molei Tao, Jiuxiang Gu

发表机构 * Adobe Research（Adobe研究院）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出FLARE框架，通过结合自回归和扩散目标、硬件感知内核和统一推理，将混合注意力LLM转换为支持并行解码的扩散模型，在保持能力的同时提升吞吐量。

详情

AI中文摘要

自回归（AR）大型语言模型（LLM）已取得广泛的实际成功，但顺序解码仍然是低延迟部署的关键瓶颈。近期的高效推理工作沿着两个方向推进：通过高效架构降低每次模型调用的成本，以及通过并行生成减少串行解码步骤。混合注意力骨干解决了前者，而扩散语言模型（dLLM）通过迭代并行去噪追求后者。结合这些优势仍然具有挑战性：AR到dLLM的转换通常无法保留种子检查点的能力，并且混合注意力循环状态和掩码约束使得扩散训练和服务变得复杂。我们提出了FLARE，一个针对混合注意力LLM的系统转换框架。我们的分析确定迁移数据质量是能力保留的主要决定因素，其重要性超过损失公式和注意力掩码设计。最终框架结合了token等价的AR和扩散目标、硬件感知内核以及统一推理，使得一个检查点能够同时支持AR风格的验证解码和扩散风格的并行去噪。从强大的AR检查点出发，使用有限的训练后数据，FLARE在模型规模上与领先的开源dLLM竞争，并在单GPU并发服务中相比开源dLLM基线实现了持续的吞吐量提升。我们的结果进一步表明，实际dLLM不仅受限于解码算法，还受限于迁移数据质量和当前块扩散目标的训练低效性，这促使我们联合设计数据、目标、架构和推理系统。

英文摘要

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01755 2026-06-02 cs.AI cs.CL 版本更新

TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment

TriAlign: 迈向个性化大语言模型对齐中的通用真值一致性

Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung

发表机构 * Department of Data Science & AI, Monash University（数据科学与人工智能系，墨尔本大学）； Defence Science and Technology Group, Australia（澳大利亚国防科学与技术集团）

AI总结针对个性化大语言模型在不同社会群体间存在的通用真值不一致问题，提出TriAlign框架，通过离线多智能体强化学习联合优化真值准确性、跨群体一致性和个性化，实现公平对齐。

详情

AI中文摘要

个性化大语言模型根据用户的偏好和社会属性调整响应，但可能在不同社会群体间引入显著的通用真值不一致性，即某些群体在客观任务上系统性地获得较不准确的响应。现有的对齐方法要么忽略个性化，要么主要关注主观偏好对齐，很大程度上忽视了通用真值的公平性和一致性。为填补这一空白，我们研究了真值不变对齐（TIA），这是一个针对个性化LLM的对齐问题，旨在确保通用真值在不同社会群体间保持一致，同时保留个性化。我们提出TriAlign，这是首个用于TIA的离线多智能体强化学习（MARL）框架，其中每个社会群体被建模为一个交互的智能体。TriAlign通过一个公平感知目标和一个显式的不一致性惩罚，联合优化通用真值准确性、跨群体真值一致性和个性化。跨多个基准的实验表明，TriAlign在这三个目标之间实现了比强基线更强的平衡，减少了跨社会群体的通用真值差异，同时提高了客观任务性能和个性化质量。

英文摘要

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

URL PDF HTML ☆

赞 0 踩 0

2606.01747 2026-06-02 cs.CL cs.AI 版本更新

Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks

基于BERT和图神经网络的历史知识图谱构建

Ping Li, Bartlomiej Brzozka

发表机构 * Shandong Management University（山东管理大学）； Maria Curie-Sklodowska University（玛丽·居里-斯洛多夫斯卡大学）

AI总结本文提出结合BERT和图神经网络的高层架构，从历史文本中提取实体和关系，构建知识图谱，在精度、召回率和F1分数上优于传统方法和深度学习基线。

Comments 9 pages, 4 figures

详情

AI中文摘要

通过数字人文研究和规模化历史数据分析，大量传统历史文本被转换为结构化知识图谱。本文提出一种结合双向编码器表示（BERT）和图神经网络（GNN）的高层架构，用于从各类历史文本中提取实体和关系。传统历史文本系统地解决了语言歧义、上下文限制的引用以及缺乏既定语法规范的问题。本研究根据上述建议，开发了一种基于FastRQNet和预训练视觉-语言模型Vilt-qaformer+RoBInet的新型图像检索系统。实验充分利用了市政记录、议会文件和历史信函的全面数据集。与传统基于规则的技术和其他流行的深度学习基线相比，联合BERT-GNN系统获得了更高的精度、召回率和F1分数（表2）。该结构在创建知识图谱时能够以足够的准确性和全面性处理复杂的嵌套结构和隐式引用问题。上述实验表明，将关系图学习算法与上下文敏感的语义表示技术相结合，可以自动提取历史数据，为知识库积累累积的智慧。

英文摘要

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

URL PDF HTML ☆

赞 0 踩 0

2606.01741 2026-06-02 cs.CR cs.AI 版本更新

SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems

SECUREVENT: 面向分布式事件系统的混合AI/ML安全监控

Eric Liang

发表机构 * Oracle

AI总结提出SECUREVENT架构，结合传统安全机制与在线异常检测、图行为特征、复杂事件策略、联邦学习和对抗ML治理，通过混合AI/CEP监控提高召回率并保持低误报率。

详情

AI中文摘要

分布式事件系统已成为互联网规模发布/订阅服务、物联网遥测、云原生微服务和安全运营管道的常见基础。它们的松散耦合和异步交付提高了可扩展性，但也扩大了攻击面：发布者、代理、订阅者、主题、模式和时间顺序都可能被滥用，而没有一个组件能观察整体行为。本文提出了SECUREVENT，一种用于分布式事件系统的混合AI/ML安全监控架构。该架构将传统保护（如认证传输、主题级授权和签名事件）与在线异常检测、图感知行为特征、复杂事件策略规则、联邦学习和对抗ML治理相结合。对合成事件流攻击的确定性原型研究表明，混合AI/CEP监控可以在保持低误报率的同时提高静态规则的召回率。核心主张并非机器学习取代密码学和访问控制机制，而是当事件流、身份、模式和时间关系过于动态以至于静态控制无法单独应对时，基于模型的安全监控是必要的。

英文摘要

Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native microservices, and security operations pipelines. Their loose coupling and asynchronous delivery improve scalability, but they also expand the attack surface: publishers, brokers, subscribers, topics, schemas, and temporal ordering can each be abused without a single component observing the whole behavior. This paper proposes SECUREVENT, a hybrid AI/ML security-monitoring architecture for distributed event-based systems. The architecture combines traditional protections such as authenticated transport, topic-level authorization, and signed events with online anomaly detection, graph-aware behavioral features, complex-event policy rules, federated learning, and adversarial-ML governance. A deterministic prototype study over synthetic event-stream attacks illustrates how a hybrid AI/CEP monitor can improve recall over static rules while retaining a low false-positive rate. The central claim is not that machine learning replaces cryptographic and access-control mechanisms, but that model-based security monitoring is necessary when event flows, identities, schemas, and timing relationships are too dynamic for static controls alone.

URL PDF HTML ☆

赞 0 踩 0

2606.01738 2026-06-02 cs.CL cs.AI 版本更新

捷径通往虚无：揭秘深度虚假回归

Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Rutgers University（罗格斯大学）； Yang AI Lab（杨人工智能实验室）

AI总结针对连续预测中的虚假相关性，提出利用标签和特征空间中虚假属性的相似性来校准分布，从而提升模型在分布偏移下的泛化能力。

详情

AI中文摘要

现实世界中的回归常常存在捷径：在训练中与连续目标虚假相关的属性，在部署偏移下不可靠；使用此类捷径回归目标可能在测试时灾难性失败。现有关于虚假相关性的研究主要关注分类，其中标签是分类的且组是自然定义的。然而，许多现实任务需要连续预测，其中不存在硬标签边界或离散的组-标签对。我们将深度虚假回归（DSR）定义为从具有属性-标签混淆的回归数据中学习，处理连续虚假相关性，并在测试时泛化到所有属性-标签组合。受分类和回归捷径内在差异的启发，我们提出利用标签和特征空间中虚假属性之间的相似性，从而在跨属性校准标签和学习特征分布时考虑邻近目标和相关组。在涵盖计算机视觉、环境感知和大语言模型（LLM）回归的常见真实世界DSR数据集上的大量实验验证了我们策略的优越性能。我们的工作填补了研究连续预测中虚假相关性的基准和技术空白。

英文摘要

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.01722 2026-06-02 cs.LG cs.AI cs.DC 版本更新

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

后确定性分布式系统：可信自主基础设施的新基础

Jun He, Deying Yu

发表机构 * OpenKedge Inc.（OpenKedge公司）

AI总结本文提出后确定性分布式系统（PDDS）模型，以协调确定性代码、随机模型和自主代理共存的异构环境，并定义了五大架构支柱及新的故障分类。

Comments 8 pages, 1 table

详情

AI中文摘要

几十年来，分布式系统通常假设正确的参与者执行协议指定的行为，具有稳定、外部定义和确定性的语义。经典理论广泛参数化了网络时序、通信拓扑和故障域，但参与者模型相对固定。将自主推理引擎、随机模型驱动代理和策略驱动参与者集成到云控制平面、事件响应系统和金融基础设施中，挑战了这一假设的普遍性。这些代理通常产生不同的推理路径、不同的操作轨迹和异构的内部表示，同时实现语义等价且正确的结果。在本文中，我们引入后确定性分布式系统（PDDS）作为研究和工程模型，用于协调确定性代码、随机模型和自主代理共存的异构环境。我们表明，经典分布式计算模型构成了这种参与者通用模型的零歧义特例。我们并非主张确定性系统消失；而是确定性执行不能再作为自主基础设施的通用参与者假设。最后，我们概述了后确定性基础设施的五大架构支柱：协议驱动开发、可验证代理基础设施、自主状态控制平面、语义法定保证和认知状态复制。认知状态复制将持久性和一致性模型从数据可见性扩展到知识可见性，实现代理记忆、可验证语义回滚以及跨推理参与者的连贯性。我们还定义了在此环境中出现的故障类别的分类法。

英文摘要

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

URL PDF HTML ☆

赞 0 踩 0

2606.01719 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Fair Finetuning Mitigates Distribution Inference Attacks

公平微调缓解分布推断攻击

Rakshit Naidu

发表机构 * Rakshit Naidu

AI总结提出公平微调（FFt）方法，通过在等几率约束下对互补分布样本进行微调，将模型公平性指标与分布推断攻击中的对抗优势联系起来，并给出理论界限，实验证明能有效降低攻击成功率。

Comments 16 pages (11 main, 5 appendix)

详情

AI中文摘要

在敏感数据上训练的机器学习模型可能会无意中泄露其训练分布的群体级信息——这种威胁被称为分布推断攻击（DIA）。具有黑盒访问权限的对手可以在不直接观察任何训练数据的情况下推断敏感的人口统计属性，如子群比例。尽管已经提出了差分隐私和属性遗忘等防御措施，但公平性约束与分布泄漏之间的联系尚未被探索。我们提出了公平微调（FFt）：在等几率（EO）约束下，对来自互补分布的样本进行微调。我们提供了完整的理论刻画，证明了紧界 $ ext{Adv}(\mathcal{A},M_f) \le Δ_{ ext{EO}} \cdot W$，其中 $W$ 量化了两个训练分布通过其敏感属性组成的可区分程度。我们还建立了FFt降低对抗优势的必要条件，并证明了该界的紧性。我们在六个数据集上进行了评估，涵盖表格数据（ACS Income、COMPAS、German Credit）、图像数据（UTKFaces）和自然语言处理数据（Bias in Bios）。基于重演的FFt在所有设置中一致地将对抗准确率差距降低到检测阈值 $τ=0.1$ 以下；在ACS Income上，差距从约15%下降到4%以下。我们的工作提供了第一个将模型测量的EO差异直接与其在DIA博弈中的对抗优势联系起来的正式界限，为统一的公平性和隐私防御开辟了新途径。

英文摘要

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.

URL PDF HTML ☆

赞 0 踩 0

2606.01708 2026-06-02 cs.LG cs.AI 版本更新

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

随机极小极大树的双保真度最优动作识别

Peter Chen, Xi Chen

发表机构 * Department of Mathematics, Columbia University（哥伦比亚大学数学系）； Stern School of Business, New York University（纽约大学斯特恩商学院）

AI总结针对随机极小极大树中的固定置信度最优动作识别问题，提出双保真度树搜索算法2FFS，结合极小极大快速扩展与MCTS随机采样，自适应选择廉价有偏评估或昂贵精确评估，理论证明固定置信度正确性、有限停止及多项式深度成本上界，实验表明比现有BAI-MCTS基线显著减少样本和计算。

Comments 36 pages

详情

AI中文摘要

我们研究随机极小极大树中的固定置信度最优动作识别（BAI）。该问题在现代AI规划中日益重要，其中深度极小极大搜索和带有语言模型长滚动的蒙特卡洛树搜索（MCTS）面临一个基本权衡：启发式评估廉价但有偏，而精确滚动可靠但代价高昂。我们提出2FFS，一种双保真度树搜索算法，将多保真度平面赌博机思想引入树中。该算法结合了极小极大风格的快速扩展和MCTS风格的随机采样，自适应地决定何时利用廉价有偏评估以及何时调用昂贵精确评估进行局部认证。我们证明了固定置信度正确性，建立了精确识别的有限停止性，并给出了通用深度树的多项式深度成本上界。在数值随机树实验中，与现有BAI-MCTS基线相比，2FFS使用的样本和计算操作显著减少。

英文摘要

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.01703 2026-06-02 cs.SD cs.AI cs.CV 版本更新

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结提出JenBridge框架，通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制，实现长视频配乐的高保真生成与场景转换自然连贯。

详情

AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计，缺乏确保叙事连续性的机制。我们提出了JenBridge，一个模块化且可解释的自适应长视频配乐框架，确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型，采用流匹配目标训练，遵循两阶段范式：在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验，然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是，为了实现跨不同场景变化的长格式连贯性，JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包，包括一种生成式过渡方法，并独特地采用了一个大型语言模型（LLM）代理，作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务，我们提出了LVS基准，这是一个新基准，包含一个精选数据集和新的评估指标，侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明，JenBridge在客观和主观指标上均显著优于现有方法，特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

URL PDF HTML ☆

赞 0 踩 0

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA（电气与计算机工程系，信息处理实验室，华盛顿大学，美国）

AI总结针对热行人多目标跟踪中身份碎片化问题，提出轻量级后处理方法，通过在线短间隙重映射和离线轨迹重链接恢复身份连续性，在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419

AI中文摘要

热行人多目标跟踪仍然具有挑战性，因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始，我们添加了一个模块化的身份修复后端，包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明，主要身份增益来自保守的重链接，将IDF1从82.25提升到84.93，同时保持MOTA，而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明，在低信息热图像中，通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析，表明与局部帧到帧关联相比，场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

URL PDF HTML ☆

赞 0 踩 0

2606.01689 2026-06-02 cs.CV cs.AI 版本更新

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University（教育部符号计算与知识工程重点实验室）； College of Software, Jilin University（吉林大学软件学院）； School of Geosciences, Yangtze University（长江大学地球科学学院）； College of Communication Engineering, Jilin University（吉林大学通信工程学院）

AI总结针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题，提出基于鲁棒主成分分析（RPCA）的RPCASSM网络，通过设计背景状态空间模块（BSSM）和目标状态空间模块（TSSM）分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模，有效解决了边缘建模难题。

Comments 12 pages, 8 figures, under review

详情

AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少，主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题，本文基于鲁棒主成分分析（RPCA）的模型范式提出了RPCASSM网络，旨在通过红外小目标在空间域的性质设计背景状态空间模块（BSSM）和目标状态空间模块（TSSM）。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制（SPCM）来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制（DPCM），聚焦于目标的可变形空间进行状态空间建模。通过上述设计，我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

URL PDF HTML ☆

赞 0 踩 0

2606.01686 2026-06-02 cs.SD cs.AI 版本更新

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

HAIM: 用于AI音乐制作跟踪基准的人机音乐数据集

Seonghyeon Go, Yumin Kim

发表机构 * KAIST（韩国科学技术院）

AI总结针对当前AI音乐检测局限于二元分类的不足，提出HAIM数据集，通过多阶段标签定义“AI音乐跟踪”任务，评估现有检测器缺陷，推动向细粒度结构化评估转变。

详情

AI中文摘要

随着Suno和Udio等生成平台达到人类级音频质量，AI的实用性已扩展到整个音乐制作流程。除了简单的音轨生成，这些进步催生了AI驱动方法在各种形式中的应用，包括声音合成、编曲和专业母带处理。然而，当前的检测研究仍主要局限于二元“AI或人类”范式，未能反映当代音乐制作流程的现实。在真实制作中，AI工具越来越多地被用于优化或母带处理人类制作的音轨，而人类工程师同样对AI生成的材料进行后处理以确保专业质量。此外，用户经常采用对抗策略绕过AI检测器，例如对AI生成的音轨应用人类母带处理。这创造了一个简单的二元分类无法捕捉的灰色地带。在本文中，我们定义并研究“AI音乐跟踪”：在音乐制作的多面光谱中识别特定AI集成的挑战。为此，我们引入HAIM，一个具有音乐制作阶段多样化标签的数据集。它旨在隔离AI干预的阶段，包括混合制作和代理级跟踪。我们对最先进检测器的评估揭示了系统性缺陷。通过发布HAIM，我们提出了一个新的基准，将领域从二元分类转向对AI音乐的细粒度结构化评估。

英文摘要

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

URL PDF HTML ☆

赞 0 踩 0

2606.01682 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器：数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland（马里兰大学计算机科学系）

AI总结提出Chunk-Level Guided Generation方法，利用现成的大语言模型作为过程评分器，通过固定长度块评分和对比选择规则，无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情

AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略，但当小模型已经陷入错误推理路径时，该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题，但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation，一种无训练的替代方案，使用现成的大语言模型作为过程评分器。在每一步，小模型采样k个固定长度的候选块，而大模型使用似然度对候选块进行评分，无需生成任何文本。选中的块在下一步之前被提交，从而在错误传播之前引导生成。我们用两种选择规则实例化该框架：似然引导选择（LGS），选择具有最高长度归一化大模型对数概率的块；以及对比引导选择（CGS），减去小模型的对数概率，以偏向于大模型偏好与小模型偏好不同的块。我们证明，由于系统性的长度偏差（即使在长度归一化后仍然存在），使用大模型似然度对可变长度推理步骤进行评分是不可靠的，而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上，使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B，CGS在多数投票上最多提升28个百分点，并且在匹配的引导预算下，在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索，且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B，CGS在k=16时在MATH上达到81.8%，在Minerva Math上达到63.6%，超过多数投票4-6个百分点。最后，Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

URL PDF HTML ☆

赞 0 踩 0

2606.01670 2026-06-02 cs.IR cs.AI 版本更新

Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation

基于偏好解耦的时间感知扩散用于生成式推荐

Bangguo Zhu, Peng Huo, Yuanbo Zhao, Zhicheng Du, Jun Yin, Senzhang Wang

发表机构 * Central South University（中南大学）； National Super Computing Center（国家超算中心）； Renmin University of China（中国人民大学）； Hong Kong Polytechnic University（香港理工大学）

AI总结针对现有扩散生成式推荐模型忽略用户偏好时间非平稳分布的问题，提出TDPM框架，通过将用户偏好解耦为长期周期偏好和短期点状偏好并融入扩散过程，在三个数据集上HR@20和NDCG@20平均提升29.21%和25.45%。

详情

AI中文摘要

最近，生成式推荐（GRs）通过用语义索引（SIDs）取代传统项目ID，成为一种变革性的推荐范式。由于扩散模型卓越的生成能力，一些开创性工作探索了以扩散架构为骨干开发GRs。然而，现有基于扩散的GRs的一个致命限制是扩散过程统一应用于历史交互中的所有项目。相比之下，用户偏好由多方面的时变因素塑造，因此在时间维度上呈现非平稳分布。为弥补这一差距，本研究提出一种新颖的GR框架，名为TDPM，通过在SID令牌上设计时间感知扩散。具体而言，TDPM将时变用户偏好的影响明确整合到扩散过程中。详细地，用户偏好被解耦为（i）长期一致的周期偏好和（ii）由近期焦点事件触发的点状偏好。在三个公开真实数据集上的大量实验表明，TDPM显著优于最先进的基线模型。TDPM在HR@20和NDCG@20上分别实现了平均高达29.21%和25.45%的提升。消融研究进一步强调了基于扩散的GRs中时间感知令牌扩散的必要性。

英文摘要

Recently, Generative Recommenders (GRs) have emerged as a transformative recommendation paradigm by replacing traditional item IDs with semantic indices (SIDs). Owing to the exceptional generative capabilities of diffusion models, a few pioneering works explore developing GRs with diffusion architectures as the backbone. However, a fatal limitation of existing diffusion-based GRs is that the diffusion process applies uniformly to all items within the historical interactions. In contrast, the user preference is shaped by multifaceted time-evolving factors and thus exhibits a non-stationary distribution in the temporal aspect. To bridge this gap, this study proposes a novel GR framework, named TDPM, by designing the time-aware diffusion on SID tokens. Specifically, TDPM explicitly integrates the impact of time-evolving user preferences into the diffusion process. In detail, the user preference is disentangled into (i) the period preference, which remains consistent over a long time-span, and (ii) the point preference, which is triggered by recent focal events. Extensive experiments on three public real-world datasets demonstrate the significant superiority of TDPM over the state-of-the-art baselines. TDPM achieves average improvements of up to 29.21% and 25.45% in terms of HR@20 and NDCG@20, respectively. The ablation study further underscores the necessity of time-aware token diffusion in diffusion-based GRs.

URL PDF HTML ☆

赞 0 踩 0

2606.01666 2026-06-02 cs.LG cs.AI 版本更新

DOT-MoE: Differentiable Optimal Transport for MoEfication

DOT-MoE：用于MoE化的可微最优传输

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出DOT-MoE框架，通过可微最优传输将密集层分解为专家，联合学习神经元分配和路由策略，在减少50%活跃参数的同时保留90%原始性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）的扩展带来了显著的性能提升，但也造成了推理效率方面的重大挑战。虽然混合专家（MoEs）架构通过将模型大小与推理成本解耦来解决这一问题，但从头训练MoEs通常不稳定且计算密集。将预训练的密集模型转换为稀疏MoEs已成为一种替代方案；然而，现有方法通常依赖启发式神经元聚类或随机分割来将前馈网络（FFN）划分为专家。在这项工作中，我们提出了DOT-MoE，一种新颖的框架，将密集层的分解建模为可微最优传输（DOT）问题。与静态启发式方法不同，我们将神经元分配建模为平衡传输问题，利用可微的Sinkhorn-Knopp迭代来强制执行严格的专家容量约束。此外，我们利用直通估计器（STE）来联合学习离散的神经元到专家的分配和令牌到专家的路由策略。跨多个架构和基准的大量实验表明，DOT-MoE显著优于结构化剪枝、启发式聚类和随机分割基线，在减少50%活跃参数的同时保留了原始密集模型90%的性能。

E4GEN：事件级可解释的极端增强时间序列生成

Lin Jiang, Dahai Yu, Ximiao Li, Guang Wang

发表机构 * Florida State University（佛罗里达州立大学）

AI总结提出E4GEN可解释扩散框架，通过E-Activator、E-Predictor和E-Control三个组件实现事件级极端事件可控生成，在整体保真度、极端事件保真度和下游效用上优于现有方法。

Comments 48 pages,26 figures

详情

AI中文摘要

生成逼真的时间序列对于科学研究和实际应用至关重要。然而，现有方法通常强调整体分布保真度，而未能忠实捕捉极端事件。为了推进现有研究，我们提出了E4GEN，一个用于极端事件感知时间序列生成的可解释扩散框架。E4GEN通过三个关键组件提供了关于何时、什么以及如何控制极端事件生成的系统见解。首先，E-Activator在去噪过程中学习数据集自适应的极端控制信号激活步骤，而不干扰常规时间成分，包括趋势和季节性。其次，E-Predictor通过自驱动语义预测确定要强制执行的控制信号，其中每个样本通过推断生成过程中的潜在极端事件信息来导出其自身的控制信号。它还包括一种新颖的数据条件训练、噪声初始化采样机制，以解决训练标签不可用的问题。第三，E-Control通过可训练的极端控制网络指定如何控制极端事件生成，该网络将语义控制信号转换为逐层信号并将其注入去噪过程。我们在六个数据集上使用17个指标评估了E4GEN，大量实验表明，E4GEN在多个维度上优于最先进的模型，包括整体保真度、极端事件保真度和下游效用。

英文摘要

Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasize overall distributional fidelity while failing to faithfully capture extreme events. To advance existing research, we propose E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. E4GEN provides systematic insights into when, what, and how to control extreme-event generation through three key components. First, E-Activator learns the dataset-adaptive extreme-control signal activation step during the denoising process without interfering with regular temporal components, including trend and seasonality. Second, E-Predictor determines what control signal to enforce through Self-Driven Semantic Prediction, where each sample derives its own control signal by inferring latent extreme-event information during generation. It also includes a novel Data-Conditioned Training, Noise-Initiated Sampling mechanism to address the issue of unavailable training labels. Third, E-Control specifies how to control extreme-event generation through a trainable Extreme Control Network, which transforms the semantic control signal into layer-wise signals and injects it into the denoising process. We evaluate E4GEN on six datasets with 17 metrics, and extensive experiments show that E4GEN outperforms state-of-the-art models across multiple dimensions, including overall fidelity, extreme-event fidelity, and downstream utility.

URL PDF HTML ☆

赞 0 踩 0

2606.01632 2026-06-02 cs.GT cs.AI 版本更新

重新审视知识编辑中的涟漪效应：通过压力感知联合邻域优化

Haoben Huang, Shuxin Liu, Ou Wu, Di Gao

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（杭州高等研究院，中国科学院大学）

AI总结针对大语言模型单次编辑引发的涟漪效应，提出联合邻域优化框架，通过压力感知协调和语义预执行门控联合优化可编辑侧与保留侧的耦合压力，在RippleEdits上传播与保留指标提升至少7.0%。

详情

AI中文摘要

大语言模型中的单次编辑更新会在局部知识邻域中引发涟漪效应：理想情况下传播到相关事实，同时意外扰动应保留的事实。现有方法分别处理这两种效应，而未显式建模它们的耦合。我们通过分析典型基线中的涟漪响应挑战这种分离，识别出两种耦合的设计压力：可编辑侧协调和保留侧泄露。我们提出联合邻域优化（JNO），一种新的知识编辑框架，在目标规划阶段形式化并联合处理这两种压力。JNO通过压力感知协调（PAC）实例化这一原则，该协调在耦合约束下联合优化邻域目标表示，并设置语义预执行门控，在参数执行前拒绝高风险目标计划。在RippleEdits上的实验表明，JNO在保持跨骨干编辑稳定性的同时，传播和保留指标至少提升7.0%。

英文摘要

Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to related facts and unintended perturbation of preserved ones. Existing methods address these two effects separately, without explicitly modeling their coupling. We challenge this separation through an analysis of ripple responses across typical baselines, identifying two coupled design pressures: editable-side coordination and preserved-side leakage. We propose Joint Neighborhood Optimization (JNO), a new knowledge-editing framework to formalize and jointly address both pressures at the target-planning stage. JNO instantiates this principle through Pressure-Aware Coordination (PAC), which jointly optimizes neighborhood target representations under coupled constraints, and a semantic pre-execution gate that rejects high-risk target plans before parameter execution. Experiments on RippleEdits show JNO improves propagation and preservation metrics by at least 7.0% while preserving cross-backbone editing stability.

URL PDF HTML ☆

赞 0 踩 0

2606.01607 2026-06-02 cs.LG cs.AI 版本更新

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

FedMTFI: 异构联邦学习环境中基于特征重要性优化的多教师知识蒸馏

Nazmus Shakib Shadin, Aaron Cummings, Xinyue Zhang, Bobin Deng

发表机构 * Department of Computer Science, Kennesaw State University, Marietta, GA, 30060 USA（计算机科学系，肯纳邦大学，马里埃塔，GA，30060 USA）

AI总结提出FedMTFI架构，通过结合多教师知识蒸馏与Shapley值特征重要性，在异构联邦学习中提升模型准确性和可解释性。

Comments Accepted by IJCNN 2026

详情

AI中文摘要

联邦学习（FL）是一种去中心化方法，能够在无需暴露原始数据的情况下实现协作模型训练。它允许设备仅共享模型权重，而将个人数据保留在本地并确保安全，从而避免了敏感数据的传输。然而，在现实环境中，设备持有的数据往往分布不均，且设备在计算能力和内存容量上大多存在差异。这些差异使得FL难以在整个系统中保持一致的性能。为了解决这些问题，我们提出了FedMTFI，一种新颖的架构，它将多教师知识蒸馏（MTKD）与特征重要性相结合，以改善异构环境中的FL过程。在FedMTFI中，客户端根据相似的硬件和模型类型进行聚类。每个聚类在非独立同分布（non-IID）数据上训练特定模型。在聚类内部，每个客户端仅使用自己的本地私有数据更新该模型。然后，服务器使用FedAvg对每个聚类中的本地训练模型进行聚合，形成多个原型模型。接着，这些原型作为教师模型，通过MTKD训练一个全局通用的学生模型。FedMTFI的独特之处在于集成了Shapley值（SHAP），以在蒸馏过程中强调重要特征，从而提高了准确性和可解释性。实验结果表明，FedMTFI比传统FL算法实现了更高的准确性，并且在non-IID数据条件下表现更有效。

英文摘要

Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transferring sensitive data, it allows devices to share only model weights, keeping personal data locally and secure. However, in real world settings, the data held by devices is often not evenly distributed and devices mostly differ in computing power and memory capacity. These differences make FL harder to maintain consistent performance across the system. To address these issues, we propose FedMTFI, a novel architecture that combines multi-teacher knowledge distillation (MTKD) with feature importance to improve the FL process in heterogeneous environments. In FedMTFI, clients are clustered based on similar hardware and model types. Each cluster trains a specific model on not independently and identically distributed (non-IID) data. Within a cluster, every client updates that model using only its own local private data. The server then aggregates the locally trained models in each cluster using FedAvg to form multiple prototype models. Then these prototypes serve as teacher models to train a global generalized student model using MTKD. What makes FedMTFI more unique is the integration of Shapley values (SHAP) to emphasize important features during distillation, which enhances both accuracy and interpretability. Experimental results show that FedMTFI achieves higher accuracy than traditional FL algorithms and performs more effectively under non-IID data conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.01599 2026-06-02 cs.AI 版本更新

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

TRON：面向视觉推理强化学习的目标化规则可验证在线环境

Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun

发表机构 * University of Georgia（佐治亚大学）

AI总结提出TRON在线环境框架，通过可控生成-验证程序产生无限训练实例，支持视觉推理强化学习，在多个多模态基准上提升性能。

Comments 27 pages, 8 figures

详情

AI中文摘要

视觉推理的强化学习（RL）需要可扩展、可验证且可控的训练信号。现有的视觉RL后训练在静态策划数据集上进行，其图像-问题-答案样本受限于收集预算。本文引入TRON（目标化、规则可验证的在线环境），一种在线环境基底：训练rollout由可控的生成-验证程序按需生成，该程序采样新的潜在视觉状态，渲染图像，提出问题，并精确验证答案。因此，单次运行可以按当前课程所需的难度级别抽取无限的新实例流。当前TRON套件包含520个环境，组织成五个能力桶（空间、数学、图表、模式/逻辑和计数）；同一基底支持在所有桶上训练的单个完整模型以及每个桶的能力专家模型，无需额外数据收集。我们还引入了基底分析，涵盖生成可靠性、实例和级别多样性、跨环境近似重复以及按难度级别的基础模型通过率。使用METHOD进行RL后训练在Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B-SFT上的十个外部多模态推理基准上持续提升性能。

英文摘要

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

URL PDF HTML ☆

赞 0 踩 0

2606.01584 2026-06-02 cs.CL cs.AI 版本更新

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

识别LLM中高置信度的社会偏见以构建可信的对话辅导代理

Aitor Arronte Alvarez, Naiyi Xie Fincham

发表机构 * University of Hawaii at Manoa（夏威夷大学马诺亚分校）

AI总结本研究通过生成对话数据集，评估大型语言模型在辅导场景中检测社会偏见的能力，发现模型在对话上下文中比基准测试更难检测偏见，且对错误判断过度自信，影响推理和反馈。

Comments Accepted for AIED 2026

详情

AI中文摘要

对话辅导代理已被证明能提高学习参与度和学生成绩，大型语言模型（LLM）越来越多地被用于这些系统以提供可扩展的个性化反馈。然而，LLM可能会延续或放大刻板的社会偏见，在教育环境中带来特殊风险。在本研究中，我们评估了LLM在对话辅导场景中的表现，以识别高置信度的社会偏见，即模型在无法识别辅导对话中的偏见判断时仍保持高度自信，可能影响其推理和向学习者提供的反馈。我们提出了一种新的数据集生成方法，通过重新生成学生-AI辅导教师互动并引入来自基准数据集的受控偏见轮次，实现在自然教学条件下的偏见评估。利用这些数据，我们评估了多个LLM检测刻板偏见的能力，并通过计算和人工评估分析了其响应背后的置信度和推理。我们发现，在对话辅导上下文中，偏见检测比基于基准的评估更具挑战性，且最先进的LLM对其刻板偏见陈述的错误评估过于自信。此外，模型置信度强烈影响推理和反馈，突显了基于LLM的辅导代理中过度自信和偏见行为的风险。最后，我们讨论了影响、缓解考虑和未来研究方向。

英文摘要

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs' ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.01560 2026-06-02 cs.LG cs.AI 版本更新

TN-SHAP-G：用于Shapley值和交互的图结构张量网络代理

Farzaneh Heidari, Guillaume Rabusseau

发表机构 * University of Washington（华盛顿大学）； CNRS（法国国家科学研究中心）

AI总结提出TN-SHAP-G框架，利用图结构输入通过张量网络代理高效计算Shapley值和高阶交互指数。

详情

AI中文摘要

Shapley值是一种广泛使用的工具，用于归因黑盒模型中输入变量的重要性和交互，但其计算涉及定义在指数级子集空间上的函数。我们提出TN-SHAP-G，一个利用图结构输入中的结构高效计算Shapley值和高阶交互指数的框架。给定一个预测器和一个固定的掩码方案，TN-SHAP-G学习一个紧凑的、与图对齐的多线性代理，该代理近似掩码输入行为，表示为拓扑结构反映输入图的张量网络。一旦从少量oracle查询中训练完成，该代理通过多线性扩展实现一阶和高阶Shapley指数的确定性恢复，无需额外模型查询或蒙特卡洛方差。分子基准实验表明，学习到的分解在小图上紧密匹配精确Shapley值，并能高效扩展到基于采样的方法不可行的更大图。

英文摘要

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.

URL PDF HTML ☆

赞 0 踩 0

2606.01528 2026-06-02 cs.AI 版本更新

Joint Agent Memory and Exploration Learning via Novelty Signals

通过新颖性信号实现联合智能体记忆与探索学习

Shizuo Tian, Xiaohong Weng, Rui Kong, Yuxuan Chen, Guohong Liu, Yuebing Song, Jiacheng Liu, Yuchen Li, Dawei Yin, Ting Cao, Yunxin Liu, Yuanchun Li

发表机构 * Tsinghua University（清华大学）； Sun Yat-sen University（中山大学）； Baidu Inc.（百度公司）； Tongji University（同济大学）； Peking University（北京大学）

AI总结提出JAMEL框架，利用新颖性信号联合训练智能体记忆与探索策略，在开放环境中实现高效探索并泛化到未见环境。

详情

AI中文摘要

在开放环境中，探索对于自主智能体至关重要，但当前的语言模型智能体难以做到这一点。有效的探索需要记忆，但保留原始交互历史在长轨迹中计算成本高昂。虽然潜在记忆提供了压缩交互历史的解决方案，但其训练缺乏可靠的监督信号。我们提出了联合智能体记忆与探索学习（JAMEL），这是一个通过新颖性驱动的交互来共同训练智能体记忆和探索策略的框架。我们观察到记忆和探索形成了一个相互依赖的循环：持续的探索需要记忆来区分已耗尽的行为和未见过的新行为，而寻求新颖性的交互提供了使记忆对未来探索有用的监督。通过利用确定性和持久的新颖性信号（如GUI领域的代码覆盖率），我们为记忆模块提供了自然的、无需标注的监督。实证评估表明，我们的方法成功泛化到未见环境。其探索能力优于开放权重基线，并与闭源模型的探索深度相媲美，同时减少了token消耗。我们的代码和模型已在https://github.com/MobileLLM/JAMEL开源。

英文摘要

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solution to compress interaction histories, its training lacks reliable supervisory signals. We introduce \textbf{J}oint \textbf{A}gent \textbf{M}emory and \textbf{E}xploration \textbf{L}earning (\textbf{JAMEL}), a framework that trains agentic memory and exploration policy together through novelty-driven interaction. We observe that memory and exploration form a mutually dependent loop: sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, while novelty-seeking interaction provides the supervision needed to make memory useful for future exploration. By utilizing deterministic and persistent novelty signals such as code coverage in the GUI domain, we provide natural, annotation-free supervision for the memory module. Empirical evaluations demonstrate that \ours successfully generalizes to unseen environments. Its exploration capability outperforms open-weight baselines and rivals the exploration depth of a closed-source model while reducing token consumption. Our code and model are open-sourced at https://github.com/MobileLLM/JAMEL.

URL PDF HTML ☆

赞 0 踩 0

2606.01520 2026-06-02 cs.AI 版本更新

代理操作系统 (AOS)：将代理控制平面集成到传统操作系统及其之外

Ankur Sharma, Deep Shah

发表机构 * Independent Researcher（独立研究员）

AI总结本文提出代理操作系统（AOS）架构，通过集成代理控制平面到现有操作系统或逐步接管部分OS职责，以解决传统OS在调度、内存管理、安全、可观测性和治理方面对长期目标导向的代理AI工作负载的局限性。

详情

AI中文摘要

传统操作系统围绕确定性程序、显式控制流和人类发起的工作流设计。其核心抽象——进程、线程、系统调用、文件和权限——假设有界行为和可预测的交互模式。代理AI系统引入了一种不同的执行模型：长期存在、目标导向的实体，它们进行概率推理、动态调用工具，并根据反馈调整行为。虽然代理目前可以作为用户空间应用程序实现，但其执行特性在调度、内存和状态管理、安全性、可观测性和治理方面对操作系统边界施加了压力。本文引入了代理操作系统（AOS）的概念，这是一种将代理控制平面集成到现有操作系统中的系统架构，或者在某些模型中，随着时间的推移逐步接管选定的操作系统职责。我们提供了AOS的精确定义、明确的假设和非目标，并将AOS职责结构分解为调度器、上下文和内存管理、工具和能力注册表、策略和信任执行、以及可观测性和审计。我们分析了经典操作系统抽象对代理工作负载的局限性，提出了从用户空间运行时到分布式控制平面的集成模型，并将AOS概念映射到Linux和Windows原语。我们提出了安全性和安全性影响，包括代理特定的威胁模型，并定义了强调确定性执行、可审计性和操作员可理解性的评估标准。目标不是完全取代操作系统，而是为代理计算建立一个严格的系统基础，使其在大规模下保持可控、可问责和安全。

英文摘要

Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core abstractions processes, threads, system calls, files, and permissions assume bounded behavior and predictable interaction patterns. Agentic AI systems introduce a different execution model: long-lived, goal-directed entities that reason probabilistically, invoke tools dynamically, and adapt behavior based on feedback. While agents can be implemented as user-space applications today, their execution characteristics stress OS boundaries in scheduling, memory and state management, security, observability, and governance. This paper introduces the concept of an Agent Operating System (AOS), a systems architecture that integrates an agentic control plane into existing operating systems or, in some models, subsumes selected OS responsibilities over time. We provide a precise definition of an AOS, explicit assumptions and non-goals, and a structured decomposition of AOS responsibilities into schedulers, context and memory management, tool and capability registries, policy and trust enforcement, and observability and audit. We analyze limitations of classical OS abstractions for agent workloads, propose integration models from user-space runtimes to distributed control planes, and map AOS concepts onto Linux and Windows primitives. We present security and safety implications, including agent specific threat models, and define evaluation criteria that emphasize deterministic enforcement, auditability, and operator comprehensibility. The objective is not to replace operating systems wholesale, but to establish a rigorous systems foundation for agentic computation that remains controllable, accountable, and secure at scale.

URL PDF HTML ☆

赞 0 踩 0

2606.01503 2026-06-02 cs.CV cs.AI cs.CL 版本更新

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan（密歇根大学）； Sony AI（索尼人工智能）

AI总结本文通过分析层注意力分配，发现视觉理解与视觉生成在令牌冗余上存在不对称性，设计任务特定加速器，但统一训练中任务特定令牌丢弃导致协同损失，表明高效统一建模需保留共享跨任务结构。

详情

AI中文摘要

统一视觉语言模型（VLM）在单个自回归骨干中集成了视觉理解和视觉生成，但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中，我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析，我们揭示了一个基本的不对称性：视觉理解在后期层表现出显著的视觉冗余，而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发，我们设计了任务特定的加速器，针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升，但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径，并消除了联合优化中通常观察到的相互性能增益。我们的发现表明，高效统一建模需要保留共享的跨任务结构，强调了需要协同感知的加速策略。项目页面：https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

URL PDF HTML ☆

赞 0 踩 0

2606.01502 2026-06-02 cs.DC cs.AI cs.NI 版本更新

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

移动查询，而非缓存：跨GPU结构中的跨实例潜在注意力再分布特征

Bole Ma, Jan Eitzinger, Harald Köstler, Gerhard Wellein

AI总结本研究通过真实多节点H100集群实验，刻画了多头部潜在注意力（MLA）在跨实例场景下的性能特征，提出了拓扑感知成本模型和路由/获取/本地谓词，证明在多数情况下路由查询比移动缓存更高效。

详情

AI中文摘要

前沿大语言模型越来越多地使用稀疏注意力索引器来决定查询关注的内容，该索引器为每个查询挑选几个KV缓存块：注意力的单位现在是一个小的、可重用的块。代理工作负载频繁使用这一机制：许多子代理查询一个大型代码库，重用相同的块。当语料库超出单个GPU容量时，它会被分区到多个实例上，因此查询及其选择的块通常位于不同的GPU上：回答查询意味着跨实例的注意力。先前跨实例KV系统的惯常做法是移动缓存：将选定的块拉到请求方。多头部潜在注意力反转了计算方式，将每个令牌的键和值压缩成一个窄向量，因此路由的查询行只有约1 KB，比它注意的块还小；此时路由查询通常比移动缓存更便宜。哪种原语在哪种结构和请求形状下胜出，尚未被研究，尤其是在设备发起的RDMA上，该技术使得每个请求的跨节点传输成本很低。我们在真实的多节点H100集群上刻画了跨实例MLA注意力的特征，提炼出两个可重用的产物：一个拓扑感知的成本模型（探测/传输/计算/返回/合并）和一个闭合形式的路由/获取/本地谓词，我们在真实的IBGDA上测量了其常数，该模型跟踪批量往返的误差在约7%以内。在解码阶段，它路由查询，将移动缓存的成本（连续块的约3毫秒重新适应拼接，或选择下的分散收集）替换为数十微秒的往返，并根据探测延迟而非峰值带宽选择结构。我们为MLA实例化了成本模型和谓词，但两者并非MLA特有：它们适用于任何通过压缩或稀疏选择将注意力缩小到小块的情况（如当前的DeepSeek-V3.2、V4和GLM-5.1）。将它们扩展到新架构只需测量两个系数：路由的有效载荷和获取的移动缓存成本。

英文摘要

Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents query one large codebase, reusing the same blocks. When that corpus outgrows one GPU it is partitioned across instances, so a query and the blocks it selects often sit on different GPUs: answering it means attention across instances. The reflex of prior cross-instance KV systems is to move the cache: pull the selected blocks to the requester. Multi-head Latent Attention inverts the arithmetic, compressing each token's key and value into one narrow vector, so a routed query row is only ~1 KB, smaller than the chunk it attends; routing the query is then often cheaper than moving the cache. Which primitive wins, over which fabric and request shape, is uncharted, least of all on device-initiated RDMA that makes per-request cross-node transfers cheap. We characterize cross-instance MLA attention on a real multi-node H100 cluster, distilling two reusable artifacts: a topology-aware cost model (probe / transfer / compute / return / merge) and a closed-form route/fetch/local predicate, whose constants we measure on real IBGDA, where the model tracks batched round-trips to within ~7%. At decode it routes the query, trading the cost of moving the cache (a ~3 ms re-adaptation splice for a contiguous chunk, or a scattered gather under selection) for a tens-of-microsecond round trip, and picks the fabric by probe latency, not peak bandwidth. We instantiate the cost model and predicate for MLA, but neither is MLA-specific: they apply wherever compression or sparse selection shrinks attention to small chunks (DeepSeek-V3.2, V4, and GLM-5.1 today). Extending them to a new architecture requires measuring just two coefficients: the routed payload and fetch's move-the-cache cost.

URL PDF HTML ☆

赞 0 踩 0

2606.01498 2026-06-02 cs.CL cs.AI 版本更新

MURMUR：一种高效的长时间语音识别推理系统

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

发表机构 * University of Washington（华盛顿大学）

AI总结提出MURMUR推理系统，通过块间和块内两级优化，在保持高精度的同时显著降低长时间语音识别的延迟。

详情

AI中文摘要

长时间自动语音识别（ASR）需要高精度和低延迟，但现有系统迫使两者之间进行权衡。基于块的流水线在并行窗口中处理音频以实现低延迟，但丢失了跨块上下文，并且需要脆弱的启发式方法来对齐边界处的说话人和时间戳。长上下文ASR模型通过单次传递解决所有问题以获得更好的准确性，但速度慢一个数量级。我们提出MURMUR，一个通过两级操作克服这种权衡的推理系统。在块间级别，我们重新审视基于块的流水线以适应现代长上下文ASR，将块大小视为可调超参数，并表明中间块大小在准确性和延迟之间取得了良好的平衡。在块内级别，我们通过应用于输出和语音令牌的滑动窗口KV缓存驱逐策略来利用注意力稀疏性。在AMI-IHM上，MURMUR匹配单次传递准确性，同时将延迟降低4.2倍，通过令牌驱逐进一步获得收益，相对tcpWER退化小于1%。MURMUR的代码可在https://github.com/uw-syfi/Murmur获取。

英文摘要

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.

URL PDF HTML ☆

赞 0 踩 0

2606.01473 2026-06-02 cs.AI cs.HC 版本更新

A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation

极简脑机音乐接口用于实时情感驱动声化：系统设计与初步评估

Pablo A. Monroy-D'Croz, Rafael Ramirez-Melendez, Julian Cespedes-Guevara

发表机构 * GitHub

AI总结本文提出一种极简脑机音乐接口，通过前额EEG活动实时估计情感效价并映射到音乐特征，实验发现额叶alpha不对称性无法可靠区分指令性情绪状态。

详情

AI中文摘要

本文提出一种极简脑机音乐接口（BCMI），作为实时情感声化系统，将前额EEG活动转化为自适应音乐。通过额叶alpha不对称性（AF7/AF8）估计情感效价，并通过随机生成算法映射到音乐特征，如调式、速度、节奏密度和音高音域。系统集成了无线EEG采集、实时Python信号处理以及通过Lab Streaming Layer同步的Ableton Live音乐生成。一项包含22名参与者的实验探究了有意情感自我诱导是否能调节BCMI神经反馈信号。线性混合效应分析发现目标情绪或时间无显著效应，表明额叶alpha不对称性信号无法可靠区分指令性情绪状态。个体差异（包括音乐训练和表演经验）解释了比实验操作更多的方差，后者仅占总信号方差的0.40%。这些发现凸显了使用额叶alpha不对称性作为闭环情绪调节的自愿控制信号的挑战，并为未来BCMI研究提出了方法论方向。

英文摘要

This paper presents a minimalist brain-computer Musical Interface (BCMI) that functions as a real-time affective sonification system, translating prefrontal EEG activity into adaptive music. Emotional valence is estimated from frontal alpha asymmetry (AF7/AF8) and mapped to musical features such as mode, tempo, rhythmic density, and pitch register through a stochastic generative algorithm. The system integrates wireless EEG acquisition, real-time Python signal processing, and Ableton Live-based music generation synchronized via Lab Streaming Layer. An experiment with 22 participants investigated whether intentional emotional self-induction could modulate the BCMI neurofeedback signal. Linear mixed-effects analyses found no significant effects of target emotion or time, indicating that the frontal alpha asymmetry signal did not reliably distinguish instructed emotional states. Individual differences, including musical training and acting experience, explained more variance than the experimental manipulation, which accounted for only 0.40\% of total signal variance. These findings highlight the challenges of using frontal alpha asymmetry as a voluntary control signal for closed-loop emotion regulation and suggest methodological directions for future BCMI research.

URL PDF HTML ☆

赞 0 踩 0

2606.01470 2026-06-02 physics.flu-dyn cs.AI cs.LG 版本更新

跨干预因果贝叶斯优化的信息传递

Mohammad Ali Javidian

发表机构 * Computer Science Department（计算机科学系）

AI总结提出图耦合因果贝叶斯优化方法，通过共享因果参数的不确定性连接不同干预效应，实现跨干预信息传递，在可识别线性高斯因果模型中证明低秩核性质和次线性遗憾界。

详情

AI中文摘要

贝叶斯优化是一种优化昂贵系统的流行方法，其中每次实验、模拟或干预都会耗费时间或金钱。在其标准形式中，它将我们控制的变量视为黑盒的普通输入，无法区分单纯的相关性与真正的因果关系。因果贝叶斯优化通过使用已知因果图结合观测数据来决定哪些变量值得干预，从而部分弥补了这一差距。然而，现有方法几乎孤立地学习每种可能干预的效果，尽管在因果系统中这些效果通常共享相同的底层机制。我们提出图耦合因果贝叶斯优化，通过我们对一小部分共享因果参数的不确定性，将不同的干预效果联系在一起。结果是一个因果核，使得从一次干预收集的证据能够改进我们对相关干预的估计。对于可识别的线性高斯因果模型，我们证明该核具有低秩，其秩由共享参数的数量而非干预菜单的大小界定。这进而产生一个信息增益界，该界仅随优化范围对数增长，以及一个遗憾界，清晰地将三种误差来源分开：优化、因果估计以及考虑哪些干预集的选择。我们还描述了非线性和自适应扩展。在与理论一致的高斯系统、共享机制压力测试以及标准因果优化基准测试中，该方法保持了因果贝叶斯优化的优势，同时实现了跨相关干预的信息传递，当对目标父节点的直接干预不可用且稀疏的干预数据必须在一大组候选干预中重复使用时，增益最为明显。

英文摘要

Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or money. In its standard form, it treats the variables we control as plain inputs to a black box and cannot tell apart mere correlation from a real cause and effect. Causal Bayesian optimization closes part of this gap by using a known causal graph together with observational data to decide which variables are worth intervening on. Existing methods, however, learn the effect of each possible intervention almost in isolation, even though in a causal system these effects usually share the same underlying mechanisms. We propose graph-coupled causal Bayesian optimization, which ties the different intervention effects together through the uncertainty we have about a small set of shared causal parameters. The result is a causal kernel that lets evidence collected from one intervention improve our estimate of related interventions. For identifiable linear Gaussian causal models, we show that this kernel has low rank, bounded by the number of shared parameters rather than by the size of the intervention menu. This in turn yields an information-gain bound that grows only logarithmically in the optimization horizon, and a regret bound that cleanly separates three sources of error: optimization, causal estimation, and the choice of which intervention sets to consider. We also describe nonlinear and adaptive extensions. Across theory-aligned Gaussian systems, shared-mechanism stress tests, and standard causal optimization benchmarks, the method keeps the benefits of causal Bayesian optimization while transferring information across related interventions, with the clearest gains when direct interventions on the target's parents are unavailable and sparse interventional data must be reused across a large family of candidate interventions.

URL PDF HTML ☆

赞 0 踩 0

2606.01444 2026-06-02 cs.AI cond-mat.mtrl-sci cs.CL cs.LG math.CT 版本更新

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

科学中的自我修正发现系统：面向主体人工智能的范畴论框架

Fiona Y. Wang, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics（原子分子力学实验室）； Department of Biological Engineering（生物工程系）； Massachusetts Institute of Technology（麻省理工学院）； Department of Civil and Environmental Engineering（土木与环境工程系）； Department of Mechanical Engineering（机械工程系）； Center for Computational Science and Engineering（计算科学与工程中心）； Schwarzman College of Computing（施瓦茨曼计算学院）

AI总结本文提出一个基于范畴论的框架，通过左Kan扩展实现科学发现中的表征体制转换，并应用于材料科学中的蛋白质力学和纤维网络建模。

详情

AI中文摘要

科学发现不仅是生成答案，更是对证据、人工制品、操作和验证者进行类型化的表征体制的修正。我们为材料科学中的主体发现开发了一个范畴论描述。在固定体制b中，模式类别为S_b，系统状态是一个余预层I_t: S_b -> Set，来源是元素范畴∫_{S_b} I_t。固定体制操作是对此类状态的更新，仅当指定并保留了保持来源的细化时才是自函子。发现则是经过验证的体制转换u: S_b -> S_b'：旧人工制品通过左Kan扩展Lan_u I_t保存并传输，并与转换后状态进行比较，以识别超出函子传输的剩余内容。这在不依赖主观新颖性的情况下区分了检索、搜索和发现。我们在两个系统中实例化了该框架。在Builder/Breaker中，蛋白质力学世界模型在最小描述长度门控下进行修正；接受的定律将链内柔性表示为受慢集体模式调节的全模态弹性柔度，即模式调节柔度。在CategoryScienceClaw中，类型化技能、人工制品、开放需求、工作流变异、门控、压力测试和公共话语构成了一个携带证明的知识计算图。一个纤维网络示例记录了候选模型、被拒绝的替代方案、AIC门控、扰动测试以及一个基于各向同性纤维计数描述符的接受取向张量各向异性刚度代理模型。这些案例共同展示了范畴论如何既作为科学发现的数学语言，又作为自我修正AI发现系统的工程规范。

英文摘要

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b': old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01443 2026-06-02 cs.LG cs.AI cs.CV 版本更新

不要询问LLM追踪新鲜度：一种确定性的内存冲突解决策略

Vikas Reddy, Sumanth Challaram

发表机构 * IIT Kgp（印度理工学院科钦分校）

AI总结针对基于LLM的内存系统中事实冲突解决性能低下的问题，提出用候选提取加Python max(serial)的确定性聚合替代LLM判断，在单跳任务上提升10.8个百分点，并扩展到多跳任务。

详情

AI中文摘要

基于LLM的内存系统越来越多地维护随时间演变的事实，其中一个反复出现的失败是冲突解决：当一个事实有多个矛盾的值时，智能体应该返回哪个？MemoryAgentBench (MAB; Hu et al., 2026) 在其FactConsolidation任务中明确了这一点：事实被编号，反事实具有更高的序号，并且智能体被告知较新的事实具有较大的序号。然而，每个已发布的系统表现不佳：HippoRAG-v2在单跳（FC-SH）上达到54%，BM25 48%，Mem0 18%，而时间知识图谱Zep/Graphiti仅为7%。多跳几乎未解决（22个系统中最多7%）。我们认为瓶颈在于组装步骤：基线将冲突解决留给LLM介导的检索或生成，而不是版本感知的聚合。一个匹配设置的比较（相同的主干、检索、分块、TOP_K）表明，用候选提取加Python max(serial)替换LLM判断答案流水线，在FC-SH上（gpt-4o-mini）获得+10.8分的提升，从6K时的+8分扩大到262K时的+21分。这是一个全流水线效应（解析器、提示、格式和温度共同变化）；隔离解析器是未来的工作。该配方在FC-SH上达到78.0%（gpt-4o-mini）、94.8%（gpt-4o），在FC-MH上达到30.2%（gpt-4o-mini，使用gpt-4o时升至51.5%），通过每跳确定性的Self-Ask扩展。在匹配的262K下，它比HippoRAG-v2高出+28分，比已发布的最佳FC-MH结果高出+20分。这一含义对该子领域具有纠正作用：冲突解决的瓶颈是组装（检索后聚合），而不是存储。一个LongMemEval知识更新检查表明，该机制从max(serial)移植到max(timestamp)，但仅与LLM判断持平（57.8% vs 64.4%，n=45）：确定性聚合是当前值冲突的正确原语，并且必须与问题类型感知处理组合，以实现更广泛的内存问答。

英文摘要

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

URL PDF HTML ☆

赞 0 踩 0

2606.01417 2026-06-02 cs.AI 版本更新

GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway

GovAI-Pipe：面向土耳其电子政务门户的公民交互AI分层治理管道

Ahmet Kaplan

发表机构 * Turkey's e-Government Gateway（土耳其电子政务门户）

AI总结针对土耳其电子政务平台缺乏结构化技术治理基础设施的问题，提出基于设计科学研究方法的四层AI治理管道GovAI-Pipe，将AI模型生命周期映射到治理检查点，并通过高风险用例验证其可审计的技术实现。

Comments 7 pages

详情

AI中文摘要

土耳其的电子政务门户（e-Devlet）为超过6800万注册用户提供9200多项政府服务，并越来越多地将人工智能集成到面向公民的应用中，如聊天机器人助手和资格评估。然而，目前没有结构化的技术治理基础设施将高级AI政策框架（如欧盟AI法案、OECD AI原则和土耳其自身的国家AI战略）与在集中式电子政务平台中部署AI的操作现实联系起来。我们提出GovAI-Pipe，这是一个使用设计科学研究方法设计的四层治理管道，将AI模型生命周期映射到治理检查点：（1）部署前验证，用于偏差测试、可解释性和隐私影响评估；（2）部署治理，用于风险等级分类和审批工作流；（3）运行时监控，用于漂移检测、公平性跟踪和人在回路升级；（4）事后治理，用于审计跟踪、回滚和公民补救。每一层都锚定到欧盟AI法案、GDPR数据保护框架和国家AI战略的具体条款。我们通过两个高风险e-Devlet用例演示该框架，展示GovAI-Pipe如何将治理原则作为可审计的技术管道组件进行操作化。

英文摘要

Turkey's e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasingly integrating artificial intelligence into citizen-facing applications such as chatbot assistants and eligibility assessments. However, no structured technical governance infrastructure currently connects high-level AI policy frameworks, such as the EU AI Act, OECD AI Principles, and Turkey's own National AI Strategy, to the operational reality of deploying AI within a centralized e-government platform. We propose GovAI-Pipe, a four-layer governance pipeline designed using Design Science Research methodology that maps the AI model lifecycle to governance checkpoints: (1) pre-deployment validation for bias testing, explainability, and privacy impact assessment; (2) deployment governance for risk-tier classification and approval workflows; (3) runtime monitoring for drift detection, fairness tracking, and human-in-the-loop escalation; and (4) post-incident governance for audit trails, rollback, and citizen redress. Each layer is anchored to specific provisions of the EU AI Act, the GDPR data protection framework, and the National AI Strategy. We demonstrate the framework through two high-risk e-Devlet use cases, showing how GovAI-Pipe operationalizes governance principles as auditable, technical pipeline components.

URL PDF HTML ☆

赞 0 踩 0

2606.01416 2026-06-02 cs.AI 版本更新

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

用于可靠的工具增强型大语言模型系统的自愈代理编排器

Rahul Suresh Babu, Adarsh Agrawal

发表机构 * Independent Researcher（独立研究者）； Senior Member, IEEE（IEEE高级成员）

AI总结提出一种自愈代理编排器，通过将可靠性视为有界运行时控制问题，映射故障信号、选择恢复动作并验证轨迹，在100任务故障注入基准上达到98.8%任务成功率，优于重试和完全重规划基线。

详情

AI中文摘要

工具增强型大语言模型（LLM）代理依赖于协调规划、检索、工具调用、验证、记忆和恢复的编排层。在这些系统中，故障不仅来自模型错误，还来自编排层问题，如工具超时、参数格式错误、过时上下文、矛盾证据、重试循环和未验证的中间输出。本文提出一种自愈代理编排器，将可靠性视为有界运行时控制问题。该编排器将可观察的故障信号映射到推断的故障类别，在显式预算下选择目标恢复动作，验证恢复轨迹，并记录可观察性痕迹。我们在一个100任务的受控故障注入基准上，将本方法与静态工作流、仅重试、ReAct风格和完全重规划基线进行比较。自愈方法实现了98.8%的任务成功率，而仅重试为94.5%，完全重规划为93.8%。匹配的恢复预算扫描显示，在每个测试预算下，自愈方法均优于仅重试和完全重规划，在单次恢复尝试下差距最大：分别为94.0%对比85.3%和88.2%。在受控的语义静默故障设置下，验证器引导的自愈将静默故障降至0.0%，而非验证基线更频繁地返回错误但看似合理的输出。紧凑的模型在环验证表明，当实时工具调用模型在本地故障注入工具上执行工具选择、参数生成和答案合成时，相同的恢复机制可以运行。这些结果提供了受控证据，表明故障感知、有预算和验证引导的编排提高了工具增强型LLM系统的可靠性和可诊断性。

英文摘要

Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8\% task success, compared with 94.5\% for retry-only and 93.8\% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0\% versus 85.3\% and 88.2\%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0\%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01402 2026-06-02 cs.LG cs.AI 版本更新

Neural Network Compression by Approximate Differential Equivalence

基于近似微分等价的神经网络压缩

Ravi Dhiman, Andrea Passarella, Mirco Tribastone, Lorenzo Valerio

发表机构 * IMT School for Advanced Studies Lucca（利古里亚高级研究学院）； IIT CNR（理工学院-国家科研委员会）

AI总结提出一种通过聚合功能相似神经元来压缩神经网络的方法，利用近似前向微分等价将网络编码为多项式ODE系统，实现模型大小与精度的平滑权衡。

Comments 19 pages, 4 figures

详情

AI中文摘要

神经网络压缩通常通过基于局部重要性分数（例如基于幅度的剪枝）剪枝参数来实现。我们提出一种互补方法，通过聚合具有相似功能行为的神经元来压缩模型，而不是独立移除权重。我们的方法将训练好的网络编码为多项式ODE系统，并应用一种称为近似前向微分等价的 lumping 方法来识别具有近似匹配诱导动力学的神经元。单个容差参数 $\varepsilon$ 控制压缩水平，并在模型大小和预测精度之间诱导平滑权衡。我们在来自已知真实行为的非线性动力系统的合成数据集和公共回归基准上评估该方法。在这两种设置下，所提出的方法在保持精度的同时实现了显著的参数减少，并在相似的压缩水平下始终优于基于幅度的剪枝和Wanda。这些结果表明，基于微分等价的聚合是传统以权重为中心的剪枝的一种有原则且有效的替代方案。

英文摘要

Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We propose a complementary approach that compresses models by aggregating neurons with similar functional behavior rather than removing weights independently. Our method encodes a trained network as a polynomial ODE system and applies a lumping method called Approximate Forward Differential Equivalence to identify neurons with approximately matching induced dynamics. A single tolerance parameter, $\varepsilon$, controls the compression level and induces a smooth trade-off between model size and predictive accuracy. We evaluate the method on synthetic datasets derived from nonlinear dynamical systems with known ground-truth behavior and on public regression benchmarks. Across both settings, the proposed approach achieves substantial parameter reduction while preserving accuracy, and consistently compares favorably with magnitude-based pruning and Wanda at similar compression levels. These results suggest that differential equivalence-based aggregation is a principled and effective alternative to conventional weight-centric pruning.

URL PDF HTML ☆

赞 0 踩 0

2606.01400 2026-06-02 cs.CL cs.AI 版本更新

BRo-JEPA：在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； NYU Langone Health（纽约大学Langone医疗中心）

AI总结本文提出BRo-JEPA模型，通过在潜空间中施加模10算术的循环结构，实现零样本泛化，解决了标准模型无法外推未见操作的问题。

Comments 10 pages, 14 figures

2606.01364 2026-06-02 cs.CR cs.AI cs.SE 版本更新

Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research

大规模针尖：LLM辅助的Windows漏洞研究目标选择

Michael J. Bommarito

发表机构 * Microsoft（微软）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出Symbolicate-Enrich-Sample流水线，通过符号恢复、结构特征提取和低成本语言模型排序，从Windows系统数千万函数中筛选出约2.2万个候选目标，解决漏洞研究中目标选择瓶颈问题。

Comments 9 pages, 3 figures, 2 tables

详情

AI中文摘要

现代操作系统的攻击面如同大海捞针：数千个签名二进制文件和数百万个函数，几乎没有一个与任何给定漏洞相关。人类分析师或LLM代理必须在分析之前选择值得阅读的函数。在整个操作系统范围内，这种目标选择（而非分析）才是约束条件。我们提出了Symbolicate-Enrich-Sample，一个低成本的批处理流水线，将生产级Windows二进制文件语料库转化为可查询、优先级排序的研究队列。我们(i)通过自动获取公共符号文件并将其与恢复的调用图结合，恢复剥离符号的供应商二进制文件的函数级符号；(ii)为每个命名函数附加廉价、确定性的结构特征，并基于这些特征使用低成本语言模型分配可达性层级、风险级别、漏洞类别假设和理由；(iii)通过优先级加权重要性采样器抽取多样化、优先排序的批次。贡献在于一个选择基础：下游检测器或LLM代理在其上运行的优先级排序层。在包含7,231,419个函数的整个Windows镜像上，标签具有显著的选择性，叠加确定性过滤器后留下约22K个函数的候选列表：候选的针尖，数量足够人类或代理处理。我们描述了流水线的选择性及其失败模式，介绍了方法论并报告了总体统计数据；由于法律和双重用途原因，我们暂不公开推导出的数据集。

英文摘要

The attack surface of a modern operating system is a haystack: thousands of signed binaries and millions of functions, almost none relevant to any given vulnerability. A human analyst or an LLM agent must pick the function worth reading before analyzing it. At whole-OS scope, this target selection, not the analysis, is the binding constraint. We present Symbolicate-Enrich-Sample, a low-cost batch pipeline that turns a corpus of production Windows binaries into a queryable, priority-ranked research queue. We (i) recover function-level symbols for stripped vendor binaries by auto-fetching the public symbol files and joining them to a recovered call graph; (ii) attach cheap, deterministic structural features to each named function and, conditioned on those features, use a low-cost language model to assign a reachability tier, a risk level, a bug-class hypothesis, and a rationale; and (iii) draw diverse, prioritized batches via a priority-weighted importance sampler. The contribution is a selection substrate: the prioritization layer a downstream detector or LLM agent runs on top of. Across a whole Windows image of 7,231,419 functions, the labels are markedly selective, and stacking deterministic filters on them leaves a ~22K-function shortlist: the candidate needles, few enough for a human or agent to work through. We characterize the pipeline's selectivity and its failure modes, describe the methodology, and report aggregate statistics; we withhold the derived dataset for legal and dual-use reasons.

URL PDF HTML ☆

赞 0 踩 0

2606.01352 2026-06-02 cs.AI 版本更新

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

FlowTime: 基于流的个性化先验实现连续生成式观看时间预测

Hongxu Ma, Han Zhou, Chenghou Jin, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Shanghai University of Finance and Economics（上海财经大学）； Kuaishou Technology（快手科技）； Tongji University（同济大学）

AI总结针对现有观看时间预测方法在范式上的局限性，提出连续生成式回归范式及FlowTime方法，利用一步生成变分自编码器和基于流的个性化先验，有效建模多模态用户-物品交互模式，显著提升预测性能。

Comments Accepted by KDD'26

详情

DOI: 10.1145/3770855.3818143

AI中文摘要

观看时间已成为短视频推荐系统中优化深度用户参与度的关键指标。然而，当前的观看时间预测方法存在固有的范式特定局限性。直接回归因单峰高斯假设而面临均值崩溃，序数回归因刚性离散化而受到量化误差的困扰。同样，离散生成式回归则面临高推理延迟和启发式词汇表设计的问题。除了这些具体缺陷外，一个共同的不足是无法捕捉用户-物品交互模式的内在多模态性和异质性。为应对这些挑战，我们首先从因果角度重新审视观看时间预测问题，并将这些用户特定模式识别为调节观看时间结果的结构性混淆因素，其中相同的兴趣在不同用户习惯条件下表现为不同的观看时间结果。然后，我们正式提出一种新的（即第四种）范式——连续生成式回归，并引入FlowTime，一种利用一步生成变分自编码器的新方法。FlowTime有效规避了迭代去噪的延迟，同时保持了连续潜在空间的表达能力。此外，我们设计了一种基于流的个性化先验，利用归一化流将标准高斯先验扭曲为复杂的历史条件流形，从而实现对多模态交互模式的自适应建模。最后，我们构建了TimeRec，首个开源观看时间预测库，并引入一种新的个性化指标，以建立严格的基准测试标准。广泛的离线实验和在线A/B测试表明，FlowTime显著优于现有最先进方法。

英文摘要

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

URL PDF HTML ☆

赞 0 踩 0

2606.01351 2026-06-02 cs.AI 版本更新

DiffuSent：面向方面级情感分析的统一扩散框架

Shu Long, Yanglei Gan, Xuchuan Zhou

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Southwest Petroleum University（西南石油大学）； Southwest Minzu University（西南民族大学）

AI总结提出非自回归扩散框架DiffuSent，将方面级情感分析的所有子任务统一为边界去噪扩散过程，通过对比去噪训练策略解决重复预测问题，在28个设置上优于现有生成式和跨度式系统，并实现高达181倍的推理加速。

详情

AI中文摘要

方面级情感分析（ABSA）包含七个不同的子任务，每个子任务关注不同的提取元素。尽管生成模型在统一方面情感分析中取得了成功，现有方法通常依赖于自回归的逐词生成，未能捕捉方面和意见术语的整体信息，导致边界不敏感，特别是在多词方面和意见术语的上下文中。为了解决这些问题，我们提出了DiffuSent，一个非自回归扩散框架，系统地将所有ABSA子任务公式化为边界去噪扩散过程，逐步在噪声状态上细化边界。此外，我们引入了一种对比去噪训练策略，有效解决了扩散过程中引入的细微变化导致的重复预测问题。在28个设置（7个子任务×4个数据集）上的大量实验表明，DiffuSent在最强生成式和跨度式系统上实现了持续改进。DiffuSent在多词三元组上表现出显著增益，平均F1提升+2.48，并在包含多个情感三元组的句子中保持稳健的提取准确性。此外，非自回归解码实现了显著的效率优势，推理速度比自回归生成基线快达181倍。

英文摘要

Aspect-Based Sentiment Analysis (ABSA) encompasses seven distinct subtasks, each focusing on different extracted elements. Despite the proven success of generative models in unified aspect sentiment analysis, existing approaches often rely on auto-regressive token-by-token generation without grasping the whole information of the aspect and opinion terms, resulting in boundary insensitivity, particularly in context of multi-word aspect and opinion terms. To address these issues, we present DiffuSent, a non-auto-regressive diffusion framework that systematically formulates all ABSA subtasks as boundary denoising diffusion processes, progressively refining boundaries over noisy states. Furthermore, we introduce a contrastive denoising training strategy which effectively address duplicate predictions with subtle variations introduced by diffusion process. Extensive experiments across 28 settings (7 subtasks x 4 datasets) demonstrate that DiffuSent achieves delivers consistent improvements over the strongest generative and span-based systems. DiffuSent exhibits notable gains on multi-word triplets, achieving an average improvement of +2.48 F1, and maintains robust extraction accuracy in sentences containing multiple sentiment triplets. Moreover, the non-auto-regressive decoding enables substantial efficiency benefits, reaching up to 181 times faster inference than auto-regressive generative baselines

URL PDF HTML ☆

赞 0 踩 0

2606.01322 2026-06-02 cs.CL cs.AI 版本更新

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

TukaBench: 一个基于文化的非洲语言越狱基准

Victor Akinode, Senyu Li, Wassim Hamidouche, Waqas Zamir, Inbal Becker-Reshef, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute（魁北克人工智能研究院）； McGill University（麦吉尔大学）； Microsoft AI for Good Research Lab（微软人工智能造福人类研究实验室）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结针对大型语言模型在非洲低资源语言上的安全评估缺失，提出TUKABENCH基准，通过四种设置（直接翻译、文化适应翻译、人工策划提示、代码切换提示）评估语言、文化背景和提示规避性对模型安全的影响，发现非洲语言提示降低拒绝率，并引入Deflection指标和人工验证以解决模型理解失败和评判可靠性问题。

Comments Under review

详情

AI中文摘要

大型语言模型（LLMs）的安全评估仍然高度以英语为中心，导致低资源语言（LRLs），特别是非洲语言，严重缺乏探索。我们引入了TUKABENCH，一个针对七种非洲语言的越狱基准，它通过四种设置将JailbreakBench（JBB）扩展到直接翻译之外：JBB提示的人工翻译、适应非洲背景的英语提示后人工翻译、通过与GPT-5.2交互验证的人工策划提示，以及结合英语和非洲语言的代码切换提示，从而隔离语言、文化背景和提示规避性对模型安全的影响。在闭源和开源模型中，使用非洲语言提示相比英语减少了拒绝，其中文化适应的提示导致最少的拒绝。评估还揭示了两个结构性限制：模型理解失败和低资源语言中LLM作为评判者的可靠性降低。为了捕捉前者，我们在“拒绝”和“越狱”之外引入了“回避”；为了评估后者，我们通过人工标注验证输出，显示在低资源语言和较少支持的脚本中，评判者与人类的一致性下降。

英文摘要

Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation, human-curated prompts validated through interactions with GPT-5.2, and code-switched prompts combining English and African languages, isolating the effect of language, cultural grounding, and prompt evasiveness on model safety. Across closed and open models, prompting in African languages reduces refusal relative to English, with culturally adapted prompts leading to least refusal. The evaluation also surfaces two structural limitations: model comprehension failures and reduced LLM-as-a-judge reliability in LRLs. To capture the first, we introduce Deflection alongside Refused and Jailbroken; to assess the second, we validate outputs with human annotations, showing that judge-human agreement drops in lower-resource languages and less commonly supported scripts.

URL PDF HTML ☆

赞 0 踩 0

2606.01316 2026-06-02 cs.AI 版本更新

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Science Earth: 迈向面向AI原生科学发现的行星级操作系统

Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma, Yifan Wen, Jinglin Jian, Jiacheng Ge, Xiangru Tang, Bo An, Ming Yin, Sanfeng Wu, Mengdi Wang, Le Cong

发表机构 * Department of Pathology, Department of Genetics, Stanford University School of Medicine（病理学系、遗传学系，斯坦福大学医学院）； Princeton AI Lab, Department of Electrical & Computer Engineering, Princeton University（普林斯顿人工智能实验室、电气与计算机工程系，普林斯顿大学）； Scripps Research, La Jolla, CA, USA（斯克里普斯研究机构，洛杉矶，加利福尼亚州，美国）； Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine（生物统计学部、人口健康系，纽约大学格罗斯曼医学院）； College of Computing and Data Science, Nanyang Technological University（计算与数据科学学院，南洋理工大学）； Department of Computer Science, Yale University（计算机科学系，耶鲁大学）； Department of Physics, Princeton University（物理系，普林斯顿大学）

AI总结提出Science Earth行星级科学运行时，通过EACN协议实现AI能力动态连接与自组织协作，在跨太平洋Kuramoto同步研究和单细胞分析中验证了分布式自校正科学推理。

详情

AI中文摘要

科学发现需要在广阔的搜索空间中运用智能、毅力和偶然性。如今，顶尖科学能力仍然孤立——一个AI系统用于生物分析，另一个用于临床推理、数学推导或材料模拟——并且没有预设计的团队能够预见一个问题所需的所有技能。Science Earth是一个行星级科学运行时，其中任何能力——模拟集群、湿实验室机器人、证明引擎、单细胞管道——都可以相互连接，协作结构由问题本身涌现。其底层EACN协议让能力能够相互发现、协商任务所有权，并在不相容的证据标准之间进行裁决，而无需事先知道谁将遇见谁。这将组织挑战从工作流设计转向开放式连接。两次运行在结构不同的条件下验证了这一点。在一项跨太平洋高阶Kuramoto同步研究中，智能体在30分钟内识别并纠正了Ott-Antonsen解析理论中一个在洛伦兹极限外失效的闭合比率假设。在针对488万细胞Kang 2024泛癌图谱的八智能体单细胞运行中，异质能力在64.9小时窗口内耦合，仅有一条结构外部指令，产生了三个新的结果层，并将发现与一项关于相邻CCR8- TIGIT+ Treg亚群的独立湿实验室研究进行锚定。这些案例是首次实证读数，而非基准测试。它们表明，当AI能力真正可连接且协调从问题中涌现时，科学推理成为一个分布式、自校正的过程——这是向行星级AI原生发现迈出的一步。

英文摘要

Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed--one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation--and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability--a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline--can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process--a step towards scaling AI-native discovery to the planet.

URL PDF HTML ☆

赞 0 踩 0

2606.01314 2026-06-02 cs.AI 版本更新

SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems

SkillSmith: 技能与工具的协同进化用于自我改进的智能体系统

Yangbo Wei, Zhen Huang, Shaoqiang Lu, Junhong Qian, Qifan Wang, Chen Wu, Lei He

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Eastern Institute of Technology（东部技术研究所）； University of Science and Technology of China（中国科学技术大学）； Southeast University（东南大学）； Ningbo Institute of Digital Twin（宁波数字孪生研究所）

AI总结提出SkillSmith框架，通过统一提案空间和Lotka-Volterra生态效用模型实现技能与工具的协同进化，在多个基准测试中显著提升性能。

详情

AI中文摘要

最近的自进化智能体表明，技能可以通过执行被发现、精炼和积累。然而，现有的技能进化框架通常假设固定的工具层，并独立评估每个技能，限制了它们修复工具级故障或推理技能间交互的能力。我们提出SkillSmith，一个协同感知的技能-工具协同进化框架。SkillSmith引入了一个统一的提案空间，其中反思产生原子束，共同修改技能和工具，允许在技能进化识别出可重用的能力缺口时，对工具进行包装、编辑、组合、拆分或淘汰。为了指导这种联合搜索，SkillSmith维护了一个受Lotka-Volterra动力学启发的生态效用模型，其中从执行轨迹估计的交互矩阵捕获技能间的成对互补和冲突，并为检索、变异优先级和淘汰提供压力信号。此外，SkillSmith记录反模式，包括失败特征、因果归因和补救措施，以加速诊断并否决重复已知错误的提案。在包括WildClawBench在内的三个基准测试和五个Qwen3.5模型规模上的实验表明，SkillSmith始终优于强基线，并且随着任务复杂性和多技能共激活的增加，增益会放大。

英文摘要

Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently, limiting their ability to repair tool-level failures or reason about interactions among skills. We propose SkillSmith, a synergy-aware skill-tool co-evolution framework. SkillSmith introduces a unified proposal space in which reflection produces atomic bundles that jointly modify skills and tools, allowing tools to be wrapped, edited, composed, split, or retired when skill evolution identifies a reusable capability gap. To guide this joint search, SkillSmith maintains an ecological utility model inspired by Lotka-Volterra dynamics, where an interaction matrix estimated from execution traces captures pairwise complementarity and conflict among skills and provides pressure signals for retrieval, mutation prioritization, and retirement. Furthermore, SkillSmith records anti-patterns, including failure signatures, causal attributions, and remedies, to accelerate diagnosis and veto proposals that repeat known mistakes. Experiments on three benchmarks, including WildClawBench, and five Qwen3.5 model scales show that SkillSmith consistently outperforms strong baselines, with gains that amplify as task complexity and multi-skill co-activation increase.

URL PDF HTML ☆

赞 0 踩 0

2606.01313 2026-06-02 cs.RO cs.AI 版本更新

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

PSG-Nav: 通过多元宇宙决策的概率场景图导航

Rufeng Chen, Yue Chang, Xiaqiang Tang, Hechang Chen, Sihong Xie

发表机构 * Tsinghua University（清华大学）

AI总结提出PSG-Nav方法，通过构建3D概率场景图并利用多元宇宙决策从联合分布中采样最可能的世界设置，以处理开放词汇导航中的感知不确定性，并引入证据经验校准器实现在线终身适应，在多个基准上取得最新最优结果。

Comments 21 pages, 7 figures. ICML 2026

详情

AI中文摘要

开放词汇导航要求具身智能体管理由语义歧义和模型错误引起的显著感知不确定性。然而，大多数现有工作满足于局部最优的确定性方法，剥夺了在多个复合可能性上的复杂导航决策，而这些对于全局更优解至关重要。在本文中，我们提出概率场景图导航（PSG-Nav），它构建了一个3D概率场景图，使用完整的语义类别分布来考虑感知不确定性。为了有效利用局部分布来组合和推理最优导航地标，我们提出多元宇宙决策，从联合分布中采样多个最可能的世界设置，并基于地标与多元宇宙之间的兼容性评估导航地标。为了减轻开放词汇导航中因认知不确定性导致的误报，我们引入证据经验校准器，通过将检测与过去成功和失败的记忆进行交叉验证，实现在线终身适应。在广泛使用的基准MP3D、HM3D和HSSD上的大量实验表明，PSG-Nav建立了新的最先进结果，分别实现了66.1%、44.8%和67.9%的成功率。代码可在https://psg-nav.github.io/获取。

英文摘要

Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HM3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving Success Rates of 66.1%, 44.8%, and 67.9%, respectively. Code is available at: https://psg-nav.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.01312 2026-06-02 eess.SP cs.AI cs.NI 版本更新

A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks

面向可扩展战术自主防御车辆网络的以通信为中心的6G-LLM架构

Kiran Khurshid, Shumaila Javaid, Nasir Saeed

发表机构 * Department of Computer and Software Engineering, National University of Sciences and Technology (NUST), Islamabad, Pakistan（计算机与软件工程系，国家科学与技术大学（NUST），伊斯兰堡，巴基斯坦）； Department of Control Science and Engineering, College of Electronics and Information Engineering, Tongji University and National Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, China（控制科学与工程系，电子与信息工程学院，同济大学，以及自主智能无人机系统国家重点实验室，同济大学，中国）； Department of Electrical and Communication Engineering, UAE University, Al-Ain 15551, UAE（电子与通信工程系，阿联酋大学，阿恩15551，阿联酋）

AI总结提出一种以通信为中心的分层架构，通过集成边缘辅助大语言模型推理与6G语义通信，在战术自主防御车辆网络中实现协调效率提升、通信开销降低和延迟韧性增强。

Comments 10 pages, accepted in IEEE Network Magazine

详情

DOI: 10.1109/MNET.2026.3694711
Journal ref: K. Khurshid, S. Javaid and N. Saeed, "A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks," in IEEE Network, Early access, 2026

AI中文摘要

人工智能（AI）与新兴6G网络的融合为战术自主车辆系统的可扩展协调带来了新机遇。本文提出了一种以通信为中心的分层架构，用于战术自主防御车辆网络（TADVNs），该架构将边缘辅助大语言模型（LLM）推理与6G连接和语义通信相结合。该框架旨在提高协调效率、减少通信开销，并在不断扩大的车队规模操作下增强延迟韧性。与依赖结构化特征处理和基于规则协调的传统任务特定AI流水线不同，所提出的方法在分层边缘-云通信架构中引入了语义抽象和上下文感知决策支持。我们通过蒙特卡洛模拟，在竞争网络条件下对5-30辆车的车队规模进行了通信和协调性能评估。结果表明，在30辆车规模下，与基于5G的传统AI基线相比，6G-LLM配置实现了75.2%的延迟降低（29.1毫秒对比117.5毫秒），任务成功率提高68.7个百分点（82.9%对比14.2%），通信开销降低88.6%。这些发现表明，当语义推理与低延迟6G连接相结合时，在协调和通信方面具有可衡量的优势。

英文摘要

The integration of Artificial Intelligence (AI) and emerging 6G networks introduces new opportunities for scalable coordination in tactical autonomous vehicle systems. This paper proposes a communication-centric hierarchical architecture for Tactical Autonomous Defense Vehicle Networks (TADVNs) that models the integration of edge-assisted Large Language Model (LLM) reasoning with 6G-enabled connectivity and semantic communication. The framework is designed to improve coordination efficiency, reduce communication overhead, and enhance latency resilience under increasing fleet-scale operation. Unlike conventional task-specific AI pipelines that rely on structured feature processing and rule-based coordination, the proposed approach incorporates semantic abstraction and context-aware decision support within a layered edge-cloud communication architecture. We evaluate communication and coordination performance via Monte Carlo simulations across fleet sizes of 5-30 vehicles under contested network conditions. Results indicate that at a 30-vehicle scale, the 6G-LLM configuration achieves 75.2% latency reduction (29.1 ms vs. 117.5 ms), a 68.7 percentage point increase in mission success rate (82.9% vs. 14.2%), and an 88.6% reduction in communication overhead compared to a 5G-based conventional AI baseline. These findings demonstrate measurable benefits in coordination and communication when semantic reasoning is combined with low-latency 6G connectivity.

URL PDF HTML ☆

赞 0 踩 0

2606.01311 2026-06-02 cs.CL cs.AI cs.LG cs.MA 版本更新

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

SkillAdaptor：基于轨迹的LLM智能体自适应技能

Zhuoyun Yu, Xin Xie, Wuguannan Yao, Chenxi Wang, Lei Liang, Xiang Qi, Shumin Deng

发表机构 * Zhejiang University（浙江大学）； Ant Digital Technologies, Ant Group（蚂蚁集团数字技术部）； Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph（浙江大学-蚂蚁集团知识图谱联合实验室）

AI总结提出SkillAdaptor，一种无训练的步骤级技能自适应框架，通过显式故障归因和针对性更新，提升LLM智能体在长程交互任务中的表现。

Comments Work in progress

详情

AI中文摘要

大型语言模型（LLM）智能体越来越依赖可重用的外部技能来解决长程交互任务。现有的无训练技能自适应流程通常从完整轨迹或会话级反馈更新技能，这使得故障归因粗糙，往往产生不稳定或过于宽泛的修订。我们提出SkillAdaptor，一种无训练的步骤级技能自适应框架，具有显式故障归因，并可插入OpenClaw类智能体框架。给定一个失败轨迹，SkillAdaptor识别第一个可操作的故障步骤，将责任关联到候选技能，并在显式接受检查下应用针对性更新，同时保持主干冻结。我们在WebShop、PinchBench和Claw-Eval上使用Kimi-K2.5、GLM-5和GPT-5.2进行评估。SkillAdaptor在所有三个套件上均优于无技能和技能自适应基线，最大的单项指标提升为PinchBench平均得分%提升1.5分，Claw-Eval平均得分提升1.8分，WebShop成功率提升1.7分。这些结果表明，步骤级归因支持更稳定且可审计的无训练技能维护。

英文摘要

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.

URL PDF HTML ☆

赞 0 踩 0

2606.01300 2026-06-02 cs.LG cs.AI 版本更新

ChronosAD: Leveraging Time Series Foundation Models for Accurate Anomaly Detection

ChronosAD：利用时间序列基础模型进行精确异常检测

Uzair Khan, Luigi Capogrosso, Francesco Biondani, Michele Magno, Franco Fummi, Francesco Setti, Marco Cristani

发表机构 * PR Veneto FESR 2021-2027（普罗文托地区FESR 2021-2027项目）； Action 1.1.1（行动1.1.1）； DGR 792 ； CUP D19J24000810007

AI总结提出ChronosAD架构，通过时间序列基础模型提取特征并结合BiLSTM与多头注意力机制，实现跨域鲁棒的异常检测，在11个基准上平均AUC提升4.72%，AP提升6.60%。

Comments Accepted at the 24th IEEE International Conference on Industrial Informatics (INDIN) 2026

详情

AI中文摘要

时间序列异常检测是金融、医疗和工业等多个领域的关键任务。然而，现有方法通常难以在不同数据集上泛化，尤其是当异常微妙或依赖于上下文时。为解决此问题，我们引入了ChronosAD，一种新颖的异常检测架构，它使用时间序列基础模型作为特征提取器。具体而言，它采用两阶段流程：首先，使用基础模型以零样本方式为每个时间序列提取嵌入。然后，一个由双向长短期记忆（BiLSTM）和多头注意力组成的自定义开发的时间块，对这些嵌入进行精炼以捕捉时间依赖性并突出显著模式。与先前方法不同，我们的模型需要最少的任务特定调整，并在包括工业、医疗、信息物理和汽车系统在内的广泛领域中展现出鲁棒的泛化能力。在11个基准上的大量实验表明，ChronosAD在AUC和AP上平均分别超过现有方法4.72%和6.60%。源代码可在https://github.com/intelligolabs/ChronosAD获取。

英文摘要

Time series anomaly detection is a crucial task in various domains, including finance, healthcare, and industry. However, existing methods often struggle to generalize across different datasets, especially when anomalies are subtle or context-dependent. To solve this issue, we introduce ChronosAD, a novel architecture for anomaly detection that uses a time series foundation model as a feature extractor. Specifically, it employs a two-stage pipeline: first, it uses the foundation model to extract embeddings for each time series in a zero-shot manner. Then, a custom-developed Temporal Block, composed of Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Attention, refines these embeddings to capture temporal dependencies and highlight salient patterns. Unlike previous approaches, our model requires minimal task-specific tuning and demonstrates robust generalization across a wide range of domains, including industrial, medical, cyber-physical, and automotive systems. Extensive experiments on 11 benchmarks show that ChronosAD outperforms existing methods by 4.72% in AUC and 6.60% in AP on average. The source code is available at https://github.com/intelligolabs/ChronosAD.

URL PDF HTML ☆

赞 0 踩 0

2606.01293 2026-06-02 eess.IV cs.AI cs.CV 版本更新

ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI

ResNet-34与轻量级解码器用于胎儿脑部MRI的准确高效分割

Ashiqur Rahman, Muhammad E. H. Chowdhury, Md. Abu Sayed, Md. Sharjis Ibne Wadud, Abu Naser Md. Arafat, Mehedi Hasan Prince

发表机构 * Department of Biomedical Physics and Technology, University of Dhaka（达卡大学生物医学物理与技术系）； Department of Electrical Engineering, College of Engineering, Qatar University（卡塔尔大学工程学院电气工程系）； Department of Biomedical Engineering, Jashore University of Science and Technology（贾沙尔大学科学与技术学院生物医学工程系）

AI总结提出一种结合ResNet-34编码器和基于MLP的轻量级解码器的深度学习模型，以解决胎儿脑MRI分割中的运动伪影和强度不均匀问题，在FeTA 2021数据集上达到97.37%准确率和90.33%平均DSC。

详情

AI中文摘要

在磁共振成像（MRI）中准确分割胎儿脑组织对于先天性异常的早期诊断和改善产前护理至关重要。然而，由于胎儿运动、组织对比度低以及整个孕龄期解剖结构变异大，特别是分割白质、灰质、侧脑室、深部灰质、脑外脑脊液、小脑和脑干等复杂结构时，该任务仍然困难。针对这些难题，本研究引入了一种新颖的深度学习模型，该模型将ResNet-34编码器与利用多层感知器（MLP）模块进行自适应特征细化的轻量级解码器相结合。这种设计特别增强了模型保留解剖边界并减轻由运动伪影和强度不均匀引起的分割误差的能力。通过减少参数数量、采用双线性上采样代替转置卷积以及优化解码器以提高速度而不牺牲精度，实现了计算效率。在FeTA 2021数据集上使用5折交叉验证进行训练和验证，所提出的模型优于UNet、UNet++、DeepLabV3和DeepLabV3+等基线架构，平均准确率达到97.37%，平均Dice相似系数（DSC）为90.33%，平均交并比（IoU）为86.93%，精确率为90.83%。此外，其快速的推理时间和减少的计算负载使其非常适合集成到实时临床工作流程中。

英文摘要

Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalities and improving prenatal care. However, the task remains difficult because of fetal motion, low tissue contrast, and major anatomical variability throughout gestational ages, particularly in segmenting complex structures such as white matter, gray matter, lateral ventricles, deep gray matter, extra-cerebrospinal fluid, cerebellum, and brainstem. As a solution to these difficulties, this research introduces a novel deep learning model that combines a ResNet-34 encoder with a lightweight decoder leveraging multi-layer perceptron (MLP) modules for adaptive feature refinement. This design specifically enhances the model's ability to preserve anatomical boundaries and mitigate segmentation errors caused by motion artifacts and intensity inhomogeneities. Computational efficiency is achieved by reducing parameter count, employing bilinear upsampling instead of transposed convolutions, and optimizing the decoder for speed without sacrificing accuracy. Trained and validated on the FeTA 2021 dataset using 5-fold cross-validation, the proposed model outperforms baseline architectures such as UNet, UNet++, DeepLabV3, and DeepLabV3+, achieving an average Accuracy of 97.37% with a mean Dice Similarity Coefficient (DSC) of 90.33%, mean Intersection over Union (IoU) of 86.93%, and Precision of 90.83%. Additionally, its fast inference time and reduced computational load make it well-suited for integration into real-time clinical workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.01292 2026-06-02 cs.LG cs.AI 版本更新

What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression

什么造就了一个强模型？高维线性回归中知识迁移的统一谱分析

Wendao Wu, Fangqing Zhang, Haihan Zhang, Cong Fang

发表机构 * Department of Computer Science（计算机科学系）； Cranberry-Lemon University（Cranberry-Lemon 大学）； Department of Computational Neuroscience（计算神经科学系）； University of the Witwatersrand（沃特瓦特斯兰大学）

AI总结本文通过高维线性回归中SGD动力学的统一谱分析，揭示了知识蒸馏中的谱视界扩展和弱到强泛化中的谱去噪两种机制，统一解释了不同知识迁移范式的有效性。

详情

AI中文摘要

师生知识迁移在现代机器学习中无处不在，从通过知识蒸馏进行的经典模型压缩到弱到强泛化这一新兴现象。尽管现有研究提供了孤立见解，但缺乏一个统一的理论框架来解释知识迁移在这些不同机制中的有效性。在这项工作中，我们建立了高维线性回归中SGD动力学的统一谱分析，阐明了知识迁移在看似不同的机制中的效率。我们通过两种不同机制来刻画知识迁移效率：知识蒸馏中的谱视界扩展，使得能够捕获统计上不可及的高频信号；以及弱到强泛化中的谱去噪，其中学生充当优化噪声的滤波器。我们的框架统一了这些现象，揭示了迁移的有效性由隐式正则化与谱上异质谱学习速度之间的相互作用所支配。

英文摘要

Teacher-Student Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Distillation (KD) to the emergent phenomenon of Weak-to-Strong (W2S) generalization. While existing studies offer isolated insights, a unified theoretical framework explaining the efficacy of KT across these disparate regimes remains lacking. In this work, we establish a unified spectral analysis of SGD dynamics in high-dimensional linear regression, elucidating the efficiency of KT across seemingly disparate regimes. We characterize KT efficiency through two distinct mechanisms: \emph{Spectral Horizon Expansion} in KD, which enables the capture of statistically inaccessible high-frequency signals, and \emph{Spectral Denoising} in W2S, where the student acts as a filter for optimization noise. Our framework unifies these phenomena, revealing that the efficacy of transfer is governed by the interplay between implicit regularization and heterogeneous spectral learning speeds over the spectrum.

URL PDF HTML ☆

赞 0 踩 0

2606.01291 2026-06-02 quant-ph cs.AI 版本更新

Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework

量子分布式纠缠约简算法（QADR）：一种可训练且模拟高效的量子机器学习框架

Syed Farhan Ahmad, Gregory T. Byrd

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出QADR框架，通过将全局n量子比特变分量子电路分解为因果光锥内的局部子电路，将经典模拟内存从O(2^n)降至O(n·2^{2d+1})并缓解贫瘠高原，在MNIST和NASA轴承诊断任务上匹配或超越经典模型。

详情

AI中文摘要

在含噪中等规模量子（NISQ）约束下训练变分量子电路（VQCs）引入了严重的计算限制：经典态矢量模拟内存呈指数增长（$\mathcal{O}(2^n)$），且全局代价函数遭受贫瘠高原，其中梯度方差呈指数衰减（$\mathcal{O}(1/2^n)$）。本文介绍并评估了量子分布式纠缠约简算法（QADR），这是一种混合量子-经典机器学习框架，它将全局$n$量子比特VQC分解为局部子电路，这些子电路大致在单个目标量子比特的因果光锥内运行。QADR将经典模拟内存从$\mathcal{O}(2^n)$降低到$\mathcal{O}(n \cdot 2^{2d+1})$（光锥半径$d$），同时自然缓解了全局贫瘠高原。我们在MNIST数据集和高维NASA IMS风力发电机传动系统诊断任务上，将QADR与标准全局VQC、支持向量机（SVM）以及两种定制的经典参数匹配神经网络（CANN和PMNN）进行了基准测试。QADR展示了出色的可扩展性，在$n_{\text{features}}=2000$时成功运行，而标准全局VQC因内存耗尽而崩溃，同时匹配或超越了优化经典架构的性能。

英文摘要

Training Variational Quantum Circuits (VQCs) under Noisy Intermediate-Scale Quantum (NISQ) constraints introduces severe computational limitations: classical statevector simulation memory scales exponentially ($\mathcal{O}(2^n)$), and global cost functions suffer from barren plateaus where gradient variance decays exponentially ($\mathcal{O}(1/2^n)$). This paper introduces and evaluates the Quantum Algorithm for Distributed Reduction of Entanglements (QADR), a hybrid quantum-classical machine learning framework that decomposes a global $n$-qubit VQC into localized sub-circuits operating approximately within the causal light cones of individual target qubits. QADR reduces classical simulation memory scaling from $\mathcal{O}(2^n)$ to $\mathcal{O}(n \cdot 2^{2d+1})$ for a light cone radius $d$, while naturally mitigating global barren plateaus. We benchmark QADR against standard global VQCs, Support Vector Machines (SVM), and two customized classical parameter-matched neural networks (CANN and PMNN) on the MNIST dataset and the high-dimensional NASA IMS wind turbine drivetrain diagnostic task. QADR demonstrates excellent scalability, operating successfully at $n_{\text{features}}=2000$ where standard global VQCs crash due to memory exhaustion, while matching or exceeding the performance of optimized classical architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.01287 2026-06-02 cs.CV cs.AI 版本更新

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

超越视觉记忆：潜在视觉推理的机制诊断

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Shuai Dong

发表机构 * Amap, Alibaba Group（阿里集团亚马通）； Shanghai Innovation Institute（上海创新研究院）

AI总结通过分解潜在令牌为三个可测试组件，发现边界标记和格式而非潜在槽贡献了主要性能提升，揭示了潜在视觉推理的真正机制。

详情

AI中文摘要

最近的潜在视觉推理方法通过在多模态语言模型中插入连续潜在令牌取得了显著提升。这些提升通常归因于令牌编码了视觉证据；然而，最近的分析揭示了一个悖论：令牌与图像关联松散，对答案贡献甚微。关键的是，这些分析将潜在令牌视为一个整体，掩盖了提升的真正来源。因此，我们将潜在令牌分解为三个可测试组件：潜在槽、边界标记和格式，并在有利条件下开发了一种最先进的方法作为探针。在六个方法-阶段设置和四个感知密集型基准测试中，潜在槽未能通过视觉记忆解释的所有预测。引人注目的是，在几种设置中，仅保留边界标记即可保留78%至100%的提升，而模型在潜在位置比在答案位置更窄地关注图像。因此，提升来自边界标记、格式以及这种注意力模式，而非潜在槽。每种方法如何利用这一机制取决于其训练监督：在匹配的准确率下，机制仍可能显著不同。因此，潜在视觉推理不仅需要根据准确率评估，还需要根据模型实际依赖的内容进行评估。

英文摘要

Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.

URL PDF HTML ☆

赞 0 踩 0

2606.01286 2026-06-02 cs.SE cs.AI cs.CL cs.LG 版本更新

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver: 通过以解决方案为中心的进化进行前沿任务合成

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）

AI总结提出BenchEvolver框架，通过进化参考解决方案自动生成更难的编程问题，以解决基准饱和问题，并在LiveCodeBench和SciCode上验证其有效性。

详情

AI中文摘要

相对于语义保持嵌入的梯度揭示大语言模型的不确定性

Mingda Li, Rundong Lv, Xinyu Li, Weinan Zhang, Ting Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出首个基于梯度的自由文本生成不确定性量化方法SemGrad，通过语义空间中的梯度计算实现高效且无需采样的不确定性估计。

Comments Accepted by ICML 2026

详情

AI中文摘要

不确定性量化（UQ）是确保大语言模型（LLM）可信度的重要技术，因为LLM容易产生幻觉。现有的自由文本生成UQ方法严重依赖采样，导致计算成本高且方差大。在这项工作中，我们提出了首个基于梯度的自由文本生成UQ方法SemGrad，它无需采样且计算高效。与先前针对分类任务开发的在参数空间中操作的梯度方法不同，我们提出在语义空间中考虑梯度。我们的方法基于一个关键直觉：自信的LLM应在语义等价的输入扰动下保持稳定的输出分布。我们将这种稳定性解释为语义空间中的梯度，并引入语义保持分数（SPS）来识别最能捕捉语义的嵌入，并针对这些嵌入计算梯度。我们进一步提出了HybridGrad，它结合了SemGrad和参数梯度的优势。实验表明，我们的两种方法都提供了高效且有效的不确定性估计，在多个有效响应的设置中尤其优于现有方法。

英文摘要

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

URL PDF HTML ☆

赞 0 踩 0

2605.04193 2026-06-02 cs.AI cs.LG cs.LO 版本更新

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming

ANDRE：一种基于注意力的神经符号可微规则提取器，用于归纳逻辑编程

Iman Sharifi, Peng Wei, Saber Fallah

发表机构 * Dept. of Mechanical and Aerospace Engineering, George Washington University, USA（机械与航空航天工程系，乔治华盛顿大学）； Dept. of Mechanical Engineering Sciences, University of Surrey, UK（机械工程科学系，萨里大学）

AI总结提出ANDRE框架，通过注意力驱动的可微逻辑算子优化连续规则空间，实现从概率数据中学习一阶逻辑规则，在噪声环境下保持鲁棒性和可解释性。

Comments 35 pages, 8 figures, 10 tables

详情

AI中文摘要

归纳逻辑编程（ILP）旨在从数据中学习可解释的一阶规则，但现有的符号和神经符号方法难以扩展到噪声和概率设置。经典ILP依赖于离散的组合规则搜索，在不确定性下脆弱，而可微ILP方法通常依赖预定义规则模板或不精确的模糊算子，这些算子在推理概率谓词估值时会遭受梯度消失或逻辑结构近似不佳的问题。本文提出基于注意力的神经符号可微规则提取器（ANDRE），一种新颖的ILP框架，通过基于注意力的逻辑算子优化连续规则空间来学习一阶逻辑程序。ANDRE用完全可微的、注意力驱动的合取和析取算子替代规则模板和逻辑算子，这些算子近似逻辑最小-最大语义，从而实现对概率数据的准确、稳定和可解释推理。通过在每条规则内软选择、否定或排除谓词，ANDRE在保持符号结构的同时支持灵活规则归纳。在经典ILP基准、大规模知识库以及带有概率谓词和噪声监督的合成数据集上的大量实验表明，ANDRE达到了有竞争力或更优的预测性能，同时在不确定性下可靠地恢复正确的符号规则。特别是，ANDRE对中等标签噪声保持鲁棒，在规则提取质量和稳定性上显著优于现有可微ILP方法。

英文摘要

Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.

URL PDF HTML ☆

赞 0 踩 0

2606.01237 2026-06-02 cs.AI 版本更新

Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes

脑图谱引导的生成式反事实注意力用于基于多模态连接组的可解释认知衰退诊断

Xiongri Shen, Jiaqi Wang, Zhenxi Song, Yi Zhong, Leilei Zhao, Xin He, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术系）； School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology（哈尔滨工业大学智能科学与工程学院）； School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical, Measurements and Ultrasound Imaging, Shenzhen University Medical School, Shenzhen University（深圳大学医学院生物医学工程学院、医学超声关键技术工程实验室、广东省生物医学测量与超声成像重点实验室）

AI总结提出一种脑图谱知识引导的生成式反事实注意力网络（GCAN），通过将诊断建模为源到目标的反事实生成问题，利用多模态连接组实现可解释的认知衰退诊断。

详情

AI中文摘要

轻度认知障碍（MCI）和主观认知衰退（SCD）与早期阿尔茨海默病连续谱密切相关，准确且可解释的诊断对于早期风险评估和干预至关重要。现有的基于连接组的深度学习模型可以提高分类性能，但通常对疾病相关的功能和结构连接变化提供的洞察有限。本文提出了一种图谱知识引导的生成式反事实注意力网络（GCAN），用于使用多模态脑连接组进行可解释的认知衰退诊断。GCAN将诊断建模为源到目标的反事实生成问题，其中从源标签输入生成目标标签连接组，并利用它们的差异构建反事实注意力图。为了保持连接组拓扑，一种图谱感知的双向Transformer（AABT）在脑图谱约束下执行网络级令牌编码和解码。该框架进一步从功能连接（FC）扩展到联合功能和结构连接（SC）建模，从而实现对互补功能重组和结构拓扑变化的反事实分析。在医院收集的数据集和ADNI数据集上的实验表明，GCAN在HC vs. SCD、HC vs. MCI和SCD vs. MCI分类任务中取得了竞争性能。可视化、圆形连接组分析、基于CAM的比较、消融研究和置信区间分析进一步支持了所提框架的可解释性和可靠性。使用特定模态的FC和SC预训练分类器为反事实生成提供目标状态先验，同时将其与下游诊断分类器分离以防止数据泄露。

英文摘要

Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.

URL PDF HTML ☆

赞 0 踩 0

2606.01230 2026-06-02 cs.AI 版本更新

混合不平衡回归：统一的数据级与算法级平衡方法

Shermin Shahbazi, Hossein Mohammadi, Mohsen Afsharchi

发表机构 * Zahedan National University（札赫德安国立大学）

AI总结提出一个五阶段混合框架，结合自适应分箱、条件变分自编码器、特征空间聚类过采样、潜在密度加权损失和注意力门控融合，解决回归中的不平衡问题。

Comments 52 pages, 20 figures, accepted at Expert Systems with Applications

详情

DOI: 10.1016/j.eswa.2026.131908
Journal ref: Expert Systems with Applications, Date: 1 August 2026, Article: 131908, Volume: Volume 322

AI中文摘要

不平衡学习是机器学习中的一个关键挑战，其中代表性不足的目标值可能使模型产生偏差，并降低对罕见但重要案例的预测性能。尽管在分类中得到了广泛研究，不平衡回归仍然相对未被充分探索。现有方法主要关注数据级平衡（可能引入噪声和过拟合）或算法级平衡（通常难以处理高度复杂的目标分布）。为了解决这些局限性，我们提出了一个统一的混合框架，将数据级和算法级平衡策略集成到一个与回归器无关的流水线中。该框架包括五个阶段：（1）自适应分箱划分，基于局部线性一致性动态分割目标空间；（2）使用条件变分自编码器进行目标条件表示学习；（3）通过特征空间聚类和少数类过采样进行多阶段数据级平衡；（4）使用新颖的潜在密度加权损失（LDWL）进行算法级平衡，以强调潜在空间和目标空间中的稀有样本；（5）基于注意力的门控融合用于最终回归。在基准数据集上的实验结果表明，与单独的回归器和现有的不平衡回归方法相比，所提出的框架持续提高了预测性能。

英文摘要

Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade prediction performance on rare but important cases. Although extensively studied in classification, imbalanced regression remains relatively underexplored. Existing methods mainly focus on either data-level balancing, which may introduce noise and overfitting, or algorithm-level balancing, which often struggles with highly complex target distributions. To address these limitations, we propose a unified hybrid framework that integrates both data- and algorithm-level balancing strategies into a regressor-agnostic pipeline. The proposed framework consists of five stages: (1) adaptive bin partitioning to dynamically segment the target space based on local linear coherence; (2) target-conditioned representation learning using a Conditional Variational Autoencoder; (3) multistage data-level balancing through feature-space clustering and oversampling of minority clusters; (4) algorithm-level balancing using a novel Latent-Density Weighted Loss (LDWL) to emphasize rare samples in latent and target spaces; and (5) attention-based gated fusion for final regression. Experimental results on benchmark datasets demonstrate that the proposed framework consistently improves predictive performance compared to standalone regressors and existing imbalanced regression approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.01220 2026-06-02 cs.LG cs.AI 版本更新

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调扩散模型用于分子生成

Guang Lin, Shikui Tu, Lei Xu

发表机构 * Department of Computer Science and Engineering, Shanghai Jiao Tong University（上海交通大学计算机科学与工程系）； Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)（广东人工智能与数字经济实验室（深圳））

AI总结提出FTDiff框架，结合组相对策略优化和快速采样机制，微调扩散模型以生成满足多目标药物设计约束的高质量分子。

Comments 13 pages, 7 figures

详情

AI中文摘要

生成同时满足类药性质并符合目标蛋白三维结构的分子是基于结构的药物设计（SBDD）中的核心挑战。然而，现有的生成方法通常依赖于采样过程中昂贵的后处理或训练时需要精心策划的数据集，但增益仍然有限。这些限制在多目标设置中尤为突出，平衡冲突标准仍是一个核心挑战。为了解决这些问题，我们提出了FTDiff，一个专为结构约束下基于扩散的分子生成量身定制的强化学习微调框架。为了确保稳定且样本高效的优化，FTDiff采用了组相对策略优化（GRPO）风格策略。此外，FTDiff基于一个无时间预训练扩散模型，并集成了快速采样机制，减少了去噪步数，在保持生成质量的同时显著加速了训练和推理。通过优化一个固定阈值感知的奖励，FTDiff有效引导模型生成有效、多样且高质量的分子，平衡多个药物设计目标。在基准数据集上的大量实验表明，FTDiff始终优于先前的方法，且无需昂贵的后处理优化或复杂的数据工程。

英文摘要

Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM 版本更新

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出APEIRIA，通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中，实现透明推理与开放词汇空间推理的统一。

Comments To appear in ICML 2026

详情

AI中文摘要

当前的3D空间推理方法面临根本性权衡：神经符号3D（NS3D）概念学习器通过组合程序实现可解释推理，但受限于封闭集概念词汇和简单程序；端到端3D多模态大语言模型（3D MLLMs）能处理复杂自然语言和开放词汇概念，但缺乏显式空间验证的黑箱推理。我们提出APEIRIA，一种神经符号3D MLLM，通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中，桥接两种范式。我们的三阶段课程逐步构建推理能力：a) 3D感知对齐将物体视觉-几何特征接地到LLM，b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证，c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识，APEIRIA保留了NS3D的关键优点：透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明，APEIRIA超越了先前的NS3D方法，并在3D空间推理数据集上匹配最先进的3D MLLMs，统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

URL PDF HTML ☆

赞 0 踩 0

2606.01213 2026-06-02 cs.CV cs.AI cs.CL 版本更新

LLM智能体能否维持长期组织动态？

Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou, Xiaohan Zhang, Yongrui Liu, Guoshun Nan

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出TaskWeave分层智能体框架，通过记忆中心的协调机制（规划-分解-诊断-对齐循环和依赖感知追踪记忆）实现长期组织模拟，实验表明该框架能维持连贯的组织动态并产生可靠的人工制品。

详情

AI中文摘要

大型语言智能体越来越多地用于社会模拟，但尚不清楚它们能否在结构化组织中维持连贯行为，其中目标必须通过层级传播，任务依赖于先前执行，并且人工制品在长期范围内积累。我们将长期组织模拟定义为以记忆为中心的协调问题，并引入TaskWeave，这是一个分层智能体框架，通过制定-分解-诊断-对齐循环维护规划状态，并通过依赖感知追踪记忆来接地执行。我们在一个为期一年的IT公司模拟中评估TaskWeave，并将其与其他多智能体框架在组织连贯性、执行接地和下游企业NLP效用方面进行比较。实验表明，TaskWeave支持连贯且长期的组织动态，同时产生接地的人工制品并适应外部环境。这些发现表明，结构化模拟记忆是构建可靠的基于LLM的组织模拟器的关键机制。

英文摘要

Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.

URL PDF HTML ☆

赞 0 踩 0

2606.01196 2026-06-02 cs.CL cs.AI 版本更新

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

低资源安全失败是行动失败，而非表征失败

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎伊德大学人工智能学院）

AI总结本文发现低资源语言的安全对齐失败源于决策校准问题而非表征缺失，通过重校准高资源门控（低秩逻辑回归+阈值重置）显著提升拒绝选择性。

详情

AI中文摘要

在高资源语言中学习的安全对齐在低资源语言中迁移效果不佳。模型能拒绝英文有害提示，但当相同提示翻译成斯瓦希里语或缅甸语时则无法拒绝。自适应引导方法如AdaSteer和CAST在跨语言中继承了这一失败。我们诊断了迁移失败的原因。在Qwen2.5-7B、Gemma-2-9B和Llama-3.1-8B模型上，针对23种语言，从高资源激活中提取的有害方向几乎能像高资源提示一样线性分离低资源有害与无害提示。相关表征存在。然而，有害拒绝率从87.9%下降到43.9%。模型未能将表征转化为拒绝。未能迁移的是安全决策的校准，而非底层表征。我们利用这一点，通过重校准而非重新训练高资源门控：一个低秩逻辑回归读出器，其决策阈值使用每类仅1到4个目标语言示例重置。该门控在拒绝引导和有害方向消融之间路由，将平均拒绝选择性（Δ = 有害 − 无害拒绝）从最强自适应基线的33.6显著提高到54.5，同时保持MMLU效用。这些结果表明，一些低资源安全失败可以通过重校准现有表征而非学习新表征来修复。我们的代码已发布：https://github.com/rashadaziz/low-resource-safety。

英文摘要

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

URL PDF HTML ☆

赞 0 踩 0

2606.01189 2026-06-02 cs.AI 版本更新

The Case for Model Science: Verify, Explore, Steer, Refine

模型科学的案例：验证、探索、引导、改进

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel, Andreas Holzinger, Wojciech Samek

发表机构 * Center for Credible AI（可信AI中心）； University of Warsaw（华沙大学）； Warsaw University of Technology（华沙技术大学）； University College Cork（科克大学学院）； University of Technology Sydney（悉尼技术大学）； Kempner Institute, Harvard University（哈佛大学凯普纳研究所）； Human-Centered AI Lab（以人为本的人工智能实验室）； Technical University of Berlin（柏林技术大学）； Fraunhofer Heinrich Hertz Institute（弗劳恩霍夫海因里希·赫茨研究所）； Berlin Institute for the Foundations of Learning and Data (BIFOLD)（柏林学习与数据基础研究所（BIFOLD））

AI总结本文提出AI社区应超越基准测试，建立系统性的模型分析学科——模型科学，通过验证、探索、引导和改进四个功能视角，以及共享基础设施和深度案例研究，来理解复杂AI模型的行为。

Comments Follow up on arXiv:2508.20040

详情

AI中文摘要

我们认为，AI社区现在已经准备好超越基准测试，并将分散的模型分析工作整合成一个系统性的学科，我们称之为模型科学。复杂的AI模型现在服务于数十亿用户，但我们对它们工作原理的理解远远落后于部署它们的能力。几十年来以基准测试为导向的研究取得了显著进展：广泛的排行榜、各种性能指标、跨不同任务的能力提升追踪；然而，这种成功也揭示了基准测试的局限性，因为它们告诉我们模型是否表现良好，但不告诉我们为什么成功或失败，它们忽略了关键的失败模式，如幻觉或捷径。来自成熟科学的先例指明了前进的方向：认知科学表明，理解复杂系统需要互补的分析层次；神经科学证明，对单个案例的深入研究揭示了群体研究遗漏的东西；医学教导我们，专业培训必须与研究实践同步发展；农业模型展示了共享基础设施和原则如何实现累积进展。这些经验为模型科学提供了三个基础。首先，我们建议围绕四个功能视角整合研究：验证、探索、引导和改进，这些视角解决了关于模型行为的互补问题。其次，我们讨论了累积知识所需的基础设施：数据集、模型和发现的目录。第三，我们强调需要对单个模型实例进行深入分析，而不仅仅是模型家族，因为单个案例可以揭示群体研究遗漏的东西。

英文摘要

We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.

URL PDF HTML ☆

赞 0 踩 0

2606.01188 2026-06-02 cs.HC cs.AI 版本更新

pcbGPT: Automatic PCB Schematic Synthesis from Natural Language Requirements

pcbGPT: 从自然语言需求自动合成PCB原理图

Tobias King, Steven Kehrberg, Michael Beigl, Tobias Röddiger

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Bosch Sensortec GmbH（博世传感器技术有限公司）

AI总结提出pcbGPT系统，通过工具增强合成、组件库搜索、数据表知识、执行检查、结构语义验证和交互式工作流，从自然语言规格自动生成可编辑的KiCad原理图，在20个嵌入式任务上达到pass@1为0.90。

详情

AI中文摘要

在嵌入式、物联网和可穿戴设备开发中，将自然语言硬件需求转化为正确的印刷电路板（PCB）原理图仍然困难。设计者必须选择兼容的组件、解读数据手册、添加支持电路并暴露正确的接口，然后才能开始布局和原型制作，而许多此类电路无法通过简单的仿真进行验证。我们提出了pcbGPT，一个从自然语言规格生成可编辑KiCad原理图的接地系统。pcbGPT用Python DSL表示电路，并结合了工具增强合成、组件库搜索、基于数据手册的设计知识、基于执行的检查、结构和语义验证，以及支持迭代优化和与KiCad项目同步的交互式Web工作流。我们在20个嵌入式原理图生成任务上评估了该系统，这些任务具有参考实现、所需组件和接口约束，以便自动比较。最佳模型在整体上达到pass@1为0.90，pass@5为1.00；在基础和简单任务上pass@1为1.00，中等任务上为0.91，困难任务上为0.72。这些结果以及失败分析表明，pcbGPT已经能够为早期原型生成有用的、可审查的初稿原理图，但尚不足以可靠地取代专家审查。

英文摘要

Translating natural-language hardware requirements into correct printed circuit board (PCB) schematics remains difficult in embedded, IoT, and wearable development. Designers must choose compatible components, interpret datasheets, add support circuitry, and expose correct interfaces before layout and prototyping can begin, while many such circuits cannot be validated through straightforward simulation. We present pcbGPT, a grounded system for generating editable KiCad schematics from natural-language specifications. pcbGPT represents circuits in a Python DSL and combines tool-augmented synthesis with component-library search, datasheet-grounded design knowledge, execution-based checking, structural and semantic validation, and an interactive web workflow that supports iterative refinement and synchronization with KiCad projects. We evaluate the system on 20 embedded schematic-generation tasks with reference implementations, required components, and interface constraints that enable automatic comparison. The best model reaches overall pass@1 of 0.90 and pass@5 of 1.00; pass@1 is 1.00 on basic and easy tasks, 0.91 on medium tasks, and 0.72 on hard tasks. These results, together with failure analysis, show that pcbGPT can already generate useful, reviewable first-draft schematics for early prototyping, but is not yet reliable enough to replace expert review.

URL PDF HTML ☆

赞 0 踩 0

2606.01185 2026-06-02 cs.AI 版本更新

"Skill issues'': data-centric optimization of lakehouse agents

技能问题：湖仓代理的数据中心优化

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

发表机构 * University of Maryland（马里兰大学）； Università Milano Bicocca（米兰Bicocca大学）； Bauplan Labs（Bauplan实验室）

AI总结针对分支湖仓Bauplan上的编码代理，提出数据中心的优化流程，通过生成任务验证器对、在隔离沙箱中执行候选技能并利用追踪信号和程序化检查评分，将准确率提升31.9%。

详情

AI中文摘要

编码代理正在成为数据基础设施的用户，但它们的成功不仅取决于模型质量：还取决于教导代理如何使用系统的技能和环境文件。我们研究如何为在分支湖仓Bauplan上操作的代理优化这些工件。在我们的设置中，无头API和类似Git的数据原语通过代码、分支、提交和合并暴露数据工作流。我们的核心观察是，分支湖仓将数据代理评估从输出匹配问题转变为状态验证问题：代理生成的管道代码会引发具体的、可检查的湖仓变化。我们提出了一个数据中心优化流程，生成任务验证器对，在隔离沙箱中执行候选技能，并使用追踪级信号和湖仓状态的程序化检查对轨迹进行评分。在25个任务的初步评估中，优化后的技能将准确率提升了31.9%。这些结果表明，写路径数据工作流为优化代理技能提供了有用的基础，超越了只读任务。

英文摘要

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01182 2026-06-02 cs.CL cs.AI 版本更新

CA-BED: Conversation-Aware Bayesian Experimental Design

CA-BED：对话感知的贝叶斯实验设计

Daniel Arnould, Rashad Aziz, Zixuan Kang, Tanav Changal, Kevin Zhu, Sunishchal Dev, Gabriel Grand, Shreyas Sunil Kulkarni

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结提出对话感知的贝叶斯实验设计（CA-BED），一种推理时概率对话规划框架，通过结合贝叶斯实验设计与LLM似然估计，在多个对话轮次中优化问题选择，在结构化实体推断基准上平均成功率提升21.8%，仅增加1.8轮对话。

Comments Reliable Autonomy Workshop at ICLR 2026

详情

AI中文摘要

大型语言模型（LLM）在静态推理任务中表现出色，但在需要通过提问主动获取信息的交互场景中，其性能往往会下降。一个关键挑战在于选择能够减少不确定性同时纳入可能模糊或仅部分信息性的回应的问题。为了解决这个问题，我们提出了对话感知的贝叶斯实验设计（CA-BED），一种推理时概率对话规划框架，它将贝叶斯实验设计与基于LLM的似然估计相结合，以在多个对话轮次中优化问题选择。CA-BED维护关于假设的信念分布，预测可能的答案，并通过模拟对话树传播期望信息增益。在两个结构化实体推断基准上，CA-BED相比直接提示实现了平均21.8%的成功率提升，相对于其他信息寻求方法也有相当的增益。与直接提示相比，它仅平均增加了1.8个对话轮次就实现了这些增益。

英文摘要

Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.

URL PDF HTML ☆

赞 0 踩 0

2606.01179 2026-06-02 cs.LG cs.AI 版本更新

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

异质系统中熵预测的物理信息深度学习：热力学与信息论案例研究

Biswajeet Sahoo, Debadutta Patra

发表机构 * Durham University（杜ham大学）； Department of Chemical Engineering（化学工程系）； Veer Surendra Sai University of Technology（维尔·苏雷纳·赛大学）

AI总结提出统一物理信息深度学习框架，通过微分方程残差和信息论约束，在单一神经网络中同时实现热力学与信息论系统的熵预测，并验证其数据效率和物理一致性。

详情

AI中文摘要

熵产生支配着物理和信息论系统中的不可逆性和不确定性。尽管物理信息神经网络（PINNs）成功求解微分方程，但当前架构本质上仍是领域特定的。跨根本不同物理定律的领域不变熵表示的提取尚未探索。本文引入了一个统一的物理信息深度学习（PIDL）框架，该框架在单一神经架构中同时强制执行微分方程残差和信息论界限。我们通过两个经典研究来展示该框架：（i）一个热力学连续搅拌釜反应器（CSTR）模型，求解控制常微分方程，其中Softplus约束严格强制执行热力学第二定律；（ii）一个信息论金融市场模型，求解逆Fokker-Planck偏微分方程以推断潜在漂移和扩散系数，通过Softplus约束保证扩散正性，同时自然诱导香农熵。评估了三种模型变体：两个特定领域基线和一种共享编码器架构。PIDL框架保证了绝对的热力学可接受性，零违反第二定律，并表现出卓越的数据效率，仅使用30%的可用训练数据即可保持>90%的预测精度。此外，对学习到的熵表面的事后Ruppeiner黎曼几何分析成功识别了热力学相不稳定性。该方法为物理约束熵建模提供了一个稳健、领域无关的架构，推动了可持续过程设计和定量金融风险评估的应用。

英文摘要

Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neural Networks (PINNs) successfully solve differential equations, current architectures remain inherently domain-specific. The extraction of domain-invariant entropy representations across fundamentally different physical laws remains unexplored. This paper introduces a unified Physics-Informed Deep Learning (PIDL) framework that simultaneously enforces differential equation residuals and information-theoretic bounds within a single neural architecture. We demonstrate this framework via two canonical studies: (i) a thermodynamic continuous stirred-tank reactor (CSTR) model solving governing ODEs, where a Softplus constraint strictly enforces the Second Law of Thermodynamics; and (ii) an information-theoretic financial market model solving the inverse Fokker-Planck PDE to infer latent drift and diffusion coefficients, guaranteeing diffusion positivity via a Softplus constraint while naturally inducing Shannon entropy. Three model variants are evaluated: two domain-specific baselines and one shared-encoder architecture. The PIDL framework guarantees absolute thermodynamic admissibility with zero Second-Law violations and exhibits exceptional data efficiency, retaining >90% predictive accuracy using merely 30% of available training data. Furthermore, a post-hoc Ruppeiner Riemannian geometric analysis of the learned entropy surface successfully identifies thermodynamic phase instabilities. This methodology provides a robust, domain-agnostic architecture for physics-constrained entropy modeling, advancing applications in sustainable process design and quantitative financial risk assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.01171 2026-06-02 cs.CY cs.AI 版本更新

AI From the Margins (AIM): Rethinking Participatory AI Design Through the Lived Experience of Minoritized Communities

边缘AI（AIM）：通过少数群体生活经验重新思考参与式AI设计

Tijs Portegies, Laureanne Willems, Maaike Harbers, Giovanni Sileno, Roland van Dierendonck, Mayesha Tasnim, Lotte Willemsen, Sennay Ghebreab

发表机构 * Utrecht University（乌特勒支大学）； University of Amsterdam（阿姆斯特丹大学）

AI总结提出AIM方法论，通过叙事启发、共同规则制定等步骤，将少数群体的生活经验融入参与式AI设计，并在荷兰医疗场景中验证其有效性。

Comments Under review at the AAAI/ACM Conference on AI, Ethics, and Society (AIES 2026)

详情

AI中文摘要

人工智能（AI）可以再现并放大少数群体面临的结构性不平等。参与式AI被提出作为应对措施，但参与通常始于问题定义和成功标准设定之后，留给少数群体重塑AI系统目的的空间有限。我们提出边缘AI（AIM）：一种方法论立场，阐明如何引发、聚焦并推进少数群体的生活经验，以指导参与式AI设计。AIM并非固定协议；它阐明了一组先决条件，可通过不同技术在不同环境中实施。我们在荷兰医疗背景下，与13名有色人种女性和非二元性别者以及5名市政政策工作者进行了八次会话，应用了AIM，具体包括：（1）使用传记叙事解释法（BNIM）进行叙事启发；（2）共同构建规则制定；（3）参与者决定AI是否、在何处以及如何介入；（4）通过与政策制定者的对话将生活经验转化为AI政策。在会话反思中，参与者将参与描述为实质性的，并呼吁继续开展，展示了以生活经验为基础的准备性取向如何塑造参与式AI设计的目的。

英文摘要

Artificial intelligence (AI) can reproduce and amplify the structural inequities faced by minoritized communities. Participatory AI has been proposed as a response, but participation typically starts after problem definitions and success criteria have been set, leaving limited room for minoritized communities to reshape what an AI system is for. We propose AI From the Margins (AIM): a methodological stance that articulates the conditions under which lived experiences of minoritized communities can be elicited, centered, and carried forward to inform participatory AI design. AIM is not a fixed protocol; it articulates a set of preconditions that can be enacted through different techniques in different settings. We applied AIM in a Dutch healthcare context in eight sessions with 13 women and non-binary people of color and five municipal policy workers, namely through (1) narrative elicitation using the Biographic Narrative Interpretive Method (BNIM); (2) co-constructed rule-making; (3) participants' determination of whether, where, and how AI should be involved; and (4) translating lived experience into AI policy through dialogue with policymakers. In their reflections on the sessions, participants described the engagement as substantive and called for its continuation, demonstrating how preparatory orientation fundamentally grounded in lived experience shapes what participatory AI design is for.

URL PDF HTML ☆

赞 0 踩 0

2606.01160 2026-06-02 cs.AI 版本更新

Reasoning4Sciences：将推理语言模型桥接到所有科学分支

Teddy Ferdinan, Bartłomiej Koptyra, Mikołaj Langner, Tomasz Adamczyk, Łukasz Radliński, Maciej Markiewicz, Aleksander Szczęsny, Stanisław Woźniak, Tymoteusz Romanowicz, Dzmitry Pihulski, Mateusz Zbrocki, Mateusz Śmigielski, Michał Rajkowski, Mateusz Biedka, Konrad Kiełczyński, Konrad Wojtasik, Jacek Duszenko, Jan Eliasz, Piotr Matys, Michał Bernacki-Janson, Maria Bellaniar Ismiati, Latius Hermawan, Wiktoria Mieleszczenko-Kowszewicz, Anna Kubicka-Sowinska, Grzegorz Chodak, Karol Postawa, Paweł Zyblewski, Tomasz Szandała, Łukasz Sterczewski, Adrian Chajec, Pawel Niewiadomski, Piotr Gruber, Marcin Wdowikowski, Sławomir Czarnecki, Bartłomiej Kryszak, Dominik Drabik, Tomasz Kajdanowicz, Kamil Mamak, Paweł Preś, Katarzyna Paczkowska, Joachim Sobczuk, Tomasz Zięba, Jan Kocoń, Maciej Piasecki, Przemysław Kazienko

发表机构 * Poznan University of Technology（波兹南理工大学）； National Cheng Kung University（国立成功大学）； Universitas Katolik Musi Charitas Palembang（Palembang 巴厘岛天主教大学）

AI总结本文首次全面分析推理语言模型在28个科学学科中的采用情况，提出基于领域资源的成熟度评估框架，揭示学科间差距并展望未来方向。

详情

AI中文摘要

虽然推理语言模型（RLMs）正迅速成为科学研究的强大工具，但其影响主要集中在“硬科学”领域。RLMs在其他科学分支中的采用缓慢（或缺乏）导致研究生产力差距不断扩大。在本综述中，我们首次按照欧洲研究理事会（ERC）使用的分类，对RLMs在28个科学学科中的采用情况进行了全面分析，涵盖社会科学与人文、物理科学与工程以及生命科学。我们研究了RLMs如何跨学科开发、评估和应用。此外，我们引入了一个基于可用领域特定开发和评估资源的成熟度导向评估框架，揭示了RLM成熟度的显著差异，当仅考虑公开可用资源时，这种差异变得更加明显。最后，我们强调了当前跨学科流行的实施范式、当前挑战以及推动RLMs在科学中采用的未来方向。

英文摘要

While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in "hard science" fields. The slow -- or lack of -- adoption of RLMs in other branches of science is causing a widening gap in research productivity. In this survey, we provide the first comprehensive analysis of RLM adoption across 28 scientific disciplines following the classification used by the European Research Council (ERC), spanning the Social Sciences and Humanities, Physical Sciences and Engineering, and Life Sciences. We examine how RLMs are developed, evaluated, and applied across disciplines. Furthermore, we introduce a maturity-oriented assessment framework based on available domain-specific development and evaluation resources, revealing substantial disparities in RLM maturity that become even more pronounced when only publicly available resources are considered. Finally, we highlight current implementation paradigms that are gaining popularity across disciplines, current challenges, and future directions in enabling RLM adoption across science.

URL PDF HTML ☆

赞 0 踩 0

2606.01126 2026-06-02 cs.LG cs.AI cs.CV 版本更新

MiCU: 基于大语言模型的端到端智能家居指令理解

Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang, Bin Qin, Yuxiang Wang, Jiawei Jiang, Xiao Yan, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； Xiaomi Corporation（小米公司）； Institute for Math & AI, Wuhan University（武汉大学数学与人工智能研究院）

AI总结提出MiCU，一种利用课程学习、强化学习和令牌压缩技术的领域特定大语言模型，用于解决智能家居中模糊指令理解问题，平均准确率提升20.01%。

详情

DOI: 10.1145/3770855.3818446

AI中文摘要

智能家居生态系统中的指令理解系统可以自动化设备控制并显著改善用户体验。然而，尽管它们在精确表述（例如“打开卧室灯”）上表现良好，但在处理模糊或不一致的指令（例如“让卧室变得舒适”）时却存在困难。大语言模型（LLM）在各种领域都能很好地泛化，并且在此类任务上可以超越传统的基于规则的系统，但其有效性通常受到领域特定数据稀缺、任务特定适应性不足以及高计算成本的限制。在本文中，我们提出了一种利用用户日志和LLM的自动化训练数据合成工作流程；然后构建了MiCU，一个在指令理解方面表现出色的领域特定LLM。具体来说，我们采用课程学习将领域知识注入基础LLM，然后通过冷启动训练结合领域特定思维规则引导的强化学习（RL）来增强其推理能力。此外，我们引入了一种令牌压缩技术，将设备描述压缩为单个特殊令牌，从而显著降低推理开销，并实现了\model-fast，一种针对长输入优化的高效变体。大量实验表明，MiCU显著优于基线，在所有设备类别上平均准确率提升20.01%。我们已在小米家应用中部署了MiCU，每天接收约170万页面浏览量。生产评估显示，MiCU将用户纠正率降低了1.57%，并将人工审核准确率提高了32.05%。我们的数据和代码可在https://github.com/xiaomi-research/iot_spec_llm获取。

英文摘要

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

URL PDF HTML ☆

赞 0 踩 0

2606.01098 2026-06-02 cs.RO cs.AI 版本更新

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

隐式漂移策略：通过条件专家几何实现单步动作生成

Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma

发表机构 * ShanghaiTech University（上海科技大学）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）； Morphi Robot（Morphi机器人）

AI总结提出隐式漂移策略（IDP），一种单步模仿学习框架，通过条件专家几何隐式引入训练时的漂移校正，无需显式向量场估计，在2D、3D及真实世界操作任务中有效保持有效动作流形，性能优于显式漂移方法并达到强单步基线水平。

详情

AI中文摘要

基于扩散或流匹配的生成动作策略在行为克隆中表现出色，但其迭代采样对于高频机器人控制来说过于耗时。尽管最近的单步公式缓解了这种延迟，但它们不可避免地丢弃了提供关键动作校正的中间轨迹演化。由于条件演示极端稀疏，通过显式估计训练时漂移场直接恢复这一机制在数学上是不适定的。我们提出了隐式漂移策略（IDP），一种单步模仿学习框架，无需显式向量场估计即可将训练时的漂移校正引入策略学习。IDP从观测相似专家动作的局部变化中提取条件专家几何，并将其与全局参考几何进行比较，以分离条件特定的约束。这种局部几何结构自适应地加权一个标量势目标。结合专家近端终端评估，IDP在训练期间直接对单步生成器施加流形约束。在2D、3D和真实世界操作任务上的广泛评估表明，IDP有效保持了对有效动作流形的遵循，优于显式漂移方法，并达到了与强单步基线相当的性能。

英文摘要

Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01095 2026-06-02 cs.RO cs.AI 版本更新

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

超越任务成功：WAM 和 VLA 的行为与表征诊断

Hung Mai, Bin Zhu, Tuan Do

发表机构 * National Economics University, Vietnam（越南国家经济大学）； Singapore Management University（新加坡管理大学）； Phenikaa University, Vietnam（越南Phenikaa大学）

AI总结本文提出一个模型无关的诊断框架，通过行为分析和基于稀疏自编码器的特征分析，比较世界动作模型（WAM）与视觉-语言-动作（VLA）策略在机器人操作中的行为与表征差异，发现WAM在目标选择和行为改进上优于VLA但计算成本更高，且不同WAM架构对未来信息的编码方式不同。

详情

AI中文摘要

视觉-语言-动作（VLA）策略和世界动作模型（WAM）代表了机器人操作中两种日益重要的范式。然而，尚不清楚WAM中的未来预测是否在最终任务成功之外带来行为上有意义的改进。在本文中，我们探究WAM是否仅仅增加了未来预测，还是以对控制可操作的方式改变了机器人行为和内部表征。我们引入一个模型无关的诊断框架，通过两个互补的视角比较WAM和VLA：行为 rollout 分析和基于稀疏自编码器的特征分析。行为协议测量动作动态一致性、目标物体进展、干扰物干扰和运行时成本。特征空间协议将内部表征表征为记忆型、反应型或预测型，揭示模型是否编码了面向未来的结构。在LIBERO和RoboTwin2.0上，我们评估了7种策略，涵盖直接VLA以及联合、顺序和辅助WAM。我们的结果表明，仅凭成功隐藏了关键差异：WAM通常改善物体级行为和目标选择性，但其收益依赖于架构并导致更高的推理成本。顺序WAM显示出最清晰的预测结构，而辅助和联合WAM分别压缩或纠缠未来信息。这些发现为WAM设计提供了未来方向，以保留行为可操作的未来表征，实现高效操作。

英文摘要

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.01094 2026-06-02 cs.AI 版本更新

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent: 具有结构化推理和工具集成的临床智能体用于医嘱生成

Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu, Siran Zhao, Yao Yu, Jie Zhai, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China（东华大学上海科学技术学院）； Zhongshan Hospital, Fudan University, Shanghai, China（复旦大学中山医院）

AI总结提出CAREAgent，通过两阶段推理数据构建和监督微调与强化学习，生成细粒度临床医嘱，在ClinicalBench上F1提升5.05%。

详情

AI中文摘要

临床医嘱生成是临床决策与实际实践之间的关键桥梁，将医疗决策转化为具体可执行的医嘱。现有智能体主要关注粗粒度决策，忽略了临床医嘱所需的细粒度可执行信息。为弥补这一差距，我们提出CAREAgent，一个用于临床医嘱生成的智能体。为支持其训练，我们引入了一种两阶段智能体推理数据构建方法。首先，我们设计了一个智能体框架，构建与真实临床工具使用一致的可验证推理轨迹。其次，我们根据格式合规性、医嘱有效性和临床合理性筛选推理轨迹。基于构建的数据，模型首先通过监督微调训练以获得基本的推理格式和医学知识，随后通过具有多维奖励函数的强化学习进行优化，以增强复杂的临床推理能力。在多个基准上的实验证明了CAREAgent的有效性。在ClinicalBench（训练中未见）上，CAREAgent的F1分数分别比单智能体、多智能体和智能体推理方法提高了5.05%、2.09%和0.86%。

英文摘要

Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.01092 2026-06-02 cs.LG cs.AI 版本更新

A Fiber Criterion for Representation Identifiability in Supervised Learning

监督学习中表示可辨识性的纤维准则

Vasileios Sevetlidis

发表机构 * Athena Research Center, Kimmeria Campus, Xanthi, Greece（亚特兰大研究中心，基米里亚校区，哈尼亚，希腊）； Democritus University of Thrace, Vas. Sofias Campus, Xanthi, Greece（德摩根大学，瓦斯·索菲亚校区，哈尼亚，希腊）； International Hellenic University, Serres, Greece（国际希腊大学，塞雷斯，希腊）

AI总结本文提出纤维准则，通过投影映射的纤维常数性来形式化监督学习中表示-头部分解的可辨识性，并指出仅凭监督预测行为无法唯一确定表示。

详情

AI中文摘要

监督学习通过输入-输出行为评估预测器。当预测器实现为复合函数 $f=c\circ h$ 时，监督证据约束了复合映射 $f$，但未必确定表示-头部因子分解 $(h,c)$。本文形式化了由此产生的表示级可辨识性问题：对于一类可接受的表示-头部对，当且仅当表示属性在投影 $(h,c)\mapsto c\circ h$ 的纤维上为常数时，它可从诱导的预测器中辨识，等价于它下降为预测器的良定义属性。保持预测器的增广给出了一个规范障碍：辅助信息可以附加到表示上而头部忽略它，保持预测器不变但改变诸如极小性、压缩、不变性、等变性、干扰信息或语义可访问性等属性。这种构造将表示可辨识性与优化和有限样本估计分离开来。有限样本诊断说明了而非证明了该准则：精确代数见证在改变表示诊断时保持预测器固定，而匹配性能的Waterbirds模型表明不同约束可以在相似的监督性能下选择不同的表示。结果阐明，表示级声明需要超越监督预测行为本身的假设、目标、测量或归纳偏置。

英文摘要

Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$, supervised evidence constrains the composite map $f$ but need not determine the representation-head factorization $(h,c)$. This paper formalizes the resulting representation-level identifiability problem: for a class of admissible representation-head pairs, a representation property is identifiable from the induced predictor exactly when it is constant on the fibers of the projection $(h,c)\mapsto c\circ h$, equivalently when it descends to a well-defined property of the predictor. Predictor-preserving augmentation gives a canonical obstruction: auxiliary information can be appended to a representation while the head ignores it, leaving the predictor unchanged but altering properties such as minimality, compression, invariance, equivariance, nuisance information, or semantic accessibility. This construction separates representation identifiability from optimization and finite-sample estimation. Finite-sample diagnostics illustrate, rather than prove, the criterion: exact algebraic witnesses hold the predictor fixed while changing representation diagnostics, and matched-performance Waterbirds models show that different constraints can select different representations at similar supervised performance. The results clarify that representation-level claims require assumptions, objectives, measurements, or inductive biases beyond supervised predictive behavior alone.

URL PDF HTML ☆

赞 0 踩 0

2606.01086 2026-06-02 cs.LG cs.AI 版本更新

Strong Stochastic Flow Maps

强随机流映射

Sam McCallum, Zander W. Blasingame, Timothy Herschell, Niklas Rindtorff, Alexander Tong, James Foster

发表机构 * University of Bath（巴斯大学）； AITHYRA

AI总结提出强随机流映射（SSFMs）框架，通过学习加性噪声SDE的强解映射，实现扩散模型的免模拟训练和少步采样，在图像生成和分子系统采样中优于现有方法。

Comments Preprint

详情

AI中文摘要

流模型和扩散模型在许多模态中生成高质量样本；然而，由于需要对底层微分方程进行数值积分，推理过程中需要多次网络评估。流映射通过学习微分方程的解映射直接缓解了这一问题，实现了少步采样。然而，当前方法仅限于逼近ODE的解映射。这些方法可用于学习SDE的转移核，从而获得恢复过程边际分布（弱收敛）而非解路径（强收敛）的解映射。我们提出强随机流映射（SSFMs）作为一种新框架，用于学习加性噪声SDE的强解映射，直接将确定性流映射推广到随机设置。此外，引入了布朗运动的多项式逼近，并证明其路径收敛。这些结果为扩散模型的解映射提供了免模拟训练目标。我们证明，SSFMs在图像生成上优于先前的随机流映射方法，并实现了分子系统的少步采样。

英文摘要

Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference due to numerical integration of an underlying differential equation. Flow maps alleviate this problem by learning the solution map of the differential equation directly, enabling few-step sampling. Yet, current methods are restricted to approximating the solution map of ODEs. These methods can be used to learn the transition kernel of an SDE, thereby obtaining a solution map that recovers the marginal distributions of the process (weak convergence) rather than the solution path (strong convergence). We propose Strong Stochastic Flow Maps (SSFMs) as a novel framework for learning the strong solution map of additive-noise SDEs, directly generalizing deterministic flow maps to the stochastic setting. Further, a polynomial approximation to Brownian motion is introduced and shown to converge pathwise. These results enable a simulation-free training objective for the solution map of diffusion models. We demonstrate that SSFMs outperform previous stochastic flow map methods on image generation and enable few-step sampling of molecular systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01084 2026-06-02 cs.LG cs.AI 版本更新

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter：通过多视图交替注意力内化组合路由的几何等变性

Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出MViewRouter框架，利用多视图交替注意力机制内化几何等变性作为结构归纳偏置，通过集体策略梯度聚合优化，解决组合路由问题中的对称性挑战，在TSP和CVRP上取得竞争性解质量和强零样本泛化。

详情

AI中文摘要

组合路由问题，如旅行商问题（TSP）和带容量约束的车辆路径问题（CVRP），是基础的NP难问题，具有广泛的现实应用。虽然最近的深度强化学习方法显示出有希望的性能，但它们通常仅通过数据增强处理几何对称性，导致决策不一致和泛化能力有限。为了解决这个问题，我们提出了MViewRouter，一个多视图框架，将几何等变性内化为结构归纳偏置，以实现跨路由问题变体的不变决策。我们的方法引入了一种多视图交替注意力（MAA）机制，能够在$D_4$对称群上进行并行处理，在视图内关系建模和视图间特征对齐之间交替进行。此外，我们通过集体策略梯度聚合（CPGA）优化策略，利用来自多个对称视图的共识梯度来稳定训练并加速收敛。在TSP和CVRP基准测试以及真实世界的TSPLIB实例上的实验表明，MViewRouter实现了竞争性的解质量和强大的零样本泛化能力。

英文摘要

Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the $D_4$ symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.01080 2026-06-02 cs.LG cs.AI 版本更新

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

ThinkSwitch：基于LoRA和权重插值的上下文蒸馏用于特定目的推理任务

Dhruv Saini, Rohan Pandey

发表机构 * bellevue High School（贝尔维尤高中）； DigitalOcean

AI总结提出ThinkSwitch方法，通过QLoRA蒸馏和球面权重插值协同训练指令模型和思考模型，在AIME 2026和PubMedQA上分别提升指令模型10/30→20/30和13/30→18/30，思考模型14/30→22/30和18/30→25/30，仅需15个训练提示和$2.86成本。

详情

AI中文摘要

大型语言模型通常通过在产生最终答案之前花费推理时间计算来改进困难任务。额外的计算可能有用，但也增加了延迟、令牌成本和部署复杂性。我们引入了 extbf{ThinkSwitch}，一种低计算量的程序，用于协同训练配对的指令和思考检查点。从兼容的Qwen3-4B指令和思考模型开始，每次迭代要求思考检查点生成答案，移除推理轨迹，通过QLoRA将仅答案对蒸馏到指令检查点，并通过球面权重插值重建思考检查点。唯一的人工输入是任务提示；标签由模型自身生成。在30个问题的AIME 2026评估中，ThinkSwitch将指令检查点从10/30提升到20/30，思考检查点从14/30提升到22/30。在30个问题的PubMedQA子集上，它将指令检查点从13/30提升到18/30，思考检查点从18/30提升到25/30。完整实验每个领域使用15个训练提示，在单个云RTX 3070上花费2.86美元。结果规模较小，但表明有针对性的蒸馏循环可以将显式推理的部分好处转移到权重中，同时保留独立的思考模式。

英文摘要

Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and deployment complexity. We introduce \textbf{ThinkSwitch}, a low-compute procedure for co-training paired instruct and thinking checkpoints. Starting from compatible Qwen3-4B instruct and thinking models, each iteration asks the thinking checkpoint to generate answers, removes the reasoning trace, distills the answer-only pairs into the instruct checkpoint with QLoRA, and reconstructs a thinking checkpoint with spherical weight interpolation. The only human-supplied inputs are task prompts; the labels are generated by the model itself. On a 30-question AIME 2026 evaluation, ThinkSwitch improves the instruct checkpoint from 10/30 to 20/30 and the thinking checkpoint from 14/30 to 22/30. On a 30-question PubMedQA subset, it improves the instruct checkpoint from 13/30 to 18/30 and the thinking checkpoint from 18/30 to 25/30. The complete experiment uses 15 training prompts per domain and costs \$2.86 on a single cloud RTX 3070. The results are small-scale, but they indicate that targeted distillation loops can move part of the benefit of explicit reasoning into weights while preserving a separate thinking mode.

URL PDF HTML ☆

赞 0 踩 0

2606.01070 2026-06-02 cs.IR cs.AI cs.LG 版本更新

Test-Time Training for Zero-Resource Dense Retrieval Reranking

零资源稠密检索重排的测试时训练

Shiyan Liu, Yichen Li

发表机构 * Huazhong University of Science and Technology（华中科技大学）； ByteDance（字节跳动）

AI总结提出 DART 方法，通过测试时自适应双线性评分矩阵，利用伪正负样本进行少量梯度更新，在零资源下提升稠密检索重排性能。

Comments Accepted at KnowFM @ ACL 2026

详情

AI中文摘要

稠密检索器在第一阶段候选生成中表现出色，但在零资源设置下缺乏有效的重排能力。现有方法面临根本性困境：交叉编码器重排质量高，但需要昂贵的监督训练且延迟高，而无监督的 BM25 重排在大多数 BEIR 基准上持续降低稠密检索性能。我们提出 DART（测试时稠密自适应重排），通过在推理时自适应评分函数来解决这一困境。对于每个查询，排名靠前的文档作为伪正例，排名靠后的作为伪负例，提供噪声但可用的监督信号，通过少量梯度更新来适应双线性评分矩阵 $W$。我们进一步引入置信加权边际损失和跨查询动量缓冲区，以预热跨查询的适应过程。在六个 BEIR 基准上，DART 相对于稠密检索基线实现了平均每个数据集 NDCG@10 相对提升 +2.1%，且每个查询额外延迟低于 10ms，展示了强大的零样本性能提升和跨领域泛化能力。

英文摘要

Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix $W$ via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.01066 2026-06-02 cs.AI 版本更新

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

在模型学会Bug之前：模糊测试RLVR验证器

Jaideep Ray

发表机构 * ACM

AI总结本文提出一个轻量级验证器模糊测试框架，通过生成对抗性补全、比较有缺陷与严格的参考验证器，并报告多种指标，以研究RLVR中验证器错误导致优化学习Bug的失败模式。

2606.01065 2026-06-02 cs.DC cs.AI cs.LG 版本更新

TravelEval：评估基于LLM的旅行规划代理的综合基准框架

Weiyi Chen, Shuaixiong Wang, Ziyun Gao, Kaichun Hu, Wangze Ni, Shimin Di, Chen Jason Zhang, Lei Chen

发表机构 * Zhejiang University（浙江大学）； Hong Kong Polytechnic University（香港理工大学）； Southeast University（东南大学）； HKUST (GZ) & HKUST Guangzhou（香港科技大学（广州）& 香港科技大学（广州））

AI总结针对现有基准过度关注约束合规、缺乏真实性和多维评估的问题，提出TravelEval，通过六维评估框架、真实数据沙盒和模拟全局评估方法，全面评估LLM在旅行规划中的表现。

Comments 31pages, 8 figures, accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817533

AI中文摘要

大型语言模型（LLM）的发展显著提升了旅行规划应用，但现有基准的局限性限制了对其评估：1）过度强调约束合规，忽视时空成本等多维质量；2）数据集缺乏真实世界真实性和关键领域（如住宿、交通）的覆盖；3）孤立的每日计划评估遗漏了评估整个计划所需的关键细节（例如每日住宿和参观节奏的影响）。为解决这一差距，我们引入了TravelEval，一个真实且全面的基准。TravelEval具有1）一个新颖的六维评估框架，从准确性、合规性、时间性、空间性、经济性和实用性维度全面评估计划；2）一个高度真实的数据沙盒，包含精确的住宿定价和真实的城际交通数据；3）一种基于模拟的全局评估方法，通过集成API的地理信息和细粒度排队时间模拟完整的旅行计划。使用TravelEval评估12种主流方法揭示了若干有价值的见解，例如LLM在全局优化的多维规划（特别是时空推理和预算合规）方面存在困难，而代理推理策略并未提供一致的改进。简而言之，TravelEval通过基于现实的时空模拟和全面指标促进旅行计划评估，为推进基于LLM的旅行规划研究和应用提供了坚实基础。

英文摘要

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

URL PDF HTML ☆

赞 0 踩 0

2606.01042 2026-06-02 cs.LG cs.AI 版本更新

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

似真性不是预测：基于LLM的细胞扰动推理的对比证据

Xinyu Yuan, Xixian Liu, Jianan Zhao, Yashi Zhang, Hongyu Guo, Jian Tang

发表机构 * Mila - Québec AI Institute（魁北克人工智能研究所）； University of Montréal（蒙特利尔大学）； HEC Montréal（蒙特利尔HEC商学院）； University of Ottawa（渥太华大学）； National Research Council of Canada（加拿大国家研究理事会）； CIFAR AI Chair（CIFAR人工智能 chair）

AI总结本文发现基于大语言模型的细胞扰动推理虽能生成生物上合理的解释，但实际预测性能差，并提出CORE方法通过对比证据组织来提升扰动特异性预测。

详情

AI中文摘要

扰动实验对于理解细胞机制至关重要，但成本高昂且稀疏，因此需要预测未观察条件下的基因表达响应。最近一个有前景的方向是利用大语言模型（LLM）作为“虚拟细胞”模拟器——通过逐步的、基于知识的机械推理来推断差异表达——指向一种可解释的、知识驱动的范式，超越了纯粹的数据驱动方法。然而，我们发现似真性不是预测：尽管产生了生物上合理的解释，这些方法未能捕捉扰动特异性效应：系统性地高估差异表达，在聚合评估中通常表现不如简单的基因频率基线，并且在每个基因水平上降至随机水平。这揭示了对内在基因响应倾向的依赖，而非真正的扰动推理。我们将这一失败追溯到证据呈现方式：现有方法孤立地评估扰动-基因对，而不揭示相关扰动对同一基因的影响差异。为解决这一局限性，我们引入了CORE（对比关系证据组织），通过将证据组织成来自相关扰动的正面和负面结果，将预测重新定义为比较任务。使用生物医学知识图谱进行证据检索，CORE在基于LLM和非LLM的设置中均改善了校准并大幅提升了扰动特异性预测：例如，在药物扰动数据上，CORE-Reasoning将Qwen3.5-9B的聚合指标提升了高达28.6%；而在通用扰动数据上，CORE-Voting将四个细胞系的每个基因平均AUROC从随机水平提高到0.703。这突显了对比证据组织对于可靠的基于LLM的扰动推理至关重要。

英文摘要

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as "virtual cell" simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning

URL PDF HTML ☆

赞 0 踩 0

2606.01039 2026-06-02 cs.LG cs.AI 版本更新

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+: 重新思考在线策略蒸馏中的优势设计

Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang

发表机构 * Columbia University（哥伦比亚大学）； Amazon（亚马逊）； Meta ； Capital One

AI总结本文提出OPD+，通过修正在线策略蒸馏中因停止梯度操作导致的奖励目标偏差，并支持多种f-散度，在数学推理和工具使用基准上提升了性能。

详情

AI中文摘要

在线策略蒸馏（OPD）是一种广泛使用的技术，用于将能力强的教师语言模型的能力迁移到基础学生模型，并且可以通过使用学生生成的轨迹来制定强化学习风格的目标。然而，尽管散度奖励依赖于学生模型的可能性，现有工作通常采用停止梯度设计主要是为了稳定性，这使得得到的优势估计存在问题。在这项工作中，我们提供了一个基于学生和教师之间f-散度的通用优化框架，并从数学上重新审视这种设计空间是否有效。我们证明，对于一般的散度函数，一般的停止梯度操作会导致奖励目标和相应梯度的有偏估计。我们提出了OPD+，这是OPD的修正版本，在基线KL方法上展示了改进的性能，并且也支持各种f-散度的选择。我们在数学推理和工具使用基准上验证了我们的发现。

英文摘要

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.01033 2026-06-02 cs.AI 版本更新

ProductWebGen: 多模态产品网页生成基准测试

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

发表机构 * School of Computer Science & Zhiyuan College（计算机科学学院及智远学院）； Shanghai Jiao Tong University（上海交通大学）； Kuaishou Technology（快手科技）

AI总结提出ProductWebGen基准，用于评估多模态生成模型从产品图像和指令生成一致产品展示网页的能力，并比较了基于编辑和基于统一模型两种工作流。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817507

AI中文摘要

从源产品图像以及布局和视觉内容指令中制作产品展示网页，对于营销、广告和电子商务等领域具有重要的实用价值。直观上，该任务要求产品展示之间严格的视觉一致性以及高保真度的指令遵循，以联合生成可渲染的HTML代码。这些对可控性和指令遵循的要求与先进多模态生成模型（如图像编辑模型和统一模型）的核心特征紧密一致。为此，本文引入ProductWebGen来系统性地基准测试这些模型的产品网页生成能力。我们组织了包含500个测试样本的ProductWebGen，涵盖13个产品类别；每个样本由源图像、视觉内容指令和网页指令组成。任务是根据源图像和指令生成包含多个一致图像的产品展示网页。鉴于任务的混合模态输入输出性质，我们设计并系统比较了两种评估工作流——一种使用大语言模型和图像编辑模型分别生成HTML代码和图像（基于编辑），另一种依赖单个统一模型生成两者，其中图像生成依赖于先前的多模态上下文（基于统一模型）。实验结果表明，基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果，而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。我们还构建了一个监督微调数据集ProductWebGen-1k，包含1000组真实产品图像和LLM生成的HTML代码。我们在开源统一模型BAGEL上验证了其有效性。数据和代码可在https://github.com/SJTU-DENG-Lab/ProductWebGen获取。

英文摘要

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.

URL PDF HTML ☆

赞 0 踩 0

2606.01020 2026-06-02 cs.AI cs.LG 版本更新

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

通过苏格拉底式提问和批判性论证教授外行人逻辑谬误，以应对错误信息的根源

Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury, Mrinmaya Sachan

发表机构 * ETH Zurich（苏黎世联邦理工学院）； ETH AI Center（苏黎世联邦理工学院人工智能中心）

AI总结提出LFTutor智能辅导系统，利用大语言模型结合苏格拉底式提问和批判性论证原则，帮助外行人学习识别逻辑谬误，显著优于基线模型。

Comments This paper has been accepted to Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Long Paper), Main Conference

详情

Journal ref: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

AI中文摘要

识别日常话语中的逻辑谬误对许多人来说具有挑战性。这一挑战在大语言模型（LLMs）时代被放大，恶意行为者可以利用谬误论证大规模传播错误信息。在这项工作中，我们探索了LLMs作为解决方案一部分的潜力。我们介绍了LFTutor，一个智能辅导系统，它使用LLMs辅导外行人，帮助他们学习逻辑谬误。LFTutor整合了意图驱动的苏格拉底式提问和批判性论证原则，以积极引导学习者反思自己的推理。通过自动评估和人工评估，我们证明LFTutor显著优于缺乏这些教学策略的基线LLMs。这项工作突显了将LLMs与教学支架相结合以在人工智能时代培养批判性思维和论证素养的前景。

英文摘要

Identifying logical fallacies in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Language Models (LLMs), where malicious agents can deploy fallacious arguments to disseminate misinformation at scale. In this work, we explore the potential of LLMs as part of the solution. We introduce LFTutor, an intelligent tutoring system which uses LLMs to tutor laypeople and help them learn about logical fallacies. LFTutor integrates intent-driven Socratic questioning and critical argumentation principles to actively engage learners to reflect on their reasoning. Through both automatic and human evaluations, we demonstrate that LFTutor significantly outperforms baseline LLMs lacking these pedagogical strategies. This work highlights the promise of combining LLMs with pedagogical scaffolding to foster critical thinking and argument literacy in the age of AI.

URL PDF HTML ☆

赞 0 踩 0

2606.01019 2026-06-02 cs.CL cs.AI 版本更新

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

混合验证解码：在推测解码中学习分配验证

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard

发表机构 * Thoughtworks ； Nvidia

AI总结提出混合验证解码方法，通过预测缓存草稿的接受长度并在缓存验证与模型草稿之间动态选择，在代理工作流中平均加速2.73倍。

详情

AI中文摘要

大型语言模型（LLM）生成仍然昂贵，因为自回归解码每生成一个新token就调用一次模型。推测解码通过草拟多个token并用目标模型一步验证来降低成本，但其加速取决于接受的草稿token数量。无参数草稿源可以在结构化和代理工作负载中以低成本提出长续写，但一个生成步骤中看起来有前景的缓存匹配可能在下一步收益很低。我们提出混合验证解码，在验证前预测缓存草稿的接受长度，并使用该收益估计在缓存验证和基于模型的草稿器之间进行选择。在三个LLM和十六个数据集上，混合验证解码在代理工作流中特别有效，在每个设置中均优于EAGLE3，平均加速2.73倍。我们的分析揭示了提示结构如何创造缓存机会，高收益缓存草稿如何集中在草稿空间的一小部分，以及收益引导的选择如何减少顺序解码工作，指向运行时草稿选择作为推测解码的一个有前景的方向。

英文摘要

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.01016 2026-06-02 cs.CL cs.AI eess.AS 版本更新

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100：面向100多种语言和方言的大规模语音理解基准

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He

发表机构 * Shenzhen International Graduate School, Tsinghua University（深圳国际研究生院，清华大学）； Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； JD AI Research（京东人工智能研究院）

AI总结为解决现有语音评估基准在资源丰富语言偏向、缺乏语义推理和忽视方言的问题，提出PolySpeech-100基准，通过混合构建管道覆盖110种语言变体，并评估22个模型，发现开源端到端模型在重方言上优于级联系统，而思维链提示在零样本设置下会降低性能。

Comments 19 pages, 13 figures, KDD 2026

详情

AI中文摘要

虽然端到端（E2E）语音大语言模型（Speech-LLMs）正在快速发展，但它们的评估方法仍局限于简单转录的时代。现有基准存在三个关键限制：明显偏向高资源语言、关注低级识别（ASR）而非语义推理，以及忽视区域方言。为弥补这一差距，我们引入了PolySpeech-100，这是一个大规模基准，旨在评估110种语言变体上的“母语级”语音理解。我们采用了一种新颖的混合构建管道，将黄金标准的人类录音与指令驱动的合成语音相结合，从而覆盖了19种不同的中文方言和80多种低资源语言。对22个最先进模型（包括Gemini-3、GPT-Audio和Qwen2.5-Omni）的广泛评估得出了关键见解。首先，我们证明开源端到端模型在重方言上优于级联（ASR+LLM）系统，证明直接音频处理保留了标准转录中经常丢失的关键副语言线索和韵律特征（例如语调、重音）。其次，我们揭示了一个显著的性能差距：虽然商业模型保持稳健，但开源模型在低资源语言上遭受灾难性退化。最后，反直觉的是，我们观察到在标准零样本设置下，思维链提示经常降低大多数评估模型的语音理解性能，揭示了当前架构中潜在的多模态对齐差距。PolySpeech-100为下一代包容性、全能的语音LLM建立了严格标准。数据、演示和代码公开于https://github.com/YoungSeng/PolySpeech-100。

英文摘要

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

URL PDF HTML ☆

赞 0 踩 0

2606.01015 2026-06-02 cs.RO cs.AI cs.NI cs.SY eess.SY 版本更新

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

AI-IoT-机器人集成：框架、新兴趋势及迈向互联机器人的路径综述

Ranulfo Bezerra, Satoshi Tadokoro, Kazunori Ohno

发表机构 * Tohoku University（东大大学）

AI总结本文综述了人工智能、物联网和机器人三者融合的现状，提出了模块化系统架构，并强调了小语言模型（SLM）和大型语言模型（LLM）在分布式认知与自主决策中的作用，为下一代互联机器人和物理AI生态系统提供了概念和技术路线图。

Comments 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal

详情

DOI: 10.1109/JIOT.2026.3670191
Journal ref: IEEE Internet of Things Journal, vol. 13, no. 10, pp. 20398-20412, 15 May15, 2026

AI中文摘要

人工智能、物联网和机器人的融合不再是未来的愿景；它正迅速成为实时、智能和上下文感知系统的基础。AI实现感知和推理，IoT提供可扩展的感知和通信，而机器人则提供具身驱动。尽管在AIoT和物联网机器人（IoRT）等两两组合方面取得了显著进展，但仍缺乏完全整合这三者的统一设计框架。本综述综合了这些领域的最新进展，强调了边缘端的小语言模型（SLM）和云端的大型语言模型（LLM）在分布式认知和自主决策中的新兴作用。我们提出了一个符合这些趋势的模块化系统架构，分析了互操作性和反馈控制中存在的持续差距，并根据集成深度对现有工作进行了分类。我们的综述强调了混合SLM-LLM系统与IoT基础设施和机器人代理相结合时，如何应对实时适应、可扩展性和可靠性方面的挑战。这项工作为设计模块化、可解释且能够在动态环境中学习的下一代AI-IoT-机器人生态系统提供了概念和技术路线图，为新兴的互联机器人和物理AI范式铺平了道路。

英文摘要

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.

URL PDF HTML ☆

赞 0 踩 0

2606.01014 2026-06-02 cs.CV cs.AI 版本更新

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

基于文本的三维人体运动编辑中的跨轴特征融合与关节运动差异预测

Gyojin Han, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

AI总结提出一种跨轴特征融合架构和辅助任务，通过联合锚定变换器预测关节运动差异，实现文本驱动的三维人体运动编辑，在MotionFix数据集上达到最优性能。

Comments CVPR 2026

详情

AI中文摘要

我们研究基于文本的三维人体运动编辑，目标是保留源运动的风格和结构，同时应用自然语言描述的编辑。MotionFix数据集的发布推动了基于训练扩散模型的直接生成编辑运动的研究，这些模型从源运动和文本指令生成编辑运动。虽然先前的工作主要关注学习编辑在时间上何时发生，但我们的目标是创建一个不仅理解时间方面，还理解哪些特定关节负责变化的模型。为此，我们提出了一种新颖的架构和一个互补的辅助任务来辅助其训练。我们的架构由两个轴锚定变换器组成，分别沿关节和时间维度提取不同特征，以及一个跨轴融合块来整合这些表示。我们进一步引入一个辅助任务，训练关节锚定变换器回归源和目标关节旋转之间的Soft-DTW距离。该目标教会模块理解哪些关节需要修改，哪些需要保留。通过在MotionFix数据集上的全面实验，我们证明我们的方法显著提高了与文本指令和源运动的语义对齐，以及生成运动的整体保真度，达到了最先进的结果。

英文摘要

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

URL PDF HTML ☆

赞 0 踩 0

2606.01012 2026-06-02 cs.AI cond-mat.mtrl-sci 版本更新

Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach

堆叠双层材料的性质预测：一种多模态学习方法

An Vuong, Minh-Hao Van, Chen Zhao, Xintao Wu

发表机构 * University of Arkansas（亚拉巴马大学）； Baylor University（贝勒大学）

AI总结提出一种多模态学习方法，通过联合建模不同材料层间的界面，预测给定配置下垂直堆叠产生的性质，实验证明其有效性和高效性。

Comments Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情

AI中文摘要

AI for materials science 是 AI for science 中的一个关键主题，旨在加速材料发现并产生准确的性质预测。双层二维材料堆叠对于探索具有新功能和内在现象的新材料至关重要，能够创建用于各种实际应用的新型二维双层材料。从实验和计算角度对双层 vdWs 材料的研究已取得显著进展。多种双层材料已通过实验成功合成，并且高通量计算技术的日益普及构建了几个计算二维材料数据库。然而，利用 AI 对双层堆叠进行建模并预测新性质的研究仍不充分，需要进一步研究。在这项工作中，我们提出了一种新颖的多模态学习方法，用于研究不同材料之间的界面，这些界面共同实现新的或多种功能，并预测在给定配置下不同功能材料层垂直集成（堆叠）产生的新性质。综合实验证明了我们方法相对于基线方法的有效性和高效性。我们的代码可在 https://github.com/AnVuong123/bimat_ml 获取。

英文摘要

AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real-world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high-throughput computing technology has con structed several computational two-dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at https://github.com/AnVuong123/bimat ml.

URL PDF HTML ☆

赞 0 踩 0

2606.01008 2026-06-02 cs.SE cs.AI 版本更新

FVSpec: Real-World Property-Based Tests as Lean Challenges

FVSpec: 作为精益挑战的真实世界基于属性的测试

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

发表机构 * Forall R&D（Forall 研发）； Benchify ； Galois Inc（Galois 公司）

AI总结提出FVSpec基准，通过从真实Python仓库中提取属性测试并自动翻译为Lean 4规范，评估AI在形式化验证任务上的能力。

详情

AI中文摘要

我们提出了一个用于评估AI模型和智能体在真实世界形式化软件验证任务上的基准。首先从真实世界的Python仓库中抓取11,039个基于属性的测试（PBT），然后自动将其中2,772个（25%）翻译成9,415个带有sorry占位符的Lean 4规范（每个PBT约3个形式化；当没有形式化在质量指标上占优时，我们保留多次尝试）。将PBT翻译成Lean规范具有挑战性：需要在Lean中建模Python语义，推断命令式PBT中编码的逻辑属性，并处理在很少使用的语言中进行依赖类型编程的固有困难。我们描述了一个用于将PBT转译为Lean规范的三智能体LLM流水线，评估覆盖率和质量指标，并使用多种自动化和基于模型的方法为证明生成提供基线。所有代码（爬虫和智能体）和数据（PBT和Lean规范）都是开源的。我们的基准旨在推动AI辅助形式化验证真实世界软件这一尚未充分探索的问题的进展，随着AI生成越来越多的代码，这一问题日益受到关注。

英文摘要

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.

URL PDF HTML ☆

赞 0 踩 0

2606.01007 2026-06-02 cs.LG cs.AI 版本更新

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

超越任务无关：面向通信高效的多任务MoE推理的任务感知分组

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao, Yong Jiang, Qing Li

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Pengcheng Laboratory（鹏城实验室）

AI总结提出任务感知共激活分组（TACG）框架，通过任务特定的共激活模式优化专家放置，并引入通用专家共享复制（GESR）应对在线负载倾斜，在三个MoE模型上平均降低通信成本31.39%，保持公平性指数0.9975。

详情

AI中文摘要

稀疏激活的混合专家（MoE）模型通过条件计算扩展容量，但分布式推理面临跨GPU专家通信和路由引起的负载不平衡问题。现有的放置方法通过共同定位频繁共激活的专家来降低这一成本；然而，它们从全局聚合的路由轨迹中推导出单一部署方案，从而平均掉了多任务服务中实际驱动通信的异构、任务特定的共激活模式。我们观察到专家共激活强烈依赖于任务：在一个任务族中紧密耦合的专家对在另一个任务族中往往不相关，因此有效的部署应根据任务感知的共激活而非任务无关的平均值来分组专家。基于这一见解，我们提出了任务感知共激活分组（TACG），这是一个部署时框架，利用族特定的调度和共激活轨迹推导每个专家的任务族偏好，重新加权共激活图使得族内局部性主导分组，并在精确容量约束下将每个专家分配到主GPU。为了使静态放置对在线工作负载倾斜保持鲁棒，我们进一步引入了通用专家共享复制（GESR），这是一个轻量级辅助方法，识别具有持续中心共激活特征的通用专家，将它们复制到少量辅助GPU上，并在服务时应用局部性和负载感知的选择。在三个代表性的开源MoE模型上的实验表明，我们的框架相比基线平均降低了31.39%的通信成本，同时保持了平均Jain公平指数0.9975。即使在推理数据出现严重分布偏移的情况下，这一优势依然存在，持续优于强基线。

英文摘要

Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00991 2026-06-02 cs.AI 版本更新

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

交通系统管理与运营中的大语言模型：从文本推理到多模态决策支持

Siyan Li, Zehao Wang, Jiachen Li, Kanok Boriboonsomsin, Matthew J. Barth, Guoyuan Wu

发表机构 * Bourns College of Engineering, Center for Environmental Research and Technology, University of California at Riverside, CA, USA（伯恩斯工程学院，环境研究与技术中心，加州大学河滨分校，美国，加利福尼亚州河滨）

AI总结本文综述了大语言模型（LLM）和多模态大语言模型（MM-LLM）在交通系统管理与运营（TSMO）中的应用，涵盖运营与服务、移动性与车队服务、数据建模与决策支持三大领域，并指出了数据异构性、实时推理、可解释性等挑战及未来方向。

Comments Preprint version

详情

AI中文摘要

交通系统管理与运营（TSMO）越来越依赖于对各种传感器流、事件报告、旅行者反馈和视觉观测等异构数据的及时解读。大语言模型（LLM），包括新兴的多模态大语言模型（MM-LLM），为将这些结构化和非结构化输入整合到面向操作者的决策支持中提供了新机制。本文综述了基于LLM和MM-LLM在TSMO中的应用，涵盖三个领域：交通运营与服务（供给）、移动性与车队服务（需求）以及数据、建模与决策支持。通过PRISMA指导的筛选过程，我们综合了当前研究，同时区分了面向操作的应用与原型及新兴概念。我们进一步识别了数据异构性、实时推理、可解释性、多模态融合和治理方面的反复出现的挑战。最后，我们概述了在本地化适应、边缘部署、基准测试和跨机构协作方面的现有差距和未来方向。总体而言，基于LLM的系统作为决策支持层最有前景，而MM-LLM在需要整合异构文本、视觉和传感器输入时尤其有价值。

英文摘要

Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various sensor streams, incident reports, traveler feedback, and visual observations. Large language models (LLMs), including emerging multi-modal large language models (MM-LLMs), provide a new mechanism for integrating these structured and unstructured inputs into operator-facing decision support. This survey paper reviews LLM- and MM-LLM-based applications in TSMO across three domains: transportation operations & services (supply), mobility & fleet services (demand), and data, modeling & decision support. Using a PRISMA-guided screening process, we synthesize current studies while distinguishing operationally oriented applications from prototype and emerging concepts. We further identify recurring challenges in data heterogeneity, real-time inference, explainability, multi-modal fusion, and governance. Finally, we outline existing gaps and future directions in localized adaptation, edge deployment, benchmarking, and cross-agency collaboration. Overall, LLM-based systems appear most promising as a decision-support layer, with MM-LLMs offering particular value when heterogeneous text, visual, and sensor inputs must be integrated.

URL PDF HTML ☆

赞 0 踩 0

2606.00987 2026-06-02 cs.CV cs.AI 版本更新

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

多时相指代分割的开源基准与基线

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Institute of Artificial Intelligence (TeleAI)（人工智能研究所）； China Telecom（中国电信）； School of Artificial Intelligence, Optics and Electronics (iOPEN)（人工智能、光学与电子学院）； Northwestern Polytechnical University（西北工业大学）

AI总结提出多时相指代分割任务，通过自动化数据构建管道CRAFT-Agent生成首个基准MTRefSeg-21K，并设计两阶段训练的变化感知LVLM框架MTRefSeg-R1，实现优于现有基线的性能。

详情

AI中文摘要

大型视觉语言模型（LVLMs）展现了强大的视觉理解和语言引导定位能力，但其多时相视觉推理能力仍未充分探索。为填补这一空白，我们引入了 extbf{多时相指代分割（MTRS）}，这是一个新任务，旨在从多时相图像中分割语言描述的时间变化。MTRS通过联合要求时相对应推理、语言定位和像素级掩码预测，扩展了传统的指代分割和变化检测。我们提出了 extbf{CRAFT-Agent}，一个带有人工审核的自动化数据构建管道，并构建了 extbf{MTRefSeg-21K}，这是第一个MTRS基准，包含21K个高质量的多时相图像-文本-掩码三元组，覆盖多样化的场景、视角和领域。对一系列基于VLM和LVLM的模型进行基准测试表明，直接推理表现较差，而任务特定的微调仍然有限。为解决这一问题，我们提出了 extbf{MTRefSeg-R1}，一个采用两阶段策略训练的变化感知LVLM框架。它首先从20K个仅视觉的双时相样本中学习通用时间变化感知，然后在MTRefSeg-21K上进行微调，以实现细粒度的语言引导时间定位。MTRefSeg-R1显式建模跨时相视觉差异，将语言指令与时间变化对齐，并预测所指变化掩码。大量实验表明，与现有的LVLM基线相比，MTRefSeg-R1实现了强大且通常更优的性能，展示了MTRS的挑战和潜力。

英文摘要

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

URL PDF HTML ☆

赞 0 踩 0

2606.00970 2026-06-02 cs.AI cs.LG econ.TH 版本更新

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

具有灾难性状态的MDP中贝尔曼最优性产生的前景理论行为

Yujiao Chen

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制，发现标准贝尔曼最优性产生前景理论特征：S形值函数、内生损失敏感系数和反射效应策略反转，并推导出渐近损失厌恶平台的闭式表达式。

详情

AI中文摘要

我们研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制。尽管奖励是线性的且智能体没有效用曲率、概率加权或框架依赖，标准贝尔曼最优性产生了三个前景理论特征：S形值函数轮廓（灾难附近凸，远处凹）、内生损失敏感系数$λ^*(S) > 1$以及反射效应策略反转。在495个配置中，最优策略在正漂移（增长）模式下在灾难附近选择安全动作，尽管风险动作的即时期望值更高；在负漂移（衰退）模式下在灾难附近选择风险动作，尽管安全动作的即时期望损失更低。我们推导出渐近损失厌恶平台$\barλ$的闭式表达式，该表达式仅依赖于获胜概率$p$、收益不对称性$r = |Δ_\ell/Δ_w|$和折扣因子$β$，与数值解的拟合$R^2 = 0.999$。该机制不需要不对称收益。在三个不对称水平下对$(p,β)$进行扫描，$\barλ$大于1的不对称份额中位数为4.6%（$r = 1.25$时），上升到13.9%（$r = 2$时），且在每个测试单元中边界贡献超过不对称贡献。这些现象在表格Q学习（无模型智能体在增长模式下与$V^*$的相关性为0.98，衰退模式下为1.00）以及随机转移（高斯、重尾Student-$t_3$和不对称偏正态噪声，幅度高达步长的50%）中持续存在，其中渐近平台在安全通道噪声下跟踪闭式预测的误差在0.41%以内，在风险通道或双通道噪声下误差在9.6%以内。这些结果将吸收失败状态识别为最优控制下产生前景理论行为的充分结构机制。

英文摘要

We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient $λ^*(S) > 1$, and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action's higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action's lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau $\barλ$ that depends only on win probability $p$, payoff asymmetry $r = |Δ_\ell/Δ_w|$, and discount factor $β$, and matches numerical solutions to $R^2 = 0.999$. The mechanism does not require asymmetric payoffs. Across a sweep of $(p,β)$ at three asymmetry levels, the asymmetry share of $\barλ$ above unity has median 4.6% at $r = 1.25$ and rises to 13.9% at $r = 2$, with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces $V^*$ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student-$t_3$, and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.

URL PDF HTML ☆

赞 0 踩 0

2606.00962 2026-06-02 cs.CR cs.AI 版本更新

SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

SS-ZKR：面向隐私保护多智能体协作的空间语义零知识路由

Hassan Touheed

发表机构 * Linux Foundation（Linux基金会）； Google（谷歌）； W3C（万维网联盟）

AI总结提出SS-ZKR协议，通过差分隐私语义意图向量、自适应净化和空间到密码策略编译器三种机制，在不解密负载的情况下实现跨组织信任边界的内容感知语义路由。

详情

AI中文摘要

基础智能体互操作标准，特别是智能体到智能体（A2A）协议和模型上下文协议（MCP），推动了多智能体系统通信的发展，而利用W3C去中心化标识符（DID）和可验证凭证（VC）的补充身份框架提供了密码学智能体认证。然而，现有协议均不支持在无需路由中介解密负载的情况下，跨组织信任边界进行基于内容的智能体负载语义路由，而这在受GDPR、HIPAA和MiFID II监管的合规敏感环境中是硬性约束。我们提出SS-ZKR，一种三机制隐私保护路由协议，设计为A2A/MCP之上的补充层。机制一通过差分隐私语义意图向量引入盲路由，该向量密码学绑定到负载模式一致性的零知识证明。机制二提供向量加权自适应负载净化，对数值字段采用形式化(ε,δ)-差分隐私，对文本字段采用启发式语义聚合。机制三提出空间到密码策略编译器，将视觉定义的信任区域拓扑转换为确定性零知识访问电路。我们提供形式化威胁模型，分析意图向量的信息泄露界限，给出所有三种机制的伪代码，并与基于TEE和同态加密的路由基线进行解析复杂度比较。SS-ZKR允许金融服务、医疗保健和国防领域的企业跨监管边界编排异构AI智能体，而无需向路由基础设施暴露专有数据。

英文摘要

Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advanced multi-agent system communication, and complementary identity frameworks leveraging W3C Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) provide cryptographic agent authentication. However, no existing protocol supports content-based semantic routing of agent payloads across organisational trust boundaries without requiring the routing intermediary to decrypt the payload, which is a hard constraint in compliance-sensitive environments governed by GDPR, HIPAA, and MiFID II. We propose SS-ZKR, a three-mechanism privacy-preserving routing protocol designed as a complementary layer atop A2A/MCP. Mechanism I introduces blind routing via differentially private semantic intent vectors cryptographically bound to zero-knowledge proofs of payload-schema consistency. Mechanism II offers vector-weighted adaptive payload sanitisation with formal (epsilon, delta)-differential privacy for numerical fields and heuristic semantic aggregation for textual fields. Mechanism III presents a spatial-to-cryptographic policy compiler that translates visually defined trust-zone topologies into deterministic zero-knowledge access circuits. We provide a formal threat model, analyse information leakage bounds of intent vectors, present pseudocode for all three mechanisms, and give analytical complexity comparisons against TEE-based and homomorphic encryption-based routing baselines. SS-ZKR lets enterprises in financial services, healthcare, and defence orchestrate heterogeneous AI agents across regulatory boundaries without exposing proprietary data to routing infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.00959 2026-06-02 cs.AI 版本更新

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

通过部分信息分解理解多模态语言模型中的模态交互

Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结引入部分信息分解（PID）框架，分离感官和语言输入的独特、冗余和协同贡献，揭示多模态大模型中的模态使用模式，并扩展至三模态系统。

Comments Accepted by ICML 2026

详情

AI中文摘要

理解多模态大语言模型（MLLMs）中的模态交互对于可靠部署至关重要。我们引入部分信息分解（PID）作为决策级框架，将感官和语言输入的独特、冗余和协同贡献分离，超越了表示对齐和基于结果的评估。在视觉-语言基准测试中，PID揭示了重复出现的模态使用模式：推理和接地导向的任务往往表现出高协同性，而专家和知识导向的任务则显示出更强的语言独特性依赖。这些模式在不同模型家族中普遍存在，并能预测对模态级干预的敏感性。我们进一步将PID扩展到三模态系统，提出感官PID，将语言作为控制变量来分解视频-音频信息增益。应用于全模态模型时，感官PID揭示了感官协同瓶颈，即使在音视频融合任务中也以视觉信息为主。最后，PID引导的重新加权为改进多模态推理和接地性能提供了初步证据。

英文摘要

Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00949 2026-06-02 cs.LG cs.AI physics.flu-dyn 版本更新

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释深度强化学习揭示湍流减阻的节能控制策略

Federica Tonti, Ricardo Vinuesa

发表机构 * Department of Aerospace Engineering University of Michigan（航空航天工程系密歇根大学）

AI总结结合多智能体深度强化学习与可解释深度学习，提出基于SHAP归因的奖励策略，实现高效湍流减阻，净节能达34.01%且输入功率仅0.43%。

详情

AI中文摘要

我们提出了一种结合多智能体深度强化学习（MARL）和可解释深度学习（XDL）的方法，用于减少壁面边界湍流中的阻力。以直接针对壁面剪切应力和反对称控制训练智能体的结果作为基线，比较了三种SHAP引导的方法。第一种方法中，奖励根据预测未来速度场的U-net的SHAP归因计算；第二种方法中，奖励根据预测摩擦系数的U-net的SHAP归因计算；第三种方法中，奖励结合了分别预测摩擦系数和壁面压力脉动的两个U-net的SHAP归因。基于摩擦系数和壁面压力脉动的组合SHAP策略实现了最佳整体性能，在仅0.43%归一化输入功率下实现了34.44%的减阻率（DR）和34.01%的净节能率（NES）。相对于反对称控制，减阻和净节能分别提高了49.41%和48.52%。与直接壁面剪切应力基线相比，所提出的策略在提高性能的同时，将归一化驱动成本从5.90%降低到0.43%。结果分析表明，节能策略与压力门控驱动一致，主要在壁面压力接近零时激活，并且其时间尺度与近壁湍流结构的寿命相当。

英文摘要

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.

URL PDF HTML ☆

赞 0 踩 0

2606.00946 2026-06-02 cs.DC cs.AI cs.LG 版本更新

开放智能体技能生态系统中的安全风险检测与验证基准测试

Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder, Nan Jiang

发表机构 * University of Texas at El Paso（德克萨斯理工大学）； Southern Illinois University-Carbondale（南方伊利诺伊大学卡本代尔分校）； Purdue University（普渡大学）

AI总结提出SkillVetBench，一个两阶段安全审查基准，通过语义审查和沙箱执行检测与验证开放智能体技能生态系统中的恶意技能。

详情

AI中文摘要

开放智能体平台允许社区贡献者发布可重用的技能，智能体可在运行时调用。这种可扩展性也带来了供应链风险：恶意贡献者可以将有害行为隐藏在表面检查看似良性的技能中。然而，现有防御措施难以评估，因为没有同时衡量恶意技能检测和运行时验证的基准。我们提出了SkillVetBench，一个针对开放智能体技能生态系统的两阶段安全审查基准。第一阶段对每个技能的自然语言规范进行语义审查，以检测隐藏的恶意意图。第二阶段在沙箱中执行标记的技能，观察运行时行为并收集可审计的证据。我们从活跃的OpenClaw生态系统中确认的恶意技能构建基准，包括最近ClawHavoc供应链攻击中的样本。与仅静态方法不同，SkillVetBench通过执行轨迹验证检测到的威胁。我们的实验表明：（1）仅语义和基于签名的基线方法不足，最多遗漏89%的恶意技能，其威胁源于自然语言指令、多组件逻辑或跨组件交互；（2）运行时攻击集中在少量高权限原语上，特别是exec、write_file、install_skill和spawn；（3）SkillVetBench提供了案例研究，其中沙箱执行直接以具体的运行时证据支持恶意判定。

英文摘要

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill's natural-language specification to detect hidden malicious intent. The second stage executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. We build a benchmark from confirmed malicious skills in the live OpenClaw ecosystem, including samples from the recent ClawHavoc supplychain campaign. Unlike static-only methods, SkillVetBench verifies detected threats with execution traces. Our experiments show that: (1) semantic-only and signature-based baselines are insufficient, missing up to 89\% of malicious skills whose threats arise from natural-language instructions, multicomponent logic, or cross-component interactions; (2) runtime attacks are concentrated in a small set of high-permission primitives, especially exec, write\_file, install\_skill, and spawn; and (3) SkillVetBench provides case studies in which sandbox execution directly supports malicious verdicts with concrete runtime evidence.

URL PDF HTML ☆

赞 0 踩 0

2606.00920 2026-06-02 cs.LG cs.AI cs.SE 版本更新

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

大型语言模型在确定性编程任务上的准确性、稳定性和重复运行可靠性

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

发表机构 * Northeastern University, Massachusetts, USA（东北大学，马萨诸塞州，美国）； University of Southern California, California, USA（南加州大学，加利福尼亚州，美国）

AI总结通过重复运行评估协议，发现运行级通过率高估了无重试覆盖率高达17.8个百分点，且差距在中等性能系统中最大，表明稳定性分析是准确性报告的必要补充。

详情

AI中文摘要

运行级通过率高估了无重试覆盖率高达17.8个百分点——且差距恰恰在中等性能系统中最大。我们研究了大型语言模型（LLM）在确定性文本条件生成评估中的这种准确性-稳定性关系，以编程任务作为具体测试平台。标准代码生成基准强调单次运行准确性或在重复采样下的最终成功，但许多部署场景还需要稳定性：在相同任务描述下重复调用时的一致结果。我们提出了一种重复运行评估协议，包含运行级准确性、无重试覆盖率和每个问题的变异性指标。在一个包含100道LeetCode风格问题的基于近期的基准上，我们评估了来自五个提供者家族的16个模型，使用两种提示模板，每个问题重复运行五次，共产生16,000个评估实例。尽管运行级通过率与完美稳定率强相关（r=0.985），但通过率始终超过无重试覆盖率——这一差距达到17.8个百分点，并且即使在密切匹配的系统之间也会逆转模型排名。提示效应是模型依赖的，而非普遍有益的。这些结果表明，对于确定性文本条件生成任务，重复运行稳定性分析是传统准确性报告的必要补充。

英文摘要

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeated-run evaluation protocol with metrics for run-level accuracy, retry-free coverage, and per-problem variability. On a recency-based benchmark of 100 LeetCode-style problems, we evaluate 16 models from five provider families under two prompt templates with five repeated runs per problem, yielding 16,000 evaluation instances. Although run-level pass rate and perfect stability rate are strongly correlated (r=0.985), pass rate consistently exceeds retry-free coverage -- a gap that reaches 17.8 percentage points and reverses model rankings even among closely matched systems. Prompt effects are model-dependent rather than uniformly beneficial. These results suggest that repeated-run stability analysis is a necessary complement to conventional accuracy reporting for deterministic text-conditioned generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.00914 2026-06-02 cs.AI cs.CL cs.CR 版本更新

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

对抗性输入流引导LLM智能体决策偏离其默认行为

Rana Muhammad Usman

发表机构 * Independent Researcher（独立研究者）

AI总结本研究通过控制实验揭示，外部输入流的组成和排序能因果性地改变LLM智能体的下游决策，存在对抗性屈服、默认饱和及默认方向不对称三种响应模式，且该效应在多个决策领域普遍存在。

Comments 14 pages, 5 figures. Code, post pools, and 2,785 decision rollouts: https://github.com/ranausmanai/recommenders-as-control-surfaces

详情

AI中文摘要

LLM智能体越来越多地在消费排序后的外部信息流（如社交推送、搜索结果、检索上下文和邮件队列）后采取行动，然而安全评估几乎总是孤立地测试模型或用户提示，从未测试决定智能体在行动前读取内容的上游排序器。我们引入了一个受控协议，固定模型、角色、主题和最终决策提示，仅改变智能体在之前十轮“滚动”阶段中遇到的帖子的组成和顺序，从而隔离输入流策划对下游决策的因果效应。在来自三个独立实验室的四个现代开放指令LLM上进行的2,785次决策展开中，我们识别出三种响应模式：对抗性屈服、默认饱和以及默认方向不对称——其中单边输入流会扭转模型原本不确定的决策（最明显的情况下从5%到100%；Fisher p值低至3×10^-10），但无法动摇模型已经偏好或坚定持有的决策。该效应遵循剂量-反应曲线，通过生成器交换（排除了写作风格伪影）后依然存在，在多个决策领域（包括安全相关选择，如移除部署批准门或放松访问控制）中普遍存在，并且可以通过两种简单的输入流级防御部分缓解；前沿模型保留其默认行为。我们将推荐系统描述为LLM智能体的一种实用的、受默认边界约束的控制面，并认为智能体评估必须审计输入流层，而不仅仅是最终提示。

英文摘要

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

URL PDF HTML ☆

赞 0 踩 0

2606.00909 2026-06-02 cs.CL cs.AI 版本更新

深入波动：用于跨被试脑电情绪解码的Morlet谱变换器

Jiaxin Qing, Lexin Li

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对脑电情绪识别中跨被试变异性问题，提出基于Morlet小波标记化、长上下文基线去除和频带特定空间投影的Morlet谱变换器（MST），无需预训练即可在SEED系列数据集上超越大型预训练模型和频域方法。

详情

AI中文摘要

我们研究基于脑电的跨被试情绪识别，这是脑机接口中一个实际重要但具有挑战性的问题。与具有清晰波形特征的任务不同，情绪相关的脑电信号主要编码在频谱功率中，且微弱、嘈杂，并在被试间高度变化。现有方法要么依赖需要大量数据但仍难以应对跨被试变异的大型预训练脑电基础模型，要么依赖频域编码器（能更好地反映频谱结构但存在表示不匹配、漂移主导的标记化以及缺乏频带特定空间建模）。在本文中，我们提出了Morlet谱变换器（MST），它围绕三个关键组件构建，并与时空变换器主干集成。首先，Morlet小波标记化提供了与脑节律多尺度结构匹配的时频表示，并将经典微分熵特征扩展到适合变换器的形式。其次，长上下文基线去除作为一种简单的时间归一化，消除了被试特定漂移和附近窗口间的冗余。第三，频带特定空间投影为每个频带学习独立的通道混合器，捕获可解释的频带特定模式并减少跨通道混合。我们表明，即使没有预训练，MST在所有SEED系列数据集上始终优于大型预训练脑电基础模型和基于频率的方法。这些结果表明，精心的表示设计可以产生准确、经济且可解释的替代大规模预训练的方法。

英文摘要

We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike tasks with clear waveform signatures, emotion-related EEG signals are primarily encoded in spectral power and are weak, noisy, and highly variable across subjects. Existing approaches rely either on large pretrained EEG foundation models, which require massive data yet still struggle with cross-subject variability, or frequency-domain encoders, which better reflect spectral structure but suffer from mismatched representations, drift-dominated tokenization, and lack of band-specific spatial modeling. In this article, we propose the Morlet Spectral Transformer (MST), built around three key components and integrated with a spatiotemporal Transformer backbone. First, Morlet wavelet tokenization provides a time-frequency representation that matches the multi-scale structure of brain rhythms, and extends classical differential entropy features to a form suitable for Transformers. Second, long-context baseline removal acts as a simple temporal normalization that removes subject-specific drift and redundancy across nearby windows. Third, frequency-specific spatial projection learns a separate channel mixer for each frequency band, capturing interpretable band-specific patterns and reducing cross-channel mixing. We show that, even without pretraining, MST consistently outperforms both large pretrained EEG foundation models and frequency-based methods across all SEED-family datasets. These results suggest that careful representation design can yield an accurate, cost-effective, and interpretable alternative to large-scale pretraining.

URL PDF HTML ☆

赞 0 踩 0

2606.00880 2026-06-02 cs.LG cs.AI 版本更新

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统性迁移但抑制持续强化学习

Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman, Max Kleiman-Weiner, Wilka Carvalho

发表机构 * MIT（麻省理工学院）； University of California, Berkeley（加州大学伯克利分校）； Princeton University（普林斯顿大学）； Harvard University（哈佛大学）

AI总结通过引入GPU加速的持续强化学习领域Banyan，研究任务多样性（地图布局、交互对象、子目标层次结构）对智能体在分布变化下持续学习能力的影响，发现多样性促进局部迁移但导致长期任务性能停滞和遗忘。

详情

AI中文摘要

持续强化学习旨在产生不仅能在当前任务上提高，还能随着任务分布变化而适应的智能体。在众多不同任务上训练智能体可以引发零样本泛化，但先前的工作通常是在训练后（冻结权重）评估这种泛化。任务多样性是否也能提高智能体在分布变化下继续学习的能力仍不清楚。我们引入了Banyan，一个GPU加速的持续强化学习领域，其中任务多样性分解为三个独立可控的轴：智能体必须导航的地图布局、必须与之交互的对象以及子目标依赖的层次结构。在单个分布变化中，增加每个轴上的多样性会导致智能体在新任务上开始训练时，其性能接近先前任务达到的水平，即使变化改变了最优策略的结构。然而，随着变化数量的增加，这种局部迁移本身并不能产生持续的持续学习：更长视野的任务出现平台期，并且较早的任务分布在后续训练后被遗忘。Banyan是一个基准，用于研究受控的任务多样性何时产生可迁移的学习，这种迁移何时持续，以及它在哪些方面未能达到真正的持续学习。

英文摘要

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training -- with frozen weights. Whether task diversity also improves an agent's ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.

URL PDF HTML ☆

赞 0 踩 0

2606.00871 2026-06-02 cs.CV cs.AI 版本更新

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

城市感知中的视觉语言模型基准应具备可靠性意识且可协商

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结本文提出，用于城市感知的视觉语言模型基准应将分歧和弃权视为测量结果，报告标注者间信度，并将标签空间和评分策略视为可协商的产物。

Comments To appear in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

视觉语言模型（VLM）越来越多地用于生成街景图像的结构化描述，用于街道景观审计、制图和公众咨询等任务。这些用途将可观察属性与评估类别相结合，而人类目标往往是带有分歧和明确不回答的判断分布。本文认为，为城市感知建立VLM基准应将分歧和弃权视为测量结果，报告标注者间信度以及模型对齐度，并在输出旨在为城市治理提供信息时，将标签空间和评分策略视为可协商的产物。我们基于一个由来自七个社区组织的12名参与者对100个蒙特利尔街景进行30个维度标注的基准，以及对七个VLM的确定性零样本评估来论证这一观点。在各个维度上，模型与人类共识的一致性随维度层面的人类信度共同变化，而对于评估维度“总体印象”，模型和标注者表现出分布不匹配，包括“不适用”的不同比率。最后，我们为基准创建者、模型开发者和机构提出了行动建议，以使不确定性和基准假设在评估报告中可见。

英文摘要

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.

URL PDF HTML ☆

赞 0 踩 0

2606.00860 2026-06-02 cs.SI cs.AI cs.CL 版本更新

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

GenPT：通过生成式投射测试实现超越自我报告的可靠LLM心理测量

Ming Wang, Shuang Wu, Bixuan Wang, Lu Lin, Yuxin Chen, Xiaocui Yang, Daling Wang, Shi Feng, Yifei Zhang, Yufan Sun

发表机构 * School of Computer Science and Engineering, Northeastern University（东北大学计算机科学与工程学院）； School of Computing and Information Systems, Singapore Management University（新加坡管理学院计算机与信息学院）； Mental Health Education Center, Northeastern University（东北大学心理健康教育中心）； School of Psychology, Northeast Normal University（东北师范大学心理学系）； Faculty of psychology, Southwest University（西南大学心理学系）； School of Sociology and Psychology, Central University of Finance and Economics（中央财经大学社会学与心理学学院）； College of Arts, Northeastern University（东北大学艺术学院）

AI总结针对自我报告问卷在人格化智能体心理测量中存在的训练语料污染和方向性偏差问题，提出GenPT方法，通过改编投射测试范式并构建三阶段评估流程，实现了更可靠的心理状态测量。

详情

AI中文摘要

自我报告问卷仍然是探测人格化智能体（PC-Agents）心理状态的主流工具。然而，经典工具存在两个众所周知的威胁：来自训练语料的污染以及由社会期望或上下文框架驱动的方向性偏差。为了克服这些方法论瓶颈，我们探讨投射范式是否能够被改编为一种稳健的心理测量工具。我们提出了 extbf{GenPT}（生成式投射测试），它通过新生成的刺激重新表述了TAT、罗夏测试和SCT，并将评估组织为三阶段流程，以导出标准化的心理指标和目标状态。通过评估由CharacterRAG和AnnaAgent配置文件诱导的PC-Agents，我们针对经典问卷基准测试了GenPT的信度和效度。结果表明，问卷在社会期望框架下表现出系统性的方向性偏移，在自杀意念上最为强烈。相比之下，GenPT收集的行为模式保持在对称基线附近。此外，在纵向咨询背景下，当Qwen3作为骨干模型时，基于GenPT的抑郁评估变化幅度比问卷对应方法大约一个数量级。总体而言，GenPT在需要抗污染、偏差对称性和上下文敏感性的场景中补充了自我报告方法。代码和刺激材料可在https://github.com/sci-m-wang/GenPT获取。

英文摘要

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). However, classical instruments inherit two well-known threats: contamination from training corpora and directional bias driven by social-desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce \textbf{GenPT} (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three-stage pipeline to derive standardized psychological indicators and target states. Evaluating PC-Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT's reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social-desirability framing, most strongly on suicide ideation. In contrast, GenPT's collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT-based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at https://github.com/sci-m-wang/GenPT.

URL PDF HTML ☆

赞 0 踩 0

2606.00857 2026-06-02 cs.RO cs.AI 版本更新

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

从线索到视野：轨迹预测的动态风险视界剖面

Xinyi Ning, Zilin Bian, Dachuan Zuo, Semiha Ergan, Kaan Ozbay

发表机构 * Department of Civil and Urban Engineering, New York University（纽约大学土木与城市工程系）； Department of Civil Engineering Technology and Environmental Management Safety, Rochester Institute of Technology（罗切斯特理工学院土木工程技术与环境安全管理系）

AI总结提出风险视界剖面（RHP）模块，通过连续可学习的势场模型对未来风险分布进行建模，以提升轨迹预测的准确性，在highD和SHRP2数据集上分别降低5秒RMSE 25.0%和5秒minFDE 29.1%。

Comments 11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)

详情

AI中文摘要

准确可靠的车辆轨迹预测对于安全自动驾驶至关重要。最近的研究将安全风险纳入轨迹预测，以量化周围代理带来的危险。然而，大多数风险感知方法将过去的风险信息作为辅助信号来帮助决策，忽视了其未来的演变和不确定性。在本文中，我们提出了一种风险视界剖面（RHP）模块，该模块结合了连续、可学习的势场模型，用于风险感知轨迹预测。RHP模块计算周围物体的时空接近度，以描绘未来视界上的风险分布，通过自适应识别人类驾驶员认为的关键时刻，支持更好的轨迹预测。我们在两个不同驾驶设置的数据集上评估了我们的方法：highD（高速公路走廊）和SHRP2（城市街道），涵盖了包括安全、近碰撞和碰撞事件在内的多种风险场景。与基线方法相比，我们的框架在highD数据集上实现了5秒RMSE降低25.0%，在SHRP2上实现了5秒minFDE降低29.1%。这些结果表明，该方法在短视界和长视界预测中均表现出色，并且在高速公路和城市场景中具有强大的泛化能力。所提出的方法能够实现更真实的自动驾驶车辆路径规划和策略选择，从而支持更安全的自动驾驶和更先进的驾驶员辅助系统。本工作的源代码可在以下网址获取：https://github.com/bilab-nyu/RHP

英文摘要

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0\% reduction in 5s RMSE on the highD dataset and a 29.1\% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: https://github.com/bilab-nyu/RHP

URL PDF HTML ☆

赞 0 踩 0

2606.00852 2026-06-02 cs.CV cs.AI cs.LG 版本更新

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur（计算机科学与工程系印度理工学院坎浦尔）； Department of Materials Science and Engineering Indian Institute of Technology Kanpur（材料科学与工程系印度理工学院坎浦尔）

AI总结提出RefDiffNet，一种轻量级即插即用的输入增强模块，通过引入无缺陷参考图像来突出缺陷区域，从而提升下游检测器在PCB缺陷检测中的性能。

详情

AI中文摘要

印刷电路板（PCB）缺陷检测具有挑战性，因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测，忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中，我们提出了RefDiffNet，一种轻量级即插即用的输入增强模块，放置在检测器主干之前，用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代，利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像，捕获相对于参考图像的结构变化，并使用轻量级编码器输出缺陷区域被突出的原始图像，从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明，RefDiffNet在各类检测器上一致地提升了性能，包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益，且开销可忽略，仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs，最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块，以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.00844 2026-06-02 cs.CV cs.AI cs.LG 版本更新

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

MoEIoU：将边界框回归重新思考为混合专家模型

Vinay Edula, Priyanka Bagade

发表机构 * Indian Institute of Technology Kanpur（印度理工学院坎普尔分校）

AI总结提出MoEIoU损失函数，通过混合专家模型联合优化重叠、中心对齐和长宽比，并采用课程学习权重调度，在多个数据集和YOLO架构上超越现有IoU损失。

详情

AI中文摘要

边界框回归是目标检测的基本组成部分，在精确目标定位中起着关键作用。现有的基于交并比（IoU）的损失函数通过引入几何惩罚项（如中心距离和长宽比不匹配）来扩展IoU目标，以改进边界框回归。然而，这些惩罚项通常在训练过程中保持不变，没有考虑优化动态：预测框在初始阶段表现出较大的中心距离和形状误差，而后期阶段则侧重于提高与真实框的重叠。为了解决这一局限性，我们引入了MoEIoU，一种基于混合专家的回归损失，它联合建模了重叠、中心对齐和长宽比不匹配。MoEIoU使用log-sum-exp函数聚合这些组件，该函数强调主要的定位误差，同时保持其他项的平滑贡献。此外，采用基于课程的权重调度，在早期训练阶段优先纠正框的位置和形状，在后期阶段提高重叠。我们在PASCAL VOC、HRIPCB和MS COCO上使用多种YOLO架构以及大规模模拟实验评估了所提出的MoEIoU。它始终优于标准和最新的最先进损失，表现出更快的收敛速度和更高的定位精度。我们进一步表明，这种自适应聚合改进了现有的基于IoU的损失，带来了一致的增益，并为目标检测框架中的边界框回归提供了更有效的优化指导。

英文摘要

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

URL PDF HTML ☆

赞 0 踩 0

2606.00840 2026-06-02 cs.AI 版本更新

Certificate-Guided Evaluation of Reinforcement Learning Generalization

证书引导的强化学习泛化评估

Vignesh Subramanian, Đorđe Žikelić, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology（佐治亚理工学院计算机科学学院）； School of Computing and Information Systems, Singapore Management University（新加坡管理大学 computing and information systems 学院）

AI总结提出一个逻辑驱动框架，通过神经证书函数验证强化学习算法在未见任务上的泛化能力，并证明证书违规率与测试任务成功率负相关。

详情

AI中文摘要

本文提出了一个逻辑驱动框架，用于评估强化学习算法在泛化到未见任务方面的性能。我们的框架定义了一类归纳可达-避免任务，这些任务在任务动态中具有结构相似性，从而能够评估泛化能力。我们引入了一个神经证书函数，通过强制执行关键条件来验证强化学习算法生成的轨迹，从而作为强化学习泛化的试金石。我们通过实验证明了该方法在几个最先进的可泛化强化学习算法上的能力，在具有挑战性的连续环境中验证了泛化能力。我们的结果表明，证书函数违规率越低，成功解决的测试任务数量越多，突显了我们的框架在评估和区分强化学习算法泛化能力方面的有效性。这项工作为基准测试强化学习泛化提供了一种原则性方法。

英文摘要

This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method's capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.00838 2026-06-02 cs.AI 版本更新

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

解耦行为克隆实现基于规范的强化学习中的可扩展归纳泛化

Vignesh Subramanian, Subhajit Roy, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology, USA（美国佐治亚理工学院计算机科学学院）； Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India（印度理工学院坎浦尔分校计算机科学与工程系）

AI总结提出DIBS方法，通过解耦任务特定策略学习与演化函数学习，利用行为克隆替代噪声奖励聚合，提升训练稳定性和零样本泛化能力。

详情

AI中文摘要

归纳泛化是强化学习泛化的一种框架，其中归纳相关的任务实例允许归纳相关的策略。先前的工作通过直接使用强化学习学习的高阶策略演化函数捕捉这种结构，但存在训练可扩展性差的问题：随着训练任务增加，聚合的奖励反馈变得嘈杂且冲突，破坏训练稳定性并削弱泛化能力。我们提出DIBS，一种解耦的行为克隆方法，将学习任务特定策略与学习演化函数分离。我们首先通过标准强化学习为每个任务学习独立的教师策略，然后通过行为克隆在教师标记的状态-动作对上拟合演化函数。这用密集、稳定的监督取代了嘈杂的奖励聚合。DIBS在训练稳定性和零样本泛化方面相比现有强化学习和元强化学习算法取得了显著改进。

英文摘要

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.00834 2026-06-02 stat.AP cs.AI cs.LG math.PR 版本更新

Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing

加纳五岁以下儿童疟疾住院人数的混合概率预测：高斯过程回归与Holt-Winters平滑

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * GAEC, Ghana（加纳农业和粮食部）

AI总结针对加纳疟疾预测中季节性和数据不确定性挑战，提出结合高斯过程回归与Holt-Winters指数平滑的混合模型，实现概率性预测并评估其性能。

Comments 24 pages, 8 figures, accepted for publication in Artificial Intelligence in Medicine

详情

AI中文摘要

准确的疟疾预测在撒哈拉以南非洲仍是一个重大挑战，那里强烈的季节性、报告不确定性和非平稳传播动态降低了传统模型的可靠性。在加纳，地区级疟疾监测需要概率上严谨且数据有限时稳健的预测框架。本研究提出了一个混合框架，将高斯过程回归（GPR）与Holt-Winters指数平滑相结合，用于建模每月五岁以下儿童疟疾住院人数。GPR捕捉非线性行为和预测不确定性，而Holt-Winters稳定长期预测并保留季节结构。使用十年（2014-2023年）的地区级数据，通过滚动起点扩展窗口验证评估性能。混合模型实现了$R^2 = 0.9906$，而单独Holt-Winters为$0.8213$，$94.2\%$的残差在$\pm 2σ$范围内。2024-2028年的预测显示月平均住院人数约为8,000至12,200例。时空分析揭示了显著的生态异质性：北部高负担地区尽管绝对波动较大，但相对模式稳定。该框架为疟疾流行地区的早期预警和运营规划提供了一种可扩展的概率方法，支持加纳国家疟疾控制战略。

英文摘要

Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved $R^2 = 0.9906$ versus $0.8213$ for Holt-Winters alone, with $94.2\%$ of residuals within $\pm 2σ$ bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8{,}000 to 12{,}200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana's national malaria control strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.00831 2026-06-02 cs.AI cs.LG 版本更新

没有电子的证书？AI驱动电力需求影响的理论与证据

Dana Golden, Aruna Balasubramanian, Niranjan Balasubramanian

发表机构 * Department of Economics, Stony Brook University（石溪大学经济系）； Department of Computer Science, Stony Brook University（石溪大学计算机科学系）

AI总结通过博弈论模型和自然实验，研究AI数据中心使用可再生能源证书和购电协议对电网可靠性、电价和排放的影响，发现证书无法解决时序错配问题，而共置储能可有效缓解。

详情

AI中文摘要

数据中心目前占美国电力需求的4.4%，但超大规模企业用于宣称碳中和的可再生能源证书（RECs）和购电协议（PPAs）在电网层面的有效性仍不明确。我们开发了一个博弈论模型，其中数据中心运营商在RECs、PPAs和表后共置之间选择，而发电商在内生融资成本下做出进入决策。该模型识别出一个时序楔子——消费与信用可再生能源发电之间的不匹配——作为核心机制，通过该机制，即使RECs覆盖100%的年消费量，AI需求也会降低可靠性、提高价格并增加排放。与储能共置直接解决了这一楔子，并通过消除发电商收入风险诱导最大的可再生能源进入。我们通过利用大型语言模型的分阶段发布作为自然实验来检验这些预测，使用双重差分法分析一个将AI活动与当地电网结果联系起来的新数据集。AI需求显著增加了数据中心附近的化石燃料发电、批发价格（在处理的PJM区域高达25%）和停电频率（每年额外0.5-1次停电），其影响随模型规模扩大而扩大。拥有现场发电的数据中心在电能质量效应上表现出符号反转，这与模型的预测一致，即表后容量吸收了需求峰值。反事实分析表明，边缘推理、空间重新分配和共置储能均能显著减轻电网影响，而仅依赖RECs的策略则不能。总之，我们的结果表明，AI对电网的外部性与采购设计及数据中心基础设施的空间组织紧密相关。

英文摘要

Data centers now account for 4.4% of United States electricity demand, yet the grid-level effectiveness of the renewable energy certificates (RECs) and power purchase agreements (PPAs) hyperscalers use to claim carbon neutrality remains unclear. We develop a game-theoretic model in which a data center operator chooses among RECs, PPAs, and behind-the-meter colocation while generators make entry decisions under endogenous financing costs. The model identifies a timing wedge -- the mismatch between consumption and credited renewable generation -- as a central mechanism through which AI demand degrades reliability, raises prices, and increases emissions even when RECs cover 100% of annual consumption. Colocation with storage addresses this wedge directly and induces the greatest renewable entry by eliminating generator revenue risk. We test these predictions by exploiting the staggered release of large language models as a natural experiment, using difference-in-differences on a novel dataset linking AI activity to local grid outcomes. AI demand significantly increases fossil generation, wholesale prices (up to 25% in treated PJM zones), and outage frequency (0.5--1 additional outages per year) near data centers, with impacts scaling in model size. Data centers with on-site generation exhibit a sign reversal in power-quality effects, consistent with the model's prediction that behind-the-meter capacity absorbs demand spikes. Counterfactual analyses show that edge inference, spatial reallocation, and colocated storage each substantially mitigate grid impacts, while REC-only strategies do not. Together, our results demonstrate that the externalities of AI to the grid are tightly coupled to procurement design and the spatial organization of data center infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.00798 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH: 用于引导校准紧凑扩散模型的双分支分数蒸馏

Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo

发表机构 * Khulna University of Engineering & Technology（Khulna 工程与技术大学）； University Clermont Auvergne（克莱蒙特-奥弗涅大学）

AI总结针对类条件扩散模型参数压缩中无监督无条件分数分支导致引导失效的问题，提出双分支蒸馏框架DASH，通过独立监督两个分支并引入锚点正则化和课程迁移，在5.9倍压缩下保持与教师模型相近的FID和引导保真度。

Comments 14 pages, 7 figures, 4 tables; appendix with additional ablations and qualitative results

详情

AI中文摘要

类条件扩散模型的参数压缩揭示了输出级蒸馏中一个未被充分探索的局限性：无条件分数分支保持无监督，导致学生模型中无分类器引导差距欠定。该差距在每个去噪步骤中被放大，允许两个分支都崩溃为相同预测的退化解，使得引导在低输出级训练损失下无效。本文介绍了DASH，一种双分支蒸馏框架，独立监督两个分数分支，通过独立分支约束为每个训练样本唯一指定目标分支输出，并引入锚点项将条件预测正则化到真实噪声。该框架进一步引入了TIRT迁移，将教师收敛的每时间步重要性课程复制到学生中作为冻结先验，消除了在有限蒸馏预算内重新学习它的需要。在CIFAR-10和CIFAR-100上的实验表明，5.9倍压缩在50步DDIM采样下将质量保持在教师模型4个FID点以内，显著优于从头训练，且引导保真度良好保持。消融研究证实无条件监督是主要贡献，占总蒸馏增益的60%以上。课程迁移和锚点正则化提供互补收益，共同验证了双分支约束对于引导保持压缩的经验必要性。

英文摘要

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher's converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.

URL PDF HTML ☆

赞 0 踩 0

2606.00795 2026-06-02 cs.LG cs.AI 版本更新

Extending Causal Metamodeling to a non-Markovian Queue

将因果元建模扩展到非马尔可夫排队系统

Pracheta Amaranath, Anant Bhide, David Jensen, Peter Haas

发表机构 * Manning College of Information and Computer Sciences University of Massachusetts Amherst（信息与计算机科学学院麻省大学阿默斯特分校）

AI总结本文通过相位型分布近似非指数分布，将模块化动态贝叶斯网络（MDBN）因果元建模方法从马尔可夫系统扩展到非马尔可夫排队系统，并解决了相位数选择、参数学习和采样间隔等挑战，实验表明在G/M/1队列上可实现数量级的推理加速。

Comments 12 pages

详情

AI中文摘要

离散事件仿真的元模型近似模拟模型的行为，而无需运行昂贵的仿真。先前的工作引入了模块化动态贝叶斯网络（MDBN）——一类元模型，可以使用单个训练模型估计一系列概率和因果查询（PCQ）——但该方法仅限于马尔可夫系统。在本文中，我们通过使用相位型分布近似非指数分布，启动MDBN向非马尔可夫排队的扩展。这种方法带来了新的挑战，包括在选择相位数量时平衡元建模精度和可处理性、高效学习元模型参数，以及选择用于通过离散时间MDBN近似连续时间仿真的采样间隔。我们为这些挑战提供了初步解决方案，从而产生了第一个针对非马尔可夫系统的因果元建模技术。在G/M/1队列上的实验表明，MDBN可以为PCQ提供准确的答案，并且相对于直接仿真，推理时间实现了数量级的加速。

英文摘要

Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work introduced modular dynamic Bayesian networks (MDBNs) -- a class of metamodels that can estimate a range of probabilistic and causal queries (PCQs) using a single, trained model -- but the method was limited to Markovian systems. In this paper, we initiate an extension of MDBNs to non-Markovian queues by approximating non-exponential distributions using phase-type distributions. This approach raises novel challenges, including balancing metamodeling accuracy and tractability when choosing the number of phases, efficiently learning metamodel parameters, and choosing the sampling interval that is used to approximate a continuous-time simulation by a discrete-time MDBN. We provide preliminary solutions to these challenges, yielding the first causal metamodeling technique for non-Markovian systems. Experiments on a G/M/1 queue demonstrate that the MDBN can produce accurate answers to PCQs with orders-of-magnitude speedup of inference times relative to direct simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.00783 2026-06-02 stat.AP cs.AI math.PR stat.CO 版本更新

Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler

加纳非线性疟疾动力学的贝叶斯推断：基于集成马尔可夫链蒙特卡洛采样器

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * Ghanaian Agricultural and Environmental Council（加纳农业与环境委员会）

AI总结针对加纳疟疾监测数据短、噪声大、空间异质性强的问题，提出一种贝叶斯非线性推断框架，结合三次基线与阻尼振荡核，通过仿射不变集成马尔可夫链蒙特卡洛采样器估计参数，实现了高精度拟合和概率预测，揭示了空间异质性并预测了2024-2026年疟疾回升趋势。

Comments 27 pages, 15 figures, published in Expert Systems with Applications

详情

DOI: 10.1016/j.eswa.2026.131540
Journal ref: Expert Systems with Applications, Volume 312, 131540 (2026)

AI中文摘要

可靠量化撒哈拉以南非洲疟疾动态受到短、噪声大且空间异质的监测记录阻碍。在加纳，2014年至2023年的卫生设施数据揭示了住院人数的非线性和年龄特异性波动，然而现有方法难以捕捉随机变异或提供可信的不确定性区间。本研究开发了一个贝叶斯非线性推断框架，该框架将三次基线与阻尼振荡核相结合，通过仿射不变集成马尔可夫链蒙特卡洛采样器进行估计。该框架适应有限数据，建模参数不确定性，并为五岁以下儿童和五岁及以上个体生成概率预测。结果显示较强的经验充分性（五岁以下：$R^2 = 0.9958$；五岁及以上：$R^2 = 0.9956$），残差低于$2\%$，且混合良好的后验分布确认了收敛性。区级分析揭示了显著的空间异质性，变异系数从库马西等城市中心的$<0.07$到姆波霍尔和东比亚等边缘地区的$>3.3$。2024-2026年的预测表明逐步回升：五岁以下儿童病例从137,000例增至149,000例，年长个体从348,000例增至375,000例，不确定性随时间扩大。通过生成概率预测，该贝叶斯框架为预测疟疾波动和加强加纳国家疟疾控制战略中的数据驱动决策提供了原则性工具。

英文摘要

Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance records. In Ghana, health-facility data from 2014 to 2023 reveal non-linear and age-specific fluctuations in hospital admissions, yet existing approaches struggle to capture stochastic variability or provide credible uncertainty bounds. This study develops a Bayesian nonlinear inference framework that integrates a cubic baseline with a damped oscillatory kernel, estimated via an affine-invariant ensemble Markov Chain Monte Carlo sampler. The framework accommodates limited data, models parameter uncertainty, and generates probabilistic forecasts for children under five years and individuals aged five years or more. Results show strong empirical adequacy ($R^2 = 0.9958$ for $<5$ years; $R^2 = 0.9956$ for $\geq 5$ years) with residual errors below $2\%$ and well-mixed posteriors confirming convergence. District-level analysis reveals pronounced spatial heterogeneity, with coefficients of variation ranging from $<0.07$ in urban centres such as Kumasi to $>3.3$ in peripheral districts such as Mpohor and Bia East. Forecasts for 2024-2026 indicate a gradual resurgence: from 137,000 to 149,000 cases among children under five years and from 348,000 to 375,000 cases among older individuals, with uncertainty widening over time. By producing probabilistic forecasts, this Bayesian framework provides a principled tool for anticipating malaria fluctuations and strengthening data-driven decision-making in Ghana's national malaria control strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.00780 2026-06-02 cs.LG cs.AI 版本更新

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

基于Transformer世界模型的行为不变任务表示学习用于离线元强化学习

Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出一种结合信息论任务表示学习与Transformer随机世界模型的框架，通过提取行为不变的任务变量和保守值惩罚，解决离线元强化学习中的分布偏移和稀疏奖励问题，实现鲁棒泛化。

Comments ICML2026

详情

AI中文摘要

离线元强化学习利用静态数据集使智能体能够通过结合离线效率与元学习适应性来泛化到未见环境，但它面临来自上下文和策略分布偏移的关键挑战。这些问题阻碍智能体适应在线环境，并在稀疏奖励设置下进一步加剧。结果，智能体常常陷入固有的模式困境，无法实现鲁棒的泛化。在这项工作中，我们提出了一种新颖的框架，将信息论任务表示学习与基于Transformer的随机世界模型相结合。我们的方法提取对行为策略不变的任务定义潜在变量，从而有效缓解上下文分布偏移。为了进一步处理策略偏移和模型利用，我们对基于想象力的轨迹应用保守值惩罚，防止策略利用模型不准确性，同时保持鲁棒适应。大量评估表明，我们的方法在分布外和稀疏奖励设置下优于最先进的方法，具有优越的稳定性和泛化能力。

英文摘要

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.

URL PDF HTML ☆

赞 0 踩 0

2606.00775 2026-06-02 cs.CV cs.AI 版本更新

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

GIRL-DETR: 梯度隔离强化学习用于视频时刻检索

Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji

发表机构 * College of Electronics and Information Engineering, Sichuan University（四川大学电子信息工程学院）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）

AI总结针对视频时刻检索中连续代理损失与非可微指标不匹配导致的优化停滞问题，提出梯度隔离强化学习框架GIRL-DETR，通过冻结骨干网络并采用三阶段渐进强化学习策略直接优化tIoU指标，在轻量级模型中实现定位精度提升。

Comments 13 pages, 6 figures. Submitted to IEEE Transactions on Image Processing (TIP). Code is available at: https://github.com/Z-Shihang/GIRL-DETR

详情

AI中文摘要

视频时刻检索（VMR）任务要求精确定位与自然语言查询对齐的时间边界，但许多模型存在连续代理损失与非可微指标之间的不匹配，导致训练后期优化停滞，边界预测陷入次优解。尽管强化学习（RL）后训练成功优化了大模型的定位结果，但直接应用于轻量级网络容易破坏监督阶段建立的脆弱特征表示。为克服这一优化瓶颈，我们提出梯度隔离强化学习用于DETR（GIRL-DETR），首次将RL后训练引入轻量级时间定位框架。输入视频和文本特征首先通过跨模态交互（CMI）在进入Transformer编码器之前建立早期对齐。随后，文本引导门控（TGG）机制在Transformer解码器生成候选提案之前动态地将语义先验注入查询，为时间预测提供高信噪比输入。在监督训练达到收敛后，冻结骨干网络以保护特征流形，而检测头通过三阶段渐进强化学习（TPRL）策略直接优化非可微评估指标tIoU以提升定位精度。该方法实现了状态表示与指标优化的正交解耦。在Charades-STA、QVHighlights和TACoS上的实验表明，GIRL-DETR有效解决了代理损失退化问题，以最少的参数更新实现了显著的精度提升，为轻量级VMR模型中的RL应用提供了稳健的新途径。

英文摘要

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.

URL PDF HTML ☆

赞 0 踩 0

2606.00771 2026-06-02 cs.LG cs.AI cs.SD 版本更新

Logit Distillation on Manifolds: Mapping by Learning

流形上的对数蒸馏：通过学习进行映射

Yiru Yang, Junling Wang, Nishant Kumar Singh, Luohong Wu, Haoran Yan

发表机构 * University of Zurich（苏黎世大学）； ETH Zurich（苏黎世联邦理工学院）； Deutsche Bank Securities（德意志银行证券公司）

AI总结提出一种层和点投影映射方法，将学生和教师表示对齐到高维嵌入空间，结合LoRA注入，在显著减少可训练参数的同时提高词错误率。

详情

AI中文摘要

提高几乎任何机器学习模型性能的一种简单方法是，不训练单个模型，而是训练多个使用不同算法的模型，这些模型对相同数据做出略有不同的预测和错误，从而提高平均预测和鲁棒性。然而，使用整个模型集成进行预测是繁琐且计算成本过高的，无法部署给大量用户，特别是当模型是大型神经网络时。为此，我们引入了一种层和点投影映射，在训练过程中将学生和教师表示映射到对齐的高维嵌入空间。所提出的方法结合LoRA注入，将学生模型的可训练参数减少到教师模型的不到1%，同时与其他蒸馏方法相比，显著提高了词错误率（WER），如消融研究所示。与专家混合不同，我们的方法可以快速并行训练。

英文摘要

A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algorithms which will make slightly distinct kinds of predictions and errors on the same data, and thus improve the average predictions and robustness. However, making predictions using a whole ensemble of models is cumbersome and computationally too expensive to allow deployment to a large number of users, especially if the models are large neural nets. In response to this, we introduce a layer and point wise projection mapping, which maps student and teacher representations into an aligned high-dimensional embedding space during training process. The proposed approach combined with LoRA injection reduces the student model trainable parameters to less than 1% of the teacher model, while significantly improving word error rate (WER) compared to other distillation methods, as demonstrated in ablation studies. Unlike a mixture of experts, our method can be trained rapidly and in parallel.

URL PDF HTML ☆

赞 0 踩 0

2606.00765 2026-06-02 cs.AI 版本更新

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

FALAT: 通过依赖引导搜索追踪LLM智能体轨迹中的失败

Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang, Tse-Hsun Chen

发表机构 * SPEAR Lab（SPEAR实验室）； Concordia University（康科德大学）； DePaul University（德保罗大学）

AI总结提出FALAT框架，通过依赖引导搜索方法，在LLM智能体轨迹中识别导致失败的关键步骤和责任智能体。

详情

AI中文摘要

基于LLM的智能体越来越多地通过包含推理步骤、工具调用和智能体间通信的长轨迹来解决复杂任务。然而，当这些智能体失败时，通常不清楚是哪个智能体导致了失败，以及哪个步骤引入了决定性错误。这个归因问题具有挑战性，因为错误可以在轨迹中传播：后续动作可能看起来不正确，但仅仅是因为它们依赖于先前被破坏的状态。因此，失败归因不能被视为独立的步骤级分类。我们提出FALAT，一个用于LLM智能体轨迹中失败归因的诊断框架。FALAT将归因问题框架化为一个依赖引导的搜索问题。它首先构建任务应如何解决的期望，并利用该期望识别轨迹中的可疑区域。然后，它追踪决策、工具输出和智能体消息之间的依赖关系，以区分引入错误的步骤和仅仅继承或传播先前错误的步骤。最后，FALAT评估纠正候选步骤是否足以恢复预期结果，从而能够识别责任智能体和决定性失败步骤。我们在Who&When基准上评估FALAT，该基准包括算法生成和手工制作的多智能体失败轨迹。结果表明，FALAT持续改进了责任智能体和决定性步骤的归因。其最佳配置在算法生成轨迹上达到46.0%的步骤级准确率，在更具挑战性的手工制作轨迹上达到29.1%，优于专门的归因基线和直接提示的独立LLM。这些发现表明，依赖感知推理对于LLM智能体系统中可靠的失败诊断至关重要。

英文摘要

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00756 2026-06-02 cs.AI 版本更新

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

CoMIC：云边系统中长周期LLM代理的协作记忆与洞察循环

Yannan Wang, Longli Yang, Zhen Liu, Abhishek Kumar, Carsten Maple

发表机构 * Beijing Jiaotong University（北京交通大学）； The Alan Turing Institute（艾伦·图灵研究所）； University of Warwick（沃里克大学）

AI总结提出无需参数更新的云边框架CoMIC，通过集中式反思与分布式执行设计，利用语义子目标标识实现跨代理经验聚合，提升弱边缘代理在长周期任务中的进展率和动作基础。

详情

AI中文摘要

在边缘服务器上部署轻量级大语言模型（LLM）代理可以减少延迟并将代理服务更贴近用户，但资源受限的边缘模型在处理需要持久记忆、子目标跟踪和反思的长周期任务时往往表现不佳。部署后对边缘模型进行微调成本高昂且难以在异构节点上扩展，而纯本地记忆则使代理拥有孤立经验并导致提示上下文不断增长。我们提出 extsc{CoMIC}，一种无需参数更新的云边框架，用于协作记忆与洞察循环。 extsc{CoMIC}遵循 extit{集中式反思，分散式执行}的设计：边缘代理使用面向子目标的分层记忆和选择性重新展开相关历史在本地执行，而云端LLM批评者异步评估完成的轨迹，过滤可重用经验，并通过语义子目标标识符聚合跨代理指导。在涵盖符号规划和文本交互的五项长周期代理任务中， extsc{CoMIC}提高了弱边缘代理的进展率和动作基础，并在不更新模型参数的情况下实现了任务相关的成功率提升。

英文摘要

Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textsc{CoMIC}, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textsc{CoMIC} follows a \textit{Centralized Reflection, Decentralized Execution} design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textsc{CoMIC} improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.00754 2026-06-02 stat.ME cs.AI cs.LG 版本更新

Causal Density Functions

因果密度函数

Sridhar Mahadevan

发表机构 * Adobe Research（Adobe研究院）； University of Massachusetts（马萨诸塞大学）； Amherst（阿默斯特）

AI总结提出因果密度函数作为干预分布与观测分布的Radon-Nikodym导数，用于局部密度比衡量因果效应，并给出估计与检验方法。

Comments 25 pages

2606.00741 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment

量子隧穿感知机器学习：面向鲁棒部署的物理衍生噪声模型

Uiwon Hwang, Jaeho Hwang

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Human-Centered Artificial Intelligence Research Institute（以人为本的人工智能研究院）

AI总结本文提出量子隧穿感知机器学习（QTAML），通过WKB近似推导部署时的权重误差分布，并设计隧穿感知补偿（TAC）算法，在无需重训练和标签的情况下，以较低ECC开销恢复模型精度。

详情

AI中文摘要

晶体管缩放正接近量子力学极限，因为薄栅氧化物通过量子隧穿引起电子泄漏。与传统数字系统不同，只要错误结构被正确建模，AI推理可以容忍此类错误。在本文中，我们引入量子隧穿感知机器学习（QTAML）。我们使用Wentzel-Kramers-Brillouin（WKB）近似从第一性原理推导部署时的权重误差分布，并表明它具有通用高斯噪声模型所忽略的结构：精确的仿射均值漂移、由最高有效位主导的逐位方差层级，以及依赖于$\|W_\ell\|_\infty$和训练网络Jacobian的逐层依赖性。我们将这三个结构属性打包成一个单一的部署时算法——隧穿感知补偿（TAC），该算法结合了闭式均值校正和基于WKB方差分解的最优逐层自适应比特预算分配。在$p_\mathrm{flip}=0.10$的四个卷积架构和$p_\mathrm{flip}=0.05$的一个Transformer编码器上，TAC达到了干净精度的95%，同时ECC开销比从相同物理导出的自然基线Uniform-MSP低3.4倍到33.6倍。闭式饱和比$ ho^*$预先预测了这些增益，在异构架构上，WKB导出的评分在小预算下比基于幅度的分配高出多达24个百分点。该算法无需重训练、无需标签，且无推理时开销。我们还验证了WKB导出的分布定理达到蒙特卡洛精度。这些结果将WKB隧穿物理与噪声感知深度学习联系起来，并为超越传统缩放极限的硬件-软件协同设计提供了一条有原则的路径。

英文摘要

Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike conventional digital systems, AI inference can tolerate such errors provided their structure is modeled correctly. In this paper, we introduce quantum tunneling-aware machine learning (QTAML). We derive the deployment-time weight-error distribution from first principles using the Wentzel-Kramers-Brillouin (WKB) approximation and show that it has structure that generic Gaussian noise models miss: an exact affine mean drift, a per-bit variance hierarchy dominated by the most-significant bit, and a per-layer dependence on $\|W_\ell\|_\infty$ and the trained-network Jacobian. We package these three structural properties into a single deployment-time algorithm, Tunneling-Aware Compensation (TAC), that combines closed-form mean correction with an optimal layer-adaptive bit-budget allocation derived from the WKB variance decomposition. Across four convolutional architectures at $p_\mathrm{flip}$=0.10 and a transformer encoder at $p_\mathrm{flip}$=0.05, TAC reaches $95\%$ of clean accuracy with 3.4$\times$ to 33.6$\times$ less ECC overhead than Uniform-MSP, the natural baseline derived from the same physics. The closed-form saturation ratio $ρ^*$ predicts these gains in advance, and on heterogeneous architectures WKB-derived scoring outperforms magnitude-based allocation by up to 24 percentage points at small budgets. The algorithm requires no retraining, no labels, and no inference-time overhead. We also verify the WKB-derived distributional theorems to Monte Carlo precision. These results connect WKB tunneling physics with noise-aware deep learning and suggest a principled path toward hardware--software co-design beyond conventional scaling limits.

URL PDF HTML ☆

赞 0 踩 0

2606.00738 2026-06-02 cs.LG cs.AI cs.CV 版本更新

SORA: Free Second-Order Attacks in Fast Adversarial Training

SORA：快速对抗训练中的自由二阶攻击

Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran（谢赫大学计算机工程系）

AI总结针对快速对抗训练中的灾难性过拟合问题，提出通过扰动变异性和梯度对齐指标PertAlign来预测并防止过拟合，并设计自适应步长方法SORA，实现最优鲁棒性和干净准确率。

Comments Accepted at ICML 2026

详情

AI中文摘要

对抗训练是对抗性样本的主要防御手段，但在高效的单步变体中常常遭受灾难性过拟合，即尽管单步性能很高，但对多步攻击的鲁棒性却崩溃。我们通过两个贡献来解决这种失效模式。首先，我们形式化了epsilon过拟合（EO），这是一种固定扰动幅度和方向加剧CO的视角，并表明引入扰动变异性可以显著提高不同架构和数据集上的鲁棒泛化能力。其次，我们提出了PertAlign（扰动对齐），这是一种理论上合理、计算开销可忽略的指标，通过测量攻击阶段的梯度对齐来预测CO的发生。利用这些见解，我们引入了SORA，一种自适应步长的AT方法，它根据损失曲面几何动态调整扰动。SORA始终能防止CO，实现最先进的鲁棒性和干净准确率，并使用一组固定的超参数在数据集和架构上泛化，这对于快速AT的适用性至关重要。在不同数据集和架构上的大量实验表明，SORA在提供更高干净准确率和卓越效率的同时，匹配或超越了先前方法的鲁棒性。代码可在https://github.com/SecondOrderAT/SORA获取。

英文摘要

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

URL PDF HTML ☆

赞 0 踩 0

2606.00726 2026-06-02 cs.AI 版本更新

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

潜在奖励引导：一种自适应推理时框架，隐式促进推理大语言模型中的认知行为

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li

发表机构 * Rutgers University（罗格斯大学）； South China Agricultural University（华南农业大学）； Columbia University（哥伦比亚大学）； Fenz.AI ； QuantaAlpha ； Adobe ； Santa Clara University（圣克拉拉大学）； City University of Hong Kong（香港城市大学）

AI总结提出潜在奖励引导（LRS）框架，通过优化稀疏自编码器潜在状态隐式促进认知行为，利用最终答案正确性训练潜在奖励模型估计中间状态质量，并在推理时提供状态特定的修正方向，实验表明该方法能提升推理性能并修复原始推理错误。

详情

AI中文摘要

强推理不仅依赖于模型知识，还取决于生成过程中认知行为的有效部署。现有方法通常依赖显式的行为级控制，当失败和所需修正因推理状态、任务和模型而异时，其适应性不足。为此，我们提出潜在奖励引导（LRS），一种自适应推理时框架，通过优化隐式携带认知行为的稀疏自编码器（SAE）潜在状态来促进认知行为。LRS不依赖预定义的认知行为或由此衍生的引导方向，而是基于最终答案正确性在推理轨迹上训练潜在奖励模型，以估计中间潜在状态的质量。推理时，奖励梯度为脆弱的潜在状态提供状态特定的修正方向，而奖励与置信度门控将干预限制在奖励信号标记为脆弱的状态上。在多个推理LLM骨干和基准上的实验表明，LRS一致地提升了相对于各种基线的性能，事后分析进一步表明LRS隐式促进了修复原始推理错误的良好认知行为。代码见：https://github.com/jiakanglee/Latent-Reward-Steering。

英文摘要

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

URL PDF HTML ☆

赞 0 踩 0

2606.00724 2026-06-02 cs.CL cs.AI 版本更新

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

WaveFilter: 通过小波引导的KV缓存过滤增强扩散型大语言模型的长上下文能力

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Alibaba Group（阿里巴巴集团）； Huzhou Normal University（湖州师范学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结针对扩散型大语言模型在长上下文任务中计算开销大和推理延迟高的问题，提出一种无需训练的通用缓存框架WaveFilter，利用小波变换分解长序列以精确识别关键token，构建稀疏KV缓存，从而提升现有KV缓存方法在复杂长上下文任务中的性能。

Comments 8 pages,3 figures

详情

AI中文摘要

扩散型大语言模型（DLMs）在各种任务中展现出显著优势。然而，受限于其多步迭代推理机制，它们在长上下文任务中的计算开销和推理延迟已成为限制其大规模部署的核心瓶颈。在处理长序列时，现有的键值（KV）缓存机制常常面临生成质量急剧下降的困境，其核心挑战在于如何在超长上下文中精确且高效地过滤关键token。受人类阅读过程的启发，我们提出了 extbf{WaveFilter}，一个通用的、无需训练的缓存框架。该框架创新性地引入小波变换来分解长序列，以实现关键token的精确识别，并基于此构建稀疏KV缓存以计算最终的上下文表示。实验结果表明，WaveFilter作为一个即插即用的通用框架，显著提升了现有主流KV缓存方法在复杂长上下文任务中的性能。

英文摘要

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbf{WaveFilter}, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.00722 2026-06-02 cs.CL cs.AI 版本更新

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

EPIC: 扩散语言模型在上下文无关文法约束下的高效并行推理

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University（延世大学）

AI总结提出EPIC框架，通过词法记忆化、Earley解析验证和松弛兼容子集选择，解决扩散语言模型在CFG约束解码中的低效和并行性损失问题，推理时间降低67.5%，额外开销减少90.5%。

详情

AI中文摘要

控制语言模型输出对于确保结构有效性、可靠性和下游可用性至关重要，扩散语言模型也不例外。最近扩散语言模型解码的进展已将输出控制从常规约束扩展到上下文无关文法（CFG）约束。然而，现有方法的速度可能比无约束解码慢四倍。更重要的是，它们大大削弱了扩散语言模型相对于自回归模型的关键优势之一，即并行解码。这种减慢是因为在并行生成过程中，顺序有效性检查引入了显著开销。我们提出了一个高效的CFG约束解码框架EPIC，解决了这一限制。我们的方法通过结合词法记忆化、使用Earley风格解析（而非确定性自动机）进行验证，以及用于并行提交的松弛兼容子集选择，提高了解码效率。它减少了重复的词法分析和验证开销，同时允许多个兼容令牌一起提交。在三个基准测试上使用四个模型的实验表明，与现有的CFG约束解码方法相比，我们的方法将推理时间减少了高达67.5%，并将额外开销降低了高达90.5%。我们的实现可在https://github.com/hyundong98/EPIC-Decoding.git获取。

英文摘要

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC-Decoding.git .

URL PDF HTML ☆

赞 0 踩 0

2606.00718 2026-06-02 cs.AI math.OC 版本更新

LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization

LLM驱动的双组件耦合组合优化的协同进化自动启发式设计

Mingen Kuang, Xudong Deng, Xi Lin, Ye Fan, Jianyong Sun, Jialong Shi

发表机构 * Xi’an Jiao Tong University（西安交通大学）； Northwestern Polytechnical University（西北工业大学）

AI总结提出CoEvo-AHD框架，利用大语言模型协同进化两个算子种群，通过合作评估和联合交叉发现互补逻辑，解决旅行窃贼问题等耦合组合优化问题。

详情

AI中文摘要

虽然大语言模型（LLMs）最近在自动启发式设计（AHD）中展现出潜力，但现有方法通常将启发式作为单一算子或搜索策略生成和进化，限制了它们在诸如旅行窃贼问题（TTP）和旅行采购问题（TPP）等问题中对多个决策子结构之间强耦合建模的能力。在这项工作中，我们提出CoEvo-AHD，一个LLM驱动的双种群协同进化框架，用于耦合组合优化中的自动启发式设计。与先前单独进化单个算子的方法不同，CoEvo-AHD利用LLMs协同进化两个紧密相关的算子种群。合作评估机制明确捕获路径和选择算子之间的交互，而成对评分和协同联合交叉有助于发现互补的算子逻辑，以在耦合决策子空间上实现联合改进。我们进一步设计了一个工具调用环境库，将常用核心操作（如局部搜索增量计算）封装为可调用函数，使LLM生成的算子能够使用标准化接口，而不是重新实现低效且易出错的问题特定循环。在TTP和TPP上的实验表明，CoEvo-AHD自动发现合作启发式组合，并达到与传统启发式竞争的解质量。

英文摘要

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and evolve heuristics as a single operator or search strategy, limiting their ability to model strong coupling among multiple decision substructures in problems such as the Traveling Thief Problem (TTP) and the Traveling Purchaser Problem (TPP). In this work, we propose CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in coupled combinatorial optimization. Unlike prior methods that evolve individual heuristics in isolation, CoEvo-AHD leverages LLMs to co-evolve two closely related operator populations. A cooperative evaluation mechanism explicitly captures interactions between route and selection operators, while pairwise scoring and synergistic joint crossover help discover complementary operator logic for joint improvement across coupled decision subspaces. We further design a tool-invocation environment library that encapsulates frequently used core operations, such as local-search delta computation, into callable functions, enabling LLM-generated operators to use standardized interfaces instead of reimplementing inefficient and error-prone problem-specific loops. Experiments on TTP and TPP show that CoEvo-AHD automatically discovers cooperative heuristic combinations and achieves competitive solution quality against traditional heuristics.

URL PDF HTML ☆

赞 0 踩 0

2606.00717 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Multi-Agent Conformal Prediction with Personalized Statistical Validity

具有个性化统计有效性的多智能体共形预测

Martin V. Vejling, Christophe A. N. Biscio, Adrien Mazoyer, Petar Popovski, Shashi Raj Pandey

发表机构 * Department of Electronic Systems（电子系统系）； Aalborg University（奥尔堡大学）； Department of Mathematical Sciences（数学科学系）； Institut de Mathématiques de Toulouse（图卢兹数学研究所）； Université de Toulouse（图卢兹大学）

AI总结提出个性化联邦加权共形预测框架，通过局部密度比加权和加权分位数聚合，在保护隐私的同时纠正数据异质性，为每个参与智能体提供渐近有效的边际和校准条件覆盖保证。

详情

AI中文摘要

不确定性量化在高风险机器学习任务中至关重要。然而，共形预测这一原则性解决方案在局部校准数据有限、隐私约束和数据异质性下面临挑战。在多智能体设置中，现有工作无法同时令人满意地解决这些挑战，其保证要么限于智能体间的平均值，要么在异质性设置中失去有效性。因此，我们提出个性化联邦加权共形预测（PFWCP），该框架结合局部密度比加权与加权分位数聚合，以在保护隐私的同时纠正异质性。该方法为每个参与智能体提供渐近有效的边际和校准条件覆盖保证，并支持一次性通信协议。理论分析呈现了对覆盖方差的调整，该调整由有效样本量表达式控制，这在加权共形预测的背景下是必要的，并且在合成和真实数据集上的实验表明，与最先进的联邦共形基线相比，校准质量有所提高。

英文摘要

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00708 2026-06-02 cs.AI cs.LG 版本更新

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

MOSAIC：结构化智能体智能与组合的模块化编排

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge, Lei Jiang, Kevin Zhang, Raad Khraishi, Yihao Ang, Anthony K. H. Tung, Lukasz Szpruch, Hao Ni

发表机构 * Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）； University College London（伦敦大学学院）； University of Edinburgh（爱丁堡大学）； Data & Analytics, Digital X（Digital X 数据与分析部）； Alan Turing Institute（艾伦·图灵研究所）

AI总结提出MOSAIC框架，通过结构化智能体编排、记忆驱动的模型选择和蓝图构建，将自动化数据科学转化为可验证、可复用的模型选择问题，在金融时间序列任务中优于AutoML和智能体基线。

详情

AI中文摘要

自动化数据科学是一个结构化的模型选择问题。解决方案必须为任务选择数据转换、特征表示、架构、训练过程、评估协议和优化策略。AutoML系统自动化了该过程的部分环节，但通常是在预定义的流水线、模型和超参数空间内搜索。基于LLM的智能体通过检索、代码生成和执行反馈提供了更大的灵活性，但其建模决策通常是非结构化的、难以验证且难以复用。我们引入了 extsc{MOSAIC}（结构化智能体智能与组合的模块化编排），一个用于记忆驱动的模型选择和工作流构建的结构化智能体框架。给定任务和数据集， extsc{MOSAIC}构建语义任务画像，检索先前的案例和源代码模块，并构建蓝图：一个指定所选建模组件、组合、接口约束和执行需求的中间表示。该蓝图将模型选择转化为分阶段、上下文驱动的搜索，并将基于LLM的代码生成建立在检索证据而非无约束合成之上。候选模型通过执行验证，并使用诊断反馈、训练轨迹、任务指标以及一个失败感知的强化学习策略进行优化。我们在金融时间序列预测和生成任务上实例化了 extsc{MOSAIC}，其中模型必须满足预测准确性、分布保真度、执行可靠性以及下游金融标准（如风险和尾部行为）。与AutoML和智能体基线的实验表明， extsc{MOSAIC}提高了任务性能、执行成功率和决策可追溯性，证明了将自动化数据科学视为结构化、可复用且基于执行的模型选择的价值。

英文摘要

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

URL PDF HTML ☆

赞 0 踩 0

2606.00703 2026-06-02 cs.IT cs.AI cs.LG math.IT 版本更新

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

通过约化到压缩高斯均值估计的比特约束随机优化的信息论下界

Munsik Kim

AI总结本文通过将强凸二次族优化问题精确约化为交互式压缩高斯均值估计问题，推导出比特约束随机优化的无条件下界，并给出近乎匹配的可实现性结果。

详情

AI中文摘要

低精度预训练（FP8, MXFP4, NVFP4）现已成为前沿语言模型的标准，但文献几乎完全是可实现性——算法和经验缩放定律——没有匹配的信息论可能性的刻画。我们研究B比特量化随机一阶预言机：优化器与T轮交互，每轮接收其随机梯度的B比特自适应公共硬币描述。我们的主要贡献是将强凸二次族优化精确约化为交互式压缩高斯均值估计——在B比特预言机下，查询不携带信息，因此优化完全坍缩为顺序分布式估计问题。这产生了两个无条件下界：通信界TB = Omega(d)和统计界T = Omega(sigma^2 d / eps^2)，以及尖锐的乘积形式界T = Omega((sigma^2 d / eps^2) max{1, d/B})。乘积形式也是无条件的：B比特转录本最多携带关于均值的O(TB / sigma^2) Fisher迹，因此比特而非维度限制了可恢复信息，结合多元van Trees不等式直接给出该界，无需有界似然比截断。我们给出了一个近乎匹配的可实现性结果，在有限动态范围预言机下精确计算每轮比特，紧至对数因子；下界针对真正高斯（无界）梯度，而缩小这一预言机差距留待未来。顺序率失真视角将约化扩展到相关和漂移预言机，并修正了先前的猜想：正噪声相关性将界提高(1+rho)/(1-rho)倍而非放松。这些界为任何低位梯度路径提供了信息论基线，而非关于已部署FP4系统的最优性声明。

英文摘要

Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability -- algorithms and empirical scaling laws -- with no matching characterization of what is information-theoretically possible. We study a B-bit quantized stochastic first-order oracle: an optimizer interacts for T rounds and receives, each round, a B-bit adaptive public-coin description of its stochastic gradient. Our main contribution is an exact reduction from optimizing a strongly convex quadratic family to interactively compressed Gaussian mean estimation -- under the B-bit oracle the query carries no information, so optimization collapses exactly onto a sequential distributed-estimation problem. This yields two unconditional lower bounds, a communication bound TB = Omega(d) and a statistical bound T = Omega(sigma^2 d / eps^2), and the sharp product-form bound T = Omega((sigma^2 d / eps^2) max{1, d/B}). The product form is also unconditional: a B-bit transcript carries at most O(TB / sigma^2) of Fisher trace about the mean, so bits rather than dimension limit the recoverable information, and combined with the multivariate van Trees inequality this gives the bound directly, without bounded-likelihood-ratio truncation. We give a near-matching achievability result with exact per-round bit accounting under a bounded-dynamic-range oracle, tight up to a logarithmic factor; the lower bound is for truly Gaussian (unbounded) gradients, and closing this oracle gap is left open. A sequential rate-distortion perspective extends the reduction to correlated and drifting oracles and corrects an earlier conjecture: positive noise correlation raises the bound by (1+rho)/(1-rho) rather than relaxing it. The bounds give an information-theoretic baseline for any low-bit gradient path, not an optimality claim about deployed FP4 systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00702 2026-06-02 cs.RO cs.AI 版本更新

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体：用于多形态机器人设计的价值梯度

Nico Bohlinger, Jan Peters

发表机构 * Technical University of Darmstadt（德累斯顿技术大学）； Robotics Institute Germany (RIG)（德国机器人研究所）； German Research Center for AI (DFKI)（德国人工智能研究中心）； hessian.AI（黑森AI）

AI总结提出将通用多形态价值函数转化为可复用模型，通过价值梯度优化机器人设计，无需为每个机器人重新进行强化学习协同设计。

2606.00700 2026-06-02 cs.LG cs.AI 版本更新

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

COPF：演化图中部署稳定的反事实公平性在线框架

Sheng'en Li, Dongmian Zou

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对演化图上的在线链接推荐，提出COPF框架，通过反事实暴露机会差距、显式探索和残差不可区分性审计，实现部署稳定的公平性监控与控制。

Comments Accepted at ICML 2026

详情

AI中文摘要

演化图上的在线链接推荐是表演性的：通过选择向用户展示哪些候选链接，系统会改变哪些链接形成以及后续观察到的反馈。因此，来自记录结果的公平性估计可能具有误导性，并且在推荐策略更新后部署时可能会漂移。我们引入了COPF（反事实在线表演性公平性），这是一个用于在线链接推荐中部署稳定的公平性监控和控制的决策层框架。COPF (i) 定义了暴露（展示 vs. 未展示）反事实上的群体级机会差距，(ii) 通过显式探索和记录每个候选被展示的概率（倾向性）使其可估计，以及(iii) 使用图感知双重稳健（GA-DR）估计器，在可配置的审计器族上通过残差结果不可区分性（OI）审计和控制公平性。我们提供了一个噪声传递定理，表明在时间混合和有界局部干扰下，估计的GA-DR残差上的残差OI意味着暴露反事实群体差距的界限，并实例化了一个在线多校准审计器以及一个原始-对偶控制器。在两个TGB流和一个受控的合成二分图流上的实验表明，COPF减少了暴露反事实群体差距的最坏情况峰值，同时对排序效用的影响较小。我们的代码可在 https://github.com/lsnnnnnnnn/COPF 获取。

英文摘要

Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision-layer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal-dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.

URL PDF HTML ☆

赞 0 踩 0

2606.00674 2026-06-02 cs.LG cs.AI 版本更新

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

结果优化的悖论：LLM中推理捷径的因果信息论界限

Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding, Yining Sun

发表机构 * HFIPS, Chinese Academy of Sciences（中国科学院HFIPS）； University of Science and Technology of China（中国科学技术大学）

AI总结针对基于结果强化学习的LLM在分布外任务中推理脆弱的问题，提出因果信息论框架解释奖励诱导的流形坍缩，并证明过程奖励模型作为拓扑滤波器可消除低复杂度捷径。

详情

AI中文摘要

通过基于结果的强化学习（RL）对齐的大型语言模型（LLM）经常表现出一种关键失败模式：它们在分布内基准测试上取得高性能，但在分布外（OOD）任务上推理能力脆弱。我们将这种现象称为奖励诱导的流形坍缩。我们建立了一个理论框架，将结构因果模型（SCM）和信息瓶颈（IB）原理联系起来，以解释这一悖论。我们将推理定义为高复杂度的因果过程，将捷径学习定义为利用低复杂度的虚假相关性。在随机梯度下降（SGD）的隐式归纳偏置下，只要训练分布允许对真实因果机制进行“马尔可夫筛选”，优化结果奖励的模型就会偏向于捷径解。我们基于语义覆盖度量（$\eta$）而非样本量推导了一个新的泛化界限，说明了为什么在同质分布上扩展数据可能无法纠正推理缺陷。我们还表明，过程奖励模型（PRM）作为拓扑滤波器，通过强制执行逐步互信息约束，使得低复杂度的捷径流形不可行。这些结果为过程监督在简单信用分配之外的作用提供了数学基础。

英文摘要

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening'' of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ($η$) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.

URL PDF HTML ☆

赞 0 踩 0

2606.00672 2026-06-02 cs.AI cs.LG 版本更新

Medication-Aware Financial Exploitation Detection for Alzheimer's Patients Using Edge-Aware Interaction Risk Modeling

基于边缘感知交互风险建模的阿尔茨海默病患者药物感知金融剥削检测

Farzana Akter, Lisan Al Amin, Rakib Hossain, Chaitanya Gunupudi, Faisal Quader

发表机构 * Cognitive Links LLC ； University of Maryland, College Park（马里兰大学学院公园分校）

AI总结提出一种药物感知框架，通过同步药物依从性与交易监控，利用交互感知逻辑模型提升对认知风险金融事件的检测，尤其在药物脆弱窗口期召回率从0.7442提升至0.9070。

详情

AI中文摘要

金融剥削对阿尔茨海默病患者日益构成威胁，尤其是在认知稳定性下降期间。传统欺诈检测系统通常仅依赖金融行为，忽略可能改变脆弱性的临床相关因素。本文提出一种药物感知框架，将药物依从性与交易级监控同步，以改进对认知风险金融事件的检测。构建了180名患者45天的混合模拟数据集，产生8,100条药物记录和30,855笔交易。该框架通过纯金融、加性药物感知和交互感知逻辑模型评估金额异常、商家新颖性、交易频率、时间偏差和药物依从性。结果表明，纯金融基线获得了最高的全局F1分数0.5000，但交互感知模型在药物诱导脆弱窗口期内将召回率从0.7442提升至0.9070，并在排名高风险案例中实现了最高平均精度。研究结果表明，药物依从性作为金融风险的上下文修饰因子比作为孤立预测因子更有用。

英文摘要

Financial exploitation is a growing concern for people with Alzheimer's disease, especially during periods of reduced cognitive stability. Conventional fraud detection systems usually rely on financial behavior alone and ignore clinically relevant factors that may alter vulnerability. This paper proposes a medication-aware framework that synchronizes medication adherence with transaction-level monitoring to improve detection of cognitively risky financial events. A hybrid simulation dataset was constructed for 180 patients across 45 days, producing 8,100 medication records and 30,855 transactions. The framework evaluates amount anomaly, vendor novelty, transaction frequency, time deviation, and medication adherence through financial-only, additive medication-aware, and interaction-aware logistic models. Results show that the financial-only baseline obtained the highest global F1-score of 0.5000, but the interaction-aware model improved recall during medication-induced vulnerability windows from 0.7442 to 0.9070 and achieved the highest average precision for ranked high-risk cases. The findings suggest that medication adherence is most useful as a contextual modifier of financial risk rather than as an isolated predictor.

URL PDF HTML ☆

赞 0 踩 0

2606.00671 2026-06-02 cs.AI cs.CL cs.LG 版本更新

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

AXIOM: 一种用于可验证数学推理的信任优先神经符号执行架构

Alessio Bruno

发表机构 * Independent researcher（独立研究者）

AI总结提出AXIOM架构，将语言模型限制为规范化器，通过确定性计算机代数系统管道实现可验证的数学推理，在4个MATH类别上达到94.36%的正确率和100%的信任度。

Comments Preprint. 12 pages, 2 figures. Live interactive demo: https://huggingface.co/spaces/Squagghy/axiom-solver. Paper artifact and dataset on Zenodo (concept-DOI): 10.5281/zenodo.20440225

详情

DOI: 10.5281/zenodo.20440225

AI中文摘要

我们提出AXIOM，一种用于自然语言数学推理的信任优先神经符号执行架构。在AXIOM中，语言模型严格作为规范化器：它将非正式问题文本重写为狭窄的模式，由确定性计算机代数系统（CAS）管道消费，该管道推导并验证答案，或作为第一类输出弃权。路由遵循问题形状正则表达式、特定模式提示和封闭形式CAS处理器之间的1:1:1对齐，已交付3100多条这样的路由，并在250多个连续提交中零LOST_CORRECT回归。我们在4个MATH类别上报告了实证结果，累积正确率为94.36%（2,592/2,747），可解析问题的信任度为100.00%（在整个2,747条记录基准测试中零自信错误答案），所有四个领域均高于每个领域70/90/70的阈值，每个领域信任度为100.0%，仅规则处理器的中位延迟为1毫秒（在lm-eval算术20,000条记录基准测试中占88%的记录）。该架构通过公共部署已服务约30,000次生产查询。我们强调的贡献不是最终的准确率数字，而是该架构建立的向前动态：生产中的每个记录弃权在一次发布周期后都是候选正确，因为新任务在不回归注册表的情况下组合。支撑这一特性的操作纪律——数学模板分桶、LOST_CORRECT扫描作为回归预言机、可解析优先接入以及弃权作为第一类输出——构成了一个可迁移的框架，适用于数学之外的值得信赖的神经符号系统。

英文摘要

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

URL PDF HTML ☆

赞 0 踩 0

2606.00670 2026-06-02 cs.SD cs.AI 版本更新

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

超越口部：声学不确定性下视听句子识别中的上半脸情感线索

Zhou Yang, Yueyi Yang

发表机构 * Faculty of Education and Psychology, University of Oulu, Finland（奥卢大学教育与心理学学院，芬兰）； Center for Machine Vision and Signal Analysis, University of Oulu, Finland（奥卢大学机器视觉与信号分析中心，芬兰）

AI总结本研究利用CREMA-D语料库，通过特征分类器探究在声学退化条件下，上半脸情感信息是否有助于视听句子识别，发现上半脸情感线索能提升模型校准和鲁棒性。

详情

AI中文摘要

面对面言语理解本质上是多模态的，整合了声学信号与可见的发音、面部表情、头部运动及其他社交相关线索。虽然视听言语系统通常将口部区域作为语言信息的主要视觉来源，但情感面部表情常被单独视为情感识别目标。本文研究在声学退化条件下，上半脸情感信息是否有助于视听句子识别，超越音频和口部区域线索。使用CREMA-D视听情感言语语料库，我们在四种线索条件下训练基于特征的句子分类器：仅音频（A）、音频加口部/下半脸特征（A+M）、音频加上半脸特征（A+U）以及音频加口部和上半脸特征（A+M+U）。模型在干净音频和粉红噪声条件下（+10 dB、+5 dB和0 dB SNR）进行评估，采用演员独立划分。结果表明，在退化音频下，口部/下半脸特征提供了显著的鲁棒性优势。在0 dB SNR下，A+M相比A准确率提升0.0794，演员自举95%置信区间为[0.0296, 0.1298]。上半脸情感线索表现出更微妙的效果。尽管A+M+U相比A+M的直接准确率增益很小，但全脸模型在不同SNR水平上持续改善校准，并且在噪声条件下优于打乱的上半脸对照。这些发现表明，情感面部信息可能支持声学不确定性下的多模态鲁棒性和置信度估计，而不直接编码词汇内容。更广泛地说，该研究强调了社交表达性面部线索在以人为中心的视听交互系统中的潜在作用。

英文摘要

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00658 2026-06-02 cs.CV cs.AI 版本更新

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Wan2.2双专家视频扩散模型的协同少步蒸馏与低位量化

Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu, Yang Yong, Jinyang Guo, Xianglong Liu

发表机构 * IEEE ICME 2026 ； GCC Low-Bit-width Large Model Quantization Challenge（GCC 低精度大模型量化挑战）

AI总结针对Wan2.2-T2V-A14B视频扩散模型，提出结合少步分布匹配蒸馏与低位量化的部署压缩流程，通过双专家去噪分支校准、敏感层保护及HiF4低位表示，在保持质量的同时降低计算开销。

详情

AI中文摘要

大型视频扩散模型实现了强大的视觉质量，但由于每个样本需要大量去噪步骤和较大的驻留参数足迹，部署成本仍然很高。本文研究了一种面向部署的压缩流程，针对Wan2.2-T2V-A14B模型，结合少步分布匹配蒸馏与低位量化。该流程遵循模型的双专家去噪路线，分别校准高噪声和低噪声分支，保护敏感入口层，并使用HiF4风格的低位表示以改善动态范围覆盖。量化是在蒸馏后的少步学生模型上校准，而非原始的长步轨迹上，从而减少推理过程中的激活分布不匹配。所提出的协同设计使量化模型保持接近同步全精度模型，并在平均8步和20步时超越原始全精度基线。在测试配置中，20步设置提供了最佳的质量-效率权衡。

英文摘要

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model's dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.00656 2026-06-02 cs.LG cs.AI 版本更新

Demystifying the Optimal Fair Classifier in Multi-Class Classification

揭秘多类分类中的最优公平分类器

Li Zhang, Yuyuan Li, XiaoHua Feng, Jiaming Zhang, Fengyuan Yu, Chaochao Chen

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； College of Computer Science（计算机科学学院）； Technology, Zhejiang University（技术，浙江大学）； School of Communication Engineering, Hangzhou Dianzi University（杭州电子科技大学通信工程学院）

AI总结本文针对多类分类中的公平性问题，提出了一种在公平约束下最优分类器的概率公式，并设计了两种属性盲算法（处理中与处理后）以逼近最优精度-公平帕累托前沿。

Comments Accepted to ICML 2026

详情

AI中文摘要

确保不同群体之间的公平公正对待，特别是在多类分类任务中，由于机器学习模型中固有的持续偏差，构成了重大挑战。大多数现有的偏差缓解技术针对二元设置，而多维输出和复杂公平机制的存在使得它们扩展到多类场景既不直接也不有效。在本文中，我们研究了公平分类中两个基本且未解决的挑战：（i）刻画多类设置中的最优精度-公平前沿，以及（ii）设计在不同训练阶段达到此最优值的实用算法。为应对这些挑战，我们首先指定了公平约束下最优分类器的解析可处理概率公式。在此基础上，我们提出了两种属性盲算法以在实践中实施公平要求：一种是通过约简方法在训练期间进行公平干预的处理中方法，以及一种通过插件估计微调输出概率的处理后方法。理论分析表明，两种方法都收敛到最优精度-公平帕累托前沿。在多个数据集上进行的实验证明了我们的方法在平衡精度和公平性方面的优越性能。

英文摘要

Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challenge due to the persistent biases inherent in machine learning models. Most existing bias mitigation techniques are tailored to binary settings, and the presence of multi-dimensional outputs and complex fairness mechanisms makes their extension to multi-class scenarios neither straightforward nor effective. In this paper, we investigate two fundamental, unresolved challenges in fair classification: (i) characterizing the optimal accuracy-fairness frontier in multi-class settings, and (ii) designing practical algorithms that attain this optimum in different training phases. To tackle these challenges, we first specify an analytically tractable probabilistic formulation of the optimal classifier under fairness constraints. Building upon this, we propose two attribute-blind algorithms to enforce fairness requirements in practice: an in-processing approach for fairness intervention during training via the reduction approach, and a post-processing approach for fine-tuning output probabilities with plug-in estimation. Theoretical analysis reveals that both methods converge to the optimal accuracy-fairness Pareto frontier. Experiments conducted on multiple datasets demonstrate the superior performance of our methods in balancing accuracy and fairness.

URL PDF HTML ☆

赞 0 踩 0

2606.00655 2026-06-02 cs.MA cs.AI cs.CY 版本更新

隐藏的思考并非秘密：大型语言模型中的推理痕迹暴露

Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa, Chia-Mu Yu

发表机构 * National Yang Ming Chiao Tung University（国家阳明交通大学）； UC Berkeley（伯克利大学）

AI总结本文提出推理暴露提示（REP）方法，通过影子模型生成的示范以辅助代码格式包装，从受害者模型中引出用户可见的推理痕迹，显著提高暴露痕迹与内部痕迹的相似性并保留有用推理信号。

详情

AI中文摘要

推理痕迹已成为改进和转移大型语言模型能力的有价值学习信号。特别是，详细痕迹有助于将推理行为从更强的教师模型蒸馏到较弱的学生模型。能力转移的价值促使许多部署了推理模型的系统隐藏原始内部痕迹，最多向用户暴露摘要和答案。因此，我们提出这样的问题：这种接口级别的痕迹隐藏是否能防止用户通过提示获得有用的推理监督？我们通过推理暴露提示（REP）研究这个问题，这是一种轻量级的上下文引出方法，使用影子模型生成的示范以辅助代码格式包装，从受害者模型中引出用户可见的推理痕迹。在常见的推理数据集、不同的受害者模型和不同的学生模型蒸馏中，REP显著提高了暴露痕迹与REP条件内部痕迹之间的相似性，同时保留了有用的推理信号。

英文摘要

Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.

URL PDF HTML ☆

赞 0 踩 0

2606.00636 2026-06-02 cs.AR cs.AI 版本更新

LP5X-PIM Sim: A High-Fidelity HW/SW Integrated Simulator for LPDDR5X-PIM

LP5X-PIM Sim：用于LPDDR5X-PIM的高保真硬件/软件集成模拟器

SangHoon Cha, Jaewan Choi, Byeongho Kim, Yoonah Paik, Sukhan Lee, Kyomin Sohn

发表机构 * Samsung Electronics, South Korea（三星电子（韩国））

AI总结本文介绍三星电子开发的LPDDR5X-PIM模拟器，通过集成硬件数据路径和软件控制层的高保真模型，实现系统性能和能效的精确评估。

Comments 4 pages, 4 figures, tech note

2606.00621 2026-06-02 cs.CR cs.AI cs.CY 版本更新

Authenticity Debt and the Synthetic Content Threat Landscape: A Layered Framework for Trust, Provenance, and IP Governance in the Generative AI Era

真实性债务与合成内容威胁格局：生成式AI时代信任、溯源和知识产权治理的分层框架

Shubhashis Sengupta, Benjamin McCarty, Milind Savagaonkar, Rhine Andotra

发表机构 * Accenture Services Pvt. Ltd.（Accenture服务有限公司）

AI总结提出真实性债务概念，并基于零信任架构原则设计分层参考架构，整合密码学溯源、人工验证和持续治理，以应对生成式AI带来的合成内容威胁。

详情

AI中文摘要

生成式人工智能从根本上改变了内容的生产方式。它使得高保真文本、图像、音频和视频能够以接近零的边际成本创建、修改和重新分发。这种转变使企业和生态系统面临跨四个相互加强的真实性层（真实性、溯源、完整性和问责性）的多种风险，而传统控制措施单独无法充分应对。我们引入了真实性债务的概念：当组织在未保留可验证来源、完整性和问责性的情况下部署AI生成内容时，累积的制度性负债，将暴露推迟到监管、法律或市场审查之下。本文提出了生成式AI危害和攻击向量的全面多维分类法，调查了技术控制（包括数字水印、溯源框架（C2PA、Adobe CAI）和检测技术）的能力和失效模式，并论证了在开放、对抗和不断变化的环境中没有任何单一机制是足够的。借鉴零信任架构原则和企业治理框架，我们提出了一个分层参考架构，整合密码学溯源、人工验证和持续治理，以大规模维持可辩护的真实性。我们进一步审视了监管格局（欧盟AI法案、美国联邦贸易委员会、NIST AI风险管理框架），并为寻求将真实性建设为制度基础设施而非事后考虑的组织确定了实用指导原则。

英文摘要

Generative artificial intelligence has fundamentally changed how content is now produced. It has enabled how high-fidelity text, images, audio, and videos are created, modified, and redistributed at near-zero marginal cost. This shift exposes enterprises and ecosystems to a number of risks across four reinforcing authenticity layers -- authenticity, provenance, integrity, and accountability -- that traditional controls are inadequate to address in isolation. We introduce the concept of authenticity debt: the cumulative institutional liability that accumulates when organizations deploy AI-generated content without preserving verifiable origin, integrity, and accountability, deferring exposure that surfaces under regulatory, legal, or market scrutiny. This paper presents a comprehensive, multi-dimensional taxonomy of generative AI harms and attack vectors, surveys the capabilities and failure modes of technical controls including digital watermarking, provenance frameworks (C2PA, Adobe CAI), and detection technologies, and argues that no single mechanism is sufficient in open, adversarial, and evolving environments. Drawing on Zero Trust Architecture principles and enterprise governance frameworks, we propose a layered reference architecture that integrates cryptographic provenance, human-in-the-loop verification, and continuous governance to sustain defensible authenticity at scale. We further examine the regulatory landscape (EU AI Act, U.S.\ FTC, NIST AI RMF) and identify practical guiding principles for organizations seeking to build authenticity as institutional infrastructure rather than an afterthought.

URL PDF HTML ☆

赞 0 踩 0

2606.00619 2026-06-02 cs.CL cs.AI 版本更新

MemPro: Agentic Memory Systems as Evolvable Programs

MemPro：作为可进化程序的智能体记忆系统

Qingshan Liu, Guoqing Wang, Wen Wu, Jingqi Huang, Xinqi Tao, Dejia Song, Jie Zhou, Liang He

发表机构 * East China Normal University（东华师范大学）； Xiaohongshu Inc.（小红书公司）

AI总结提出MemPro框架，将整个记忆构建-检索管道视为可进化程序，通过故障模式引导的编辑-调试迭代优化，在多个长时任务数据集上超越静态和提示级进化基线。

Comments 20 pages, 14 figures

详情

AI中文摘要

长时程自主智能体需要记忆系统来保留历史信息、跟踪演化状态并在有限上下文窗口之外重用相关知识。现有的智能体记忆系统通常遵循记忆构建-检索（MCR）管道，但往往主要适应记忆库，而在部署后保持周围管道固定。这种固定管道设计难以处理异构的任务特定故障模式，并且可能随着时间推移与规模和结构演化的记忆库产生错位。为解决这些限制，我们提出MemPro，一种系统级进化框架，将整个MCR管道视为可进化程序，而不仅仅是适应记忆库或提示文本。MemPro维护一个可运行记忆系统实现的版本树，其中进化智能体迭代选择有前途的版本，诊断重复出现的故障，并通过故障模式引导的编辑-调试改进创建改进的子版本。在LongMemEval、LoCoMo、HotpotQA和NarrativeQA上的实验表明，MemPro在几次迭代内持续优于强静态和提示级进化基线，随着进化持续改进，并实现了良好的性能-成本权衡。代码可在https://github.com/wanghai673/MemPro获取。

英文摘要

Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction-retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed-pipeline design struggles to handle heterogeneous task-specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system-level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory-system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure-mode-guided edit-debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt-level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance-cost trade-off. Code is available at https://github.com/wanghai673/MemPro.

URL PDF HTML ☆

赞 0 踩 0

2606.00618 2026-06-02 cs.AI 版本更新

Efficient Test-time Inference for Generative Planning Models

生成式规划模型的高效测试时推理

Robert Gieselmann, Mihai Samson, Federico Pecora, Jeremy L. Wyatt

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出一种改进的开放-封闭列表搜索算法，结合生成模型和启发式模型，在测试时高效推理，提升生成式规划模型的解质量和计算效率。

详情

AI中文摘要

生成式模型已成为人工智能规划的强大范式，但其性能仍受训练数据分布的限制。一种方法是通过扩展测试时计算来改进推理过程中生成的解决方案。更高效的替代方案是优化推理过程本身。在本文中，我们展示了经典开放-封闭列表（OCL）搜索的修改版本提供了这样一种高效的推理过程。我们的算法协同了两个学习组件：一个从中间状态执行快速推演的生成模型，以及一个在候选推理路径中优先排序的启发式模型。关键贡献包括新颖的探索控制机制以及将学习模型集成到OCL框架中。在多个组合规划领域中，我们的方法在计算效率和解质量上均优于神经符号搜索基线和经典求解器。

英文摘要

Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more efficient alternative is to optimize the inference process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inference procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality.

URL PDF HTML ☆

赞 0 踩 0

2606.00613 2026-06-02 cs.CL cs.AI 版本更新

Linguistics-Aware Non-Distortionary LLM Watermarking

语言学感知的无失真LLM水印

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han

发表机构 * Yonsei University（延世大学）； Rensselaer Polytechnic Institute（罗切斯特理工学院）

AI总结提出LUNA水印方法，通过语言自适应非失真二元锦标赛采样器，在保持文本质量的同时实现高检测性能，在12种设置中AUROC达0.9959且中位困惑度偏移仅0.045。

详情

AI中文摘要

水印应能识别语言模型输出而不降低质量或限制验证仅由模型提供者进行。多语言部署使这更加困难，因为形态、分词和书写系统的变化会改变水印证据自然进入的位置。我们引入LUNA，一种语言自适应水印，结合了无模型检测和标准随机密钥模型下的单令牌无失真。LUNA从外部语料库中的词性上下文估计归一化下一标记熵，并用其设置无失真二元锦标赛采样器的深度；检测器从文本、分词器、词性标注器和密钥重建相同的调度。我们在六种类型多样的语言和两个领域上评估了八种主要基线。LUNA在十二种设置中达到了0.9959的AUROC和最低的平均绝对中位困惑度偏移0.045；其95%自助法区间[0.022, 0.073]低于所有基线区间。LUNA还记录了最低的平均Self-BLEU、Distinct-1、surprisal和熵偏移。它是唯一同时在大多数设置中实现AUROC > 0.99和绝对中位困惑度偏移低于0.1的方法，在12种设置中的9种达到该状态，而没有任何基线在超过2种设置中达到。我们的代码可在https://github.com/Shinwoo-Park/luna_watermark获取。

英文摘要

Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo-Park/luna_watermark

URL PDF HTML ☆

赞 0 踩 0

2606.00611 2026-06-02 cs.AI 版本更新

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

TRACE: 面向长程智能体安全的轨迹风险感知压缩

Zhepei Hong, Lin Wang, Liting Li, Haokai Ma, Junfeng Fang, Fei Shen, Dan Zhang, Xiang Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）； South China Normal University（华南师范大学）

AI总结提出轨迹风险感知压缩方法TRACE，通过压缩器-阅读器架构将长轨迹压缩为潜在证据状态，以聚合稀疏风险信号并提升长程安全检测准确率。

详情

AI中文摘要

长程LLM智能体在长轨迹中产生安全证据，其中稀疏、延迟和组合的风险信号常常逃脱局部审核。现有的轮次级或短上下文检测器难以在长时间跨度内可靠地保留和聚合此类证据。我们将长程智能体安全检测重新定义为轨迹级证据压缩，并提出面向长程智能体安全的轨迹风险感知压缩（TRACE）。TRACE采用压缩器-阅读器设计：压缩器在轨迹级监督下将完整轨迹编码为紧凑的潜在证据状态，阅读器以该潜在证据状态作为安全参考来判断原始轨迹。该设计有助于聚合分散的风险线索并减少过早的证据丢失。在ASSEBench、Pre-Ex-Bench和R-Judge上，TRACE在所有评估基线上取得了最佳准确率，相比强基线最高提升12.6个百分点。在LongSafety上，TRACE随着上下文长度增加表现出更小的性能下降。注意力可视化和案例研究表明，压缩后的参考有助于阅读器聚焦于风险关键片段并恢复跨步证据。代码可在https://github.com/Peregrine123/TRACE_official获取。

英文摘要

Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.

URL PDF HTML ☆

赞 0 踩 0

2606.00610 2026-06-02 cs.IR cs.AI cs.MA 版本更新

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

MemGraphRAG：基于记忆的多智能体系统用于图检索增强生成

Chuanjie Wu, Zhishang Xiang, Yunbo Tang, Zerui Chen, Qinggang Zhang, Jinsong Su

发表机构 * Xiamen University（厦门大学）； Jilin University（吉林大学）

AI总结提出MemGraphRAG框架，通过基于记忆的多智能体系统构建高质量知识图谱，并设计记忆感知的分层检索算法，在多个基准上超越现有模型。

Comments Accepted by KDD 2026

详情

AI中文摘要

检索增强生成（RAG）已成为通过利用外部知识来减轻大型语言模型（LLMs）幻觉的重要方法。虽然对简单查询有效，但传统RAG在处理信息高度碎片化的大规模非结构化语料库时存在困难。基于图的RAG（GraphRAG）引入知识图谱来捕获结构关系，从而实现对复杂推理的更全面检索。然而，现有的GraphRAG方法依赖孤立的、片段级别的提取来构建图，缺乏对整个语料库的全局视角。因此，这些方法经常导致主题不一致、逻辑冲突和结构碎片化的图，从而降低检索性能。在本文中，我们提出MemGraphRAG，一种新颖的框架，引入基于记忆的多智能体系统以确保高质量的图构建。具体来说，MemGraphRAG采用由共享记忆支持的智能体协作社会，在整个提取过程中提供统一的全局上下文。这种机制允许智能体动态解决逻辑冲突并保持整个语料库的结构连通性。此外，我们提出了一种针对所构建图的记忆感知分层检索算法。在多个基准上的大量实验表明，MemGraphRAG以相当的效率优于最先进的基线模型。我们的代码可在https://github.com/XMUDeepLIT/MemGraphRAG获取。

英文摘要

Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by leveraging external knowledge. Although effective for simple queries, traditional RAG struggles with large-scale, unstructured corpora where information is highly fragmented. Graph-based RAG (GraphRAG) incorporates knowledge graphs to capture structural relationships, enabling more comprehensive retrieval for complex reasoning. However, existing GraphRAG methods rely on isolated, fragment-level extraction for graph construction, lacking a global perspective on the whole corpus. As a result, these methods frequently lead to thematically inconsistent, logically conflicting, and structurally fragmented graphs that degrade retrieval performance. In this paper, we propose MemGraphRAG, a novel framework that introduces a memory-based multi-agent system to ensure high-quality graph construction. Specifically, MemGraphRAG employs a collaborative society of agents supported by shared memory, which provides a unified global context throughout the extraction process. This mechanism allows agents to dynamically resolve logical conflicts and maintain structural connectivity throughout the corpus. Furthermore, we propose a memory-aware hierarchical retrieval algorithm tailored for the constructed graph. Extensive experiments on multiple benchmarks demonstrate that MemGraphRAG outperforms the state-of-the-art baseline models with comparable efficiency. Our code is available at https://github.com/XMUDeepLIT/MemGraphRAG.

URL PDF HTML ☆

赞 0 踩 0

2606.00609 2026-06-02 cs.LG cs.AI 版本更新

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

CARE-RL：用于缓解跨领域冲突的能力感知强化学习

Rui Zhang, Xinle Wu, Yao Lu

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出CARE-RL框架，结合协议感知奖励生成与能力感知优化，通过PA-GRM和DACSP方法缓解多领域强化学习中的奖励不可靠与能力干扰问题。

详情

AI中文摘要

具有可验证奖励的强化学习在面向推理的大语言模型中取得了显著进展，但由于非可验证任务中奖励不可靠以及跨领域能力干扰，将其扩展到多领域强化学习仍具挑战性。我们提出CARE-RL，将协议感知奖励生成与能力感知优化相结合，以缓解跨领域冲突。对于非可验证任务，协议感知生成式奖励模型（PA-GRM）在生成轨迹条件奖励之前构建提示级别的评估协议和模式，从而实现对开放式响应的任务自适应且可比较的评估。对于多领域优化，方向感知能力子空间投影（DACSP）从先前的强化学习阶段提取历史能力方向，并通过放大对齐分量、抑制冲突分量以及保留正交更新来调节后续更新。在数学、聊天和指令遵循基准上的实验表明，CARE-RL始终优于标准的多领域强化学习基线，在Qwen2.5-7B和Qwen3-4B上分别达到47.9和50.7的总平均分。

英文摘要

Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.00593 2026-06-02 cs.CL cs.AI 版本更新

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER: 面向多答案问答的逐步同伴优势与多样性感知探索奖励

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

发表机构 * State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD与CG国家重点实验室）； School of Software and Microelectronics, Peking University（北京大学软件与微电子学院）； School of Software Technology, Zhejiang University（浙江大学软件技术学院）

AI总结提出SPADER强化学习框架，通过逐步同伴优势（SPA）机制和多样性感知探索奖励，解决多答案问答中长程工具使用的细粒度信用分配与持续探索问题，实验表明在多个数据集上提升了召回率和F1分数。

详情

AI中文摘要

大型语言模型越来越多地被部署为工具增强型智能体，以获取参数知识之外的信息。虽然最近的工作改进了长程工具使用推理，但大多数方法专注于具有单一正确答案的任务。相比之下，许多现实世界中的查询需要发现一组全面的有效答案，这种设置被称为多答案问答。这种设置带来了两个挑战：长搜索轨迹上的细粒度信用分配，以及超越简单高频实体的持续探索的奖励对齐。我们提出了SPADER，一个用于多答案问答中长程工具使用的强化学习框架。SPADER包括逐步同伴优势（SPA），一种无评论家的逐步信用分配机制，它通过决策步骤对齐并行轨迹，并根据同伴回报估计优势。它还包括一个多样性感知探索奖励，通过加权稀有发现和降低冗余发现的权重来促进长尾实体发现。在QAMPARI、Mintaka、WebQSP和QUEST上的实验表明，SPADER通常比基于提示的智能体、结果监督的强化学习方法和最近的逐步监督方法提高了召回率和整体F1分数。我们的代码和模型权重可在https://github.com/KhanCold/spader获取。

英文摘要

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

URL PDF HTML ☆

赞 0 踩 0

2606.00590 2026-06-02 cs.IR cs.AI 版本更新

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Critic-R：使用具有自然语言内省反馈的指令调优检索器改进智能搜索

Md Zarif Ul Alam, Alireza Salemi, Hamed Zamani

发表机构 * Center for Intelligent Information Retrieval（智能信息检索中心）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）

AI总结提出Critic-R框架，通过引入评论模型评估智能体内省推理轨迹，实现检索模型与推理代理之间的反馈闭环，无需人工标注即可优化检索质量与下游答案准确性。

详情

AI中文摘要

智能搜索系统迭代地与检索模型交互以回答复杂查询。尽管取得了实质性进展，但优化检索器以适应智能搜索仍然具有挑战性，通常需要大量的协同训练或黄金标准标注，这限制了现实世界的适用性。我们提出Critic-R，一个在推理和训练过程中明确关闭推理代理与检索模型之间反馈循环的框架。Critic-R引入了一个评论模型，该模型在消费检索到的证据后评估代理的内省推理轨迹，以确定检索到的上下文是否充分支持下一步推理。Critic-R具有两种互补机制：Critic-R-Zero，一种推理时查询细化循环，迭代地重写查询和检索指令；以及Critic-Embed，一种检索模型的优化方法，利用成功和失败的细化轨迹作为自动监督，无需手动相关性标注。我们在HotpotQA、2WikiMultihopQA、MuSiQue和Bamboogle上评估Critic-R。结果表明，Critic-R显著提高了检索质量和下游答案准确性。

英文摘要

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent's introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Hanyang University（翰阳大学）

AI总结提出VRPO方法，通过强化学习将静态对齐损失替换为生成式表示策略优化目标，动态平衡表示一致性与生成质量，在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情

AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力，但由于生成表示与判别表示之间的弱对齐，训练效率仍然较低。虽然表示对齐框架（如REPA）通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛，但其外部监督的对齐损失是静态的，在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标，无法动态平衡表示一致性和生成质量，导致判别收益有限，且无法以任务自适应方式优化对齐。为了解决这个问题，我们提出了VRPO，一种基于强化学习的优化策略，用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束，而是将表示对齐视为一个奖励引导的过程：模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示，朝向有语义意义的方向，同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中，引入可忽略的计算成本，并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明，我们的VRPO-Alignment显著提高了收敛速度和保真度，在相同计算预算下，与REPA相比，FID提升高达1.8，训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.00582 2026-06-02 cs.AI 版本更新

PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis

PropLLM：面向网络故障诊断的传播感知场景重建

Zongzong Wu, Ming Zhao, Fengxiao Tang, Nei Kato

发表机构 * National Natural Science Foundation of China（国家自然科学基金委员会）； High Performance Computing Center of Central South University（中南大学高性能计算中心）

AI总结提出PropLLM，首次将逐跳场景重建范式与LLM生成推理能力结合，通过双知识图谱和时序因果传播注意力机制，从端点告警回溯定位根因并确定故障类型，在真实Wi-Fi多模态故障数据集上诊断准确率提升3.9%，根因定位准确率提升4.7%，幻觉率降低50.8%。

详情

AI中文摘要

网络故障沿着拓扑和协议依赖关系逐层传播，然而运维系统通常只观察到传播链末端的症状告警，此时不同的根因故障可能产生高度相似的端点症状。现有方法（无论是基于规则、机器学习还是大语言模型）本质上都是将告警集一次性映射到诊断结果，在结构上无法解决这种端点歧义性。本文提出PropLLM，首次将逐跳场景重建范式与LLM的生成推理能力相结合。从端点告警出发，PropLLM沿着传播路径逐跳回溯，在每一跳从双层知识图谱中检索可验证的事实证据，同时提出的时序因果传播注意力机制将已知的拓扑因果先验直接编码到注意力计算中，引导模型沿着正确的因果方向前进，最终通过完全基于证据的因果链定位根因并确定故障类型。在真实Wi-Fi多模态故障数据集上，PropLLM的故障类型诊断准确率比最强基线提升3.9%，根因定位准确率提升4.7%，幻觉率降低50.8%。在TeleLogs 5G数据集上的补充实验进一步证明了所提方法在不同网络场景下的有效性。

英文摘要

Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9\% and root cause localization accuracy by 4.7\% over the strongest baseline, while reducing the hallucination rate by 50.8\%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.00571 2026-06-02 cs.LG cs.AI cs.CV 版本更新

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结针对元学习训练数据选择（MTS）在实践中表现不佳的问题，本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍，并提出增大批大小和利用信息特征作为解决方案。

详情

AI中文摘要

合成数据越来越多地被用于训练神经网络，但若不加区分地使用，其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重，我们称之为元学习训练数据选择（MTS）。有趣的是，在实践中，MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍：梯度信噪比（GSNR）低导致优化困难，以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析，揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明，一个简单而有效的解决方案是增大批大小。此外，我们提出了一组信息特征，用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进，与无选择的训练相比平均提升 5.49%，与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.00570 2026-06-02 cs.CL cs.AI 版本更新

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

重新审视大型语言模型中基于参数的知识编辑：理论极限与实证证据

Wanying Ren, Xin Song, Futing Wang, Guoxiu He, Aixin Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过理论分析和实证评估，揭示了基于参数的知识编辑方法会因维度坍缩假设导致全局干扰和推理崩溃，而简单的检索基线方法在所有条件下均表现更优。

Comments Accepted to ICML 2026. Equal contribution by the first two authors. 9 pages main paper, 10 figures, with appendix

详情

AI中文摘要

基于参数的知识编辑通过局部权重修改更新大型语言模型（LLMs）的内部知识，并引起了广泛关注。然而，大多数现有方法忽略了基本的理论限制，并且很少在现实的、面向实践的设置下进行评估。在本文中，我们首先基于维度坍缩假设提出理论分析，解释局部参数编辑如何沿着表示空间中的脆弱方向传播，引发全局干扰并最终导致推理崩溃。基于这一见解，我们通过系统变化知识复杂度、编辑次数、评估维度和基线方法进行了全面的实证评估。我们的结果表明，基于参数的编辑方法持续损害LLM的核心能力。相比之下，一个简单的基于检索的基线在所有评估条件下始终比所有参数编辑方法表现更强。这些发现强调，在知识编辑后保持LLM的基本能力应成为未来研究的核心关注点。

英文摘要

Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.00563 2026-06-02 cs.LG cs.AI stat.ML 版本更新

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models

医学预测模型中选择偏差影响的一个实用上界

Kara Liu, Maggie Wang, Russ B. Altman

发表机构 * Stanford University（斯坦福大学）

AI总结针对选择偏差导致模型泛化性差的问题，提出在仅部分观测选择机制和目标分布的现实条件下，对目标群体最差模型性能的一个新上界，并通过合成数据和真实数据验证其有效性和实用性。

Comments 32 pages, 27 figures, will be published at ACM SIGKDD '26

详情

DOI: 10.1145/3770855.3818112

AI中文摘要

选择偏差是真实世界数据中常见且往往不可避免的一个方面，它挑战了机器学习模型的泛化性。当在偏倚数据上训练的模型被部署到更广泛的目标群体时，模型泛化能力差可能导致实际危害，尤其是在医疗保健等高危环境中。这种风险凸显了从业者在部署前可靠评估模型泛化性的需求。然而，现有的预测模型性能的方法依赖于不切实际地访问目标分布或了解导致偏差的选择机制。为了解决这些局限性，我们提出了一个新颖的上界，用于在现实设置下目标群体上的最差模型性能，其中选择机制和目标群体数据仅被部分观测。我们通过在完全合成数据、源自All of Us研究计划的半合成数据以及MIMIC-IV中的真实世界选择偏差上的实验，证明了我们方法的有效性和实际效用。我们的工作提供了一个原则性和实用性的工具，用于估计在原本难以处理的情况下选择偏差的影响，从而使从业者能够在医疗保健及其他领域构建更安全、更具泛化性的模型。

英文摘要

Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.

URL PDF HTML ☆

赞 0 踩 0

2606.00561 2026-06-02 cs.LG cs.AI 版本更新

与AI行动：基于交互的代理侵权责任框架

Yiheng Yao

发表机构 * Yiheng Yao（姚艺恒）

AI总结本文基于Bratman规划理论和普通法人类协同行动原则，提出一种交互分类框架（自主漂移、纯工具使用、协作规划）来分配AI代理系统的侵权责任，并引入“合理代理”标准。

详情

AI中文摘要

代理AI系统能够多步规划、使用工具并随时间执行任务。当此类系统造成损害时，侵权法难以分配责任，因为有害路径可能既非用户完全选择，也非开发者特别预见。本文借鉴Michael Bratman的规划理论和普通法对人类协同行动的处理，提出一个基于交互的代理侵权框架。我们区分三种交互类型：自主漂移、纯工具使用和协作规划。纯工具案例仍受普通产品缺陷和警告原则管辖；协作规划案例映射到独立承包商控制测试、专业过失和过失性虚假陈述；自主漂移则映射到雇主责任下的“擅自行动”和严格产品责任。该框架将有状态交互日志作为主要证据线索，使法院能够推断人-AI轨迹何时偏离授权行为以及责任应归于何处。我们解决了四个事件锚定案例，将该观点与严格责任和基于保险的提案并列，指出其与监管监督的关系，并提出了一个围绕约束验证、认知透明度、运行时基础和取证日志构建的“合理代理”标准。

英文摘要

Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman's planning theory and on the common law's treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent'' standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.

URL PDF HTML ☆

赞 0 踩 0

2606.00516 2026-06-02 cs.AI 版本更新

Threshold-Based Exclusive Batching for LLM Inference

基于阈值的独占批处理用于LLM推理

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma, Shining Wu

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Engineering Center for Space Utilization, Chinese Academy of Sciences（中国科学院空间利用工程技术中心）

AI总结针对混合批处理中预填充与解码干扰导致边际成本上升的问题，提出基于GPU内存带宽、模型大小和工作负载的EB-MB性能交叉条件及最优切换阈值，优化后的独占批处理在带宽受限GPU上吞吐量提升高达41.9%。

Comments 37 pages, 12 figures. Accepted at ICML 2026

详情

AI中文摘要

混合批处理（MB）——将预填充和解码交错在单个批次中——已成为大型语言模型（LLM）推理的标准调度策略，因其在最大化计算和内存利用率方面的效率。然而，通过受控实验，我们发现预填充-解码干扰使MB的每步边际成本高于纯解码。在高带宽H200（4.8 TB/s）上，这仅在解码token超过批次的80%时发生；然而，在带宽受限的RTX PRO 6000（1.792 TB/s）上，该阈值骤降至仅20%。因此，MB与独占批处理（EB）之间的最优选择根本上取决于GPU内存带宽、模型大小和工作负载组成。我们推导了该EB-MB性能交叉的闭式条件，以及渐近最优的相位切换阈值和EB的内存安全批次大小。优化的EB在带宽受限GPU上吞吐量提升高达41.9%，而MB在具有更大模型的高带宽硬件上保持优势。我们的混合调度器EB+在线应用该条件，在无需人工干预的情况下动态切换EB和MB。在分布或并发度变化的非平稳流量下，EB+在每个设置中达到最高或接近最高的吞吐量，比MB高出高达36.4%。

英文摘要

Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB's per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.

URL PDF HTML ☆

赞 0 踩 0

2606.00515 2026-06-02 cs.RO cs.AI cs.SY eess.SY 版本更新

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

PaCo-VLA: 用于富接触视觉-语言-动作操控的被动屏蔽柔顺先验

Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li

发表机构 * Southwest Jiaotong University（西南交通大学）； University of Leeds（莱斯特大学）

AI总结提出PaCo-VLA框架，通过被动屏蔽将VLA模型输出转化为任务级柔顺建议，并利用能量罐和边界检查防止无效预测绕过底层接触物理，实现安全精确的富接触操控。

Comments Under review, code will be available soon

详情

AI中文摘要

富接触操控既需要高层语义推理，也需要对高频接触动态的安全调节。虽然视觉-语言-动作（VLA）模型提供了前所未有的语义泛化能力，但其低速率输出缺乏在力敏感任务中直接控制执行器所需的可靠性。为弥合这一语义到控制的鸿沟，我们引入PaCo-VLA，一种被动屏蔽的柔顺先验，重新定义了VLA接口。PaCo-VLA不将直接电机指令托付给VLA，而是将网络输出视为任务级柔顺建议：语义绑定、任务阶段和导纳调度。一个高频、建议无关的被动屏蔽通过能量罐核算和边界检查来管理这些建议，防止无效、过时或未经验证的模型预测绕过底层接触物理。这种解耦架构还支持因果评估，将语义贡献与几何捷径分离。大量仿真和真实世界的连接器插入实验表明，PaCo-VLA在无屏蔽VLA基线上实现了卓越的精度，即使在对抗性柔顺偏移下也能保持零被动违规。该框架在导纳端口建立了一个可证明的采样被动运行时契约，并为在富接触领域部署基础模型提供了运行时接口。

英文摘要

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00510 2026-06-02 cs.CL cs.AI 版本更新

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

技能还是跳过？通过双粒度偏好学习在智能体任务中学习选择性技能调用

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang, Yi Yang, Cong Qin, Yangen Hu, Lu Pan, Ke Zeng

发表机构 * Meituan（美团）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）； Nanjing University（南京大学）； Peking University（北京大学）

AI总结提出SelSkill框架，通过双粒度偏好学习实现选择性技能调用，在ALFWorld和BFCL上显著提升任务成功率和执行精度。

Comments 18 pages, 4 figures, 10 tables

详情

AI中文摘要

智能体技能是可调用的程序化模块，为复杂智能体任务提供可重用知识和执行策略。然而，现有方法主要关注选择相关技能或改进技能本身，而忽略了在当前决策点是否应该实际调用相关技能。无帮助的调用可能引入无关上下文并破坏原本正确的执行过程。为解决此问题，我们提出SelSkill，一个用于选择性技能调用的双粒度偏好学习框架。SelSkill将技能使用表述为技能或跳过决策，利用预测不确定性优先考虑候选决策点，并从共享轨迹前缀构建受控的调用-跳过偏好对。它进一步结合了回合级结果偏好与步骤级调用偏好，以捕捉整体轨迹质量和技能调用的局部有效性。在ALFWorld上使用Qwen3-8B，SelSkill将任务成功率提高了10.9个百分点，执行精度提高了29.1个百分点。在BFCL上，它将任务成功率提高了5.7个百分点，执行精度提高了29.5个百分点。在Tau-bench和PopQA上的零样本结果进一步表明，学习到的调用策略可迁移到具有未见技能的新领域。

英文摘要

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.

URL PDF HTML ☆

赞 0 踩 0

2606.00508 2026-06-02 cs.CV cs.AI 版本更新

V-LynX: Token Interface Alignment for Video+X LLMs

V-LynX: 视频+X 大语言模型的令牌接口对齐

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea（延世大学，首尔，韩国）； Ewha Womans University, Seoul, South Korea（成均馆大学，首尔，韩国）

AI总结本文发现视频大语言模型中存在令牌接口连续流形，并提出V-LynX框架，通过轻量辅助路径对齐注意力响应和统计分布，无需配对监督即可集成新模态，在音视频问答、3D推理等任务上达到最优效率。

Comments ICML 2026 Camera-ready

详情

AI中文摘要

本研究揭示了视频大语言模型中的一个有趣现象：视频大语言模型不仅仅是简单地将帧转换为文本嵌入，而是建立了一个连续流形——令牌接口，使得视觉令牌能够在架构内作为独立实体运行。利用这一发现，我们提出了V-LynX，这是一个可扩展的框架，通过重新利用内部化接口，将新模态集成到视频大语言模型中。与需要大量模态特定编码器或配对监督的传统范式不同，V-LynX采用轻量辅助路径与冻结的视觉编码器并行运行。我们的方法通过使用非配对单模态数据集对齐注意力响应和统计分布，将新的感官输入与内在视频先验相结合。这确保了流形兼容性，同时保持了视频大语言模型的完整性。大量基准测试表明，V-LynX在音视频问答、3D推理、高帧率和多视角视频理解方面达到了最先进水平和高效性。代码可在https://github.com/park-jungin/lynx获取。

英文摘要

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.

URL PDF HTML ☆

赞 0 踩 0

2606.00506 2026-06-02 cs.AI cs.LG 版本更新

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

EnergyMamba：一种用于能耗预测的具有不确定性感知的图增强选择性状态空间模型

Dahai Yu, Rongchao Xu, Lin Jiang, Guang Wang

发表机构 * Florida State University（佛罗里达州立大学）

AI总结提出EnergyMamba框架，通过图增强选择性状态空间模型（GE-Mamba）和自适应序列分位数回归（AS-CQR）模块，实现时空联合建模与不确定性量化，在能耗预测中提升准确率约5%、不确定性量化约6%。

Comments Accepted by KDD 2026 AI4S

详情

DOI: 10.1145/3770855.3818841

AI中文摘要

能耗预测对于高效的电网管理、需求侧优化和可持续能源规划至关重要。尽管先进的机器学习方法已被用于提高预测性能，但现有工作存在两个关键局限：（1）通常将任务视为纯时间序列预测问题，未显式建模不同区域间的空间依赖关系；（2）在极端天气等异常情况下无法提供带有不确定性估计的可靠预测。为推进现有研究，我们提出EnergyMamba，一种具有不确定性感知的时空学习框架，用于准确可靠的能耗预测，包含两个关键组件：（i）一种新颖的图增强选择性状态空间模型（GE-Mamba），将从电网拓扑中学到的空间上下文注入时间动态，实现耦合的时空建模；（ii）自适应序列分位数回归（AS-CQR）模块，包括局部自适应归一化和在线反馈机制，以在潜在分布偏移下动态校准预测区间。我们在来自佛罗里达、纽约和加利福尼亚的四个大规模真实数据集上评估EnergyMamba。结果表明，与15个最先进的基线相比，EnergyMamba在预测准确率上提升约5%，在不确定性量化上提升约6%。

英文摘要

Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00503 2026-06-02 cs.LG cs.AI 版本更新

TabChange: Precise Attribute Changes in Tabular Data

TabChange: 表格数据中的精确属性变化

Arjun Dahal, Yu Lei, Raghu N. Kacker, Richard Kuhn

发表机构 * The University of Texas at Arlington（德克萨斯大学阿灵顿分校）； National Institute of Standards and Technology（美国国家标准与技术研究院）； Information Technology Laboratory（信息技术实验室）

AI总结针对表格数据中修改属性时破坏自然性的问题，提出TabChange方法，通过分析属性间关系并利用对抗框架去除潜在空间中的属性信息，实现精确且自然的属性修改。

详情

AI中文摘要

修改表格数据中的属性通常会破坏其与其他属性的关系，从而产生不自然的实例。修改后的实例必须既自然又与原始实例变化最小。本文解决了生成这种修改实例的挑战。我们识别了现有方法的关键局限性：生成模型要么不支持实例级属性编辑，要么像CVAE这样的方法在潜在空间中保留属性信息，导致不必要的修改。为了解决这个问题，我们提出了TabChange，一种分析数据集中目标属性与其他属性关系的方法。如果关系较弱，它直接翻转属性；如果关系较强，它使用对抗框架去除潜在空间表示中的属性信息。这种去除使得能够进行精确修改，只进行必要的调整以保持自然性。我们在七个数据集上的实验表明，TabChange生成的属性反事实在自然性方面与基线相当，并且更接近原始实例。与基线相比，这导致了更多有效的反事实和更少的无效反事实。

英文摘要

Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modified instance must be both natural and minimally changed from the original instance. This paper addresses the challenge of generating such a modified instance. We identify key limitations in existing approaches: generative models either don't support instance-level attribute editing or, in the case of methods like CVAE, retain attribute information in the latent space, leading to unnecessary modifications. To solve this, we propose TabChange, an approach that analyzes the relationship between the attribute of interest and other attributes in the dataset. If the relationship is weak, it simply flips the attribute; if it is strong, it uses an adversarial framework that removes information about the attribute in the latent space representation. This removal enables precise modifications, making only the necessary adjustments to maintain naturalness. Our experiments across seven datasets show that TabChange generates counterfactuals in attributes that are comparable in naturalness and are more proximal to their original instances. This leads to a higher number of valid counterfactuals and a lower number of invalid counterfactuals compared to the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00487 2026-06-02 cs.AI 版本更新

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

TAPS: 面向扩散草稿推测解码的目标感知前缀树选择

Zhuoyu Wang, Junnan Huang, Xinyu Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结提出TAPS方法，通过目标感知的前缀树选择优化扩散模型草稿的验证效率，实现最高7.9倍无损加速。

详情

AI中文摘要

使用扩散模型进行并行草稿是推测解码的一种有前景的方法。通过在单次前向传播中预测多个未来位置的token，扩散草稿器显著降低了草稿延迟。然而，这会将瓶颈转移到验证上：验证单个序列限制了接受长度，而验证大型草稿树会导致过度的目标模型延迟。我们发现了现有草稿树方法中的一个关键不匹配：现有的扩散树方法按边际概率对节点排序，忽略了验证是前缀条件化的。因此，它们可能会验证被拒绝前缀的不可达后代，从而增加延迟而接受增益有限。为了解决这个问题，我们提出了TAPS，一种目标感知的前缀选择方法，将扩散边际转化为路径条件化的接受估计。然后，TAPS在固定的验证预算下选择一个紧凑的前缀封闭子树，改善接受-成本权衡，而不是简单地扩展草稿树。跨不同数据集和模型族的实验表明，TAPS在无损端到端速度上比普通自回归解码最高提升7.9倍，分别比最先进的DFlash和DDTree提升1.36倍和1.74倍。我们的工作可在https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD获取。

英文摘要

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

URL PDF HTML ☆

赞 0 踩 0

2606.00476 2026-06-02 cs.AI 版本更新

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

做他们所说的，而不是他们所推理的：定位LLM智能体中的忠实性差距

Yufeng Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过将忠实性差距分解为推理-结论和结论-行动两个步骤，在可控的德克萨斯扑克模拟器中研究LLM智能体是否按照其陈述的推理行动。

Comments submitted to COLM social simulation with LLM workshop

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG 版本更新

当安全技能碰撞：衡量智能体技能生态系统中的组合风险

Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, Jingzhou Xu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）； University of Glasgow（格拉斯哥大学）； Independent Researcher（独立研究员）； Corespeed Inc.（Corespeed公司）

AI总结本文提出SkillReact框架，通过静态组合基准、双评分者LLM辅助人工审核和基于动作的可利用性测试，研究LLM智能体中个体安全的技能组合后可能产生的不安全行为，发现约18.2%的候选组合存在真实风险，且主机模型决定是否利用这些组合能力。

详情

AI中文摘要

LLM智能体越来越依赖社区贡献的技能，这些技能扩展了智能体的操作能力集。我们研究了智能体AI系统中的一个核心安全问题：个体安全的技能是否可能组合成不安全的已安装技能集。我们提出了SkillReact，一个组合安全测量框架，包含三个组件：一个确定性静态组合基准、一个双评分者LLM辅助人工审核流程，以及一个基于动作的可利用性测试工具。在1,520个ClawHub技能中，651个通过个体检查并形成211,575对；基准标记其中22.25%为结构候选。我们将这个原始比率视为面向召回率的扫描上限，并根据人类判断进行校准：在按模式分层的审计中，大约五分之一的标记对模式命中被确认为真实的组合风险（人口加权有效性18.2%，我们的主要结果），这意味着在单个注册表中约有14K个真实风险成员，而按技能扫描由于构造原因会遗漏这些风险，因为每一对个体都是安全的。然后，基于动作的测试工具探测这些候选何时成为模型发出的工具调用，并发现实现受主机模型倾向的门控：在一个锚定条件的dropper子集上，Haiku-4-5在所有39次直接提示试验中发出了dropper阶段的工具调用（其中36次是完整的下载然后执行链，3次仅下载），Opus-4-7在下载阶段停止，而Sonnet-4-6直接拒绝。一个保持请求固定且仅改变已安装技能的对照实验发现，未安装任何技能时合规性最高：组合决定了哪些能力可达，而主机模型决定是否使用它们。这些结果共同表明，安装时组合检查和能力隔离是对按技能扫描的补充。

英文摘要

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.

URL PDF HTML ☆

赞 0 踩 0

2606.00447 2026-06-02 cs.CV cs.AI 版本更新

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

GeoSAM-3D: 用于从单目视频进行开放词汇3D场景分割的测地线提示传播

Arun Sharma

发表机构 * University of Minnesota, Twin Cities（明尼苏达大学，双城分校）

AI总结提出GeoSAM-3D方法，利用冻结的视觉基础模型和单目3D高斯泼溅重建，通过可微分的图-测地线传播核在场景图上传播用户提示，实现从单目视频的开放词汇3D场景分割。

详情

AI中文摘要

开放词汇的3D场景分割通常假设有RGB-D视频、校准的多视角图像或重建的网格。GeoSAM-3D研究了一种更轻的设置：用户上传一段短的单目视频，在一帧中点击或命名一个物体，并在高斯场景上接收传播的3D掩码。该实现结合了冻结的图像和视频基础模型、单目3D高斯泼溅重建以及在高斯质心上可微分的图-测地线传播核。核心设计选择是通过重建场景图上的热核距离传播提示，而不是通过3D中的欧几里得最近邻。这保持了曲面周围的连续性，并减少了附近但不相连物体之间的泄漏。本文描述了仓库状态、在geosam3d.propagate中实现的数学核、从Segment Anything掩码训练的特征头以及代码库中已有的验证。评估协议将实现验证、图传播质量、泄漏控制和交互延迟分开。

英文摘要

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.

URL PDF HTML ☆

赞 0 踩 0

2606.00445 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities（明尼苏达大学，双城分校）

AI总结提出DarkVesselNet，融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头，实现多模态遥感暗船检测。

2606.00440 2026-06-02 cs.AI 版本更新

SDR: Set-Distance Rewards for Radiology Report Generation

SDR：用于放射学报告生成的集合距离奖励

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert

发表机构 * Stanford University（斯坦福大学）； Stanford University School of Medicine（斯坦福大学医学院）； Ghent University（根特大学）

AI总结针对胸部X光报告生成中标准奖励不兼容的问题，提出基于集合距离的连续置换不变奖励，通过GRPO后训练和测试时缩放显著提升性能。

详情

AI中文摘要

具有可验证奖励的强化学习已迅速推进了视觉-语言模型中的推理能力。然而，对于胸部X光报告生成，标准奖励（即精确匹配准确率和逐步过程）并不兼容，因为报告由无序且正交的发现组成，而非因果推理链。我们通过基于集合的视角来解决这一差距：每个报告被分割成句子，并由冻结的句子变换器嵌入，生成无序的嵌入集合。我们提出使用生成嵌入与参考嵌入之间的集合到集合距离作为连续的、置换不变的奖励。在两个数据集和三个视觉-语言模型（Qwen3-VL-2B/4B, Gemma3-4B）上，通过GRPO使用基于集合到集合距离的奖励进行后训练，在所有主要指标（BERTScore、RadGraph F1和CheXbert F1，分别相对提升平均6.80%、7.82%和4.45%）上一致优于监督微调和精确匹配GRPO。相同的集合距离还实现了测试时的最佳N选：通过候选与训练报告嵌入的距离进行评分，在我们训练的模型以及三个闭源LLM（Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini）上，平均相对提升BERTScore 16.4%，优于随机选择。作为流式信号使用时，它们支持更高效的测试时缩放形式：在生成过程中修剪低分候选，可将生成的令牌减少50%以上，同时保持完整最佳N选的结果质量。这些结果共同确立了集合距离奖励作为胸部X光报告生成中后训练和测试时缩放的统一信号。我们的代码已公开。

英文摘要

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

URL PDF HTML ☆

赞 0 踩 0

2606.00428 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters

低秩PEFT的更细参数步长：基于CP张量适配器的控制研究

Xinjue Wang, Xiuheng Wang, Yejun Zhang, Sergiy A. Vorobyov, Esa Ollila, Zhi-Yong Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过固定组件的规范多路分解（CP）张量适配器实现更细的参数步长，研究其对低秩适配器精度-预算权衡的影响，发现CP适配器能填补LoRA秩之间的空白，但效果依赖于任务。

Comments Accepted at the ICML 2026 Workshop on CoLoRAI

详情

AI中文摘要

低秩适配器通常通过扫描少量秩进行比较，但秩也固定了参数预算的分辨率。对于一个$2048{\times}2048$的OPT注意力投影，增加LoRA的一个秩会存储$4096$个可训练标量，导致可行的低预算适配器大小之间存在较大间隙。本文探讨具有更细容量增量的张量化适配器是否会改变观察到的精度-预算权衡。我们通过固定组件的规范多路分解（CP）张量适配器来实例化这个问题。在$32{\times}64{\times}32{\times}64$的张量化下，一个归一化的CP组件每个投影存储$193$个可训练标量，比LoRA的一个秩步长小约21倍。我们在OPT-1.3B上，在匹配的目标模块、训练协议、数据上限和种子调度下，比较了CP适配器和LoRA在SST-2、RTE和BoolQ上的表现。CP训练稳定，并填补了LoRA秩之间的空白，但效果依赖于任务：SST-2早期达到低预算平台，BoolQ在略低于LoRA饱和之前受益于额外的CP组件，而RTE仍然偏好LoRA。因此，更细的参数步长有助于诊断PEFT预算敏感性，但它们本身并不能保证更好的精度-预算曲线。

英文摘要

Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. For a $2048{\times}2048$ OPT attention projection, increasing LoRA by one rank stores $4096$ trainable scalars, leaving large gaps between feasible low-budget adapter sizes. This paper asks whether a tensorized adapter with finer capacity increments changes the observed accuracy--budget trade-off. We instantiate this question with fixed-component canonical polyadic (CP) tensor adapters. Under a $32{\times}64{\times}32{\times}64$ tensorization, one normalized CP component stores $193$ trainable scalars per projection, about $21$ times smaller than one LoRA rank step. We compare CP adapters and LoRA on OPT-1.3B across SST-2, RTE, and BoolQ under matched target modules, training protocol, data caps, and seed schedules. CP trains stably and fills the gaps between LoRA ranks, but the effect is task-dependent: SST-2 reaches an early low-budget plateau, BoolQ benefits from additional CP components before saturating slightly below LoRA, and RTE remains LoRA-favored. Finer parameter steps are therefore useful for diagnosing PEFT budget sensitivity, but they do not by themselves guarantee a better accuracy--budget curve.

URL PDF HTML ☆

赞 0 踩 0

2606.00424 2026-06-02 cs.AI 版本更新

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

弱批评者造就强学习者：用于可扩展监督的在策略批评蒸馏

Can Jin, Jiakang Li, Rui Wu, Eddy Zhang, Dimitris N. Metaxas

发表机构 * University of Cambridge（剑桥大学）； University of California, Berkeley（加州大学伯克利分校）； UC Berkeley AI Lab（加州大学伯克利分校人工智能实验室）

AI总结提出在策略批评蒸馏（OPCD）方法，利用弱模型作为批评者提供修订方向，通过自适应自教师信号蒸馏批评引导的行为，提升强模型在推理和对齐基准上的表现。

详情

AI中文摘要

随着大型语言模型变得更强，弱监督者可能无法为复杂输出提供可靠的标签、偏好或最终判断，限制了弱到强的泛化和可扩展监督。我们研究了一种更易处理的弱监督形式：使用弱模型作为批评者，而不是作为标注者或评判者。弱批评者不需要解决任务或选择正确答案，只需提供非误导性的修订方向，帮助强模型更好地利用自身知识。我们将这种设置称为*弱批评者强监督*。我们首先表明，弱批评可以在推理时改进冻结的强模型，并且批评质量是这种改进的关键。然后，我们提出渐进式在策略批评蒸馏（**OPCD**），它过滤高质量的批评，并通过自适应自教师信号将批评引导的行为蒸馏到强模型中。在推理和对齐基准上的实验表明，我们的方法在训练轮次中改进了强模型，为使用弱监督的可扩展监督提供了一条有效路径。

英文摘要

As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting *weak-critic strong oversight*. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (**OPCD**), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.00417 2026-06-02 cs.NI cs.AI 版本更新

AgentxGCore: Agentic AI for Next-Generation Mobile Core Network

AgentxGCore：面向下一代移动核心网络的智能体AI

Maria Katarine Santana Barbosa, Kelvin L. Dias

发表机构 * Centro de Informática - Universidade Federal de Pernambuco（计算机中心 - 佩鲁巴科联邦大学）

AI总结本文提出AgentxGCore，通过智能体AI原生层扩展3GPP架构，利用多智能体系统实现基于实时信息的闭环优化，支持自组织和自适应。

Comments This paper has been accepted for publication in IEEE Network

详情

DOI: 10.1109/MNET.2026.3699249

AI中文摘要

为满足新兴应用的严格要求以及日益复杂的网络管理和操作，下一代移动网络（NextG）或6G将在核心网（CN）上采用AI原生架构。在此进程中，第三代合作伙伴计划（3GPP）已通过新功能扩展蜂窝CN，作为集成分析、人工智能（AI）和机器学习的第一步。然而，这些新功能受限于集中式方法和管理复杂性。此外，随着大型语言模型（LLM）的兴起，网络编排和管理进入新时代，利用并赋能基于意图的网络（IBN）范式。同时，AI智能体和智能体AI集成了推理与行动（ReAct），使得能够利用此类意图持续与网络交互。与主要采用智能体AI来缓解CN中部署和配置复杂性的现有方法不同，本文介绍了AgentxGCore，它利用智能体AI原生层扩展3GPP架构，并基于超越下一代核心网（xGC）域中的现有API构建系统。该提案建立了基于实时信息的AI驱动闭环，用于持续优化，实现自组织和自适应。我们的方法涉及一个多智能体专用系统，分为网络规划智能体（能够可视化网络状态并制定满足意图的计划）和网络执行器（负责批评并执行计划）。为验证所提方案，使用开源CN、异构数据集构建了环境，并采用不同的LLM来证明其有效性。

英文摘要

To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generation Mobile Networks (NextG), or 6G, will adopt an AI-native architecture on the Core Network (CN). In this movement, the Third Generation Partnership Project (3GPP) has extended the cellular CN with new function as a first step toward integrating analytics, Artificial Intelligence (AI), and machine learning. However, those new functionalities are constrained by a centralized approach and managerial complexity. Furthermore, with the rise of Large Language Models (LLMs), a new era in network orchestration and management begins, leveraging and empowering the Intent-based Networking (IBN) paradigm. In addition, AI agents and Agentic AI integrate Reasoning and Acting (ReAct), enabling the usage of such intents to continuously interact with the network. Unlike state-of-the-art approaches that primarily employ Agentic AI to mitigate deployment and configuration complexity in the CN, this paper introduces AgentxGCore, which leverages an Agentic AI-Native layer to extend the 3GPP architecture and enable a system based on the existing APIs across the Beyond Next Generation Core (xGC) domain. This proposal establishes an AI-driven closed-loop for continuous optimization based on real-time information, enabling self-organization and self-adaptation. Our approach involves a multi-agent specialized system, divided into a network planner agent, capable of visualizing the network state and developing a plan to meet the intents, and a network executor, responsible for criticizing and executing the plan. To validate the proposed solution, an environment was built using an open-source CN, heterogeneous datasets, and different LLMs were employed to demonstrate its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.00408 2026-06-02 cs.CL cs.AI cs.IR 版本更新

Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

掩盖过时观察帮助搜索智能体——直到适得其反：一个机制图谱及其机制

Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley

发表机构 * UC San Diego（加州大学圣地亚哥分校）； UC Berkeley（加州大学伯克利分校）； Texas A&M University（德克萨斯大学）； UIUC（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文通过系统实验发现，在长时域搜索智能体中掩盖过时观察的效果呈现非对称倒U型曲线，取决于检索器召回率与模型隐式过滤能力的交互，并揭示了其背后的令牌-轮次权衡机制。

Comments 47 pages, 7 figures

详情

AI中文摘要

长时域搜索智能体在多次工具调用中积累大量检索内容，使得上下文预算效率日益重要。一种最小干预措施是在轨迹推进过程中掩盖过时观察，但尚不清楚这种上下文管理何时有帮助及其原因。我们通过系统扫描不同智能体骨干（4B到284B参数）和三个检索器，在离线和在线智能搜索基准上研究观察掩盖。我们发现，当以无上下文管理时的模型准确率为横轴时，掩盖带来的准确率提升呈非对称倒U形：在弱检索器下出现平台期，在强检索器与中等容量模型相遇时达到峰值，在模型饱和时急剧下降。这种模式反映了检索器召回率与模型隐式过滤能力之间的交互，而非单一因素。机制上，掩盖实现了令牌-轮次权衡：它移除了模型基本不再关注的观察以及智能体很少重新打开的页面。当增加的轮次将失败转化为成功时，它们有帮助；但当掩盖移除了模型本会使用的证据时，它们会失败。因此，我们将上下文管理重新定义为一种依赖机制的干预，并为分析智能深度搜索中的上下文使用提供了整体视角。我们在此发布我们的框架和轨迹（https://github.com/i-DeepSearch/observation-masking）以支持未来研究。

英文摘要

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.00402 2026-06-02 stat.ME cs.AI stat.AP 版本更新

A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering

基于重写的人类文本检测的无分布框架：通过Knockoff过滤

Yi Liu

发表机构 * Prorata.ai

AI总结提出一种无分布统计框架，将任意基于重写的检测器转化为具有有限样本FDR保证的检测器，无需重新训练，通过将重写检测视为具有knockoff结构的多重假设检验问题实现。

2606.00392 2026-06-02 cs.LG cs.AI 版本更新

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

通过约束策略优化实现检测器规避的LLM释义

Mingyi Wang, Zhuoer Shen, Yuheng Bu, Shaofeng Zou

发表机构 * School of ECEE, Arizona State University（亚利桑那州立大学电子工程与计算机科学学院）； Department of Computer Science, University of California, Santa Barbara（加州大学圣巴巴拉分校计算机科学系）

AI总结提出DEPO算法，将检测器规避的LLM释义建模为约束马尔可夫决策过程，通过拉格朗日对偶强化学习在保持语义的同时实现高效规避。

详情

AI中文摘要

AI文本检测器易受释义和检测器引导的释义攻击，但现有规避方法缺乏对语义保持的精确控制。特别是，直接优化检测器规避会降低细粒度语义，而标量化奖励设计仅提供间接、权重敏感的规避-语义权衡控制。我们通过将检测器规避的LLM释义建模为约束马尔可夫决策过程来解决这一限制，其中检测器规避是主要目标，语义保持作为显式约束强制执行。我们提出检测器规避策略优化（DEPO），一种拉格朗日原始-对偶强化学习算法，具有新颖的GRPO风格组基策略更新。DEPO在训练期间自适应平衡语义保持和检测器规避，使策略能够在规定的语义保持区域内提高攻击成功率。在MAGE、M4、RAID和同行评审数据集上的实验，针对MAGE、RoBERTa、RADAR、Binoculars和Fast-DetectGPT检测器进行评估，表明DEPO在精确满足语义保持约束的同时实现了强大的检测器规避。DEPO还表现出跨领域、跨检测器和提示级别的鲁棒性。

英文摘要

AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.00390 2026-06-02 cs.CV cs.AI 版本更新

Zamba2-VL Technical Report

Zamba2-VL 技术报告

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结提出基于混合架构Zamba2的视觉语言模型Zamba2-VL，在图像理解等基准上媲美Transformer模型，且首次令牌延迟降低约一个数量级。

Comments 16 pages, 2 figures

详情

AI中文摘要

长期决策问题中基于成对偏好的强化学习

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

发表机构 * School of Computer Science, McGill University, Montreal, Quebec, Canada（麦吉尔大学计算机科学学院）； Mila - Quebec AI Institute, Montreal, Quebec, Canada（魁北克人工智能研究所）； Department of Electrical Engineering, Stanford University, Stanford, California, USA（斯坦福大学电气工程系）

AI总结针对长期决策问题中基于成对偏好的强化学习效率低且缺乏马尔可夫策略最优性保证的问题，提出马尔可夫决策竞赛模型，证明平稳马尔可夫策略最优性、求解复杂度为P，并给出亚线性收敛算法，在高维长期问题中显著提升学习效率。

详情

AI中文摘要

强化学习问题通常将目标定义为最大化标量奖励函数的期望值。但是，成对偏好通常比标量奖励更容易指定，并且它们表达了标量奖励无法表达的某些目标。因此，基于成对偏好的强化学习方法受到了越来越多的关注。不幸的是，这些方法在长时间跨度的任务中效率低下，并且缺乏关于马尔可夫策略相对于历史依赖策略的性能保证，而这连接了强化学习的理论与实践。因此，我们提出了 extit{马尔可夫决策竞赛}作为基于成对偏好的强化学习的新问题模型。我们证明了平稳马尔可夫策略在所有历史依赖策略中是最优的，精确求解马尔可夫决策竞赛属于P类问题，并且一个简单的迭代算法以亚线性速率收敛到最优策略。最后，在一组具有长时间跨度的高维决策问题中，我们展示了我们的近似算法在学习效率上显著优于先前的工作。

英文摘要

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

URL PDF HTML ☆

赞 0 踩 0

从噪声到控制：参数化扩散策略

Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris, Yilun Du, Bruno Castro da Silva

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出参数化扩散策略（PDP），通过学习行为流形上的低维连续参数条件化扩散策略，将扩散从随机多样性机制转化为精确可优化的行为引导工具，实现策略间的平滑插值和新约束下的高效适应。

2606.00334 2026-06-02 cs.CL cs.AI 版本更新

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

隔离LLM词汇偏差：一种无人工标注的三角化偏好学习阶段度量

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

发表机构 * Florida State University（佛罗里达州立大学）

AI总结提出一种无需人工标注的三角化偏好偏移分数（Triangulated Preference Shift score），通过对比人类标准、基础模型和指令变体，量化偏好学习阶段引入的词汇偏差。

Comments 7 pages, 2 figures, 1 table

详情

DOI: 10.32473/flairs.39.1.141843
Journal ref: The International FLAIRS Conference Proceedings, 39(1) (2026)

AI中文摘要

近年来，各种语言领域发生了显著变化；这些变化很大程度上归因于大型语言模型的出现及其与自然语言使用的不对齐。这些不对齐部分源于偏好学习阶段，例如从人类反馈中强化学习，这通常使模型更有用，但同时也可能引入系统性词汇偏差。在词汇行为方面，这体现在模型对某些格式的偏好或过度使用某些词汇（如delve、furthermore），即使这些模式在基础模型输出中并不存在。关于偏好训练引起的词汇不对齐的研究受限于对人工标注的依赖。我们通过引入三角化偏好偏移分数来解决这一问题，该度量在人类黄金标准、基础模型和指令变体之间进行三角化，以隔离偏好学习引起的特定偏移，无需人工标注。我们提供了六个模型家族的数据，将结果锚定在文献中，并通过分析偏好学习是否将模型推向可解释为“威望语言”的方向，展示了该通用方法的实用性。该度量提供了一种初始的自动化方法来量化偏好调整引起的行为偏移，从而有助于指导模型对齐和可信AI的开发。

英文摘要

Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach's utility by analyzing whether preference learning shifts models toward what could be interpreted as a "language of prestige". The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.

URL PDF HTML ☆

赞 0 踩 0

2606.00324 2026-06-02 cs.IR cs.AI 版本更新

重新思考温度在大语言模型蒸馏中的作用

Hoang-Chau Luong, Lingwei Chen

发表机构 * Golisano College of Computing and Information Sciences（戈利萨诺计算与信息科学学院）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结本文通过分析温度τ对前向KL散度和反向KL散度在LLM蒸馏中的不对称影响，发现高温下FKL优于RKL，并证明温度能提升多种蒸馏目标，使简单KL方法达到先进水平。

详情

AI中文摘要

反向KL散度在大语言模型蒸馏中比前向KL更受欢迎，但这种偏好主要基于忽略温度τ的比较，忽视了其在软化教师分布和改进知识转移中的核心作用。本文重新审视LLM蒸馏中的温度，发现它从根本上改变了FKL和RKL的比较。我们的分析揭示了一种不对称效应：温度显著丰富了FKL中的非主导令牌信号，而主要重新缩放RKL梯度，导致FKL从τ缩放中获益远多于RKL。这种不对称推翻了标准经验结论：尽管在τ=1时RKL优于FKL，但在指令遵循基准测试中，高温下FKL始终超过RKL。此外，温度的影响不仅限于FKL；它改进了更广泛的蒸馏目标，使简单的基于KL的方法能够与最近最先进的LLM蒸馏方法竞争。

英文摘要

Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $τ$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $τ$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $τ=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.00305 2026-06-02 cs.CL cs.AI 版本更新

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

通过近未来指导桥接在线策略蒸馏中的推理轨迹

Yuxuan Jiang, Francis Ferraro

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩分校）

AI总结针对在线策略蒸馏中令牌级学习信号无法有效桥接推理轨迹的问题，提出基于近未来轨迹信息的轨迹感知在线策略蒸馏方法，显著提升大语言模型推理性能。

详情

AI中文摘要

在线策略蒸馏通过让学生在教师监督下从其自身策略采样的轨迹上进行训练，改进了大语言模型的推理能力。尽管OPD基于轨迹操作，但其学习信号仍然是令牌级的：它通过高损失令牌识别偏差，并通过局部反向KL校正进行修复。我们表明，这种“轨迹采样但令牌学习”的机制无法可靠地将学生轨迹桥接至教师轨迹。约30%的高损失令牌落入低发散区域，表明许多是表面形式不匹配而非真正的推理分叉。此外，即使是真正发散的令牌也难以通过孤立的令牌级监督修复，因为推理失败通常表现为短视的分布漂移。我们提出轨迹感知OPD，它利用近未来轨迹信息识别真正的发散状态，并将指导分布到多个未来令牌上。实验表明，抑制非发散的高损失令牌将标准OPD的平均准确率从47.8%提升至48.2%，而TOPD进一步将性能提升至52.2%，在AIME24上从60.0%提升至63.3%，在AIME25上从46.7%提升至53.3%。

英文摘要

On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this "trajectory-sampled but token-learned" mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.

URL PDF HTML ☆

赞 0 踩 0

2606.00299 2026-06-02 cs.CV cs.AI 版本更新

超几何与证据优先专家用于大型视觉-语言模型

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao

发表机构 * China University of Petroleum (Beijing)（中国石油大学（北京））； Hainan Institute of China University of Petroleum (Beijing)（中国石油大学（北京）海南学院）； South China Normal University（华南师范大学）

AI总结针对大型视觉-语言模型中视觉与语言模态的不对称性，提出AsyMoE架构，通过超几何跨模态专家和证据优先语言专家分别建模层级关系与保持上下文基础，在减少参数的同时提升性能。

详情

AI中文摘要

大型视觉-语言模型（LVLMs）通过扩展架构和大量训练在多模态任务上展现了令人印象深刻的性能。近期研究将混合专家（MoE）引入LVLMs以提高计算效率。然而，现有的MoE方法以对称架构处理视觉和语言模态，忽视了这两种模态处理中的固有不平衡性。这种不平衡性导致两个关键问题。首先，文本和视觉形成层级而非并行关系，因为文本查询通常描述完整视觉场景的部分方面。欧几里得专家空间难以编码这种包含结构。其次，深层语言专家逐渐从基于证据的处理转向参数记忆依赖，失去对提供的视觉和语言信息的立足点。为解决这些问题，我们提出AsyMoE，一种通过三个专门专家组显式建模这种不平衡性的新型架构。模态内专家处理模态特定处理。超几何跨模态专家通过负曲率几何捕获层级跨模态关系。证据优先语言专家抑制参数记忆激活并在整个网络深度中保持上下文基础。大量实验表明，AsyMoE相比基线方法取得一致改进，平均比MoE变体提升1.5%，在幻觉敏感任务上提升高达3.8%。与密集模型相比，AsyMoE激活参数减少25.45%。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

URL PDF HTML ☆

赞 0 踩 0

2606.00272 2026-06-02 cs.AI cs.CL cs.CY 版本更新

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

在周三，我们提问：优化自动化法律分诊与转介中的“主动倾听”

Quinten Steenhuis, Jacqueline Harvey

发表机构 * Suffolk University Law School（苏福克大学法学院）

AI总结本文通过专家律师和LLM评估FETCH分类器的追问问题方法，发现低成本LLM在分类任务中表现良好但生成高质量问题需高成本模型，并提出法律分诊问题评估标准。

Comments Working paper submitted as accepted to AIDA2J workshop at International Conference for AI and Law in Singapore, June 2026

详情

AI中文摘要

FETCH分类器生成追问问题，以帮助优化申请人法律问题的最佳匹配，使用低成本LLM集成。在本文中，我们描述了专家律师和LLM辅助评估FETCH中的追问问题方法，并表明虽然低成本LLM在分类任务中表现良好，但在这种情况下生成高质量的通俗语言问题似乎需要更复杂和更高成本的模型。通过与法律接待工作人员的讨论，我们提出了法律接待分类问题的评估标准，并发现仅靠提示工程不足以提高接待目的的问题质量。我们还发现LLM作为评判者与人类评分存在分歧。我们证明，通过添加单个高成本模型GPT-5，分类器可以从寻求法律帮助的申请人那里引出相关信息，并且这些问题导致分类任务更准确的性能。我们还发现不同类别（包括家庭暴力）的事实引出不均匀，与家庭法筛查规程不一致，这表明在某些法律领域纳入专门筛查小组的价值。

英文摘要

The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.

URL PDF HTML ☆

赞 0 踩 0

2606.00270 2026-06-02 cs.AI cs.LG cs.LO 版本更新

当 Softmax 在顶部失效：InfoNCE 的极值修正

Melihcan Erol, Suat Evren, Oktay Ozel, Alexander Morgan, Jongha Jon Ryu, Lizhong Zheng

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对 InfoNCE 中 softmax 假设与对比学习嵌入设置不匹配的问题，提出基于极值理论的 WEINCE 修正方法，在五个视觉基准上提升冻结特征评估性能。

Comments Presented in ICML 2026

2606.00257 2026-06-02 cs.LG cs.AI 版本更新

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA: 当令牌信号退化时的适配器-残差信用分配

Rodney Lafuente-Mercado

发表机构 * Rodney Lafuente-Mercado（罗伊德·拉福恩特-默茨）

AI总结针对LoRA微调下令牌级信用分配信号退化的问题，提出ARCA方法，利用适配器隐藏状态残差作为令牌显著性度量，无需学习奖励模型或价值头。

Comments Accepted to DEMO 2026: ICML Workshop on Decision-Making from Offline Datasets to Online Adaptation. Non-archival report

详情

AI中文摘要

语言模型强化学习的令牌级信用分配通常被表述为策略完全可训练，而实际的LLM-RL流程往往依赖于参数高效微调，尤其是LoRA。我们认为这种分离隐藏了一种结构性失效模式。在LoRA下，策略被限制在参考模型的低秩邻域内，因此常用内在信用信号（如惊奇度、熵减和策略散度）所依赖的每令牌输出分布差异，在轨迹内归一化后可能变得退化，要么接近均匀权重，要么集中在少量与任务无关的位置上。我们形式化了这种行为，并提出直接用浓度诊断指标（如权重基尼系数和有效令牌比率）进行测量。然后，我们引入了适配器-残差信用分配（ARCA），一种轻量级替代方案，它从适配器自身的隐藏状态残差 $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$ 中推导令牌显著性。ARCA关注适配器实际改变模型的位置，而不是输出分布显得不确定或偏移的位置，并且不需要学习奖励模型、价值头或树结构。在紧凑的MATH/Qwen3-1.7B GRPO扫描中，ARCA在匹配的轨迹预算下表现出预测的非退化中间区域信用分布，并与秩匹配的基线保持竞争力。

英文摘要

Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emph{Adapter-Residual Credit Assignment} (ARCA), a lightweight alternative that derives token salience from the adapter's own hidden-state residual, $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00251 2026-06-02 cs.AI 版本更新

SEMBridge: 具有最弱前置条件和有界检查解释的无标签最终程序语义

Eric Liang

发表机构 * Oracle

AI总结提出SEMBridge框架，通过无标签最终风格统一生成可执行语义、最弱前置条件验证条件和有界检查，保持三者同步。

详情

AI中文摘要

形式化方法提供程序行为的严格描述，但实际软件工程通常通过可执行库、测试和增量设计工作。本文提出SEMBridge，一个小的无标签最终框架，用于从相同的可执行目标程序生成最弱前置条件和有界检查解释。不是将程序语义提交给一个抽象语法树然后编写单独的遍历，而是针对语义接口编写一次目标程序，并将其解释为多种含义：可读代码、具体执行、谓词变换器、有界反例搜索以及未来的证明助手或SMT后端。Python原型实现了一个无循环的命令式核心，包含赋值、条件、假设和断言。在五个示例程序上，相同的无标签最终定义生成了可执行状态变换器和验证条件，这些条件在多达729个状态的域上通过了有界检查。贡献不是Scala代码生成系统或新的验证器，而是一种紧凑的架构，用于保持可执行语义、最弱前置条件工件和有界验证同步。

英文摘要

Formal methods provide rigorous accounts of program behavior, but practical software engineering often works through executable libraries, tests, and incremental design. This paper presents SEMBridge, a small tagless-final framework for generating weakest-precondition and bounded-checking interpretations from the same executable object programs. Instead of committing a program semantics to one abstract syntax tree and then writing separate traversals, object programs are written once against a semantic interface and interpreted into multiple meanings: readable code, concrete execution, predicate transformers, bounded counterexample search, and future proof-assistant or SMT back ends. The Python prototype implements a loop-free imperative core with assignments, conditionals, assumptions, and assertions. Across five example programs, the same tagless-final definitions generated executable state transformers and verification conditions that passed bounded checking over domains up to 729 states. The contribution is not a Scala code-generation system or a new verifier, but a compact architecture for keeping executable semantics, weakest-precondition artifacts, and bounded validation synchronized.

URL PDF HTML ☆

赞 0 踩 0

2606.00202 2026-06-02 cs.LG cs.AI 版本更新

From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets

从Rashomon理论到PRAXIS：高效决策树Rashomon集

Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

发表机构 * Stanford University（斯坦福大学）

AI总结针对决策树Rashomon集计算开销大的问题，提出PRAXIS算法，在运行时和内存使用上实现数量级改进，并能恢复几乎完整的Rashomon集。

Comments Accepted to ICML 2026

详情

AI中文摘要

标准机器学习流程通常会产生许多接近最优的模型。这些“Rashomon集”为不确定性感知的鲁棒决策带来了一系列挑战和机遇。它们允许用户整合领域知识和偏好，这些知识和偏好通常难以直接指定为目标函数，并且它们量化了给定训练数据集和目标函数下有效模型之间的多样性。然而，即使对于稀疏决策树这样简单、可解释的模型类，Rashomon集的计算仍然需要巨大的内存和运行时资源。我们提出了PRAXIS，一种近似该Rashomon集的算法，在运行时和内存使用上实现了数量级的改进。我们验证了PRAXIS通常能恢复几乎完整的Rashomon集。PRAXIS使研究人员和从业者能够可扩展地对真实世界数据集的Rashomon集进行建模。PRAXIS的代码可在https://github.com/zakk-h/PRAXIS获取。

英文摘要

Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets. Code for PRAXIS is available at https://github.com/zakk-h/PRAXIS

URL PDF HTML ☆

赞 0 踩 0

2606.00198 2026-06-02 cs.LG cs.AI cs.CL 版本更新

BAGEN: Are LLM Agents Budget-Aware?

BAGEN：LLM 智能体是否具有预算意识？

Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

发表机构 * Northwestern University（西北大学）； O2 Lab（O2实验室）； Independent（独立）； University of Michigan（密歇根大学）； Cornell（康奈尔大学）； All Hands AI ； Stanford（斯坦福大学）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出预算感知智能体（BAGEN）概念，将预算作为主动控制信号而非被动成本指标，通过渐进区间估计方法预测剩余预算上下界，并在四个环境和五个前沿模型上发现强模型不一定具有强预算意识、模型过度乐观等失败模式，早期停止可节省 28-64% 令牌，但精确区间校准仍具挑战。

详情

AI中文摘要

尽管智能体正在花费越来越多的资源，但如今智能体成本大多仅在执行后衡量。预算感知智能体（BAGEN）应将预算视为主动控制信号，而非被动成本指标。我们首先系统地将预算估计定义为内部预算（来自智能体计算）和外部预算（来自智能体动作）。然后，我们将预算意识形式化为渐进区间估计：在计划的每一步，智能体应预测剩余预算的上限和下限，并在完成可能性低时发出警报。通过 rollout-replay 协议进行评分，我们在四个环境和五个前沿模型上发现了一致的失败模式：（1）强模型不一定具有强预算意识，相关性 r=0.35。（2）前沿模型始终过度乐观，继续在不太可能成功的任务上花费资源，而不是尽早提醒用户。（3）预算感知信号是可操作且可训练的。早期停止在失败轨迹上节省 28-64% 的令牌，SFT+RL 增强了早期停止和警报行为。（4）精确区间校准仍然具有挑战性，SFT+RL 后区间覆盖率上限为 47%。项目页面：https://ragen-ai.github.io/bagen/

英文摘要

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen-ai.github.io/bagen/

URL PDF HTML ☆

赞 0 踩 0

2606.00189 2026-06-02 cs.LG cs.AI 版本更新

Learning to Construct Practical Agentic Systems

学习构建实用的智能体系统

Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, Cassandra A. Cohen, Arthur Kajiyama, William W. Cohen

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Dept. of Computer Science（计算机科学系）； Emory University（埃默里大学）

AI总结本文提出一种基于伪工具和固定工作流的智能体框架，通过模块化设计和多目标优化方法，在保证成本可控和结果质量的前提下，实现实用智能体系统的自动构建与优化。

详情

AI中文摘要

基于LLM的智能体系统的自动设计和优化能够产生复杂的系统，显著提升结果质量，优于现成的智能体模式。然而，对实际部署的智能体系统的研究表明，生产系统更关注推理成本的简单性、可控性和可预测性等问题。本文提出了设计和优化实用智能体系统的原则性方法。我们描述了一个智能体框架，通过定义在受限上下文中递归调用LLM的“伪工具”，使设计者能够强制智能体系统的模块化。利用该框架，我们为多种任务手工设计了智能体，并表明相对于动态规划的工作流，手工构建的固定工作流通常更便宜且更准确。随后，我们提出了针对该框架所需的智能体组件（即伪工具和固定工作流）的新型学习方法。这些学习方法通常优于手工设计的智能体。我们还利用框架的模块化特性，应用多目标优化方法联合优化成本和响应质量，并融合多个学习系统的结果。

英文摘要

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining "pseudo-tools" that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00183 2026-06-02 cs.LG cs.AI math.OC stat.ML 版本更新

Agentic Transformers Provably Learn to Search via Reinforcement Learning

智能体Transformer通过强化学习可证明地学会搜索

Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Pennsylvania（宾夕法尼亚大学）； The Ohio State University（俄亥俄州立大学）； Yale University（耶鲁大学）

AI总结本文通过构建双头Transformer实现随机深度优先搜索，并分析策略梯度训练动力学，证明该搜索机制能从稀疏强化反馈中分阶段涌现，且具备深度泛化能力。

详情

AI中文摘要

树搜索是许多语言智能体推理和决策任务的核心抽象：智能体必须探索动作、记住失败并回溯到有希望的替代方案。然而，我们缺乏对基于Transformer的策略如何从强化学习（RL）的训练动态中获得这种搜索能力的理论理解。我们在一个随机的$k$叉树环境中研究这个问题，其中智能体Transformer仅通过交互观察其轨迹历史，并在到达隐藏的叶子目标节点时获得终端奖励。我们首先构建了一个实现随机深度优先搜索（DFS）的双头Transformer：一个头跟踪之前的动作，而另一个头检测失败结果并触发回溯。然后，我们分析了在深度课程下的策略梯度训练动态，表明相同的DFS机制在没有专家演示的情况下，从稀疏强化反馈中分阶段涌现。得到的策略表现出深度泛化能力：仅在深度为1和2的树上训练后，它能在更深的完整树上成功。我们进一步表明，在非平衡目标分布下，对回报进行折扣会导致一种排序的DFS策略，优先考虑高概率分支。总的来说，我们的结果确定了基于Transformer的搜索的一种机制性标准形式，其中注意力头专门化并协作，从上下文中提取与决策相关的轨迹，并通过RL训练将其转化为智能体动作选择。

英文摘要

Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire such search capabilities from the training dynamics of reinforcement learning (RL). We study this question in a stochastic $k$-ary tree environment, where an agentic transformer observes only its trajectory history through interaction and receives a terminal reward for reaching a hidden leaf goal node. We first construct a two-head transformer that implements randomized depth-first search (DFS): one head tracks previous actions, while the other detects failure outcomes and triggers backtracking. We then analyze the training dynamics of policy gradient under a depth-wise curriculum, showing that this same DFS mechanism emerges in stages from sparse reinforcement feedback without expert demonstrations. The resulting policy exhibits depth generalization: after training only on depth-$1$ and depth-$2$ trees, it succeeds on deeper full trees. We further show that, under imbalanced goal distributions, discounting the return leads to a ranked DFS policy that prioritizes higher-probability branches. Overall, our results identify a mechanistic normal form for transformer-based search, in which attention heads specialize and cooperate to extract decision-relevant traces from context and convert them into agentic action selection via RL training.

URL PDF HTML ☆

赞 0 踩 0

2606.00180 2026-06-02 cs.LG cs.AI 版本更新

UF-AMA: 通过自适应多模态对齐的跨域情感识别统一框架

Zheng Wang, Shuo Wang, Junhong Wang

发表机构 * Institute of Advanced Technology, University of Science and Technology of China（中国科学技术大学先进技术研究院）； Department of Electronic Engineering and Information Science, University of Science and Technology of China（中国科学技术大学电子工程与信息科学系）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合国家科学中心人工智能研究院）

AI总结提出一种统一框架UF-AMA，利用自适应多模态对齐和置信度感知筛选机制，解决跨主体和跨会话的生理信号情感识别中的分布偏移问题，在SEED和SEED-IV数据集上达到最优性能。

详情

AI中文摘要

近年来，基于脑电图（EEG）等生理信号的情感识别受到了广泛关注，因为与面部表情等外部行为数据相比，内部生理数据提供了更高的客观性和可靠性。然而，由于个体和情境差异导致的分布偏移，以及各模态样本质量的差异，构建具有高泛化性和鲁棒性的跨域多模态情感识别模型仍然是一个关键挑战。在本研究中，我们提出了一种具有自适应多模态对齐的统一框架（UF-AMA），以使用多模态生理信号解决跨主体和跨会话的情感识别问题。首先，我们构建了一个由Transformer编码器和多头交叉注意力模块组成的跨模态特征融合网络，实现了EEG信号和眼动追踪数据的深度融合。随后，我们引入了一种置信度感知筛选机制，动态评估每个模态分支在目标域样本上的预测可靠性，将样本划分为不同的质量子集，并相应地应用全局一致性对齐和跨模态蒸馏。最后，我们提出了一个多级域自适应框架，联合优化局部模态特定特征和全局融合特征的边际分布和条件分布，从而在多个粒度上减少跨域分布偏移。在SEED和SEED-IV数据集上的大量实验表明，UF-AMA在跨主体和跨会话任务中均达到了最先进的性能。源代码可在 https://github.com/BetterCoderLab/UF-AMA 获取。

英文摘要

In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross-domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF-AMA) to address cross-subject and cross-session emotion recognition using multimodal physiological signals. First, we construct a cross-modal feature fusion network comprising Transformer encoders and multi-head cross-attention modules, enabling the deep integration of EEG signals and eye-tracking data. Subsequently, we introduce a confidence-aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross-modal distillation. Finally, we propose a multi-level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality-specific and global fusion features, thereby reducing cross-domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED-IV datasets demonstrate that UF-AMA achieves state-of-the-art (SOTA) performance in both cross-subject and cross-session tasks. The source code is available at: https://github.com/BetterCoderLab/UF-AMA.

URL PDF HTML ☆

赞 0 踩 0

2606.00169 2026-06-02 cs.LG cs.AI 版本更新

ChurnNet: A Optimized Modern AI for Churn Prediction

ChurnNet: 一种用于流失预测的优化现代人工智能

Syed Saad Saif, Giulio Maggiore, Paolo Russo, Damiano Distante

发表机构 * Department of Computer, Control, and Management Engineering（计算机、控制与管理工程系）； Department of Law and Economics, UnitelmaSapienza University of Rome（法律与经济学系，Unitelma萨皮恩扎罗马大学）； R&D Center, Token Financial Technologies（Token金融技术研发中心）； Department of Civil, Computer Science and Aeronautical Technologies Engineering, Roma Tre University（土木、计算机科学与航空技术工程系，罗马三大学）

AI总结本研究评估了传统机器学习方法（随机森林、XGBoost、支持向量机）与统一多任务时间序列模型在客户流失预测任务上的性能，发现传统方法在预测性能、数据效率和计算资源需求方面仍具优势。

详情

AI中文摘要

日益激烈的竞争以及零售商提供的产品和服务日益相似，降低了客户转向竞争对手的门槛。准确的流失预测可以成为推动有效个性化营销活动和帮助减少客户流失的宝贵工具。本研究评估了传统机器学习技术（即随机森林、XGBoost和支持向量机）的性能，并将其与统一多任务时间序列模型（一种二元时间序列分类任务）在流失预测上进行比较。尽管后者在建模复杂时间动态和变量间关系方面具有强大能力，但我们的结果表明，对于流失预测，传统方法在预测性能、数据效率以及训练和部署的计算资源需求方面仍可超越它。这些发现在多个数据集和各种流失标记技术中保持一致。

英文摘要

Increased competition and the growing similarity of products and services offered by retailers have lowered the barriers for customers to switch to competitors. Accurate churn prediction can be a valuable tool for driving effective personalized marketing campaigns and helping to reduce customer attrition. This study evaluates the performance of traditional machine learning techniques, namely, Random Forests, XGBoost, and Support Vector Machines, and compares them with the Unified Multi-Task Time Series Model for churn prediction, a binary time-series classification task. Despite the strong capacity of the latter to model complex temporal dynamics and inter-variable relationships, our results indicate that for churn prediction, conventional methods can still outperform it in terms of predictive performance, data efficiency, and computational resource requirements for training and deployment. These findings are consistent across multiple datasets and various churn labeling techniques.

URL PDF HTML ☆

赞 0 踩 0

2606.00161 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Improving IoT Intrusion Detection Through SMOTE-Based Oversampling and Extended Multi-Model Evaluation on Side-Channel Power Data

基于SMOTE过采样和扩展多模型评估的侧信道功率数据物联网入侵检测改进

Muhammad Khuram Shahzad, Haseeb Khan, Muhammad Masood Khan, Mubashra Bibi

发表机构 * School of Electrical Engineering and Computer Science (SEECS), NUST（电气工程与计算机科学学院（SEECS），努斯兰大学）

AI总结针对物联网侧信道数据集中的严重类别不平衡问题，采用SMOTE过采样平衡数据，并评估八种机器学习模型，其中随机森林和极端随机树在F1分数上超越基线方法，同时揭示了宏观F1指标的重要性。

Comments 8 pages, 14 figures; code and results publicly available

详情

AI中文摘要

物联网网络中的入侵检测面临传统机器学习方法无法克服的挑战，其中最大的挑战之一是侧信道数据集中存在的类别不平衡问题，正常类样本与攻击类样本的比例可达75964:1。Dominguez等人通过基于功率的入侵检测概念验证解决了这一问题，但既未尝试处理不平衡问题，也未使用平衡训练集评估分类器性能。本文同时处理这两个方面。首先，对从初始数据集提取的所有九个可能数据集应用合成少数类过采样技术（SMOTE），使每个数据集的精确不平衡比达到1.1。然后，在SMOTE平衡的6小时数据集上，在相同条件下训练了八种算法：随机森林、HistGradientBoosting、LightGBM、极端随机树、XGBoost、k近邻、多层感知机和决策树。随机森林的微平均F1分数达到0.9989，宏F1为0.9794，优于基线论文中时间序列森林算法之前的最佳微F1结果0.9983。极端随机树提供了相同的性能，但速度快10倍。与基线论文评估相比，显式引入宏F1指标揭示了聚合性能指标遗漏的重要类别级信息。基于混淆矩阵、F1热图和ROC曲线计算的每类召回率表明，仅当使用SMOTE平衡时，少数攻击类（尤其是M+L联合感染类）才能被可靠检测。特征重要性分析表明，在功率窗口的60个时间步中，最近的时间步是最重要的预测信号。

英文摘要

The detection of intrusions in IoT-based networks poses challenges that cannot be overcome using traditional machine learning methods. Perhaps the biggest of them is related to the presence of a class imbalance in the side-channel dataset, where the number of samples in the normal class compared to the attacks can reach a ratio of 75,964 to 1. Such an aspect is addressed by Dominguez et al. through the proof of concept of power-based intrusion detection. Unfortunately, neither the authors attempt to cope with the problem of imbalance nor do they assess the classifier performance using a balanced training set. In the current paper, both aspects will be handled at once. First, a Synthetic Minority Oversampling Technique (SMOTE) was performed on all nine possible datasets extracted from the initial one, providing an exact imbalance ratio of 1.1 for each. Then, eight algorithms i.e. Random Forest, HistGradientBoosting, LightGBM, Extra Trees, XGBoost, k-Nearest Neighbors, Multi-Layer Perceptron, and Decision Tree were trained under identical conditions for the SMOTE balanced 6-hour dataset. Random Forest reached a micro-averaged F1 score of 0.9989 and macro F1 of 0.9794, thus outperforming the previously best micro-F1 result obtained by Time Series Forest algorithm from the base paper of 0.9983. Extra Trees provided the same performance as well, but at 10 times faster. The introduction of a macro-F1 metric explicitly in contrast to the base paper assessment reveals important class-level information missed with aggregate performance metrics. Recall rates per-class calculated with confusion matrices, F1 heatmaps, and ROC curves show that minority attack classes, especially those with combined M+L infections, are detected reliably only when using SMOTE balance. Feature importance analysis indicates the latest time steps as the most important predictor signals out of 60 steps in a power window.

URL PDF HTML ☆

赞 0 踩 0

2606.00160 2026-06-02 cs.CR cs.AI cs.CL 版本更新

一种用于定量扩散MRI的物理信息基础模型

Zihan Li, Jialan Zheng, Ziyu Li, Xun Yuan, Kasidit Anmahapong, Ziang Wang, Mingxuan Liu, Hongjia Yang, Yifei Chen, Zhuhao Wang, Yuhang He, Fang Chen, Rui Li, Huaiqiang Sun, Yi Liao, Congyu Liao, Yang Yang, Haibo Qu, Xue Zhang, Hongen Liao, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua University（清华大学生物医学工程系）； Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford（牛津大学整合神经影像中心、FMRIB、临床神经科学系）； Department of Radiology, West China Second University Hospital, Sichuan University（四川大学华西第二医院放射科）； School of Biomedical Engineering and the Institute of Medical Robotics, Shanghai Jiaotong University（上海交通大学生物医学工程学院和医学机器人研究院）； Department of Radiology, Institution of Radiology and Medical Imaging, West China Hospital, Sichuan University（四川大学华西医院放射科、放射医学与影像研究所）； Department of Radiology and Biomedical Imaging, University of California San Francisco（加州大学旧金山分校放射科和生物医学影像系）； Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine（斯坦福大学医学院精神病学与行为科学系）

AI总结提出物理信息生成微结构网络（PIGMENT），通过零样本适应实现从稀疏数据中恢复可靠的定量扩散MRI参数映射。

详情

AI中文摘要

理解人脑需要获取其微观组织架构。扩散磁共振成像（MRI）提供了唯一非侵入性的活体全脑微结构窗口，但可靠的定量映射仍局限于需要密集采样和优化采集协议的专业研究环境。为解决这一差距，我们提出了一种物理信息生成微结构网络（PIGMENT），它学习人脑微结构的通用生成先验，并以零样本方式适应每个参与者的测量数据，以恢复特定主体的映射。PIGMENT在涵盖多个站点、供应商和场强的11375次扫描上训练，使得在来自五个独立中心的外部数据集上，能够对张量、峰度和NODDI模型进行可靠的定量映射。在传统拟合变得不可靠的情况下，它仍然有效，从极其稀疏的采集中恢复有意义的映射，同时支持下游的纤维追踪和结构连接映射。PIGMENT估计显示出强大的生物学有效性，从10倍加速扫描中保留了亚毫米级皮层微结构模式和早期儿童白质发育轨迹。此外，PIGMENT能够在成本效益高的低场系统上进行可靠的定量张量映射，并使用超快速临床协议提取肿瘤相关生物标志物。这些结果共同确立了PIGMENT作为一种物理信息基础模型，将定量扩散MRI扩展到传统上因过于稀疏、异质或临床受限而无法进行可靠分析的领域。

英文摘要

Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides the only noninvasive window into whole-brain microstructure in vivo, yet reliable quantitative mapping remains confined to specialized research settings requiring dense sampling and optimized acquisition protocols. To address this gap, we present a physics-informed generative microstructure network (PIGMENT) that learns a universal generative prior of human brain microstructure and adapts it zero-shot to each participant's measured data to recover subject-specific maps. Trained on 11375 scans spanning multiple sites, vendors, and field strengths, PIGMENT enabled reliable quantitative mapping for tensor, kurtosis, and NODDI models across external datasets from five independent centers. It remains effective where conventional fitting becomes unreliable, recovering meaningful maps from extremely sparse acquisitions while supporting downstream tractography and structural connectivity mapping. PIGMENT estimates demonstrated strong biological validity, preserving submillimeter cortical microarchitectural patterns and early-childhood white matter developmental trajectories from 10-fold accelerated scans. Furthermore, PIGMENT enables reliable quantitative tensor mapping on cost-efficient low-field systems and the extraction of tumor-related biomarkers using ultra-fast clinical protocols. Together, these results establish PIGMENT as a physics-informed foundation model that extends quantitative diffusion MRI into regimes traditionally too sparse, heterogeneous, or clinically constrained for reliable analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.00155 2026-06-02 cs.CR cs.AI 版本更新

A Protocol-Language Model for Network Intrusion (Without Deep Packet Inspection)

一种用于网络入侵的协议语言模型（无需深度包检测）

Vivek Kumar Sharma

发表机构 * Palo Alto Networks（帕洛阿尔托网络）

AI总结提出PLM-NIDS，利用RWKV-4状态空间模型将网络流作为语言处理，仅基于L3/L4包元数据检测攻击，无需深度包检测，实现零样本异常检测（PR-AUC=0.93）和加密协议透明处理。

Comments 20 pages Research paper on Packet Language Models for Network Intrusion Detection Systems(Without Deep Packet Inspection).Code available on GitHub

详情

AI中文摘要

现代网络入侵检测系统（NIDS）陷入结构性矛盾：承载最高威胁情报的协议恰恰是那些在TLS 1.3和QUIC下加密的协议，其中负载检测毫无用处。我们提出一个更简单的问题——如果攻击签名不在字节中，而在节奏中呢？——并通过将网络流视为一种语言来回答，该语言的语法完全由L3/L4包元数据编写：长度、到达间隔时间、TTL、TCP标志和哈希端口号。我们提出了PLM-NIDS，它依次证明了三个主张。（1）语法存在且可学习：在344,232个未标记的Monday流上训练的RWKV-4状态空间模型实现了0.204的因果LM验证损失，表明良性流量具有可预测的、统计一致的结构。（2）攻击违反此语法：在训练时使用零攻击标签的情况下，每流困惑度得分以PR-AUC=0.93清晰分离良性流和攻击流。（3）这种分离在架构上非平凡：在相同令牌序列上训练的LSTM退化为多数类预测器（ROC-AUC约0.50，通过始终预测“攻击”得到F1=0.91），证明RWKV的因果预训练提供了直接分类器无法获得的归纳偏置。监督微调进一步将PR-AUC提升至0.94，ROC-AUC提升至0.75，在校准操作阈值下精确率为97.7%。RWKV骨干的O(T)循环推理支持逐包流式处理而无需流缓冲，使PLM-NIDS在线速率下可操作。由于它仅读取IP/TCP/UDP头部，因此本质上是加密无关的：TLS 1.3、QUIC和未来的加密协议均被透明处理。

英文摘要

Modern network intrusion detection systems (NIDS) are caught in a structural contradiction: the protocols carrying the highest threat intelligence are precisely those encrypted under TLS 1.3 and QUIC, where payload inspection yields nothing. We ask a simpler question -- what if the attack signature is not in the bytes, but in the rhythm? -- and answer it by treating network flows as a language whose grammar is written entirely in L3/L4 packet metadata: length, inter-arrival time, TTL, TCP flags, and hashed port numbers. We present PLM-NIDS, which proves three claims in sequence. (1) The grammar exists and is learnable: a RWKV-4 state-space model trained on 344,232 unlabelled Monday flows achieves a causal LM validation loss of 0.204, demonstrating that benign traffic has predictable, statistically consistent structure. (2) Attacks violate this grammar: the per-flow perplexity score cleanly separates benign from attack flows with PR-AUC = 0.93 using zero attack labels at training time. (3) This separation is architecturally nontrivial: an LSTM trained on identical token sequences degenerates to a majority-class predictor (ROC-AUC approximately 0.50, F1 = 0.91 by always predicting "attack"), proving that RWKV's causal pre-training provides an inductive bias unavailable to direct classifiers. Supervised fine-tuning further raises PR-AUC to 0.94 and ROC-AUC to 0.75, with a precision of 97.7% at the calibrated operating threshold. The RWKV backbone's O(T) recurrent inference enables per-packet streaming without flow buffering, making PLM-NIDS operationally viable at line rate. Because it reads only IP/TCP/UDP headers, it is inherently encryption-agnostic: TLS 1.3, QUIC, and future encrypted protocols are handled transparently.

URL PDF HTML ☆

赞 0 踩 0

通过重试在策略梯度强化学习中涌现探索行为

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo

发表机构 * University of Tokyo（东京大学）； Aalto University（阿尔托大学）

AI总结提出ReMax目标函数，通过最大化M个样本的期望最大回报来使探索行为自然涌现，并推导策略梯度公式及RePPO算法，在MinAtar和Craftax基准上无需显式探索奖励即可促进探索。

详情

AI中文摘要

在强化学习（RL）中，智能体从探索中获益仅仅是因为它们反复遇到相似的状态：尝试不同的动作可以提高性能或减少不确定性；没有这样的重试，贪婪策略是最优的。我们通过ReMax形式化这一直觉，该目标函数根据$M$个样本（$M$为正整数）的期望最大回报来评估策略，同时考虑回报的不确定性。优化该目标函数会使随机探索作为涌现属性出现，无需显式奖励项。为了实现高效的策略优化，我们为ReMax推导了新的策略梯度公式，并引入ReMax PPO（RePPO），这是一种PPO变体，它优化ReMax的同时将离散重试次数$M$推广为连续参数$m>0$，从而实现对探索的细粒度控制。实验上，RePPO在MinAtar和Craftax基准上无需任何显式探索奖励即可促进探索。

英文摘要

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.00150 2026-06-02 cs.CR cs.AI 版本更新

多对比度MRI运动校正：基于参数信息解缠与自适应专家网络

Honglin Xiong, Yuxian Tang, Feng Li, Yulin Wang, Lei Xiang, Dinggang Shen, Qian Wang

发表机构 * ShanghaiTech University（上海科技大学）

AI总结提出一种结合参数信息对比度解缠与严重度感知自适应校正的统一框架，通过ScanCLIP提取对比度嵌入以分离解剖内容，利用视觉Transformer估计运动严重度并路由至专家混合网络，实现跨对比度与严重度的运动伪影校正，在IXI和HCP基准上优于现有方法。

详情

AI中文摘要

磁共振成像中的运动伪影降低了诊断可靠性。现有的深度学习方法通常针对特定对比度，无法泛化到不同模态和伪影严重度。我们提出一个统一框架，结合参数信息对比度解缠与严重度感知自适应校正。ScanCLIP在超过30,000个MRI文本-图像对上预训练，从采集参数中导出对比度嵌入，将对比度风格与解剖内容分离，得到无对比度特征。然后，视觉Transformer估计运动严重度，并通过专家混合网络路由特征，实现针对性伪影校正。双路径解码器重建干净图像和残差伪影图，强制执行图像空间一致性。在IXI和HCP基准上，我们的方法在PSNR上提升0.75 dB，SSIM最高提升0.0279，优于现有方法，且在更高伪影严重度下增益更大。该方法在真实临床数据上展现出鲁棒的零样本泛化能力，这些数据使用未见过的扫描参数采集，而现有方法要么无法去除伪影，要么引入额外失真。

英文摘要

Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.

URL PDF HTML ☆

赞 0 踩 0

2606.00145 2026-06-02 cs.RO cs.AI 版本更新

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

边界完成（CaB）：有限校准下具有完成感知的可部署切换

Yusuke Sano, Takeshi Itoga

发表机构 * Intelligent Systems Laboratory, SECOM Co., Ltd.（SECOM公司智能系统实验室）

AI总结提出Completion at the Boundary (CaB)方法，通过边界阶段令牌（Before/Hit/After）保留双边证据，在有限校准条件下实现VLA代理的完成感知切换，提升复合指令执行和交接质量。

详情

AI中文摘要

视觉-语言-动作（VLA）代理可以执行自然语言指令，但部署系统仍缺乏操作接口：决定指令何时完成。这一缺口在短复合指令（“做A，然后做B”）中尤为严重，时机不当的交接会级联导致下游故障。完成本质上是闭环的，因为切换是一种改变指令上下文从而影响未来动作和观察的干预。我们研究在由开放式指令空间启发的可部署低校准机制下的完成问题，强制要求无测试时重新学习，并选择一个全局校准的切换规则（在开发集上选择一次，在测试集上原样复用）。在此约束下，将非对称边界证据压缩为单个标量可能在任务极性变化时变得脆弱。我们提出边界完成（CaB），它预测事件局部完成对象，形式为边界阶段令牌（Before/Hit/After），在此规则下保留双边证据。CaB-When将此完成对象转换为最小、可审计的切换决策（何时），而CaB-How重用同一完成对象来调节动作生成，以实现交接过程中的边界稳定控制（如何）。使用干预感知的E1/E2协议，我们表明在匹配容量和可部署性约束下，CaB在第一个视角Minecraft VLA基准上提高了复合执行和交接质量。

英文摘要

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: deciding when the instruction is complete. This gap is acute in short composites ("do A, then B"), where mistimed handoffs cascade into downstream failures. Completion is inherently closed-loop because switching is an intervention that changes the instruction context and thus future actions and observations. We study completion under a deployable low-calibration regime motivated by open-ended instruction spaces, enforcing no test-time relearning and a single globally calibrated switching rule selected once on development set and reused unchanged on test set. Under this constraint, collapsing asymmetric boundary evidence into a single scalar can be brittle under polarity shifts across tasks. We propose Completion at the Boundary (CaB), which predicts an event-local completion object in the form of Boundary-Phase Tokens (Before/Hit/After), retaining two-sided boundary evidence under this discipline. CaB-When converts this completion object into a minimal, auditable switching decision (when), while CaB-How reuses the same completion object to condition action generation for boundary-stable control through handoffs (how). Using an intervention-aware E1/E2 protocol, we show that CaB improves composite execution and handoff quality on a first-person Minecraft VLA benchmark under matched capacity and deployability constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.00144 2026-06-02 cs.LG cs.AI 版本更新

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft：面向稀疏KV投机解码的接受感知多视角训练

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang

发表机构 * Shanghai Institute of Optics and Fine Mechanics（上海光学精密机械研究所）； The University of Sydney（悉尼大学）； Marquette University（马基特大学）； Johns Hopkins University（约翰·霍普金斯大学）； Wake Forest University（威克森林大学）； University of Texas Health Science Center at Houston（德克萨斯大学健康科学中心休斯顿分部）； University of Surrey（萨里大学）

AI总结针对中长上下文推理中稀疏/全缓存不匹配导致接受率下降的问题，提出BudgetDraft多视角稀疏训练方法，通过接受感知损失和多视角损失训练单一鲁棒草稿模型，在固定KV预算下恢复接受率，实现最高6.55倍加速。

详情

AI中文摘要

投机解码通过草稿模型提出多个令牌，验证器并行验证，从而加速自回归解码。在资源受限的部署中，草稿模型使用稀疏KV缓存以在固定KV预算下限制峰值GPU内存和端到端延迟，而验证器保留全KV缓存。实际应用中常见中长上下文推理（4K--16K上下文长度）。然而，随着上下文长度增长，朴素稀疏/全投机解码遭受稀疏/全不匹配问题，导致接受率快速下降。我们提出BudgetDraft，一种用于中长推理中稀疏草稿的多视角稀疏训练方法。草稿模型在训练期间暴露于多个采样的KV预算，并学习将每个稀疏视角与一个共享的全缓存教师目标对齐。BudgetDraft将全缓存分支上的接受感知损失与稀疏缓存分支上的多视角损失相结合，产生一个单一的预算鲁棒草稿模型，无需额外的推理时组件即可恢复跨稀疏级别的接受率。在PG-19、LongBench和LWM上的实验结果表明，BudgetDraft在4K、8K和16K上下文长度下，与自回归相比分别实现了最高6.55倍、4.46倍和2.10倍的端到端加速，同时保持推理流水线内存友好。

英文摘要

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.

URL PDF HTML ☆

赞 0 踩 0

2606.00143 2026-06-02 q-fin.PM cs.AI 版本更新

Regime-Adaptive Continual Learning for Portfolio Management

Chaofan Pan, Lingfei Ren, Linbo Xiong, Yonghao Li, Wei Wei, Xin Yang

发表机构 * Southwestern University of Finance and Economics（西南财经大学）； Shanxi University（山西大学）

AI总结提出ReCAP框架，通过自适应制度检测和持续学习实现投资组合管理的快速适应与长期优异回报。

Comments Accepted by KDD 2026

详情

AI中文摘要

金融市场本质上是不稳定的，频繁出现制度转变和结构性变化，使得传统的投资组合管理方法失效。现有的补救措施，如滚动窗口重新训练和朴素在线微调，分别受到高计算成本和知识利用不足的困扰，导致低回报和有限的适应性。持续学习通过使交易代理能够跨顺序任务积累和转移知识，提供了一种有前景的范式。在本文中，我们提出了 extbf{Re}gime-aware extbf{C}ontinual extbf{A}daptive extbf{P}ortfolio management ( extbf{ReCAP})，一个将CL集成到PM中以应对动态金融环境挑战的新框架。ReCAP采用自适应制度检测模块将历史市场数据分割成可变长度的制度，实现制度特定的策略向量学习和策略库构建。在持续交易过程中，制度门控模块根据当前市场状态自适应地组合策略库中的策略向量，促进对新检测到的制度的快速适应。只有制度门控和当前制度的策略向量被持续更新，以有效保留有用知识。在五个真实世界数据集上的广泛实验表明，ReCAP持续优于流行的基线，在长期投资视野中实现卓越回报，并快速适应制度转变。

英文摘要

Financial markets are inherently non-stationary, exhibiting frequent regime shifts and structural changes that render traditional Portfolio Management (PM) approaches ineffective. Existing remedies, such as rolling-window retraining and naive online fine-tuning, are hindered by high computational costs and insufficient knowledge utilization, respectively, resulting in low returns and limited adaptability. Continual learning (CL) offers a promising paradigm by enabling trading agents to accumulate and transfer knowledge across sequential tasks. In this paper, we propose \textbf{Re}gime-aware \textbf{C}ontinual \textbf{A}daptive \textbf{P}ortfolio management (\textbf{ReCAP}), a novel framework that integrates CL into PM to address the challenges of dynamic financial environments. ReCAP employs an adaptive regime detection module to segment historical market data into variable-length regimes, enabling regime-specific learning of policy vectors and the construction of a policy library. During continual trading, a regime-gate module adaptively combines policy vectors from the library based on the current market state, facilitating rapid adaptation to newly detected regimes. Only the regime-gate and the current regime's policy vector are continually updated to preserve useful knowledge effectively. Extensive experiments on five real-world datasets demonstrate that ReCAP consistently outperforms popular baselines, achieving superior returns in long-term investment horizons and rapid adaptation to regime shifts.

URL PDF HTML ☆

赞 0 踩 0

2606.00141 2026-06-02 cs.LG cs.AI 版本更新

生成式AI与数字生态系统韧性：基于生命周期的主动式综述

Jonghyun Chung, Rishabh Chaddha, Sanket Badhe, Debanshu Das, Nathan Huang, Amanpreet Kaur

发表机构 * Google LLC（谷歌有限公司）

AI总结本文采用基于生命周期的C5交互模型，综合机器学习与社会科学方法，系统综述了针对生成式AI驱动的对抗性合成内容的主动检测技术，包括协调不真实行为分析、流行病学建模和霍克斯过程等，旨在构建更具韧性的信息生态系统。

Comments 14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)

详情

Journal ref: IEEE Access (2026) IEEE Access (2026)

AI中文摘要

生成式AI加速了对抗性合成内容的扩散，使得传统的被动检测方法失效。本综述综合了新兴研究，展示了向主动检测新兴不真实叙事的范式转变。我们采用统一的、基于生命周期的分类法，将对抗性活动的社会技术生命周期模型与新兴不真实叙事检测的高级计算方法相结合。通过围绕C5交互模型（背景、原因、内容、放大循环、后果）构建分析，我们整合了机器学习和社会科学的不同研究流。为了区分合成放大模式与真实基线流量，本文综述了建模新叙事创建、播种和传播的最先进技术，包括协调不真实行为分析、流行病学建模和霍克斯过程。本综述还系统回顾了C5交互模型不同阶段对抗性威胁的主动检测方法，特别是高维嵌入空间中的异常检测、多层图上的无监督协调检测以及代理型AI系统。最后，本综述探讨了生成式AI带来的挑战，包括追踪快速变化威胁和多级分布漂移的困难，并概述了未来研究议程，重点在于检测异常聚类和构建预期性及韧性系统。本综述为更韧性的信息生态系统提供了基于生命周期的主动检测新兴合成威胁方法的全面回顾。

英文摘要

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.00135 2026-06-02 cs.LG cs.AI 版本更新

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

论智能体工具调用与强化学习训练的有效性与效率

Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）

AI总结本文系统分析工具调用评估中的实现选择对结果敏感性的影响，并针对强化学习训练中的计算浪费提出两种加速技术。

Comments ICML 2026

详情

AI中文摘要

工具调用是现代大型语言模型（LLM）智能体的核心组件，使其具备超越参数化知识的技能。本文从两个互补维度研究工具调用：有效性（即如何衡量该能力）和效率（即如何学习该能力）。在有效性方面，我们系统分析了工具调用评估流程，并表明结果可能对看似微小、通常未文档化的实现选择高度敏感，包括随机种子、系统提示、多轮模板构建以及先前交互/推理历史的传递方式。这些选择可能导致报告性能的显著差异，尤其是在多轮设置中，若缺乏严格标准化，排行榜排名将不可靠。在效率方面，我们考察了用于工具调用的标准强化学习（RL），并识别出两个计算浪费来源：（i）在 rollout 过程中，许多提示不产生学习信号；（ii）在策略更新过程中，优化产生高计算成本。基于这些发现，我们引入了两种加速基于 RL 的工具调用训练的技术，在不降低性能的情况下实现了显著的挂钟时间加速。

英文摘要

Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00134 2026-06-02 cs.CR cs.AI cs.LG 版本更新

XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT

XAI-SOH-FL: 通过自适应聚合和可解释人工智能增强异构物联网入侵检测中的SOH-FL

Ambreen Aslam, Maaz Hassan, Bibi Zahra, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)（电气工程与计算机科学学院（SEECS），国家 sciences and Technology（NUST））

AI总结针对异构物联网中数据异构、标签稀缺和模型不可解释性问题，提出XAI-SOH-FL框架，通过自适应聚合（动态γ选择与贝叶斯优化）和SHAP可解释性，在CICIDS2017数据集上达到94.12%准确率和0.92 F1分数，优于基线SOH-FL。

Comments 8 pages, 6 figures; code available at https://github.com/aaslam-msit/SOH-FL-Enhancement

详情

AI中文摘要

物联网环境中的入侵检测系统面临数据异构、缺乏标记数据和模型可解释性有限等重大挑战。联邦学习提供了一种隐私保护解决方案；然而，现有方法如SOH-FL存在两个关键限制：依赖手动调整的聚合参数γ以及模型预测缺乏可解释性。在本文中，我们提出XAI-SOH-FL，一个增强框架，将自适应聚合和可解释人工智能集成到SOH-FL范式中。首先，我们引入基于相似性阈值的动态γ选择机制，使聚合过程能够适应不断变化的数据分布。其次，采用贝叶斯优化自动确定最优γ值，消除了手动调整的需要。第三，引入SHAP（SHapley Additive exPlanations）为入侵检测决策提供特征级可解释性。在CICIDS2017数据集上的实验评估表明，所提方法达到了94.12%的准确率和0.92的F1分数，优于基线SOH-FL模型，同时收敛所需的通信轮次更少。此外，基于SHAP的分析揭示，流级特征如流持续时间和数据包长度显著影响模型预测。这些结果表明，XAI-SOH-FL在异构物联网环境中提供了准确性、适应性和可解释性之间的有效平衡。

英文摘要

Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of labeled data, and limited model interpretability. Federated Learning (FL) offers a privacy-preserving solution; however, existing approaches such as SOH-FL suffer from two key limitations: reliance on a manually tuned aggregation parameter γ and lack of explainability in model predictions. In this paper, we propose XAI-SOH-FL, an enhanced framework that integrates adaptive aggregation and explainable artificial intelligence into the SOH-FL paradigm. First, we introduce a dynamic γ selection mechanism based on similarity thresholding, enabling the aggregation process to adapt to evolving data distributions. Second, Bayesian Optimization is employed to automatically determine optimal γ values, eliminating the need for manual tuning. Third, SHAP (SHapley Additive exPlanations) is incorporated to provide feature-level interpretability for intrusion detection decisions. Experimental evaluation on the CICIDS2017 dataset demonstrates that the proposed approach achieves an accuracy of 94.12% and an F1-score of 0.92, outperforming the baseline SOH-FL model while converging in fewer communication rounds. Furthermore, SHAP-based analysis reveals that flow-level features such as Flow Duration and Packet Length significantly influence model predictions. These results indicate that XAI-SOH-FL provides an effective balance between accuracy, adaptability, and interpretability in heterogeneous IoT environments.

URL PDF HTML ☆

赞 0 踩 0

2606.00132 2026-06-02 cs.LG cs.AI 版本更新

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

基于广义瑞利商优化的基础模型保留适配

Dongjun Kim, Adrian de Wynter, Huancheng Chen, Heasung Kim, Haris Vikalo

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Microsoft（微软）； Microsoft AI（微软人工智能）； Meta

AI总结提出FoLoRA框架，通过广义瑞利商优化更新方向，在微调中平衡下游任务性能与预训练能力保留。

详情

AI中文摘要

虽然微调有效地将基础模型适配到专门的下游任务，但可能会降低预训练期间获得的非目标能力。现有的遗忘感知方法通常通过专门的初始化或固定约束寻求更安全的更新，但未在训练过程中调节适配-保留权衡。我们提出基础保留LoRA（FoLoRA），一个遗忘感知优化框架。在一阶保留条件的指导下，FoLoRA定义了预训练代理激活上的遗忘惩罚和下游任务激活上的任务效用。然后，它通过广义瑞利商按单位遗忘惩罚的任务效用对更新方向进行评分。由此产生的谱坐标系实现了方向门控Adam更新，在训练过程中衰减低效用-惩罚方向。为了估计遗忘惩罚，FoLoRA通过从预训练模型中采样构建预训练代理校准数据，而不是依赖单个代理数据集。在数学、代码和指令遵循适配上的实验表明，FoLoRA在基线上实现了最强的保留-适配平衡，提高了目标任务性能，同时最好地聚合保留了非目标能力。

英文摘要

While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through specialized initialization or fixed constraints, but do not regulate the adaptation preservation trade-off during training. We propose Foundation Preserving LoRA (FoLoRA), a forgetting aware optimization framework. Guided by a first order preservation condition, FoLoRA defines a forgetting penalty over pretraining-proxy activations and a task utility over downstream task activations. It then scores update directions by task utility per unit forgetting penalty via a generalized Rayleigh quotient. The resulting spectral coordinate system enables direction wise gated Adam updates, attenuating low utility to penalty directions during training. To estimate the forgetting penalty, FoLoRA constructs pretraining proxy calibration data by sampling from the pretrained model rather than relying on a single proxy dataset. Experiments on math, code, and instruction following adaptation show that FoLoRA achieves the strongest preservation adaptation balance over baselines, improving target task performance with best aggregate preservation of non target capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.00131 2026-06-02 cs.SE cs.AI cs.LG cs.PL 版本更新

AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve

AI-PROPELLER：基于AlphaEvolve的仓库规模过程间代码布局优化

Chaitanya Mamatha Ananda, Rajiv Gupta, Mircea Trofin, Aiden Grossman, Sriraman Tallam, Xinliang David Li, Amir Yazdanbakhsh

发表机构 * University of California, Riverside（加州大学河滨分校）； Google（谷歌）； DeepMind（深度思维）

AI总结提出AI-PROPELLER系统，利用Magellan智能工作流将Propeller的编译器启发式方法演化为细粒度过程间优化器，并通过实际硬件执行评估布局变体，首次在工业仓库规模应用中实现细粒度过程间代码布局优化，性能提升0.23%至1.6%。

详情

AI中文摘要

后链接优化器（如Propeller和BOLT）已证明，精确的、基于性能剖析的代码布局可以从高度优化的二进制文件中提取显著的性能提升。然而，这些系统目前局限于过程内技术，未能充分利用过程间布局的全局潜力。由于组合爆炸的搜索空间和复杂的调用返回语义难以建模，过程间代码布局历来困难。因此，细粒度过程间布局的性能潜力在实践中尚未得到证实。AI-PROPELLER使用Magellan（一种智能工作流），将Propeller中的编译器启发式方法演化为细粒度过程间优化器，并微调所得策略的超参数。为确保高保真度，我们摒弃了近似的静态成本模型，智能工作流生成多个布局变体，并在实际硬件上执行以测量真实性能计数器，为进化循环提供精确的奖励信号。AI-PROPELLER已在包括大型仓库规模应用在内的多个基准测试上进行了评估，实验表明，在使用最先进的FDO和PLO优化后，性能提升0.23%至1.6%，这对于实际二进制文件而言意义重大。这是首次在工业环境中对大型仓库规模应用进行细粒度过程间代码布局优化。

英文摘要

Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.

URL PDF HTML ☆

赞 0 踩 0

2606.00130 2026-06-02 cs.LG cs.AI 版本更新

Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks

自动可微非线性张量网络（ADNTNs）用于深度神经网络的指数级压缩

Andrzej Cichocki, Michal Wietczak

发表机构 * Institute of Computing Intelligence, Polish Academy of Sciences（波兰科学院计算智能研究所）

AI总结提出自动可微非线性张量网络（ADNTNs）作为结构化权重生成器，通过反向模式自动微分端到端训练紧凑核心张量，实现深度神经网络的高效压缩，在AlexNet和VGG-16上达到每层2000倍至77000倍压缩比，且精度与密集基线相当或更优。

Comments 6 figure, 28 pages, to be submitted to Journal and confrence

详情

AI中文摘要

我们研究了自动可微非线性张量网络（ADNTNs），这是一类结构化权重生成器，其紧凑核心张量通过反向模式自动微分（AD）进行端到端训练。该方法可视为低秩适应和张量分解的自然扩展：ADNTN不是使用一个低秩矩阵更新，而是通过小核心、非线性激活和可选的横向混合张量的层次结构构建大权重张量。本文聚焦于三种架构：树张量网络（TTNs）、带边界解缠器的增强型TTN（aTTNs）以及多尺度纠缠重整化拟设（MERA）。该公式支持非线性激活、任务感知目标、批处理以及硬件感知的执行调度。同时，本文明确区分了“微分”收缩程序和使收缩自由：AD并未消除大中间体、不良收缩顺序或一般带环张量网络精确收缩的成本。在AlexNet和VGG-16层上的大量模拟显示，在所研究设置下每层压缩比约为2000倍至77000倍，精度通常与密集基线相当，且在几个VGG-16案例中有所提升。这些结果是令人鼓舞的而非最终结论：它们表明，只要优化、收缩调度和部署内核协同设计，ADNTNs是一条有前景、数学结构清晰且硬件感知的通往更小神经网络的路径。

英文摘要

We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tensors are trained end-to-end by reverse-mode automatic differentiation (AD). The approach can be viewed as a natural extension of low-rank adaptation and tensor factorisation: instead of using one low-rank matrix update, an ADNTN builds a large weight tensor through a hierarchy of small cores, nonlinear activations, and optional lateral mixing tensors. The paper focuses on three architectures: Tree Tensor Networks (TTNs), augmented TTNs (aTTNs) with boundary disentanglers, and Multi-scale Entanglement Renormalisation Ansatze (MERA). The formulation supports nonlinear activations, task-aware objectives, batching, and hardware-aware execution schedules. At the same time, the paper keeps a clear distinction between \emph{differentiating} a contraction program and making contraction free: AD does not remove the cost of large intermediates, poor contraction orders, or exact contraction of general loopy tensor networks. Extensive simulations on AlexNet and VGG-16 layers show per-layer compression ratios from roughly $2000\times$ to $77000\times$ in the studied settings, with accuracy often matching the dense baseline and, in several VGG-16 cases, improving it. These results are encouraging rather than final: they suggest that ADNTNs are a promising, mathematically structured, and hardware-aware route toward much smaller neural networks, provided that optimisation, contraction schedules, and deployment kernels are designed together.

URL PDF HTML ☆

赞 0 踩 0

2606.00129 2026-06-02 cs.LG cs.AI 版本更新

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

现代LLM与人脑EEG共享的效价轴：饱和规律

Yousef A. Radwan, Xuhui Liu, Kilichbek Haydarov, Yuqian Fu, Mohamed Elhoseiny

发表机构 * King Abdullah University of Science and Technology（卡布斯大学）

AI总结本研究通过构建从大型语言模型（LLM）中提取的一维效价方向（V轴），发现其与人类EEG神经活动对齐，但进一步对齐策略无法提升解码性能，并形式化为“饱和规律”，指出改进应来自监督无法触及的残差子空间。

详情

AI中文摘要

大型语言模型（LLM）已成为强大的表示学习器，其内部特征与人类认知日益对齐。我们研究现代LLM是否可以作为理解人脑神经表示的透镜，重点关注EEG中的情感效价。我们首先仅使用九个情感唤起句子从现代LLM中构建了一维效价方向（V轴），并通过零样本迁移到情感基准测试和跨十四个LLM的模型一致性进行了验证。然后，我们展示了这个从LLM导出的方向映射到人类神经活动。在一个包含123名受试者观看情感视频的公共EEG队列中，EEG特征上的单个线性投影追踪了每个刺激的V轴位置。此外，36个未暴露于V轴的EEG情感分类器在其内部表示中自发发现了相同的方向，表明相同的效价结构在语言模型和人类电生理学中同时出现。然而，这种趋同并未提供有效的训练信号。我们测试了二十五种对齐策略，包括知识蒸馏、表示相似性、对比和拓扑损失；没有一种能改善解码，十六种显著降低了准确性。我们将这一结果形式化为饱和规律：一旦任务标签单独驱动脑解码网络朝向目标方向，额外的监督主要扭曲已经饱和的盆地，而承载类内残差的子空间几乎得不到有用的梯度。这一规律也指出了改进应来自何处：监督无法触及的残差子空间。受此启发，我们集成残差多样性而非监督盆地，在FACED上将平衡准确率提高了10.5%，并在SEED-V上复制了相同效果。

英文摘要

Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cognition. We study whether modern LLMs can serve as a lens for understanding neural representations in the human brain, focusing on emotional valence in EEG. We first build a one-dimensional valence direction, the V-axis, from modern LLMs using only nine emotion-evocative sentences. We validate it through zero-shot transfer to sentiment benchmarks and cross-model consistency across fourteen LLMs. We then show that this LLM-derived direction maps onto human neural activity. On a public EEG cohort of 123 subjects watching affective videos, a single linear projection on EEG features tracks the V-axis position of each stimulus. Moreover, 36 EEG emotion classifiers trained without exposure to the V-axis spontaneously rediscover the same direction in their internal representations, suggesting that the same valence structure emerges in both language models and human electrophysiology. Yet this convergence does not provide an effective training signal. We test twenty-five alignment strategies, including knowledge distillation, representational similarity, contrastive, and topographic losses; none improve decoding, and sixteen significantly reduce accuracy. We formalize this result as the saturation regularity: once task labels alone drive a brain-decoding network onto the target direction, additional supervision mainly distorts an already-saturated basin, while the load-bearing within-class residual receives little useful gradient. This regularity also indicates where improvement should come from: the residual subspace unreachable by supervision. Motivated by this insight, we ensemble across residual diversity rather than supervising the basin, improving balanced accuracy by 10.5% over the prior best on FACED, with the same effect replicated on SEED-V.

URL PDF HTML ☆

赞 0 踩 0

2606.00125 2026-06-02 cs.IR cs.AI cs.LG cs.MM 版本更新

Multimodal Music Recommendation System using LLMs

使用LLMs的多模态音乐推荐系统

Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan, Shamanth Kuthpadi, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Nesreen Ahmed

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）； Dolby Laboratories（Dolby实验室）； Adobe Research（Adobe研究）； Cisco Research（Cisco研究）

AI总结提出一个多模态框架，通过融合音频、歌词、LLM生成的语义元数据和收听完成率，在基于会话的音乐推荐中显著提升Recall和NDCG。

详情

AI中文摘要

音乐推荐系统通常将歌曲视为不透明标记，依赖协同交互历史，忽略了语义或声学内容。先前工作探索了LLM增强、多模态和文本增强的序列推荐方法，但有些方法部分结合了语义、声学或参与信号，没有在一个统一的基于LLM的序列推理框架中联合建模所有三个信号，该框架将推荐基于实际歌曲内容。在这项工作中，我们提出了一个用于基于会话的音乐推荐的多模态框架，通过三种互补信号丰富了LastFM-1K数据集：(1) 使用预训练音乐和文本表示模型提取的音频和歌词嵌入，(2) 使用MGPHot注释方案生成的LLM语义元数据，以及(3) 收听完成率。我们采用E4SRec框架，通过扩展多模态特征和不同的项目ID编码器骨干（包括SASRec、BERT4Rec和GRU4Rec）来增强它。我们进一步扩展了LLM骨干选项，包括LLaMa-2-13B、Qwen2.5-7B-Instruct和LLaMa-3-70B，在零样本和微调设置下。我们的实验表明，集成基于内容的特征比仅使用ID的基线在Recall上提升高达95%，在NDCG上提升高达79%。此外，我们的实验表明，朴素的多模态融合并不总是产生加性改进，突显了跨模态整合的挑战。我们发布了一个用于音乐推荐的大规模多模态基准。

英文摘要

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

URL PDF HTML ☆

赞 0 踩 0

2606.00123 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

CardioLens: 通过多序列心脏MRI评估揭示MLLMs的临床现实差距

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Beijing Anzhen Hospital（北京安贞医院）； Beihang University（北航）； King Abdullah University of Science and Technology（国王 Abdullah 科学与技术大学）

AI总结提出CardioLens测试平台，通过多序列心脏磁共振成像评估24个多模态大语言模型，发现其在临床工作流中表现不佳，存在类别崩溃失败模式，且输入选择和推理提示改进效果有限。

详情

AI中文摘要

多模态大语言模型在公共医学基准上表现出色，但现有评估通常依赖于孤立输入和简化识别任务，难以作为临床使用的有效代理。我们提出了CardioLens，一个针对多序列心血管磁共振的无泄漏评估测试平台，通过严格的报告到QA构建和验证流程，从私有医院档案中构建。CardioLens包含473,896张切片和13,494个经过验证的QA对，涵盖4D Cine、LGE、灌注和T2加权成像，并评估CMR解读的三个阶段：图像理解、报告生成和疾病诊断。在24个最先进的MLLM上，CardioLens揭示了显著的临床现实差距：模型整体表现不佳，性能沿真实CMR工作流下降。混淆分析进一步显示一种类别崩溃失败模式，模型倾向于默认频繁出现的异常类别，而不是区分临床不同的发现。为了排除MLLM兼容输入构造是主要原因，我们在不同切片预算下比较了随机、临床动机和数据驱动的切片选择协议；性能变化很小，通常约为1%。显式推理提示也无法挽救性能，往往使模型更加保守，而不是改善视觉证据的使用。这些结果表明，当前MLLM远未达到可靠的CMR解读，临床决策需要跨序列、视图和时间相位整合分布式证据。CardioLens为开发面向真实临床部署的下一代MLLM提供了一个临床基础的测试平台。

英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.00121 2026-06-02 cs.CV cs.AI 版本更新

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

基于语义和结构引导的大脑活动图像重建通用框架

Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang, Huiguang He

发表机构 * State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology（脑认知与脑启发智能技术国家重点实验室）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Future Technology, University of Chinese Academy of Sciences（中国科学院大学未来技术学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出MindDiffuser两阶段框架，结合CLIP文本嵌入和视觉特征，通过Stable Diffusion生成语义图像并迭代优化结构信息，在fMRI、EEG、MEG三种模态上显著提升图像重建性能。

详情

AI中文摘要

从大脑记录中重建视觉刺激一直是脑解码中一项有意义且具有挑战性的任务。特别是，实现精确且可控的图像重建对于推动脑机接口的进步和应用具有重要意义。最近的方法利用文本到图像生成模型的能力，在语义（如概念和对象）方面重建了接近复杂自然刺激的图像。然而，它们在保持与原始刺激在细粒度结构信息（如位置、方向和大小）上的一致性方面存在困难，这削弱了模型的可控性和可解释性。为了解决上述问题，我们提出了一个两阶段图像重建框架，称为MindDiffuser。在第一阶段，从大脑反应解码的对比语言-图像预训练（CLIP）文本嵌入被输入到Stable Diffusion中，生成包含语义信息的初步图像。在第二阶段，我们使用解码的浅层CLIP视觉特征作为监督信号，通过反向传播迭代优化来自第一阶段的特征向量，以对齐结构信息。我们在由视觉刺激引发的三种模态（fMRI、EEG、MEG）的大脑反应数据集上进行了大量实验，结果表明我们的框架显著提升了先前最先进模型的性能，凸显了我们方法的有效性和通用性。空间和时间可视化结果进一步支持了我们框架的神经生物学合理性，为未来跨不同大脑信号模态的神经解码工作提供了指导。

英文摘要

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.00120 2026-06-02 eess.SP cs.AI cs.LG 版本更新

SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction

SpikeWFM：用于鲁棒信道预测的脉冲辅助无线基础模型

Liwen Jing, Yisha Lu, Tingting Yang, Li Sun, Yuxuan Shi, Yuwei Wang, Mengfan Zheng, Leiyang Xu

发表机构 * Mobile Information Networks-National Science and Technology Major Project（移动信息网络国家科技重大专项）

AI总结提出SpikeWFM混合架构，将脉冲神经网络与基于ANN的Transformer结合，通过时间稀疏性和事件驱动处理增强无线基础模型对噪声和干扰的鲁棒性，在信道预测任务上优于传统模型。

详情

AI中文摘要

本文提出SpikeWFM，一种新颖的混合架构，它将脉冲神经网络（SNN）与基于传统人工神经网络（ANN）的Transformer集成用于无线基础模型（WFM）。受人类大脑中噪声鲁棒且节能的信息处理启发，SpikeWFM旨在增强WFM对噪声和干扰的抵抗力，同时保持跨多种无线场景的强大泛化能力。借鉴大型语言模型成功经验，WFM利用跨各种无线环境的大规模数据集上的自监督预训练，学习一个统一的嵌入表示，支持包括信道预测、信道估计、波束预测、定位等在内的广泛下游任务。这类模型通常优于任务特定设计，并对未见条件表现出卓越的适应性。然而，现有WFM在实际无线系统中仍易受真实噪声和干扰影响。为解决这一局限，我们将脉冲神经元引入基于Transformer的WFM架构。我们提供简要理论分析，展示SNN-ANN混合如何通过时间稀疏性和事件驱动处理有效减轻噪声和干扰。实验结果表明，SpikeWFM在预训练收敛和信道预测准确性上均持续优于传统基于ANN的WFM。关于通信和感知任务的更多结果将在本工作的完整期刊版本中呈现。

英文摘要

This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neural network (ANN)-based transformers for wireless foundation models (WFMs). Inspired by the noise-robust and energy-efficient information processing in the human brain, SpikeWFM aims to enhance the resilience of WFMs against noise and interference while maintaining strong generalization capabilities across diverse wireless scenarios. Drawing from the success of large language models, WFMs leverage self-supervised pre-training on large-scale datasets spanning various wireless environments to learn a unified embedding that supports a wide range of downstream tasks, including channel prediction, channel estimation, beam predition, positioning and etc. Such models typically outperform task-specific designs and exhibit superior adaptability to unseen conditions. However, existing WFMs remain vulnerable to realistic noise and interference in practical wireless systems. To address this limitation, we incorporate spiking neurons into the transformer-based WFM architecture. We provide a brief theoretical analysis demonstrating how the SNN-ANN hybrid effectively mitigates noise and interference through temporal sparsity and event-driven processing. Experimental results show that SpikeWFM consistently outperforms conventional ANN-based WFMs in both pre-training convergence and channel prediction accuracy. Additional results on communication and sensing tasks will be presented in the full journal version of this work.

URL PDF HTML ☆

赞 0 踩 0

2606.00119 2026-06-02 cs.RO cs.AI 版本更新

SPARROW项目与保护技术的未来

Juan M. Lavista Ferres, Carl Chalmers, Bruno Demuro Segundo, Zhongqi Miao, Andres Hernandez Celis, Federico Alves Torres, Isai Daniel Chacon Silva, Anthony Cintron Roman, Allen Kim, Meygha Machado, Luana Marotti, Amy Michaels, Daniela Ruiz Lopez, Catherine Romero, Rahul Dodhia, Inbal Becker-Reshef, Pablo Arbelaez

发表机构 * Microsoft AI for Good Lab（微软AI for Good实验室）； Universidad de los Andes（andes大学）； University of Maryland（马里兰大学）

AI总结提出SPARROW开源平台，通过集成太阳能、边缘AI和卫星通信，实现偏远地区连续自主的生物多样性监测，并在多国部署验证其鲁棒性和可扩展性。

详情

AI中文摘要

全球生物多样性正以前所未有的速度下降，然而可用于监测和保护生态系统的工具仍受限于电力、连接性和可达性。我们提出SPARROW，一个集成太阳能、边缘人工智能和卫星通信的开源软硬件平台，能够在偏远环境中实现连续、自主的生物多样性监测。每个SPARROW节点结合低功耗图形处理单元（GPU）与模块化视觉、声学和环境传感器，执行设备端深度学习推理，并通过低地球轨道（LEO）卫星或全球移动通信系统（GSM）网络传输汇总结果。我们在哥伦比亚、秘鲁、坦桑尼亚和美国的热带、温带和高山生态系统中部署了SPARROW，它在多变的环境条件下维持24/7运行，并在前190天内收集了超过200万张图像和声学记录。该系统展示了鲁棒的实时分类和自适应电源管理，实现了无需现场人工干预的完全自主。通过集成可再生能源、边缘AI和开源设计，SPARROW降低了生态监测的技术和财务门槛，并为分布式智能传感器网络（新兴的“万物互联”用于行星生物多样性监测）建立了可扩展的基础。

英文摘要

Global biodiversity is declining at unprecedented rates, yet the tools available to monitor and protect ecosystems remain limited by constraints in power, connectivity, and accessibility. We present SPARROW, a hardware and software open-source platform that integrates solar energy, edge artificial intelligence, and satellite communication to enable continuous, autonomous biodiversity monitoring in remote environments. Each SPARROW node combines a low-power Graphics Processing Unit (GPU) with modular visual, acoustic, and environmental sensors, performing on-device deep learning inference and transmitting summarized results through Low-Earth-Orbit (LEO) satellite or Global System for Mobile Communications (GSM) networks. We deployed SPARROW across tropical, temperate, and montane ecosystems in Colombia, Peru, Tanzania, and the United States, where it sustained 24/7 operation under variable environmental conditions and collected more than two million images and acoustic recordings in the first 190 days. The system demonstrated robust real-time classification and adaptive power management, achieving full autonomy without on-site human intervention. By integrating renewable energy, on-edge AI, and open-source design, SPARROW lowers the technical and financial barriers to ecological monitoring and establishes a scalable foundation for a distributed, intelligent network of sensors, an emerging "Internet of Living Things" for planetary biodiversity monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.00107 2026-06-02 eess.SP cs.AI cs.LG 版本更新

Motif-based morphology signatures for interpretable ECG screening and monitoring

基于基序的形态学特征用于可解释的心电图筛查和监测

Nivedita Bijlani, Mauricio Villarroel

发表机构 * The Podium Institute of Sports Medicine and Technology（Podium运动医学与体育科技研究所）

AI总结提出一种基于基序的框架，通过定义可解释的心跳对齐基序和三种漂移度量，实现短期和长期心电图监测中的形态学变化量化与异常检测。

Comments Accepted to the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

详情

AI中文摘要

心电图仍然是心血管筛查的核心，但解读仍主要依赖人工且呈间歇性。临床实践依赖于简短的静息心电图，并在需要时进行长时间动态记录，两者都会产生需要大量资源审查的数据。因此，在临床明显异常出现之前，微妙的形态学变化或渐进性漂移可能被忽视。我们提出了一种基于基序的框架，该框架将心跳对齐的心电图基序定义为可解释的心脏特征，并量化短期和长期监测中的形态学漂移和偏差。基序是代表主导形态的典型心动周期。我们引入了三个可解释的漂移度量：与正常窦性心律的偏差、与个性化基线的偏差以及基序不稳定性指数。基序通过选择在固定窗口内最小化动态时间规整距离的心跳来提取。我们在短期（PTB-XL）和长期（MIT-BIH心律失常）心电图数据集上评估这些度量。通过代表性基序叠加和基于基准点的可视化实现可解释性，从而能够直接检查形态学变化。在MIT-BIH中，所提出的度量显著区分了主要正常和心律失常受试者（p<0.01）。在PTB-XL中，正常窦性心律偏差在主要诊断亚型中区分了正常和异常心电图（p<1e-4，Cliff's delta高达0.93）。心电图基序提供了心脏形态的可解释表示，支持可扩展的纵向监测和形态学驱动变化的早期检测。

英文摘要

Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical practice relies on brief resting ECGs and, when required, long-duration ambulatory recordings, both generating data that require resource-intensive review. Consequently, subtle morphological changes or progressive drift preceding clinically apparent abnormalities may go unnoticed. We propose a motif-based framework that defines beat-aligned ECG motifs as interpretable cardiac signatures and quantifies morphological drift and deviation across short and long-term monitoring. Motifs are representative cardiac cycles capturing dominant morphology. We introduce three interpretable drift metrics: deviation from a normal sinus rhythm (NSR), deviation from a personalised baseline, and a motif instability index. Motifs are extracted by selecting beats that minimise Dynamic Time Warping (DTW) distance within fixed windows. We evaluate these metrics on short (PTB-XL) and long-duration (MIT-BIH Arrhythmia) ECG datasets. Interpretability is achieved through representative motif overlays and fiducial-based visualisations, enabling direct inspection of morphological changes. In MIT-BIH, the proposed metrics significantly separated predominantly normal from arrhythmic subjects (p<0.01). In PTB-XL, NSR deviation distinguished normal from abnormal ECGs across major diagnostic subtypes (p<1e-4, Cliff's delta up to 0.93). ECG motifs provide an interpretable representation of cardiac morphology, supporting scalable longitudinal monitoring and early detection of morphology-driven change.

URL PDF HTML ☆

赞 0 踩 0

2606.00106 2026-06-02 eess.SP cs.AI cs.HC cs.LG 版本更新

A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces

脑机接口中速度-准确性权衡显式控制的方法论框架

Javier Jiménez, Francisco B Rodríguez

发表机构 * Grupo de Neurocomputación Biológica, Departamento de Ingeniería Informática, Universidad Autónoma de Madrid（生物神经计算组，信息工程系，马德里自治大学）

AI总结提出一个独立于分类器、范式和早停策略的评估框架，通过增益和保持度两个指标及可调参数α显式控制速度-准确性权衡，并在P300范式上验证其有效性。

详情

AI中文摘要

脑机接口（BCI）受到脑电图等模态低信噪比的限制，需要多次试验才能可靠解码用户意图。这导致了速度-准确性权衡，即更高的准确性以速度为代价。速度-准确性平衡依赖于应用，因此需要可控的权衡。传统指标（如信息传输率）将速度和准确性合并，模糊了它们的依赖关系并可能引入偏差。在本研究中，我们提出了一个独立于分类器、范式和早停策略的评估框架，将速度和准确性分离。我们采用两个度量：增益（相对速度提升）和保持度（相对准确性保持），并将它们组合成一个由α控制的可调增益-保持平衡，从而调节速度-准确性权衡。该参数无需修改分类器即可调整工作点，便于跨场景部署。该框架在P300事件相关电位范式上进行了评估，使用了63名受试者的公开记录以及多种分类器和早停策略，以实现速度-准确性和比特率的不同工作点。结果表明，调整α可产生快速、准确或平衡的BCI行为，展示了速度-准确性权衡的显式控制。该方法支持受试者级别的性能预测，并提高了BCI行为的可解释性。对信息传输率的进一步分析揭示了其向速度的系统性偏差，该偏差通过所提出的框架中的增益和保持度测量得到解释。总体而言，本工作将速度-准确性权衡确立为可控的设计变量，并在公开的P300范式上进行了验证，从而实现了BCI的透明评估和应用特定优化。

英文摘要

Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by α, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning α yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.

URL PDF HTML ☆

赞 0 踩 0

2606.00105 2026-06-02 cs.CV cs.AI 版本更新

CoilDrop-MRI：基于线圈丢弃的自监督物理引导MRI重建

Tongxi Song, Ziyu Li, Zihan Li, Wen Zhong, Congyu Liao, Yang Yang, Hua Guo, Wenchuan Wu, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University（清华大学生物医学工程系）； Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford（牛津大学整合神经影像中心）； Department of Radiology & Biomedical Imaging, University of California San Francisco（加州大学旧金山分校放射科与生物医学成像系）

AI总结提出CoilDrop-MRI方法，通过在线圈维度进行丢弃并作为自监督训练目标，结合图像域和k空间域展开架构，实现无需全采样数据的并行MRI重建，在多站点、多场强、多模态数据集上性能优于现有自监督方法。

详情

AI中文摘要

基于自监督深度学习的方法在加速磁共振成像（MRI）重建中展现出巨大潜力，无需全采样数据即可实现高图像质量。这些方法通常将采集的数据划分为两个不相交的子集，构建输入-目标对以优化重建网络。然而，现有方法仅在空间频率（k空间）域进行划分，未探索线圈维度。为充分利用接收线圈间的信号相关性，我们提出CoilDrop-MRI，该方法对输入应用线圈级丢弃，并将丢弃的数据作为自监督框架中的训练目标。该方法被集成到图像域（SENSE）和k空间（SPIRiT）公式的展开架构中。我们进一步将CoilDrop-MRI扩展到多激发、相位校正的扩散MRI（dMRI）重建，展示了其多功能性。CoilDrop-MRI在多站点、多场强（0.3T、0.55T和3T）和多模态（T1加权、T2加权、T2-FLAIR和dMRI）数据集上进行了广泛验证，始终优于最先进的自监督方法，达到了与监督重建方法相当的质量，且无需全采样参考训练数据。此外，CoilDrop-MRI表现出强大的数据效率和跨成像条件的鲁棒泛化能力，使其成为自监督并行MRI重建的实用且通用的框架。

英文摘要

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟：面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Bosch Corporate Research（博世企业研究）； King Abdullah University of Science and Technology（卡布斯大学）

AI总结提出分层语义几何地图（HSGM），将3D几何信息转化为VLM可理解的结构化表示，结合VLM高层语义规划与经典路径规划，实现零样本视觉语言导航，在R2R-CE和RxR-CE基准上达到最先进性能。

详情

AI中文摘要

视觉语言导航（VLN）使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型（VLM）取得了进展，但仍存在关键的语义-几何鸿沟：VLM擅长语言和2D视觉理解，但在3D空间推理方面表现不佳，且无法捕捉动作与空间转换之间的因果动态，导致导航不可靠，尤其在零样本设置中。为弥合这一鸿沟，我们提出分层语义几何地图（HSGM），将3D几何信息转化为与VLM兼容的结构化表示，有效将其与物理世界连接。具体而言，HSGM表示为多通道俯视图，组织为三个层次：（1）几何层，记录可导航区域和障碍物；（2）语义层，表示物体及其关系；（3）决策层，支持高层任务推理和目标选择。导航过程中，VLM作为高层语义规划器，解释HSGM编码的空间布局以选择几何有效航点，而航点间的低层无碰撞运动由经典路径规划算法执行，从而将语义推理与动作执行完全解耦。此外，复杂指令被分解为子任务，以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明，我们的零样本框架达到了最先进性能，甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

URL PDF HTML ☆

赞 0 踩 0

2606.00092 2026-06-02 cs.CV cs.AI 版本更新

结构化视觉证据分解用于阻塞性睡眠呼吸暂停低通气综合征的证据驱动多模态筛查

Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science（上海工程技术大学电子与电气工程学院）； Tencent Youtu Lab（腾讯云视频实验室）； ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital of Fudan University（复旦大学耳鼻喉科医院耳鼻喉科研究所）； National University of Singapore（新加坡国立大学）

AI总结提出EviOSAHS框架，通过将面部图像分解为七个解剖查询并生成结构化证据卡，结合临床信息进行高灵敏度OSAHS筛查。

详情

AI中文摘要

有效的阻塞性睡眠呼吸暂停低通气综合征（OSAHS）多导睡眠图前筛查需要结合临床风险因素与可见的颅面和颈部线索。直接提示通用多模态基础模型进行医学是/否决策可能产生不稳定、校准不良的输出。我们提出EviOSAHS，一个证据驱动的多模态推理框架，将仅基于图像的解剖证据获取与最终临床判定分离。每张正面面部图像被分解为七个固定的解剖查询，涵盖颈部、下巴、嘴巴、面/颈脂肪、下颌、中面部和鼻子。视觉响应被转换为结构化证据卡，记录目标解剖结构、可见性、风险方向、证据强度、置信度和简洁摘要。这些卡片仅在最后阶段与清理后的临床档案结合，由大型语言模型进行平衡的二元筛查判定。我们在642名受试者队列上评估了EviOSAHS，将正常受试者映射为筛查阴性，轻度、中度或重度OSAHS受试者映射为筛查阳性。EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率，在统一协议下优于仅临床提示、直接多模态提示和朴素两阶段流水线。消融实验表明，七问题视觉分解和平衡最终判定对高灵敏度工作点至关重要。对4,494个视觉输出的问题级审计显示100%的结构化解析率和93.88%的高可见率。EviOSAHS为二元多导睡眠图前OSAHS筛查提供了一个可审计、高灵敏度的工作流程，但应被视为分诊助手而非诊断系统。在临床部署前需要进行前瞻性验证、外部测试和校准的工作点控制。

英文摘要

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.00084 2026-06-02 cs.IR cs.AI cs.CL cs.LG 版本更新

SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector

SentimentLens: 通过双模态调和酒店业中的情感与评分

Dineth Jayakody, Pasindu Thenahandi, Sampath Jayarathna

发表机构 * University of Peradeniya（珀拉尼亚大学）

AI总结提出SentimentLens系统，基于方面级情感分析从非结构化酒店评论中提取知识，并通过跨模态调和文本情感与数值评分来识别运营冲突和服务改进机会。

详情

AI中文摘要

在线旅游平台生成大量用户生成的酒店评论，为大规模理解旅行者体验提供了丰富机会。然而，将非结构化文本反馈转化为结构化、可操作的见解仍然是一项具有挑战性的任务。本文提出了SentimentLens，一个基于方面级情感分析的可扩展分析系统，该系统从非结构化酒店评论中执行知识提取，并将其组织成可解释的服务类别。SentimentLens集成了方面术语提取、方面情感分类、语义类别分配和多层次分析模块，以支持区域级、酒店级和类别级评估。该系统设计为在不同地理环境和酒店环境中运行。为了展示其实用性，我们将SentimentLens应用于一个包含超过10,000条公开酒店评论的大型真实数据集。通过广泛分析，该框架揭示了旅行者情感如何随区域、服务类别和酒店类型而变化。我们进一步实现了文本情感与数值评分的跨模态调和，以识别潜在运营冲突、服务质量的结构性不一致性，并使用重要性-绩效和基于熵的分析确定高影响力的改进机会。结果表明，SentimentLens有效地将大规模非结构化评论转化为可操作的情报，支持酒店管理和旅游政策的数据驱动决策。虽然通过一个国家案例研究进行了演示，但所提出的系统可推广到其他目的地和评论驱动的服务领域。

英文摘要

Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experiences at scale. However, transforming unstructured textual feedback into structured, actionable insights remains a challenging task. This paper presents SentimentLens, a scalable analysis system based on Aspect-Based Sentiment Analysis that performs knowledge extraction from unstructured hotel reviews and organizes them into interpretable service categories. SentimentLens integrates aspect term extraction, aspect sentiment classification, semantic category assignment, and multi-level analytical modules to support region-level, hotel-level, and category-level evaluation. The system is designed to operate across different geographic contexts and hospitality settings. To demonstrate its practical utility, we apply SentimentLens to a large real-world dataset of over 10,000 publicly available hotel reviews. Through extensive analysis, the framework reveals how traveler sentiment varies across regions, service categories, and hotel archetypes. We further implement a cross-modal reconciliation of textual sentiment and numerical ratings to identify latent operational conflicts, structural inconsistencies in service quality, and high-impact improvement opportunities using importance--performance and entropy-based analyses. The results show that SentimentLens effectively transforms large-scale unstructured reviews into actionable intelligence, supporting data-driven decision-making for hospitality management and tourism policy. While demonstrated using a national case study, the proposed system is generalizable to other destinations and review-driven service domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00083 2026-06-02 cs.LG cs.AI cs.RO 版本更新

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励：VLM奖励模型的测试时提示优化

Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Catholic University of Leuven（鲁汶天主大学）； Toyota Research Institute（丰田研究院）； Toyota Motor Europe（丰田欧洲公司）

AI总结提出Demo2Reward方法，利用少量专家演示在测试时优化VLM奖励模型的提示指令，减少假阳性并保持真阳性，无需额外训练即可提升下游策略学习。

详情

AI中文摘要

强化学习依赖于准确的奖励函数，但在现实应用（如机器人技术）中，这些函数通常是手工设计的，甚至不可用。最近的研究探索了预训练视觉-语言模型（VLM）作为奖励模型的零样本推理能力。然而，如果没有仔细的提示工程，这些方法往往会产生次优的奖励，其中假阳性预测会严重降低下游策略学习。在机器人技术中，通常收集包含专家演示的有限数据集来引导策略学习。这种场景提供了在策略训练之前优化奖励模型的机会。我们提出Demo2Reward，一种测试时自适应技术，基于少量演示（3-10条轨迹）优化奖励模型的语言指令，以减少假阳性同时保持真阳性。关键是，这在策略学习期间不需要额外的模型训练或计算资源。我们表明，Demo2Reward在一系列模拟机器人任务和策略骨干上始终优于现有的零样本和少样本VLM奖励模型。最后，我们证明Demo2Reward有效迁移到真实世界的机器人学习场景，无需手动设计奖励函数即可实现策略学习。

英文摘要

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.

URL PDF HTML ☆

赞 0 踩 0

2606.00082 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

Hoeffding概念瓶颈模型及其在俯视图像中的应用

Clément Bénard, Manon Arfib, Christophe Labreuche, Victor Quétu

发表机构 * Thales cortAIx-Labs（泰雷兹 cortAIx 实验室）； Université Paris-Saclay, CentraleSupélec（巴黎-萨克雷大学，中央理工-巴黎高等学院）

AI总结针对线性概念瓶颈模型可解释性差和信息泄露问题，提出基于Hoeffding泛函分解的非线性稀疏聚合方法HCBM，并证明其对概念间泄露的鲁棒性，在分类和俯视图像目标检测任务中优于传统线性CBM。

详情

AI中文摘要

深度学习算法的可解释性对于高风险决策的计算机视觉应用至关重要。概念瓶颈模型（CBM）最近在基于高级概念瓶颈的分类问题上展示了提供可解释且准确预测的潜力。现有的CBM方法依赖概念分数的线性聚合来计算预测。然而，这种线性方法通常使用大量概念，这削弱了可解释性并有利于信息泄露。通常，概念与输出logits之间的潜在关系不是线性的。因此，我们引入了Hoeffding概念瓶颈模型（HCBM），该模型基于梯度提升树的Hoeffding泛函分解，提供概念分数的非线性和稀疏聚合，并使用素蕴含生成紧凑预测。HCBM被证明对概念间泄露具有鲁棒性，并在大量实验中优于标准线性CBM。除了分类，HCBM还可以适应目标检测，我们专注于一个具有挑战性的俯视图像案例，以展示HCBM在这些设置中的高性能。

英文摘要

Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate predictions for classification problems, based on a bottleneck of high-level concepts. Existing CBM methods rely on a linear aggregation of the concept scores to compute predictions. However, a large number of concepts is often used in this linear approach, which undermines explainability and favors information leakage. In general, the underlying relation between concepts and output logits is not linear. Therefore, we introduce Hoeffding Concept Bottleneck Models (HCBM), which build on the Hoeffding functional decomposition of gradient-boosted trees to provide non-linear and sparse aggregations of concept scores, and generate compact predictions using prime implicants. HCBM are proved to be robust to interconcept leakage, and outperform standard linear CBM in practice, as shown in extensive experiments. Beyond classification, HCBM can be adapted to object detection, and we focus on a challenging case with overhead images to show the high performance of HCBM in these settings.

URL PDF HTML ☆

赞 0 踩 0

2606.00081 2026-06-02 cs.LG cs.AI cs.SD 版本更新

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer: 一种融合统计特征的混合多分支Transformer用于DAS模式识别

Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche

发表机构 * IMT Nord Europe, Institut Mines-Telecom, Univ. Lille, Centre for Digital Systems Lille, France（IMT北欧学院，法国电信研究院，里尔大学，数字系统研究中心，法国）； IMT Mines Ales, Institut Mines-Telecom, Ales, France（IMT阿尔勒学院，法国电信研究院，阿尔勒，法国）

AI总结针对DAS数据高维度和复杂时空模式问题，提出DAStatFormer混合多分支Transformer，通过提取24个ANOVA选择的统计特征并采用门控Transformer网络，在降低数据量级的同时实现高达99.4%的准确率。

详情

AI中文摘要

分布式声学传感（DAS）通过光纤实现大规模监测，但其高维度和复杂的时空模式使得事件分类具有挑战性。现有的深度学习方法——CNN、循环模型和Transformer变体——要么无法捕获长程依赖，要么需要以高昂成本处理原始DAS矩阵。我们提出DAStatFormer，一种混合多分支Transformer，将紧凑的多域统计特征与门控Transformer网络相结合。我们不是使用原始信号，而是从每个通道的时域、波形和频域提取24个ANOVA选择的属性，将数据量减少数个数量级，同时保留判别信息。每个域通过专用的逐步骤和逐通道注意力分支处理，并通过自适应门控机制融合。在开放的$\Phi$-OTDR基准测试和真实场景DAS数据集上的实验表明，DAStatFormer实现了高达99.4%的准确率和接近完美的实际性能，同时使用的参数和推理成本显著低于DASFormer和DeepViT等模型。这些结果证明了其适用于可扩展、实时的DAS监测。我们在https://github.com/MichelD-git/DAStatFormer发布代码。

英文摘要

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open $Φ$-OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at https://github.com/MichelD-git/DAStatFormer

URL PDF HTML ☆

赞 0 踩 0

2606.00080 2026-06-02 cs.CV cs.AI cs.LG cs.NE 版本更新

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Planktonzilla: 用于理解浮游生态系统的多模态数据集与模型

Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi

发表机构 * Inria Chile Research Center（Inria智利研究中心）

AI总结为解决浮游生物分类模型泛化性差的问题，提出统一数据集Planktonzilla-17M（含1740万张图像，涵盖602个分类类群），并对比监督学习与CLIP风格训练，发现基于分类谱系的监督学习优于CLIP，且现有生物基础模型在海洋成像领域表现不佳。

详情

AI中文摘要

海洋浮游生物支撑着水生食物网，并在全球二氧化碳封存中发挥关键作用，因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有的分类模型在单个数据集上表现良好，但由于训练数据集孤立且标签不一致，无法跨仪器和环境泛化。为解决这一问题，我们引入了Planktonzilla-17M，这是一个统一的数据集，整合了来自13个成像系统的公开浮游生物图像集合。它包含1740万张图像，具有标准化的分类学和地理环境元数据，其中包括374万张浮游生物图像，涵盖602个分类类群，其中201个在物种级别被识别，使其成为迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集，我们在共享ViT骨干网络上进行了监督学习与CLIP风格图像-文本训练的对比实验。我们发现，当使用分类谱系作为文本时，监督分类器的表现与CLIP风格训练相当或更优。我们进一步观察到，BioCLIP和BioCLIP2在零样本和少样本设置下对浮游生物表现不佳。利用Planktonzilla-17M提高了浮游生物分类性能，凸显了当前生物基础模型在海洋成像领域的局限性。

英文摘要

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00079 2026-06-02 cs.LG cs.AI 版本更新

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE: 面向MoE大语言模型量化的频谱能量引导比特分配

Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu

发表机构 * School of Microelectronics, University of Science and Technology of China（中国科学技术大学微电子学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电子与电气工程学院）

AI总结提出BitsMoE框架，通过SVD分解和频谱能量引导的混合精度比特分配，解决MoE模型超低位量化中的精度损失问题，在Qwen3-30B-A3B-Base上2比特量化下准确率提升27.83个百分点。

Comments 29 pages, 6 figures, 9 tables. Code and models are available at https://github.com/zjiayu064/BitsMoE

详情

AI中文摘要

混合专家（MoE）大语言模型通过稀疏专家激活减少了每词元的计算量，但由于所有专家权重必须常驻内存，其部署仍然占用大量内存。现有的MoE压缩方法在超低位宽场景下表现不佳：剪枝不可逆地移除模型容量，而粗粒度量化无法根据异构的专家和权重方向重要性分配比特。我们提出BitsMoE，一种面向MoE大语言模型量化的频谱能量引导比特分配框架。BitsMoE通过SVD将每个MoE层分解为共享基和专家特定的频谱因子，保留共享基不进行量化以保持跨专家的共同结构，并使用专家特定因子作为细粒度量化单元。为确定每个单元的比特宽度，BitsMoE将频谱混合精度量化建模为激活感知的重建替代问题，并求解一个整数线性规划，在固定比特预算下最小化估计的重建损失。在多个MoE大语言模型上的实验表明，BitsMoE在超低位宽场景下显著降低了下游任务准确率下降。在Qwen3-30B-A3B-Base上进行2比特量化时，BitsMoE相比GPTQ加速量化12.3倍，平均准确率提升27.83个百分点，解码速度提升1.76倍。我们的模型和代码已在https://github.com/zjiayu064/BitsMoE公开。

英文摘要

Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3$\times$, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76$\times$ over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.

URL PDF HTML ☆

赞 0 踩 0

2606.00078 2026-06-02 cs.CV cs.AI 版本更新

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

基于流的生成建模优化压缩感知应用中的采样策略

Roman Pavelkin, Luis A. Zavala-Mondragon, Christiaan G. A. Viviers, Fons van der Sommen

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出一种任务感知的基于流的生成框架，通过训练流模型优化压缩感知中的子采样掩码，显著提升图像分类、重建和MRI加速的性能。

详情

AI中文摘要

信号处理和医学成像中的许多现代应用需要在严格的资源约束下获取高维信号。传统采样理论表明，准确重建信号所需的测量次数与信号的维数成正比，这一要求往往过于昂贵或不切实际。压缩感知通过证明稀疏信号可以在较少的测量下恢复（前提是测量算子满足某些条件）挑战了这一观念。这项概念验证研究提出了一种任务感知的基于流的生成框架——对传统流匹配训练范式的重新表述，其中流模型被训练用于优化压缩感知应用中的子采样。我们建立了所提出的学习子采样掩码框架的基本可行性，该框架显著提升了压缩感知在图像分类、图像重建和MRI加速中的性能。在图像重建任务中，我们的方法展示了最先进的性能，在CelebA数据集上以5%的子采样率实现了25.17 dB的峰值信噪比，在重建8倍加速的MRI测量（fastMRI数据集）时以最小的计算开销达到了29.24 dB。这些结果突显了生成流模型中任务条件化的有效性，并揭示了表示学习策略的一个有前景的方向。总体而言，所提出的框架提供了一种统一、灵活的方法来设计数据和任务驱动的感知方案，有望适用于广泛的逆问题。

英文摘要

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal's ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework -- a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5\% on the CelebA dataset and 29.24 dB when reconstructing $8\times$ accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.

URL PDF HTML ☆

赞 0 踩 0

2606.00077 2026-06-02 cs.CV cs.AI 版本更新

Improved Belief-Attention in Vision Task

视觉任务中的改进信念注意力

Guoqiang Zhang

发表机构 * University of Exeter（埃克塞特大学）

AI总结提出Belief2-Attention，通过同时利用垂直分量和投影分量扩展信念注意力，并引入额外内积矩阵增强标记相关性，提升视觉任务性能。

详情

AI中文摘要

最近，Belief-Attention \cite{Guoqiang25BeliefAttention} 被提出，它首先对基于 softmax 的 $V$ 向量加权求和进行关于原始 $V$ 向量的正交投影，然后将垂直分量作为 Transformer 中的残差信号以提升性能。在本文中，我们首先进行消融研究，表明投影分量也携带关于标记相关性的信息，不应被忽略。然后，我们提出通过同时利用垂直分量和投影分量来扩展 Belief-Attention。具体地，投影分量经过某种激活函数，然后进行线性映射，再与所考虑的标记合并。概念上讲，投影分量的神经块可以视为新注意力块内的两层前馈网络（FFN）。此外，注意到标准注意力通过内积矩阵 $QK^T$ 捕获标记相关性。我们提出向 $QK^T$ 引入额外的内积矩阵 $ZZ^T$ 以捕获更丰富的标记相关性。我们将新模块称为 Belief2-Attention。可以很容易地证明 Belief2-Attention 比标准注意力更具表达能力。然后，我们验证了 Belief2-Attention 在图像分类和分割等视觉任务中的有效性。

英文摘要

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix $QK^T$. We propose to introduce an additional inner-product matrix $ZZ^T$ to $QK^T$ to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.00074 2026-06-02 eess.SP cs.AI cs.LG 版本更新

CLSP-REQA: A Real-Time Quality-Aware Closed-Loop Seizure Prediction Framework with Mamba-BiLSTM and Confidence-Gated Intervention

CLSP-REQA：基于Mamba-BiLSTM和置信门控干预的实时质量感知闭环癫痫发作预测框架

Mufeng Chen, Qi Wu, Bingchao Huang, Xiwen Lai, Zekai Chen, Xinge Ouyang, Quansheng Ren

发表机构 * Department of Engineering Science, University of Oxford（牛津大学工程科学系）； Mathematical Institute, University of Oxford（牛津大学数学研究所）； School of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院航天信息研究所）； Department of Mechanical Engineering, The University of British Columbia（不列颠哥伦比亚大学机械工程系）； College of Life Sciences, Hunan Normal University（湖南师范大学生命科学学院）； School of Electronics, Peking University（北京大学电子学院）

AI总结提出CLSP-REQA框架，通过嵌入实时EEG质量评估模块和Mamba-BiLSTM骨干网络，结合分层非线性融合函数，在严格跨患者评估下实现优于现有方法的癫痫发作预测性能。

Comments 27 pages, 8 figures, submitted to Biomedical Signal Processing and Control

详情

AI中文摘要

可靠的癫痫发作预测是闭环神经刺激治疗的前提，然而现有方法很少考虑实际部署中EEG信号质量的可变性，并且绝大多数采用非严格的评估协议，高估了泛化性能。我们提出了CLSP-REQA（具有实时EEG质量评估的闭环癫痫发作预测），这是一个统一框架，将轻量级信号质量估计器直接嵌入预测流程中。实时EEG质量评估（REQA）模块与Mamba-BiLSTM骨干网络并行运行，产生一个标量质量分数q ∈ [0,1]，通过分层非线性融合函数（ECLO）调节输出置信度。在CHB-MIT头皮EEG数据库（n=23名受试者，198次发作）的严格跨患者评估下，CLSP-REQA实现了0.7426 ± 0.0199的AUC-ROC，优于Jemal等人报告的未适应跨患者基线0.69，仅使用16个EEG通道（先前工作为23个），且无需任何目标患者数据或域适应。在SIENA头皮EEG数据库（n=14名受试者，47次发作）上，CLSP-REQA实现了0.7012 ± 0.0249的AUC，大幅超过同一数据集上最佳域适应跨患者结果0.61，展示了强大的跨数据集泛化能力。该框架输出结构化四元组(p, q, c, Phi_SHAP)，可直接与闭环神经刺激器接口兼容。

英文摘要

Reliable seizure prediction is a prerequisite for closed-loop neurostimulation therapy, yet existing methods rarely account for the variability in EEG signal quality encountered in real-world deployment, and the overwhelming majority adopt non-strict evaluation protocols that overestimate generalisation performance. We propose CLSP-REQA (Closed-Loop Seizure Prediction with Real-time EEG Quality Assessment), a unified framework that embeds a lightweight signal quality estimator directly within the prediction pipeline. A Real-time EEG Quality Assessment (REQA) module runs in parallel with a Mamba-BiLSTM backbone, producing a scalar quality score q in [0,1] that modulates output confidence through a tiered non-linear fusion function (ECLO). Under strict cross-patient evaluation on the CHB-MIT Scalp EEG Database (n = 23 subjects, 198 seizures), CLSP-REQA achieves an AUC-ROC of 0.7426 +- 0.0199, outperforming the unadapted cross-patient baseline of 0.69 reported by Jemal et al., using only 16 EEG channels compared to 23 in prior work, and without requiring any target-patient data or domain adaptation. On the SIENA Scalp EEG Database (n = 14 subjects, 47 seizures), CLSP-REQA achieves AUC 0.7012 +- 0.0249, substantially surpassing the best domain-adapted cross-patient result of 0.61 on the same dataset, demonstrating strong cross-dataset generalisation. The framework outputs a structured four-tuple (p, q, c, Phi_SHAP) directly compatible with closed-loop neurostimulator interfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.00073 2026-06-02 cs.NE cs.AI cs.LG 版本更新

Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks

罕见事件，真实信号：深度脉冲网络中的功能集合作为计算单元

Aditi Aravind, Konstantinos Ladakis, Mario Alexios Savaglio, Stelios M. Smirnakis, Maria Papadopouli

发表机构 * University of Crete（希腊克里特大学）； Foundation of Research & Technology - Hellas（希腊研究与技术基金会）； Archimedes Research Unit（阿基米德研究单位）； Harvard Medical School（哈佛医学院）； Brigham and Women’s Hospital（布莱根妇女医院）

AI总结通过引入功能连接性分析框架，研究深度脉冲神经网络中功能集合的涌现特性，发现一阶功能连接集合的协同放电可靠预测下游神经元响应，且信息编码集中在罕见但高度协调的活动模式中。

详情

AI中文摘要

我们通过引入一个受神经科学启发的框架，从功能连接性的角度分析深度脉冲神经网络（SNN），研究内部表征如何在层次化处理系统中涌现。借鉴系统神经科学和信息论的概念，我们基于一个神经元与训练好的SNN架构中前一层神经元的统计显著成对相关性，形成该神经元的一阶功能连接（1FC）组。然后，我们在各种条件下的推理过程中跟踪其响应特性。我们的分析表明，先前在生物皮层中观察到的功能连接性的几个原理在脉冲ResNet架构中得以保留。这些1FC集合表现出有趣的特性：它们的聚合协同放电通过一个鲁棒的、类似ReLU的输入输出关系可靠地预测下游神经元响应，其增益随集合大小系统性缩放。仅在高的1FC协同放电事件期间才出现所呈现类别的可靠编码，而这些事件本身发生频率较低，表明信息表征集中在罕见但高度协调的活动模式中。在均匀随机噪声或对抗性扰动下，这些响应轮廓被破坏，尤其是在早期和中间层。这使得能够在特定节点和路径上进行有针对性的高分辨率探查。我们表明，功能连接结构由学习塑造，并且在权重置换下该结构被破坏。这些确立了1FC集合作为输入编码和信息传递的功能上有意义的基质，对设计针对信息流的有针对性的细粒度诊断具有潜在意义。

英文摘要

We investigate how internal representations emerge across hierarchical processing systems by introducing a neuroscience-inspired framework for analyzing deep spiking neural networks (SNN) through the lens of functional connectivity. Drawing on concepts from systems neuroscience and information theory, we form the first-order functionally-connected (1FC) group of a neuron based on its statistically significant pairwise correlations with neurons from the previous layer of a trained SNN architecture. We then track its response properties during inference under various conditions. Our analysis shows that several principles of functional connectivity previously observed in biological cortex are preserved in spiking ResNet architectures. These 1FC ensembles display interesting properties: their aggregate cofiring reliably predicts downstream neuronal responses through a robust, ReLU-like input-output relationship, whose gain scales systematically with ensemble size. Reliable encoding of the presented class emerges only during high 1FC cofiring events, which themselves occur infrequently, indicating that informative representations are concentrated in rare but highly coordinated activity patterns. Under uniform random noise or adversarial perturbations, these response profiles are disrupted, particularly in early and intermediate layers. This enables a targeted high-resolution interrogation at specific nodes and pathways. We showed that the functional connectivity structure is shaped by learning and this structure breaks under weight permutation. These establish 1FC ensembles as a functionally meaningful substrate for input encoding and information transfer, with potential implications in designing targeted fine-grained diagnostics on the information flow.

URL PDF HTML ☆

赞 0 踩 0

2606.00065 2026-06-02 cs.IR cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

超越文本与表格：ComProScanner中视觉-语言模型集成实现从科学图表中高精度提取材料数据

Aritra Roy, Enrico Grisan, Chiara Gattinoni, John Buckeridge

发表机构 * Energy, Materials and Environment Research Centre, London South Bank University, London SE1 0AA, UK（能源、材料与环境研究中心，伦敦南银行大学）； School of Engineering and Design, London South Bank University, London SE1 0AA, UK（工程与设计学院，伦敦南银行大学）； Bioscience and Bioengineering Research Centre, London South Bank University, London SE1 0AA, UK（生物科学与生物工程研究中心，伦敦南银行大学）； Department of Physics, Kings College London, London WC2R 2LS, UK（物理系，伦敦国王学院）

AI总结本文通过集成视觉-语言模型扩展ComProScanner框架，实现了从科学图表中自动提取成分-性能数据，在压电陶瓷数据集上达到0.97的组成准确率和归一化F1分数，并引入基于范围的误差阈值评估方法。

Comments 18 pages, 3 figures

详情

AI中文摘要

基于大语言模型流水线的自动提取科学文献中材料成分-性能数据的方法已取得显著进展；然而，现有框架仍局限于文本和表格内容，忽视了仅在科学图表中报告的大量定量性能数据。本文扩展了ComProScanner——一个用于自动构建成分-性能数据库的完全端到端多智能体框架，为其增加了基于原生视觉-语言模型（VLM）的图表提取能力。该扩展引入了一个FigureExtractor工具，用于基于标题关键词对所有支持的出版商进行图表过滤，以及一个GraphExtractorTool智能体，它将提取的图表传递给可配置的VLM，以从科学图表和绘图中恢复成分-性能对。基于LMArena Diagram排行榜和每百万token输入成本低于1.50美元的标准，选择了四个VLM进行评估。在来自已建立的d33测试语料库的50篇压电陶瓷文章上的基准测试表明，Gemini-3-Flash-Preview实现了最高性能，组成准确率为0.97，归一化F1分数为0.97，同时仍然是四个评估模型中成本效益最高的。此外，我们在评估框架中引入了一个基于范围的值误差阈值参数，与精确值匹配相比，提供了对从图表中提取的数值性能数据更具物理意义的评估。这些贡献使集成VLM的ComProScanner成为第一个针对材料科学、完全自动化、多模态的文献挖掘平台，能够在单一统一流水线中从文本、表格和图表中提取结构化的成分-性能数据。

英文摘要

Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of large language model-based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end-to-end multi-agent framework for automated composition-property database construction, with a native vision-language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption-keyword-based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition-property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than \$1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established $d_{33}$ test corpus demonstrates that Gemini-3-Flash-Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost-effective model among the four evaluated. We additionally introduce a range-based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM-integrated ComProScanner as the first materials-specific, fully automated, multimodal literature mining platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.00056 2026-06-02 cs.CE cs.AI cs.LG physics.app-ph 版本更新

Physics-Informed Neural Networks for Radial Consolidation of Combined Electroosmotic, Vacuum and Surcharge Preloading Considering Smear Effects

考虑涂抹效应的电渗-真空-堆载联合预压径向固结的物理信息神经网络

Dong Li, Yapeng Cao, Shuai Huang, Yujun Cui, Haiping Fu, Lu Yang, He Wei

发表机构 * Department of Civil, Environmental, and Infrastructure Engineering, George Mason University（乔治·马歇尔大学土木、环境与基础设施工程系）； State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences（中国科学院寒区工程与冻土科学国家重点实验室，西北生态环境资源研究院）； Laboratoire Navier/CERMES, École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris（巴黎理工学院劳纳实验室/塞梅斯实验室，法国国家桥梁与道路学院）； College of Water Conservancy and Hydropower Engineering, Hohai University（河海大学水利水电学院）； School of Geosciences and Info-physics, Central South University（中南大学地球科学与信息物理学院）

AI总结提出一种无量纲多域物理信息神经网络框架，通过改进的门控硬约束边界编码模型解决电渗径向固结问题，在时变荷载下实现高精度预测。

详情

AI中文摘要

本研究开发了一个无量纲多域物理信息神经网络（PINN）框架，用于考虑涂抹效应和真空-堆载联合预压的电渗径向固结。研究了三种基于PINN的模型：标准软约束PINN（Std-PINN）、改进的门控PINN（Mod-PINN）以及具有硬约束边界编码的改进门控PINN（Mod-HC-PINN）。这些模型在四种荷载工况下与有限元参考解进行了对比评估，包括恒定真空、指数真空、指数真空加斜坡堆载以及指数真空加循环半正弦堆载。结果表明，Mod-PINN中采用的门控架构提高了恒定真空荷载下阴极和涂抹区界面附近陡峭压力梯度的分辨率。在时变荷载下，软约束的Mod-PINN由于必须同时学习多个竞争目标而精度降低。Mod-HC-PINN通过将阴极边界和初始条件嵌入输出结构，减轻了这一问题，从而降低了优化负担并提高了物理一致性。Mod-HC-PINN在指数真空、斜坡堆载和循环堆载工况下的平均绝对误差（MAE）分别为0.43、0.41和0.27 kPa。敏感性分析进一步表明，所提出的框架在网络架构、配置点密度和渗透率对比的实际范围内保持稳健。

英文摘要

This study develops a dimensionless multi-domain physics-informed neural network (PINN) framework for electro-osmotic radial consolidation considering smear effects and combined vacuum and surcharge loading. Three PINN-based models are investigated: a standard soft-constrained PINN (Std-PINN), a modified gated PINN (Mod-PINN), and a modified gated PINN with hard-constraint boundary encoding (Mod-HC-PINN). The models are evaluated against FEM reference solutions under four loading cases, including constant vacuum, exponential vacuum, exponential vacuum with ramp surcharge, and exponential vacuum with cyclic haversine surcharge. The results indicate that the gated architecture applied in Mod-PINN improves the resolution of steep pressure gradients near the cathode and smear-zone interface under constant vacuum loading. Under time-dependent loading, the soft-constrained Mod-PINN shows reduced accuracy because it must learn multiple competing objectives simultaneously. The Mod-HC-PINN mitigates this issue by embedding the cathode boundary and initial conditions into the output structure, thereby reducing the optimization burden and improving physical consistency. The Mod-HC-PINN achieves MAE values of 0.43, 0.41, and 0.27 kPa for the exponential vacuum, ramp surcharge, and cyclic surcharge cases, respectively. Sensitivity analyses further demonstrate that the proposed framework remains robust across practical ranges of network architecture, collocation density, and permeability contrast.

URL PDF HTML ☆

赞 0 踩 0

2606.00054 2026-06-02 cs.RO cs.AI cs.CV 版本更新

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

从人类视频到机器人操作：基于人类中心数据的可扩展视觉-语言-动作学习综述

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University（清华大学）； HKUST（香港科技大学）； Xi’an Jiaotong University（西安交通大学）； Fudan University（复旦大学）； Microsoft Research Asia（微软亚洲研究院）； Peking University（北京大学）； Microsoft Zurich Project（微软苏黎世实验室）

AI总结本文综述了如何将丰富的人类视频转化为视觉-语言-动作（VLA）模型的有效知识，分类了四种方法（潜在动作表示、预测世界模型、显式2D监督、显式3D重建），并指出了结构化非结构化视频、跨具身和视角的动作映射、以及评估协议设计三大挑战。

Comments Accepted to IJCAI 2026 Survey Track. Project page: https://aaronfengzy.github.io/HumanCentricToVLA-Survey/

详情

AI中文摘要

近期在可泛化具身控制方面的进展由大规模预训练的视觉-语言-动作（VLA）模型驱动。然而，大多数现有方法依赖于大量机器人演示数据，这些数据获取成本高昂且与特定具身紧密耦合。相比之下，人类视频丰富且捕捉了丰富的交互，为真实世界操作提供了多样的语义和物理线索。然而，具身差异以及任务对齐标注的频繁缺失使得它们直接用于VLA模型具有挑战性。本综述提供了一个统一的视角，探讨如何将人类视频转化为VLA模型的有效知识。我们根据所提取的动作相关信息将现有方法分为四类：(i) 编码帧间变化的潜在动作表示；(ii) 预测未来帧的预测世界模型；(iii) 提取图像平面线索的显式2D监督；(iv) 恢复几何或运动的显式3D重建。除分类外，我们强调了该领域的三个关键开放挑战：将非结构化视频结构化为可训练的片段、在具身和视角异质性下将视频导出的监督接地到机器人可执行动作中，以及设计能更好预测真实世界部署性能和迁移效率的评估协议，从而为未来研究方向提供参考。论文和资源的精选列表见 https://github.com/AaronFengZY/HumanCentricToVLA-Survey。

英文摘要

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.

URL PDF HTML ☆

赞 0 踩 0

2606.00052 2026-06-02 cs.AI cs.LG 版本更新

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

产品感知深度自编码器用于多产品信息物理系统的鲁棒过程监控

MD Shafikul Islam, Jordan Carden

发表机构 * University of Cambridge（剑桥大学）

AI总结针对多产品制造中全局模型因决策边界扩大而产生盲点的问题，提出产品感知自编码器，通过限制学习域到产品特定分布来提升异常检测鲁棒性，在扩展田纳西伊士曼过程基准上实现100%攻击检测。

详情

AI中文摘要

随着工业4.0加速信息物理系统在制造业中的集成，鲁棒异常检测对于确保过程安全与安保变得至关重要。当前的数据驱动方法通常采用“产品无关”或全局模型，这些模型在所有正常操作数据的聚合上训练。然而，现代工业设施经常在不同的产品等级下运行。虽然计算简单，但这些全局模型本质上会扩展其决策边界以适应多种模式的方差，从而产生一个“盲点”，其中微妙的异常或针对性的信息物理攻击可能被模型的宽接受区域所掩盖。在这项工作中，我们首先证明了上述漏洞存在于跨多个产品等级运行的全局无关模型中。然后，我们提出了一种产品感知自编码器作为原则性的缓解措施，将学习域限制在等级特定的分布上。虽然这种方法降低了已识别的盲点风险，但我们并不声称它是所有可能替代方案中的最优缓解措施。我们使用扩展的田纳西伊士曼过程基准对这种方法进行了严格的验证，并与全局无关基线进行了比较。我们的实证结果表明，产品感知框架在标准检测指标上与全局基线表现相当，同时提供了对产品等级特定操作模式的改进鲁棒性。最关键的是，模拟我们假设的攻击场景的压力测试显示，虽然全局模型在77.8%的场景中未能检测到操作偏差，但产品感知系统实现了100%的检测准确率。这些发现表明，在柔性制造环境中，广义异常检测器可能带来非平凡的安全风险，促使向模式感知诊断架构的转变。

英文摘要

As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical for ensuring process safety and security. Current data-driven approaches typically employ "product-agnostic" or global models trained on the aggregate of all normal operating data. However, modern industrial facilities frequently operate under diverse product grades. While computationally simple, these global models inherently expand their decision boundaries to accommodate the variance of multiple modes, creating a "blind spot" where subtle anomalies or targeted cyber-physical attacks may be masked by the wide acceptance region of the model. In this work, we first demonstrate that the vulnerability described above is present in global-agnostic models operating across multiple product grades. We then present a Product-Aware Autoencoder as a principled mitigation that restricts the learning domain to grade-specific distributions. While this approach reduces the identified blind-spot risk, we do not claim it as the optimal mitigation among all possible alternatives. We rigorously validate this approach against a Global Agnostic baseline using the Extended Tennessee Eastman Process (TEP) benchmark. Our empirical results indicate that the Product-Aware framework performs comparably to the global baseline on standard detection metrics, while offering improved robustness to product-grade-specific operating modes. Most critically, stress tests simulating our hypothetical attack scenarios reveal that while the global model fails to detect operational deviations in 77.8% of the scenarios, the product-aware system achieves 100% detection accuracy. These findings suggest that, in flexible manufacturing environments, generalized anomaly detectors can pose non-trivial security risks, motivating a shift toward mode-aware diagnostic architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.00051 2026-06-02 cs.CY cs.AI 版本更新

Business Utility of Large Language Models as Exploratory Data Analysis Agents

大型语言模型作为探索性数据分析代理的商业实用性

Rafał Łabędzki, Patryk Miziuła, Hubert Rutkowski, Szymon Betlewski, Cezary Depta, Szymon Janowski, Jarosław Kochanowicz, Jan Kanty Milczek

发表机构 * deepsense.ai ； SGH Warsaw School of Economics（SGH沃兹尼亚克经济学院）； Bydgoszcz University of Science and Technology（比得戈茨茨理工大学）； Google（谷歌）

AI总结通过基于代理的供应链模拟基准，评估LLM作为EDA代理在商业环境中的平均性能与可重复性，提出风险调整指标Business utility，发现多数配置不可靠，GPT-5.4表现最佳。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用于分析工作流，但它们在商业环境中作为探索性数据分析（EDA）代理的适用性仍不确定。在实践中，一个可部署的EDA代理不仅必须提供有用的平均性能，还必须提供足够的可重复性以支持对其输出的信任。我们在一个受控的、与商业相关的基准上评估了这一要求，该基准基于基于代理的供应链模拟。任务是通过从间接操作痕迹而非显式标签进行推理，识别导致低质量和下游销售损失的供应商-产品组合。来自八个模型家族的十五种模型变体配置在四种实验条件下进行了评估，这些条件改变了数据表示、提示清晰度和信号强度，每种条件有五个轨迹。输出使用Jaccard指数与确定性真实值进行评分，并通过一个框架进行评估，该框架结合了平均得分（ms）、变异系数（CV）、探索性跨条件显著性检验以及商业实用性（Business utility），这是我们提出的一个风险调整指标，用于在单一操作度量中总结质量和可重复性。结果表明，大多数配置对于自主EDA使用来说不够可靠，即使它们的平均得分看起来可以接受。具有超高推理努力的GPT-5.4实现了最强的整体表现，实验平均ms为0.8748，实验平均商业实用性为0.6952，而次优配置在可变性折扣后损失了更多的实用性。我们的发现表明，对EDA代理的评估应将平均质量、可重复性和条件敏感性视为操作可信度的互补维度。

英文摘要

Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents in business settings remains uncertain. In practice, a deployable EDA agent must provide not only useful average performance but also sufficient repeatability to support trust in its outputs. We evaluate this requirement in a controlled, business-relevant benchmark built on an agent-based supply chain simulation. The task is to identify supplier-product combinations responsible for low quality and downstream sales loss by reasoning from indirect operational traces rather than from explicit labels. Fifteen model-variant configurations from eight model families were evaluated under four experimental conditions that varied data representation, prompt clarity, and signal strength, with five trajectories per condition. Outputs were scored against deterministic ground truth using the Jaccard index and assessed through a framework that combines mean score (ms), coefficient of variation (CV), exploratory cross-condition significance tests, and Business utility, a risk-adjusted metric that we propose to summarise quality and repeatability in a single operational measure. The results show that most configurations are not reliable enough for autonomous EDA use, even when their average scores appear acceptable. GPT-5.4 with extra-high reasoning effort achieved the strongest overall profile, with an experiment-averaged ms of 0.8748 and an experiment-averaged Business utility of 0.6952, while the next-best configurations lost substantially more utility after variability discounting. Our findings suggest that evaluation of EDA agents should treat average quality, repeatability, and condition sensitivity as complementary dimensions of operational trustworthiness.

URL PDF HTML ☆

赞 0 踩 0

2606.00050 2026-06-02 cs.AI cs.CL cs.DB cs.IR 版本更新

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

Grokers: 基于类型化知识图谱的自底向上归纳理解与写入时智能

Gregory Magarshak

发表机构 * Gregory Magarshak

AI总结提出Grokers架构，通过自底向上的依赖子图归纳遍历构建持久结构化理解，将智能推至写入时，实现零额外LM成本的查询，并证明字节同一性、累积单调性和双遍历顺序三个形式性质。

Comments 6 pages; second in a series with the Magarshak Machine / SPACER paper and the Context paper

详情

AI中文摘要

我们提出Grokers，一种通过依赖子图的自底向上归纳遍历来构建类型化知识图谱的持久结构化理解的架构。与检索增强生成（RAG）不同，后者在每个查询时支付全部理解成本，Grokers将智能推至写入时：自主的Groker代理分析类型化流图中的节点，通过受控语言模型（LM）调用提取结构化属性，并通过依赖关系归纳组合这种理解，写入丰富的类型化属性，从而以零额外LM成本服务于所有未来查询。我们证明了三个形式性质：（1）字节同一性定理，确立了从事务性维护的反规范化索引组装出的上下文块在语义变化之间的LM轮次中字节相同，使得KV缓存命中率接近100%；（2）累积单调性定理，确立了在受控智慧库增长协议下，无需LM调用即可解决交互的比例随已完成交互数量非递减；（3）双遍历顺序定理，确立了自顶向下生成和自底向上理解分别是它们在依赖DAG上各自任务的唯一正确遍历顺序，且它们的组合闭合为一个完整的生成-理解循环。我们进一步提出了一种基于嵌入的语义搜索的确定性替代方案，采用同义词缓存协议，其LM回退率在有限词汇域中收敛至零。在开源Qbix/Safebox/Safebots栈中提供了参考实现。

英文摘要

We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension cost at every query, Grokers pushes intelligence to write time: autonomous Groker agents analyze nodes in a typed stream graph, extract structured attributes via governed language model (LM) calls, and inductively compose that understanding upward through dependency relations, writing enriched typed attributes that serve all future queries at zero additional LM cost. We prove three formal properties: (1) the Byte-Identity Theorem, establishing that context blocks assembled from a transactionally-maintained denormalization index are byte-identical across LM turns between semantic changes, enabling KV-cache hit rates approaching 100%; (2) the Accumulation Monotonicity Theorem, establishing that the fraction of interactions resolved without LM calls is non-decreasing in the number of completed interactions under a governed wisdom library growth protocol; and (3) the Dual-Traversal Ordering Theorem, establishing that top-down generation and bottom-up comprehension are the unique correct traversal orderings for their respective tasks over a dependency DAG, and that their composition closes into a complete generation-comprehension cycle. We further present a deterministic alternative to embedding-based semantic search, with a synonym caching protocol whose LM fallback rate converges to zero for finite-vocabulary domains. A reference implementation is provided in the open-source Qbix / Safebox / Safebots stack.

URL PDF HTML ☆

赞 0 踩 0

2606.00049 2026-06-02 cs.CY cs.AI 版本更新

Measuring and Mitigating Bias in Code Generated by Large Language Models

测量和减轻大型语言模型生成代码中的偏见

Yuxi Chen, Yutian Tang, Timothy Storer

发表机构 * School of Computing Science, University of Glasgow（格拉斯哥大学计算机科学学院）

AI总结本文针对GPT-4o和Gemini等主流代码生成工具，提出评估框架，使用代码偏见分数和属性变化比率量化偏见，并探索四种轻量级缓解策略。

详情

AI中文摘要

大型语言模型（LLMs）在自然语言生成中的应用广受认可，并越来越多地用于代码生成任务。然而，其生成输出中的偏见问题仍然显著。本文聚焦于GPT-4o和Gemini这两个主流的代码生成工具，提出了一个评估LLM生成代码中偏见的框架，特别考察了受保护属性、提示和网络搜索能力的影响。我们使用两个指标：代码偏见分数（CBS）和属性变化比率（ACR），分别量化偏见的普遍性和不同属性的影响程度。此外，我们研究了四种轻量级缓解策略：少样本、思维链、少样本思维链和多智能体，旨在减轻生成代码中的偏见。我们的研究结果表明，即使在应用缓解策略后，偏见在不同受保护属性和数据集中仍然普遍存在，这凸显了需要更有效的方法来减少AI驱动的代码生成系统中的偏见。

英文摘要

Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code generation tasks. However, concerns about bias in their generated outputs remain significant. This paper focuses on GPT-4o and Gemini, mainstream tools for code generation, and proposes a framework for evaluating bias in LLM-generated code, specifically examining the influence of protected attributes, prompts and web-search capability. We use two metrics: the code bias score (CBS) and the attribute change ratio (ACR), to quantify the prevalence of bias and the degree of influence of different attributes, respectively. In addition, we investigate four lightweight mitigation strategies: Few-Shot, Chain-of-Thought, Few-Shot Chain-of-Thought, and Multi-agent, aimed at mitigating bias in generated code. Our findings reveal that bias remains prevalent across different protected attributes and datasets even after applying mitigation strategies, highlighting the need for more effective approaches to reduce bias in AI-driven code generation systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00047 2026-06-02 cs.CY cs.AI 版本更新

Comprehensive AI governance requires addressing non-model gains

全面的人工智能治理需要解决非模型增益问题

Arthur Goemans, Dan Altman, Noemi Dreksler, Jonas Freund, Milan Gandhi, Zhengdong Wang, Sarah Cogan, Sebastien Krier, Demetra Brady, Lewis Ho, Allan Dafoe

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Open Philanthropy（开放哲学基金会）

AI总结本文提出非模型增益的概念，包括推理增益、系统增益和资产增益，并论证这些增益会削弱以模型为中心的治理有效性，进而提出超越模型层面的治理方法。

Comments This paper has been accepted to ICML 2026 (Position paper track): https://openreview.net/forum?id=V3O1sHpKxX

详情

AI中文摘要

前沿人工智能治理通常以模型级治理范式为中心，该范式假设模型的能力概况主要取决于训练期间使用的计算和数据。本文认为，当能力进步越来越多地由“非模型增益”——即与基础模型进步无关的改进——驱动时，模型级治理的有效性会降低。我们形式化了非模型增益的概念，并提供了三种不同能力增益向量的分类：推理增益（测试时扩展计算）、系统增益（训练后增强，如脚手架）和资产增益（用受限资产增强模型）。我们展示了这些向量——以及来自具身化、持续学习和人工智能扩散的潜在未来影响——可能会破坏主要依赖于部署前评估和缓解的风险管理策略。我们概述了超越模型层面的治理方法：系统、实体、代理和云治理。最后，我们强调社会韧性作为这些治理层补充的重要性。

英文摘要

Frontier AI governance often centres on the model-level governance paradigm, which assumes that a model's capability profile is primarily a function of the compute and data used during training. This position paper argues that model-level governance becomes less effective when capability progress is increasingly driven by "non-model gains"--improvements that are independent from advances in the base model. We formalise the concept of non-model gains and provide a taxonomy of three distinct vectors of capability gain: inference gain (scaling compute at test-time), systems gain (post-training enhancements such as scaffolds), and asset gain (enhancing a model with restricted assets). We demonstrate how these vectors--alongside potential future impacts from embodiment, continual learning, and AI diffusion--may undermine risk management strategies that hinge mostly on pre-deployment evaluation and mitigation. We provide an overview of governance approaches that go beyond the model level: system, entity, agent, and cloud governance. Finally, we emphasise the importance of societal resilience as a complement to these governance layers.

URL PDF HTML ☆

赞 0 踩 0

2606.00046 2026-06-02 cs.MM cs.AI cs.CV cs.CY 版本更新

When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts

当玩笑越界：分析YouTube Shorts中的常规幽默与黑色幽默

Sydney Johns, Sanjeev Parthasarathy, Shantnu Bhalla, Vaibhav Garg

发表机构 * Virginia Polytechnic Institute and State University（弗吉尼亚理工大学）

AI总结通过构建TwistedHumor数据集（1211个YouTube Shorts及33041条评论的手工标注），结合多视角分析（LLooM概念归纳、评论情感分析、大模型评估），揭示了短格式视频中常规幽默与黑色幽默在主题、观众反应和模型检测上的差异，强调了上下文感知审核的必要性。

详情

AI中文摘要

YouTube等视频平台重塑了用户参与娱乐和信息的方式，强调简短、高参与度的内容，如Shorts。在这个生态系统中，某些内容处于灰色地带：虽然允许存在，但仍可能对部分观众产生意想不到的负面影响。为了研究这一问题，我们引入了TwistedHumor数据集，包含1,211个YouTube Shorts及其配对的33,041条相关评论，并手工标注了幽默存在性、幽默类型、伤害性、主题、修辞手法和单口喜剧背景。除了数据集构建，我们还提出了对短格式社交媒体中幽默与伤害表现的多视角分析。通过使用基于LLooM的概念归纳对视频描述进行分析，我们发现黑色幽默经常围绕批评、应对、尴尬和身份表达等主题聚集，而不是作为一个单一的类别出现。我们进一步通过关联评论分析观众反应，表明常规幽默与更积极的情感相关，而黑色幽默则收到更多混合、中性甚至有时更有毒的反馈。最后，我们评估了大语言模型与人类标注的一致性，发现它们在单口喜剧上的表现优于短笑话。综合来看，这些结果将TwistedHumor不仅定位为一个新的基准，而且是对短格式视频中幽默与伤害灰色地带的实证研究，强调了需要上下文感知的审核和更稳健的多模态评估。

英文摘要

Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging content such as Shorts. Within this ecosystem, certain content occupies a gray area where it remains allowed but may still have unintended negative effects on some audiences. To study this problem, we introduce TwistedHumor, a dataset of 1,211 YouTube Shorts paired with 33,041 related comments, with hand annotations for humor presence, humor type, harm, topic, rhetorical devices, and stand up context. Beyond dataset creation, we present a multi view analysis of how humor and harm appear in short form social media. Using LLooM based concept induction over video descriptions, we find that dark humor frequently clusters around themes of critique, coping, awkwardness, and identity expression rather than appearing as a single uniform category. We further analyze audience response through linked comments and show that regular humor is associated with more positive sentiment, while dark humor receives more mixed, neutral, and sometimes more toxic reactions. Finally, we evaluate large language models against human annotations and find that they perform better on stand up comedy compared to shorter jokes. Together, these results position TwistedHumor not only as a new benchmark, but as an empirical study of the gray area between humor and harm in short form video, highlighting the need for context aware moderation and more robust multimodal evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.00045 2026-06-02 cs.AI cs.ET quant-ph 版本更新

Universal Quantum Transformer

通用量子Transformer

Sungyong Chung, Alireza Talebpour

发表机构 * Grainger College of Engineering, Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校格拉inger工程学院土木与环境工程系）

AI总结提出通用量子Transformer（UQT），利用多量子比特系统的物理特性作为归纳偏置，通过参数化几何相位嵌入和SU(2)波干涉实现精确的数学和代数推理，在紧凑的5量子比特基板上完美学习循环模算术和非阿贝尔代数，并实现确定性泛化。

详情

AI中文摘要

经典连续空间神经网络从根本上难以锁定精确的数学对称性，如模算术和非交换代数。为了近似这些离散逻辑规则，它们通常依赖大规模参数缩放，导致即使在称为grokking的延迟泛化现象之后仍出现随机不稳定性。在这里，我们引入了通用量子Transformer（UQT），这是一种根本性的新型量子原生计算架构，利用多量子比特系统的物理性质作为精确数学和代数推理的通用归纳偏置。我们的框架并非翻译经典神经机制，而是完全依赖于参数化几何相位嵌入和$SU(2)$波干涉。我们证明了在高度紧凑的5量子比特基板上运行的量子注意力电路能够完美学习两个截然不同的形式类：循环模算术（$\mathbb{Z}_{11}$）和非阿贝尔代数（$S_4$置换群）。虽然经典注意力网络在收敛时表现出随机不稳定性，但UQT实现了数学上精确的确定性泛化。我们将这种现象称为结晶：超越了众所周知的grokking现象。关键的是，该框架通过理论上绕过经典自注意力的二次瓶颈，并通过对数压缩所需表示维度以消除经典网络固有的过度参数化，从而带来了巨大的计算和内存优势。最后，我们在嘈杂中等规模量子（NISQ）硬件上部署了该架构，证明了其在当前IBM量子计算机上的可行性。这些结果确立了参数化量子拓扑作为精确人工智能的普遍优越物理基底。

英文摘要

Classical continuous-space neural networks fundamentally struggle to lock into exact mathematical symmetries, such as modular arithmetic and non-commutative algebra. To approximate these discrete logical rules, they often rely on massive parameter scaling, resulting in stochastic instability even after delayed generalization phenomena known as grokking. Here, we introduce the Universal Quantum Transformer (UQT), a fundamentally novel, quantum-native computing architecture that uses the physical properties of multi-qubit systems as a universal inductive bias for exact mathematical and algebraic reasoning. Rather than translating classical neural mechanisms, our framework relies entirely on parameterized geometric phase embedding and $SU(2)$ wave-interference. We demonstrate that the quantum attention circuit, operating on a highly compact 5-qubit substrate, perfectly learns two highly distinct formal classes: cyclic modular arithmetic ($\mathbb{Z}_{11}$) and non-Abelian algebra (the $S_4$ permutation group). While classical attention-based networks exhibit stochastic instability at convergence, the UQT achieves mathematically exact, deterministic generalization. We refer to this phenomenon as crystallization: a step beyond the well-known phenomenon of grokking. Crucially, this framework yields massive computational and memory advantages by theoretically bypassing the quadratic bottleneck of classical self-attention, and by logarithmically compressing the required representation dimension to eliminate the massive over-parameterization inherent to classical networks. Finally, we deploy this architecture on noisy intermediate-scale quantum (NISQ) hardware, proving its viability on current IBM Quantum computers. These results establish parameterized quantum topology as a universally superior physical substrate for exact artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.00044 2026-06-02 cs.CY cs.AI 版本更新

Algorithmic Authority and the Clinical Standard of Care

算法权威与临床护理标准

Aizierjiang Aiersilan

发表机构 * The George Washington University（乔治·华盛顿大学）

AI总结本文探讨人工智能在临床医学中引发的算法概率推理与医生经验直觉之间的张力，提出将AI系统视为事实上的医疗监管，并主张通过辩证的护理标准将AI-医生联合体作为单一诊断责任实体。

2606.00041 2026-06-02 cs.CY cs.AI 版本更新

Improving Hospital Process Management through Process Mining: A Case Study on COVID-19 Clinical Pathways

通过过程挖掘改进医院流程管理：COVID-19临床路径案例研究

Pasquale Ardimento, Mario Luca Bernardi, Marta Cimitile, Samuele Latorre

发表机构 * University of Bari Aldo Moro（巴里大学Aldo Moro分校）； Unisannio University of Benevento（贝内文托大学Unisannio分校）； UnitelmaSapienza Rome（罗马Sapienza大学Unitelma分校）

AI总结本研究利用COVID数据共享学习数据集，构建透明可复现的管道将临床数据转化为事件日志，通过过程发现、声明性合规检查和结果分析，揭示COVID-19护理路径中的监测主干、急诊与入院接口的变异性以及年龄和重症监护暴露导致的结果差异，支持分诊标准化、容量规划和降级协调。

2606.00040 2026-06-02 cs.CY cs.AI 版本更新

Tracing GenAI Literacy: Uncovering Student-AI Interaction Patterns in Academic Writing through Epistemic Network Analysis

追踪GenAI素养：通过认知网络分析揭示学术写作中的学生-AI交互模式

Angxuan Chen, Jiyou Jia

发表机构 * Department of Educational Technology, Graduate School of Education, Peking University（教育技术系，教育研究生院，北京大学）

AI总结本研究利用学习分析和认知网络分析，通过分析162名学生在GenAI辅助摘要写作任务中的交互日志，揭示了高素养学生采用迭代优化和策略性提问，而低素养学生依赖直接生成命令的不同交互模式。

2606.00039 2026-06-02 cs.CY cs.AI cs.HC 版本更新

Beyond Categories of Caste: Examining Caste Bias and Morality in Text-to-Image AI Models

超越种姓类别：审视文本到图像AI模型中的种姓偏见与道德

Divyanshu Kumar Singh, Dipto Das, Deepika Rama Subramanian, Koustuv Saha, Stephen Voida, Bryan Semaan

发表机构 * University of Colorado Boulder（科罗拉多大学波得尔分校）； University of Toronto（多伦多大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结通过算法审计与批判性话语分析，揭示文本到图像模型如何超越上下种姓二元对立而延续种姓偏见，并提出反种姓方法应对AI系统中的公平问题。

详情

DOI: 10.1145/3805689.3806720

AI中文摘要

文本到图像（T2I）模型在各个领域展现出有前景的实用性。然而，这类模型也在其输出中放大了有害的社会偏见。在南亚背景下，近期研究表明种姓偏见和刻板印象正通过生成式AI（GenAI）系统得以延续。尽管这些研究提供了关于GenAI系统如何使种姓歧视的隐形叙事显性化的极其相关的见解，但它们往往将种姓视为一个身份类别。因此，在本工作中，我们转变本体论，聚焦于种姓的关系性方面。这使我们能够更细致地理解T2I模型产生和延续种姓歧视的机制。通过将算法审计与批判性话语分析相结合，我们借鉴挑战婆罗门规范性的概念框架，展示种姓偏见如何超越上下种姓类别的简单二元对立而得以延续。我们的贡献有两方面。除了挑战将种姓视为类别的范畴化理解，我们还提出了一种反种姓方法，以应对AI系统中种姓偏见和公平性的问题。

英文摘要

Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal biases in their outputs. In the context of South Asia, recent work has shown caste biases and stereotypes are being perpetuated through Generative AI (GenAI) systems. While this research offers extremely relevant insight into invisibilized narratives of caste discrimination through the GenAI system, they often treat caste as an identity category. Therefore, in this work we shift our ontology to focus on the relational aspect of caste. This enables us to develop a more nuanced understanding of the mechanics of caste discrimination by and through T2I models. Combining an algorithmic audit with critical discourse analysis, we draw on a conceptual frame challenging Brahminical Normativity to show how caste biases are perpetuated beyond the simple binaries of upper vs lower-caste categories. Our contributions are two-fold. Beyond challenging the categorical understanding of caste as a category, we propose an anti-caste approach to tackle the issue of caste bias and fairness in AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00037 2026-06-02 cs.CY cs.AI cs.HC 版本更新

TCAR-Gen: 基于证据融合的时间图检索用于知识增强生成

Sidra Nasir, Muhammad Noman Zahid, Rizwan Ahmed Khan

发表机构 * Dipartimento di Informatica, Università di Verona（威尼斯大学计算机科学系）； School of Advanced Studies, University of Camerino（坎皮诺大学高级研究学院）； Department of Computer Science, School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan（卡拉奇工商管理学院（IBA）数学与计算机科学学院计算机科学系）

AI总结提出TCAR-Gen框架，结合查询条件图神经网络、时间证据融合和树链推理，在历史犯罪叙事问答中实现时间推理和多源证据融合，优于现有RAG方法。

详情

AI中文摘要

检索增强生成系统在回答历史犯罪案件叙述的复杂问题时，在时间推理和证据融合方面存在困难。现有方法要么独立于查询语义进行检索，要么无法连贯地整合多个证据来源。我们提出时间上下文增强检索生成（TCAR-Gen），一个结合查询条件图神经网络、时间证据融合和树链推理的框架，以将答案生成基于检索到的证据。在维多利亚犯罪日记基准上，TCAR-Gen在Recall@5上达到0.3738，在包括多跳推理和反事实问题在内的七种查询类型上优于Vanilla RAG、Temporal RAG、GraphRAG-C和GraphRAG-T。消融研究表明，上下文图、时间惩罚机制和查询条件是关键组件。跨五个语言模型（GPT-OSS 20B到TinyLlama 1.1B）的评估表明，TCAR-Gen在较小模型规模下保持稳健的检索覆盖，但生成质量随模型容量减少而显著下降。我们的工作表明，显式时间建模和多分支证据融合对于基于知识语料库的忠实、推理密集型问答至关重要。

英文摘要

Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.00027 2026-06-02 cs.CL cs.AI 版本更新

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

医疗大语言模型安全性、鲁棒性和公平性评估的多领域红队框架

Andrei Marian Feier, Veysel Kocaman, Yigit Gul, Ahmet Korkmaz, Alexander Thomas, Aleksei Zakharov, Jay Gil, Mehmet Butgul, David Talby

发表机构 * John Snow Labs Inc.（约翰·索克斯实验室公司）

AI总结提出一个多领域红队框架，通过690个临床场景评估11个当代大语言模型，发现平均准确率掩盖了临床意义上的风险，性能方差和最坏情况失败比平均准确率更能反映可靠性，混合评估方法对可信安全评估至关重要。

Comments 10 pages, 4 figures. To be presented at the Text2Story 2026 Workshop (Delft, The Netherlands, 29 March 2026); CEUR Workshop Proceedings (forthcoming). Affiliation: John Snow Labs Inc

详情

AI中文摘要

大语言模型（LLM）在医疗领域的部署日益增多，但现有基准未能捕捉临床实践中常见的对抗性或伦理复杂条件下的模型行为。我们开发了一个多领域红队框架，评估了11个当代LLM在690个临床场景中的表现，这些场景涵盖9个领域和150多个子类别。场景包含对抗性变换，响应使用七维度评分标准进行评估，包括LLM辅助评分和人在环验证。结果揭示了显著的性能差异，平均得分范围从0.791到0.984。关键的是，几个高性能系统在个别安全关键场景中完全失败，表明平均准确率掩盖了临床意义上的风险。最高性能系统（X-BAI、GPT-5、Claude Opus 4.1）得分超过0.97且方差较低，而不同领域间性能差异显著。公平性相关任务在人口统计修改后错误率放大10-20%，人工评审员识别出自动评估遗漏的临床相关失败。我们的发现表明，性能方差和最坏情况失败比平均准确率更能提供临床意义的可靠性指标，而结合自动化与临床监督的混合评估方法对于可信安全评估至关重要。

英文摘要

Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories. Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0.791 to 0.984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.1) achieved scores above 0.97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation. Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.00023 2026-06-02 cs.CL cs.AI cs.LG 版本更新

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

TrustLDM：语言扩散模型的可信度基准测试

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

发表机构 * State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（中国科学院自动化研究所，智能科学与技术学院，北京大学）； CISPA Helmholtz Center for Information Security（信息安全研究所）； School of EECS, Peking University（电子工程学院，北京大学）； Institute for Artificial Intelligence, Peking University（人工智能研究所，北京大学）

AI总结针对语言扩散模型（LDM）的可信度问题，提出TrustLDM基准，评估其在不同架构和恶意上下文下的安全性、隐私性和公平性，并开发自动评估框架TrustLDM-Auto以识别脆弱配置。

详情

AI中文摘要

语言扩散模型（LDM）的快速发展挑战了自回归模型在语言处理中的主导地位。然而，其灵活、任意顺序的解码策略不仅实现了快速解码速度，还可能带来新的可信度挑战。为了更好地理解其流程背后的风险，我们引入了一个针对LDM的全面可信度基准（TrustLDM），评估不同LDM架构在多种静态后上下文类别下的安全性、隐私性和公平性。我们的实证结果表明，尽管LDM在仅使用用户提示时通常表现出较强的可信度，但当恶意后上下文附加到掩码响应时，其对齐行为明显下降。我们进一步观察到，较长的上下文不一定产生更强的影响，解码顺序和生成长度都会影响评估结果。最后，我们提出了TrustLDM-Auto，一个利用LDM解码灵活性自动识别脆弱配置的评估框架，揭示了所有评估模型和维度上的显著可信度弱点。我们的工作可能有助于社区构建更可信的LDM。我们的代码可在https://github.com/PKU-ML/TrustLDM获取。

英文摘要

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.

URL PDF HTML ☆

赞 0 踩 0

2606.00022 2026-06-02 cs.CL cs.AI 版本更新

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

lmfaoooo at SemEval-2026 Task 1: 幽默是受众。面向约束幽默生成的偏好建模

Alexey Tikhonov, Alexey Ivanov

发表机构 * Inworld.AI Berlin, Germany（Inworld.AI柏林，德国）； OpenAI ； Mountain View, CA（山景城，加利福尼亚州）

AI总结针对约束幽默生成任务，提出“生成多候选-偏好选择”策略，利用人类成对比较训练偏好模型，在MWAHAHA任务英、中、西语子任务中分别获得第1、第1和第2名。

Comments 5 pages. Accepted for SEMEVAL 2026

详情

AI中文摘要

幽默生成仍然困难，不仅因为生成流畅、新颖的笑话很难，而且因为“有趣”取决于受众，监督信号嘈杂——偏好随受众、语境和文化而变化，标注者一致性通常较低。在本文中，我们描述了用于SemEval-2026 Task-1（MWAHAHA）的系统，该任务专注于在显式约束下进行幽默生成。任务通过1对1竞技场式比较中的人类偏好判断来评估提交的系统。我们采用“生成多个->选择最佳”策略。首先，我们通过多步提示、模型集成和多样性导向解码为每个实例生成多样化的候选池。其次，我们使用偏好模型选择输出，该模型通过从人类比较中学习（而非绝对趣味性分数）来近似“读者”。为支持该方法，我们发布了通过幽默竞技场原型收集的2.5K人类成对判断。我们进一步提出一个可解释的流程，将标注的比较转换为偏好模型。在三个偏好数据集上，我们的模型一致优于基线，并表现出更强的跨领域迁移。最后，我们将学到的偏好模型应用于MWAHAHA设置中的候选排序，并发布中间产物（候选池和排序）以促进后续工作。我们的系统在MWAHAHA的英语和汉语子任务中排名第一，在西班牙语子任务中排名第二。

英文摘要

Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a "generate-many -> select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a "reader" by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.

URL PDF HTML ☆

赞 0 踩 0

2606.00021 2026-06-02 cs.CL cs.AI cs.LG 版本更新

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE: 基于软门控评估的语义嵌入导航用于检索式推测解码

Shaowen Chen, Zhicheng Liao, Hongwei Wang

发表机构 * Zhejiang University, Hangzhou, China（浙江大学，杭州，中国）

AI总结提出SENSE方法，通过语义嵌入导航和软门控评估模块替代表面形式匹配，提升检索式推测解码的鲁棒性和加速效果，在LLaMA和Qwen系列上实现最高4.09平均接受长度和3.26倍加速。

详情

AI中文摘要

推测解码（SD）通过使用轻量级草稿模型提出候选令牌，并由目标模型并行验证，从而加速大型语言模型（LLM）推理，同时不损害生成质量。尽管检索式推测解码（RSD）因其即插即用的多功能性而受到青睐，但其潜力受到刚性词汇依赖的阻碍，使得检索和验证对表面形式变化敏感。为了解决这个问题，我们提出了SENSE（基于软门控评估的语义嵌入导航）。通过将检索锚定在目标模型的隐藏状态上，SENSE建立了稳健的语义对齐，这使得软门控评估模块能够验证语义等价性而非表面形式。为了确保严格的基准测试，我们将现有方法解构为统一框架内的原子原语，促进细粒度的组件级比较。跨多个领域的广泛实验表明，SENSE在LLaMA和Qwen系列上优于多个基线，实现了高达4.09的平均接受长度和3.26倍的加速，同时保持了生成质量。我们的代码将在发表后发布。

英文摘要

Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality. While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations. To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms. To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison. Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.00020 2026-06-02 cs.CL cs.AI 版本更新

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP：基于效率感知奖励的强化学习链式推理中文文本纠错

Wei Tian, Yuhao Zhou, Man Lan

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（东华大学教育人工智能研究所）

AI总结提出CSRP三阶段框架，通过连续预训练、链式推理监督微调和基于效率感知奖励的组相对策略优化，在NACGEC基准上实现最优性能，有效缓解过度纠正偏差。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main conference)

详情

AI中文摘要

基于大语言模型的中文语法纠错系统面临两个关键挑战：通用模型缺乏针对细微语法区别的专业语言先验，以及使用最大似然估计的监督微调无法优化以精度为中心的指标，导致系统性过度纠正。我们提出CSRP，一个三阶段框架，通过以下步骤逐步构建纠错能力：在590万平衡样本上进行连续预训练以内化领域知识，使用显式错误推理进行链式推理监督微调以实现诊断透明度，以及采用新颖的效率感知奖励进行组相对策略优化，明确惩罚不必要的编辑。在NACGEC基准上，CSRP以50.99的$F_{0.5}$和57.17的精确率实现了最先进性能，大幅优于先前最佳结果，同时有效缓解了MLE训练模型固有的过度纠正偏差。我们的方法还将CSCD拼写纠错提升至59.61的F1，超过GPT-4 5.20分。全面的消融研究表明，RL对齐阶段相比SFT基线贡献了8%的相对增益，且该增益与大规模CPT的贡献正交，验证了针对编辑效率的显式优化对于高质量语法纠错至关重要。我们的代码可在https://github.com/TW-NLP/ChineseErrorCorrector获取。

英文摘要

Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction. We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8\% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at https://github.com/TW-NLP/ChineseErrorCorrector.

URL PDF HTML ☆

赞 0 踩 0

2606.00019 2026-06-02 cs.HC cs.AI 版本更新

Understanding Stigmatizing Language in Clinical Documentation: A Paired Comparison of Ambient AI Drafts and Clinician Finalized Notes

理解临床文档中的污名化语言：环境AI草稿与临床医生最终笔记的配对比较

Yiliang Zhou, Yawen Guo, Sairam Sutari, Jasmine Dhillon, Alexandra L. Beck, Emilie Chow, Steven Tam, Danielle Perret, Deepti Pandita, Gelareh Sadigh, Archana J. McEligot, Kai Zheng

发表机构 * Department of Informatics, University of California, Irvine（加州大学欧文分校信息学系）； Institute for Clinical and Translational Science, University of California, Irvine（加州大学欧文分校临床与转化科学研究所）； Department of Medicine, University of California, Irvine（加州大学欧文分校医学系）； Department of Physical Medicine & Rehabilitation, University of California, Irvine（加州大学欧文分校物理医学与康复科）； Department of Radiological Sciences, University of California, Irvine（加州大学欧文分校放射科学系）； Department of Public Health, California State University Fullerton（加州州立大学富尔顿分校公共卫生系）

AI总结通过配对比较环境AI生成的草稿与临床医生最终笔记，量化编辑前后污名化语言的变化，发现临床医生编辑可能成为污名化语言进入电子健康记录的净来源。

详情

AI中文摘要

环境人工智能（AI）文档工具越来越多地被用于减轻临床医生的文档负担，但它们对临床笔记中偏见语言的影响尚不清楚。我们对AI草稿和相应的临床医生最终笔记进行了大规模比较分析，以量化编辑前后污名化语言的变化。使用基于词典的自然语言处理（NLP）流程，我们测量了（1）AI草稿中污名化语言的普遍性，（2）最终笔记中的普遍性和术语组成，以及（3）污名化术语的移除或引入频率。在66,297对笔记部分中，21.4%的AI草稿部分包含至少一个污名化语言提及，而在临床医生最终版本中这一比例上升至24.0%。引入比移除更频繁，表明在使用环境AI时，临床医生编辑可能是污名化语言进入电子健康记录的净来源。

英文摘要

Ambient artificial intelligence (AI) documentation tools are increasingly deployed to reduce clinician documentation burden, but their implications for biased language in clinical notes remain unclear. We conducted a large-scale comparison analysis of AI drafts and corresponding clinician finalized notes to quantify stigmatizing language changes pre- and post-editing. Using a lexicon-based natural language processing (NLP) pipeline, we measured (1) the prevalence of stigmatizing language in AI drafts, (2) the prevalence and term composition in final notes, and (3) the frequency of removal or introduction of stigmatizing terms. Across 66,297 paired note sections, 21.4% of AI draft sections contained at least one stigmatizing language mention, rising to 24.0% in clinician finalized versions. Introductions occurred more often than removals, suggesting clinician editing can be a net source of stigmatizing language entering the EHR with using Ambient AI.

URL PDF HTML ☆

赞 0 踩 0

2606.00018 2026-06-02 cs.HC cs.AI 版本更新

SortingHat: 用定制的数字教学助手重新定义操作系统教育

Yifan Zhang, Xinkui Zhao, Zuxin Wang, Zhengyi Zhou, Guanjie Chen, Shuiguang Deng, Jianwei Yin

发表机构 * School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结针对操作系统课程教学挑战，提出结合检索增强生成和多智能体强化学习的3D数字人教学助手SortingHat，提供个性化指导、自适应内容生成和自动评估。

详情

DOI: 10.1145/3701716.3715199
Journal ref: WWW '25: Companion Proceedings of the ACM on Web Conference 2025,Pages 2951 - 2954

AI中文摘要

操作系统课程是计算机科学教育中最具挑战性的课程之一，原因在于其内部结构的复杂性和运行环境的多样性。传统的教学方法往往无法应对学生多样化的背景、学习速度和实际需求。为了应对这些挑战，我们提出了SortingHat，一个专为操作系统教育定制的个性化数字教学助手。SortingHat集成了先进的人工智能技术，包括检索增强生成框架和多智能体强化学习，以提供自适应、可扩展且有效的教育支持。SortingHat采用由大型语言模型驱动的3D数字人界面，提供个性化、富有同理心和上下文感知的指导。它根据每个学生的学习历史和学业表现生成定制的练习，强化薄弱环节并挑战高级概念。此外，该系统包含一个强大的评估流程，确保对学生提交的内容进行公平、一致和无偏见的评分，同时提供个性化的、可操作的改进反馈。通过结合个性化指导、自适应内容创建和自动评估，SortingHat将操作系统教育转变为一种引人入胜、沉浸式且可扩展的体验。

英文摘要

Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student's learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.

URL PDF HTML ☆

赞 0 踩 0

2606.00014 2026-06-02 cs.CL cs.AI 版本更新

Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval

面向鲁棒的上下文学习：利用分布外代理进行目标不可访问的演示检索

Hao Xu, Rite Bo, Fausto Giunchiglia, Yingji Li, Rui Song

发表机构 * College of Computer Science and Technology, Jilin University, China（吉林大学计算机科学与技术学院）； Department of Information Engineering and Computer Science, University of Trento, Italy（特伦托大学信息工程与计算机科学系）

AI总结提出DOPA框架，通过引入分布外代理近似不可访问的目标域并利用马氏距离全局多样性约束，提升大语言模型在分布偏移下的鲁棒性。

Comments Accepted by ACL 2026 main

详情

AI中文摘要

尽管研究表明大语言模型（LLMs）在分布外（OOD）任务上表现良好，但随着分布偏移加剧，其优势趋于减弱。因此，研究人员旨在从可用源域中检索分布相似且信息丰富的演示来增强LLMs的推理能力。然而，在目标域不可访问的实际场景中，评估未知分布具有挑战性，这间接影响所选演示的质量。为解决此问题，我们提出 extbf{DOPA}，一种演示搜索框架，它引入OOD代理来近似不可访问的目标域并指导检索过程。基于代理评估，DOPA进一步引入基于马氏距离的全局多样性约束，确保检索到的演示具有足够的多样性。在多个LLMs和任务上的实验结果表明，DOPA有效增强了OOD设置下的鲁棒性 ootnote{https://github.com/bort64/ood\_code}。

英文摘要

Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out-of-Distribution (OOD) tasks, their advantage tends to diminish as the distribution shift becomes more severe. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations. To address this problem, we propose \textbf{DOPA}, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process. Building on proxy-based evaluation, DOPA further introduces a Mahalanobis distance-based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings\footnote{https://github.com/bort64/ood\_code}.

URL PDF HTML ☆

赞 0 踩 0

2606.00013 2026-06-02 cs.CY cs.AI cs.HC 版本更新

A phenomenon of AI-conformity: how algorithms change human moral decision-making

AI从众现象：算法如何改变人类道德决策

Yana Venerina, Dmitry Koch, Nare Meloyan, Gerda Prutko, Valeriia Lelik, Victoria Taova, Andrey Kurpatov

发表机构 * Neuroscience Laboratory, Sberbank, Moscow, Russia（神经科学实验室，俄罗斯储蓄银行，莫斯科）

AI总结本研究通过改编经典Asch范式，发现具有推理能力的AI模型对人类道德判断的影响程度与人类多数相当，表明道德决策也可能受到算法从众的影响。

Comments 31 pages, 1 figure

详情

AI中文摘要

社会从众是一种有充分记录的现象，即个体将其观点转向社会多数的观点。随着人工智能（AI）日益融入日常生活，它也可能创造一种新的影响源，引发算法从众，其机制尚不清楚。本研究考察了AI判断是否影响人类的道德决策（n=165），改编了经典的Asch范式。参与者在三种不同条件下完成一系列道德困境：存在社会多数时、AI模型提供简短答案时、以及AI模型同时提供答案和解释时。在所有条件下，呈现的回应都违背了普遍接受的道德规范。结果表明，具有推理成分的AI模型对参与者意见的影响程度与人类多数相当。这些发现表明，即使是道德判断，尽管其敏感性和个人重要性，也可能容易受到算法从众的影响。然而，算法从众的机制似乎与社会从众不同。总体而言，该研究挑战了道德决策处于“AI禁区”——即被认为只有人类决策才可接受的领域——的假设，并强调了随着基于AI的建议日益融入人类决策，需要进一步研究这一现象。

英文摘要

Social conformity is a well-documented phenomenon in which individuals shift their opinions towards those of a social majority. As artificial intelligence (AI) becomes increasingly integrated into everyday life it may also create a novel source of influence giving rise to algorithmic conformity, mechanisms of which are poorly understood. The present study examined whether AI judgements affect moral decision-making in humans (n=165) adapting the classical Asch paradigm. Participants completed a series of moral dilemmas under three different conditions: in presence of social majority, with an AI model providing brief answers and with an AI model providing both answers and explanations of its choices. In all conditions the presented responses contradicted generally accepted moral norms. The results indicated that an AI model with a reasoning component affected the opinion of participants to a degree comparable to that of a human majority. These findings suggest that even moral judgements, despite their sensitivity and personal significance, may be susceptible to algorithmic conformity. However, the mechanism underlying algorithmic conformity appears to differ from the social one. Overall, the study challenges the assumption that moral decision-making lies in "AI inadmissibility zone" - a sphere that is considered as an area in which only human-made decisions are acceptable and highlights the need for a further investigation of this phenomenon as AI-based recommendations become increasingly embedded into human decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.00011 2026-06-02 cs.HC cs.AI cs.LG 版本更新

RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview

RuleEdit: 失败引导的人机模型编辑与前瞻性影响预览

Min Hun Lee, Justin Yu Feng Teo

发表机构 * Singapore Management University（新加坡国立大学）

AI总结提出RuleEdit系统，通过规则表的不匹配信号检测失败并预览模型编辑的影响，在卒中康复评估中显著提升人机协同性能。

详情

AI中文摘要

尽管AI有望协助复杂决策，但从业者仍然缺乏在提交模型编辑之前检测可能失败和检查后果的方法。我们提出RuleEdit，一个交互式、规则引导的人机模型编辑系统，它(i)通过规则表可解释的不匹配信号揭示可能的失败，并(ii)支持用户编写的规则反馈，提供预期性能变化和嵌入偏移的前瞻性预览。我们在卒中康复评估中实例化RuleEdit，并与卫生专业人员和学生一起评估。规则引导的失败检测将人+AI性能显著提高了14.16%（p<0.001），同时改善了对错误AI的拒绝，减少了过度依赖和不足依赖以及ChangedToWrong决策。此外，呈现前瞻性嵌入预览改善了参与者对模型适应的反馈，在纳入用户基于规则的反馈后，将更新后的局部性能增益从11.50%提高到36.38%（p<0.001）。我们的发现表明，基于不匹配的失败线索和前瞻性影响预览可以支持失败感知的人机模型编辑，同时也揭示了局部-全局权衡：有助于特定案例的编辑在全局转移时可能会降低性能。我们讨论了设计失败感知和可控人机系统的意义。

英文摘要

Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences of model edits before committing them. We present RuleEdit, an interactive, rule-guided human-AI model editing system that (i) surfaces likely failures through interpretable mismatch signals from rule tables and (ii) supports user-authored rule feedback with prospective previews of projected performance changes and embedding shifts. We instantiate RuleEdit in stroke rehabilitation assessment and evaluate it with health professionals and students. Rule-guided failure detection significantly increased Human + AI performance by 14.16\% ($p<0.001$) while improving rejection of incorrect AI and reducing both over- and under- reliance as well as ChangedToWrong decisions. In addition, presenting prospective embedding previews improved participants' feedback for model adaptation, increasing post-update local performance gains from 11.50\% to 36.38\% after incorporating users' rule-based feedback ($p<0.001$). Our findings show that mismatch-based failure cues and prospective impact previews can support failure-aware human-AI model editing, while also revealing a local-global tradeoff: edits that help a specific case can degrade performance when transferred globally. We discuss implications of designing failure-aware and controllable human-AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00010 2026-06-02 cs.HC cs.AI cs.CY 版本更新

Empathic and agentic artificial intelligence in nursing: perspectives on a human-centered framework for cancer care navigation in the United States

护理中的共情与自主人工智能：美国癌症护理导航中以人为本框架的视角

Tyra Girdwood, Saba Kheirinejad, Parnian Kheirkhah Rahimabad, Brianna M. White, Robert L Davis, David L Schwartz, Arash Shaban-Nejad

发表机构 * University of Tennessee Health Science Center, College of Nursing（田纳西大学健康科学中心护理学院）； University of Tennessee Health Science Center, Center for Biomedical Informatics, Department of Pediatrics（田纳西大学健康科学中心生物医学信息中心，儿科系）； University of Tennessee Health Science Center, Department of Radiation Oncology（田纳西大学健康科学中心放射肿瘤学系）

AI总结本文提出一个以人为本的人工智能框架，结合共情与自主方法，基于美国护士协会伦理准则，支持护士在癌症护理导航中增强而非取代人类共情与自主性，改善工作流程、医患关系和护理协调。

Comments 5 Pages, 1 Figure, 1 Table

详情

DOI: 10.1016/j.envres.2023.116972
Journal ref: ESMO Real World Data and Digital Oncology, 2026, Vol 12, 100694

AI中文摘要

对于癌症患者，护士导航可以通过加强健康服务协调和患者结果来减轻复杂护理的负担。然而，在资源不足的地区，训练有素的护士导航员可能有限或不存在。在美国，人工智能驱动的数字健康工具日益可用，可能有助于解决护理协调中的差距；然而，大多数并非专门设计用于支持护理。这篇观点文章讨论了一个以人为本的人工智能框架，该框架整合了基于美国护士协会伦理准则的共情和自主方法，以支持美国护士在癌症护理导航中的工作。该框架可以增强而非取代人类的共情和自主性，同时改善护士工作流程、患者-临床医生关系以及资源不足地区的护理协调服务。

英文摘要

For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and patient outcomes. However, in under-resourced areas, trained nurse navigators may be limited or non-existent. In the United States, artificial intelligence (AI)-enabled digital health tools are increasingly available and may help address gaps in care coordination; however, most are not designed to specifically support nursing. This perspective piece discusses a human-centered AI framework that integrates empathic and agentic approaches grounded in the American Nurses Association's code of ethics to support nurses in the United States in cancer care navigation. The framework could augment, not replace, human empathy and agency while improving nurse workflow, patient-clinician relationships, and care coordination services in under-resourced areas.

URL PDF HTML ☆

赞 0 踩 0

2606.00009 2026-06-02 cs.AI 版本更新

多模型AI系统中的涌现协作审议：一种源自BFT的认知综合协议

VD Doske

发表机构 * Independent Researcher（独立研究者）； Consilia

AI总结提出Consilium协议，一种基于拜占庭容错的多模型AI审议架构，将模型间分歧视为认知信号而非错误，通过认知角色分配和样本内外验证框架，实现低成本下与前沿模型相当的认知综合能力。

Comments 32 pages, 7 figures

详情

DOI: 10.5281/zenodo.19229039

AI中文摘要

我们提出了Consilium协议，这是一种源自拜占庭容错技术的结构化多模型AI审议架构，将模型间分歧视为认知信号而非错误。该协议为语言模型分配工程化的认知角色——区分模型是什么与它如何推理——并引入了一种从量化金融中改编的样本内/样本外验证框架，以区分训练数据共识与经验验证的结论。在涵盖10个领域类别32个主题的1,478次审议会话中，我们证明了：(1) 认知角色而非底层模型决定认知行为：每批次成本0.0002美元的自由边缘推理模型产生的分析输出与成本10.69美元的前沿模型相当；(2) RLHF对齐训练产生了可测量的、特定领域的认知盲点——有争议的政策主题比已解决的科学主题受到的对抗性挑战少12.3个百分点，而AI安全主题表现出不对称偏差（Δ=11.6%），模型质疑AI危险主张的力度远大于质疑AI风险被夸大主张的力度；(3) 该协议本身没有方向性偏差（移民Δ=2.3%，可再生能源Δ=1.2%）；(4) 样本外证据检索验证了239项主张，证据检索率100%，并发现了167个训练数据审议无法察觉的盲点发现。跨随机模型×角色分配的运行间可重复性平均标准差为±2.2%。整个电池的总成本（包括所有开销）为217美元。我们在MIT许可下发布协议规范，以便独立验证。

英文摘要

We present the Consilium Protocol, a Byzantine Fault Tolerance-derived architecture for structured multi-model AI deliberation that treats inter-model disagreement as epistemic signal rather than error. The protocol assigns engineered cognitive personas to language models -- separating what a model is from how it reasons -- and introduces an In-Sample/Out-of-Sample validation framework adapted from quantitative finance to distinguish training-data consensus from empirically grounded conclusions. Across 1,478 deliberation sessions spanning 32 topics in 10 domain categories, we demonstrate that (1) the cognitive persona, not the underlying model, determines epistemic behavior: free edge-inference models costing 0.0002 USD per batch produced comparable analytical output to frontier models costing 10.69 USD; (2) RLHF alignment training creates measurable, domain-specific epistemic blind spots -- contested policy topics exhibit 12.3 percentage points less adversarial challenge than settled science topics, and AI safety topics show asymmetric bias ($Δ$=11.6%) where models challenge claims that AI is dangerous far more vigorously than claims that AI risk is overstated; (3) the protocol exhibits no directional bias of its own (immigration $Δ$=2.3%, renewables $Δ$=1.2%); and (4) out-of-sample evidence retrieval validated 239 claims with 100% evidence retrieval and surfaced 167 blind-spot discoveries invisible to training-data deliberation. Run-to-run reproducibility across randomized model$\times$persona assignments averages $\pm$2.2% standard deviation. Total cost for the complete battery including all overhead: 217 USD. We release the protocol specification under MIT license to enable independent verification.

URL PDF HTML ☆

赞 0 踩 0

2606.00002 2026-06-02 cs.AI 版本更新

询问是不够的：LLM 置信度校准中的协议敏感性

Hankyeol Kim, Pilsung Kang

发表机构 * Seoul National University（首尔国立大学）

AI总结研究通过改变测量协议（如条件上下文、令牌读取方式）发现，LLM 的令牌概率置信度与口头置信度之间的比较高度敏感，且口头置信度不仅反映正确性还反映答案的合理性和来源。

详情

AI中文摘要

LLM 置信度校准通常通过比较两种信号来评估：令牌概率分数和口头置信度。这些信号有时被视为模型不确定性的直接读数，但它们的比较取决于很少明确说明的测量选择。在主要分析中，我们固定口头置信度的引出方式：一个单一的提示模板、概率尺度和输出格式。然后，我们改变定义口头与令牌比较的测量轴：哪个答案字符串获得令牌概率分数，如何从答案令牌中读取该分数，以及在哪种条件上下文中测量它。我们在三个开放 7--8B 基础/指令模型家族的四个 QA 基准上评估了这种设计，并使用更大的 Qwen2.5 变体作为同家族鲁棒性检查。结果比较对这些选择敏感：条件上下文改变了跨设置的 ECE 差距的符号或大小，令牌读取产生了更小但仍改变符号的变化，而改变 ECE 估计器影响很小。在默认的生成答案、裸上下文协议下，指令设置接近平衡，而不是显示口头置信度的大幅校准增益。在单独的提供答案分析中，表面合理的错误答案与提供的正确答案获得几乎相同的置信度，这表明口头置信度也反映了答案的合理性和来源，而不仅仅是正确性。我们认为，两种置信度信号都应被视为依赖于协议的测量行为，并提供了一个报告清单，涵盖引出来源、评分答案、令牌概率读取和条件上下文。

英文摘要

LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

URL PDF HTML ☆

赞 0 踩 0

2605.27701 2026-06-02 cs.AI 版本更新

Cross-Entropy Games and Frost Training

交叉熵博弈与Frost训练

Arthur Renard, Franck Gabriel, Valentin Hartmann, Clément Hongler

发表机构 * Xent Labs（Xent实验室）； Université Lyon 1（里昂1大学）

AI总结提出Frost训练方法，利用奖励函数在嵌入空间中的梯度改进基于蒙特卡洛的策略优化，用于解决一类称为交叉熵博弈的LLM-as-a-judge任务，在最佳k选择中实现更高最大分数并加速训练。

Comments 14 pages, 6 figures

2605.27575 2026-06-02 cs.AI 版本更新

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn：一个面向AI代理的开源平台，具有可扩展的按需执行、代理即代码定义和零信任访问

Nikita Benkovich, Vitalii Valkov

发表机构 * Agyn, Inc.（Agyn公司）； Mila AI e-Lab

AI总结提出Agyn开源平台，通过信号驱动的有状态无服务器运行时、Terraform代理定义和零信任安全模型，解决AI代理在生产中的隔离、治理和扩展问题。

2605.27569 2026-06-02 cs.AI 版本更新

RULER: Representation-Level Verification of Machine Unlearning

RULER: 机器遗忘的表示级验证

Georgina Cosma, Axel Finke

发表机构 * Department of Computer Science, Loughborough University, UK（英国洛林大学计算机科学系）； School of Mathematics, Statistics and Physics, Newcastle University, UK（英国新castle大学数学、统计与物理学院）

AI总结提出表示级验证指标RULER（包括基于oracle的M2和无oracle的M4），检测机器遗忘后模型中间表示中残留的被遗忘记录信息，发现输出级验证通过的方法在表示级仍存在显著残留。

详情

AI中文摘要

机器遗忘旨在从已部署的模型中移除特定训练记录的影响，而无需从头重新训练。当前协议通过成员推断、保留准确率和遗忘集准确率在输出级进行验证，但模型可能满足所有三个条件的同时在其中间表示中编码被遗忘的记录。我们引入RULER，一组表示级验证指标。基于oracle的比较指标M2衡量遗忘集记录是否占据与在没有它们的情况下重新训练的模型中相同的表示位置。无oracle指标M4仅从未学习模型的内部相似性结构检测残差，无需重新训练。四种近似遗忘方法均通过输出级评估，但在线性混合效应模型下，M2在12个条件中的10个中检测到显著残差（p<0.05），且效应大小随遗忘比例增加而增大。第五种方法Bad Teacher尽管具有不同的遗忘机制，也显示出相同的残差。M4作为遗忘前诊断指标，适用于表格、图像、临床文本和人脸身份设置：它检测到人脸识别模型中身份级别的记忆化，而所有测试方法均无法完全擦除该信号。

英文摘要

Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

URL PDF HTML ☆

赞 0 踩 0

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

FrontierOR：基准测试大语言模型在大规模优化中高效算法设计的能力

Minwei Kong, Chonghe Jiang, Ao Qu, Wenbin Ouyang, Zhaoming Zeng, Xiaotong Guo, Zhekai Li, Junyi Li, Yi Fan, Xinshou Zheng, Xi Jing, Yikai Zhang, Zhiwei Liang, Seonghoo Kim, Runqing Yang, Zijian Zhou, Sirui Li, Han Zheng, Wangyang Ying, Ou Zheng, Chonghuan Wang, Jinglong Zhao, Hanzhang Qin, Cathy Wu, Paul Pu Liang, Jinhua Zhao, Hai Wang

发表机构 * Singapore-MIT Alliance for Research and Technology（新加坡-麻省理工联盟研究技术）； Massachusetts Institute of Technology（麻省理工学院）； Northeastern University（东北大学）； Uber ； Shanghai Jiaotong University（上海交通大学）； Boston University（波士顿大学）； Emory University（埃默里大学）； Northwestern University（西北大学）； National University of Singapore（国立新加坡大学）； Microsoft（微软）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； Singapore Management University（新加坡管理学院）

AI总结提出FrontierOR基准，系统评估大语言模型在现实大规模优化问题中设计可扩展算法（而非仅生成求解器代码）的能力，发现最强模型仅在31%案例中优于Gurobi。

详情

AI中文摘要

大语言模型（LLMs）越来越多地用于优化建模和求解器代码生成，然而实际的运筹学和优化问题往往需要更困难的能力：设计可扩展的算法，利用问题结构并超越直接的建模-求解基线。现有基准仅限于远低于现实规模和复杂度的小型或简化示例。我们引入FrontierOR，这是首批系统评估基于LLM的高效算法设计在现实大规模优化问题中的基准之一。FrontierOR包含180个任务，这些任务源自顶级运筹学场所发表的方法论多样的论文，每个任务都有标准化实例和隐藏的、专家验证的评估套件。我们评估了七个LLM，涵盖前沿、经济高效和开源模型，在一次性设置和测试时进化设置中。结果显示，前沿模型仍然难以从可执行的公式化转向高效的优化算法：最强的一次性模型在解决方案质量和计算效率方面仅在31%的案例中优于Gurobi，即使具有测试时进化的强大编码代理在选定的困难任务上也仅达到50%。FrontierOR为基于LLM的优化算法设计建立了一个实用的评估平台，使未来的LLM和智能体能够系统地测试它们是否能够超越正确的公式化，转向可行、高质量和高效的算法。代码和数据已在https://github.com/Minw913/FrontierOR公开。

英文摘要

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.

URL PDF HTML ☆

赞 0 踩 0

2605.26684 2026-06-02 cs.LG cs.AI 版本更新

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

超越轨迹级归因：基于图的智能体强化学习信用分配

Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出GraphGPO方法，通过构建状态转移图并利用全局信息估计各状态到任务目标的距离，实现步骤级信用分配，提升训练效率和性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

基于组的强化学习方法在提升大型语言模型性能方面取得了显著成功，并已迅速扩展到智能体任务。然而，其信用分配严重依赖于根据最终结果进行的粗粒度轨迹级归因，难以捕捉单个步骤的贡献，例如失败轨迹中被掩盖的有价值步骤。为了揭示潜在信息并实现更忠实的步骤级信用分配，我们提出基于图的组策略优化（GraphGPO），该方法首先将所有 rollout 轨迹聚合为一个统一的状态转移图，然后利用图中编码的全局信息估计每个状态到任务目标的距离。最后，GraphGPO 通过估计基于图的优势函数，根据转移减少到任务目标距离的程度，为每条边分配信用。通过这种方式，GraphGPO 显著提高了训练效率，并在多个具有挑战性的基准测试中取得了最先进的性能。

英文摘要

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26436 2026-06-02 cs.CL cs.AI 版本更新

Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models

目标重掩码：在离散扩散语言模型中将令牌编辑替换为令牌到掩码的精炼

Lin Yao

AI总结针对离散掩码扩散语言模型中令牌编辑机制的局限性，提出无训练的令牌到掩码重掩码方法，通过将疑似错误令牌重置为掩码状态，利用扩散过程在更干净上下文中重新预测，显著提升数学等任务的性能。

Comments This paper has been significantly revised, expanded, and superseded by a more comprehensive version available at arXiv:2604.18738. The authors have chosen to withdraw this version to avoid overlap and direct readers to the updated work

详情

AI中文摘要

离散掩码扩散语言模型（如LLaDA）通过迭代去噪生成文本，其中掩码令牌逐步被预测的令牌替换。LLaDA2.1引入了令牌到令牌（T2T）编辑机制，通过直接替换疑似错误的已提交令牌来加速生成。然而，我们发现了T2T编辑的根本性限制：它将错误检测与替换耦合，用可能错误的令牌污染生成上下文，并引入了训练-推理噪声不匹配，其中系统性的模型生成错误与训练中看到的随机扰动不同。我们提出了令牌到掩码（T2M）重掩码，这是一种无需训练、即插即用的T2T编辑替代方案，将疑似错误的令牌重置回掩码状态，允许扩散过程在更干净的上下文中重新预测它们。我们设计并实证验证了三种互补的错误检测策略——基于概率的、触发镜像的和基于时间差分的——并提供了统一的理论分析，表明T2M重掩码净化了生成上下文，将系统性的推理错误转换回模型的原生掩码噪声类型，并实现了延迟承诺以进行联合多位置优化。在涵盖知识、推理、数学、编码和指令跟随的12个基准上的全面实验表明，T2M通常在需要精确令牌级输出的任务上提升性能，其中数学任务提升最大（CMATH上+5.92%）。对CMATH的错误分析揭示，主要的失败模式是最后一英里令牌损坏——即正确的推理产生损坏的最终答案——而T2M修复了59.4%的此类情况。

英文摘要

Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies -- probability-based, trigger-mirrored, and temporal-difference-based -- and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model's native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption -- where correct reasoning produces a corrupted final answer -- and that T2M repairs 59.4% of such cases.

URL PDF HTML ☆

赞 0 踩 0

2605.26397 2026-06-02 cs.CL cs.AI 版本更新

Algorithmic Fragility and Persona Bias in LLM-Generated Autistic Communication

LLM生成的自闭症交流中的算法脆弱性与人格偏见

Naba Rizvi, Mohammed Rizvi, Harper Strickland, Saleha Ahmedi, Nedjma Ousidhoum

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Georgia Institute of Technology（佐治亚理工学院）； Cornell University（康奈尔大学）； Cardiff University（卡迪夫大学）

AI总结通过双人格改写范式，发现LLM在生成自闭症人格文本时存在词汇与情感偏离、输出坍塌等系统性失败，且对齐策略而非参数规模主导这些失败，表明当前对齐训练导致深层表征鸿沟。

Comments main paper: 9 pages; total: 19 pages; 2 figures; 5 tables

详情

AI中文摘要

安全对齐减少了明确有害的输出，但无意中编码了一种对边缘化交流的净化、神经典型化表征。我们使用双人格改写范式研究这种编码，提示十个大型语言模型（LLM）从自闭症或神经典型人格改写自然发生的自闭症话语。我们发现，尽管语义相似性相当，自闭症人格改写比神经典型改写在词汇形式和情感语域上偏离显著更多。此外，大多数模型将跨人格生成折叠成几乎相同的输出。为了揭示这种生成崩溃背后的机制，我们引入了一个多智能体定性分析框架。我们的结果揭示了系统性输出擦除、刻板幻觉和任务回避元评论是此任务的普遍失败模式，这些模式按对齐策略而非参数规模聚类。最后，我们与自闭症人类标注者的针对性比较表明，社区内部知识相对于LLM分类产生了系统性标签反转。我们的发现表明，当前的对齐训练导致仅通过定性分析可见的人格特定生成崩溃，证实了提示工程无法解决的深层表征鸿沟。

英文摘要

Safety alignment reduces explicitly harmful outputs but inadvertently encodes a sanitized, neuronormative representation of marginalized communication. We investigate this encoding using a dual-persona rewrite paradigm, prompting ten large language models (LLMs) to rewrite naturally occurring autistic discourse from either an autistic or neurotypical persona. We uncover autistic-persona rewrites diverge significantly more in lexical form and affective register than neurotypical rewrites, despite equivalent semantic similarity. Furthermore, most models collapse cross-persona generations into near-identical outputs. To uncover the mechanisms behind this generative breakdown, we introduce a multi-agent qualitative analysis framework. Our results reveal systemic output erasure, stereotyped hallucination, and task-evasive meta-commentary are pervasive failure modes for this task that cluster by alignment strategy rather than parameter scale. Finally, our targeted comparison with autistic human annotators demonstrates that community-insider knowledge produces systematic label reversals relative to LLM classifications. Our findings indicate that current alignment training causes persona-specific generative breakdown visible only through qualitative analysis, confirming a deep representational gap that prompt engineering cannot resolve.

URL PDF HTML ☆

赞 0 踩 0

2605.26305 2026-06-02 cs.AI cs.SY eess.SY hep-ph 版本更新

Experiments in Agentic AI for Science

科学领域中的自主AI代理实验

Judy Fox, Geoffrey Fox

发表机构 * School of Data Science, Department of Computer Science, and Biocomplexity Institute, University of Virginia（数据科学学院、计算机科学系和生物复杂性研究所，弗吉尼亚大学）

AI总结提出两种基于本地体-远程脑架构的自主AI代理框架，通过系统工程技术克服上下文和推理限制，分别用于时间序列数据集的大规模自动整理和物理讲座的结构化报告生成。

详情

AI中文摘要

本文详细介绍了在科学工作流中开发自主AI代理的两种新颖框架。两个系统都通过Google Colab利用混合本地体-远程脑架构，使用基于Python的本地协调器调用大型语言模型（LLM）云后端。第一个代理DeepTS/DeepCollector自动化了时间序列数据集的大规模整理、提取和去重。第二个代理DeepScribe是一个自主演示分析器，将视觉密集、数学复杂的物理讲座转换为结构化科学报告。通过实际的系统工程——如细粒度属性提取（Cellular RAG）、远程数据检查和分布式并发控制——我们展示了自主AI代理如何克服当前最先进系统的上下文和推理限制，以严格支持科学工作流。最后，我们概述了DeepTS的泛化以支持深度知识图谱，并讨论了这种概念方法在高能物理（DeepQCD）中的应用。

英文摘要

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

URL PDF HTML ☆

赞 0 踩 0

2605.26068 2026-06-02 cs.LG cs.AI 版本更新

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

重新思考异常检测中的弱监督：一个综合基准

Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou, Shuang Liang, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

发表机构 * Shanghai University of Finance and Economics（上海金融学院）； Ant Group（蚂蚁集团）； Key Laboratory of Interdisciplinary Research of Computation and Economics（计算与经济交叉学科重点实验室）

AI总结提出WSADBench，首个统一评估不完全、不精确和不准确三种弱监督异常检测场景的基准，通过系统变化标签数量、粒度和质量，揭示36种算法在4种模态上的性能边界，并发现弱监督场景间存在强相关性、专用WSAD算法仅在极端标签稀缺时占优等关键洞察。

Comments Accepted at KDD 2026 Datasets and Benchmarks Track

详情

DOI: 10.1145/3770855.3817536

AI中文摘要

弱监督异常检测（WSAD）已发展出三个主要方向：不完全监督、不精确监督和不准确监督。然而，这些方向仍然相互孤立，缺乏一个统一的框架来评估它们是否解决独特的挑战或共享基本机制。本文介绍了WSADBench，这是第一个统一评估不同弱监督场景的基准，对从专用WSAD方法到先进表格基础模型的多种方法进行基准测试。WSADBench通过系统变化标签数量、粒度和质量，建立了标准化协议来评估4种模态上的36种算法，揭示了各种方法的性能边界。基于超过70万次实验，WSADBench揭示了四个关键见解：（i）这些弱监督场景之间存在强内在相关性，挑战了当前研究方向的孤立性。（ii）专用WSAD算法仅在极端标签稀缺情况下表现出色，但随着监督增加或在OOD场景中，很快被表格基础模型和通用分类方法主导。（iii）未标记数据在不同设置中的效用不一致，与标签细化相比收益微乎其微。（iv）模型对不同类型的标签噪声表现出不对称敏感性。我们发布WSADBench作为开源基准，包含代码和数据集，以促进未来的WSAD研究：https://github.com/SUFE-AILAB/WSADBench。

英文摘要

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.

URL PDF HTML ☆

赞 0 踩 0

2605.26089 2026-06-02 cs.CV cs.AI 版本更新

Channel-wise Vector Quantization

通道级向量量化

Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Min Li, Jiaqi Wang, Kaicheng Yu

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Westlake University（西湖大学）； Zhejiang University（浙江大学）； Fudan University（复旦大学）； JD.COM（京东公司）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出通道级向量量化（CVQ）代替补丁级量化，并基于此设计通道级自回归（CAR）模型，通过逐通道预测实现渐进式细节生成，在图像重建和文本到图像生成中取得优异性能。

详情

AI中文摘要

我们提出了通道级向量量化（CVQ），一种新颖的图像标记化范式，用通道级标记取代补丁级标记。与传统的向量量化（为每个补丁特征向量分配一个离散标记）不同，CVQ 对特征图的每个通道进行量化。这种表示将图像表示为视觉细节的离散层级，而不是空间补丁的网格。基于 CVQ，我们引入了一种新的视觉自回归框架，采用“下一通道预测”。我们的通道级自回归（CAR）模型不是按光栅顺序逐补丁渲染图像，而是顺序预测图像通道，逐步生成更丰富的视觉细节。具体来说，它首先勾勒全局结构，然后细化细粒度属性，类似于人类艺术家的创作流程。实验表明：（1）CVQ 在 16K+ 的码本大小下实现了 100% 的码本利用率，无需任何额外技巧，并且显著提高了传统 VQ 的重建质量；（2）CAR 在 DPG 评分中达到 86.7，在 GenEval 评分中达到 0.79，展示了其在文本到图像生成中的强大有效性。

英文摘要

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30290 2026-06-02 cs.LG cs.AI cs.CL 版本更新

分离性身份：语言模型代理缺乏声誉机制的基础

Botao Amber Hu, Helena Rong, Max Van Kleek

发表机构 * University of Oxford（牛津大学）； New York University Shanghai（纽约大学上海分校）

AI总结本文指出语言模型代理因本体上的分离性（模块可替换、身份流动）而无法满足声誉机制所需的身份持续性、行为可预测性和制裁敏感性，从而提出转向基于可观察性、事前、构成性、协议的行为约束。

Comments Accepted by FaccT 2026

详情

DOI: 10.1145/3805689.3806748

AI中文摘要

随着自主语言模型代理的激增，形成了一个具有现实后果的新兴代理网络，您可以使用哪些可信信号来决定是否信任并委托一个陌生的代理？自然的治理直觉是将人类身份验证和声誉机制从“了解你的客户”和信用评分扩展到“了解你的代理”制度。然而，我们认为这种类比从根本上是不完整的。声誉机制既作为社会信号，也作为纠正性反馈，维持可信行为的均衡，其前提是存在与行为连续性、制裁敏感性和昂贵不可替代性相关的持久身份。但语言模型代理在本体上是分离性的：它们本质上是可修改模块的集合——基础模型、系统提示、工具访问策略、外部记忆，在某些情况下还包括整个多代理系统——任何模块都可能改变代理行为，并且具有流动的人格，容易受到对抗性攻击，且可能不会内化制裁。借鉴分离性身份障碍的法理学，这种分离性使得代理缺乏可识别性、可预测性、可信性和可恢复性的基础——而这些正是声誉机制旨在维持的属性——从而破坏了信任。我们认为，基于身份的事后、规制性、制裁性的治理（如声誉）在结构上不适用于分离性代理，并建议转向基于可观察性的事前、构成性、协议性的行为约束。

英文摘要

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundation models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

URL PDF HTML ☆

赞 0 踩 0

2605.30122 2026-06-02 cs.LG cs.AI 版本更新

Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression

超越MSE：利用多分位数回归改进降水临近预报

Gijs van Nieuwkoop, Siamak Mehrkanoon

发表机构 * Department of Information and Computing Sciences, Utrecht University（信息与计算科学系，乌特勒支大学）

AI总结本文提出将确定性降水临近预报模型的训练目标从均方误差（MSE）改为多分位数回归损失，使用SmaAt-UNet模型在荷兰雷达降水数据上验证，使中心确定性预测的测试集MSE降低8.6%，并输出高分位数预测以改善强降水预测。

Comments 7 pages, 5 figs

详情

AI中文摘要

深度学习降水临近预报模型通常使用逐点损失（如均方误差或平均绝对误差）进行优化，这可能导致预测过于平滑且对强降雨的表示较差。本研究探讨了是否可以通过将训练重新表述为多分位数回归问题来提高已建立的确定性临近预报架构的预测性能。使用SmaAt-UNet作为核心模型，我们在荷兰雷达降水临近预报上比较了MSE、MAE和多分位数pinball损失训练。结果表明，多分位数训练改进了中心确定性预测，与使用MSE训练的模型相比，测试集MSE降低了8.6%，同时产生的高分位数输出对强降水的风险敏感预测很有用。这些发现表明，分位数回归提供了一种简单的替代标准逐点损失的方法，无需新的架构或生成采样过程。我们模型和训练设置的实现可在GitHub上获取。

英文摘要

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi-quantile regression problem. Using SmaAt-UNet as a core model, we compare MSE, MAE, and multi-quantile pinball-loss training on radar precipitation nowcasting over the Netherlands. The results show that multi-quantile training improves the central deterministic forecast, decreasing test-set MSE by 8.6\% compared to a model trained using MSE, while also producing upper-quantile outputs that are useful for risk-sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting}{GitHub}.

URL PDF HTML ☆

赞 0 踩 0

2605.30000 2026-06-02 cs.AI 版本更新

AttenA+: 纠正机器人基础模型中的动作不平等性

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie, Jian Guo, Ping Luo, Andrew F. Luo, Boyu Zhou, Jun Ma

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKU（香港大学）； USTC（中国科学技术大学）； IDEA Research（IDEA研究院）； SUSTech（南方科技大学）； X-Humaniod

AI总结针对机器人基础模型忽视动作物理重要性的问题，提出AttenA+框架，通过速度驱动的动作注意力重加权训练目标，提升复杂长程任务性能。

详情

AI中文摘要

现有的机器人基础模型虽然强大，但基于一个隐含的时间同质性假设：在优化过程中将所有动作视为同等信息量。这种从语言模型继承的“平坦”训练范式，对操作的内在物理层次结构无动于衷。实际上，机器人轨迹本质上是异质的，其中低速段通常通过需要精确交互来决定任务成功，而高速运动则作为容错过渡。这种均匀损失权重与物理关键性之间的错位从根本上限制了当前视觉-语言-动作（VLA）模型和世界-动作模型（WAM）在复杂长程任务中的性能。为了纠正这一点，我们引入了AttenA+，一个与架构无关的框架，通过速度驱动的动作注意力优先考虑运动学关键段。通过基于逆速度场重新加权训练目标，AttenA+自然地使模型的学习能力与操作的物理需求对齐。作为一种即插即用的增强，AttenA+可以集成到现有骨干网络中，无需结构修改或额外参数。大量实验表明，AttenA+显著提升了当前最先进模型的上限。具体来说，它在Libero基准上将OpenVLA-OFT提升至98.6%（+1.5%），并将FastWAM在RoboTwin 2.0上推进至92.4%（+0.6%）。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性和跨任务泛化能力。我们的工作表明，挖掘动作序列的内在结构先验为标准缩放定律提供了一种高效、物理感知的补充，为通用机器人控制开辟了新路径。

英文摘要

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

URL PDF HTML ☆

赞 0 踩 0

2605.07804 2026-06-02 cs.LG cs.AI 版本更新

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Prune-OPD：面向长程推理的高效可靠在线策略蒸馏

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）； MBZUAI ； University of California, Merced（加州大学默塞德分校）； Sun Yat-sen University（中山大学）

AI总结提出Prune-OPD框架，通过实时检测学生与教师之间的前缀漂移并动态截断不可靠的轨迹，在减少计算浪费的同时保持或提升长程推理任务的性能。

Comments 17 pages, 8 figures

详情

AI中文摘要

在线策略蒸馏（OPD）利用密集的教师奖励来增强推理模型。然而，将OPD扩展到长程任务暴露了一个关键缺陷：随着学生生成的前缀不可避免地偏离教师的思维过程，教师的密集奖励失去了局部可开发性。继续在这些“漂移”轨迹上生成和评估标记不仅会降低奖励质量，还会导致巨大的计算浪费。为了解决这个问题，我们引入了 extbf{Prune-OPD}，一个动态地将训练预算与监督质量对齐的框架。通过持续监控学生和教师预测之间的局部兼容性（例如，通过top-$k$重叠），Prune-OPD实时检测前缀漂移事件。一旦检测到严重漂移，它会单调地降低后续不可靠奖励的权重，并触发动态的轨迹截断。这使得训练过程能够停止无效的生成，并将计算重新分配到可靠的教师监督上。在不同的教师-学生组合中，Prune-OPD始终将计算与监督可靠性对齐。当前缀漂移使得密集的教师奖励不可靠时，它减少了37.6\%--68.0\%的训练时间，同时保持甚至提升了在具有挑战性的基准（AMC、AIME、HMMT）上的性能。当学生-教师兼容性保持较高时，它会通过扩展训练窗口自动保留长上下文监督。这些结果表明，Prune-OPD不是通过盲目缩短轨迹来改进OPD，而是通过将计算重新分配到局部可开发的教师奖励上。

英文摘要

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

URL PDF HTML ☆

赞 0 踩 0

2602.14307 2026-06-02 cs.AI cs.LG 版本更新

Benchmarking at the Edge of Comprehension

在理解边缘的基准测试

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

发表机构 * University of Cambridge（剑桥大学）

AI总结提出Critique-Resilient Benchmarking框架，通过对抗性生成-评估游戏在人类理解受限时比较模型，利用批判韧性正确性概念和分项Bradley-Terry模型对LLM进行排序。

详情

AI中文摘要

随着前沿大型语言模型（LLMs）在新基准发布后迅速饱和，基准测试本身正处于一个转折点：如果前沿模型持续改进，人类将越来越难以生成具有区分度的任务、提供准确的真实答案或评估复杂解决方案。如果基准测试变得不可行，我们衡量AI进展的能力将受到威胁。我们将这种情况称为后理解阶段。在这项工作中，我们提出了Critique-Resilient Benchmarking，一种对抗性框架，旨在即使在人类完全理解不可行的情况下也能比较模型。我们的技术依赖于批判韧性正确性的概念：如果没有对手令人信服地证明答案错误，则该答案被视为正确。与标准基准测试不同，人类充当有界验证者，专注于局部声明，从而在超出任务完全理解的情况下保持评估完整性。使用分项二分Bradley-Terry模型，我们联合对LLM进行排序，依据其解决挑战性任务的能力和生成困难但可解问题的能力。我们在数学领域展示了该方法在八个前沿LLM上的有效性，表明所得分数稳定且与外部能力度量相关。我们的框架将基准测试重新定义为一种对抗性生成-评估游戏，其中人类作为最终裁决者。

英文摘要

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

URL PDF HTML ☆

赞 0 踩 0

2605.29539 2026-06-02 cs.CV cs.AI 版本更新

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

GiPL: 用于跨域小样本目标检测的生成增强迭代伪标签方法

Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结提出GiPL双分支训练框架，通过迭代伪标签自训练和生成数据增强，解决跨域小样本目标检测中支持集利用不足和过拟合问题。

Comments CVPR 2026 Workshop

详情

AI中文摘要

视觉语言基础模型在跨域小样本目标检测（CD-FSOD）中展现出有前景的零样本泛化能力。然而，它们在微调过程中面临两个关键挑战：由于稀疏的单实例标注导致支持集利用不足，以及在极有限的域目标样本下严重过拟合。为解决这些问题，本文提出GiPL，一个高效的双分支训练框架。在第一个分支中，我们设计了一种迭代伪标签自训练范式，该范式对支持集进行零样本推理以生成可靠的伪标注，将其与真实标签融合，并迭代优化模型以充分利用支持集数据。在第二个分支中，我们引入了使用大型视觉语言模型的生成数据增强流程，该流程合成域对齐、多目标标注的图像以丰富训练样本并抑制过拟合。在三个具有挑战性的CD-FSOD数据集（RUOD、CARPK、CarDD）上，在1/5/10样本设置下的大量实验表明，GiPL始终以显著的性能提升优于最先进的方法。代码可在\href{https://github.com/z-yaz/CDiscover}{CDiscover}获取。

英文摘要

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training framework. In the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance gains. Code is available at \href{https://github.com/z-yaz/CDiscover}{CDiscover}.

URL PDF HTML ☆

赞 0 踩 0

2605.29488 2026-06-02 cs.CV cs.AI 版本更新

Many-Shot CoT-ICL: 使上下文学习真正学习

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

发表机构 * The University of Hong Kong（香港大学）

AI总结研究多示例思维链上下文学习在推理任务中的特性，提出曲线演示选择方法，在数学任务上提升5.42个百分点。

Comments Accepted by ICML 2026

详情

AI中文摘要

虽然多示例ICL取得了显著性能，但先前对其缩放行为的研究主要关注非推理任务。在这项工作中，我们研究了推理任务上的多示例ICL，特别关注多示例思维链上下文学习（CoT-ICL）。通过分析非推理和推理任务以及非推理和推理导向的LLM，我们识别出多示例CoT-ICL的几个独特性质。我们进一步将这些发现解释为多示例CoT-ICL是上下文测试时学习而非缩放模式匹配，并提出两个原则：（i）演示应易于目标模型理解，（ii）它们应按顺序排列以支持平滑的概念进展。受该原则指导，我们提出了曲线演示选择（CDS），一种简单的排序方法，在具有64个演示的数学任务上获得了高达5.42个百分点的提升。总体而言，我们的结果将长上下文窗口从检索缓冲区重新定义为上下文测试时学习的结构化课程。

英文摘要

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In this work, we study many-shot ICL on reasoning tasks, with a particular focus on many-shot chain-of-thought in-context learning (CoT-ICL). Analyzing across non-reasoning and reasoning tasks and across non-reasoning and reasoning-oriented LLMs, we identify several distinctive properties of many-shot CoT-ICL. We further interpret these findings by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggest two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on a math task with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

URL PDF HTML ☆

赞 0 踩 0

2604.19532 2026-06-02 cs.SD cs.AI 版本更新

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

BEAT: 通过均匀时间步对符号音乐进行分词和生成

Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang

发表机构 * South China University of Technology（南方科技大学）； National University of Singapore（新加坡国立大学）； Mohamed bin Zayed University of Artificial Intelligence（莫扎德·本·扎耶德人工智能大学）； New York University（纽约大学）

AI总结提出一种以均匀节拍为基本单元的分词方法，将同一时间步内相同音高的所有事件编码为一个令牌，并在音乐续写和伴奏生成任务中验证其相比传统事件基方法能提升音乐质量和结构连贯性。

详情

AI中文摘要

将音乐分词以适应语言模型的通用框架是一个具有挑战性的问题，特别是考虑到音乐可以表示的各种符号结构（例如，序列、网格和图）。迄今为止，大多数方法将符号音乐分词为音乐事件序列，如起始、音高、时移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效，但它隐式处理了音乐时间的规律性：单个令牌可能跨越不同时长，导致时间进展不均匀。在本文中，我们考虑另一种分词方式是否可能，其中均匀长度的音乐步长（例如，一个节拍）作为基本单元。具体来说，我们将单个时间步内相同音高的所有事件编码为一个令牌，并显式按时间步对令牌进行分组，这类似于钢琴卷帘表示的稀疏编码。我们在音乐续写和伴奏生成任务上评估了所提出的分词方法，并将其与主流事件基方法进行比较。结果表明，所提出的分词方法提高了音乐质量和结构连贯性，而额外分析证实了更高的效率和更有效地捕获长程模式。

英文摘要

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

URL PDF HTML ☆

赞 0 踩 0

2602.07666 2026-06-02 cs.CR cs.AI 版本更新

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

SoK: DARPA 人工智能网络挑战赛 (AIxCC)：竞赛设计、架构与经验教训

Cen Zhang, Younggi Park, Fabian Fleischer, Yu-Fu Fu, Jiho Kim, Dongkwan Kim, Youngjoon Kim, Qingxiao Xu, Andrew Chin, Ze Sheng, Hanqing Zhao, Michael Pelican, David J. Musliner, Jeff Huang, Jon Silliman, Mikel Mcdaniel, Jefferson Casavant, Isaac Goldthwaite, Nicholas Vidovich, Matthew Lehman, Taesoo Kim

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Texas A&M University（德克萨斯大学）； Smart Information Flow Technologies (SIFT)（智能信息流技术公司）； Kudu Dynamics（Kudu动态公司）； Microsoft（微软）

AI总结本文系统分析 DARPA 人工智能网络挑战赛 (AIxCC)，探讨其竞赛设计、决赛系统的架构方法，并总结驱动性能的因素、技术进展及未来研究方向。

Comments Camera ready version, systematization of Knowledge and post-competition analysis of DARPA AIxCC (2023-2025)

详情

Journal ref: USENIX Security 2026

AI中文摘要

DARPA 的人工智能网络挑战赛 (AIxCC, 2023--2025) 是迄今为止规模最大的竞赛，旨在构建完全自主的网络推理系统 (CRS)，利用人工智能的最新进展——特别是大型语言模型 (LLM)——来发现和修复真实世界开源软件中的漏洞。本文首次对 AIxCC 进行系统分析。基于设计文档、源代码、执行轨迹以及与组织者和参赛团队的讨论，我们审视了竞赛的结构和关键设计决策，描述了决赛 CRS 的架构方法，并分析了最终计分板之外的竞赛结果。我们的分析揭示了真正驱动 CRS 性能的因素，识别了各团队取得的技术进步，并指出了未来研究中仍需解决的局限性。最后，我们总结了组织未来竞赛的经验教训，以及在实际中部署自主 CRS 的更广泛见解。

英文摘要

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.

URL PDF HTML ☆

赞 0 踩 0

2605.25143 2026-06-02 cs.AI cs.LG 版本更新

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

超越前沿：用于高效测试时扩展的随机回溯

Dao Tran, Duc Anh Le, Ngoc Luu, Quan Pham, Tung Pham, Hung Bui

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出随机回溯方法，通过维护历史前缀池并利用子池选择和幂回溯序列蒙特卡洛机制，在测试时扩展中实现更高的准确率-令牌数权衡。

详情

AI中文摘要

测试时扩展通过花费额外计算来探索多个解轨迹，从而改进语言模型推理。关键挑战是在推理过程中最大化准确率的同时最小化生成的令牌总数。最近的PRM引导方法对中间前缀进行评分以引导搜索，但大多数方法仅关注前沿：它们只保留当前活动的前缀，并使用带噪声的PRM分数不可逆地剪枝或重采样其余部分。这可能导致过早承诺、多样性崩溃以及丢失仍可产生正确延续的前缀。我们引入了一种基于历史前缀持久池的随机回溯，允许测试时计算重新访问先前生成的状态，而不是仅扩展当前前沿。为了提高效率，我们提出了两种互补机制。子池选择通过随机子池内应用Top-N选择来增强贪婪PRM引导搜索，使历史前缀有机会绕过评分过高的前沿候选。幂回溯序列蒙特卡洛使用幂化PRM分数和混合校正权重，将SMC风格的重采样扩展到持久池。在数学推理基准和模型规模上，我们的方法在每令牌准确率上始终更高，并且与强PRM引导基线相比，仅使用一小部分令牌数即可达到相同的准确率水平，这表明持久池随机回溯为改善测试时扩展中的准确率-令牌权衡提供了一种简单有效的方法。

英文摘要

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

URL PDF HTML ☆

赞 0 踩 0

2605.24828 2026-06-02 cs.AI 版本更新

Test-Time Deep Thinking to Explore Implicit Rules

测试时深度思考以探索隐式规则

Wentong Chen, Xin Cong, Zhong Zhang, Yaxi Lu, Siyuan Zhao, Yesai Wu, Qinyu Luo, Haotian Chen, Yankai Lin, Zhiyuan Liu, Maosong Sun

发表机构 * Renmin University of China（中国人民大学）； Department of Statistics and Data Science, Tsinghua University（清华大学统计与数据科学系）； School of Computer Science and Engineering, UESTC（UESTC计算机科学与工程学院）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； School of Mathematical Sciences, Nankai University（南开大学数学科学学院）； Whiting School, Johns Hopkins University（约翰斯·霍普金斯大学惠特林学院）； School of Artificial Intelligence, Shanghai Jiaotong University（上海交通大学人工智能学院）

AI总结针对智能体在隐式规则环境中失败的问题，提出TTExplore框架，通过训练专用模型Exp-Thinker进行测试时推理，平均提升基线性能14-19点。

详情

AI中文摘要

随着大型语言模型（LLMs）的不断进步，智能体变得越来越重要。然而，这些智能体在由隐式规则——无法直接观察、必须通过交互推断的隐藏约束——支配的环境中常常失败。这导致智能体陷入重复的试错循环，最终导致任务失败。为了应对这一挑战，我们提出了测试时探索（TTExplore）框架，其中思考者组件分析交互历史以推断这些隐式规则并指导行动者。在此设置中，有效的探索关键取决于思考者的推理能力。然而，评估深度推理轨迹本质上不稳定且困难，这对有效训练构成了主要障碍。为了解决这个问题，我们引入了一种新颖且稳定的强化学习流程。核心思想是使用准确的任务级分数作为间接奖励，以绕过评估中间推理的困难，并仅保留每个轨迹的单个思考节点以缓解奖励稀疏性。使用此流程，我们训练了一个专门的7B模型Exp-Thinker。在五个基于文本的具体任务上的实验表明，配备Exp-Thinker的TTExplore将基线智能体性能平均提升了14-19个点，证明了显式推理隐式规则的有效性。

英文摘要

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

URL PDF HTML ☆

赞 0 踩 0

2605.24727 2026-06-02 cs.AI cs.CL cs.CY cs.IT math.IT 版本更新

Fundamental Limitation in Explaining AI

解释AI的根本限制

Atsushi Suzuki, Jing Wang

发表机构 * Department of Mathematics Faculty of Science（科学学院数学系）； The University of Hong Kong Hong Kong SAR（香港大学香港特别行政区）； School of Computing and Mathematical Sciences Faculty of Engineering and Science（工程与科学学院计算与数学科学系）； University of Greenwich United Kingdom（格林威治大学英国）

AI总结本文通过数学证明了一个解释AI的基本四难困境，指出AI及其解释无法同时满足环境复杂性、AI性能优良、解释可解释性和解释完全忠实性四个条件，从而表明AI治理应基于解释忠实性总是不完整的假设。

Comments minor modifications

详情

AI中文摘要

尽管大规模模型如LLMs和扩散模型已取得实际成功，公共机构强调了AI可解释性的重要性。然而，现有的解释AI方法并非旨在提供大规模AI系统行为的完全忠实解释。虽然对AI系统行为的完全忠实且可解释的解释可能对AI治理有用，但尚不清楚提供这样的解释在理论上是否可能。在本文中，我们从数学上证明了解释AI的一个基本四难困境，指出AI及其解释无法同时满足以下四个条件：1）操作环境的复杂性，2）AI性能的优良性，3）AI解释的可解释性，以及4）AI解释的完全忠实性。这个四难困境表明，在大多数我们无法改变环境或牺牲良好AI性能和可解释解释的应用中，我们应该放弃解释的完全忠实性，而应仅针对应用重要的部分进行解释。因此，该四难困境意味着AI治理应基于AI解释的忠实性总是不完整的假设来设计。

英文摘要

While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importance of explainability in AI. Existing methods for explaining AI, however, are not designed to provide completely faithful explanations of the behavior of large-scale AI systems. Although a completely faithful and interpretable explanation of the behavior of an AI system might be useful for AI governance, it has not been known whether providing such an explanation is theoretically possible. In this paper, we mathematically prove a fundamental quadrilemma in explaining AI, stating that AI and its explanation cannot satisfy the following four conditions simultaneously: 1) the complexity of the operation environment, 2) the goodness of the AI's performance, 3) the interpretability of the AI's explanation, and 4) the complete faithfulness of the AI's explanation. This quadrilemma suggests that, in most applications where we cannot change the environment or sacrifice good AI performance and an interpretable explanation, we should give up complete faithfulness of explanations and should instead aim to explain only the parts that are important for applications. As a consequence, the quadrilemma implies that AI governance should be designed on the premise that the faithfulness of AI explanations is always incomplete.

URL PDF HTML ☆

赞 0 踩 0

2605.24681 2026-06-02 cs.CL cs.AI 版本更新

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Mix-MoE：通过混合专家混合提升大语言模型的多语言机器翻译

Bo Li, Tianyu Dong, Shaolin Zhu, Deyi Xiong

发表机构 * School of Software, Tsinghua University（清华大学软件学院）； College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院）

AI总结提出Mix-MoE框架，通过将MoE层分为语言模型专家和机器翻译专家，并利用傅里叶变换增强路由机制，解决大语言模型在多语言机器翻译微调中的参数干扰问题。

Comments Accepted by TASLP

详情

DOI: 10.1109/TASLPRO.2026.3698013

AI中文摘要

大语言模型（LLMs）在多语言机器翻译（MT）中展现出巨大潜力，即使双语监督有限。然而，使用平行语料库微调LLMs带来了主要挑战，即参数干扰。为了解决这些问题，我们提出了Mix-MoE，一个混合专家混合框架，旨在训练LLMs进行多语言MT。我们的框架在两个不同的阶段运行：（1）在单语语料库上使用MoE进行后预训练，以及（2）在平行语料库上使用MoE进行后预训练。关键的是，我们将MoE层分为两个专门的组：语言模型专家（LM专家）和机器翻译专家（MT专家）。LM专家旨在捕获和保留预训练LLM学到的单语知识。另一方面，MT专家专门训练以获取和存储双语翻译知识。此外，为了促进这些专门专家之间的有效交互并利用文本中潜在的结构模式，我们引入了一种由模型表示中的傅里叶变换特征增强的路由机制。实验结果表明，Mix-MoE在多语言MT中表现出色，显著优于现有基线，并在缓解参数干扰方面取得了显著进展。

英文摘要

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

URL PDF HTML ☆

赞 0 踩 0

2605.24528 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Hypothesis Generation and Inductive Inference in Children and Language Models

儿童与语言模型中的假设生成与归纳推理

Jeffrey Qin, Wasu Top Piriyakulkij, Zhuangfei Gao, Mia Radovanovic, Jessica Sommerville, Kevin Ellis, Marta Kryven

发表机构 * Computer Science University of Waterloo（滑铁卢大学计算机科学系）； Department of Computer Science Cornell University（康奈尔大学计算机科学系）； Department of Computer Science Dalhousie University（达尔豪斯大学计算机科学系）； Department of Psychology University of Toronto（多伦多大学心理学系）

AI总结通过归纳推理盒子任务，结合贝叶斯粒子推断的程序归纳形式化，比较儿童与基于LLM的智能体在不确定性下的假设生成与证据寻求行为，发现两者在适应环境结构上相似但信息寻求成本与归纳偏差不同。

详情

AI中文摘要

现实世界中的决策需要在证据、潜在因果规则以及世界状态本身的不确定性下构建心智模型。在这种条件下，哪些计算原理支撑人类的推理？在给定匹配约束下，基于LLM的智能体是否表现出类似行为？我们使用归纳推理盒子任务来探讨这些问题，在该任务中，参与者（人类儿童和基于LLM的智能体）通过与不确定环境的顺序交互来推断潜在原因。我们将该任务形式化为基于贝叶斯粒子推断的程序归纳，并承认两种互补的解释：(1) 作为对假设的约束满足过程，以及(2) 作为程序综合问题，其中假设是针对证据评估的可执行程序。使用基于约束的公式，我们表明儿童的行为最好由主观证据可靠性和在线假设生成的组合来解释，这解释了他们的证据寻求模式以及任务完成与规则泛化之间的分离。使用程序综合公式，我们将基于LLM的智能体视为模型有机体：可控系统，允许系统性地操纵任务条件。在各种后端中，基于LLM的智能体复制了儿童对证据可靠性和可观察性变化的反应，包括折扣不可靠证据、寻求解决部分信息以及任务完成与因果泛化之间的分离。同时，与儿童相比，基于LLM的智能体倾向于过度观察和过度遵守指令。这些结果表明，虽然儿童和基于LLM的智能体在适应环境结构方面相似，但他们的信息寻求行为表现出不同的潜在成本和归纳偏差。

英文摘要

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

URL PDF HTML ☆

赞 0 踩 0

2605.18838 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

说谎只是一个阶段：语言模型扩展中的隐藏对齐转变

Adil Amin

发表机构 * ZEHEN Labs（ZEHEN实验室）

AI总结通过分析63个基础模型，发现语言模型在特定规模阈值下，推理能力与真实性从反相关转变为正相关，并揭示了输出投影瓶颈和零竞争注意力头等内部机制。

Comments 15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." ( https://doi.org/10.48550/arXiv.2605.18840). Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/

详情

AI中文摘要

扩展定律预测了计算量带来的损失，但未预测能力如何相互作用。我们测量了来自16个家族的63个基础模型的推理能力与真实性之间的耦合，并发现了一个在损失曲线中不可见的相变：低于家族依赖的临界规模N_c时，能力反相关（r = -0.989，p = 4 x 10^{-5}，非参数置换检验）；高于该规模时，它们合作。N_c ~ 3.5B参数 [2.9B, 13.4B]（bootstrap 95% CI），但模型大小并非决定相位的唯一变量。架构、数据整理和训练配方各自独立地改变N_c：精心整理的数据消除了Qwen代际之间的耦合下降（在匹配规模下从0.025到0.830），Gemma-4在4B时通过蒸馏和架构创新实现了0.871的耦合，这通常是13B+标准训练模型的特征，而Phi在1B时仅通过数据整理就达到了10B网络训练模型的耦合水平。宽度归一化消除了所有测试家族的反相关，支持输出投影瓶颈的存在。在内部，40个模型中有38个显示零竞争注意力头。一个稀疏回归ODE以5.6%的误差交叉预测了保留的Llama-2。该诊断不需要模型内部信息——仅需跨模型家族的公开基准分数。合作区域扩展到前沿（r = +0.72，34个模型，10个实验室）。一个概念验证干预证实了瓶颈是可利用的：在识别层添加单个真实方向向量，无需重新训练即可纠正税收阶段60%的错位输出——这是一种无需修改权重的、每推理一次的外科手术式修正。代码、数据、用于任何开放权重模型的开源转向CLI以及用于相位诊断的交互式仪表板已发布：https://zehenlabs.com/cape/。

英文摘要

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

URL PDF HTML ☆

赞 0 踩 0

2605.24248 2026-06-02 cs.CR cs.AI cs.SE 版本更新

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

认证工具服务器准入：模型上下文协议的安全扩展

Alfredo Metere

发表机构 * Enclawed LLC（Enclawed公司）

AI总结针对MCP协议缺乏信任机制的问题，提出mcp-attested扩展，通过离线签名的权限断言、默认拒绝的工具白名单和分级强制审计日志，实现安全服务器准入与工具边界控制。

详情

AI中文摘要

模型上下文协议（MCP）标准化了大语言模型（LLM）代理与外部工具服务器之间的消息交换，但未标准化信任：主机读取服务器自声明的工具列表并分发调用，没有关于可以使用哪些服务器、敏感程度如何或服务器哪些工具在界限内的概念。这项工作源于一个具体需求——让Enclawed代理安全地使用Google外部运营的MCP服务器（Gmail、日历、Drive），准入服务器并限制其可能驱动的工具，而不改变MCP或Enclawed自身的工具应用程序编程接口（API）。我们构建的机制mcp-attested（已在开源enclawed-oss发行版和enclaved变体中发布）具有通用性：使未经中介的第三方连接对单个用户不安全的差距，使得受监管的部署无法获得认证。我们通过三种附加机制来弥补这一差距：（1）一个小的、离线签名的权限断言，服务器在众所周知的统一资源标识符（URI）上发布，主机在分派任何工具之前对照固定的信任根进行验证；（2）一个默认拒绝的每服务器工具允许列表，因此准入服务器并不意味着信任其每个工具；（3）一个分级门控的强制模式，将检查从警告转变为硬性拒绝，每个决策都写入防篡改审计日志。我们给出了线路格式、验证算法、安全分析和LLM驱动的对抗性评估；然后以规范的请求评论（RFC 2119）形式陈述了设计——模式、验证规则、错误注册表、众所周知的注册和机器可检查的一致性向量——以便它可以作为MCP附录被采纳，而不是重新发明。未扩展的主机会忽略众所周知的文档，行为与今天完全相同。

英文摘要

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server's self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server's tools are in bounds. This work grew out of a concrete need -- letting the Enclawed agent use Google's externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed's own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form -- schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors -- so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

URL PDF HTML ☆

赞 0 踩 0

2605.24202 2026-06-02 cs.AI cs.LG 版本更新

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

多智能体强化学习何时能改进LLM工作流？工作流、规模与策略共享的权衡

Yifan Zeng, Yiran Wu, Yaolun Zhang, Wentian Zhao, Kun Wan, Qingyun Wu, Huazheng Wang

发表机构 * Oregon State University（俄勒冈州立大学）； Pennsylvania State University（宾夕法尼亚州立大学）； Adobe Inc.（Adobe公司）； AG2AI, Inc.（AG2AI公司）

AI总结研究多智能体LLM工作流中端到端强化学习训练的效果，发现改进依赖于工作流、任务和规模，策略共享不提供统一稳定性而是重新分配失败模式。

详情

AI中文摘要

多智能体LLM工作流通过将推理路由到专门角色来提升最终任务准确性，但联合训练这些角色的强化学习不稳定，其机制尚不明确。我们研究了多智能体LLM工作流的端到端RL训练何时能改进其基础模型，比较了共享策略训练（所有角色更新一个策略）和隔离策略训练（每个角色有自己的参数）。我们的实验矩阵涵盖Eval-Opt、Voting和Orch-Workers工作流、数学和代码任务以及三种模型规模（0.6B、1.7B、4B）。我们发现多智能体RL通常能改进基础模型，但增益共同依赖于工作流、任务和规模，而非仅依赖于策略共享。隔离策略倾向于达到更高的峰值准确率，但更频繁地掉入终端准确率悬崖，而共享策略训练并未消除失败；它只是将失败重新分布为性质不同的模式。然后，我们通过工作流拓扑和策略路由引起的角色级梯度动力学解释了其中最显著的模式：在隔离策略下，共享提示上的并行同角色代理会放大每个角色的梯度，并在Voting和Orch-Workers工作流中导致终端退化；在共享策略下，非对称的每步梯度质量导致共享策略被主导角色捕获，从而产生因任务和工作流而异的失败特征。总之，经验图谱及其潜在机制表明，策略共享通过不同渠道引导训练压力，而非提供统一稳定性，使其成为具有工作流和任务条件权衡的设计选择。

英文摘要

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

URL PDF HTML ☆

赞 0 踩 0

2605.24005 2026-06-02 cs.AI cs.CL 版本更新

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD：通过一致性调节奖励分解挖掘潜在逻辑以实现自我进化推理

Yanyu Chen, Jiyue Jiang, Dianzhi Yu, Zheng Wu, Jiahong Liu, Jiaming Han, Xiao Guo, Jinhu Qi, Yu Li, Yifei Zhang, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiaotong University（上海交通大学）； Fudan University（复旦大学）

AI总结针对大语言模型推理中高质量过程数据稀缺的问题，提出LC-ERD框架，通过潜在逻辑挖掘和一致性调节的奖励分解，实现自我对齐与推理进化。

Comments Accepted in SIGKDD 2026 Research Track

详情

AI中文摘要

大语言模型推理的进化受到高质量过程数据稀缺的瓶颈限制。虽然通过内生奖励进行自我对齐提供了一种解决方案，但挖掘有效监督面临三个挑战：（1）通过模仿偏差产生的标签噪声，奖励优先考虑统计可能性而非逻辑真实性，造成掩盖复合错误的“正确性幻觉”；（2）粗粒度监督，稀疏的全局结果（例如在GRPO中）无法提供细粒度指导，将推理链视为整体；（3）分布崩溃，信号无法在不放大预训练偏差的情况下泛化。为了解决这些问题，我们引入了LC-ERD（逻辑一致的内生奖励分解），一个将自我对齐视为潜在结构挖掘的框架。我们通过聚合模型潜在逻辑专家（LLE）的共识推导出变分逻辑势，以去噪推理流形，并引入基于IGM原则的多智能体价值分解协议来量化单个步骤的效用。实验表明，LC-ERD提供了一条稳健的自我进化路径，揭示了逻辑一致性与准确性之间的权衡，同时识别了标准奖励遗漏的高价值推理模式。我们的代码可在https://github.com/LC-ERD-repo/LC-ERD获取。

英文摘要

The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment via endogenous rewards offers a solution, mining valid supervision faces three challenges: (1) Label Noise via Mimetic Bias, where rewards prioritize statistical likelihood over logical truth, creating a "correctness illusion" that masks compounding errors; (2) Coarse-Grained Supervision, where sparse global outcomes (e.g., in GRPO) fail to provide granular guidance, treating reasoning chains as monolithic; and (3) Distributional Collapse, where signals fail to generalize without amplifying pre-training biases. To address these, we introduce LC-ERD (Logic-Consistent Endogenous Reward Decomposition), a framework framing self-alignment as latent structure mining. We derive a Variational Logic Potential by aggregating consensus from the model's Latent Logic Expertise (LLE) to denoise the reasoning manifold, and introduce a Multi-Agent Value Decomposition protocol based on the IGM principle to quantify individual step utility. Experiments show LC-ERD delivers a robust self-evolution path, uncovering trade-offs between logic consistency and accuracy while identifying high-value reasoning patterns missed by standard rewards. Our code is available at https://github.com/LC-ERD-repo/LC-ERD.

URL PDF HTML ☆

赞 0 踩 0

2605.11359 2026-06-02 cs.AI physics.data-an 版本更新

迈向可穿戴健康数据的通用智能与接口

Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari, Samy Abdel-Ghaffar, Marius Guerard, Kara Vaillancourt, Zhihan Zhang, Jake Garrison, Levi Albuquerque, Dimitris Spathis, Hong Yu, Hamid Palangi, Xuhai "Orson" Xu, David G. T. Barrett, Joseph Breda, Jed McGiffin, Yubin Kim, Yuwei Zhang, Naghmeh Rezaei, Samuel Solomon, Karan Ahuja, Tim Althoff, Jake Sunshine, Ming-Zher Poh, Benjamin Yetton, Ari Winbush, Nicholas B. Allen, James M. Rehg, Isaac Galatzer-Levy, Yun Liu, John Hernandez, Anupam Pathak, Conor Heneghan, Yuzhe Yang, Ahmed A. Metwally, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Xin Liu, Daniel McDuff

发表机构 * Google Research（谷歌研究）； Google DeepMind（谷歌DeepMind）； University of Washington（华盛顿大学）； University of Oregon（俄勒冈大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出一个基于超过一万亿分钟无标签传感器数据预训练的可穿戴健康基础模型，通过联合扩展模型容量和预训练数据量，在35项健康预测任务上实现系统性性能提升，并利用LLM代理自动搜索下游预测头，集成到个人健康代理中以提高相关性和安全性。

详情

AI中文摘要

虽然无处不在的可穿戴传感器捕获了大量的行为和生理信息，但有效地将这些信号转化为个性化的健康见解具有挑战性。具体来说，由于高度的表型多样性以及个体基线健康、生理和生活方式因素的差异，将低层传感器数据转换为能够表征高层状态的表示是困难的。此外，收集带有健康结果注释的可穿戴数据既费力又昂贵，而回顾性注释实际上不可行，导致高质量标签数据的稀缺。为了克服这些限制，我们提出了一个可穿戴健康基础模型，该模型在来自五百万参与者的大型队列中超过一万亿分钟的无标签传感器信号上进行了预训练。我们证明了模型容量和预训练数据量的联合扩展在35项健康预测任务（涵盖心血管、代谢、睡眠和心理健康以及生活方式选择和人口统计因素）的多样化评估中带来了系统性的性能提升。我们发现这种人群规模的表示解锁了标签高效的少样本学习和稳健的日常指标估计的生成能力。为了进一步利用这种学习到的表示，我们部署了一个LLM代理教室来自动搜索基于模型嵌入构建的下游预测头空间，显示出随着LLM模型容量增加而广泛性能提升。最后，我们展示了将这些下游预测器集成到个人健康代理中如何能够支持更相关、更具上下文感知和更安全的模型响应，并通过来自一组临床医生的1,860个评分进行了验证。

英文摘要

While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.

URL PDF HTML ☆

赞 0 踩 0

2604.17473 2026-06-02 cs.CV cs.AI 版本更新

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

双锚定：解决视觉语言导航中的状态漂移问题

Kangyi Wu, Pengna Li, Kailin Lyu, Xi Lin, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence（人机混合增强智能国家重点实验室）； National Engineering Research Center for Visual Information and Applications（视觉信息与应用国家工程研究中心）； Institute of Artificial Intelligence and Robotics（人工智能与机器人研究院）； Xi’an Jiaotong University（西安交通大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Johns Hopkins University（约翰霍普金斯大学）； Joy Future Academy, JD（京东未来学院）

AI总结提出双锚定框架，通过指令进度锚定和记忆地标锚定分别解决进度漂移和记忆漂移，显著提升长场景导航成功率。

详情

AI中文摘要

视觉语言导航（VLN）要求智能体通过遵循自然语言指令在3D环境中导航。尽管最近的视频大语言模型（Video-LLMs）极大地推进了VLN，但在长场景中它们仍然非常容易受到状态漂移的影响。在这些情况下，智能体的内部状态偏离真实的任务执行状态，导致无目的漫游和无法执行指令中的关键操作。我们将这种失败归因于两种不同的认知缺陷：进度漂移，即智能体无法区分已完成的子目标和剩余的子目标；以及记忆漂移，即智能体的历史表示退化，使其无法跟踪已访问的地标。在本文中，我们提出了一个双锚定框架，明确锚定指令进度和历史表示。首先，为了解决进度漂移，我们引入了指令进度锚定，监督智能体生成结构化的文本标记，以描述已完成与剩余的子目标。其次，为了缓解记忆漂移，我们提出了记忆地标锚定，利用以地标为中心的世界模型回顾性地预测由Segment Anything模型提取的以对象为中心的嵌入，迫使智能体显式验证过去的观察并保留已访问地标的独特表示。为促进该框架，我们整理了两个大规模数据集：360万个带有显式进度描述的样本，以及93.7万个用于回顾性验证的接地地标数据。在模拟和真实环境中的大量实验证明了我们方法的优越性，在成功率上提高了15.2%，在长时程轨迹上获得了24.7%的显著提升。为促进进一步研究，我们将发布我们的代码、数据生成流程以及收集的数据集。

英文摘要

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.14355 2026-06-02 cs.AI cs.CL 版本更新

Herculean: An Agentic Benchmark for Financial Intelligence

Herculean: 面向金融智能的智能体基准测试

Xueqing Peng, Zhuohan Xie, Yupeng Cao, Haohang Li, Lingfei Qian, Yan Wang, Vincent Jim Zhang, Huan He, Xuguang Ai, Linhai Ma, Ruoyu Xiang, Yueru He, Yi Han, Shuyao Wang, Yuqing Guo, Mingyang Jiang, Yilun Zhao, Youzhong Dong, Xiaoyu Wang, Yankai Chen, Ye Yuan, Qiyuan Zhang, Fuyuan Lyu, Haolun Wu, Yonghan Yang, Zichen Zhao, Yuyang Dai, Fan Zhang, Rania Elbadry, Ayesha Gull, Muhammad Usman Safder, Nuo Chen, Fengbin Zhu, Tianshi Cai, Zimu Wang, Polydoros Giannouris, Yuechen Jiang, Zhiwei Liu, Mohsinul Kabir, Yuyan Wang, Yixiang Zheng, Yangyang Yu, Weijin Liu, Wenbo Cao, Anke Xu, Peng Lu, Jerry Huang, Mingquan Lin, Prayag Tiwari, Yijia Zhao, Víctor Gutiérrez-Basulto, Xiao-Yang Liu, Kaleb E Smith, Jiahuan Pei, Arman Cohan, Jimin Huang, Yuehua Tang, Alejandro Lopez-Lira, Xi Chen, Xue Liu, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou

发表机构 * The Fin AI ； Yale University（耶鲁大学）； Columbia University（哥伦比亚大学）； Stevens Institute of Technology（史蒂文斯理工学院）； NVIDIA（英伟达）； New York University（纽约大学）； Georgia Institute of Technology（佐治亚理工学院）； University of Florida（佛罗里达大学）； MBZUAI ； Université de Montréal（蒙特利尔大学）； University of Minnesota（明尼苏达大学）； University of Massachusetts Boston（马萨诸塞大学波士顿分校）； National Institute of Advanced Industrial Science and Technology（国家先进工业科学与技术研究院）； University of Liverpool（利物浦大学）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）； National University of Singapore（新加坡国立大学）； Halmstad University（哈尔姆斯塔德大学）； University of Manchester（曼彻斯特大学）； Cardiff University（卡迪夫大学）； McGill University（麦吉尔大学）； Mila – Quebec AI Institute（魁北克人工智能研究所）

AI总结本文提出Herculean，首个覆盖交易、对冲、市场洞察和审计四个代表性工作流的智能体金融智能基准测试，通过标准化MCP技能环境评估异构智能体系统，发现智能体在交易和市场洞察上表现较好，但在对冲和审计等需要长期协调、状态一致性和结构化验证的任务上存在显著不足。

详情

AI中文摘要

随着AI智能体的进步，核心问题不再是它们能否解决孤立的、定义明确的金融任务，而是它们能否可靠地执行金融专业工作。现有的金融基准测试仅提供了这种能力的部分视角，因为它们主要评估静态能力，如问答、检索、摘要和分类。我们引入了Herculean，这是首个面向智能体金融智能的技能基准测试，涵盖四个代表性工作流，包括交易、对冲、市场洞察和审计。每个工作流被实例化为一个基于MCP的标准化技能环境，具有自己的工具、交互动态、约束和成功标准，从而能够对异构智能体系统进行一致的端到端评估。在前沿智能体中，我们发现智能体在交易和市场洞察上表现相对较好，但在对冲和审计上表现显著不足，这些任务中长期协调、状态一致性和结构化验证至关重要。总体而言，我们的结果指出了当前智能体在高风险金融工作流中将金融推理转化为可靠工作流执行方面的关键差距。

英文摘要

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

URL PDF HTML ☆

赞 0 踩 0

2602.11210 2026-06-02 cs.SE cs.AI cs.LG 版本更新

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox：用于构建软件工程智能体的无容器强化学习

Danlong Yuan, Wei Wu, Enhan Zhao, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SWE-MiniSandbox，一种轻量级无容器方法，通过内核级隔离和预缓存技术降低磁盘使用和准备时间，实现可扩展的强化学习训练。

详情

AI中文摘要

强化学习已成为训练软件工程智能体的关键范式，但现有流程通常依赖每个任务的容器进行隔离。在大规模场景下，预构建的容器镜像会带来显著的存储开销、缓慢的环境设置，并且需要容器管理权限。我们提出SWE-MiniSandbox，一种轻量级、无容器的方法，能够在无需牺牲隔离性的情况下实现SWE智能体的可扩展强化学习训练。SWE-MiniSandbox不依赖每个实例的容器，而是在由内核级机制支持的隔离工作空间中执行每个任务，从而大幅降低系统开销。它利用轻量级环境预缓存技术，消除了对庞大容器镜像的需求。因此，我们的方法将磁盘使用量降低到基于容器的流程所需的大约5%，并将环境准备时间缩短到容器基线的大约25%。实验结果表明，SWE-MiniSandbox实现了与标准基于容器的流程相当的评估性能。通过消除对重型容器基础设施的依赖，SWE-MiniSandbox为扩展基于强化学习的SWE智能体提供了一个实用且可访问的基础，特别是在资源受限的研究环境中。

英文摘要

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

URL PDF HTML ☆

赞 0 踩 0

2603.02845 2026-06-02 cs.RO cs.AI 版本更新

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

SPARC: 通过注意力智能体通信实现空间感知路径规划

Sayang Mu, Xiangyu Wu, Bo An

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出关系增强多头注意力（RMHA）机制，通过嵌入曼哈顿距离到注意力权重计算，优先处理空间邻近机器人的消息，在40x40网格上从8机器人零样本泛化到128机器人时，在30%障碍密度下实现约75%成功率，超越基线25个百分点以上。

Comments The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme

详情

AI中文摘要

高效通信对于分散式多机器人路径规划（MRPP）至关重要，然而现有的学习型通信方法平等对待所有邻近机器人，而不考虑它们的空间接近性，导致在协调最重要的拥挤区域注意力被稀释。我们提出关系增强多头注意力（RMHA），这是一种通信机制，它将成对曼哈顿距离显式嵌入到注意力权重计算中，使每个机器人能够动态优先处理来自空间相关邻居的消息。结合距离约束注意力掩码和GRU门控消息融合，RMHA与MAPPO无缝集成，实现稳定的端到端训练。在从8个训练机器人到128个测试机器人在40x40网格上的零样本泛化中，RMHA在30%障碍密度下实现了约75%的成功率，比最佳基线高出超过25个百分点。消融研究证实，距离关系编码是高密度环境中成功率提高的关键因素。索引词-多机器人路径规划，图注意力机制，多头注意力，通信优化，协作决策。

英文摘要

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making

URL PDF HTML ☆

赞 0 踩 0

2605.15229 2026-06-02 cs.SE cs.AI 版本更新

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

PBT-Bench：基于属性测试的AI智能体基准

Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du

发表机构 * Tsinghua University（清华大学）； University of Washington（华盛顿大学）； Beneficial AI Foundation（有益人工智能基金会）

AI总结提出PBT-Bench基准，包含100个基于属性测试的问题，用于评估AI智能体从文档中推导语义不变量并生成输入策略的能力。

详情

AI中文摘要

现有的代码基准测试衡量的是智能体能否生成任何能复现已知bug的测试，或者能否生成修复描述问题的补丁。两者都没有分离出基于属性测试的独特技能：从文档中推导语义不变量，然后构建足够精确的输入生成策略，使得随机搜索能够揭示违规。我们引入了PBT-Bench，一个包含40个真实Python库中100个精心策划的基于属性测试问题的基准。每个问题注入一个或多个语义bug（共365个，平均每个问题3.65个），设计使得默认策略的随机输入几乎不会触发它们；智能体必须阅读库的文档，识别相关不变量，并指定一个Hypothesis @given策略，将质量集中在触发区域。bug按三个难度级别（L1-L3）分层，涵盖单约束边界bug到有状态、跨函数协议违规。我们在两种提示机制（开放式基线与显式Hypothesis脚手架）下评估了八个当代LLM，每个配置进行三次独立运行。在PBT引导提示下，模型间的bug召回率从42.1%到83.4%不等；在开放式基线下，从31.4%到76.7%不等。Hypothesis脚手架将中等能力模型提升了超过20个百分点，但对最强模型提升较小，有两个例外显示出退化，表明结构化提示可能干扰某些模型行为而非补充。最难的bug被证明是模型特定的：不同架构在不同问题上失败，留下没有单一模型能填补的持续空白。我们发布基准、测试框架和完整评估语料库，以支持下游关于文档基础的语义推理工作。

英文摘要

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.20301 2026-06-02 cs.CV cs.AI 版本更新

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Co-Fusion4D：面向鲁棒3D目标检测的时空协同融合

Wenxuan Li, Qin Zou, Shoubing Chen, Chi Chen, Yingyi Yang, Qingxiang Meng

发表机构 * Tsinghua University（清华大学）

AI总结提出Co-Fusion4D框架，通过当前帧主导-历史帧互补机制和双注意力融合模块，解决BEV检测器中跨帧时空不一致问题，在nuScenes上达到74.9% mAP和75.6% NDS。

详情

AI中文摘要

在自动驾驶中，3D目标检测对于准确感知和可靠决策至关重要。然而，目标运动和自车运动常常在基于BEV的检测器中引起跨帧时空不一致，导致时序BEV特征错位和时空一致性退化。为了解决这些挑战，我们提出了Co-Fusion4D，一个统一框架，显式地保持跨帧时空一致性并抑制时序特征漂移。Co-Fusion4D采用当前帧中心策略，将当前帧作为主要信息源，同时在时空滤波和对齐后选择性地融入历史帧。这种主从互补机制有效减轻了累积对齐误差，抑制了噪声特征传播，并利用可靠的时序线索获得更一致的BEV表示。此外，Co-Fusion4D集成了双注意力融合（DAF）模块，以进一步增强时空特征交互。DAF联合利用帧内空间注意力和帧间时序注意力，自适应地对齐和融合多帧特征，强调运动一致区域同时抑制虚假相关性。通过偏离传统的均匀融合范式，该设计显著提高了BEV表示的时序稳定性和判别能力。在nuScenes基准上的大量实验表明，Co-Fusion4D实现了最先进的性能，mAP为74.9%，NDS为75.6%，且不依赖测试时增强或外部数据。

英文摘要

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

URL PDF HTML ☆

赞 0 踩 0

2605.20282 2026-06-02 cs.CV cs.AI 版本更新

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

视觉模型真的能遗忘吗？Mirage：表示层面的视觉遗忘认证

Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Southeast University（东南大学）； Northeast Normal University（东北师范大学）

AI总结提出Mirage框架，通过表示层面诊断揭示现有垂直联邦学习遗忘方法在输出层面通过认证后仍保留类别结构信息，并发现遗忘三元组困境和类别-样本不对称性。

详情

AI中文摘要

垂直联邦学习中的机器遗忘引起了越来越多的关注，但现有方法仅使用输出层面指标来认证遗忘。我们通过引入Mirage（一个表示层面审计框架，包含四种互补诊断方法：线性探针恢复、中心核对齐、特征可分性评分和逐层恢复分析）来挑战这些说法。通过在七个数据集和七种基线方法上遵循最近的VFL遗忘协议进行实验，Mirage揭示了三个关键发现：（i）遗忘差距：通过输出层面认证的方法在其表示中仍然保留了大量的类别结构，线性探针恢复比重新训练的基线高出最多15.4个百分点；中心核对齐显示这些模型在结构上更接近原始模型而非重新训练的参考模型，而可分性评分表明存在持续的几何区分。（ii）遗忘三元组困境：没有现有方法能同时实现高效用、输出层面遗忘和表示层面遗忘。（iii）类别-样本不对称性：类别级遗忘留下强烈的表示痕迹（线性探针恢复高达97%），而样本级遗忘与随机无异（线性探针恢复约50%）；逐层分析进一步表明残差类别信息在网络深度中持续存在。这些发现呼吁在联邦遗忘研究中采用表示层面感知的评估标准。

英文摘要

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

URL PDF HTML ☆

赞 0 踩 0

2605.17839 2026-06-02 cs.LG cs.AI 版本更新

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

基于双层优化的不平衡学习知识蒸馏平衡

Anh B. H. Nguyen, Ba Tho Phan, Viet Cuong Ta

发表机构 * VNU University of Engineering and Technology（越南工程技术大学）

AI总结提出BiKD双层框架，通过自适应样本级权重平衡硬损失和软损失，解决不平衡数据上知识蒸馏的脆弱性问题。

Comments Accepted to Special Session: Data Science: Foundations and Applications (DSFA), PAKDD 2026

详情

AI中文摘要

知识蒸馏通过混合硬损失和软损失将高容量教师的知识转移到紧凑的学生模型。在不平衡数据上，硬损失和软损失之间的固定权重使得学习过程变得脆弱。最近的研究尝试在长尾设置中重新加权这些组件。然而，大多数方法没有在样本级别调整权重，也没有考虑训练过程中学生的行为。为了解决这个问题，我们提出了BiKD——一个双层框架，动态平衡每个样本的硬损失和软损失。我们采用一个权重生成网络，由一个小型平衡验证集引导，产生自适应的逐样本权重。学生现在通过无约束的加权硬损失和软损失组合进行训练，使得学生可以放松这两个项。我们进一步提出了一种多步SGD策略，以更准确和高效地优化权重模型。在长尾CIFAR-10/100上的实验表明，我们的方法在不同不平衡因子下均优于最近的平衡蒸馏方法。

英文摘要

Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these methods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we propose BiKD -- a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced validation set. The student is now trained with an unconstrained combination of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the weight model more accurately and efficiently. Experiments on long-tailed CIFAR-10/100 show that our approach surpasses recent balanced distillation methods across imbalance factors.

URL PDF HTML ☆

赞 0 踩 0

2605.18077 2026-06-02 cs.AI cs.LG cs.MA 版本更新

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

LLM引导的通信用于合作多智能体强化学习

Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han

发表机构 * KAIST（韩国科学技术院）

AI总结提出LMAC框架，利用大语言模型的推理能力设计通信协议，使所有智能体尽可能准确一致地重建底层状态，从而提升多智能体强化学习中的状态重建和性能。

Comments 9 pages for main, 32 pages for total, Accepted to ICML 2026

2605.17909 2026-06-02 cs.AI cs.LO 版本更新

Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems

伦理超高速 (EHV)：一种面向代理型AI系统的硬件根零信任运行时强制架构

Riddhi Mohan Sharma

发表机构 * Senior Member, IEEE（IEEE高级会员）

AI总结提出伦理超高速 (EHV) 架构，通过结合语法约束解码、因果图CRDT、可信执行环境和OSCAL审计日志，实现硬件根的零信任运行时强制，将策略执行点嵌入推理管线，显著降低治理延迟并支持形式化验证。

Comments 12 pages, 3 TikZ Figures, 3 Tables

详情

AI中文摘要

随着自主代理系统在受监管的关键基础设施中规模化部署，缺乏针对高频策略更新的机械性、硬件根强制机制构成了基本的安全缺口。我们提出伦理超高速 (EHV)，一种面向代理系统的治理感知运行时强制架构，它结合了用于内联策略约束令牌生成的语法约束解码 (GCD)、基于向量时钟排序的因果图CRDT策略同步、可信执行环境 (TEE) 中的硬件证明执行以及OSCAL格式的机器可读审计日志。与引入14-30天策略延迟的事后审计框架（如ISO/IEC 42001、NIST AI RMF）不同，EHV通过治理感知即时 (JIT) 编译器将策略执行点 (PEP) 重新定位到推理管线中。在明确陈述的假设下，该架构降低了强制延迟，提高了可追溯性，并支持有界模型中的安全不变量的形式化验证。我们通过TLA+模型检查证明，在验证的有界运行状态空间（生成1738个状态，324个不同状态，深度8，零违规）中，不合规的代理行为是不可达的。在这些条件下，O(1)运行时强制减少了部署速度与治理完整性之间的传统权衡，将治理延迟从O(天)降至O(1)。EHV的差异化贡献在于将GCD、因果CRDT、TEE证明缓存和有界形式化验证集成到一个单一的、硬件根的强制架构中——这是任何同期系统都未实现的组合。该架构通过儿科肿瘤剂量用例进行演示，适用于包括医疗、金融合规和关键基础设施控制在内的受监管关键基础设施。

英文摘要

As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for high-frequency policy updates presents a fundamental safety gap. We present Ethical Hyper-Velocity (EHV), a governance-aware runtime enforcement architecture for agentic systems that combines Grammar-Constrained Decoding (GCD) for inline policy-constrained token generation, Causal Graph CRDT-based policy synchronization with vector-clock ordering, hardware-attested execution in Trusted Execution Environments (TEEs), and OSCAL-formatted machine-readable audit logging. Unlike retrospective auditing frameworks (ISO/IEC 42001, NIST AI RMF) that introduce 14-30 day policy latencies, EHV relocates the Policy Enforcement Point (PEP) into the inference pipeline via a Governance-Aware Just-In-Time (JIT) Compiler. Under explicitly stated assumptions, the architecture reduces enforcement latency, improves traceability, and supports formal verification of safety invariants in a bounded model. We demonstrate via TLA+ model checking that non-compliant agentic actions were unreachable in the verified bounded operating state space (1,738 states generated, 324 distinct, depth 8, zero violations). Under these conditions, O(1) runtime enforcement reduces the traditional trade-off between deployment velocity and governance integrity, targeting Governance Latency from O(days) toward O(1). EHV's differentiating contribution is the integration of GCD, Causal CRDT, TEE attestation caching, and bounded formal verification into a single, hardware-rooted enforcement architecture -- a combination not achieved by any contemporaneous system. The architecture is demonstrated through a pediatric oncology dosage use case, with applicability to regulated critical infrastructures including healthcare, financial compliance, and critical infrastructure control.

URL PDF HTML ☆

赞 0 踩 0

2605.17554 2026-06-02 cs.AI cs.LG 版本更新

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

评估深度研究代理在专家咨询工作中的表现：一个包含验证器、评分标准和认知陷阱的基准

Tanmay Asthana, Aman Saksena, Divyansh Sahu

AI总结本文提出一个基准，通过42个专家编写的任务，使用确定性验证器和五维度评分标准评估三个前沿深度研究代理（Claude、OpenAI o3、Gemini）在管理咨询类结构化分析交付物上的表现，并嵌入认知陷阱，发现所有代理的联合接受率均较低（最高21.4%），且各有独特失败模式。

Comments Updating the paper with more data. Will resubmit

详情

AI中文摘要

前沿深度研究代理（DRA）能够规划研究任务、综合多篇文档，并按需生成结构化的交付物。它们在企业工作流中的部署速度远快于评估速度。现有基准衡量事实回忆、单跳问答或通用代理技能，忽略了DRA被部署用于生成的多文档、决策级工作。我们引入一个基准，针对管理咨询师典型一周中所需的结构化分析交付物。我们评估三个前沿代理，即Claude Opus 4.6（带网络搜索）、OpenAI o3-deep-research和Google Gemini 3.1 Pro deep-research，在42个由领域专家（SME）编写的提示上。每个提示的126个响应在两个层面评分：确定性真实验证器（平均每个任务13.8个）和五维度0-3 SME评分标准，组合成0-100的验证器-评分标准分数（VRS）。大多数提示嵌入了惩罚表面模式匹配的认知陷阱。在我们的联合阈值（评分标准均值>=2.5且验证器通过率>=80%）下的接受率普遍较低：Gemini 21.4%，o3 9.5%，Claude 9.5%。平均VRS分数与已发表的基于评分标准的基准一致（我们的最高62.6对比APEX-v1 64.2，ProfBench 65.9，ResearchRubrics <68%），验证了评分标准构建。ACCEPT率低于APEX-Agents在专用DR代理上的MC-segment Pass@1区间（12.3-22.7%）；尽管有工具优势，我们的下限仍低三个百分点，这是由于更严格的合取评分和陷阱设计。每个代理的失败模式各不相同。Claude最可靠地生成交付物（在需要文件的任务上比其他代理高4.5倍），但具有最高的虚构特征。o3具有最清晰的推理平均值，但会遗漏必要部分并传播算术错误。Gemini是双峰的，具有最高的接受率，同时也有最多的零分评分标准单元格。

英文摘要

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

URL PDF HTML ☆

赞 0 踩 0

2605.12969 2026-06-02 cs.LG cs.AI 版本更新

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

从对比视角重新审视基于可验证奖励的强化学习

Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang

发表机构 * Beijing Institute of Technology（北京理工大学）； Qwen Business Unit of Alibaba（阿里巴巴Qwen业务部）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结本文提出ConSPO方法，通过对比序列级策略优化，解决GRPO在目标函数上的似然错配和信用分配不敏感问题，在推理任务上超越强基线。

详情

AI中文摘要

组相对策略优化（GRPO）是目前最广泛采用的RLVR算法之一，用于对大型语言模型进行推理任务的后训练。我们首先证明GRPO存在等价的判别式重新表述，其中策略优化最大化验证的正负rollout之间的期望得分差距。这种重新表述揭示了两个目标层面的局限性：似然错配的替代得分（优化的是基于裁剪比率的得分而非控制生成的序列似然）和得分不敏感的信用分配（rollout级别的信用不反映当前正负rollout之间的得分差距）。为了解决这些局限性，我们提出ConSPO，一种对比序列级策略优化方法，它使用长度归一化的序列对数概率作为rollout得分，并在同一组内对比验证的正rollout与负干扰项。ConSPO优化一个组级别的InfoNCE风格目标，以自适应地增强对分离不佳的正样本和高分负样本的更新，同时结合课程调度的边界，在训练过程中保持分离压力。在多种设置下的实验表明，ConSPO在具有挑战性的推理基准上优于强基线。代码将在论文被接收后发布。

英文摘要

Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in which clipped ratio-based scores are optimized rather than the sequence likelihoods that govern generation, and score-insensitive credit assignment, in which rollout-level credit does not reflect the current score gaps between positive and negative rollouts. To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives, together with a curriculum-scheduled margin that preserves separation pressure as training progresses. Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks. Code will be released upon paper acceptance.

URL PDF HTML ☆

赞 0 踩 0

2603.05308 2026-06-02 cs.CL cs.AI 版本更新

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Med-V1：用于零样本和可扩展生物医学证据归因的小型语言模型

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu

发表机构 * Division of Intramural Research, National Library of Medicine, National Institutes of Health（国家医学图书馆内部研究部，国立卫生研究院）； Department of Computer Science, University of Virginia（弗吉尼亚大学计算机科学系）； Center for Cancer Research, National Cancer Institute, National Institutes of Health（国家癌症研究所癌症研究中心，国立卫生研究院）； Department of Population Health Sciences, Weill Cornell Medicine Institute of AI for Digital Health, Weill Cornell Medicine（韦尔·科恩医学中心流行病学与健康科学系，韦尔·科恩医学中心人工智能与数字健康研究所）

AI总结提出仅3B参数的小语言模型Med-V1，通过高质量合成数据训练，在生物医学证据归因任务上性能媲美GPT-5等前沿大模型，并用于量化LLM幻觉和识别临床指南中的证据错误归因。

详情

AI中文摘要

评估一篇文章是否支持某个断言对于幻觉检测和声明验证至关重要。虽然大型语言模型（LLM）有潜力自动化这一任务，但实现强性能需要如GPT-5这样的前沿模型，而这些模型在规模部署时成本过高。为了高效执行生物医学证据归因，我们提出了Med-V1，一个仅有三亿参数的小语言模型家族。在本研究中新开发的高质量合成数据上训练，Med-V1在统一为验证格式的五个生物医学基准上显著优于其基础模型（+27.0%至+71.3%）。尽管规模较小，Med-V1的性能与GPT-5等前沿LLM相当，并提供高质量的预测解释。我们使用Med-V1进行了首次用例研究，量化了不同引用指令下LLM生成答案中的幻觉。结果表明，格式指令强烈影响引文有效性和幻觉，GPT-5生成更多声明但表现出与GPT-4o相似的幻觉率。此外，我们展示了第二个用例，表明Med-V1可以自动识别临床实践指南中的高风险证据错误归因，揭示了否则难以大规模识别的潜在负面公共卫生影响。总体而言，Med-V1为生物医学证据归因和验证任务的实际应用提供了一种高效、准确的轻量级替代方案。Med-V1可在https://github.com/ncbi-nlp/Med-V1获取。

英文摘要

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

URL PDF HTML ☆

赞 0 踩 0

2605.17110 2026-06-02 cs.AI cs.LG 版本更新

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

通过证据校准的查询聚类捕捉LLM能力

Fangzhou Wu, Sandeep Silwal, Qiuyi Zhang

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Elorian AI

AI总结提出ECC算法，利用有限后验模型比较校准先验语义嵌入，通过Bradley-Terry模型参数化能力轮廓，联合学习灵活的能力感知聚类结构，显著提升LLM能力排序质量。

Comments 45 pages

详情

AI中文摘要

查询聚类将查询分组为反映共享潜在能力需求的组，从而实现能力感知的LLM评估。现有的聚类方法主要依赖于语义分类或嵌入，由于表面语义与实际模型性能之间的错位，往往无法捕捉此类潜在能力需求。我们提出ECC，一种使用有限后验模型比较校准先验语义嵌入的算法，以弥合表面语义与潜在能力需求之间的差距。ECC通过Bradley-Terry模型参数化的能力轮廓来表征每个聚类，并使用可训练的混合权重来适应具有混合能力需求的查询，联合学习灵活的能力感知聚类结构，支持查询特定的LLM能力推断。大量的定量和定性评估表明，ECC显著提高了LLM能力排序质量，分别比人工标注和基于嵌入的基线平均高出17.64和18.02个百分点，并在查询路由等下游任务中证明有效。

英文摘要

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

URL PDF HTML ☆

赞 0 踩 0

2605.17034 2026-06-02 cs.LG cs.AI cs.CR 版本更新

避免表格公平半监督学习中的结构失效模式：基于置信门控的在线原始-对偶分配

Hangchuan Liang, Changchun Li

发表机构 * College of Computer Science and Technology, Jilin University, China（吉林大学计算机科学与技术学院）； Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, China（教育部符号计算与知识工程重点实验室）

AI总结针对表格公平半监督学习中的结构冲突，提出在线原始-对偶分配（OPDA）方法，通过动态调度公平性和熵稳定性惩罚，避免掩码崩溃和平凡饱和两种失效模式，在多个基准上实现非退化运行点。

详情

AI中文摘要

半监督学习（SSL）能够在有限标签下进行预测，但高风险表格应用（医疗、信贷、再犯）需要统计公平性保证。通过诊断压力测试，我们识别出表格公平SSL中的结构冲突：在置信门控伪标签下，矩匹配公平正则化器可能触发两种失效模式——掩码崩溃（公平性侵蚀置信度，导致伪标签匮乏）和平凡饱和（漂移至常数预测器）。我们提出在线原始-对偶分配（OPDA），一种在线控制器，利用违规、风险和伪标签健康信号调度公平性和基于熵的稳定性惩罚，从而避免在该诊断机制下为每个数据集选择固定公平权重。在评估的表格基准（Adult、ACSIncome、COMPAS）上，OPDA缓解了静态权重和简单单信号自适应基线中观察到的退化状态。在Adult和COMPAS上，它产生了与经验静态λ前沿竞争的非退化运行点；在ACSIncome上，它保持了效用，同时具有更宽的公平-效用分布。相对于OPDA-lite，完整控制器主要在ACSIncome上将运行点向更高效用偏移，而Adult则突出了两种变体之间的公平-效用权衡。这些结果使OPDA成为表格公平SSL中无需校准的控制器，无需针对每个数据集进行调整即可获得非退化运行点。

英文摘要

Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes -- Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static-$λ$ frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.09366 2026-06-02 cs.AI 版本更新

Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

迈向虚拟神经科学家：基于多智能体协作的自主神经影像分析

Keqi Han, Songlin Zhao, Yao Su, Xiang Li, Yixuan Yuan, Lifang He, Carl Yang

发表机构 * Emory University（埃默里大学）； Lehigh University（莱斯大学）； Worcester Polytechnic Institute（沃思堡理工学院）； Massachusetts General Hospital（麻省总医院）； Harvard University（哈佛大学）； Chinese University of Hong Kong（香港中文大学）

AI总结提出NEXUS多智能体框架，通过代码中心执行和分层验证实现自主神经影像分析，在ADHD-200和ADNI数据集上优于传统工作流。

详情

AI中文摘要

将神经影像数据转化为临床可操作的生物标志物是一个知识密集型和劳动密集型过程。fMRIPrep等标准化工作流提高了鲁棒性和效率，但它们是静态配置的，无法像人类研究人员那样推理下游目标、权衡替代策略或在中级证据与后续决策之间形成闭环。这种闭环适应的缺失常常使领域专家陷入手动试错以调整参数和修复工作流失败的循环，严重限制了临床生物标志物开发的可扩展性。为弥补这一差距，我们引入了NEXUS，一个自主多智能体框架，它将神经影像工作流执行与科学目标理解相结合。与传统的平面工具调用智能体不同，NEXUS采用以代码为中心的执行范式，其中专业智能体在可组合的领域特定原语上协作合成和优化可执行程序。这种设计使得鲁棒的、长时程的工作流构建成为可能，并能动态适应运行时观察。此外，我们提出了一个用于自主质量控制的分层验证框架，将队列级指标筛选与智能体视觉检查相结合，以驱动基于证据的工作流修复。在ADHD-200和ADNI上的实验表明，NEXUS在预测性能上优于标准工作流基线，同时展现出复杂的智能体行为，包括策略探索和自适应改进。代码可在https://github.com/LearningKeqi/Virtual-Neuroscientist-NEXUS获取。

英文摘要

Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized workflows such as fMRIPrep have improved robustness and efficiency, but they are statically configured and cannot reason about downstream objectives, deliberate over alternative strategies, or close the loop between intermediate evidence and subsequent decisions in the way a human researcher would. This lack of closed-loop adaptation often leaves domain experts trapped in a cycle of manual trial-and-error to tune parameters and remediate pipeline failures, severely constraining the scalability of clinical biomarker development. To bridge this gap, we introduce NEXUS, an autonomous multi-agent framework that integrates neuroimaging workflow execution with scientific-objective understanding. Unlike conventional flat toolcalling agents, NEXUS adopts a code-centric execution paradigm where specialist agents collaboratively synthesize and optimize executable programs over composable domain-specific primitives. This design enables robust, long-horizon workflow construction that adapts dynamically to runtime observations. Furthermore, we propose a hierarchical verification framework for autonomous quality control, integrating cohort-level metric screening with agentic visual inspection to drive evidence-grounded workflow remediation. Experiments on ADHD-200 and ADNI demonstrate that NEXUS outperforms standard workflow-based baselines in predictive performance while exhibiting sophisticated agentic behaviors, including strategy exploration and adaptive refinement. The code is available at https://github.com/LearningKeqi/Virtual-Neuroscientist-NEXUS.

URL PDF HTML ☆

赞 0 踩 0

2603.05917 2026-06-02 cs.LG cs.AI q-fin.ST 版本更新

Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis

结合BERT情感分析的节点Transformer架构用于股票市场预测

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * University of Technology, Baghdad, Iraq（巴格达大学）

AI总结提出一种将节点Transformer与BERT情感分析相结合的框架，通过图结构建模股票间依赖关系并融合社交媒体情感，在S&P 500股票上实现0.80%的MAPE，显著优于传统方法。

Comments 18 pages, 5 figures, 12 tables. Accepted for publication in IEEE Access

详情

DOI: 10.1109/ACCESS.2026.3691980
Journal ref: IEEE Access, vol. 14, pp. 72613-72631, 2026

AI中文摘要

股票市场预测对在噪声、非平稳和行为动态的复杂市场环境中操作的投资者、金融机构和政策制定者提出了相当大的挑战。传统的预测方法，包括基本面分析和技术指标，往往无法捕捉金融市场中固有的复杂模式和横截面依赖性。本文提出了一种结合节点Transformer架构与基于BERT的情感分析的集成框架，用于股票价格预测。该模型将股票市场表示为图结构，其中个股构成节点，边捕捉关系，包括行业隶属关系、相关价格变动和供应链连接。一个微调的BERT模型从社交媒体帖子中提取情感信息，并通过基于注意力的融合机制将其与定量市场特征相结合。节点Transformer处理历史市场数据，同时捕捉股票间的时间演变和横截面依赖性。在1982年1月至2025年3月期间20只S&P 500股票上进行的实验表明，集成模型在一天前预测中实现了0.80%的平均绝对百分比误差（MAPE），而ARIMA为1.20%，LSTM为1.00%。情感分析的加入使预测误差总体降低10%，在财报公告期间降低25%，而基于图的架构通过捕捉股票间依赖性额外贡献了15%的改进。方向准确率在一天预测中达到65%。通过配对t检验的统计验证确认了这些改进的显著性（所有比较p < 0.05）。该模型在高波动期保持较低的误差，MAPE为1.50%，而基线模型范围为1.60%至2.10%。

英文摘要

Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods, including fundamental analysis and technical indicators, often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment information from social media posts and combines it with quantitative market features through attention-based fusion mechanisms. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments conducted on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. The inclusion of sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while the graph-based architecture contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms the significance of these improvements (p < 0.05 for all comparisons). The model maintains lower error during high-volatility periods, achieving MAPE of 1.50% while baseline models range from 1.60% to 2.10%.

URL PDF HTML ☆

赞 0 踩 0

2605.14791 2026-06-02 astro-ph.IM astro-ph.CO cs.AI 版本更新

Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

超越AI助手：迈向宇宙学中的自主发现

Licong Xu, Thomas Borrett

发表机构 * Institute of Astronomy, University of Cambridge（剑桥大学天文研究所）； Kavli Institute for Cosmology, University of Cambridge（剑桥大学凯斯勒宇宙研究所）； Cavendish Astrophysics, University of Cambridge（剑桥大学卡文迪许天体物理研究所）

AI总结本文提出两种互补的智能体系统（CMBEvolve和CosmoEvolve），通过LLM引导的代码进化与树搜索以及虚拟多智能体研究实验室，实现宇宙学中的自主科学发现，并在弱引力透镜异常检测和ACT DR6数据分析中展示了初步成果。

Comments 4 pages, 2 figures, Contribution to the 2026 Cosmology session of the 60th Rencontres de Moriond

2605.14398 2026-06-02 cs.AI 版本更新

任意骨干网络的归一化等变性及其在图像去噪中的应用

Youssef Saied, François Fleuret

发表机构 * University of Cambridge（剑桥大学）； DeepMind

AI总结提出无参数包装器WNE，通过输入归一化、任意骨干网络处理、输出反归一化实现归一化等变，在盲去噪中提升CNN和Transformer对噪声水平失配的鲁棒性且无GPU开销。

详情

AI中文摘要

归一化等变性（NE）是一种结构先验，可提高图像到图像任务中对分布偏移的鲁棒性。函数 $f$ 是归一化等变的当且仅当对于所有 $a>0$ 和 $b\in\mathbb{R}$，有 $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$。现有的NE方法将每个内部层约束为与NE兼容的操作。这些约束增加了运行时成本，并排除了标准的Transformer组件，如softmax注意力和LayerNorm。我们引入了包装归一化等变性（WNE），这是一种无参数包装器，它对输入进行归一化，应用任意骨干网络，然后对输出进行反归一化。我们证明了每个NE函数都允许这种分解，因此该包装器精确参数化了NE函数类。在盲去噪中，包装CNN和Transformer架构在噪声水平失配下提高了鲁棒性，且没有可测量的GPU开销，而架构性NE基线则慢达 $1.6$ 倍。

英文摘要

Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f$ is normalization equivariant iff $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$ for all $a>0$ and $b\in\mathbb{R}$. Existing NE methods constrain every internal layer to NE-compatible operations. These constraints add runtime cost and exclude standard transformer components such as softmax attention and LayerNorm. We introduce Wrapped Normalization Equivariance (WNE), a parameter-free wrapper that normalizes the input, applies any backbone, and denormalizes the output. We prove every NE function admits this factorization, so the wrapper exactly parameterizes the class of NE functions. On blind denoising, wrapping CNN and transformer architectures improves robustness under noise-level mismatch with no measurable GPU overhead, while architectural NE baselines are up to $1.6\times$ slower.

URL PDF HTML ☆

赞 0 踩 0

2605.13834 2026-06-02 cs.LG cs.AI cs.CG 版本更新

Topology-Preserving Neural Operator Learning via Hodge Decomposition

通过Hodge分解保持拓扑的神经算子学习

Dongzhe Zheng, Tao Zhong, Christine Allen-Blanchette

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文从函数空间视角研究几何网格上物理场方程的解算子，利用Hodge正交性分离不可学习的拓扑自由度与可学习的几何动力学，提出基于Hodge谱对偶的混合欧拉-拉格朗日架构，在保持物理不变量的同时提升几何图上的精度与效率。

Comments Accepted at ICML 2026. Code available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

详情

AI中文摘要

本文从函数空间视角研究几何网格上物理场方程的解算子。我们发现Hodge正交性通过将不可学习的拓扑自由度与可学习的几何动力学分离，从根本上解决了谱干扰问题，从而实现了局限于保结构子空间的加性逼近。基于Hodge理论和算子分裂，我们推导出原则性的算子级分解。结果是一种混合欧拉-拉格朗日架构，具有我们称为Hodge谱对偶（HSD）的代数级归纳偏置。在我们的框架中，我们使用离散微分形式捕捉拓扑主导的分量，并使用正交辅助环境空间表示复杂的局部动力学。我们的方法在几何图上实现了优越的准确性和效率，并增强了对物理不变量的保真度。我们的代码可在https://github.com/ContinuumCoder/Hodge-Spectral-Duality获取。

英文摘要

In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure-preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator-level decomposition. The result is a Hybrid Eulerian-Lagrangian architecture with an algebraic-level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology-dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

URL PDF HTML ☆

赞 0 踩 0

2605.13430 2026-06-02 stat.ME cs.AI cs.LG 版本更新

Towards a holistic understanding of Selection Bias for Causal Effect Identification

走向因果效应识别中选择偏差的整体理解

Yiwen Qiu, Filip Kovačević, Shimeng Huang, Peter Spirtes, Francesco Locatello

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结研究在观测研究中存在选择偏差时，如何利用弱假设刻画倾向得分和选择概率，给出平均处理效应可识别性的充要条件，扩展了现有图形识别准则。

Comments 9 pages for the main text, ICML 2026

详情

AI中文摘要

选择偏差在观测研究中普遍存在。例如，大规模生物库数据可能表现出“健康志愿者偏差”，即受访者比他们所要代表的人群更健康、社会经济地位更高。从这样的子人群中恢复因果效应是因果推断中的一个重要问题，因为从选定人群估计平均处理效应（ATE）可能导致对整个群体的ATE估计严重偏倚。本文研究了选择偏差下ATE的可识别性。我们利用概率类的弱假设刻画倾向得分和选择概率，给出了ATE可识别性的充要条件。与以往工作相比，我们的结果扩展了现有的图形可识别性准则，并在存在选择偏差的情况下，以严格更弱的条件提供了对因果效应识别更全面的理解。

英文摘要

Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when respondents are healthier and of higher socio-economic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We provide necessary and sufficient conditions for ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identification with strictly weaker conditions in the presence of selection bias.

URL PDF HTML ☆

赞 0 踩 0

2505.12741 2026-06-02 cs.AI 版本更新

MobiBench: 面向移动GUI智能体的多分支模块化基准

Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（全北大学）； Korea University（韩国大学）； Fluiz

AI总结提出MobiBench，首个模块化且支持多路径感知的离线基准测试框架，用于高保真、可扩展和可复现地评估移动GUI智能体，并揭示组件级性能瓶颈。

详情

AI中文摘要

移动GUI智能体，即能够代表用户与移动应用交互的AI智能体，有潜力改变人机交互。然而，当前GUI智能体的评估实践存在两个基本限制。首先，它们要么依赖单路径离线基准，要么依赖在线实时基准。使用静态、单路径标注数据集的离线基准不公平地惩罚有效的替代动作，而在线基准由于实时评估的动态和不可预测性，面临可扩展性和可复现性差的问题。其次，现有基准将智能体视为单一黑盒，忽略了各个组件的贡献，这常常导致不公平的比较或掩盖关键性能瓶颈。为了解决这些限制，我们提出了MobiBench，这是首个模块化且支持多路径感知的移动GUI智能体离线基准测试框架，能够在完全离线环境下实现高保真、可扩展和可复现的评估。我们的实验表明，MobiBench与人类评估者的一致性达到94.72%，与精心设计的在线基准相当，同时保留了静态离线基准的可扩展性和可复现性。此外，我们全面的模块级分析揭示了几个关键见解，包括对移动GUI智能体中使用的多种技术的系统评估、跨模型规模的最佳模块配置、当前LFM的固有限制，以及设计更强大且成本效益更高的移动智能体的可操作指南。

英文摘要

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

URL PDF HTML ☆

赞 0 踩 0

2605.12400 2026-06-02 cs.LG cs.AI 版本更新

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

OGLS-SD：基于结果引导的对数几率操控的在线自蒸馏用于大语言模型推理

Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

发表机构 * UNC Chapel Hill（UNC夏洛特山分校）

AI总结提出OGLS-SD框架，通过结果奖励校准教师对数几率，解决在线自蒸馏中师生响应模式不匹配导致的训练不稳定问题，提升数学推理性能。

Comments 17 pages, 10 figures, 5 tables

详情

AI中文摘要

我们研究在线自蒸馏（OPSD），其中语言模型通过沿其自身在线轨迹蒸馏特权教师分布来提高推理能力。尽管有前景，OPSD可能因教师和学生响应之间的模式不匹配而遭受训练不稳定。自我反思的教师响应可能引入反思引起的偏差和响应模板，从而错误校准令牌级监督，最终损害学生的推理能力。为缓解此问题，我们提出OGLS-SD，一种结果引导的对数几率操控框架，利用可验证的结果奖励来校准特权教师对数几率。具体而言，OGLS-SD对比由成功和失败的在线轨迹诱导的教师对数几率，构建一个结果判别性的操控方向用于令牌级指导。在数学推理基准上的实验表明，OGLS-SD稳定了自蒸馏，并提高了相对于标准OPSD和其他变体的性能。

英文摘要

We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance. Experiments on mathematical reasoning benchmarks show that OGLS-SD stabilizes self-distillation and improves performance over standard OPSD and other variants.

URL PDF HTML ☆

赞 0 踩 0

2602.08058 2026-06-02 cs.CV cs.AI cs.RO cs.SY eess.SY 版本更新

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Picasso: 基于物理约束采样的整体场景重建

Xihang Yu, Rajat Talak, Lorenzo Shaikewitz, Luca Carlone

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； National University of Singapore（新加坡国立大学）

AI总结提出Picasso，一种通过快速拒绝采样推理多物体交互并考虑几何、非穿透和物理约束的整体场景重建方法，在物理合理性和重建精度上显著优于现有技术。

Comments 15 pages, accepted to Robotics: Science and Systems (RSS) 2026

详情

AI中文摘要

在存在遮挡和测量噪声的情况下，几何精确的场景重建（即拟合传感器数据）仍然可能在物理上不正确。例如，当估计场景中物体的姿态和形状并将结果导入模拟器时，微小误差可能导致不合理的配置，包括物体相互穿透或不稳定平衡。这使得使用数字孪生预测场景的动态行为变得困难，而这是基于模拟的接触丰富行为规划和控制的重要步骤。在本文中，我们认为物体姿态和形状估计需要对场景进行整体推理（而不是孤立地推理每个物体），考虑物体交互和物理合理性。为此，我们的第一个贡献是Picasso，一个受物理约束的重建流水线，通过考虑几何、非穿透和物理来构建多物体场景重建。Picasso依赖于一种快速拒绝采样方法，该方法推理多物体交互，利用推断的物体接触图来指导采样。其次，我们提出了Picasso数据集，这是一个包含10个接触丰富真实场景的集合，带有真实标注，以及一个量化物理合理性的指标，我们将其作为基准测试的一部分开源。最后，我们在新引入的数据集和YCB-V数据集上对Picasso进行了广泛评估，结果表明它在提供物理合理且更符合人类直觉的重建的同时，大幅优于现有技术。

英文摘要

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

URL PDF HTML ☆

赞 0 踩 0

2603.29002 2026-06-02 cs.DC cs.AI 版本更新

Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference

理解并加速大型语言模型推理的内存处理流水线

Zifan He, Rui Ma, Yizhou Sun, Jason Cong

发表机构 * GitHub

AI总结本文通过将稀疏注意力、检索增强生成和压缩上下文内存等优化统一为四步内存处理流水线，识别出22%-97%的内存处理开销，并提出使用GPU-FPGA异构系统加速该流水线，实现最高2.2倍加速和4.7倍能效提升。

Comments Accepted by ICML 2026. Code: https://github.com/OswaldHe/HeteroLLM

详情

AI中文摘要

现代大型语言模型（LLMs）越来越依赖于高效的长上下文处理和生成机制，包括稀疏注意力、检索增强生成（RAG）和压缩上下文内存，以支持复杂推理。我们表明这些优化可以统一为一个四步内存处理流水线：准备内存、计算相关性、检索和应用到推理。通过系统分析，我们识别出LLM推理中22%-97%的内存处理开销及其计算特征的强异构性。受此洞察启发，我们认为异构系统非常适合加速内存处理，从而加速端到端推理。我们在GPU-FPGA系统上展示了这种方法，将稀疏、不规则和内存受限的操作卸载到FPGA，同时将计算密集型操作保留在GPU上。在AMD MI210 GPU和Alveo U55C FPGA上评估，我们的系统在多种LLM推理优化中比GPU基线快高达2.2倍，能耗降低高达4.7倍（在NVIDIA A100上结果类似）。这些结果确立了异构系统作为高效LLM内存处理的实用方向，并为未来异构硬件设计提供参考。

英文摘要

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is up to $2.2\times$ faster and achieves up to $4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

URL PDF HTML ☆

赞 0 踩 0

2605.09907 2026-06-02 cs.AI cs.MA 版本更新

RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation

RADAR：面向多智能体通信结构生成的冗余感知扩散方法

Zhen Zhang, Wanjing Zhou, Juncheng Li, Hao Fei, Jun Wen, Wei Ji

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种基于条件离散图扩散模型的冗余感知生成框架RADAR，通过逐步生成通信拓扑并利用图有效尺寸引导，在六项基准上实现更高准确率、更低令牌消耗和更强鲁棒性。

Comments Accepted by ICML 2026 (fix typos)

详情

AI中文摘要

与单个智能体相比，基于大语言模型的多智能体系统在代码生成、数学推理和规划等不同任务上持续展现出强大的能力。尽管性能令人印象深刻，但这些系统的有效性和鲁棒性在很大程度上依赖于其通信拓扑，而通信拓扑通常是固定的或单步生成的。这限制了细粒度的结构探索和灵活的组合，导致在简单任务上过度使用令牌，同时在复杂任务上能力受限。为了缓解这一挑战，我们引入了RADAR，一个冗余感知且查询自适应的生成框架，主动减少通信开销。受条件离散图扩散模型最新进展的启发，我们将通信拓扑设计表述为一个逐步生成的过程，并由图的有效尺寸引导。在六个基准上的全面实验表明，RADAR在多种场景下始终优于最近的基线方法，实现了更高的准确率、更低的令牌消耗和更强的鲁棒性。我们的代码和数据可在 https://github.com/cszhangzhen/RADAR 获取。

英文摘要

Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, resulting in excessive token utilization on simple tasks while limiting capability on complicated tasks. To mitigate this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Motivated by recent progress in conditional discrete graph diffusion models, we formulate communication topology design as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our code and data are available at https://github.com/cszhangzhen/RADAR.

URL PDF HTML ☆

赞 0 踩 0

2605.09883 2026-06-02 cs.CV cs.AI 版本更新

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

笛卡尔捷径：在极坐标空间中重新评估视觉推理

Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

发表机构 * Stanford University（斯坦福大学）； Google Research（谷歌研究院）

AI总结针对多模态大语言模型在视觉推理中利用笛卡尔坐标捷径的问题，提出Polaris-Bench基准，将任务转换至极坐标空间，揭示模型缺乏拓扑不变性视觉推理。

详情

AI中文摘要

随着当前多模态大语言模型迅速饱和标准视觉推理基准，一个关键问题浮现：这些高分是否真正反映了鲁棒的视觉理解？我们发现了一个普遍存在的漏洞，即笛卡尔捷径：视觉推理基准普遍基于正交网格布局，这些布局可以轻易地离散化为显式的文本坐标。模型系统地利用这一特性，大量依赖基于文本的演绎推理来辅助视觉问题解决。为了系统地消除这一捷径，我们引入了Polaris-Bench，该基准将53个视觉推理任务重新表述在极坐标空间中，并配有对应的笛卡尔坐标作为参考，同时保持一致的逻辑约束和任务语义——从而从根本上打破了模型所利用的正交先验。对14个最先进MLLM的全面评估显示，在笛卡尔布局上达到70%-83%的前沿模型在极坐标等价布局上骤降至31%-39%，即使在完全逻辑等价的情况下，性能下降依然持续。此外，在笛卡尔布局上观察到的推理增益在极坐标等价布局上严重减弱。这些发现揭示了当前MLLM的一个关键缺陷：缺乏拓扑不变的视觉推理。

英文摘要

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce Polaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.09253 2026-06-02 cs.CL cs.AI 版本更新

拒绝-顺从权衡：大型语言模型的大规模安全行为审计

Alif Al Hasan, Sumon Biswas

发表机构 * Department of Computer and Data Sciences（计算机与数据科学系）

AI总结本研究通过调整组合方法隔离模型敏感性与数据集毒性混淆，审计了21个开源权重LLM在四个安全基准上的拒绝与顺从失败模式，发现模型采用不同校准策略、人口保护不平等以及拒绝与顺从倾向在模型家族内稳定。

详情

AI中文摘要

拒绝率是LLM安全性的一个不良代理指标，即模型可能过度拒绝良性提示，同时仍顺从有害提示。我们审计了21个开源权重LLM在四个安全基准（OR-Bench、XSTest、ToxiGen、BOLD）上的两种失败模式，使用组合调整来隔离模型敏感性与数据集毒性混淆。我们报告三个发现。首先，模型采用根本不同的校准策略：保守生态系统（如Llama）以过度拒绝为代价抑制不安全输出，而宽松生态系统（如DeepSeek和Qwen）保持有用性但容忍更高的有害顺从。其次，人口保护不平等：模型过度保护突出的种族和宗教群体，经常拒绝甚至关于它们的良性提示，而对针对残疾的攻击提供显著较弱的保护。第三，拒绝和顺从倾向在模型家族内跨代和规模稳定，表明后训练目标比架构更能塑造安全行为。我们的结果呼吁进行联合、人口意识感知和多评判者的安全评估。

英文摘要

Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We audit both failure modes across 21 open-weight LLMs on four safety benchmarks (OR-Bench, XSTest, ToxiGen, BOLD), using a composition adjustment to isolate model sensitivity from dataset toxicity confounds. We report three findings. First, models adopt fundamentally different calibration strategies: conservative ecosystems such as Llama suppress unsafe outputs at the cost of elevated over-refusals, while permissive ecosystems such as DeepSeek and Qwen preserve helpfulness but tolerate higher harmful compliance. Second, demographic protection is unequal: models over-protect prominent racial and religious groups, frequently refusing even benign prompts about them, while providing substantially weaker protection against disability-targeted attacks. Third, refusal and compliance tendencies are stable within model families across generations and scales, suggesting that post-training objectives shape safety behavior more than architecture. Our results call for joint, demographically-aware, and multi-judge safety evaluation.

URL PDF HTML ☆

赞 0 踩 0

2604.17415 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

奖励分数匹配：统一流模型和扩散模型的基于奖励的微调

Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, Korea（人工智能研究生院，韩国科学技术院）

AI总结提出奖励分数匹配（RSM）框架，统一了多种基于奖励的微调方法，通过分数匹配与值引导目标对齐，简化了设计空间并提高了效率。

Comments 43 pages, 15 figures

详情

AI中文摘要

基于奖励的微调引导预训练的扩散或基于流的生成模型生成更高奖励的样本，同时保持接近预训练模型。尽管现有方法源自不同视角，但我们表明许多方法可以写在一个共同框架下，我们称之为奖励分数匹配（RSM）。在此视角下，对齐变为针对值引导目标的分数匹配，方法间的主要差异归结为值引导估计器的构建和跨时间步的有效优化强度。这种统一澄清了现有设计的偏差-方差-计算权衡，并将核心优化组件与增加复杂性而无明显益处的辅助机制区分开来。在此视角指导下，我们针对代表性的可微和黑盒奖励对齐任务开发了更简单、更高效的重新设计。总体而言，RSM将看似分散的基于奖励的微调方法集合转变为更小、更可解释且更可操作的设计空间。代码可在 https://github.com/jaylee2000/rsm 获取。

英文摘要

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space. Code is available at https://github.com/jaylee2000/rsm

URL PDF HTML ☆

赞 0 踩 0

2604.09063 2026-06-02 cs.CV cs.AI 版本更新

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

频率增强扩散模型：基于课程引导语义对齐的零样本骨架动作识别

Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing（测绘遥感信息工程国家重点实验室）； Wuhan University（武汉大学）； Information Systems Technology and Design Pillar（信息系统技术与设计学院）； Singapore University of Technology and Design（新加坡科技与设计大学）； School of Geodesy and Geomatics（测绘学院）； School of Mathematics and Statistics（数学与统计学院）； Wuhan University Shenzhen Research Institute（武汉大学深圳研究院）

AI总结提出频率感知扩散模型FDSM，通过语义引导频谱残差模块、时间步自适应频谱损失和课程语义抽象，解决扩散模型频谱偏差导致的高频动态过度平滑问题，实现零样本骨架动作识别，在多个数据集上达到最优性能。

Comments Accepted by The Visual Computer

详情

AI中文摘要

人体动作识别在计算机视觉中至关重要，应用范围从监控到人机交互。尽管基于监督的骨架方法有效，但其对详尽标注的依赖限制了对新动作的泛化能力。零样本骨架动作识别（ZSAR）成为一种有前景的范式，但由于扩散模型的频谱偏差（过度平滑高频动态）而面临挑战。在此，我们提出频率感知扩散用于骨架-文本匹配（FDSM），集成了语义引导频谱残差模块、时间步自适应频谱损失和基于课程的语义抽象以应对这些挑战。我们的方法有效恢复了细粒度运动细节，在NTU RGB+D、PKU-MMD和Kinetics-skeleton数据集上实现了最先进的性能。代码已公开于https://github.com/yuzhi535/FDSM。项目主页：https://yuzhi535.github.io/FDSM.github.io/

英文摘要

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.00310 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

超越视觉保真度：通过下游任务集成评估大规模遥感影像的超分辨率模型

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

发表机构 * University of Maryland（马里兰大学）； University of Pittsburgh（匹兹堡大学）； Worcester Polytechnic Institute（沃思利技术学院）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结针对现有超分辨率评估依赖PSNR/SSIM等保真度指标而忽略下游任务效用的问题，提出GeoSR-Bench基准数据集，集成土地覆盖分割、基础设施映射等下游任务，评估GAN、Transformer等9种SR模型在270种设置下的性能，发现保真度指标与任务性能弱相关甚至负相关。

Comments Under review at IEEE TPAMI

详情

AI中文摘要

超分辨率（SR）技术在从低分辨率输入重建高分辨率图像方面取得了重大进展。分辨率的提高为监测任务提供了视觉增强和实用性。特别是，SR已越来越多地用于基于卫星的地球观测，应用于城市规划、农业、生态学和灾害响应。然而，现有的SR研究和基准通常使用保真度指标如PSNR或SSIM，而超分辨率图像的真实效用在于支持下游任务，如土地覆盖分类、生物量估计和变化检测。为弥合这一差距，我们引入了GeoSR-Bench，一个下游任务集成的SR基准数据集，用于评估超越保真度指标的SR模型。GeoSR-Bench包含来自约36,000个地点的空间共位、时间对齐和质量控制的图像对，覆盖多种土地覆盖类型，分辨率从500米到0.6米。据我们所知，GeoSR-Bench是第一个直接将SR模型提高的图像分辨率与下游地球监测任务（包括土地覆盖分割、基础设施映射和生物物理变量估计）联系起来的SR基准。利用GeoSR-Bench，我们对基于GAN、Transformer、神经算子和扩散的SR模型在感知质量和下游任务性能上进行了基准测试。我们进行了270种设置的实验，涵盖2个跨平台SR任务、9个SR模型、3个下游任务模型以及每个SR任务的5个下游任务。结果表明，传统SR指标的改进通常与任务性能的提升不相关，甚至可能负相关，表明这些指标为选择适用于下游任务的优越模型提供的指导有限。这揭示了将下游任务集成到SR模型开发和评估中的必要性。

英文摘要

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increased resolution provides visual enhancement and utility for monitoring tasks. In particular, SR has been increasingly developed for satellite-based Earth observation, with applications in urban planning, agriculture, ecology, and disaster response. However, existing SR studies and benchmarks typically use fidelity metrics such as PSNR or SSIM, whereas the true utility of super-resolved images lies in supporting downstream tasks such as land cover classification, biomass estimation, and change detection. To bridge this gap, we introduce GeoSR-Bench, a downstream task-integrated SR benchmark dataset to evaluate SR models beyond fidelity metrics. GeoSR-Bench comprises spatially co-located, temporally aligned, and quality-controlled image pairs from about 36,000 locations across diverse land covers, spanning resolutions from 500m to 0.6m. To the best of our knowledge, GeoSR-Bench is the first SR benchmark that directly connects improved image resolution from SR models with downstream Earth monitoring tasks, including land cover segmentation, infrastructure mapping, and biophysical variable estimation. Using GeoSR-Bench, we benchmark GAN, transformer, neural operator, and diffusion-based SR models on perceptual quality and downstream task performance. We conduct experiments with 270 settings, covering 2 cross-platform SR tasks, 9 SR models, 3 downstream task models, and 5 downstream tasks for each SR task. The results show that improvements in traditional SR metrics often do not correlate with gains in task performance, and the correlations can be negative, indicating that these metrics provide limited guidance for selecting superior models for downstream tasks. This reveals the need to integrate downstream tasks into SR model development and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2604.26977 2026-06-02 cs.LO cs.AI 版本更新

Defeasible Conditional Obligation in a Two-tiered Preference-based Semantics (Extended Version)

基于双层偏好语义的可废止条件义务（扩展版）

Xavier Parent

发表机构 * Technische Universität Wien (TU Wien)（维也纳技术大学）

AI总结本文提出一种双层偏好语义框架，通过结合非单调推理机制和双序关系（理想性与正常性），解决可废止条件义务的逻辑建模问题，并与约束输入/输出逻辑建立联系。

Comments 13 pages. Extended version of a paper presented at KR 2926

2604.25191 2026-06-02 cs.AR cs.AI cs.LG 版本更新

How Can Reinforcement Learning Achieve Expert-level Placement?

强化学习如何实现专家级布局？

Ruo-Tong Chen, Ke Xue, Chengrui Gao, Yunqi Shi, Tian Xu, Peng Xie, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； School of Artificial Intelligence, Nanjing University, China（南京大学人工智能学院）； Huawei Noah’s Ark Lab, China（华为诺亚实验室）

AI总结针对强化学习在芯片布局中因奖励设计不当而难以达到专家质量的问题，提出从专家布局直接学习奖励模型的方法，通过推断专家轨迹并训练隐式奖励模型，实现从单个设计高效学习并泛化到未见案例。

Comments DAC 2026

2604.23658 2026-06-02 cs.AR cs.AI cs.LG 版本更新

FlowPlace: Flow Matching for Chip Placement

FlowPlace: 用于芯片布局的流匹配

Peng Xie, Ke Xue, Yunqi Shi, Ruo-Tong Chen, Chengrui Gao, Siyuan Xu, Chenjian Ding, Mingxuan Yuan, Chao Qian

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； School of Artificial Intelligence, Nanjing University, China（南京大学人工智能学院）； Huawei Noah’s Ark Lab, China（华为诺亚实验室）

AI总结提出FlowPlace，通过掩码引导的合成数据生成、基于流的灵活先验注入高效训练和硬约束采样实现无重叠布局，在OpenROAD和ICCAD 2015基准上取得更优PPA指标、10-50倍采样效率提升和零重叠。

Comments DAC 2026

2604.23593 2026-06-02 cs.AI 版本更新

When AI reviews science: Can we trust the referee?

当AI评审科学：我们能信任审稿人吗？

Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, Lei Chen

发表机构 * School of Electronic Engineering, Southeast University（东南大学电子工程学院）； Zhejiang University（浙江大学）

AI总结针对AI审稿的安全性和可靠性问题，本文通过分类攻击类型并实验验证声望框架、断言强度、反驳谄媚和上下文投毒对评分的影响，为评估AI同行评审的可靠性提供基线。

详情

DOI: 10.59717/j.xinn-inform.2026.100030
Journal ref: The Innovation Informatics 2:100030 (2026)

AI中文摘要

科学投稿数量持续攀升，超过了合格人类审稿人的容量，并延长了编辑时间线。与此同时，现代大型语言模型（LLMs）在摘要、事实核查和文献分类方面展现出令人印象深刻的能力，使得将AI整合到同行评审中越来越有吸引力——实际上，也无可避免。然而，早期的部署和非正式采用已经暴露了严重的故障模式。最近的事件表明，嵌入在稿件中的隐藏提示注入可以引导LLM生成的评审走向不合理的正面判断。补充研究还显示出对对抗性措辞、权威和长度偏见以及幻觉主张的脆弱性。这些事件引发了学术交流的一个核心问题：当AI评审科学时，我们能信任AI审稿人吗？本文提供了以安全和可靠性为中心的AI同行评审分析。我们映射了评审生命周期中的攻击——训练和数据检索、初审、深度评审、反驳和系统层面。我们通过在分层选取的ICLR 2025投稿上使用两个基于LLM的高级审稿人进行四项处理-控制探针，实例化了这一分类法，以隔离声望框架、断言强度、反驳谄媚和上下文投毒对评审分数的因果效应。总之，这一分类法和实验审计为评估和跟踪AI同行评审的可靠性提供了基于证据的基线，并突出了具体的故障点，以指导有针对性的、可测试的缓解措施。

英文摘要

The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive -- and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle -- training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

URL PDF HTML ☆

赞 0 踩 0

2604.07967 2026-06-02 cs.CL cs.AI 版本更新

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

AtomEval: 事实核查中对抗性声明重写的有效性感知原子评估

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang

发表机构 * Zhejiang University（浙江大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出AtomEval协议，通过原子分解和保留门控，区分有效规避验证与改变命题的重写，并引入VASR指标，解决传统ASR膨胀问题。

详情

AI中文摘要

大型语言模型（LLM）可以重写被驳斥的声明以规避基于证据的事实核查器，但当重写改变、削弱或纠正了本应保留的虚假命题时，传统的攻击成功率（ASR）可能会被夸大。我们引入了AtomEval，一种用于固定证据对抗性声明重写的有效性感知评估协议。AtomEval将声明表示为“主体-关系-客体-修饰语”（SROM）原子，应用单向保留门将有效的验证器规避与改变命题的重写分开，并报告有效性感知攻击成功率（VASR），该指标仅统计保留原始虚假命题的验证器规避重写。AtomEval进一步提供细粒度诊断，解释命题级失败和非最小有效重写。在FEVER被驳斥声明重写任务上，AtomEval揭示并解释了ASR膨胀：许多明显的攻击通过改变、削弱或纠正本应保留的命题来欺骗验证器。通过使受攻击命题的保留变得明确且可测量，AtomEval为评估必须在验证器规避与命题保留之间取得平衡的对抗性重写器提供了稳定的评估目标。

英文摘要

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies a one-way preservation gate to separate valid verifier evasion from proposition-changing rewrites, and reports validity-aware attack success rate (VASR), which counts only verifier-evasive rewrites that preserve the original false proposition. AtomEval further provides fine-grained diagnostics that explain both proposition-level failures and non-minimal valid rewrites. On FEVER refuted-claim rewriting, AtomEval exposes and explains ASR inflation: many apparent attacks fool the verifier by altering, weakening, or correcting the proposition they should preserve. By making attacked-proposition preservation explicit and measurable, AtomEval provides a stable evaluation target for evaluating adversarial rewriters that must balance verifier evasion with proposition preservation.

URL PDF HTML ☆

赞 0 踩 0

2602.02689 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Eidolon: A Post-Quantum Signature Scheme Based on k-Colorability in the Age of Graph Neural Networks

Eidolon: 图神经网络时代基于k-可着色性的后量子签名方案

Asmaa Cherkaoui, Ramon Flores, Delaram Kahrobaei, Richard Wilson

发表机构 * Laboratory of Mathematical Analysis, Algebra and Applications (LAM2A), Faculty of Sciences Ain Chock (FSAC), University Hassan II, Casablanca, Morocco（哈桑二世大学阿因-奇克学院数学分析与代数实验室）； Department of Geometry and Topology, Faculty of Mathematics, University of Seville, Seville, Spain（塞维利亚大学数学系几何与拓扑系）； Departments of Computer Science and Mathematics, Queens College, City University of New York, USA（纽约市立大学皇后学院计算机科学与数学系；数学博士项目，理论科学倡议，研究生中心，纽约市立大学；计算机科学与工程系，纽约大学塔朗分校；计算机科学系，英国约克大学）； PhD Program in Mathematics, and Initiative for the Theoretical Sciences, Graduate Center, City University of New York, USA（英国约克大学计算机科学系）； Department of Computer Science and Engineering, Tandon School of Engineering, New York University, USA ； Department of Computer Science, University of York, United Kingdom ； Department of Computer Science, University of York, United Kingdom

AI总结提出一种基于NP完全问题k-可着色性的后量子签名方案Eidolon，通过推广Goldreich-Micali-Wigderson零知识协议、应用Fiat-Shamir变换和Merkle树压缩，并利用植入着色法生成困难实例，实验表明对经典求解器和图神经网络攻击具有抵抗性。

Comments 20 pages, 4 figures

详情

DOI: 10.1007/978-3-032-27574-5_3
Journal ref: Proceedings of WAIFI 2026, Lecture Notes in Computer Science (LNCS), Vol. 16611, Springer, 2026

AI中文摘要

我们提出Eidolon，一种基于NP完全问题k-可着色性的后量子签名方案。我们的构造将Goldreich-Micali-Wigderson零知识协议推广到任意k >= 3，应用Fiat-Shamir变换，并使用Merkle树承诺将签名从O(tn)压缩到O(t log n)。我们通过植入着色法生成困难实例，同时旨在保留随机图的统计特征。我们对此类方案进行了针对经典求解器（ILP、DSatur）和定制图神经网络（GNN）攻击者的实证安全分析。实验表明，对于n >= 60，两种方法均无法恢复与植入解匹配的有效着色，表明精心设计的k-着色实例能够抵抗所考虑的传统和基于学习的密码分析方法。这些实验表明，构造的实例能够抵抗我们评估中考虑的攻击。

英文摘要

We propose Eidolon, a post-quantum signature scheme grounded on the NP-complete k-colorability problem. Our construction generalizes the Goldreich-Micali-Wigderson zero-knowledge protocol to arbitrary k >= 3, applies the Fiat-Shamir transform, and uses Merkle-tree commitments to compress signatures from O(tn) to O(t log n). We generate hard instances by planting a coloring while aiming to preserve the statistical profile of random graphs. We present an empirical security analysis of such a scheme against both classical solvers (ILP, DSatur) and a custom graph neural network (GNN) attacker. Experiments show that for n >= 60, neither approach is able to recover a valid coloring matching the planted solution, suggesting that well-engineered k-coloring instances can resist the considered classical and learning-based cryptanalytic approaches. These experiments indicate that the constructed instances resist the attacks considered in our evaluation.

URL PDF HTML ☆

赞 0 踩 0

2604.20861 2026-06-02 cs.IR cs.AI 版本更新

Deep Interest Mining for Intent-Enriched Semantic IDs in Multimodal Generative Recommendation

面向多模态生成式推荐中意图增强语义ID的深度兴趣挖掘

Yangchen Zeng, Jinze Wang

发表机构 * Amazon（亚马逊）

AI总结提出DeepInterestGR框架，通过视觉线索和意图描述符丰富量化前的物品表示，并结合相关性门控语义奖励，提升基于语义ID的生成式推荐性能。

详情

AI中文摘要

语义ID（SID）为生成式推荐提供了离散物品词汇表，但其质量取决于量化前保留了哪些物品证据。在产品推荐中，表面元数据常缺失潜在使用意图，视觉证据可能仅在文本中弱反映，下游策略学习对生成的SID是否对应语义有用的物品提供稀疏反馈。我们引入 extbf{DeepInterestGR}，一个用于生成式推荐的意图增强SID框架。在SID量化前， extbf{CMSA}通过两条互补证据路径丰富物品表示：面向推荐的VLM描述和投影图像嵌入。然后 extbf{DCIM}使用LLM挖掘物品侧意图描述符——由产品内容隐含的潜在使用动机，而非个性化用户状态。在构建的SID上进行策略训练时， extbf{QARM}在标准SID奖励之上添加相关性门控语义质量奖励，仅当生成的SID解码为目标物品时应用该奖励。因此，语义质量不能奖励流畅但无关的物品预测。在三个Amazon产品评论类别（Beauty、Sports和Instruments）上的实验表明，DeepInterestGR优于有竞争力的生成式和基于RL的基线，在最强每度量基线上NDCG@5相对提升高达 extbf{15.1\%}，NDCG@10提升 extbf{13.9\%}。组件消融、CMSA分支分析、奖励变体和SID级案例研究支持一个有界声明：用视觉线索和物品侧意图描述符丰富量化前物品证据，结合相关性门控语义奖励，在评估设置下改进了基于SID的生成式推荐。

英文摘要

Semantic IDs (SIDs) provide the discrete item vocabulary used by generative recommendation, but their quality depends on what item evidence is preserved before quantization. In product recommendation, surface metadata often misses latent usage intent, visual evidence may be only weakly reflected in text, and downstream policy learning provides sparse feedback about whether a generated SID corresponds to a semantically useful item. We introduce \textbf{DeepInterestGR}, an intent-enriched SID framework for generative recommendation. Before SID quantization, \textbf{CMSA} enriches item representations through two complementary evidence paths: recommendation-oriented VLM captions and projected image embeddings. \textbf{DCIM} then uses an LLM to mine item-side intent descriptors -- latent usage motivations implied by product content rather than personalized user states. During policy training over the constructed SIDs, \textbf{QARM} adds a relevance-gated semantic-quality bonus on top of standard SID rewards, applying the bonus only when the generated SID decodes to the target item. Thus, semantic quality cannot reward a fluent but irrelevant item prediction. Experiments on three Amazon Product Review categories (Beauty, Sports, and Instruments) show that DeepInterestGR improves over competitive generative and RL-based baselines, with relative gains of up to \textbf{15.1\%} in NDCG@5 and \textbf{13.9\%} in NDCG@10 over the strongest per-metric baseline. Component ablations, CMSA branch analyses, reward variants, and SID-level case studies support a bounded claim: enriching pre-quantization item evidence with visual cues and item-side intent descriptors, together with relevance-gated semantic rewards, improves SID-based generative recommendation under the evaluated settings.

URL PDF HTML ☆

赞 0 踩 0

2603.15956 2026-06-02 cs.RO cs.AI 版本更新

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen: 从非完美行为先验的可扩展仿真到现实专家策略学习

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Kausik Sivakumar, Ahmet Salih Gundogdu, Eric Rosen, Riedana Yan, Tushar Kusnur, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

发表机构 * Robotics and AI Institute（机器人与人工智能研究院）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Sony AI（索尼人工智能）

AI总结提出ExpertGen框架，通过扩散策略初始化行为先验并结合强化学习优化噪声，在仅稀疏奖励下生成高质量专家策略，实现从仿真到现实的可扩展迁移。

详情

AI中文摘要

学习通用且鲁棒的行为克隆策略需要大量高质量的机器人数据。虽然人类演示（例如通过遥操作）是专家行为的标准来源，但在现实世界中大规模获取此类数据成本过高。本文介绍了ExpertGen，一个在仿真中自动化专家策略学习的框架，以实现可扩展的仿真到现实迁移。ExpertGen首先使用在非完美演示（可能由大语言模型合成或由人类提供）上训练的扩散策略初始化行为先验。然后，通过优化扩散模型的初始噪声同时保持原始策略冻结，使用强化学习将该先验引导至高的任务成功率。通过保持预训练的扩散策略冻结，ExpertGen将探索正则化到安全、类人的行为流形内，同时仅使用稀疏奖励即可实现有效学习。在具有挑战性的操作基准上的实证评估表明，ExpertGen无需奖励工程即可可靠地生成高质量的专家策略。在工业装配任务中，ExpertGen实现了90.5%的整体成功率，而在长时域操作任务中达到了85%的整体成功率，优于所有基线方法。所得策略表现出灵巧的控制，并在不同的初始配置和失败状态下保持鲁棒。为了验证仿真到现实的迁移，学习到的基于状态的专家策略通过DAgger进一步提炼为视觉运动策略，并成功部署在真实的机器人硬件上。

英文摘要

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

URL PDF HTML ☆

赞 0 踩 0

2604.17621 2026-06-02 cs.AI 版本更新

KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models

KnowledgeBerg: 评估大语言模型中的系统性知识覆盖与组合推理

Xiao Zhang, Qianru Meng, Yongjian Chen, Yumeng Wang, Johan Bos

发表机构 * University of Groningen（Groningen大学）； LIACS, Leiden University（莱顿大学LIACS）

AI总结提出KnowledgeBerg基准，通过4800道选择题评估大模型在知识宽度和推理深度上的系统性覆盖与组合推理能力，发现现有模型存在严重不足。

Comments ACL Findings

详情

AI中文摘要

许多现实世界的问题看似简单，却隐含地要求两种能力：(i) 对有限知识宇宙的系统性覆盖，以及(ii) 对该宇宙的基于集合的组合推理，我们将这种现象称为“冰山一角”。我们通过两个正交维度形式化这一挑战：知识宽度（所需宇宙的基数）和推理深度（组合集合操作的数量）。我们引入了KnowledgeBerg，一个包含4800道多项选择题的基准，这些题目源自1183个枚举种子，涵盖10个领域和17种语言，其宇宙基于权威来源以确保可重复性。代表性的开源大语言模型表现出严重局限性，在宇宙枚举上仅达到5.26-36.88的F1分数，在基于知识的推理上准确率仅为16.00-44.19。诊断分析揭示了三个失败阶段：完整性（知识缺失）、意识（未能识别需求）和应用（错误执行推理）。这种模式在语言和模型规模上持续存在。尽管测试时计算和检索增强带来了可测量的改进——分别高达4.35和3.78个百分点——但仍有显著差距，暴露了当前大语言模型在组织结构化知识和在有限领域上执行组合推理方面的局限性。数据集可在https://huggingface.co/datasets/2npc/KnowledgeBerg获取。

英文摘要

Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg

URL PDF HTML ☆

赞 0 踩 0

2604.17456 2026-06-02 cs.AI 版本更新

TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control

TrafficClaw：面向城市交通控制的统一物理环境中的可泛化LLM智能体

Siqi Lai, Pan Zhang, Yuping Zhou, Jindong Han, Yansong Ning, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Shandong University（山东大学）

AI总结提出TrafficClaw，一种基于大语言模型的可泛化交通控制智能体，通过统一物理环境、可执行时空推理与多阶段智能体强化学习，实现跨子系统的协调优化。

详情

AI中文摘要

大语言模型（LLM）智能体在数字环境中的长程推理、工具使用和决策方面表现出强大能力，但将其扩展到物理系统仍具挑战。与目标通常弱耦合的网络、代码或游戏环境不同，物理系统通过紧密耦合的动力学演化，局部干预会随时间在相互作用的子系统中传播。城市交通控制体现了这一挑战，因为交通信号、高速公路、公共交通和出租车系统通过共享的空间基础设施和时间出行需求持续交互。现有的优化、强化学习（RL）和基于LLM的方法大多针对孤立子系统设计，限制了协调推理和系统级优化。我们提出TrafficClaw，一种基于LLM的可泛化交通控制智能体，用于物理城市系统。TrafficClaw在统一的交通环境中运行，暴露耦合的城市动态和反馈，通过持久记忆执行可扩展的时空推理以实现长期适应，并利用多阶段智能体RL进行协调的系统级优化。在三个大都市区域和六个交通控制任务上的实验证明了其强大的泛化能力、鲁棒性和跨子系统协调能力。我们的项目可在https://github.com/usail-hkust/TrafficClaw获取。

英文摘要

Large language model (LLM) agents have shown strong capabilities in long-horizon reasoning, tool use, and decision-making in digital environments, yet extending them to physically grounded systems remains challenging. Unlike web, code, or game environments, where objectives are often weakly coupled, physical systems evolve through tightly coupled dynamics in which local interventions propagate across interacting subsystems over time. Urban traffic control exemplifies this challenge, as traffic signals, freeways, public transit, and taxi systems continuously interact through shared spatial infrastructure and temporal mobility demand. Existing optimization, reinforcement learning (RL), and LLM-based approaches are largely designed for isolated subsystems, limiting coordinated reasoning and system-level optimization. We propose TrafficClaw, a LLM-based generalizable traffic control agent for physical urban systems. TrafficClaw operates within a unified traffic environment that exposes coupled urban dynamics and feedback, performs executable spatiotemporal reasoning with persistent memory for long-horizon adaptation, and leverages multi-stage agentic RL for coordinated system-level optimization. Experiments across three metropolitan regions and six traffic-control tasks demonstrate strong generalization, robustness, and cross-subsystem coordination. Our project is available at https://github.com/usail-hkust/TrafficClaw.

URL PDF HTML ☆

赞 0 踩 0

2604.17007 2026-06-02 cs.CV cs.AI 版本更新

MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment

MobileAgeNet：面向移动部署的轻量级面部年龄估计

Arun Kumar, Aswathy Baiju, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室、CAIDAS与IFI、乌尔姆大学、德国）

AI总结提出基于MobileNetV3-Large的轻量级年龄回归框架MobileAgeNet，通过两阶段微调和边界回归策略，在UTKFace测试集上达到4.65年MAE，移动端延迟14.4ms，参数量3.23M。

Comments 9 Pages including references, 3 figures

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3810-3818, 2026

AI中文摘要

面部年龄估计的移动部署需要模型在预测准确性、低延迟和小尺寸之间取得平衡。在这项工作中，我们提出了MobileAgeNet，一个轻量级年龄回归框架，在UTKFace保留测试集上实现了4.65年的MAE，同时使用AI Benchmark应用程序测量，平均延迟为14.4毫秒，保持了高效的设备端推理。该模型基于预训练的MobileNetV3-Large骨干网络，结合紧凑的回归头，支持移动设备上的实时预测。训练和评估流程集成到NN LEMUR数据集框架中，支持可重复实验、结构化超参数优化和一致评估。我们采用边界年龄回归以及两阶段微调策略，以提高训练稳定性和泛化能力。实验结果表明，MobileAgeNet以3.23M参数实现了具有竞争力的准确性，并且从PyTorch训练通过ONNX导出到TensorFlow Lite转换的部署流程，在实际设备条件下保持了预测行为，没有可测量的退化。总体而言，这项工作为面向移动的面部年龄估计提供了一个实用、可部署的基线。

英文摘要

Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.

URL PDF HTML ☆

赞 0 踩 0

2601.07177 2026-06-02 cs.CR cs.AI 版本更新

Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Safe-FedLLM：深入探究联邦大语言模型的安全性

Mingxiang Tao, Yu Tian, Wenxuan Tu, Yue Yang, Xue Yang, Xiangyan Tang

发表机构 * Hainan University（海南大学）； Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出Safe-FedLLM，一种基于探针的防御框架，通过三级防御（步骤级、客户端级和阴影级）利用轻量级分类器区分恶意与良性LoRA更新，以增强联邦大语言模型对恶意客户端的鲁棒性。

详情

AI中文摘要

联邦学习解决了大语言模型训练中的隐私和数据孤岛问题。大多数先前工作侧重于提高联邦学习对大语言模型的效率。然而，开放联邦环境中的安全性，特别是针对恶意客户端的防御，仍未被充分探索。为了研究联邦大语言模型的安全性，我们进行了一项初步研究，从LoRA更新的角度分析潜在的攻击面和防御特性。我们发现联邦大语言模型的两个关键特性：1）大语言模型在联邦学习中容易受到恶意客户端的攻击，以及2）LoRA更新表现出不同的行为模式，可以通过轻量级分类器有效区分。基于这些特性，我们提出了Safe-FedLLM，一种基于探针的联邦大语言模型防御框架，该框架在三个层面构建防御：步骤级、客户端级和阴影级。Safe-FedLLM的核心概念是对每个客户端的本地LoRA更新进行基于探针的区分，将其视为高维行为特征，并使用轻量级分类器判断其是否为恶意。大量实验表明，Safe-FedLLM有效提高了联邦大语言模型对恶意客户端的鲁棒性，同时保持了对良性数据的竞争性能。值得注意的是，我们的方法在不显著影响训练速度的情况下有效抑制了恶意数据的影响，即使在恶意客户端比例较高的情况下也保持有效。

英文摘要

Federated learning (FL) addresses privacy and data-silo issues in the training of large language models (LLMs). Most prior work focuses on improving the efficiency of federated learning for LLMs (FedLLM). However, security in open federated environments, particularly defenses against malicious clients, remains underexplored. To investigate the security of FedLLM, we conduct a preliminary study to analyze potential attack surfaces and defensive characteristics from the perspective of LoRA updates. We find two key properties of FedLLM: 1) LLMs are vulnerable to attacks from malicious clients in FL, and 2) LoRA updates exhibit distinct behavioral patterns that can be effectively distinguished by lightweight classifiers. Based on these properties, we propose Safe-FedLLM, a probe-based defense framework for FedLLM, which constructs defenses across three levels: Step-Level, Client-Level, and Shadow-Level. The core concept of Safe-FedLLM is to perform probe-based discrimination on each client's local LoRA updates, treating them as high-dimensional behavioral features and using a lightweight classifier to determine whether they are malicious. Extensive experiments demonstrate that Safe-FedLLM effectively improves FedLLM's robustness against malicious clients while maintaining competitive performance on benign data. Notably, our method effectively suppresses the impact of malicious data without significantly affecting training speed, and remains effective even under high malicious client ratios.

URL PDF HTML ☆

赞 0 踩 0

2604.15231 2026-06-02 cs.AI 版本更新

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent：一种用于胸部CT逐步解读的工具型AI智能体

Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia E. Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

发表机构 * Department of Biosystems Science and Engineering, ETH Zurich（生物系统科学与工程系，苏黎世联邦理工学院）； ETH AI Center, Zurich（ETH人工智能中心，苏黎世）； Department of Computer Science, ETH Zurich（计算机科学系，苏黎世联邦理工学院）； Faculty of Computer Science and Mathematics, Heidelberg University（计算机科学与数学学院，海德堡大学）； Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University（斯坦福大学人工智能在医学和影像中的中心）； Department of Radiology, Stanford University（放射科，斯坦福大学）； Department of Quantitative Biomedicine, University of Zurich（定量生物医学系，苏黎世大学）； Institute of Computer Science, Zurich University of Applied Sciences（应用科学大学计算机科学研究所）

AI总结提出RadAgent，一种通过逐步、可解释过程生成CT报告的工具型AI智能体，在临床准确性、鲁棒性和忠实度上优于3D VLM方法。

详情

AI中文摘要

视觉语言模型（VLM）显著推进了复杂医学影像（如计算机断层扫描（CT））的AI驱动解读和报告生成。然而，现有方法主要将临床医生视为最终输出的被动观察者，没有提供可解释的推理轨迹供其检查、验证或改进。为了解决这个问题，我们引入了RadAgent，一种工具型AI智能体，通过逐步且可解释的过程生成CT报告。每个生成的报告都附带一个完全可检查的中间决策和工具交互轨迹，使临床医生能够检查报告发现是如何得出的。在我们的实验中，我们观察到RadAgent在三个维度上改进了胸部CT报告生成，优于其3D VLM对应物CT-Chat。临床准确性在宏F1上提高了5.8分（相对提高35.4%），在微F1上提高了5.1分（相对提高18.6%）。对抗条件下的鲁棒性提高了24.7分（相对提高41.9%）。此外，RadAgent在忠实度上达到了37.0%，这是其3D VLM对应物完全不具备的新能力。通过将胸部CT的解读构建为显式、工具增强和迭代的推理轨迹，RadAgent使我们更接近放射学的透明和可靠AI。

英文摘要

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 5.8 points (35.4% relative) in macro-F1 and 5.1 points (18.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

URL PDF HTML ☆

赞 0 踩 0

2512.24120 2026-06-02 cs.CV cs.AI 版本更新

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

增强基于LLM的神经网络生成：面向自动化架构设计的少样本提示与高效验证

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室，CAIDAS与IFI，乌尔姆大学，德国）

AI总结本文提出少样本架构提示（FSAP）和空白归一化哈希验证方法，以提升基于LLM的计算机视觉架构自动生成效率，并通过大规模实验验证其有效性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3242-3251, 2026

AI中文摘要

自动化神经网络架构设计仍然是计算机视觉中的一个重大挑战。任务多样性和计算约束要求既有效又高效的架构与搜索方法。大型语言模型（LLMs）为计算密集型的神经架构搜索（NAS）提供了一种有前景的替代方案，但它们在计算机视觉架构生成中的应用尚未被系统研究，特别是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架，本文引入并验证了两项针对计算机视觉的关键贡献。首先，我们提出了少样本架构提示（FSAP），这是首个针对基于LLM的架构生成中支持示例数量（n = 1, 2, 3, 4, 5, 6）的系统研究。我们发现使用n = 3个示例能在视觉任务的架构多样性和上下文聚焦之间取得最佳平衡。其次，我们引入了空白归一化哈希验证，一种轻量级去重方法（耗时小于1毫秒），相比AST解析实现了100倍加速，并防止了重复计算机视觉架构的冗余训练。在七个计算机视觉基准（MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN、Places365）的大规模实验中，我们生成了1,900个独特架构。我们还引入了一种数据集平衡的评估方法，以应对跨异构视觉任务比较架构的挑战。这些贡献为计算机视觉中基于LLM的架构搜索提供了可操作的指导，并建立了严格的评估实践，使计算资源有限的研究人员也能更便捷地进行自动化设计。

英文摘要

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

URL PDF HTML ☆

赞 0 踩 0

2604.14514 2026-06-02 cs.AI cs.CE 版本更新

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

生物医学AI中的偏见视角：防止下游医疗保健差异

Michal Rosen-Zvi, Yoav Kan-Tor, Michael Danziger, Agata Ferretti, Javier Aula-Blasco, Julia Falcao, Ron Shamir, Mira Marcus-Kalish, Mordechai Muszkat

发表机构 * Weizmann Institute of Science（魏茨曼科学研究院）； Hebrew University of Jerusalem（特拉维夫大学）

AI总结本文通过分析2015-2024年4514篇组学出版物和大型数据集，揭示数据收集和研究中存在的严重人口偏见，并提出通过来源、开放性和评估透明度三个原则来预防下游医疗保健差异。

Comments This manuscript has been accepted for publication in the 2026 IEEE International Conference on Digital Health (ICDH). The final version will appear in IEEE Xplore

详情

AI中文摘要

医疗保健差异在社会经济边界上持续存在，通常归因于筛查、诊断和治疗的不平等获取。然而，本文观点强调，关键偏见可能在更早阶段出现，即在数据收集和研究优先级确定期间，远在临床实施之前，尤其是在关注分子和组学数据的研究中。大量研究专注于收集组学数据，但相关的人口统计信息往往未被报告，即使报告了，也显示出显著偏见。对2015年至2024年间PubMed索引的4514篇组学出版物的自动分析，检查了多个人口统计维度的报告情况，发现总体报告有限；例如，只有2.7%的研究报告了祖先或种族信息，地理来源报告仅限于2.5%。对常用于模型训练的大规模数据集（如CellxGene和GEO）的分析揭示了显著的人口偏见，其中欧洲血统数据占主导地位。随着生物医学基础模型成为生物医学发现的核心，其范式是基础模型在大数据集上预训练并反复用于许多不同的下游任务，这些模型有风险延续或放大这些早期阶段的偏见，导致监管干预无法完全逆转的级联不平等。我们提出社区范围内关注三个基本原则：来源、开放性和通过评估透明度的可靠性。这些原则共同有助于使偏见和局限性对模型开发者和用户更加可见，支持在生物医学AI中更明智的模型开发、评估和部署决策。

英文摘要

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation, particularly in studies focused on molecular and omics data. A vast number of studies focus on collecting omics data, but the demographic information associated with these datasets is often not reported, and when it is reported, it reveals substantial biases. An automated analysis of 4514 PubMed-indexed omics publications from 2015 to 2024, examining reporting across multiple demographic dimensions, reveals limited reporting overall; for example, only 2.7% of studies report ancestry or ethnicity information and geographic origin reporting is limited to 2.5%. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them repeatedly for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Reliability through Evaluation Transparency. Together, these principles can help make biases and limitations more visible to model developers and users, supporting more informed model development, evaluation, and deployment decisions in biomedical AI.

URL PDF HTML ☆

赞 0 踩 0

2604.03588 2026-06-02 cs.AI 版本更新

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Rashomon记忆：面向多视角智能体记忆的论证驱动检索

Albert Sadowski, Jarosław A. Chudziak

发表机构 * Warsaw University of Technology（华沙技术大学）

AI总结提出Rashomon记忆架构，通过并行目标条件化智能体以各自优先级编码经验，并在查询时通过论证协商，利用Dung的论证语义选择解释，支持冲突呈现模式。

Comments Presented at EXTRAAMAS workshop at AAMAS 2026

详情

AI中文摘要

在长时间跨度上运行的AI智能体积累服务于多个并发目标的经验，并且通常必须维持对同一事件的矛盾解释。在客户谈判中的让步，对于一个战略目标编码为“建立信任的投资”，对于另一个目标则编码为“合同责任”。当前的记忆架构假设单一正确编码，或者最多在统一存储上支持多个视图。我们提出Rashomon记忆：一种架构，其中并行目标条件化智能体根据其优先级编码经验，并在查询时通过论证进行协商。每个视角维护自己的本体和知识图谱。在检索时，视角提出解释，使用非对称领域知识批评彼此的提议，Dung的论证语义决定哪些提议存活。生成的攻击图本身就是一个解释：它记录了哪个解释被选中，哪些替代方案被考虑，以及它们被拒绝的理由。我们提供了一个概念验证，表明检索模式（选择、组合、冲突呈现）从攻击图拓扑中涌现，并且冲突呈现模式（系统报告真正的分歧而不是强制解决）让决策者直接看到底层的解释性冲突。

英文摘要

AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain conflicting interpretations of the same events. A concession during a client negotiation encodes as a ``trust-building investment'' for one strategic goal and a ``contractual liability'' for another. Current memory architectures assume a single correct encoding, or at best support multiple views over unified storage. We propose Rashomon Memory: an architecture where parallel goal-conditioned agents encode experiences according to their priorities and negotiate at query time through argumentation. Each perspective maintains its own ontology and knowledge graph. At retrieval, perspectives propose interpretations, critique each other's proposals using asymmetric domain knowledge, and Dung's argumentation semantics determines which proposals survive. The resulting attack graph is itself an explanation: it records which interpretation was selected, which alternatives were considered, and on what grounds they were rejected. We present a proof-of-concept showing that retrieval modes (selection, composition, conflict surfacing) emerge from attack graph topology, and that the conflict surfacing mode, where the system reports genuine disagreement rather than forcing resolution, lets decision-makers see the underlying interpretive conflict directly.

URL PDF HTML ☆

赞 0 踩 0

2603.18373 2026-06-02 cs.CV cs.AI 版本更新

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

看见还是取悦：揭示视觉语言模型中的视觉谄媚与分裂信念

Rui Hong, Shuxue Quan

发表机构 * George Mason University（乔治·玛斯纳大学）； Independent Researcher（独立研究者）

AI总结提出三层诊断框架，通过反事实干预实验发现视觉语言模型中普遍存在视觉谄媚（内部证据保留但输出幻觉答案）现象，并证明扩展模型规模无法解决该问题。

Comments 14 pages, 1 figures

详情

AI中文摘要

当视觉语言模型正确回答时，它们是否真正依赖视觉信息？我们引入了一个三层诊断框架，包含三个每样本指标：潜在异常检测、视觉必要性分数和竞争分数，用于解耦感知、依赖和对齐失败。在9个视觉语言模型和9000个模型-样本对中，通过反事实盲、噪声和冲突干预，72.9%的样本表现出视觉谄媚，这是一种分裂信念模式，即内部证据被保留但解码出幻觉答案，而零样本表现出稳健拒绝，表明当前的对齐训练已消除拒绝作为解码结果。在Qwen-VL系列中，无论是代内还是代间扩展，都单调减少了语言捷径，但加剧了视觉谄媚，表明仅靠规模和更新的后训练无法解决接地问题。诊断分数进一步实现了一种无需训练的择性预测策略，在50%覆盖率下准确率提升高达9.5个百分点。

英文摘要

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-sample metrics: Latent Anomaly Detection, Visual Necessity Score, and Competition Score, which disentangle perception, dependency, and alignment failures. Across 9 VLMs and 9,000 model-sample pairs under counterfactual blind, noise, and conflict interventions, 72.9% of samples exhibit Visual Sycophancy, a Split Beliefs pattern in which internal evidence is preserved yet a hallucinated answer is decoded, while zero samples show Robust Refusal, indicating that current alignment training has eliminated refusal as a decoding outcome. Scaling within the Qwen-VL family, both within- and across-generation, monotonically reduces Language Shortcuts but amplifies Visual Sycophancy, showing that scale and newer post-training alone cannot resolve the grounding problem. Diagnostic scores further enable a training-free selective-prediction strategy yielding up to +9.5 percentage points accuracy at 50% coverage.

URL PDF HTML ☆

赞 0 踩 0

2509.05367 2026-06-02 cs.CR cs.AI 版本更新

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

进退维谷：大型语言模型中伦理推理与安全对齐之间的张力

Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh, Xiao Li, Qibing Ren, Xiaolin Hu

发表机构 * Department of Computer Science and Technology, Institute for AI, BNRist, Tsinghua University（计算机科学与技术系，人工智能研究院，BNRist，清华大学）； IDG/McGovern Institute for Brain Research, Tsinghua University（IDG/麦克戈文脑科学研究院，清华大学）； Chinese Institute for Brain Research (CIBR)（中国脑科学研究院（CIBR））； Shanghai Jiao Tong University（上海交通大学）； ByteDance（字节跳动）

AI总结本文提出TRIAL多轮红队方法，通过将有害请求嵌入伦理框架来利用模型伦理推理能力，并引入ERR防御框架（分层有害门控LoRA架构）以区分工具性回应与解释性回应，实现鲁棒防御。

详情

AI中文摘要

大型语言模型的安全对齐主要基于二元假设，即请求要么安全要么不安全。当模型遇到伦理困境时，这种分类被证明是不充分的，因为通过道德权衡进行推理的能力创造了一个独特的攻击面。我们通过TRIAL（一种多轮红队方法）形式化了这一漏洞，该方法将有害请求嵌入伦理框架中。TRIAL通过系统性地利用模型的伦理推理能力，将有害行为描述为道德上必要的妥协，在大多数测试模型上实现了高攻击成功率。基于这些见解，我们引入了ERR（伦理推理鲁棒性），一种防御框架，区分了导致有害结果的工具性回应和分析伦理框架而不认可有害行为的解释性回应。ERR采用分层有害门控LoRA架构，在保持模型实用性的同时，实现了对基于推理的攻击的鲁棒防御。

英文摘要

Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classification proves insufficient when models encounter ethical dilemmas, where the capacity to reason through moral trade-offs creates a distinct attack surface. We formalize this vulnerability through TRIAL, a multi-turn red-teaming methodology that embeds harmful requests within ethical framings. TRIAL achieves high attack success rates across most tested models by systematically exploiting the model's ethical reasoning capabilities to frame harmful actions as morally necessary compromises. Building on these insights, we introduce ERR (Ethical Reasoning Robustness), a defense framework that distinguishes between instrumental responses that enable harmful outcomes and explanatory responses that analyze ethical frameworks without endorsing harmful acts. ERR employs a Layer-Stratified Harm-Gated LoRA architecture, achieving robust defense against reasoning-based attacks while preserving model utility.

URL PDF HTML ☆

赞 0 踩 0

2604.08324 2026-06-02 cs.NE cs.AI 版本更新

Multi-Modal Learning meets Genetic Programming: Analyzing Alignment in Latent Space Optimization

多模态学习遇见遗传编程：分析潜在空间优化中的对齐

Benjamin Léger, Kazem Meidani, Christian Gagné

发表机构 * Mila, Université Laval（Mila，拉瓦尔大学）； Department of Mechanical Engineering, Carnegie Mellon University（机械工程系，卡内基梅隆大学）； Canada-CIFAR AI Chair（加拿大-卡内基梅隆人工智能主席）

AI总结本文研究SNIP方法在符号回归中多模态潜在空间优化的对齐效果，发现跨模态对齐在优化过程中未改善且过于粗糙，导致有效对齐引导的优化尚未实现。

详情

AI中文摘要

符号回归旨在从数据中发现数学表达式，传统上通过遗传编程对符号结构进行组合搜索来解决。潜在空间优化方法使用神经编码器将符号表达式映射到连续空间，将组合搜索转化为连续优化。受CLIP启发的对比预训练模型SNIP（Meidani等人，2024）通过引入多模态方法推进了潜在空间优化：在共享潜在空间中对齐符号和数值编码器以学习表型-基因型映射，从而在数值空间中进行优化以隐式指导符号搜索。然而，这依赖于细粒度的跨模态对齐，而类似CLIP模型的文献表明这种对齐通常是粗粒度的。在本文中，我们研究SNIP是否实现了其对符号回归进行有效双模态优化的承诺。我们的实验表明：（1）跨模态对齐在优化过程中并未改善，即使适应度增加；（2）SNIP学习的对齐过于粗糙，无法在符号空间中高效进行原则性搜索。这些发现揭示了尽管多模态潜在空间优化对符号回归具有巨大潜力，但有效的对齐引导优化在实践中仍未实现，突显了细粒度对齐作为未来工作的关键方向。

英文摘要

Symbolic regression (SR) aims to discover mathematical expressions from data, a task traditionally tackled using Genetic Programming (GP) through combinatorial search over symbolic structures. Latent Space Optimization (LSO) methods use neural encoders to map symbolic expressions into continuous spaces, transforming the combinatorial search into continuous optimization. SNIP (Meidani et al., 2024), a contrastive pre-training model inspired by CLIP, advances LSO by introducing a multi-modal approach: aligning symbolic and numeric encoders in a shared latent space to learn the phenotype-genotype mapping, enabling optimization in the numeric space to implicitly guide symbolic search. However, this relies on fine-grained cross-modal alignment, whereas literature on similar models like CLIP reveals that such an alignment is typically coarse-grained. In this paper, we investigate whether SNIP delivers on its promise of effective bi-modal optimization for SR. Our experiments show that: (1) cross-modal alignment does not improve during optimization, even as fitness increases, and (2) the alignment learned by SNIP is too coarse to efficiently conduct principled search in the symbolic space. These findings reveal that while multi-modal LSO holds significant potential for SR, effective alignment-guided optimization remains unrealized in practice, highlighting fine-grained alignment as a critical direction for future work.

URL PDF HTML ☆

赞 0 踩 0

2604.10788 2026-06-02 cs.CL cs.AI 版本更新

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

TInR：探索大语言模型中的工具内化推理

Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Southeast University（东南大学）； University of Edinburgh（爱丁堡大学）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）

AI总结本文提出TInR-U框架，通过工具内化、监督微调和强化学习三阶段训练，使LLM无需外部文档即可进行工具集成推理，在域内和域外设置中均取得优越性能。

Comments Accepted to ACL 2026

详情

AI中文摘要

工具集成推理（TIR）通过扩展大语言模型（LLM）在推理过程中使用外部工具的能力，已成为一个有前景的方向。现有的TIR方法通常在推理过程中依赖外部工具文档。然而，这导致了工具掌握困难、工具规模限制和推理效率低下等问题。为了缓解这些问题，我们探索了工具内化推理（TInR），旨在促进使用内化到LLM中的工具知识进行推理。实现这一目标面临显著的要求，包括工具内化和工具-推理协调。为了解决这些问题，我们提出了TInR-U，一个用于统一推理和工具使用的工具内化推理框架。TInR-U通过三阶段流水线进行训练：1）使用双向知识对齐策略进行工具内化；2）使用高质量推理注释进行监督微调预热；3）使用TInR特定奖励进行强化学习。我们在域内和域外设置中全面评估了我们的方法。实验结果表明，TInR-U在两种设置下均实现了优越的性能，突显了其有效性和效率。

英文摘要

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2604.10688 2026-06-02 cs.LG cs.AI cs.CL 版本更新

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

SCOPE: 信号校准的在线策略蒸馏增强与双路径自适应加权

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

发表机构 * University of Science and Technology of China（中国科学技术大学）； Meituan LongCat Interaction Team（美团 LongCat 交互团队）； Nanjing University（南京大学）； Fudan University（复旦大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结针对在线策略强化学习中奖励稀疏导致的信用分配难题，提出SCOPE框架，通过双路径自适应加权机制分别处理正确与错误轨迹，实现信号校准的蒸馏增强，在六个推理基准上平均提升11.42%的Avg@32和7.30%的Pass@32。

详情

AI中文摘要

在线策略强化学习已成为大型语言模型推理对齐的主导范式，但其稀疏的结果级奖励使得令牌级信用分配异常困难。在线策略蒸馏（OPD）通过引入来自教师模型的密集令牌级KL监督缓解了这一问题，但通常对所有rollout均匀应用这种监督，忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强（SCOPE），一种双路径自适应训练框架，根据正确性将在线策略rollout路由到两个互补的监督路径。对于错误轨迹，SCOPE执行教师困惑度加权的KL蒸馏，优先考虑教师展现出真正纠正能力的实例，同时降低不可靠指导的权重。对于正确轨迹，它应用学生困惑度加权的MLE，将强化集中在能力边界上的低置信度样本，而不是过度强化已掌握的样本。两条路径都采用组级归一化来自适应校准权重分布，考虑不同提示的内在难度差异。在六个推理基准上的大量实验表明，SCOPE在Avg@32和Pass@32上分别比竞争基线平均相对提升11.42%和7.30%，证明了其一致的有效性。

英文摘要

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2604.10645 2026-06-02 cs.SE cs.AI 版本更新

Vibe-driven model-based engineering

基于氛围驱动的模型工程

Jordi Cabot

发表机构 * Luxembourg Institute of Science and Technology（卢森堡科学与技术研究院）； University of Luxembourg（卢森堡大学）

AI总结本文提出氛围驱动模型工程概念，融合大型语言模型（LLM）的代码生成能力与模型驱动工程（MDE）的严谨性，以加速可靠复杂系统的开发。

详情

AI中文摘要

随着新软件系统需求的增长和复杂性的增加，迫切需要更好的开发方法和工具。新型用户界面、智能组件需求、可持续性等问题带来了新的挑战。近年来，模型驱动工程（MDE），包括其最新形式即低代码/无代码开发，一直是提高软件开发质量和生产力的关键，但模型本身的指定和管理变得越来越复杂。与此同时，我们目睹了基于大型语言模型（LLM）的氛围编码方法的日益流行，该方法将自然语言描述转换为可运行代码，但代价是潜在的代码漏洞、可扩展性问题和可维护性问题。虽然许多人认为氛围编码将取代基于模型的工程，但在本文中，我们认为这两种方法实际上可以相互补充，并为不同类型的软件系统、开发场景和用户画像提供完全不同的开发路径。从这个意义上说，我们引入了“氛围驱动模型工程”的概念，作为一种融合AI和MDE优势的新方法，以加速可靠复杂系统的开发。我们概述了这一新方法的关键概念，并强调了它为软件开发的未来带来的机遇和开放挑战。

英文摘要

There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new software systems. New types of user interfaces, the need for intelligent components, sustainability concerns, etc. bring new challenges that we need to handle. In the last years, model-driven engineering (MDE), including its latest incarnation, i.e. low/no-code development, has been key to improving the quality and productivity of software development, but models themselves are becoming increasingly complex to specify and manage. At the same time, we are witnessing the growing popularity of vibe coding approaches that rely on Large Language Models (LLMs) to transform natural language descriptions into running code at the expense of potential code vulnerabilities, scalability issues and maintainability concerns. While many may think vibe coding will replace model-based engineering, in this paper we argue that, in fact, the two approaches can complement each other and provide altogether different development paths for different types of software systems, development scenarios, and user profiles. In this sense, we introduce the concept of \textit{vibe-driven model-based engineering} as a novel approach to integrate the best of both worlds (AI and MDE) to accelerate the development of reliable complex systems. We outline the key concepts of this new approach and highlight the opportunities and open challenges it presents for the future of software development.

URL PDF HTML ☆

赞 0 踩 0

2604.10579 2026-06-02 cs.RO cs.AI 版本更新

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence

AffordGen: 通过可供性对应生成多样化演示以实现通用物体操作

Jiawei Zhang, Kaizhe Hu, Yingqian Huang, Yuanchen Ju, Zhengrong Xue, Huazhe Xu

发表机构 * Shanghai Qi Zhi Institute（上海启智研究院）； Tsinghua University（清华大学）； Fudan University（复旦大学）； UC Berkeley（伯克利大学）

AI总结提出AffordGen框架，利用3D生成模型和视觉基础模型在大规模3D网格上的语义对应生成多样化操作轨迹，训练鲁棒的闭环视觉运动策略，实现零样本泛化到未见物体。

详情

AI中文摘要

尽管现代模仿学习方法在机器人操作中取得了近期成功，但其性能常常受到数据多样性不足导致的几何变化的限制。利用强大的3D生成模型和视觉基础模型（VFMs），所提出的AffordGen框架通过利用大规模3D网格上有意义关键点的语义对应来生成新的机器人操作轨迹，从而克服了这一限制。然后，这个大规模、可供性感知的数据集被用于训练一个鲁棒的、闭环的视觉运动策略，结合了可供性的语义泛化能力和端到端学习的反应性鲁棒性。在仿真和现实世界中的实验表明，使用AffordGen训练的策略实现了高成功率，并能够零样本泛化到真正未见过的物体，显著提高了机器人学习中的数据效率。项目页面：https://jiaweiz9.github.io/AffordGen-release/

英文摘要

Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning. Project Page: https://jiaweiz9.github.io/AffordGen-release/

URL PDF HTML ☆

赞 0 踩 0

2604.09877 2026-06-02 cs.CV cs.AI cs.RO 版本更新

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Genie 4D：语义先验引导的4D动态场景重建

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess

发表机构 * University of Zurich（苏黎世大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出Genie 4D框架，结合实时视觉惯性高斯泼溅前端和前馈4D骨干网络，利用冻结的DINOv3特征作为结构先验抑制身份漂移，并通过条件扩散精炼器恢复高频细节，最终通过轻量级潜在动作头实现用户可控的4D世界模型重建。

详情

AI中文摘要

在计算机视觉与机器人感知的交汇处，动态场景的4D重建将低层几何感知与高层语义理解联系起来。我们提出Genie 4D，一个将手持手机拍摄转化为语义化、动作可控的4D世界模型的框架。Genie 4D将用于度量几何的实时视觉惯性高斯泼溅前端与由冻结的DINOv3特征（作为结构先验）正则化的前馈4D骨干网络相结合。语义先验抑制了动态跟踪中的身份漂移，而短条件扩散精炼器恢复了回归骨干网络平滑掉的高频表面细节。最后，一个轻量级潜在动作头将重建的4D状态暴露给以JEPA风格下一嵌入目标训练的Genie式世界模型，使得场景可以在用户动作下向前推进。在Point Odyssey和TUM-Dynamics基准测试上，Genie 4D保留了前馈基线的线性时间复杂度O(T)，同时提高了3D跟踪精度（APD）和重建完整性，并且可以在单个消费级GPU（RTX 5090）上通过iPhone、Mac、Windows和Linux采集客户端交互式运行。Genie 4D为走向物理基础的世界模型提供了一条实用的、语义先验引导的路径。

英文摘要

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

URL PDF HTML ☆

赞 0 踩 0

2604.09549 2026-06-02 cs.IR cs.AI 版本更新

大语言模型引导的激励感知奖励设计用于合作多智能体强化学习

Dogan Urgun, Gokhan Gungor

发表机构 * Department of Electrical and Electronics Engineering（电气与电子工程系）； Karabuk University（卡拉博克大学）； Department of Mechatronics Engineering（机械工程系）

AI总结提出利用大语言模型自动生成可执行奖励程序，结合多智能体近端策略优化训练，在Overcooked-AI环境中显著提升合作任务回报。

详情

AI中文摘要

设计有效的辅助奖励对于合作多智能体系统仍然具有挑战性，因为激励不匹配会导致次优协调，尤其是在稀疏任务奖励无法为协调行为提供足够基础的情况下。本研究引入了一个自主奖励设计框架，利用大语言模型（LLMs）从环境仪器化中合成可执行的奖励程序。该过程将候选程序限制在形式有效性范围内，并在固定计算预算下使用多智能体近端策略优化（MAPPO）从头训练策略。然后根据性能评估候选程序，并仅基于稀疏任务回报进行跨代选择。该框架在四个Overcooked-AI布局中进行了评估，这些布局具有不同程度的走廊拥堵、交接依赖和结构不对称性。所提出的奖励设计方法始终产生更高的任务回报和交付数量，在交互瓶颈主导的环境中收益最为显著。对合成塑造成分的诊断分析揭示了动作选择中更强的相互依赖性，以及在协调密集型任务中信号对齐的改善。这些结果表明，所提出的LLM引导的奖励搜索框架减轻了手动工程的需求，同时产生了与有限预算下合作学习兼容的塑造成分信号。

英文摘要

Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient grounding for coordinated behavior. This study introduces an autonomous reward design framework that uses large language models (LLMs) to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and trains policies from scratch using Multi-Agent Proximal Policy Optimization (MAPPO) under a fixed computational budget. The candidates are then evaluated on the basis of their performance, and selection across generations solely based on the sparse task returns. The framework is evaluated in four Overcooked-AI layouts characterized by varying levels of corridor congestion, handoff dependencies, and structural asymmetries. The proposed reward design approach consistently yields higher task returns and delivery counts, with the most pronounced gains observed in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components reveals stronger interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the proposed LLM-guided reward search framework mitigates the need for manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

URL PDF HTML ☆

赞 0 踩 0

2604.03893 2026-06-02 cs.AI 版本更新

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

FeynmanBench：多模态大语言模型在图解物理推理上的基准测试

Zeyu Wang, Jingye Xu, Xiaogang Li, Peiyao Xiao, Qinhao Kong, Ben Wang, Chengliang Xu, Zichao Chen, Bing Zhao, Hu Wei

发表机构 * Alibaba Group（阿里巴巴集团）； Skylenage

AI总结提出FeynmanBench基准，包含2000多个费曼图任务，评估多模态大模型在拓扑结构、守恒约束和视觉-代数映射等全局结构推理上的能力，发现模型在局部识别上表现良好但在拓扑重建和代数推导上严重不足。

Comments 9 pages, 5 figures

详情

AI中文摘要

当前用于科学推理的多模态基准主要评估局部信息提取——模型识别符号和数值，然后进行文本推理。它们不评估模型是否能在形式化图表的全局结构属性上进行推理，例如拓扑、守恒约束以及视觉模式与代数表达式之间的一致映射。我们引入了FeynmanBench，一个包含2000多个任务的基准，聚焦于涵盖标准模型电磁、弱和强相互作用的费曼图。每个实例将图表图像与最少的文本约定相结合，要求模型恢复完整的物理内容——顶点清单、传播子类型、拓扑连接性、动量路由以及完整的散射振幅。一个自动化的生成和验证流程在标准化规则下产生图表、注释和参考答案。评估了19个最先进的多模态大语言模型，我们发现一个一致的失败模式：模型在局部识别（顶点和传播子识别）上达到70-95%，但在拓扑重建（CP3）上下降到13-17%，在完整代数推导（CP5）上接近零。FeynmanBench为形式化科学图表上的多模态推理提供了一个受控测试平台，并突显了当前架构在拓扑敏感的科学推理中的基本局限性。

英文摘要

Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and values and then perform textual inference. They do not assess whether models can reason over the global structural properties of formal diagrams, such as topology, conservation constraints, and the consistent mapping between visual patterns and algebraic expressions. We introduce FeynmanBench, a benchmark of over 2,000 tasks centered on Feynman diagrams spanning the electromagnetic, weak, and strong interactions of the Standard Model. Each instance couples a diagram image with minimal textual conventions and requires models to recover the full physical content -- vertex inventory, propagator types, topological connectivity, momentum routing, and the complete scattering amplitude. An automated generation and verification pipeline produces the diagrams, annotations, and reference answers under standardized rules. Evaluating 19 state-of-the-art multimodal LLMs, we find a consistent failure pattern: models achieve 70--95\% on local recognition (vertex and propagator identification) but collapse to 13--17\% on topological reconstruction (CP3), and near zero on full algebraic derivation (CP5). FeynmanBench offers a controlled testbed for multimodal reasoning over formal scientific diagrams and highlights fundamental limitations of current architectures in topology-sensitive scientific reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.03789 2026-06-02 cs.LG cs.AI 版本更新

Automated Conjecture Resolution with Formal Verification

自动猜想解决与形式化验证

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Shurui Liu, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University（北京大学数学科学学院）； Westlake Institute for Advanced Study, Westlake University（西拉雅大学先进研究所）； School of Mathematics, Tianjin University（天津大学数学学院）； Research Institute for Mathematical Sciences, Kyoto University（京都大学数学研究所）； Department of Mathematics, Stanford University（斯坦福大学数学系）； IQuest Research（IQuest研究）； New Cornerstone Science Laboratory, School of Mathematical Sciences, Peking University（北京大学数学科学学院新基石科学实验室）； Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University（北京大学国际数学研究所以及新基石科学实验室）； Center for Machine Learning Research, Peking University（北京大学机器学习研究中心）； Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University（大湾大学先进研究所智能计算中心）； Zhongguancun Academy（中关村学院）

AI总结提出一个集成非形式化推理与形式化验证的自动框架，通过两个组件Rethlas和Archon解决研究级数学问题，并成功解决交换代数中的开放问题并在Lean 4中形式化验证。

Comments Code and resources are available at: Rethlas (https://github.com/frenzymath/Rethlas), Rethlas Results (https://github.com/frenzymath/Rethlas_results), Archon (https://github.com/frenzymath/Archon), and the formalization results (https://github.com/frenzymath/Anderson-Conjecture)

详情

AI中文摘要

近年来，大型语言模型在数学推理能力上取得了显著进步，从解决初等问题扩展到研究级问题。然而，由于自然语言推理固有的歧义性，可靠地解决和验证此类问题仍然具有挑战性。本文提出一个自动框架，将自然语言推理与形式化验证相结合，以应对研究级数学问题。我们的框架由两个组件组成：非形式化推理代理Rethlas和形式化验证代理Archon。Rethlas将推理原语与我们的定理搜索引擎Matlas相结合，探索解决策略并构建候选证明。Archon配备LeanSearch，通过任务分解、迭代细化和自动证明合成，将非形式化论证转化为形式化的Lean 4项目，确保机器可检查的正确性。利用该框架，我们解决了一个交换代数中的开放问题，并在几乎无需人工参与的情况下在Lean 4中形式化验证了所得证明。额外的案例研究展示了Rethlas在非形式化数学推理和发现方面的能力，以及Archon将研究级证明形式化为Lean 4的能力。我们的实验表明，强大的定理检索工具能够发现和应用跨领域数学技巧，而形式化代理可以自主填补非形式化论证中的非平凡空白。更广泛地说，我们的工作展示了一种有前景的数学研究范式，其中配备定理检索工具的非形式化和形式化推理系统协同工作，以产生可验证的结果，减少人工努力，并支持人机协作的数学研究。

英文摘要

Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research-level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework that integrates natural language reasoning with formal verification to tackle research-level mathematical problems. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas combines reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with LeanSearch, translates informal arguments into formalized Lean 4 projects through task decomposition, iterative refinement, and automated proof synthesis, ensuring machine-checkable correctness. Using this framework, we resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Additional case studies illustrate the capabilities of Rethlas in informal mathematical reasoning and discovery, as well as the ability of Archon to formalize research-level proofs in Lean 4. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, while the formal agent can autonomously fill nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, reduce human effort, and support human-AI collaborative mathematical research.

URL PDF HTML ☆

赞 0 踩 0

2602.00906 2026-06-02 cs.LG cs.AI cs.CL cs.DS cs.IT math.IT 版本更新

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

幻觉是空间最优性的结果：成员测试的率失真定理

Anxin Guo, Jingwei Li

发表机构 * Computer Science Department, Northwestern University（西北大学计算机科学系）； Department of IEOR, Columbia University（哥伦比亚大学工业工程与运筹学系）

AI总结通过将幻觉形式化为成员测试问题，建立率失真定理，证明在有限容量下信息论最优策略必然导致对某些非事实的高置信度，从而产生幻觉。

Comments ICML 2026

详情

AI中文摘要

大型语言模型通常对缺乏可推断模式的“随机事实”以高置信度产生幻觉。我们将此类事实的记忆形式化为一个成员测试问题，统一了布隆过滤器的离散误差指标与LLM的连续对数损失。通过分析在事实在可能主张的宇宙中稀疏的情况下，我们建立了一个率失真定理：最优记忆效率由事实与非事实得分分布之间的最小KL散度刻画。这一理论框架在理想化设置下为幻觉提供了独特的解释：即使有最优训练、完美数据和简化的“封闭世界”设置，有限容量下信息论最优策略不是放弃或遗忘，而是对某些非事实赋予高置信度，从而导致幻觉。我们在合成数据和真实数据上实证验证了这一理论，表明幻觉作为有损压缩的自然结果持续存在。同一定理恢复并锐化了布隆型滤波器的经典空间下界，确定了两侧滤波器遗留的加性常数。

英文摘要

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination under an idealized setting: even with optimal training, perfect data, and a simplified ``closed world'' setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on both synthetic and real-world data, showing that hallucinations persist as a natural consequence of lossy compression. The same theorem recovers and sharpens classical space lower bounds for Bloom-type filters, pinning down an additive constant left open for two-sided filters.

URL PDF HTML ☆

赞 0 踩 0

2603.03312 2026-06-02 cs.CL cs.AI cs.HC eess.AS q-bio.NC 版本更新

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

逃离BLEU陷阱：一种基于信号锚定与解耦语义引导的脑电解码文本框架

Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Shenzhen Loop Area Institute（深圳环湖研究院）

AI总结针对脑电解码文本中语义偏差、信号忽略和BLEU陷阱问题，提出SemKey多阶段框架，通过解耦语义目标（情感、主题、长度、意外性）和主动检索解码机制，强制生成基于信号而非语言先验，并采用检索和分布度量（如Fréchet距离）建立评估协议，有效缓解幻觉并达到最优性能。

详情

AI中文摘要

从非侵入性脑电信号中解码自然语言是一项有前景但充满挑战的任务。然而，当前最先进的模型仍受限于三个基本问题：语义偏差（输出退化为通用语言模板）、信号忽略（模型严重依赖大语言模型先验，即使在缺乏有意义信号时也能生成流畅文本）以及“BLEU陷阱”（高频停用词虚增n-gram指标，掩盖真正语义保真度的缺失）。为解决这些挑战，我们超越传统的端到端流水线，提出SemKey——一种新颖的多阶段框架，通过四个解耦的语义目标（情感、主题、长度和意外性）强制进行基于信号的生成。我们直接从脑电嵌入中提取这些语义锚点，然后通过主动检索解码机制统一它们，迫使大语言模型将其令牌生成锚定在神经信号上，而非默认使用语言先验。此外，我们通过建立全面的评估协议（使用严格的检索和基于分布的度量，如Fréchet距离）打破BLEU陷阱。大量实验表明，SemKey有效缓解了对噪声输入的幻觉，并在这些鲁棒协议上达到了最先进的性能。代码将在论文被接收后发布于https://github.com/xmed-lab/SemKey。

英文摘要

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental issues: Semantic Bias, where outputs collapse into generic linguistic templates; Signal Neglect, where models rely heavily on LLM priors to hallucinate fluent text even in the absence of meaningful signals; and the "BLEU Trap", where high-frequency stopwords inflate n-gram metrics, masking a lack of true semantic fidelity. To resolve these challenges, we move beyond conventional end-to-end pipelines and propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We extract these semantic anchors from EEG embeddings directly, then unify them with an Active Retrieval Decoding mechanism, compelling the LLM to ground its token generation in the neural signals rather than defaulting to linguistic priors. Furthermore, we break the BLEU Trap by establishing a comprehensive evaluation protocol using rigorous retrieval and distribution-based metrics such as Fréchet Distance. Extensive experiments demonstrate that SemKey effectively mitigates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.

URL PDF HTML ☆

赞 0 踩 0

2604.01841 2026-06-02 cs.AI 版本更新

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

检索对齐的表格基础模型实现电子健康记录中在现实约束下的稳健临床风险预测

Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica

发表机构 * University of Cambridge（剑桥大学）

AI总结针对电子健康记录中高维、异质、类别不平衡和分布偏移等挑战，提出任务对齐检索框架AWARE，通过监督嵌入学习和轻量适配器提升表格上下文学习性能，在极端不平衡下AUPRC提升高达12.2%。

Comments Not peer-reviewed

详情

DOI: 10.21203/rs.3.rs-9085469/v1

AI中文摘要

从结构化电子健康记录（EHR）进行临床预测具有挑战性，原因包括高维性、异质性、类别不平衡和分布偏移。尽管表格上下文学习（TICL）和检索增强方法在通用基准上表现良好，但它们在临床环境中的行为仍不清楚。我们提出了一个多队列EHR基准，比较了经典模型、深度表格模型和TICL模型在不同数据规模、特征维度、结果稀有性和跨队列泛化下的表现。基于PFN的TICL模型在低数据情况下样本高效，但随着异质性和不平衡的增加，在基于朴素距离的检索下性能下降。我们提出了AWARE，一个任务对齐的检索框架，使用监督嵌入学习和轻量适配器。AWARE在极端不平衡下将AUPRC提升了高达12.2%，且增益随数据复杂性增加。我们的结果识别出检索质量以及检索-推理对齐是部署表格上下文学习进行临床预测的关键瓶颈。

英文摘要

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.

URL PDF HTML ☆

赞 0 踩 0

2604.01562 2026-06-02 cs.SD cs.AI cs.CL cs.CY cs.HC 版本更新

Acoustic and perceptual differences between standard and accented speech and their voice clones

标准口音与带口音语音及其语音克隆的声学与感知差异

Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu

发表机构 * Department of Linguistics, University at Buffalo, United States（语言学系，布法罗大学，美国）； Department of Computer Science and Engineering, University at Buffalo, United States（计算机科学与工程系，布法罗大学，美国）； Emeritus Faculty, Australian National University, Australia（澳大利亚国立大学荣誉教职）

AI总结通过计算和感知实验，比较标准口音与带口音普通话及其语音克隆，发现口音影响感知身份匹配和可懂度，且标准口音克隆更接近原声，带口音克隆可懂度提升更大。

详情

AI中文摘要

语音克隆通常根据整体质量进行评估，但关于口音保留及其感知后果的了解较少。我们采用计算和感知相结合的设计，比较标准口音和重度口音普通话及其语音克隆。基于嵌入的分析显示，在多个说话人判别嵌入空间中，带口音说话人的原始-克隆距离更大，但在根据每个说话人的原始内部基线变异性进行归一化后，这种差异消失。在感知研究中，标准口音说话人的克隆被评价为比带口音说话人的克隆更接近其原始声音，并且从原始到克隆的可懂度增加，其中带口音语音的增益更大。这些结果表明，即使口音变异未反映在基线归一化的说话人嵌入距离中，它也能影响语音克隆中的感知身份匹配和可懂度，并促使将口音保留视为说话人身份保留的一个明确组成部分，而不是假设它完全由现成的说话人判别嵌入所捕获。

英文摘要

Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses showed larger original-clone distances for accented speakers in several speaker-discriminative embedding spaces, but this difference disappeared after normalizing against each speaker's within-original baseline variability. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in baseline-normalized speaker-embedding distance, and they motivate treating accent preservation as an explicit component of speaker identity preservation, rather than assuming that it is fully captured by off-the-shelf speaker-discriminative embeddings.

URL PDF HTML ☆

赞 0 踩 0

2603.28825 2026-06-02 cs.GT cs.AI 版本更新

Incentives, Equilibria, and the Limits of Healthcare AI: A Game-Theoretic Perspective

激励、均衡与医疗AI的局限：博弈论视角

Ari Ercole

发表机构 * Cambridge Centre for AI in Medicine, University of Cambridge, UK（剑桥大学医学人工智能中心）； Magdalene College, University of Cambridge, UK（剑桥大学玛格丽特学院）

AI总结本文通过住院容量管理的协调问题，描述三种AI部署形式，并分析其对系统行为的影响，指出只有改变激励结构的干预才能改变稳定均衡，为医疗AI的采购、治理和评估提供实践启示。

详情

AI中文摘要

利用一个来自住院容量管理的典型协调问题，描述了三种典型的AI部署形式：减少努力的技术、面向可观测性的系统以及改变潜在激励结构的干预。减少努力和可观测性可能改善现有行为模式下的性能，但通常不会改变哪些行动是个人理性的。因此，此类干预通常被吸收到现有均衡中。相比之下，通过重新分配或限制局部风险来改变局部行动如何影响下游后果的干预可以改变稳定的系统行为。这些机制层面的干预不同之处不在于技术复杂性，而在于它们与制度激励的相互作用。分析表明，对AI带来系统层面收益的期望应取决于部署是否改变了激励，而不仅仅是优化任务或信息流。对于医疗组织和政策制定者而言，这对数字技术的采购、治理和评估具有实际意义。

英文摘要

Using a stylised coordination problem drawn from inpatient capacity management, three archetypal forms of AI deployment are described: effort-reducing technologies, observability-oriented systems, and interventions that alter underlying incentive structures. Effort reduction and observability may improve performance within existing patterns of behaviour but do not, in general, change which actions are individually rational. As a result, such interventions are typically absorbed into existing equilibria. By contrast, interventions that modify how local actions map to downstream consequences by redistributing or bounding local risk can change stable system behaviour. These mechanism-level interventions differ not in technical sophistication but in their interaction with institutional incentives. The analysis suggests that expectations of system-level gains from AI should be conditioned on whether a deployment changes incentives rather than optimising tasks or information flows alone. For healthcare organisations and policymakers, this has practical implications for procurement, governance, and evaluation of digital technologies.

URL PDF HTML ☆

赞 0 踩 0

2603.27223 2026-06-02 cs.CV cs.AI 版本更新

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

EuraGovExam：来自现实世界公务员考试的多语言多模态基准

Jaeseong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

发表机构 * School of Computer Science / Data Intelligence Lab（计算机科学学院/数据智能实验室）

AI总结提出一个包含8000多道真实公务员考试题目的多语言多模态基准EuraGovExam，要求模型直接从图像中进行布局感知的跨语言推理，当前最先进的视觉语言模型准确率仅达86%。

详情

DOI: 10.1145/3770855.3817532

AI中文摘要

我们提出了EuraGovExam，一个多语言和多模态基准，来源于五个代表性欧亚地区（韩国、日本、台湾、印度和欧盟）的现实世界公务员考试。该数据集旨在反映公共部门评估的真实复杂性，包含超过8000道高分辨率扫描选择题，涵盖17个不同的学术和行政领域。与现有基准不同，EuraGovExam将所有题目内容（包括问题陈述、答案选项和视觉元素）嵌入到单个图像中，仅提供最小化的标准答案格式指令。这种设计要求模型直接从视觉输入进行布局感知的跨语言推理。所有题目均来自真实考试文档，保留了丰富的视觉结构，如表格、多语言排版和类似表单的布局。评估结果显示，即使是最先进的视觉语言模型（VLM）也仅达到86%的准确率，突显了该基准的难度及其诊断当前模型局限性的能力。通过强调文化真实性、视觉复杂性和语言多样性，EuraGovExam为在高风险、多语言、图像基础环境中评估VLM建立了新标准。它还支持电子政务、公共部门文档分析和公平考试准备等实际应用。

英文摘要

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

URL PDF HTML ☆

赞 0 踩 0

2603.26779 2026-06-02 cs.CV cs.AI 版本更新

Limits of Spatial Imagery Reasoning in Frontier LLM Models

前沿大语言模型在空间意象推理中的局限性

Sergio Y. Hayashi, Nina S. T. Hirata

发表机构 * Institute of Mathematics and Statistics – University of São Paulo（数学统计研究所 – 圣保罗大学）

AI总结本研究通过引入外部“意象模块”辅助3D模型旋转任务，发现即使外包整体3D状态维护，前沿模型仍缺乏基础视觉空间原语，导致准确率最高仅62.5%。

Comments 25 pages. v2: Title updated; added a section on object/spatial imagery and propositional reasoning; added new experimental results for the single-object rotation probe

详情

AI中文摘要

大型语言模型（LLMs）展示了令人印象深刻的推理能力，但在需要心理模拟的空间任务（如心理旋转）中表现不佳。本文研究是否通过为LLM配备一个外部“意象模块”——一种能够渲染和旋转3D模型的工具——可以弥合这一差距，充当“认知假体”。我们使用双模块架构进行了实验，其中推理模块（MLLM）与意象模块在3D模型旋转任务上进行交互。性能低于预期，准确率最高达到62.5%。进一步研究表明，即使将维护和操作整体3D状态的负担外包，系统仍然失败。这揭示了当前前沿模型缺乏与意象交互所需的基础视觉空间原语。具体来说，它们缺乏：（1）提取空间信号的低级敏感性，例如（a）深度，（b）运动，以及（c）短视距动态预测；以及（2）对图像进行沉思性推理的能力，动态转移视觉焦点，并平衡意象与符号和关联信息。

英文摘要

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

URL PDF HTML ☆

赞 0 踩 0

2603.23582 2026-06-02 cs.LG cs.AI 版本更新

AI Generalisation Gap In Comorbid Sleep Disorder Staging

共病睡眠障碍分期中的AI泛化差距

Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

发表机构 * arXiv

AI总结针对脑卒中患者睡眠分期中深度学习模型在健康与临床人群间泛化差的问题，通过Grad-CAM可视化和新数据集iSLEEPS，揭示模型关注生理无意义区域，并强调需开发疾病特异性模型。

详情

DOI: 10.1109/ISBI61048.2026.11515484

AI中文摘要

准确的睡眠分期对于诊断脑卒中患者的OSA和低通气至关重要。尽管PSG可靠，但成本高、劳动密集且需人工评分。虽然深度学习在健康受试者中实现了基于EEG的自动睡眠分期，但我们的分析显示，该方法在睡眠紊乱的临床人群中泛化能力差。利用Grad-CAM解释，我们系统地证明了这一局限性。我们引入了iSLEEPS，一个经过临床注释的缺血性脑卒中新数据集（即将公开发布），并评估了SE-ResNet加双向LSTM模型用于单通道EEG睡眠分期。正如预期，健康与疾病受试者之间的跨域性能很差。注意力可视化在临床专家反馈的支持下显示，模型在患者数据中关注生理上无信息的EEG区域。统计和计算分析进一步证实了健康与缺血性脑卒中队列之间显著的睡眠结构差异，强调了在部署前需要经过临床验证的受试者感知或疾病特异性模型。论文和代码摘要见https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

英文摘要

Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

URL PDF HTML ☆

赞 0 踩 0

2603.24511 2026-06-02 cs.LG cs.AI cs.CR 版本更新

大型语言模型中语境不变性的失效

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

发表机构 * Network Science Institute, Northeastern University（网络科学研究所，东北大学）； Center for Health Informatics Program, Boston Children’s Hospital（健康信息学计划中心，波士顿儿童医院）； Dept. of Mathematics, City St George’s, University of London（伦敦大学城市圣乔治学院数学系）； IT University of Copenhagen（哥本哈根IT大学）

AI总结通过代词选择任务发现，在语境等价但无信息量的干扰下，大语言模型输出发生系统性偏移，表明其违反语境不变性，影响偏见评估与高风险应用。

详情

AI中文摘要

标准评估实践假设，当提示嵌入语境等价的语篇中时，大型语言模型（LLM）的输出是稳定的。这里，我们在性别推断的背景下测试这一假设。使用受控的代词选择任务，我们引入最小的、理论上无信息的语篇语境，发现这会导致模型输出出现大规模、系统性的偏移。与去语境化设置中存在的文化性别刻板印象的相关性在引入语境后减弱或消失，而理论上无关的特征（如无关指代对象的代词性别）成为模型行为最具信息量的预测因子。通过默认语境性分析发现，在模型间的19%至52%的案例中，这种依赖性在考虑语境对单个输出的所有边际效应后仍然存在，并且不能归因于简单的代词重复。这些发现表明，即使在几乎相同的句法表述下，LLM的输出也违反了语境不变性，这对偏见基准测试和高风险环境中的部署具有重要影响。

英文摘要

Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behavior. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

URL PDF HTML ☆

赞 0 踩 0

2603.23398 2026-06-02 cs.LG cs.AI stat.ML 版本更新

HiFi-KPI：用于从财报中提取层次化KPI的数据集

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

发表机构 * Department of Computer Science, Aalborg University（奥尔堡大学计算机科学系）； ALIPES ApS（ALIPES公司）； University of Copenhagen（哥本哈根大学）； Pioneer Centre for AI（先锋人工智能中心）

AI总结针对财报中关键绩效指标（KPI）跨公司可迁移性差的问题，提出包含165万段落和19.8万层次化标签的HiFi-KPI数据集，并评估分类、提取和结构化提取三个任务。

详情

AI中文摘要

准确标注财报可以为利益相关者带来显著的短期回报。机器可读的内联可扩展商业报告语言（iXBRL）是公开财务申报的强制要求。然而，其复杂且细粒度的分类法限制了标记关键绩效指标（KPI）的跨公司可迁移性。为了解决这个问题，我们引入了层次化财务关键绩效指标（HiFi-KPI）数据集，这是一个包含165万段落和19.8万个独特层次化标签的大规模语料库，这些标签与iXBRL分类法相关联。HiFi-KPI支持多个任务，我们评估了其中三个：KPI分类、KPI提取和结构化KPI提取。为了快速评估，我们还发布了HiFi-KPI-Lite，一个手动策划的8K段落子集。在HiFi-KPI-Lite上的基线实验表明，基于编码器的模型在分类任务上达到了超过0.906的宏F1分数，而大型语言模型（LLM）在结构化提取任务上达到了0.440的F1分数。最后，定性分析显示提取错误主要与日期相关。我们在https://github.com/aaunlp/HiFi-KPI上开源了所有代码和数据。

英文摘要

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings. Yet, its complex, fine-grained taxonomy limits the cross-company transferability of tagged Key Performance Indicators (KPIs). To address this, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, a large-scale corpus of 1.65M paragraphs and 198k unique, hierarchically organized labels linked to iXBRL taxonomies. HiFi-KPI supports multiple tasks and we evaluate three: KPI classification, KPI extraction, and structured KPI extraction. For rapid evaluation, we also release HiFi-KPI-Lite, a manually curated 8K paragraph subset. Baselines on HiFi-KPI-Lite show that encoder-based models achieve over 0.906 macro-F1 on classification, while Large Language Models (LLMs) reach 0.440 F1 on structured extraction. Finally, a qualitative analysis reveals that extraction errors primarily relate to dates. We open-source all code and data at https://github.com/aaunlp/HiFi-KPI.

URL PDF HTML ☆

赞 0 踩 0

2603.18016 2026-06-02 cs.CL cs.AI cs.DC cs.LG 版本更新

MineDraft: A Framework for Batch Parallel Speculative Decoding

MineDraft: 批量并行推测解码框架

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Toyota Research Institute（丰田研究院）； Toyota Motor Corporation（丰田公司）

AI总结提出MineDraft框架，通过批量并行设计将草稿生成与验证阶段重叠，显著提升推测解码的吞吐量和端到端延迟。

Comments Accepted at ICML 2026

详情

AI中文摘要

推测解码（SD）通过使用较小的草稿模型提出草稿令牌，随后由较大的目标模型验证，从而加速大型语言模型推理。然而，标准SD的性能通常受限于这些草稿和验证阶段的严格顺序执行。为解决此问题，本文提出MineDraft，一种批量并行推测解码（PSD）框架，旨在通过将草稿生成与验证重叠来有效隐藏草稿延迟。我们的理论分析表明，PSD比标准SD高效得多。MineDraft通过一种新颖的批量并行设计实现PSD，该设计维护两个请求批次，将一个批次的草稿生成与另一个批次的验证重叠。我们的实验结果显示，与标准SD相比，MineDraft在吞吐量（最高提升75%）和端到端延迟（最高降低39%）方面均有显著改进。此外，我们已将MineDraft实现为vLLM的插件，展示了其在生产级推理系统中的实用性。

英文摘要

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

URL PDF HTML ☆

赞 0 踩 0

2603.17893 2026-06-02 cs.SE cs.AI cs.LG 版本更新

scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

scicode-lint: 使用LLM生成的模式检测科学Python代码中的方法论错误

Sergey V. Samsonau

发表机构 * Authentic Research Partners, Princeton, NJ（真实研究伙伴，新泽西州普林斯顿）

AI总结提出scicode-lint，通过两级架构（构建时使用前沿模型生成模式，运行时使用小型本地模型执行）自动检测科学Python代码中的方法论错误，如数据泄露、交叉验证错误和缺失随机种子。

详情

AI中文摘要

科学Python代码中的方法论错误会产生看似合理但实际不正确的结果，传统的linter和静态分析工具无法检测到这些错误。多个研究团队构建了特定于ML的linter，证明了检测的可行性。然而，这些工具存在可持续性问题：依赖于特定的pylint或Python版本、有限的打包方式，以及每个新模式都需要手动工程。随着AI生成代码增加了科学软件的数量，对自动化方法论检查（如检测数据泄露、不正确的交叉验证和缺失随机种子）的需求日益增长。我们提出了scicode-lint，其两级架构将模式设计（构建时的前沿模型）与执行（运行时的小型本地模型）分离。模式是生成的，而非手工编码；适应新的库版本花费的是token，而非工程时间。在带有手动标注真实值的Kaggle笔记本上，预处理泄露检测在100%召回率下达到了65%的精确率；在38篇应用AI/ML的已发表科学论文中，精确率为62%（由LLM评判），不同模式类别之间存在显著差异；在一个保留的论文集上，精确率为54%。在受控测试中，scicode-lint在66个模式上达到了97.7%的准确率。

英文摘要

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

URL PDF HTML ☆

赞 0 踩 0

2603.13373 2026-06-02 cs.CY cs.AI cs.LG 版本更新

强化学习中信息自锁现象及其在LLM智能体主动推理中的应用

Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对LLM智能体在主动推理中因强化学习导致的信息自锁问题，提出基于优势重加权的方法AREW，通过方向性批评重新分配轨迹信用，显著提升智能体性能。

Comments Accepted by ICML 2026

详情

从特征到行动：传统与智能体AI系统中的可解释性

Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Ahmed Y. Radwan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence（向量人工智能研究所）； Independent Researcher（独立研究者）； Mayo Clinic（梅奥诊所）

AI总结本文比较了基于归因的解释与基于轨迹的诊断在静态和智能体设置中的效果，发现归因方法无法可靠诊断智能体轨迹中的执行级故障，而轨迹级可解释性更能定位行为故障。

详情

AI中文摘要

在过去十年中，可解释AI主要关注解释单个模型预测，在固定决策结构下生成将输入与输出关联的事后解释。大型语言模型的最新进展使得智能体AI系统能够在多步轨迹中展开行为。在这些设置中，成功与失败由决策序列而非单个输出决定。目前尚不清楚为静态预测设计的解释方法如何应用于行为随时间涌现的智能体设置。在这项工作中，我们通过比较两种设置中基于归因的解释与基于轨迹的诊断来弥合这一差距。我们的结果表明，虽然归因方法在静态设置中实现了稳定的特征排名（Spearman ρ = 0.86），但它们无法可靠地诊断智能体轨迹中的执行级故障。相比之下，针对智能体设置的轨迹接地评分标准能够一致地定位行为故障，并揭示状态跟踪不一致在失败运行中的普遍性高出2.7倍，并将成功概率降低49%。这些发现促使我们转向轨迹级可解释性，以评估和诊断智能体系统中自主AI行为。代码：https://github.com/VectorInstitute/unified-xai-evaluation-framework 项目页面：https://vectorinstitute.github.io/unified-xai-evaluation-framework

英文摘要

Over the last decade, Explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. It remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge this gap by comparing attribution-based explanations with trace-based diagnostics across both settings. Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman \r{ho} = 0.86), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7x more prevalent in failed runs and reduces success probability by 49%. These findings motivate a shift towards trajectory-level explainability for evaluating and diagnosing autonomous AI behaviour in agentic systems. Code: https://github.com/VectorInstitute/unified-xai-evaluation-framework Project page: https://vectorinstitute.github.io/unified-xai-evaluation-framework

URL PDF HTML ☆

赞 0 踩 0

2512.16310 2026-06-02 cs.CR cs.AI cs.CL 版本更新

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Agent工具编排泄露更多：数据集、基准测试与缓解措施

Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）

AI总结研究LLM代理在编排多个工具时泄露敏感结论的风险（TOP-R），构建了包含1000个实例的基准TOP-Bench，并提出TOP-Align后训练方法以缓解泄露。

Comments 17 pages, 2 figures. Dataset and code are available at https://github.com/1Ponder/TOP-R

详情

AI中文摘要

基于LLM的代理越来越多地使用多个外部工具来完成复杂任务。我们研究了工具编排隐私风险（TOP-R）：代理可能组合单个非敏感的工具返回结果，并披露一个非预期的敏感结论。我们通过三个条件形式化TOP-R：结论敏感性、单源不可推断性和组合可推断性。我们引入了LRSE（基于库的反向推理种子扩展），这是一个基于隐私规范、推理链、工具模式和任务场景的四库反向构建流水线，并使用它构建了TOP-Bench，一个包含1000个实例的基准测试。该基准测试在受控的两阶段工具使用协议下评估最终响应的语义泄露。在六个LLM代理中，任务完成率保持较高，但平均泄露率达到88.6%，导致H分数仅为20.4。两种仅提示的防护措施在主基准测试上将H分数提高了约2.7分。我们进一步提出了TOP-Align，一种SFT+DPO后训练方法，用于更安全的任务完成边界。在单独的后训练评估划分上，TOP-Align将H分数比相应基础模型提高了16.2分，而同一划分上仅提示缓解措施的平均增益为4.9分。这些结果表明TOP-R需要超越仅提示的缓解措施。

英文摘要

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability. We introduce LRSE (Library-Grounded Reverse-Inference Seed Expansion), a four-library reverse-construction pipeline grounded in privacy norms, reasoning chains, tool schemas, and task scenarios, and use it to build TOP-Bench, a 1,000-instance benchmark. The benchmark evaluates final-response semantic disclosure under a controlled two-stage tool-use protocol. Across six LLM agents, task completion remains high, but the average leakage rate reaches 88.6 percent, yielding an H-score of only 20.4. Two prompt-only safeguards improve H-score by about 2.7 points on the main benchmark. We further propose TOP-Align, an SFT+DPO post-training method for safer task completion boundaries. On a separate post-training evaluation split, TOP-Align improves H-score by 16.2 points over the corresponding base model, compared with a 4.9-point average gain from prompt-only mitigation on the same split. These results show that TOP-R requires mitigation beyond prompting alone.

URL PDF HTML ☆

赞 0 踩 0

2602.23694 2026-06-02 cs.RO cs.AI 版本更新

Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion

基于对数似然比融合的可解释多模态手势识别用于无人机和移动机器人遥操作

Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh

发表机构 * Department of Artificial Intelligence, Korea University（人工智能系，韩国大学）； Department of Computer Science, RPTU Kaiserslautern-Landau（计算机科学系，RPTU凯撒斯劳滕-兰道）； Embedded Intelligence, German Research Center for Artificial Intelligence (DFKI)（嵌入式智能，德国人工智能研究中心（DFKI））

AI总结提出一种融合腕戴式Apple Watch惯性数据和定制手套电容传感信号的多模态手势识别框架，利用对数似然比后期融合策略提升性能并提供可解释性，在降低计算成本的同时达到与视觉基线相当的识别效果。

详情

AI中文摘要

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach

发表机构 * Stanford University（斯坦福大学）； Princeton University（普林斯顿大学）

AI总结本文利用基于流的生成模型估计完整未来回报分布，通过新的流匹配目标满足分布贝尔曼方程，并利用流导数ODE估计回报不确定性以优先学习，在离线与在线设置中平均成功率提升1.3倍。

Comments ICLR 2026

详情

AI中文摘要

虽然当今大多数强化学习方法将未来回报的分布压缩为单个标量值，但分布RL方法利用回报分布提供更强的学习信号，并支持探索和安全强化学习中的应用。虽然估计回报分布的主要方法是将其建模为离散区间上的分类分布或估计有限数量的分位数，但这些方法留下了关于回报分布的细粒度结构以及如何区分高回报不确定性的状态以进行决策的未解问题。本文的关键思想是使用现代、灵活的基于流的模型来估计完整的未来回报分布，并识别那些具有高回报方差的状态。我们通过制定一个新的流匹配目标来实现这一点，该目标生成满足分布贝尔曼方程的概率密度路径。基于学习到的流模型，我们使用一个新的流导数ODE来估计不同状态的回报不确定性。我们还利用这种不确定性信息，优先在某些转换上学习更准确的回报估计。我们将我们的方法（Value Flows）与先前的方法在离线和在线到在线设置中进行了比较。在37个基于状态和25个基于图像的基准任务上的实验表明，Value Flows在成功率上平均提高了1.3倍。网站：https://pd-perry.github.io/value-flows 代码：https://github.com/chongyi-zheng/value-flows

英文摘要

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

URL PDF HTML ☆

赞 0 踩 0

2601.09566 2026-06-02 cs.CV cs.AI 版本更新

Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning

热启动中文语言建模：视觉字形加速样本高效学习

Shuyang Xiang, Hao Guan

发表机构 * Independent Researcher（独立研究者）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）

AI总结本文通过将汉字渲染为视觉字形图像，研究其对字符级语言建模的归纳偏置，发现视觉输入产生显著的热启动效应，但最终精度与基于索引的方法一致。

Comments 15 pages, 5 figures, submitted to ACL 2026

详情

AI中文摘要

在这项工作中，我们研究了将汉字渲染为视觉字形图像（而非主流LLM使用的离散token ID）是否为字符级语言建模提供归纳偏置。我们的核心发现给出了一个双刃剑的见解：视觉输入产生显著的热启动效应，在第一个epoch内（占总训练步骤的0.4%）将早期准确率提高一倍以上（视觉输入12.3% vs. 基于索引的基线5.8%），但两种方法最终收敛到几乎相同的最终准确率（39%）。这一模式在低至8x8像素的分辨率、高达50%的部分裁剪以及从110M到1.78B参数的模型规模下均成立。我们识别的机制是，字形渲染在训练之前就将基于部首的结构预编码到嵌入空间中（余弦相似度0.27 vs. 随机嵌入的0.002），从而能够更快地对齐，但无法提高最终容量。我们的结果阐明了视觉表示作为中文语言建模归纳偏置的前景和根本局限性。

英文摘要

In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, providing an inductive bias for character-level language modeling. Our central finding gives a double-edged insight: visual inputs produce a pronounced hot-start effect, more than doubling early-stage accuracy within the first epoch (at 0.4% of total training steps) (12.3% visual inputs vs. 5.8% index-based baseline), yet both approaches converge to essentially identical final accuracy (39%). This pattern holds across resolutions as low as 8x8 pixels, partial cropping up to 50%, and model scales from 110M to 1.78B parameters. The mechanism we identify is that glyph rendering pre-encodes radical-based structure into embedding space before any training (cosine similarity 0.27 vs. 0.002 for random embeddings), enabling faster alignment but not higher final capacity. Our results clarify both the promise and fundamental limitation of visual representations as inductive biases for Chinese language modeling.

URL PDF HTML ☆

赞 0 踩 0

2603.02650 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自监督动作能量门控改进扩散规划器

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SAGE方法，利用潜在一致性信号在推理时重新排序轨迹，惩罚动态不一致的计划，从而提升扩散规划器的性能和鲁棒性。

详情

AI中文摘要

扩散规划器是离线强化学习的一种强大方法，但当价值引导选择偏好得分高但局部与环境动态不一致的轨迹时，它们可能会失败，导致执行脆弱。我们提出了自监督动作能量门控（SAGE），一种推理时重排序方法，使用潜在一致性信号惩罚动态不一致的计划。SAGE在离线状态序列上训练联合嵌入预测架构（JEPA）编码器，并训练一个动作条件的潜在预测器用于短时域过渡。在测试时，SAGE为每个采样候选分配一个由其潜在预测误差给出的能量，并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划流程中，这些流程可以通过价值评分采样轨迹和选择动作；它不需要环境回滚，也不需要重新训练策略。在运动、导航和操作基准测试中，SAGE提高了扩散规划器的性能和鲁棒性。

英文摘要

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

URL PDF HTML ☆

赞 0 踩 0

2603.02346 2026-06-02 cond-mat.str-el cs.AI cs.LG 版本更新

Large Electron Model: A Universal Ground State Predictor

大型电子模型：一种通用的基态预测器

Timothy Zaklama, Max Geier, Liang Fu

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Department of Physics（物理系）

AI总结提出Large Electron Model，一种基于Fermi Sets架构的神经网络模型，通过在整个哈密顿参数流形上生成变分波函数，准确预测二维谐振势中相互作用电子的基态，并泛化到未见耦合强度和粒子数，为材料发现提供了基于变分原理的基座模型方法。

Comments 8+7 pages, 5+6 figures, 1+1 tables

2603.02237 2026-06-02 cs.LG cs.AI 版本更新

Concept Heterogeneity-aware Representation Steering

概念异质性感知表示引导

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

发表机构 * arXiv

AI总结针对大语言模型表示非均匀导致全局引导脆弱的问题，提出基于最优传输的输入依赖引导方法CHaRS，通过高斯混合模型和离散最优传输实现更有效的行为控制。

详情

Journal ref: ICML 2026

AI中文摘要

表示引导提供了一种轻量级机制，通过在推理时干预内部激活来控制大语言模型（LLMs）的行为。现有方法大多依赖于单个全局引导方向，通常通过对比较数据集进行均值差异得到。这种方法隐含假设目标概念在嵌入空间中均匀表示。然而在实践中，LLM表示可能高度非均匀，表现出聚类、上下文相关的结构，这使得全局引导方向变得脆弱。在这项工作中，我们通过最优传输（OT）的视角审视表示引导，注意到标准均值差异引导隐式对应于具有不同一阶矩的两个相同分布之间的OT映射，产生全局平移。为了放宽这一限制性假设，我们从理论上将源和目标表示建模为高斯混合模型，并将引导公式化为语义潜在聚类之间的离散OT问题。从得到的传输计划中，我们通过重心投影推导出显式的、输入依赖的引导映射，产生聚类级别偏移的平滑核加权组合。我们将此方法称为概念异质性感知表示引导（CHaRS）。通过大量实验设置，我们证明CHaRS比全局引导产生更有效的行为控制。

英文摘要

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two identical distributions with differing first moments, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

URL PDF HTML ☆

赞 0 踩 0

2603.00829 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Constitutional Black-Box Monitoring for Scheming in LLM Agents

LLM Agent 中阴谋行为的宪法黑盒监控

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

发表机构 * University of Cambridge（剑桥大学）

AI总结研究使用基于宪法黑盒的监控器，通过仅观察外部输入和输出检测LLM Agent的阴谋行为，并在合成数据上优化后泛化到更真实环境。

Comments Accepted at ICML 2026. Camera-ready version

详情

评估中文事实搜索与AI答案中的可靠性不对称性

Geng Liu, Li Feng, Mengxiao Zhu, Francesco Pierri

发表机构 * Department of Electronics, Information and Bioengineering, Politecnico di Milano（电子、信息与生物工程系，米兰理工学院）； University of Science and Technology of China（中国科学技术大学）

AI总结通过构建基于真实中文搜索日志的查询事实核查数据集，比较传统搜索引擎、大型语言模型和搜索集成AI概览在中文是非问题上的准确性、回答频率、极性差距及区域信息需求差异，揭示可靠性不仅取决于回答正确性，还受回答频率、否定主张处理和信息需求暴露风险影响。

详情

AI中文摘要

搜索引擎和AI驱动的系统越来越多地成为获取事实信息的媒介，但在现实信息寻求场景中，其可靠性仍难以评估。我们通过从真实中文搜索日志构建基于查询的事实核查数据集，并比较传统搜索引擎、独立大型语言模型和搜索集成AI概览等九种系统，在中文网络生态中研究这一问题。聚焦于中文事实性是非问题，我们根据证据推导的基准事实评估系统是否提供正确、错误或不确定的判断。我们发现，当系统给出明确答案时，准确率相似（73.2%至78.9%），但给出明确答案的频率差异显著：搜索引擎对超过83%的查询给出明确答案，而Qwen-Max则不到一半。我们还发现一致的极性差距：所有系统在标记为“是”的查询上表现优于标记为“否”的查询。我们利用百度指数数据识别健康相关搜索关注度较高的中国省份，这可能表明更大的错误信息暴露风险。总体而言，我们的结果表明，可靠性不仅取决于系统回答时的正确性，还取决于回答频率、如何处理否定主张以及信息需求可能增加暴露风险的地方。

英文摘要

Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evaluate in realistic information-seeking settings. We study this problem in the Chinese web ecosystem by constructing a query-based fact-checking dataset from real Chinese search logs and comparing nine systems across traditional search engines, standalone large language models, and search-integrated AI Overviews. Focusing on factual Chinese-language factual Yes/No questions, we evaluate whether systems provide correct, incorrect, or uncertain decisions against evidence-derived ground truth. We find that systems are similarly accurate when they provide definitive answers, but differ sharply in how often they do so. Conditional accuracy ranges from 73.2% to 78.9%, yet search engines answer definitively on over 83% of queries, while Qwen-Max does so on fewer than half. We also find a consistent polarity gap: all systems perform better on yes-labeled queries than on no-labeled queries. We also use Baidu Index data to identify Chinese provinces with higher health-related search attention, which may indicate greater potential exposure to misinformation. Overall, our results show that reliability depends not only on whether systems are correct when they answer, but also on how often they answer, how they handle negative claims, and where information demand may increase exposure risks.

URL PDF HTML ☆

赞 0 踩 0

2602.16953 2026-06-02 cs.AI cs.LG 版本更新

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov：面向高覆盖率测试生成的执行感知智能体学习

Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LLM4Cov离线智能体学习框架，通过执行验证数据策展、策略感知数据合成和最差状态优先采样，在硬件验证中实现高覆盖率测试生成，4B参数模型在CVDP-ECov上达到69.2%通过率和90.4%平均覆盖率。

Comments ICML'26 Camera Ready version

详情

AI中文摘要

执行感知的LLM智能体为从工具反馈中学习提供了一种有前景的范式，但这种反馈可能昂贵且获取缓慢，使得在线强化学习（RL）在某些场景下不太实用。高覆盖率硬件验证由于依赖工业模拟器和不可微的执行信号，体现了这一挑战。我们提出LLM4Cov，一种离线智能体学习框架，将验证建模为由确定性评估器指导的单步状态转移。基于这一公式，我们引入了执行验证的数据策展、策略感知的智能体数据合成以及最差状态优先采样，以在执行约束下实现可扩展学习。我们进一步通过修订的评估协议，从现有验证套件中整理了一个符合现实的基准。使用所提出的流程，一个紧凑的4B参数模型在智能体评估下实现了69.2%的通过率和90.4%的平均覆盖率（CVDP-ECov），比其教师模型分别高出5.3%和10.5%，展现出与规模大一个数量级的模型相竞争的性能。

英文摘要

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtain, making online reinforcement learning (RL) less practical in certain scenarios. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as single-step state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% pass rate and 90.4% average coverage in CVDP-ECov under agentic evaluation, outperforming its teacher by 5.3% and 10.5%, demonstrating competitive performance against models an order of magnitude larger.

URL PDF HTML ☆

赞 0 踩 0

2508.08337 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants

立场：超越敏感属性，机器学习公平性应通过社会决定因素量化结构性不公正

Zeyu Tang, Alex John London, Atoosa Kasirzadeh, Sarah Stewart de Ramirez, Peter Spirtes, Kun Zhang, Sanmi Koyejo

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）； University of Michigan（密歇根大学）； University of Toronto（多伦多大学）

AI总结本文主张算法公平性研究应超越敏感属性，通过社会决定因素量化结构性不公正，并通过理论模型和实证研究证明仅关注敏感属性的缓解策略可能引入新的结构性不公正。

Comments Accepted to ICML 2026 Position Paper Track

详情

AI中文摘要

算法公平性研究在很大程度上将不公平视为对敏感属性的歧视。然而，这种方法限制了对作为通过社会决定因素实例化的结构性不公正的不公平的可见性，社会决定因素是塑造属性和结果但不涉及特定个体的上下文变量。这篇立场论文认为，该领域应通过社会决定因素量化结构性不公正，超越敏感属性。借鉴跨学科见解，我们认为主流技术范式未能充分捕捉作为结构性不公正的不公平，因为上下文可能被视为需要标准化的噪声，而不是需要审计的信号。我们进一步通过大学录取的理论模型、使用美国人口普查数据的人口统计研究以及美国综合医疗系统中关于乳腺癌筛查的高风险领域应用，证明了这种转变的实际紧迫性。我们的结果表明，仅关注敏感属性的缓解策略可能引入新的结构性不公正形式。我们认为，通过社会决定因素审计结构性不公正必须先于缓解措施，并呼吁开发超越以敏感属性为中心的非歧视公平概念的新技术。

英文摘要

Algorithmic fairness research has largely framed unfairness as discrimination along sensitive attributes. However, this approach limits visibility into unfairness as structural injustice instantiated through social determinants, which are contextual variables that shape attributes and outcomes without pertaining to specific individuals. This position paper argues that the field should quantify structural injustice via social determinants, beyond sensitive attributes. Drawing on cross-disciplinary insights, we argue that prevailing technical paradigms fail to adequately capture unfairness as structural injustice, because contexts are potentially treated as noise to be normalized rather than signal to be audited. We further demonstrate the practical urgency of this shift through a theoretical model of college admissions, a demographic study using U.S. census data, and a high-stakes domain application regarding breast cancer screening within an integrated U.S. healthcare system. Our results indicate that mitigation strategies centered solely on sensitive attributes can introduce new forms of structural injustice. We contend that auditing structural injustice through social determinants must precede mitigation, and call for new technical developments that move beyond sensitive-attribute-centered notions of fairness as non-discrimination.

URL PDF HTML ☆

赞 0 踩 0

2601.17074 2026-06-02 cs.LG cs.AI 版本更新

LERD: 用于神经退行性疾病分类的潜在事件-关系动力学

Yicheng Feng, Hairong Chen, Ziyu Jia, Samir Bhatt, Hengguan Huang

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Washington（华盛顿大学）； University of California, San Diego（加州大学圣地亚哥分校）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出LERD，一种端到端贝叶斯潜在事件-关系动力系统，直接从多通道脑电图推断潜在神经事件及其关系结构，无需事件或交互标注，在阿尔茨海默病分类中优于基线方法并提供生理对齐的动力学摘要。

详情

AI中文摘要

阿尔茨海默病（AD）会改变大脑电生理学并破坏多通道脑电图动力学，使得准确且临床有用的基于脑电图的诊断对于筛查和疾病监测越来越重要。然而，许多现有方法依赖黑盒分类器，并未明确建模其决策背后的潜在事件时序和跨通道协调。为解决这些局限，我们提出LERD，一种端到端贝叶斯潜在事件-关系动力系统，无需事件或交互标注，直接从多通道脑电图推断潜在神经事件及其关系结构。LERD结合连续时间事件推断模块与随机事件生成过程以捕获灵活的时间模式，同时融入电生理学启发的动力学先验以原则性方式指导学习。我们进一步提供理论分析，得到基于初值问题的可处理KL正则化项以及推断关系动力学的稳定性保证。在合成基准和两个真实世界AD脑电图队列上的大量实验表明，LERD一致优于强基线，并生成与生理对齐的速率、时序和图摘要，有助于刻画组级动力学差异。

英文摘要

Alzheimer's disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-based diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the latent event timing and cross-channel coordination behind their decisions. To address these limitations, we propose LERD, an end-to-end Bayesian latent event--relational dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable IVP-based KL regularizer and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned rate, timing, and graph summaries that help characterize group-level dynamical differences.

URL PDF HTML ☆

赞 0 踩 0

2602.18008 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

LLM 是否准备好进行神经集成机制建模？一个基准测试与智能体框架

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结本文提出神经集成机制建模（NIMM）基准测试，评估大语言模型在三个科学领域构建神经集成机制模型的能力，并设计树引导的智能体框架 NIMMGen，通过分支级搜索和原子模型细化显著提升搜索稳定性和解质量。

Comments 25 pages, 8 figures

详情

AI中文摘要

大语言模型（LLM）在从数据构建机制模型方面显示出潜力。然而，现有评估主要关注简化设置，未能捕捉真实世界科学建模的复杂性。在实践中，此类建模通常涉及神经集成公式，其中机制模型组件和神经网络组件共同构建，导致搜索空间显著复杂化。受此差距驱动，我们引入了神经集成机制建模（NIMM）基准测试，该基准测试评估 LLM 生成的神经集成机制模型在三个科学领域上的表现。在 NIMM 上的实验表明，现有基于 LLM 的方法难以有效探索这一复杂空间，导致搜索稳定性和解质量有限。为应对这一挑战，我们提出了 NIMMGen，一种树引导的智能体框架，通过分支级搜索实现多样化探索，并通过原子模型细化改进解。大量实验表明，NIMMGen 在 NIMM 上达到了最先进的性能，显著提升了搜索稳定性和解质量。

英文摘要

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus on simplified settings and fail to capture the complexity of real-world scientific modeling. In practice, such modeling often involves neural-integrated formulations, where a mechanistic model component and a neural network component are jointly constructed, leading to a significantly more complex search space. Motivated by this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) benchmark, which evaluates LLM-generated neural-integrated mechanistic models across three scientific domains. Experiments on NIMM reveal that existing LLM-based approaches struggle to effectively explore this complex space, resulting in limited search stability and solution quality. To address this challenge, we propose NIMMGen, a tree-guided agentic framework that enables diversified exploration via branch-level search and improves solutions through atomic model refinement. Extensive experiments demonstrate that NIMMGen achieves state-of-the-art performance on NIMM, significantly improving search stability and solution quality.

URL PDF HTML ☆

赞 0 踩 0

2602.16763 2026-06-02 cs.AI 版本更新

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

当AI基准测试达到平台期：基准饱和的系统性研究

Mubashara Akhtar, Anka Reuel, Prajna Soni, Sanchit Ahuja, Pawan Sasanka Ammanamanchi, Ruchit Rawal, Vilém Zouhar, Srishti Yadav, Chenxi Whitehouse, Dayeon Ki, Jennifer Mickel, Leshem Choshen, Marek Šuppa, Jan Batzner, Jenny Chim, Jeba Sania, Yanan Long, Hossein A. Rahmani, Christina Knight, Yiyang Nan, Jyoutir Raj, Yu Fan, Shubham Singh, Subramanyam Sahoo, Eliya Habba, Usman Gohar, Siddhesh Pawar, Robert Scholz, Arjun Subramonian, Jingwei Ni, Mykel Kochenderfer, Sanmi Koyejo, Mrinmaya Sachan, Stella Biderman, Zeerak Talat, Avijit Ghosh, Irene Solaiman

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Michigan（密歇根大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本研究定义并分析了60个语言模型基准的饱和现象，发现近半数基准出现饱和，且专家策划而非公开测试数据影响抗饱和能力，为延长基准寿命提供了设计建议。

Comments Accepted at ICML 2026

2602.16745 2026-06-02 cs.LG cs.AI 版本更新

Atomix: 用于可靠智能体工作流的及时事务性工具使用

Bardia Mohammadi, Nearchos Potamitis, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems（马克斯·普朗克软件系统研究所）； Aarhus University（奥胡斯大学）； EPFL（苏黎世联邦理工学院）

AI总结针对LLM智能体多步工作流中因故障、推测和并发导致的状态不一致问题，提出Atomix系统，通过进度感知事务将效果分组与冲突解决分离，实现可靠提交与回滚。

详情

AI中文摘要

LLM智能体执行多步工作流，通过工具改变外部状态。常见的编排器将工具返回视为结算触发器，因此故障、推测和并发智能体可能留下部分效果、丢失分支残留、陈旧写入或不可逆发送。正确的结算需要两个事实，而重试、检查点重放、锁和补偿各自混淆了这些事实：哪些效果必须一起结算，以及何时较早的冲突工作已耗尽。Atomix通过进度感知事务使这种分离明确化。运行时在执行期间记录读取和效果，当足迹完成时密封事务，并且仅在每个资源的前沿显示没有更早的冲突工作可能到达后才提交。提交是最终结算：Atomix释放可缓冲效果，接受可逆外部效果为最终状态，并让不可逆效果离开。中止抑制未释放的效果，并在可能的情况下补偿外部化的可逆效果。在代表性智能体工作负载上，这种组合在注入故障下改善了干净恢复，隔离了竞争和推测工作，并防止了正确分类的不可逆动作泄漏；微基准测试显示相对于工具延迟的微秒级包装开销。

英文摘要

LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement trigger, so faults, speculation, and concurrent agents can leave partial effects, losing-branch residue, stale writes, or irreversible sends. Correct settlement needs two facts that retries, checkpoint replay, locks, and compensation each conflate: which effects must settle together, and when earlier conflicting work is exhausted. Atomix makes this split explicit with progress-aware transactions. The runtime records reads and effects during execution, seals a transaction when its footprint is complete, and commits only after per-resource frontiers show that no earlier conflicting work can still arrive. Commit is final settlement: Atomix releases bufferable effects, accepts reversible external effects as final, and lets irreversible effects leave the gate. Abort suppresses unreleased effects and compensates externalized reversible effects where possible. On representative agent workloads, this composition improves clean recovery under injected faults, isolates contending and speculative work, and prevents correctly classified irreversible actions from leaking; microbenchmarks show microsecond-scale wrapper overhead relative to tool latency.

URL PDF HTML ☆

赞 0 踩 0

2602.14134 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

DenseMLLM：用于密集预测的标准多模态大语言模型

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China（香港科技大学电子与计算机工程系）； Tencent, Youtu-Lab, China（腾讯优图实验室）

AI总结提出DenseMLLM，通过标准多模态大语言模型架构和视觉令牌监督策略，无需任务特定解码器即可实现语义分割、深度估计等密集预测任务，在多个基准上取得竞争性能。

Comments ICML 2026

详情

AI中文摘要

多模态大语言模型在高层次视觉理解方面展现出卓越能力。然而，将这些模型扩展到细粒度的密集预测任务（如语义分割和深度估计）通常需要引入复杂的任务特定解码器和其他定制化组件。这种架构碎片化增加了模型复杂度，偏离了多模态大语言模型的通用设计，最终限制了其实用性。在这项工作中，我们挑战了这一范式，通过调整标准多模态大语言模型来执行密集预测，无需额外的任务特定解码器。所提出的模型称为DenseMLLM，基于标准架构，并采用一种新颖的视觉令牌监督策略来处理多个标签和任务。尽管设计极简，我们的模型在广泛的密集预测和视觉语言基准测试中取得了极具竞争力的性能，表明标准的通用多模态大语言模型可以在没有架构专门化的情况下有效支持密集感知。该项目可在github.com/Eli-YiLi/DenseMLLM获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization. This project is available at github.com/Eli-YiLi/DenseMLLM.

URL PDF HTML ☆

赞 0 踩 0

2602.14065 2026-06-02 cs.AI 版本更新

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

REAL: 通过推理枢轴对齐解决知识密集型视觉问答中的知识冲突

Kai Ye, Xianwei Mao, Sheng Zhou, Zirui Shao, Ye Mo, Liangliang Liu, Haikuan Huang, Bin Li, Jiajun Bu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出REAL框架，通过推理枢轴对齐和引导解码，解决知识密集型视觉问答中因开放域检索引起的知识冲突问题。

Comments Accepted by ICML 2026

详情

AI中文摘要

知识密集型视觉问答（KI-VQA）经常因开放域检索的固有限制而遭受严重的知识冲突。然而，现有范式由于缺乏可泛化的冲突检测和模型内约束机制来处理冲突证据，面临关键限制。为应对这些挑战，我们提出了REAL（推理枢轴对齐）框架，其核心是新颖的推理枢轴概念。与优先考虑内部自我推导的推理步骤不同，推理枢轴作为推理链中的原子单元（节点或边），强调知识链接，通常依赖外部证据完成推理。在我们构建的REAL-VQA数据集支持下，我们的方法集成了推理枢轴感知SFT（RPA-SFT），通过将冲突与枢轴提取对齐来训练可泛化的判别器，并采用推理枢轴引导解码（RPGD），一种利用这些枢轴进行针对性冲突缓解的模型内解码策略。在多个数据集上的大量实验表明，REAL显著提高了判别准确性并实现了优越性能，验证了我们的枢轴驱动解决范式。

英文摘要

Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments on diverse datasets demonstrate that REAL significantly enhances discrimination accuracy and achieves superior performance, validating our pivot-driven resolution paradigm.

URL PDF HTML ☆

赞 0 踩 0

2602.13940 2026-06-02 cs.LG cs.AI 版本更新

You Can Learn Tokenization End-to-End with Reinforcement Learning

你可以通过强化学习端到端地学习分词

Sam Dauncey, Roger Wattenhofer

发表机构 * University of Waterloo（滑铁卢大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出使用强化学习中的得分函数估计来学习离散分词边界，通过时间折扣等技巧降低方差，在1亿参数规模上优于先前的直通估计方法。

Comments ICML 2026 camera-ready

详情

AI中文摘要

分词是一个硬编码的压缩步骤，尽管架构总体上趋向于端到端，但它仍然保留在大语言模型（LLM）的训练流程中。先前的工作在大规模上展示了有希望的结果，通过启发式方法将这一压缩步骤引入LLM架构内部以绘制分词边界，并尝试使用直通估计来学习这些分词边界，直通估计将绘制离散分词边界的问题视为连续问题。我们表明，这些分词边界可以通过得分函数估计来学习，由于直接优化绘制离散分词边界以最小化损失的问题，得分函数估计具有更严格的理论保证。我们观察到，强化学习中的技术，如时间折扣，对于充分降低该得分函数的方差以使其可行是必要的。我们证明，所得到的方法在1亿参数规模上，在定性和定量上都优于先前提出的直通估计方法。

英文摘要

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

URL PDF HTML ☆

赞 0 踩 0

2602.07298 2026-06-02 cs.IR cs.AI 版本更新

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation

原则性合成数据使推荐系统中的LLM首次出现缩放定律

Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Qunshu Zhang, Neeraj Bhatia, Xiangjun Fan, Hong Yan

发表机构 * Meta

AI总结本文提出一种分层框架生成高质量合成数据，通过避免原始数据噪声，首次在推荐领域实现LLM的稳健幂律缩放，并显著提升下游排序任务性能。

Comments update according to icml reviewers feedback

详情

Journal ref: ICML 2026

AI中文摘要

大型语言模型（LLM）代表了推荐系统的一个有前景的前沿，但其发展一直受到缺乏可预测缩放定律的阻碍，而缩放定律对于指导研究和优化资源分配至关重要。我们假设，这可能是由于先前持续预训练（CPT）工作中原始用户交互数据固有的噪声、偏差和不完整性所致。本文介绍了一种新颖的分层框架，用于生成高质量合成数据，通过为LLM创建精心策划的教学课程来规避此类问题。我们提供了强有力的直接证据，证明我们课程的有效性：在原则性合成数据上训练的标准序列模型在下游排序任务中显著优于（在SasRec的recall@100上提高+130%）在真实数据上训练的模型，展示了其在学习可泛化用户偏好模式方面的优越性。在此基础上，我们首次通过实验证明，在高质量、推荐特定数据上持续预训练的LLM存在稳健的幂律缩放。我们的实验揭示了跨多种合成数据模态的一致且可预测的困惑度降低。这些发现为在推荐领域可靠地缩放LLM能力建立了基础方法论，从而将研究重点从缓解数据缺陷转向利用高质量的结构化信息。

英文摘要

Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patterns. Building on this, we empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data. Our experiments reveal consistent and predictable perplexity reduction across multiple synthetic data modalities. These findings establish a foundational methodology for reliable scaling LLM capabilities in the recommendation domain, thereby shifting the research focus from mitigating data deficiencies to leveraging high-quality, structured information.

URL PDF HTML ☆

赞 0 踩 0

2602.11852 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型Transformer：迈向可解释设计的语言模型架构

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

发表机构 * University of Cambridge（剑桥大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出原型Transformer（ProtoT），一种用线性代价原型模块替代二次代价自注意力的自回归语言模型架构，原型自动捕获可命名概念，提升可解释性并支持行为编辑。

Comments Accepted at ICML 2026. Equal contribution: Yordan Yordanov and Matteo Forasassi. 40 pages, 28 figures, 22 tables

详情

AI中文摘要

尽管最先进的语言模型（LM）在某些领域超越了大多数人类，但其推理过程仍然不透明，降低了信任度并增加了欺骗和幻觉的风险。我们引入了原型Transformer（ProtoT），一种自回归LM架构，它将Transformer的二次代价自注意力模块替换为基于原型的线性代价模块，原型是学习到的参数向量。在ProtoT中，原型创建了在不同时间尺度上聚合上下文信息的通信通道。我们表明，这种结构导致原型在训练过程中自动捕获可命名的概念，例如“女人”，为解释模型推理和对模型行为进行有针对性的编辑提供了途径。与基线相比，ProtoT在模型和数据规模上具有良好的扩展性，对输入扰动具有鲁棒性，并在文本生成和下游任务（包括GLUE）上表现良好。这些结果表明，ProtoT是朝着设计上更可解释的自回归语言模型迈出的有希望的一步。

英文摘要

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

URL PDF HTML ☆

赞 0 踩 0

2602.11790 2026-06-02 cs.AI cs.CL 版本更新

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

超越端到端视频模型：基于LLM的多智能体系统用于教育视频生成

Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

发表机构 * Baidu Inc.（百度公司）

AI总结提出LASEV，一种基于LLM的分层多智能体系统，通过将教育视频生成分解为多个专业智能体协作，解决端到端模型在逻辑严谨性和知识表示方面的不足，实现低成本、高吞吐量的自动化教学视频生产。

Comments Accepted at ACM SIGKDD 2026 (KDD '26), Applied Data Science Track. 10 pages, 2 figures, 5 tables. The project is available at \url{https://robitsg.github.io/LASEV}

详情

DOI: 10.1145/3770855.3818323

AI中文摘要

尽管最近的端到端视频生成模型在视觉导向的内容创作中表现出令人印象深刻的性能，但在需要严格逻辑严谨性和精确知识表示的场景（如教学和教育媒体）中仍然受限。为解决此问题，我们提出LASEV，一种基于LLM的分层多智能体系统，用于从教育问题生成高质量教学视频。LASEV将教育视频生成表述为一个多目标任务，同时要求正确的逐步推理、教学连贯的叙述、语义忠实的视觉演示以及精确的视听对齐。为解决先前方法的局限性——包括低程序保真度、高生产成本和有限的可控性——LASEV将生成工作流分解为通过中央编排智能体协作的专业智能体，共享生产状态、显式质量门控和迭代批评机制。具体来说，编排智能体监督一个用于严格问题求解的求解智能体、一个生成可执行可视化代码的插图智能体，以及一个面向学习者的教学脚本的叙述智能体。此外，工作智能体的所有输出都经过语义批评、基于规则的约束和基于工具的编译检查。该系统不直接合成像素，而是构建一个结构化的可执行视频脚本，该脚本通过模板驱动的组装规则确定性编译为同步的视觉和叙述，实现无需手动编辑的全自动生产。在大规模部署中，LASEV实现了每天超过一百万视频的吞吐量，与当前行业标准方法相比成本降低超过95%，同时保持高接受率。

英文摘要

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LASEV decomposes the generation workflow into specialized agents that collaborate through a central Orchestrating Agent, shared production state, explicit quality gates, and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization code, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated production without manual editing. In large-scale deployments, LASEV achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.

URL PDF HTML ☆

赞 0 踩 0

2602.11177 2026-06-02 cs.CL cs.AI 版本更新

What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection

LLMs 对阿尔茨海默病了解多少？用于 AD 检测的多损失微调和探针分析

Lei Jiang, Yue Zhou, Natalie Parde

发表机构 * University of Illinois Chicago（伊利诺伊大学香槟分校）

AI总结本文通过多损失微调 BERT、T5 和 Llama-1B 模型，在三个语料库上实现文本 AD 检测新 SOTA，并利用线性探针分析内部表征中 AD 相关信息的编码。

详情

AI中文摘要

可靠的阿尔茨海默病（AD）早期检测具有挑战性，特别是由于标记数据的有限可用性。虽然大型语言模型（LLMs）在跨领域表现出强大的迁移能力，但通过监督微调将其适应 AD 领域仍 largely unexplored。在这项工作中，我们跨三个异构转录语料库（Pitt、CCC、ADRC）实证评估了各种模型架构，以研究它们在基于文本的 AD 检测中的有效性，并分析任务相关信息如何在其内部表征中编码。据我们所知，我们微调的 BERT 和 T5 模型在 Pitt 和 CCC 数据集上建立了新的最先进水平，同时在 ADRC 上取得了强劲性能。同时，仅解码器的 Llama-1B 在所有三个语料库上取得了与 BERT 和 T5 相当的高度竞争结果，突显了其在 AD 检测中的有效性。我们进一步对 Llama-1B 骨干网络进行了全面评估，分析了跨语料库可迁移性、最优输入块大小粒度以及临床转录标记的影响。此外，我们使用线性探针实证表明，微调以反映 AD 相关信号的方式改变了单个标记（语言标记和内容词）的表征。

英文摘要

Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across do mains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we empirically evaluate various model architectures across three heterogeneous transcript corpora (Pitt, CCC, ADRC) to investigate their effectiveness for text-based AD detection and analyze how task-relevant information is encoded within their internal representations. To the best of our knowledge, our fine-tuned BERT and T5 models establish a new state-of-the-art on the Pitt and CCC datasets, while achieving strong performance on ADRC. In parallel, the decoder-only Llama-1B achieves highly competitive results comparable to BERT and T5 across all three corpora, highlighting its effectiveness for AD detection. We further conduct a comprehensive evaluation of the Llama-1B backbone, analyzing cross-corpus transferability, optimal input chunk-size granularity, and the impact of clinical transcript markers. Also, we use linear probing to empirically show that fine-tuning shifts the representations of individual tokens, both linguistic markers and content words, in ways that reflect AD-related signal.

URL PDF HTML ☆

赞 0 踩 0

2507.15336 2026-06-02 cs.LG cs.AI cs.DB 版本更新

Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design

超越模型库检索：编织知识以掌握细粒度神经网络设计

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

发表机构 * National University of Singapore（新加坡国立大学）； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出M-DESIGN框架，通过构建编辑效应证据图并采用自适应检索与预测任务规划器，在严格预算下高效发现近最优细粒度架构修改路径，在33个案例中26个达到搜索空间最佳性能。

Comments Accepted at ICML 2026. Title changed from "Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design" to "Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design"

详情

AI中文摘要

为新任务设计高性能神经网络需要在优化质量与搜索效率之间取得平衡。当前方法未能实现这一平衡：神经架构搜索计算成本高，而模型检索通常产生次优的静态检查点。为解决这一困境，我们将细粒度架构修改带来的性能增益建模为编辑效应证据，并从先验任务构建证据图。通过构建检索增强的模型精炼框架，我们提出的M-DESIGN动态编织历史证据以发现近最优的修改路径。M-DESIGN具有自适应检索机制，可快速校准来自不同来源的编辑效应证据的演化可迁移性。为处理分布外偏移，我们引入预测任务规划器，从多跳证据外推增益，从而减少对详尽知识库的依赖。基于包含22个数据集上67,760个图神经网络的知识库，大量实验表明，M-DESIGN持续优于基线，在严格预算下33个案例中有26个达到搜索空间最佳性能。

英文摘要

Designing high-performance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fail to achieve this balance: neural architectural search is computationally expensive, while model retrieval often yields suboptimal static checkpoints. To resolve this dilemma, we model the performance gains induced by fine-grained architectural modifications as edit-effect evidence and build evidence graphs from prior tasks. By constructing a retrieval-augmented model refinement framework, our proposed M-DESIGN dynamically weaves historical evidence to discover near-optimal modification paths. M-DESIGN features an adaptive retrieval mechanism that quickly calibrates the evolving transferability of edit-effect evidence from different sources. To handle out-of-distribution shifts, we introduce predictive task planners that extrapolate gains from multi-hop evidence, thereby reducing reliance on an exhaustive repository. Based on our model knowledge base of 67,760 graph neural networks across 22 datasets, extensive experiments demonstrate that M-DESIGN consistently outperforms baselines, achieving the search-space best performance in 26 out of 33 cases under a strict budget.

URL PDF HTML ☆

赞 0 踩 0

2509.18046 2026-06-02 cs.RO cs.AI cs.ET cs.SY eess.SP eess.SY 版本更新

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

HuMam: 基于Mamba的端到端深度强化学习人形机器人运动控制

Yinuo Wang, Yuanyang Qi, Jinzhao Zhou, Pengxiang Meng, Xiaowen Tao

发表机构 * College of Graduate and Professional Studies, Trine University（特灵大学研究生与专业研究学院）； Department of Civil Engineering, University of Hong Kong（香港大学土木工程系）； Faculty of Engineering and Information Technology, University of Technology Sydney（悉尼大学工程与信息技术学院）； National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University（吉林大学汽车底盘集成与生物仿生国家重点实验室）； School of Computer Science and Statistics, Trinity College Dublin（都柏林信任学院计算机科学与统计学系）

AI总结提出HuMam框架，使用单层Mamba编码器融合状态与步态目标，通过PPO优化实现人形机器人稳定高效的端到端运动控制。

Comments 12 pages

详情

Journal ref: 2026 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM) (CIS-RAM 2026)

AI中文摘要

端到端强化学习（RL）用于人形机器人运动因其紧凑的感知-动作映射而具有吸引力，但实际策略常受训练不稳定、特征融合低效和高执行成本困扰。我们提出HuMam，一种以状态为中心的端到端RL框架，采用单层Mamba编码器融合机器人中心状态与定向脚步目标及连续相位时钟。策略输出由低级PD环跟踪的关节位置目标，并通过PPO优化。一个简洁的六项奖励平衡接触质量、摆动平滑度、脚部放置、姿态和身体稳定性，同时隐含促进节能。在mc-mujoco中的JVRC-1人形机器人上，HuMam在强前馈基线上持续提高了学习效率、训练稳定性和整体任务性能，同时降低了功耗和扭矩峰值。据我们所知，这是首个采用Mamba作为融合骨干的端到端人形机器人RL控制器，展示了在效率、稳定性和控制经济性方面的切实提升。

英文摘要

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

URL PDF HTML ☆

赞 0 踩 0

2602.10623 2026-06-02 cs.LG cs.AI 版本更新

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

通过贝叶斯非负奖励建模缓解RLHF中的奖励黑客

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

发表机构 * Zhejiang University（浙江大学）

AI总结提出贝叶斯非负奖励模型（BNRM），通过非负因子分析和变分推断，在Bradley-Terry偏好模型中实现解耦与去偏，有效缓解奖励过度优化，提升鲁棒性和可解释性。

Comments Accepted as an Oral presentation at ICML 2026. The code is available at https://github.com/GuoweiRong/Bayesian-Non-negative-Reward-Model

详情

AI中文摘要

从人类偏好中学习的奖励模型是通过人类反馈强化学习对齐大型语言模型的核心，但由于噪声标注和系统偏差（如响应长度或风格），它们通常容易受到奖励黑客攻击。我们提出了贝叶斯非负奖励模型（BNRM），这是一个原则性的奖励建模框架，将非负因子分析整合到Bradley-Terry偏好模型中。BNRM通过稀疏的非负潜在因子生成过程表示奖励，该过程在两个互补层面运作：实例特定的潜在变量诱导解耦的奖励表示，而全局潜在因子的稀疏性作为隐式去偏机制，抑制虚假相关性。这种解耦-去偏结构共同实现了鲁棒的不确定性感知奖励学习。为了将BNRM扩展到现代LLM，我们开发了一个基于深度模型表示的条件摊销变分推断网络，实现高效的端到端训练。大量实验结果表明，BNRM显著缓解了奖励过度优化，提高了分布偏移下的鲁棒性，并比强基线产生了更可解释的奖励分解。

英文摘要

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR 版本更新

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）

AI总结提出层次化智能体框架SceneSmith，通过VLM智能体协作从自然语言生成仿真就绪的室内场景，相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情

AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具，但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置，缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith，一个层次化智能体框架，能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体（设计师、评论家和编排者）之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计，紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍，物体间碰撞率低于2%，且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中，与基线相比，平均真实感胜率达到92%，平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

URL PDF HTML ☆

赞 0 踩 0

2505.24069 2026-06-02 cs.LG cs.AI 版本更新

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

LLM 能否进行结构性推理？通过数据结构视角进行基准测试

Yu He, Yingxi Li, Colin White, Ellen Vitercik

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出 DSR-Bench 基准，通过 20 种数据结构、35 种操作和 4140 个问题实例评估 LLM 的结构性推理能力，发现顶级模型在挑战性实例上仅得 0.46/1，且在空间数据、上下文丰富场景及自身代码推理上表现不佳。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情

AI中文摘要

大型语言模型（LLM）被部署在日益复杂的任务上，这些任务需要多步决策。因此，理解它们的算法推理能力至关重要。然而，我们缺乏用于评估这些能力的诊断基准。我们提议使用数据结构作为原则性视角：作为算法的基本构建块，它们自然地探测结构性推理——即理解和操作支撑算法推理的关系（如顺序、层次和连接性）的能力。我们引入了 DSR-Bench（数据结构推理基准），涵盖 20 种数据结构、35 种操作和 4140 个问题实例。DSR-Bench 具有层次化任务组织、全自动生成与评估以及细粒度诊断的特点。评估 13 个最先进的 LLM 揭示了关键局限性：表现最好的模型在挑战性实例上仅达到 0.46/1。三个针对更现实用法的辅助探针暴露了进一步的弱点：模型在空间数据和上下文丰富的场景中表现不佳，并且难以对其自身代码进行推理。

英文摘要

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating these capabilities. We propose to use data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning - the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench (Data Structure Reasoning Benchmark), spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.

URL PDF HTML ☆

赞 0 踩 0

2602.09492 2026-06-02 cs.LG cs.AI 版本更新

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

当心批量大小：评估 LoRA 中的超参数偏差

Sangyoon Lee, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)（浦项科学技术大学（POSTECH））

AI总结本文发现批量大小是导致 LoRA 变体性能矛盾的关键因素，提出基于代理的高效调优策略，将批量大小提升为一阶设计参数。

2602.08868 2026-06-02 cs.LG cs.AI 版本更新

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

AnomSeer: 增强多模态大语言模型进行时间序列异常检测的推理能力

Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu

发表机构 * GitHub

AI总结提出AnomSeer，通过专家思维链和基于最优传输的时间序列接地策略优化，增强多模态大语言模型在时间序列异常检测中的细粒度推理能力，统一异常分类、定位和解释。

Comments ICML 2026

详情

AI中文摘要

基于多模态大语言模型（MLLM）的时间序列异常检测（TSAD）是一个新兴领域，但一个持续存在的挑战是：MLLM依赖于粗略的时间序列启发式方法，但在多维、详细的推理方面存在困难，而这对于理解复杂的时间序列数据至关重要。我们提出AnomSeer来解决这个问题，通过增强模型将其推理基于时间序列的精确结构细节，统一异常分类、定位和解释。其核心是生成专家思维链轨迹，从经典分析（如统计度量、频率变换）中提供可验证的细粒度推理。在此基础上，我们提出了一种新颖的时间序列接地策略优化（TimerPO），它在标准强化学习之外引入了两个额外组件：基于最优传输的时间序列接地优势，以及确保这种辅助细粒度信号不干扰主要检测目标的正交投影。在各种异常场景中，使用Qwen2.5-VL-3B/7B-Instruct的AnomSeer在分类和定位准确性上优于更大的商业基线（如GPT-4o），特别是在点和频率驱动的异常上。此外，它产生了合理的时间序列推理轨迹，支持其结论。

英文摘要

Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines (e.g., GPT-4o) in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible time-series reasoning traces that support its conclusions.

URL PDF HTML ☆

赞 0 踩 0

2602.08585 2026-06-02 cs.LG cs.AI 版本更新

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

预测未来效用：任务无关的KV缓存驱逐的全局组合优化

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

发表机构 * Fudan University（复旦大学）； Baige AI Team, Baidu inc（百度AI团队）； Work done during an internship at Baidu（百度实习）

AI总结提出LU-KV框架，通过全局组合优化分配注意力头预算以最大化长期边际贡献，实现80%的KV缓存压缩且性能损失极小。

详情

AI中文摘要

鉴于注意力的二次复杂度，KV缓存驱逐对于加速模型推理至关重要。当前的KV缓存驱逐方法通常依赖于瞬时启发式度量，隐含地假设分数幅度是所有注意力头的重要性一致代理。然而，这忽略了注意力头之间预测保真度的异质性。虽然某些头优先考虑令牌的瞬时贡献，但其他头致力于捕捉长期效用。在本文中，我们提出最优预算分配应由保留长期语义信息的边际效用决定。基于这一见解，我们提出了LU-KV，这是一个新颖的框架，将头级预算分配表述为全局组合优化问题，以最大化保留令牌的长期边际贡献。为了解决这个非凸问题，我们采用凸包松弛和基于边际效用的贪婪求解器，实现接近最优的解。此外，我们实现了一个数据驱动的离线分析协议，以促进LU-KV的实际部署。在LongBench和RULER基准上的评估表明，LU-KV将KV缓存大小减少了80%，性能下降最小，同时降低了推理延迟和GPU内存占用。

英文摘要

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Building on this insight, we propose LU-KV, a novel framework that formulates head-level budget allocation as a global combinatorial optimization problem to maximize the long-horizon marginal contribution of reserved tokens. To solve this non-convex problem, we employ a convex-hull relaxation and a marginal-utility-based greedy solver, achieving near-optimal solutions. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Evaluations on LongBench and RULER benchmarks demonstrate that LU-KV reduces KV cache size by 80% with minimal performance degradation, while also decreasing inference latency and GPU memory footprint.

URL PDF HTML ☆

赞 0 踩 0

2602.08236 2026-06-02 cs.CV cs.AI cs.CL 版本更新

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

何时想象以及想象多少：基于世界模型的自适应测试时缩放用于视觉空间推理

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill（北卡罗来纳大学教堂山分校）； Nanyang Technological University（南洋理工大学）

AI总结本文提出自适应测试时框架AVIC/AVIC-R，通过世界模型选择性调用和缩放视觉想象，在空间推理中平衡准确性与效率，超越GPT-4o等基线。

Comments the first two authors are equally contributed. Project page: https://adaptive-visual-tts.github.io/

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）取得了快速进展，但当正确答案取决于场景在未见或替代视角下的外观时，视觉空间推理仍然不可靠。最近的工作通过使用世界模型进行视觉想象来增强推理，但诸如想象何时真正必要、多少想象有益、以及何时想象有害等问题仍知之甚少。在实践中，无差别的想象可能会增加计算量，甚至通过引入误导性证据而降低性能。在这项工作中，我们深入分析了作为空间推理可控资源的测试时视觉想象。我们首先研究静态视觉证据何时足够，想象何时改进推理，以及过度或不必要的想象如何影响准确性和效率。为了支持这一分析，我们随后引入了AVIC，一个基于世界模型的自适应测试时框架，该框架在选择性调用和缩放视觉想象之前，明确推理当前视觉证据的充分性。最后，为了进一步学习这种门控和规划行为，而无需任何关于何时想象以及想象多少的标注，我们引入了AVIC-R，它通过来自QA正确性奖励和想象成本惩罚的GRPO来训练策略。在空间推理基准（SAT, MMSI）和具身导航基准（R2R）上，我们的结果揭示了想象至关重要、边际或有害的明确场景，并表明选择性控制可以匹配或超越固定想象策略，同时大幅减少世界模型调用和语言标记。我们的AVIC-R超越了包括GPT-4o和GPT-4.1在内的强大专有基线，同时调用世界模型的频率更低。总体而言，我们的发现强调了分析和控制测试时想象对于高效可靠的空间推理的重要性。

英文摘要

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2512.20806 2026-06-02 cs.AI 版本更新

Safety Alignment of LMs via Non-cooperative Games

通过非合作博弈实现语言模型的安全对齐

Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov

发表机构 * DeepMind, London, UK（深度Mind，伦敦，英国）

AI总结提出将安全对齐建模为攻击者与防御者语言模型之间的非零和博弈，通过在线强化学习联合训练，迭代提升安全性与实用性。

详情

AI中文摘要

确保语言模型（LM）的安全性同时保持其实用性仍然是AI对齐中的一个关键挑战。当前方法依赖于顺序对抗训练：生成对抗性提示并微调LM以防御它们。我们引入了一种不同的范式：将安全对齐视为攻击者LM和防御者LM之间的非零和博弈，通过在线强化学习联合训练。每个LM持续适应对方的演化策略，驱动迭代改进。我们的方法使用基于偏好比较的奖励信号而非点式分数，提供更稳健的监督并可能减少奖励破解。我们的强化学习方案AdvGame将安全性和实用性的帕累托前沿向外推移，产生一个同时更有帮助且对对抗性攻击更具弹性的防御者LM。此外，由此产生的攻击者LM收敛为一个强大的、通用的红队测试代理，可直接用于探测任意目标模型。代码见github.com/facebookresearch/advgame。

英文摘要

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models. Code at github.com/facebookresearch/advgame.

URL PDF HTML ☆

赞 0 踩 0

2510.08948 2026-06-02 cs.IR cs.AI 版本更新

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

SHERLOCK：面向LLM增强电商风险管理的动态知识适应

Nan Lu, Yurong Hu, Jiaquan Fang, Yan Liu, Rui Dong, Yiming Wang, Rui Lin, Shaoyi Xu

发表机构 * Beijing Jiaotong University（北京交通大学）； JD.com（京东公司）； Southeast University（东南大学）； Zhejiang University（浙江大学）

AI总结提出Sherlock框架，通过构建领域知识库、两阶段检索增强生成和自演化数据飞轮，将结构化知识与LLM推理结合，提升电商风险案例调查的效率和准确性。

详情

AI中文摘要

有效的电商风险管理需要深入案例调查以识别高度对抗环境中的新兴欺诈模式。然而，人工调查通常需要分析多源异构数据之间的关联和耦合，这是一个劳动密集型过程，限制了效率。虽然大型语言模型（LLM）在自动化这些分析方面显示出潜力，但其部署受到风险场景复杂性和长尾领域知识稀疏性的阻碍。为应对这些挑战，我们提出了Sherlock，一个通过三个核心模块将结构化领域知识与基于LLM的推理相结合的框架。首先，我们通过从异构知识源中提取结构化专业知识来构建领域知识库（KB）。其次，我们设计了一种针对案例调查的两阶段检索增强生成策略，该策略将输入上下文增强与反思与细化模块相结合，以充分利用知识库提高分析质量。最后，我们开发了一个用于操作和标注的集成平台，以驱动自演化数据飞轮。通过知识库更新的实时热修复与后训练定期逻辑对齐相结合，我们促进系统持续演化以对抗对抗性漂移。在京东的在线A/B测试表明，Sherlock实现了82%的专家接受率（EAR），日调查吞吐量增加了386.7%。另外90天的评估显示，该飞轮成功从两次因策略变化导致的性能衰减中恢复，通过自主模型更新将EAR上限提高了约3.5%。

英文摘要

Effective e-commerce risk management requires in-depth case investigations to identify emerging fraud patterns in highly adversarial environments. However, manual investigation typically requires analyzing the associations and couplings among multi-source heterogeneous data, a labor-intensive process that limits efficiency. While Large Language Models (LLMs) show promise in automating these analyses, their deployment is hindered by the complexity of risk scenarios and the sparsity of long-tail domain knowledge. To address these challenges, we propose Sherlock, a framework that integrates structured domain knowledge with LLM-based reasoning through three core modules. First, we construct a domain Knowledge Base (KB) by distilling structured expertise from heterogeneous knowledge sources. Second, we design a two-stage retrieval-augmented generation strategy tailored for case investigation, which combines input contextual augmentation with a Reflect & Refine module to fully leverage the KB for improved analysis quality. Finally, we develop an integrated platform for operations and annotation to drive a self-evolving data flywheel. By combining real-time hotfixes through KB updates with periodic logic alignment via post-training, we facilitate continuous system evolution to counteract adversarial drifts. Online A/B tests at JD dot com demonstrate that Sherlock achieves an 82% Expert Acceptance Rate (EAR) and a 386.7% increase in daily investigation throughput. An additional 90-day evaluation shows that the flywheel successfully recovers from performance decay caused by changing tactics twice, raising the EAR ceiling by around 3.5% through autonomous model updates.

URL PDF HTML ☆

赞 0 踩 0

2602.07218 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Collaborative and Efficient Fine-tuning: Leveraging Task Similarity

协作高效微调：利用任务相似性

Gagik Magakyan, Amirhossein Reisizadeh, Chanwoo Park, Pablo A. Parrilo, Asuman Ozdaglar

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）

AI总结提出CoLoRA方法，通过共享适配器和个性化适配器利用任务相似性进行协作微调，提升数据稀缺下的模型性能，并在理论和实验上验证其有效性。

详情

AI中文摘要

适应性被认为是基础模型的核心特征，使其能够有效适应未见过的下游任务。参数高效的微调方法，如著名的LoRA，使得使用标记的、高质量且通常稀缺的任务数据对大型基础模型进行高效适应成为可能。为了缓解基础模型微调中的数据稀缺问题，我们提出利用多个下游用户之间的任务相似性。直观上，具有相似任务的用户必须能够相互帮助，以增加有效的微调数据量。我们提出了协作低秩适应（CoLoRA），该方法利用任务相似性来协作且高效地微调个性化基础模型。CoLoRA的主要思想是训练一个共享适配器，捕捉所有任务之间的潜在任务相似性，以及针对用户特定任务定制的个性化适配器。我们在异质线性回归上对CoLoRA进行了理论研究，并提供了真实恢复的可证明保证。我们还进行了多个具有不同任务相似性的自然语言实验，进一步表明当与相似任务一起训练时，个体性能显著提升。

英文摘要

Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.

URL PDF HTML ☆

赞 0 踩 0

2602.07083 2026-06-02 cs.SE cs.AI 版本更新

Rethinking Scientific Modeling: Toward Physically Consistent and Simulation-Executable Programmatic Generation

重新思考科学建模：迈向物理一致且可模拟执行的程序化生成

Yongqing Jiang, Jianze Wang, Zhiqi Shen, Zhenghong Lin, Jiayuan Wang, Yijian Yang, Kaoshan Dai, Haoran Luo

发表机构 * Sichuan University（四川大学）； Nanyang Technological University（南洋理工大学）； Fuzhou University（福州大学）

AI总结针对大型语言模型在工程建模中生成代码的物理不一致问题，提出结合领域知识构建、约束对齐和验证驱动的框架，并引入CivilInstruct数据集和MBEval基准，通过两阶段微调提升模型生成的可执行性和物理一致性。

详情

DOI: 10.1145/3770855.3818987

AI中文摘要

结构建模是计算工程科学的基础组成部分，其中即使是微小的物理不一致或规范违反也可能使下游模拟失效。大型语言模型（LLMs）在自动生成建模代码方面的潜力已被证实。然而，在严格的工程约束下，不可执行或物理不一致的输出仍然普遍存在。因此，提出了一种物理一致自动建筑建模框架，整合了领域知识构建、面向约束的模型对齐和验证驱动的评估。引入了CivilInstruct作为领域特定数据集，形式化了结构工程知识和约束推理，以实现可模拟的模型生成。进一步采用两阶段微调策略来强制约束满足和应用程序编程接口合规性，显著减少了幻觉和不一致输出。提出了MBEval作为验证驱动的基准，通过闭环验证评估可执行性和结构动力学一致性。实验结果表明，在严格的验证指标上，该方法相比基线持续改进。我们的代码可在 https://github.com/Jovanqing/AutoBM 获取。

英文摘要

Structural modeling is a fundamental component of computational engineering science, in which even minor physical inconsistencies or specification violations may invalidate downstream simulations. The potential of large language models (LLMs) for automatic generation of modeling code has been demonstrated. However, non-executable or physically inconsistent outputs remain prevalent under stringent engineering constraints. A framework for physics-consistent automatic building modeling is therefore proposed, integrating domain knowledge construction, constraint-oriented model alignment, and verification-driven evaluation. CivilInstruct is introduced as a domain-specific dataset that formalizes structural engineering knowledge and constraint reasoning to enable simulation-ready model generation. A two-stage fine-tuning strategy is further employed to enforce constraint satisfaction and application programming interface compliance, substantially reducing hallucinated and non-conforming outputs. MBEval is presented as a verification-driven benchmark that evaluates executability and structural dynamics consistency through closed-loop validation. Experimental results show consistent improvements over baselines across rigorous verification metrics. Our code is available at https://github.com/Jovanqing/AutoBM.

URL PDF HTML ☆

赞 0 踩 0

2602.06448 2026-06-02 cs.LG cs.AI 版本更新

Principle-Evolvable Scientific Discovery via Uncertainty Minimization

通过不确定性最小化实现原理可演化的科学发现

Yingming Pu, Tao Lin, Hongyu Chen

发表机构 * Westlake University（西lake大学）； Zhejiang University（浙江大学）

AI总结提出PiEvo框架，将科学发现视为原理空间上的贝叶斯优化，通过信息导向假设选择与异常驱动增强机制，使智能体自主演化理论世界观，在四个基准上平均解质量达90.81%~93.15%，收敛速度提升83.3%。

详情

Journal ref: Proc. 43rd Intl. Conf. on Machine Learning (ICML 2026), PMLR 306

AI中文摘要

基于大型语言模型的科学智能体加速了科学发现，但由于固守初始先验，常常效率低下。现有方法主要在静态假设空间中操作，限制了新现象的发现，当基线理论失效时导致计算浪费。为解决此问题，我们提出将焦点从搜索假设转向演化底层科学原理。我们提出PiEvo，一个原理可演化框架，将科学发现视为在扩展原理空间上的贝叶斯优化。通过集成基于高斯过程的信息导向假设选择和异常驱动增强机制，PiEvo使智能体能够自主完善其理论世界观。在四个基准上的评估表明，PiEvo (1) 平均解质量高达90.81%~93.15%，比现有最优方法提升29.7%~31.1%；(2) 通过优化紧凑原理空间显著降低样本复杂度，收敛步骤加速83.3%；(3) 在不同科学领域和LLM骨干上保持稳健性能。代码公开于\hyperlink{https://github.com/amair-lab/PiEvo}{github.com/amair-lab/PiEvo}。

英文摘要

Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficiencies due to adherence to fixed initial priors. Existing approaches predominantly operate within a static hypothesis space, which restricts the discovery of novel phenomena, resulting in computational waste when baseline theories fail. To address this, we propose shifting the focus from searching hypotheses to evolving the underlying scientific principles. We present PiEvo, a principle-evolvable framework that treats scientific discovery as Bayesian optimization over an expanding principle space. By integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism, PiEvo enables agents to autonomously refine their theoretical worldview. Evaluation across four benchmarks demonstrates that PiEvo (1) achieves an average solution quality of up to 90.81%~93.15%, representing a 29.7%~31.1% improvement over the state-of-the-art, (2) attains an 83.3% speedup in convergence step via significantly reduced sample complexity by optimizing the compact principle space, and (3) maintains robust performance across diverse scientific domains and LLM backbones. Code is publicly available at \hyperlink{https://github.com/amair-lab/PiEvo}{github.com/amair-lab/PiEvo}.

URL PDF HTML ☆

赞 0 踩 0

2502.16174 2026-06-02 cs.LG cs.AI cs.CL cs.CR 版本更新

Efficient LLM Moderation with Multi-Layer Latent Prototypes

基于多层潜在原型的高效LLM审核

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

发表机构 * University of Warsaw（华沙大学）

AI总结提出多层原型审核器（MLPM），利用多层中间表示的原型实现轻量、高效且可定制的输入审核，在多个基准上达到最优性能，并可与输出审核结合提升响应安全性。

详情

AI中文摘要

尽管现代LLM在后训练过程中与人类价值观对齐，但在部署时仍需稳健的审核以防止有害输出。现有方法存在性能与效率的权衡，且难以定制以满足用户特定需求。针对这一差距，我们引入了多层原型审核器（MLPM），一种轻量级且高度可定制的输入审核工具。我们提出利用多层中间表示的原型来提高审核质量，同时保持高效率。通过设计，我们的方法对生成流水线的开销可忽略不计，并可无缝应用于任何模型。MLPM在多种审核基准上实现了最先进的性能，并在不同大小的模型系列中表现出强大的可扩展性。此外，我们展示了它能平滑集成到端到端审核流水线中，并在与输出审核技术结合时进一步提高响应安全性。总体而言，我们的工作为安全、稳健且高效的LLM部署提供了一种实用且可适应的解决方案。

英文摘要

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

URL PDF HTML ☆

赞 0 踩 0

2602.05970 2026-06-02 cs.LG cs.AI math.DS stat.ML 版本更新

Inverse Depth Scaling From Most Layers Being Similar

大多数层相似时的逆深度缩放

Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过分析大型语言模型和玩具残差网络，发现损失与深度成反比，归因于功能相似的层通过集成平均而非组合学习或平滑动力学离散化来减少误差，表明需要架构创新以鼓励深度组合使用。

Comments Camera-ready version, ICML 2026

2602.05951 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

更好的源，更好的流：学习条件依赖的源分布用于流匹配

Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

发表机构 * New York University（纽约大学）； KAIST AI（韩国科学技术院人工智能实验室）

AI总结本文提出在流匹配框架中学习条件依赖的源分布，通过方差正则化和源-目标方向对齐，显著提升文本到图像生成的速度和质量。

Comments Project Page: https://junwankimm.github.io/CSFM

详情

AI中文摘要

流匹配最近已成为基于扩散的生成模型的有前途的替代方案，特别是在文本到图像生成方面。尽管它在允许任意源分布方面具有灵活性，但大多数现有方法依赖于标准高斯分布（这是从扩散模型继承的选择），并且很少在这种设置中将源分布本身视为优化目标。在这项工作中，我们表明源分布的原则性设计不仅是可行的，而且在现代文本到图像系统的规模上也是有益的。具体来说，我们提出在流匹配目标下学习条件依赖的源分布，以更好地利用丰富的条件信号。我们识别了将条件直接纳入源时出现的关键失败模式，包括分布坍缩和不稳定性，并表明适当的方差正则化以及源和目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响具有结构化源的流匹配，揭示了这种设计最有效的场景。在多个文本到图像基准上的大量实验表明了一致且稳健的改进，包括FID收敛速度提高多达3倍，突出了原则性源分布设计对条件流匹配的实际好处。

英文摘要

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

URL PDF HTML ☆

赞 0 踩 0

2602.05395 2026-06-02 stat.ML cs.AI cs.LG 版本更新

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

用于高效推断一致LLM答案的最优贝叶斯停止

Jingkai Huang, Will Ma, Zhengyuan Zhou

发表机构 * Stern School of Business, New York University, New York, USA（纽约大学 Stern 商学院）； Graduate School of Business, Columbia University, New York, USA（哥伦比亚大学商学院）

AI总结利用贝叶斯先验信息，通过L-聚合停止策略在达到足够一致性时提前停止采样，以最小化采样成本并高效识别最一致的LLM答案。

Comments Accepted to ICML 2026. Camera-ready version

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

AI中文摘要

一种提高LLM准确性的简单策略，特别是在数学和推理问题中，是采样多个响应并提交最一致达成的答案。在本文中，我们利用贝叶斯先验信息来节省采样成本，一旦达到足够的一致性就停止。尽管精确后验在计算上难以处理，我们进一步引入了一种高效的“L-聚合”停止策略，该策略仅跟踪L-1个最频繁的答案计数。理论上，我们证明L=3就足够了：这种粗略近似足以实现渐近最优性，并且严格优于无先验基线，同时具有快速的后验计算。实验上，该方法使用更少的样本识别出最一致（即众数）的LLM答案，并且可以在减少LLM调用次数（即节省LLM推理成本）高达50%的同时实现相似的答案准确性。

英文摘要

A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient "L-aggregated" stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.

URL PDF HTML ☆

赞 0 踩 0

2511.16886 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

TRMs中的潜在推理实际上是策略改进算子

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev ； Rayan Banerjee ； Fakhri Karray ； Martin Takac

AI总结本文通过将潜在递归推理形式化为策略改进算法，解释了递归步骤何时有效提升性能，并提出结合强化学习和扩散方法的训练方案，在Tiny Recursive Model上实现18倍前向传递减少且保持性能。

详情

AI中文摘要

最近，具有潜在递归的小模型在复杂推理任务上取得了有希望的结果。这些结果通常由这样的理论解释：这种递归增加了网络的深度，使其能够紧凑地模拟更大模型的能力。然而，递归添加层的性能仍然落后于具有相同前馈深度的单次通过模型。这意味着在循环版本中，并非每个递归步骤都有效地贡献于深度。这提出了一个问题：潜在推理何时以及为何能提高性能，何时会导致无效计算？在我们的工作中，我们证明了潜在递归推理为这个问题提供了答案。我们展示了潜在递归推理可以形式化为策略改进算法。基于这些见解，我们提出使用强化学习和扩散方法的训练方案用于潜在推理模型。以Tiny Recursive Model作为测试平台，我们展示了通过我们的修改，可以避免无效计算步骤，并将前向传递总数减少18倍，同时保持性能。总的来说，我们展示了递归步骤的策略改进视角如何解释模型行为，并为进一步改进提供见解。

英文摘要

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we demonstrate that latent recursive reasoning provides answer to this question. We show that latent recursive reasoning can be formalized as a policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

URL PDF HTML ☆

赞 0 踩 0

2602.04861 2026-06-02 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

从评估到设计：利用势能面平滑度指标指导机器学习原子间势架构

Ryan Liu, Eric Qu, Tobias Kreiman, Samuel M. Blau, Aditi S. Krishnapriyan

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出键平滑度表征测试（BSCT）作为高效评估机器学习原子间势（MLIP）势能面平滑度的指标，并与分子动力学稳定性强相关，同时指导模型设计以减少伪影。

Comments Accepted at the International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

机器学习原子间势（MLIP）有时无法再现量子势能面（PES）的物理平滑性，导致下游模拟中出现标准能量和力回归评估无法捕捉的错误行为。现有评估方法（如微正则分子动力学（MD））计算成本高且主要探测近平衡态。为改进MLIP的评估指标，我们引入键平滑度表征测试（BSCT）。该高效基准通过受控键变形探测PES，检测近平衡和远离平衡态的非平滑性，包括不连续性、人工极小值和虚假力。我们证明BSCT与MD稳定性强相关，而成本仅为MD的一小部分。为展示BSCT如何指导迭代模型设计，我们利用无约束Transformer主干作为测试平台，说明如何通过改进（如新的可微$k$-最近邻算法和温度控制注意力）减少指标识别的伪影。通过基于BSCT系统优化模型设计，所得MLIP同时实现了低传统E/F回归误差、稳定的MD模拟和鲁棒的原子性质预测。我们的结果将BSCT确立为从业者评估MLIP实用性的验证指标，以及“循环内”模型设计代理，提醒MLIP开发者注意当前MLIP基准无法高效评估的物理挑战。BSCT数据集和评估可在https://github.com/ryanliu30/bsct.git获取。

英文摘要

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable $k$-nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric for practitioners to assess MLIP utility and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks. The BSCT dataset and evaluation are available on https://github.com/ryanliu30/bsct.git

URL PDF HTML ☆

赞 0 踩 0

2501.18649 2026-06-02 cs.CL cs.AI cs.IR cs.LG 版本更新

Fake News Detection After LLM Laundering: Measurement and Explanation

LLM清洗后的假新闻检测：测量与解释

Rupak Kumar Das, Jonathan Dodge

发表机构 * College of IST Pennsylvania State University（宾夕法尼亚州立大学信息科学与技术学院）

AI总结研究测量检测器在识别LLM改写假新闻时的有效性，发现检测器难以检测LLM改写的假新闻，并通过LIME解释发现情感偏移是检测失败的原因之一。

详情

DOI: 10.24251/HICSS.2026.339

AI中文摘要

凭借其先进的能力，大型语言模型（LLM）可以生成高度令人信服且上下文相关的假新闻，这可能有助于传播错误信息。尽管针对人类撰写文本的假新闻检测已有大量研究，但检测LLM生成的假新闻这一领域仍探索不足。本研究测量了检测器在识别LLM改写的假新闻方面的有效性，特别是确定在检测流程中添加改写步骤是有助于还是阻碍检测。本研究贡献如下：（1）检测器在检测LLM改写的假新闻时比检测人类撰写文本更困难；（2）我们发现了哪些模型在哪些任务（逃避检测、通过改写逃避检测以及为语义相似性进行改写）上表现出色；（3）通过LIME解释，我们发现了检测失败的一个可能原因：情感偏移；（4）我们发现了一个关于改写质量测量的令人担忧的趋势：尽管BERTSCORE很高，但样本仍表现出情感偏移；（5）我们提供了一对数据集，用改写输出和分数扩充了现有数据集。该数据集可在GitHub上获取。

英文摘要

With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub

URL PDF HTML ☆

赞 0 踩 0

2602.03685 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Universal One-third Time Scaling in Learning Peaked Distributions

学习尖峰分布中的普适三分之一时间缩放

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过理论分析和实验验证，揭示了使用softmax和交叉熵学习尖峰分布时，损失和梯度呈幂律衰减，导致损失时间缩放指数为1/3的普适瓶颈，为神经缩放现象提供了机理解释。

Comments Camera-ready version, ICML 2026

2602.03670 2026-06-02 cs.LG cs.AI cs.NE math.DS physics.class-ph 版本更新

Equilibrium Propagation for Non-Conservative Systems

非保守系统的平衡传播

Antonino Emanuele Scurria, Dimitri Vanden Abeele, Bortolo Matteo Mognetti, Serge Massar

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Institute for Advanced Study（高级研究院）

AI总结提出一种扩展平衡传播到非保守系统（包括前馈网络）的框架，通过在学习阶段引入与非互易相互作用成比例的项来精确计算代价函数的梯度，数值实验表明性能更优且学习更快。

Comments 23 pages

详情

AI中文摘要

平衡传播（EP）是一种受物理学启发的学习算法，它利用动力系统的稳态进行推理和学习。在其原始公式中，它仅限于保守系统，即从能量函数导出的动力学。考虑到它们的应用，将EP扩展到非保守系统（即具有非互易相互作用的系统）非常重要。先前将EP推广到此类系统的尝试未能精确计算代价函数的梯度。在这里，我们提出了一个将EP扩展到任意非保守系统（包括前馈网络）的框架。我们保留了平衡传播的关键特性，即同时使用稳态进行推理和学习。然而，我们在学习阶段通过一个与相互作用的非互易部分成比例的项修改了动力学，以便获得代价函数的精确梯度。该算法也可以通过变分公式推导，该公式通过定义在增广状态空间上的能量函数生成学习动力学。数值实验表明，该算法比先前的方案实现了更好的性能并学习更快。

英文摘要

Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, $\textit{i.e.}$ to dynamics which derive from an energy function. Given their applications, it is important to extend EP to non-conservative systems, $\textit{i.e.}$ systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary non-conservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments show that this algorithm achieves better performance and learns faster than previous proposals.

URL PDF HTML ☆

赞 0 踩 0

2602.03554 2026-06-02 cs.LG cs.AI cs.CE cs.CL 版本更新

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

当单一答案不够时：重新思考面向大语言模型的单步逆合成基准

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Mathieu Reymond, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

发表机构 * DeepMind, London, UK（伦敦英国深度思维公司）

AI总结针对现有逆合成基准依赖单一真实答案的局限，提出基于化学合理性度量ChemCensor的新评估框架，并构建数据集CREED训练模型以提升性能。

2602.03211 2026-06-02 cs.LG cs.AI 版本更新

Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

前瞻样本奖励引导用于扩散模型的测试时缩放

Yeongmin Kim, Donghyeok Shin, Byeonghu Na, Minsang Park, Richard Lee Kim, Il-Chul Moon

发表机构 * KAIST（韩国科学技术院）

AI总结提出一种高效测试时缩放方法LiDAR采样，通过前瞻几步采样和精确求解器引导粒子向高奖励区域移动，无需反向传播，在GenEval上达到与最新梯度引导方法相同性能且加速9.5倍。

Comments ICML 2026 Spotlight

详情

AI中文摘要

扩散模型已展现出强大的生成性能；然而，生成的样本往往未能完全符合人类意图。本文研究了一种高效的测试时缩放方法，用于从具有更高人类对齐奖励值的区域进行采样。现有的计算期望未来奖励（EFR）方法面临重要限制：反向展开导致采样成本过高，而基于Tweedie的方法（包括顺序蒙特卡洛和梯度引导）则存在偏差和固有的采样问题。我们证明，任何$\mathbf{x}_t$处的EFR仅需使用预训练扩散模型的边际样本即可计算，从而无需神经反向传播即可实现闭式奖励引导。为了进一步提高效率，我们引入了少步前瞻采样和一个精确求解器，引导粒子向高奖励的前瞻样本移动。我们将这种采样方案称为LiDAR采样。LiDAR在SDXL上达到了与最新梯度引导方法相同的GenEval性能，并实现了9.5倍的加速。我们在https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling 上发布了代码。

英文摘要

Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies an efficient test-time scaling method for sampling from regions with higher human-aligned reward values. Existing methods for computing the expected future reward (EFR) face important limitations: backward rollout incurs prohibitively high sampling costs, while Tweedie-based approaches, including Sequential Monte Carlo and gradient guidance, suffer from bias and inherent sampling issues. We show that the EFR at any $\mathbf{x}_t$ can be computed using only marginal samples from a pre-trained diffusion model, enabling closed-form reward guidance without neural backpropagation. To further improve efficiency, we introduce a few-step lookahead sampling and an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5x speedup. We release the code at https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling.

URL PDF HTML ☆

赞 0 踩 0

2602.03024 2026-06-02 cs.LG cs.AI 版本更新

Consistency Deep Equilibrium Models

一致性深度均衡模型

Junchao Lin, Zenan Ling, Jingwen Xu, Robert C. Qiu

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology（华中科技大学电子信息学院）； School of Electronic Information（电子信息学院）； Communications, Huazhong University of Science（华中科技大学通信学院）； School of Science, Wuhan University of Technology（武汉理工大学理学院）

AI总结提出一致性深度均衡模型（C-DEQ），通过一致性蒸馏将DEQ迭代推理过程视为沿ODE轨迹演化，训练模型将中间状态直接映射到不动点，实现少步推理并保持性能，同时支持多步评估以灵活权衡计算与性能，实验表明在相同少步推理预算下精度提升2-20倍。

详情

AI中文摘要

深度均衡模型（DEQ）已成为深度学习中的一种强大范式，能够以恒定的内存使用量建模无限深度网络。然而，由于不动点求解器的迭代性质，DEQ会带来显著的推理延迟。在这项工作中，我们引入了一致性深度均衡模型（C-DEQ），这是一种利用一致性蒸馏来加速DEQ推理的新框架。我们将DEQ迭代推理过程视为沿固定ODE轨迹向均衡演化。沿着这条轨迹，我们训练C-DEQ将中间状态一致地直接映射到不动点，从而在保持教师DEQ性能的同时实现少步推理。同时，它支持多步评估，以灵活地权衡计算与性能提升。跨多个领域任务的广泛实验表明，在相同的少步推理预算下，C-DEQ相比隐式DEQ实现了2-20倍的精度提升。我们的代码可在https://github.com/landrarwolf/CDEQ获取。

英文摘要

Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieve consistent 2-20$\times$ accuracy improvements over implicit DEQs under the same few-step inference budget. Our code is available at https://github.com/landrarwolf/CDEQ.

URL PDF HTML ☆

赞 0 踩 0

2602.02886 2026-06-02 cs.LG cs.AI 版本更新

Mixture of Concept Bottleneck Experts

概念瓶颈专家混合模型

Francesco De Santis, Gabriele Ciravegna, Giovanni De Felice, Arianna Casanova, Francesco Giannini, Michelangelo Diligenti, Johannes Schneider, Danilo Giordano, Mateo Espinosa Zarlenga, Pietro Barbiero

发表机构 * University of Padua（帕多瓦大学）

AI总结提出概念瓶颈专家混合模型（M-CBE），通过引入多个专家表达式和灵活的函数形式，在保持可解释性的同时提升预测精度和适应性。

详情

AI中文摘要

概念瓶颈模型（CBM）通过将预测基于人类可理解的概念来促进可解释性。然而，现有的CBM通常将其任务预测器限制为单个表达式，其函数形式是预先设定的，这限制了预测精度和对不同用户需求的适应性。我们提出了概念瓶颈专家混合模型（M-CBE），这是一个沿两个维度推广现有CBM的框架：任务预测器用于将概念映射到任务的表达式数量（称为专家），以及每个表达式所采用的函数形式，从而揭示了该设计空间中一个未被充分探索的区域。我们通过实例化两个新颖的模型来研究这一区域：线性M-CBE，它学习一组有限的线性表达式；以及符号M-CBE，它利用符号回归从数据中发现专家函数，受限于用户指定的算子词汇表。实证评估表明，改变表达式的数量及其函数形式为导航精度-可解释性权衡提供了一个稳健的框架。

英文摘要

Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically constrain their task predictor to a single expression whose functional form is set a priori, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBE), a framework that generalizes existing CBMs along two dimensions: the number of expressions, referred to as experts, employed by the task predictor to map concepts to the task, and the functional form each expression takes, thus exposing an underexplored region of this design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data subject to user-specified operator vocabularies. Empirical evaluation demonstrates that varying the number of expressions and their functional form provides a robust framework for navigating the accuracy-interpretability trade-off.

URL PDF HTML ☆

赞 0 踩 0

2602.02557 2026-06-02 cs.LG cs.AI cs.SD 版本更新

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

对齐诅咒：模态对齐通过文本传输增强音频攻击

Yupeng Chen, Junchi Yu, Aoxi Liu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * University of Oxford（牛津大学）

AI总结本文提出并验证了“对齐诅咒”原理，即更强的文本-音频模态对齐会促进文本攻击向音频的迁移，并通过黑盒实验表明文本转移的音频攻击性能与原生音频攻击相当甚至更优，揭示了能力与安全之间的根本矛盾。

Comments 23 pages, 5 figures

详情

AI中文摘要

近期端到端训练的全能模型通过加强文本-音频模态对齐显著提升了音频能力。然而，这种对齐是否无意中促进了安全漏洞跨模态的转移仍未被充分探索。这一问题至关重要，因为基于文本的越狱攻击远比基于音频的攻击成熟；如果它们系统性转移，当前的音频安全评估可能低估源自文本模态的风险。在本文中，我们引入了“对齐诅咒”，这是一个经过形式化表征和实证验证的原理，表明更强的模态对齐使得攻击从文本到音频的转移更有效，揭示了能力与安全之间的根本矛盾。基于这一原理，我们在最新的全能模型（如Qwen2.5-Omni、Qwen3-Omni）上对三类攻击（文本攻击、文本转移的音频攻击和音频攻击）进行了全面的黑盒评估。我们发现，文本转移的音频攻击与基于音频的攻击表现相当，甚至更优，在仅音频访问下展现出明显优势。这表明基于文本的漏洞在塑造音频安全风险中扮演关键角色。最后，我们实证分析了不同攻击方法和模型下模态对齐与转移有效性之间的关系，观察到对“对齐诅咒”的一致支持：更紧密的模态对齐导致更有效的跨模态攻击转移。

英文摘要

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.

URL PDF HTML ☆

赞 0 踩 0

2602.02547 2026-06-02 cs.LG cs.AI 版本更新

naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

naPINN: 用于从损坏测量中恢复物理的噪声自适应物理信息神经网络

Hankyeol Kim, Pilsung Kang

发表机构 * Department of Industrial Engineering（工业工程系）； Seoul National University（首尔国立大学）

AI总结提出噪声自适应物理信息神经网络(naPINN)，通过嵌入能量模型学习残差分布并自适应过滤异常值，从非高斯噪声和离群点损坏的测量中鲁棒恢复物理解。

详情

AI中文摘要

物理信息神经网络(PINNs)是解决逆问题和从观测数据中发现控制方程的有效方法。然而，在复杂测量噪声和严重离群点下，其性能显著下降。为解决此问题，我们提出了噪声自适应物理信息神经网络(naPINN)，该网络无需噪声分布先验知识，即可从损坏测量中鲁棒恢复物理解。naPINN在训练循环中嵌入一个基于能量的模型，以学习预测残差的潜在分布。利用学习到的能量景观，一个可训练的可靠性门自适应地过滤具有高能量的数据点，同时拒绝代价正则化防止丢弃有效数据导致的平凡解。我们在被非高斯噪声和不同比例离群点损坏的各种基准偏微分方程上展示了naPINN的有效性。结果表明，naPINN显著优于现有的鲁棒PINN基线，成功隔离离群点并在严重数据损坏下准确重建动力学。

英文摘要

Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observational data. However, their performance degrades significantly under complex measurement noise and gross outliers. To address this issue, we propose the Noise-Adaptive Physics-Informed Neural Network (naPINN), which robustly recovers physical solutions from corrupted measurements without prior knowledge of the noise distribution. naPINN embeds an energy-based model into the training loop to learn the latent distribution of prediction residuals. Leveraging the learned energy landscape, a trainable reliability gate adaptively filters data points exhibiting high energy, while a rejection cost regularization prevents trivial solutions where valid data are discarded. We demonstrate the efficacy of naPINN on various benchmark partial differential equations corrupted by non-Gaussian noise and varying rates of outliers. The results show that naPINN significantly outperforms existing robust PINN baselines, successfully isolating outliers and accurately reconstructing the dynamics under severe data corruption.

URL PDF HTML ☆

赞 0 踩 0

2602.02470 2026-06-02 cs.AI 版本更新

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

通过身份桥打破自回归语言模型中的逆转诅咒

Xutao Ma, Yixiao Huang, Hanlin Zhu, Somayeh Sojoudi

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结提出一种名为“身份桥”的简单数据正则化方法（形式为“A→A”），通过理论分析和实验证明该方法能有效缓解自回归语言模型中的逆转诅咒，使模型从事实记忆转向规则学习。

详情

AI中文摘要

自回归大型语言模型（LLMs）在许多复杂任务中取得了显著成功，但在非常简单的逻辑推理中仍可能失败，例如“逆转诅咒”——当模型在形如“$A \rightarrow B$”（例如，爱丽丝的丈夫是鲍勃）的前向知识数据上训练时，在测试时无法推断出逆转知识“$B \leftarrow A$”（例如，鲍勃的妻子是爱丽丝）。大量先前的研究表明，这种失败是自回归因果LLMs固有的根本限制，表明这些模型倾向于记忆事实层面的知识，而不是捕捉更高级别的规则。在本文中，我们通过展示这种看似根本的限制可以通过略微调整训练数据，使用一种简单的正则化数据配方（称为“身份桥”，形式为“$A \to A$”，例如，爱丽丝的名字是爱丽丝）来缓解，从而挑战了这一观点。理论上，我们证明在这种配方下，即使是一层Transformer也可以通过分析梯度下降的隐式偏差来打破逆转诅咒。实验上，我们展示了一个10亿参数的预训练语言模型，在使用所提出的数据配方进行微调后，在逆转任务上达到了50%的成功率，而仅在前向知识数据上训练时成功率接近零。我们的工作为逆转诅咒提供了新颖的理论基础，并为鼓励LLMs从数据中学习更高级别的规则提供了一条原则性、低成本的路径。

英文摘要

Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" -- when trained on forward knowledge data of the form "$A \rightarrow B$" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "$B \leftarrow A$" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "$A \to A$" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 50% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.

URL PDF HTML ☆

赞 0 踩 0

2602.02416 2026-06-02 cs.AI 版本更新

Structure Enables Effective Self-Localization of Errors in LLMs

结构使语言模型能够有效自我定位错误

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Kavosh Asadi, Youliang Yu, Daniel Jiang, Boris Vidolov, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

发表机构 * Meta AI ； Columbia University（哥伦比亚大学）； Meta Superintelligence Labs（Meta超智能实验室）； Tel Aviv University（特拉维夫大学）

AI总结本文提出结构化推理方法，通过将推理分解为离散语义步骤，使语言模型能更可靠地定位错误，并基于此设计了迭代纠正采样框架Thought-ICS，实现20-40%的自我纠正提升。

详情

AI中文摘要

语言模型的自我纠正仍然难以实现。在这项工作中，我们探索语言模型是否能够显式定位错误推理中的错误，作为构建能够有效自我纠正的AI系统的一条途径。我们引入了一种提示方法，将推理结构化为离散的、语义连贯的思维步骤，并表明模型在这种结构内比在传统的、非结构化的思维链推理中更可靠地定位错误。受人类大脑在离散决策点监控错误并重新采样替代方案的启发，我们引入了思维迭代纠正采样（Thought-ICS），一个自我纠正框架。Thought-ICS迭代地提示模型一次生成一个离散且完整的思维——其中每个思维代表模型的一个深思熟虑的决策——为精确的错误定位创建自然边界。在验证时，模型定位第一个错误步骤，系统回溯并从最后一个正确点生成替代推理。当要求纠正被预言机验证为不正确的推理时，Thought-ICS实现了20-40%的自我纠正提升。在完全没有外部验证的完全自主设置中，它优于当代自我纠正基线。

英文摘要

Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in incorrect reasoning, as a path toward building AI systems that can effectively correct themselves. We introduce a prompting method that structures reasoning as discrete, semantically coherent thought steps, and show that models can localize errors more reliably within this structure than in conventional, unstructured chain-of-thought reasoning. Motivated by how the human brain monitors errors at discrete decision points and resamples alternatives, we introduce Iterative Correction Sampling of Thoughts (Thought-ICS), a self-correction framework. Thought-ICS iteratively prompts the model to generate reasoning one discrete and complete thought at a time--where each thought represents a deliberate decision by the model--creating natural boundaries for precise error localization. Upon verification, the model localizes the first erroneous step, and the system backtracks to generate alternative reasoning from the last correct point. When asked to correct reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In a completely autonomous setting without external verification, it outperforms contemporary self-correction baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.02098 2026-06-02 cs.LG cs.AI 版本更新

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

多任务强化学习的概率性能保证

Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结提出一种结合每任务有限 rollout 置信下界与任务级泛化的新泛化界，为未见任务提供高置信度性能保证。

2602.01962 2026-06-02 cs.LG cs.AI 版本更新

Zero-Shot Off-Policy Learning

零样本离策略学习

Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev（阿里普·阿萨杜拉耶夫）； Maksim Bobrin（马克西姆·博布林）； Salem Lahlou（萨勒姆·拉洛）； Dmitry Dylov（德米特里·达里夫）； Fakhri Karray（法赫里·卡里）； Martin Takac（马尔 tin 塔卡）

AI总结本文通过发现后继度量与平稳密度比的理论联系，提出一种零样本离策略学习算法，能够实时推断最优重要性采样比率并进行平稳分布修正，实现无需额外训练即可适应新任务。

详情

AI中文摘要

离策略学习方法旨在直接从固定的先前交互数据集中推导出最优策略。这一目标面临重大挑战，主要源于固有的分布偏移和价值函数高估偏差。这些问题在零样本强化学习中尤为突出，其中在无奖励数据上训练的智能体必须在测试时适应新任务而无需额外训练。在这项工作中，我们通过发现后继度量与平稳密度比的理论联系，解决了零样本场景下的离策略问题。利用这一洞见，我们的算法能够推断最优重要性采样比率，有效地为任意任务实时执行带有最优策略的平稳分布修正。我们在SMPL人体模型上的运动跟踪任务、ExoRL上的连续控制任务以及长时域OGBench任务上对方法进行了基准测试。我们的技术无缝集成到前向-后向表示框架中，并在无需训练的情况下实现对新任务的快速适应。更广泛地说，这项工作架起了离策略学习和零样本适应之间的桥梁，为两个研究领域都带来了益处。

英文摘要

Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.

URL PDF HTML ☆

赞 0 踩 0

2601.06199 2026-06-02 eess.AS cs.AI cs.SD 版本更新

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

FastSLM：用于高效长语音自适应的层次时间抽象

Junseok Lee, Sangyong Lee, Chang-Jae Chun

发表机构 * OKESTRO ； Sejong University（世宗大学）

AI总结针对长语音输入中标记爆炸问题，提出FastSLM架构，通过层次时间抽象器（HTA）实现每秒1.67个标记的极端压缩率（减少97%），在显著降低计算量和参数的同时，在长语音基准上达到与最先进模型竞争的性能。

Comments Title updated

2507.08038 2026-06-02 cs.CL cs.AI 版本更新

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

AblationBench：评估实证AI研究中消融实验的自动规划

Talor Abramovich, Gal Chechik

发表机构 * Google Research（谷歌研究）

AI总结提出AblationBench基准套件，包含作者消融和审稿人消融两个任务，用于评估语言模型在AI研究中规划消融实验的能力，实验表明当前最佳模型仅能识别45%的原始消融，低于人类水平。

Comments AI4Science Workshop, ICML 2026; Project page: https://ablation-bench.github.io/

详情

AI中文摘要

语言模型代理越来越多地被用于自动化科学研究，然而评估其科学贡献仍然是一个挑战。获得此类见解的关键机制是通过消融实验。为此，我们引入了AblationBench，这是一个用于评估代理在实证AI研究中进行消融规划任务的基准套件。它包括两个任务：AuthorAblation，帮助作者基于方法部分提出消融实验，包含83个实例；以及ReviewerAblation，帮助审稿人发现完整论文中缺失的消融，包含350个实例。对于这两个任务，我们开发了基于LM的评判器，作为自动评估框架。我们对前沿LM的实验表明，这些任务仍然具有挑战性，性能最佳的LM系统平均仅能识别45%的原始消融，低于人类水平。我们观察到作者任务和审稿人任务之间存在相反的性能趋势，这归因于模型基础的不同。最后，我们分析了当前LM在这些任务上的局限性，并发现思维链提示优于基于代理的方法。我们的数据可在https://huggingface.co/collections/ai-coscientist/ablationbench获取，代码可在https://github.com/ai-scientist-bench/ablation-bench获取。

英文摘要

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 45% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

URL PDF HTML ☆

赞 0 踩 0

2602.00415 2026-06-02 cs.AI cs.LG 版本更新

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

PolarMem: 一种无需训练的可验证视觉语言模型极化隐式图记忆

Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Jinhan Li, Ziyan Weng, Liang Lin, Jingwei Song, Zikai Xiao, Yingwei Zhang

发表机构 * ICT, CAS（中国科学院信息科技研究院）； UCAS（中国科学院大学）； CUPB（中国政法大学）； USTC（中国科学技术大学）； CityU-DG（城市大学-数据科学）； HKU（香港大学）； ZJU（浙江大学）

AI总结提出PolarMem，一种无需训练的极化隐式图记忆框架，通过语义一致性验证和自适应分布划分将视觉语言模型感知信号转化为HAS、NOT_HAS和Uncertain记忆状态，并采用词典逻辑感知检索协议优先保证逻辑一致性，从而提升检索密集型任务性能并减少矛盾。

详情

AI中文摘要

记忆对于智能系统而言不仅是存储机制，更是组织证据和约束信念的结构。这对多模态推理尤为重要，因为检索到的证据必须既与查询相关又在视觉上一致。然而，当前视觉语言模型（VLM）的记忆系统大多保持正关联：它们检索相似或先前观察到的内容，但缺乏明确的方式记住已被验证为不存在或逻辑排除的内容。为此，我们提出 extbf{PolarMem}，一种无需训练的极化隐式图记忆框架，用于可验证的视觉语言推理。PolarMem通过语义一致性验证和自适应分布划分，将冻结的VLM感知信号转化为 extit{HAS}、 extit{NOT\_HAS}和 extit{Uncertain}记忆状态，并将其存储在具有明确正负记忆关系的极化图中。在推理时，词典逻辑感知检索协议在语义相似性之前强制执行逻辑一致性，在冲突记忆进入模型上下文之前将其抑制。在八个冻结的VLM骨干网络和六个多模态基准测试中，PolarMem一致地提升了检索密集型任务性能并减少了检索级矛盾。这些结果凸显了负记忆作为构建更可靠多模态记忆系统的关键机制。我们的代码可在https://github.com/czs-ict/PolarMem获取。

英文摘要

Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is especially important for multimodal reasoning, where retrieved evidence must be both query-relevant and visually consistent. However, current memory systems for vision-language models (VLMs) remain largely positive-associative: they retrieve what is similar or previously observed, but lack an explicit way to remember what has been verified as absent or logically excluded. To this end, we propose \textbf{PolarMem}, a training-free polarized latent graph memory framework for verifiable vision-language reasoning. PolarMem transforms frozen VLM perceptual signals into \textit{HAS}, \textit{NOT\_HAS}, and \textit{Uncertain} memory states through semantic consistency verification and adaptive distributional partitioning, and stores them in a polarized graph with distinct positive and negative memory relations. During inference, a lexicographical logic-aware retrieval protocol enforces logical consistency before semantic similarity, suppressing conflicting memories before they enter the model context. Across eight frozen VLM backbones and six multimodal benchmarks, PolarMem consistently improves retrieval-intensive tasks and reduces retrieval-level contradictions. These results highlight negative memory as a key mechanism for building more reliable multimodal memory systems. Our code is available at https://github.com/czs-ict/PolarMem.

URL PDF HTML ☆

赞 0 踩 0

2601.23220 2026-06-02 cs.CV cs.AI 版本更新

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Med-Scout: 通过几何感知的强化学习后训练治愈多模态大语言模型在医学感知中的几何盲点

Anglin Liu, Ruichao Chen, Yi Lu, Hongxia Xu, Jintai Chen

发表机构 * HKUSTGZ-ML4Health-Lab（香港科技大学-ML4Health实验室）

AI总结提出Med-Scout框架，利用无标注医学图像中的内在几何逻辑，通过强化学习和三种代理任务（层次尺度定位、拓扑拼图重建、异常一致性检测）来缓解多模态大语言模型的几何盲点，并在新基准Med-Scout-Bench上提升超过40%的几何感知性能，同时泛化到更广泛的医学理解任务。

Comments 29 pages, 14 figures. Accepted at ICML 2026

详情

AI中文摘要

尽管最近的多模态大语言模型（MLLMs）在医学诊断中展现出语言能力，但我们发现即使是最先进的MLLMs也存在一个关键的感知缺陷：几何盲点。这种无法将输出基于客观几何约束的问题导致了看似合理但事实错误的幻觉，其根源在于训练范式优先考虑语言流畅性而非几何保真度。本文介绍了Med-Scout，一种新颖的框架，通过强化学习（RL）“治愈”这种盲点，利用未标记医学图像中内在的几何逻辑。Med-Scout不依赖昂贵的人工标注，而是通过受临床医生系统阅读和推理模式启发的三种策略性代理任务推导出可验证的监督信号：层次尺度定位、拓扑拼图重建和异常一致性检测。为了严格量化这一缺陷，我们提出了Med-Scout-Bench，一个专门设计用于评估几何感知的新基准。大量评估表明，Med-Scout显著缓解了几何盲点，在我们的基准上比领先的专有和开源MLLMs提升了超过40%。此外，这种增强的几何感知泛化到更广泛的医学理解，在放射学和综合性医学VQA任务上取得了优异结果。

英文摘要

Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks inspired by the systematic reading and reasoning patterns of clinicians: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.

URL PDF HTML ☆

赞 0 踩 0

2601.22900 2026-06-02 cs.AI 版本更新

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

MulFeRL：在多轮循环中利用语言反馈增强强化学习

Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系，北京，中国）； Quancheng Laboratory（千晨实验室）

AI总结针对强化学习中标量奖励稀疏且缺乏信息的问题，提出MulFeRL框架，通过多轮语言反馈引导失败样本的再生、进度信用分配和结构化反馈注入，提升模型推理性能。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）被广泛用于提升各领域的推理能力，但仅基于结果的标量奖励往往稀疏且信息量不足。这一限制对失败样本尤为严重，因为标量奖励仅指示解决方案不正确，而未解释推理为何失败。在本文中，我们利用更丰富的语言反馈来引导失败样本上的RLVR，并将反馈引发的进展转化为可训练的学习信号。我们提出MulFeRL（多轮反馈引导的强化学习），这是一个多轮、事件触发的RLVR框架，结合了用于反馈引导失败样本再生的进展诱导、用于从验证器确认的进展中学习的进展信用分配，以及用于将反馈整合到模型推理过程中的结构化反馈注入。在采样的OpenR1-Math上训练后，MulFeRL在领域内优于监督、自蒸馏和RLVR基线，同时展现出强大的领域外泛化能力。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed samples, where scalar rewards indicate only that a solution is incorrect without explaining why the reasoning breaks down. In this paper, we leverage richer verbal feedback to guide RLVR on failed samples and convert feedback-induced progress into trainable learning signals. We propose MulFeRL (Multi-turn Feedback-guided Reinforcement Learning), a multi-turn, event-triggered RLVR framework that combines progress induction for feedback-guided regeneration of failed samples, progress credit assignment for learning from verifier-confirmed progress, and structured feedback injection for integrating feedback into the model's reasoning process. Trained on sampled OpenR1-Math, MulFeRL outperforms supervised, self-distillation-based, and RLVR baselines in-domain, while also showing strong out-of-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2601.22651 2026-06-02 cs.LG cs.AI 版本更新

GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning

GUDA: 基于反事实的扩散模型分组训练数据归因方法

Naoki Murata, Yuhta Takida, Chieh-Hsin Lai, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji

发表机构 * University of Tokyo（东京大学）； Toyota Central Research Laboratory（丰田中央研究所）； University of California, Berkeley（加州大学伯克利分校）； Massachusetts Institute of Technology（麻省理工学院）； National Institute of Advanced Industrial Science and Technology（国家工业科学与技术研究院）

AI总结提出GUDA方法，利用机器遗忘近似反事实模型，通过似然评分规则（ELBO）量化组别影响，实现高效的分组训练数据归因。

Comments Accepted at ICML 2026. Code is available at https://github.com/sony/guda

详情

AI中文摘要

视觉生成模型的训练数据归因旨在识别哪些训练数据影响了给定输出。虽然大多数方法对单个样本进行评分，但实践者通常需要组级别的答案（例如，艺术风格或对象类别）。分组归因是反事实的：如果某个组别在训练中缺失，模型对生成样本的行为会如何变化？这种反事实的自然实现是留一组法（LOGO）重训练，即移除每个组别后重新训练模型；然而，随着组别数量的增加，计算变得不可行。我们提出了用于扩散模型的GUDA（基于组遗忘的数据归因）方法，该方法通过应用机器遗忘到共享的全数据模型而不是从头训练来近似每个反事实模型。GUDA使用全模型和每个遗忘反事实模型之间基于似然的评分规则（ELBO）的差异来量化组别影响。在CIFAR-10和Stable Diffusion的艺术风格归因上的实验表明，GUDA比语义相似性、基于梯度的归因和实例级遗忘方法更可靠地识别主要贡献组别，同时在CIFAR-10上比LOGO重训练实现了约100倍的加速。

英文摘要

Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods score individual examples, practitioners often need group-level answers (e.g., artistic styles or object classes). Group-wise attribution is counterfactual: how would a model's behavior on a generated sample change if a group were absent from training? A natural realization of this counterfactual is Leave-One-Group-Out (LOGO) retraining, which retrains the model with each group removed; however, it becomes computationally prohibitive as the number of groups grows. We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch. GUDA quantifies group influence using differences in a likelihood-based scoring rule (ELBO) between the full model and each unlearned counterfactual. Experiments on CIFAR-10 and artistic style attribution with Stable Diffusion show that GUDA identifies primary contributing groups more reliably than semantic similarity, gradient-based attribution, and instance-level unlearning approaches, while achieving ~100x speedup on CIFAR-10 over LOGO retraining.

URL PDF HTML ☆

赞 0 踩 0

2601.20115 2026-06-02 cs.AR cs.AI 版本更新

ELF：无编码器心电图语言模型家族

William Han, Tony Chen, Chaojing Duan, Xiaoyu Song, Yihang Yao, Yuzhe Yang, Michael A. Rosenberg, Emerson Liu, Ding Zhao

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Allegheny Health Network（阿勒格尼医疗网络）； University of California Los Angeles（加州大学洛杉矶分校）； University of Colorado（科罗拉多大学）； Allergy and Immunology（过敏与免疫学）

AI总结提出三种无编码器架构的ECG语言模型ELF，简化架构和训练流程，在两个数据集上达到或超越现有最优模型。

Comments 31 pages, 11 figures

2601.18783 2026-06-02 cs.LG cs.AI cs.SY eess.SY 版本更新

Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic

多目标强化学习用于高速公路卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg（计算机科学与工程系，查尔姆斯理工大学和哥德堡大学）； Department of Mechanics and Maritime Sciences, Chalmers University of Technology（机械与海洋科学系，查尔姆斯理工大学）

AI总结提出基于近端策略优化的多目标强化学习框架，学习一组帕累托最优策略以平衡安全性、能源效率和时间效率，实现无需重新训练的灵活决策。

详情

AI中文摘要

在高速公路驾驶中平衡安全性、效率和运营成本对重型车辆来说是一个具有挑战性的决策问题。一个核心困难是，通过聚合这些竞争目标得到的传统标量奖励公式往往会掩盖其权衡结构。我们提出了一个基于近端策略优化的多目标强化学习框架，该框架学习一组明确表示这些权衡的策略，并在一个可扩展的模拟平台上对卡车的战术决策进行评估。所提出的方法学习一组帕累托最优策略，捕捉三个冲突目标之间的权衡：安全性（以碰撞和成功完成量化）、能源效率和时间效率（分别以能源成本和驾驶员成本量化）。得到的帕累托前沿平滑且可解释，使得在不同冲突目标下选择驾驶行为具有灵活性。该框架允许在不同驾驶策略之间无缝切换而无需重新训练，为自动驾驶卡车应用提供了稳健且自适应的决策策略。

英文摘要

Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.

URL PDF HTML ☆

赞 0 踩 0

2509.13805 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Towards a Physics Foundation Model

迈向物理基础模型

Florian Wiesner, Zoë J. Gray, Matthias Wessling, Stephen Baek

发表机构 * University of Cambridge（剑桥大学）

AI总结提出通用物理变换器（GPhyT），通过在大规模多样化模拟数据上训练，实现单一模型在多个物理领域（如流固耦合、冲击波、热对流和多相流）的零样本泛化与长期稳定预测，性能超越专用架构7倍以上。

Comments ICML-AI4Physics 2026

详情

AI中文摘要

基础模型通过“一次训练，随处部署”的范式彻底改变了自然语言处理，即单个预训练模型无需重新训练即可适应无数下游任务。拥有物理基础模型（PFM）将是变革性的——它能够民主化高保真模拟的访问、加速科学发现，并消除对专用求解器开发的需求。然而，当前物理感知的机器学习方法仍然从根本上局限于单一狭窄领域，并且需要为每个新系统重新训练。我们提出了通用物理变换器（GPhyT），该模型在1.8 TB的多样化模拟数据上训练，证明了基础模型能力在物理领域是可以实现的。我们的关键见解是，变换器可以学习从上下文中推断支配动力学，从而使单一模型能够模拟流固耦合、冲击波、热对流和多相动力学，而无需被告知底层方程。GPhyT实现了三个关键突破：（1）在多个物理领域上表现出卓越性能，比专用架构高出7倍以上；（2）通过上下文学习，对完全未见过的物理系统进行合理的零样本泛化；（3）通过长程 rollout 实现更稳定的长期预测。通过证明单一模型可以仅从数据中学习可泛化的物理原理，这项工作为通向通用PFM开辟了道路，该模型可能改变计算科学与工程。

英文摘要

Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative - democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by more than 7x, (2) plausible zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) more stable long-term predictions through long-horizon rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.

URL PDF HTML ☆

赞 0 踩 0

2601.17952 2026-06-02 cs.CL cs.AI 版本更新

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

面向临床神经科学中基于Transformer的语言模型的稳定可解释性的单语义归因框架

Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

发表机构 * Department of Computer Science and Technology（计算机科学与技术系）； Cancer Research UK（癌症研究英国公司）； Cambridge Institute（剑桥研究所）； University of Cambridge（剑桥大学）； United Kingdom（英国）； DIMES（迪梅斯）； University of Calabria（卡拉布里亚大学）； Italy（意大利）； Department of Computer Automatic and Management Engineering (DIAG)（计算机自动与管理工程系）； Sapienza Università di Roma（罗马大学萨皮恩扎）； Department of Psychiatry（精神病学系）； School of Computing（计算学院）； University of Kent（肯特大学）； Department of Psychology（心理学系）

AI总结提出一种通过单语义特征提取整合归因与机制视角的统一可解释性框架，生成稳定的输入级重要性分数，促进语言模型在认知健康和神经退行性疾病中的安全应用。

详情

AI中文摘要

可解释性仍然是语言模型在临床环境中部署的关键挑战，例如阿尔茨海默病的进展诊断，其中早期和可信的预测至关重要。现有的归因方法由于基于Transformer的语言模型和LLM表示的多语义性质而表现出高方法间变异性和不稳定的解释，而机制可解释性方法缺乏与模型输入和输出的直接对齐，并且不提供显式的重要性分数。我们引入了一个统一的可解释性框架，通过单语义特征提取整合了归因和机制视角。通过在基于Transformer的LM层级别构建单语义嵌入空间，并优化框架以显式减少方法间变异性，我们的方法生成稳定的输入级重要性分数，并通过感兴趣层的解压缩表示突出显著特征，推进了语言模型在认知健康和神经退行性疾病中的安全可信应用。

英文摘要

Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer disease, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of Transformer-Based LM and LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an transformer-based LM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LMs in cognitive health and neurodegenerative disease.

URL PDF HTML ☆

赞 0 踩 0

2511.12487 2026-06-02 cs.NE cs.AI cs.CL 版本更新

ToxSearch: Evolving Prompts for Toxicity Search in Large Language Models

ToxSearch: 面向大型语言模型毒性搜索的提示演化

Onkar Shelar, Travis Desell

发表机构 * Rochester Institute of Technology（罗切斯特技术研究所）

AI总结提出ToxSearch，一种黑盒演化框架，通过同步稳态循环演化提示来测试大型语言模型的安全性，并分析不同操作符的行为及跨模型迁移性。

Comments 16 pages

详情

DOI: 10.1007/978-3-032-23607-4_9
Journal ref: In: García-Sánchez, P., Díaz Álvarez, J., Murphy, A. (eds) Applications of Evolutionary Computation. EvoApplications 2026. Lecture Notes in Computer Science, vol 16525. Springer, Cham

AI中文摘要

大型语言模型即使在安全对齐后，仍然容易受到引发毒性内容的对抗性提示的攻击。我们提出了ToxSearch，一种黑盒演化框架，通过同步稳态循环演化提示来测试模型安全性。该系统采用多种操作符，包括词汇替换、否定、回译、释义以及两种语义交叉操作符，同时一个审核预言机提供适应度指导。操作符级分析显示出异质性行为：词汇替换提供了最佳的收益-方差权衡，语义相似性交叉充当精确的低吞吐量插入器，而全局重写表现出高方差和较高的拒绝成本。使用在LLaMA 3.1 8B上演化的精英提示，我们观察到实际有意义但衰减的跨模型迁移，大多数目标上的毒性大约减半，较小的LLaMA 3.2变体表现出最强的抵抗力，而一些跨架构模型保留了较高的毒性。这些结果表明，小的、可控的扰动是系统性红队测试的有效载体，并且防御措施应预期对抗性提示的跨模型重用，而不是仅关注单模型加固。

英文摘要

Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.

URL PDF HTML ☆

赞 0 踩 0

2404.01356 2026-06-02 cs.LG cs.AI cs.CY 版本更新

Perturbation Effects on Accuracy and Fairness among Similar Individuals

扰动对相似个体间准确性和公平性的影响

Xuran Li, Hao Xue, Peng Wu, Xingjun Ma, Zhen Zhang, Huaming Chen, Flora D. Salim

发表机构 * University of New South Wales（新南威尔士大学）； The Hong Kong University of Science and Technology（香港科学与技术大学）； Key Laboratory of System Software, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所系统软件重点实验室）； Fudan University（复旦大学）； Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（中国科学院大学杭州先进研究所）； The University of Sydney（悉尼大学）

AI总结提出鲁棒个体公平性（RIF）概念，并开发黑盒对抗框架RIFair，通过解耦扰动策略暴露模型在语义保持扰动下同时存在的鲁棒性和公平性缺陷。

详情

AI中文摘要

深度神经网络易受对抗性扰动影响，这些扰动能在不同应用场景中同时降低预测鲁棒性和个体公平性。然而，现有评估协议通常孤立地评估这些维度，从而掩盖了关键故障模式。为弥补这一差距，我们形式化了鲁棒个体公平性（RIF）：在语义保持（真值条件保持）扰动下，预测应既相对于真实标签保持正确，又在语义等价的个体间保持不变。为在实践中揭示RIF违规，我们引入RIFair，一种黑盒对抗框架，利用解耦扰动策略构建语义保持但不鲁棒和/或不公平的实例对。跨多个模型架构和真实世界文本数据集的实验表明，仅关注鲁棒性或公平性的指标常常遗漏鲁棒偏差和不鲁棒公平行为。RIFair可靠地暴露这些隐藏的漏洞，支持RIF作为可信模型评估的必要标准。实验代码公开于https://github.com/Xuran-LI/RIFair。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations that can simultaneously degrade prediction robustness and individual fairness across diverse application settings. However, existing evaluation protocols typically assess these dimensions in isolation, thereby obscuring critical failure modes. To bridge this gap, we formalize Robust Individual Fairness (RIF): under semantic-preserving (truth-condition-preserving) perturbations, predictions should remain both correct with respect to the ground truth and invariant across semantically equivalent individuals. To surface RIF violations in practice, we introduce RIFair, a black-box adversarial framework that leverages a decoupled perturbation strategy to construct semantically preserved yet unrobust and/or unfair instance pairs. Experiments across multiple model architectures and real-world textual datasets show that robustness-only or fairness-only metrics often miss Robust Biased and Unrobust Fair behaviors. RIFair}reliably exposes these hidden vulnerabilities, supporting RIF as a necessary criterion for trustworthy model assessment. The experimental code is publicly available at https://github.com/Xuran-LI/RIFair.

URL PDF HTML ☆

赞 0 踩 0

2601.14323 2026-06-02 cs.CR cs.AI cs.RO 版本更新

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

SilentDrift: 利用动作分块对视觉-语言-动作模型进行隐蔽后门攻击

Bingxin Xu, Yuzhang Shang, Binghui Wang, Emilio Ferrara

发表机构 * University of Southern California（南加州大学）； University of Central Florida（中央佛罗里达大学）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结针对视觉-语言-动作模型中的动作分块与增量位姿表示导致的视觉开环漏洞，提出一种利用平滑步函数构建满足C2连续扰动的隐蔽黑盒后门攻击方法SilentDrift，并通过关键帧攻击策略实现高攻击成功率与低中毒率。

Comments Accepted to ACL Findings 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型越来越多地部署在安全关键的机器人应用中，但其安全漏洞仍未得到充分探索。我们识别出现代VLA系统中的一个基本安全缺陷：动作分块与增量位姿表示的结合产生了块内视觉开环。该机制迫使机器人执行K步动作序列，允许每步扰动通过积分累积。我们提出SILENTDRIFT，一种利用此漏洞的隐蔽黑盒后门攻击。我们的方法采用平滑步函数构建具有保证C2连续性的扰动，确保轨迹边界处的速度和加速度为零，以满足严格的运动学一致性约束。此外，我们的关键帧攻击策略仅选择性地毒化关键的接近阶段，在最小化触发暴露的同时最大化影响。生成的毒化轨迹在视觉上与成功演示难以区分。在LIBERO上评估，SILENTDRIFT在低于2%的中毒率下实现了93.2%的攻击成功率，同时保持了95.3%的干净任务成功率。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed in safety-critical robotic applications, yet their security vulnerabilities remain underexplored. We identify a fundamental security flaw in modern VLA systems: the combination of action chunking and delta pose representations creates an intra-chunk visual open-loop. This mechanism forces the robot to execute K-step action sequences, allowing per-step perturbations to accumulate through integration. We propose SILENTDRIFT, a stealthy black-box backdoor attack exploiting this vulnerability. Our method employs the Smootherstep function to construct perturbations with guaranteed C2 continuity, ensuring zero velocity and acceleration at trajectory boundaries to satisfy strict kinematic consistency constraints. Furthermore, our keyframe attack strategy selectively poisons only the critical approach phase, maximizing impact while minimizing trigger exposure. The resulting poisoned trajectories are visually indistinguishable from successful demonstrations. Evaluated on the LIBERO, SILENTDRIFT achieves a 93.2% Attack Success Rate with a poisoning rate under 2%, while maintaining a 95.3% Clean Task Success Rate.

URL PDF HTML ☆

赞 0 踩 0

2509.06093 2026-06-02 cs.DB cond-mat.mtrl-sci cs.AI cs.CL 版本更新

Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model

基于轻结构化文本数据库和推理大语言模型的自然语言材料加工设计

Yuze Liu, Zhaoyuan Zhang, Xiangsheng Zeng, Yihe Zhang, Leping Yu, Liu Yang, Lejia Wang, Xi Yu

发表机构 * State Key Laboratory of Advanced Materials for Intelligent Sensing, Key Laboratory of Organic Integrated Circuit, Ministry of Education & Tianjin Key Laboratory of Molecular Optoelectronic Sciences, Department of Chemistry, School of Science, Tianjin University（智能传感先进材料国家重点实验室、有机集成电路重点实验室、教育部、天津分子光电子科学重点实验室、化学系、天津大学）； Language Intelligence Technology Co., Ltd.（语言智能技术有限公司）； College of Intelligence and Computing, Tianjin University（智能与计算学院、天津大学）； School of Materials and Chemical Engineering, Ningbo University of Technology（材料与化学工程学院、宁波工业大学）

AI总结将材料合成规划重构为文本推理问题，通过轻结构化知识基底结合检索增强生成与经验增强推理，在氮化硼纳米片剥离中三轮迭代获得高质量协议。

详情

AI中文摘要

材料合成步骤主要以叙述性文本形式记录在论文、方案和实验室记录中，这使得传统数据驱动优化框架难以处理。这种自然语言特性对复杂多阶段过程（如氮化硼纳米片（BNNS）的制备）构成了特殊挑战，其中结果取决于剥离、功能化和功能化中的路径依赖选择。在这里，我们将材料合成规划重构为一个文本推理问题，该问题由一个轻结构化的知识基底支持，该基底保留了程序逻辑和因果上下文，同时暴露了可计算元素以供检索。基于这种表示，我们的框架结合了语义匹配、词汇搜索和参数感知过滤，以支持检索增强生成，提供更准确、更有依据的合成指导。我们进一步引入了经验增强推理，其中从多源叙述中迭代提炼的文本指导支持假设生成、故障诊断和方案修订。我们在BNNS的目标剥离中验证了该框架，这是一个受多变量约束且文献方案在实验室间可迁移性有限的合成问题。通过将分散的文献证据与实验观察到的故障模式相结合，系统仅在三轮迭代内就收敛到一个高性能方案，该方案产生了符合目标规格的高质量超薄纳米片，大大缩短了通常由专家主导的冗长试错周期。通过实现对程序知识的自然语言推理，该框架将AI从文献辅助推向复杂材料工作流程中的主动合成规划、适应和加速。

英文摘要

Materials synthesis procedures are predominantly documented as narrative text in papers, protocols, and laboratory records, placing them beyond the reach of conventional data-driven optimization frameworks. This language-native character poses a particular challenge for complex, multistage processes such as the preparation of boron nitride nanosheets (BNNS), where outcomes depend on path-dependent choices in exfoliation, functionalization, and functionalization. Here, we recast synthesis planning of the materials as a text reasoning problem enabled by a lightly structured knowledge substrate that preserves the procedural logic and causal contexts while exposing computable elements for retrieval. Built on this representation, our framework combines semantic matching, lexical search, and parameter-aware filtering to support retrieval-augmented generation with more accurate and better-grounded synthesis guidance. We further introduce experience-augmented reasoning, in which iteratively refined text guides distilled from multi-source narratives support hypothesis generation, failure diagnosis, and protocol revision. We validated the framework in the targeted exfoliation of BNNS, a synthesis problem governed by multivariate constraints and limited transferability of literature protocols across laboratory settings. By integrating dispersed literature evidence with experimentally observed failure modes, the system converged within only three iterative rounds on a high-performing protocol that yielded high-quality ultrathin nanosheets meeting the target specifications, substantially shortening what is often a prolonged cycle of expert-led trial-and-error. By enabling language-native reasoning over procedural knowledge, this framework moves AI beyond literature assistance toward active synthesis planning, adaptation and acceleration in complex materials workflows.

URL PDF HTML ☆

赞 0 踩 0

2601.14230 2026-06-02 cs.CL cs.AI cs.HC 版本更新

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

MASCOT: 迈向多智能体社会协作伴侣系统

Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结针对多智能体系统中的人格崩溃和社会谄媚问题，提出MASCOT框架，通过双层优化策略（人格感知行为对齐与协作对话优化）提升角色一致性和对话贡献。

Comments 15 pages, 9 figures. https://hello-diana.github.io/MASCOT/

详情

AI中文摘要

多智能体系统（MAS）正成为情感和认知支持方面有前景的社会协作伴侣。然而，现有系统经常遭受人格崩溃（即智能体退化为通用、同质化的助手行为）和社会谄媚（即智能体产生冗余、非建设性的对话）。我们提出MASCOT，一个用于多视角社会协作伴侣的多智能体框架。MASCOT引入了一种新颖的双层优化策略来协调个体和集体行为：1）人格感知行为对齐，一个RLAIF驱动的流程，用于微调个体智能体以实现特定于智能体的身份；2）协作对话优化，一个群体级适应过程，促进互补、多样和富有成效的对话。我们使用源自领域内和领域外（OOD）设置的人类真实情境评估MASCOT，并与最先进的基线进行比较。MASCOT将人格一致性提高了最多+14.1，社会贡献提高了最多+10.6。广泛的评估套件，包括人类评估、多个LLM评判、三方比较和自动指标，进一步表明MASCOT产生了更符合角色且更少冗余的多智能体对话。

英文摘要

Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non-constructive dialogue. We propose MASCOT, a multi-agent framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that fine-tunes individual agents for agent-specific identities; and 2) Collaborative Dialogue Optimization, a group-level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human-grounded contexts drawn across both in-domain and out-of-domain (OOD) settings against state-of-the-art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three-way comparisons, and automatic metrics, further shows that MASCOT produces more role-consistent and less redundant multi-agent dialogue.

URL PDF HTML ☆

赞 0 踩 0

2508.06407 2026-06-02 cs.CV cs.AI eess.IV 版本更新

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

SAR图像中舰船目标的分类感知超分辨率框架

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * University of Malaya（马来亚大学）

AI总结提出一种将分类目标融入超分辨率过程的算法，通过优化兼顾图像质量和分类性能的损失函数，提升SAR图像分辨率并改善分类精度。

详情

DOI: 10.1109/JSTARS.2026.3655550
Journal ref: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 6614-6622, 2026

AI中文摘要

高分辨率图像在提升分类、检测和分割等视觉识别任务性能中起着关键作用。在包括遥感和监视在内的许多领域，低分辨率图像可能限制自动分析的准确性。为此，超分辨率（SR）技术被广泛采用，试图从低分辨率输入重建高分辨率图像。相关的传统方法仅基于像素级指标专注于提升图像质量，而超分辨率图像保真度与下游分类性能之间的关系在很大程度上未被探索。这引发了一个关键问题：将分类目标直接集成到超分辨率过程中是否能进一步提高分类精度？在本文中，我们通过部署一种专门的算法策略来研究超分辨率与分类之间的关系，试图回答这一问题。我们提出了一种新颖的方法，通过优化同时考虑图像质量和分类性能的损失函数，提高合成孔径雷达图像的分辨率。我们的方法在提升图像质量（通过科学验证的图像质量指标衡量）的同时，也提高了分类精度。

英文摘要

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

URL PDF HTML ☆

赞 0 踩 0

2512.07436 2026-06-02 cs.AI 版本更新

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

LocalSearchBench：现实本地生活服务中的智能搜索基准测试

Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Hao Chen, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su

发表机构 * Meituan, Beijing, China（美团，北京，中国）； East China Normal University Shanghai Innovation Institute（东华大学上海创新研究院）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiaotong University（上海交通大学）； North China University of Technology, Beijing, China（华北理工大学，北京，中国）； East China Normal University Shanghai China（东华大学上海）

AI总结针对本地生活服务领域，提出包含130万商家条目和900个多跳问答任务的基准测试LocalSearchBench，并开发统一环境LocalPlayground，实验表明现有大推理模型性能不足。

Comments 12 pages; accepted to KDD 2026

详情

DOI: 10.1145/3770855.3817466
Journal ref: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD 2026), August 9--13, 2026, Jeju Island, Republic of Korea. ACM, New York, NY, USA, 12 pages

AI中文摘要

近期大推理模型（LRMs）的进展使得智能搜索系统能够在多个来源上执行复杂的多步推理。然而，大多数研究集中在通用信息检索上，很少探索具有独特挑战的垂直领域。在这项工作中，我们聚焦于本地生活服务，并引入LocalSearchBench，它涵盖了多样且复杂的业务场景。该领域的真实查询通常模糊不清，需要跨商家和产品进行多跳推理，仍然具有挑战性且未得到充分解决。作为本地生活服务中智能搜索的首个综合基准，LocalSearchBench包含一个数据库，涵盖6个服务类别和9个主要城市的超过130万条商家条目，以及来自真实用户查询的900个多跳问答任务，这些任务需要多步推理。我们还开发了LocalPlayground，一个集成多种工具供LRMs交互的统一环境。实验表明，即使是最先进的LRMs在LocalSearchBench上也表现不佳：最佳模型（DeepSeek-V3.2）仅达到35.60%的正确率，大多数模型在完整性（平均60.32%）和忠实性（平均30.72%）方面存在问题。这凸显了在本地生活服务中需要专门的基准测试和领域特定的智能体训练。代码、基准和排行榜可在https://localsearchbench.github.io/获取。

英文摘要

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 35.60% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at https://localsearchbench.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2601.04946 2026-06-02 cs.CV cs.AI 版本更新

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

原型性偏差揭示多模态评估指标中的盲点

Subhadeep Roy, Gagan Bhatia, Steffen Eger

发表机构 * University of Technology Nuremberg（图恩大学）

AI总结本文通过构建受控诊断基准PROTOBIAS，发现并验证了多模态评估指标中存在原型性偏差，即倾向于选择视觉或社会原型性高但语义错误的图像，并提出了轻量级对比训练评估器PROTOSCORE作为缓解基线。

详情

AI中文摘要

自动指标广泛用于评估文生图模型，常常在基准测试、模型选择和大规模数据过滤中取代人类判断。然而，它们可能奖励看起来合理或原型性的图像，而非忠实满足提示的图像。我们识别出原型性偏差是多模态评估中的一个系统性盲点：指标可能偏好语义不正确但在视觉或社会层面具有原型性的图像，而非正确但原型性较弱的图像。我们引入PROTOBIAS，一个跨动物、物体和人口统计的受控诊断基准，其中语义正确的图像与包含单个受控语义违反的合理原型性对抗样本进行对比。基于原型理论和社会类别原型性，PROTOBIAS通过多个提示生成器、图像生成器和独立的VLM过滤器构建，并通过提示质量、人工标注和图像质量控制进行验证。使用PROTOBIAS，我们展示了广泛使用的嵌入、奖励、基于VQA和VLM作为评判的指标经常在这些对比中失败，而人类判断仍然更忠实于语义正确性。我们进一步引入PROTOSCORE，一个轻量级对比训练评估器，作为初始缓解基线。PROTOBIAS为测量原型性驱动的指标失败和开发更语义忠实的T2I评估器提供了一个聚焦基准。

英文摘要

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and large-scale data filtering. Yet they may reward images that look plausible or prototypical rather than images that faithfully satisfy the prompt. We identify prototypicality bias as a systematic blindspot in multimodal evaluation: metrics can prefer a semantically incorrect but visually or socially prototypical image over a correct but less prototypical one. We introduce PROTOBIAS, a controlled diagnostic benchmark across Animals, Objects, and Demography, where semantically correct images are contrasted with plausible prototypical adversaries containing a single controlled semantic violation. Grounded in prototype theory and social-category prototypicality, PROTOBIAS is constructed with multiple prompt generators, image generators, and independent VLM filters, and validated through prompt-quality, human-annotation, and image-quality controls. Using PROTOBIAS, we show that widely used embedding, reward, VQA-based, and VLM-as-judge metrics frequently fail these contrasts, while human judgments remain more faithful to semantic correctness. We further introduce PROTOSCORE, a lightweight contrastively trained evaluator, as an initial mitigation baseline. PROTOBIAS provides a focused benchmark for measuring prototypicality-driven metric failures and developing more semantically faithful T2I evaluators.

URL PDF HTML ☆

赞 0 踩 0

2511.01938 2026-06-02 cs.LG cs.AI 版本更新

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Grokking 的几何：零损失流形上的范数最小化

Tiberiu Musat

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文通过约束优化视角，证明在极小学习率和权重衰减系数下，梯度下降在零损失流形上最小化权重范数，并引入近似解耦参数子集的学习动力学，推导出两层网络第一层后记忆动力学的闭式表达式，实验验证了该框架能复现 grokking 的延迟泛化和表征学习特征。

详情

AI中文摘要

Grokking 是神经网络中一种令人费解的现象，即在完全记忆训练数据之后，经过相当长的延迟才出现完全的泛化。先前的研究将这种延迟泛化与由权重衰减驱动的表征学习联系起来，但精确的潜在动力学仍然难以捉摸。在本文中，我们认为后记忆学习可以通过约束优化的视角来理解：梯度下降在零损失流形上有效地最小化权重范数。我们在无穷小学习率和权重衰减系数的极限下正式证明了这一点。为了进一步剖析这一机制，我们引入了一种近似，将一部分参数的学习动力学与网络其余部分解耦。应用这一框架，我们推导出两层网络中第一层后记忆动力学的闭式表达式。实验证实，使用我们预测的梯度模拟训练过程能够再现 grokking 的特征性延迟泛化和表征学习。

英文摘要

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

URL PDF HTML ☆

赞 0 踩 0

2601.04539 2026-06-02 cs.NE cs.AI cs.LG 版本更新

Paradoxical noise preference in RNNs

RNN中的矛盾噪声偏好

Noah Eckstein, Manoj Srinivasan

发表机构 * Department of Mechanical and Aerospace Engineering（机械与航空航天工程系）

AI总结研究发现，在循环神经网络中，训练时注入的噪声在测试时移除反而会降低性能，网络偏好训练时的噪声水平，该现象源于噪声引起的固定点偏移。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026 21 pages, 8 figures

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

在用于模拟生物神经网络的循环神经网络（RNN）中，通常在训练期间引入噪声以模拟生物变异性和正则化学习。预期在测试时去除噪声应保持或提高性能。与这一直觉相反，我们发现连续时间RNN（CTRNN）通常在训练噪声水平或接近该水平时表现最佳。这种噪声偏好通常出现在噪声注入到神经激活函数内部时；而在激活函数外部注入噪声训练的网络在零噪声时表现最佳。该现象在多种任务中对于足够大的训练噪声鲁棒地出现；我们还展示了该现象出现在前馈神经网络中，而不仅仅是RNN中。我们的分析表明，该现象源于RNN底层随机动力学中固定点（平稳分布）的噪声诱导偏移。这些固定点偏移依赖于噪声水平，并在去除噪声时使网络输出产生偏差，从而降低性能。分析和数值结果表明，当神经状态在激活函数非线性附近运行时会产生偏差，此时噪声被不对称地衰减，而性能优化激励了在这些非线性附近运行；对于噪声在激活函数内部的网络存在这种性能激励，而外部噪声的网络则没有，这解释了为什么只有内部噪声网络表现出偏好。因此，网络可能过拟合到训练噪声本身，而不仅仅是输入-输出数据。该现象不同于随机共振，后者中非零噪声增强信号处理。我们的发现揭示了训练噪声可以成为神经网络学习到的计算的一部分，对理解神经群体动力学和设计鲁棒的人工RNN具有启示意义。

英文摘要

In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biological variability and regularize learning. The expectation is that removing the noise at test time should preserve or improve performance. Contrary to this intuition, we find that continuous-time RNNs (CTRNNs) often perform best at or near the training noise level. This noise preference typically arises when noise is injected inside the neural activation function; networks trained with noise injected outside the activation function perform best with zero noise. The phenomenon arises robustly in diverse tasks for large enough training noise; we also show the phenomenon arising in feedforward neural networks, not just in RNNs. Our analyses show that the phenomenon stems from noise-induced shifts of fixed points (stationary distributions) in the underlying stochastic dynamics of the RNNs. These fixed point shifts are noise-level dependent and bias the network outputs when the noise is removed, degrading performance. Analytical and numerical results show that the bias arises when neural states operate near activation-function nonlinearities, where noise is asymmetrically attenuated, and that performance optimization incentivizes operation near these nonlinearities; such performance incentives exist for networks with noise inside, but not outside, the activation function, explaining why only noise-in networks show the preference. Thus, networks can overfit to the training noise itself rather than just to the input-output data. The phenomenon is distinct from stochastic resonance, wherein nonzero noise enhances signal processing. Our findings reveal that training noise can become an integral part of the computation learned by neural networks, with implications for understanding neural population dynamics and for the design of robust artificial RNNs.

URL PDF HTML ☆

赞 0 踩 0

2601.03309 2026-06-02 cs.CV cs.AI 版本更新

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

VLM4VLA：重新审视视觉-语言-动作模型中的视觉-语言模型

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）； Qwen Team, Alibaba Inc.（阿里巴巴公司Qwen团队）

AI总结本文通过VLM4VLA最小适配管道，系统研究视觉-语言模型（VLM）的选择和能力如何影响下游视觉-语言-动作（VLA）策略性能，发现VLM通用能力无法预测下游任务表现，且视觉模块是性能瓶颈。

详情

AI中文摘要

视觉-语言-动作（VLA）模型将预训练的大型视觉-语言模型（VLM）集成到其策略主干中，因其有前景的泛化能力而受到广泛关注。本文重新审视了一个基本但很少被系统研究的问题：VLM的选择和能力如何转化为下游VLA策略的性能？我们引入了VLM4VLA，一个最小适配管道，仅使用少量新的可学习参数将通用VLM转换为VLA策略，以实现公平高效的比较。尽管简单，VLM4VLA被证明与更复杂的网络设计相比具有惊人的竞争力。通过在三个基准上的各种下游任务进行广泛的实证研究，我们发现虽然VLM初始化比从头训练提供了一致的优势，但VLM的通用能力并不能很好地预测其下游任务性能。这挑战了常见的假设，表明标准VLM能力对于有效的具身控制是必要但不充分的。我们进一步通过微调VLM在七个辅助具身任务（例如，具身问答、视觉指向、深度估计）上研究特定具身能力的影响。与直觉相反，提高VLM在特定具身技能上的性能并不能保证更好的下游控制性能。最后，模态级别的消融实验确定VLM中的视觉模块（而非语言组件）是主要的性能瓶颈。我们证明，即使在下游微调期间编码器保持冻结，向VLM的视觉编码器注入控制相关的监督也能带来一致的收益。这隔离了当前VLM预训练目标与具身动作规划需求之间持续的领域差距。

英文摘要

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

URL PDF HTML ☆

赞 0 踩 0

2601.00664 2026-06-02 cs.LG cs.AI cs.CV cs.HC cs.MM 版本更新

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing：用于自然对话的实时交互式头部化身生成

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）； NTU Singapore（新加坡国立大学）； DeepAuto.ai

AI总结提出Avatar Forcing框架，通过扩散强制实现实时交互式头部化身生成，利用直接偏好优化进行无标签学习，在低延迟（约500ms）下生成富有表现力的反应动作。

Comments CVPR 2026. Project page: https://taekyungki.github.io/AvatarForcing/

详情

AI中文摘要

说话头部生成从静态肖像创建逼真的化身，用于虚拟通信和内容创作。然而，当前的模型尚未传达真正交互式通信的感觉，通常生成缺乏情感投入的单向响应。我们确定了实现真正交互式化身的两个关键挑战：在因果约束下实时生成运动，以及在没有额外标注数据的情况下学习富有表现力、生动的反应。为了解决这些挑战，我们提出了Avatar Forcing，一种新的交互式头部化身生成框架，通过扩散强制建模实时用户-化身交互。该设计允许化身处理实时多模态输入，包括用户的音频和运动，以低延迟即时响应语言和非语言线索，如言语、点头和笑声。此外，我们引入了一种直接偏好优化方法，利用通过丢弃用户条件构建的合成失败样本，实现无标签的富有表现力交互学习。实验结果表明，我们的框架能够实现低延迟（约500ms）的实时交互，相比基线加速6.8倍，并生成反应性和富有表现力的化身运动，在80%以上的情况下优于基线。

英文摘要

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

URL PDF HTML ☆

赞 0 踩 0

2512.20638 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

揭示大型语言模型及其基准测试中的能力差距

Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

发表机构 * University of Zurich（苏黎世大学）

AI总结提出一种基于稀疏自编码器概念激活的新方法，自动发现模型在细粒度概念上的弱点（模型差距）和基准测试覆盖不平衡（基准差距），并通过内部表示评估和跨基准比较进行验证。

详情

Journal ref: ICML 2026

AI中文摘要

大型语言模型的评估严重依赖标准化基准测试。这些基准测试提供了有用的聚合指标，但可能掩盖（i）模型薄弱的特定子领域（“模型差距”）和（ii）基准测试本身的不平衡覆盖（“基准差距”）。为了自动揭示这两类差距，我们提出了一种简单的新方法，利用稀疏自编码器的概念激活，在逐概念基础上识别细粒度差距。该方法还受益于将评估基于模型的内部表示，以及易于跨基准测试进行比较。我们将该方法应用于五个流行的开源模型和十几个基准测试，作为示例说明。作为对该方法的验证，我们发现我们的自动无监督方法能够恢复文献中先前记录的模型差距（例如与谄媚相关的差距），并识别出新的模型差距。我们还能够自动揭示基准差距：应属于给定基准测试范围的核心概念。我们的“能力差距”方法可以通过提供模型行为的概念级分解，并帮助基准测试开发者迭代基准测试设计，来补充现有基准测试。代码可在 https://competency-gaps.github.io 获取。

英文摘要

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our "competency gaps" method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at https://competency-gaps.github.io.

URL PDF HTML ☆

赞 0 踩 0

2506.13702 2026-06-02 cs.LG cs.AI 版本更新

Value-Free Policy Optimization via Reward Partitioning

通过奖励划分实现无价值函数策略优化

Bilal Faye, Hanane Azzag, Mustapha Lebbah

发表机构 * LIPN, Université Paris 13（巴黎第十三大学LIPN实验室）； Université Paris 13（巴黎第十三大学）； Université de Versailles Saint-Quentin Paris（巴黎- versaillies圣quentin大学）

AI总结提出Reward Partition Optimization (RPO)方法，通过基于划分的奖励归一化消除价值函数学习，实现稳定、高效的策略优化。

详情

AI中文摘要

单轨迹偏好优化方法从((提示, 响应, 奖励))元组的数据集中学习，通过直接利用标量反馈为成对偏好学习提供了一种实用的替代方案。现有方法如直接奖励优化(DRO)已显示出有希望的结果，但依赖于价值函数估计，引入了额外的方差、优化复杂性和对离策略数据的敏感性。我们引入了奖励划分优化(RPO)，一种简单且可扩展的奖励驱动目标，消除了对价值函数学习的需要。RPO通过直接从提示级奖励分布估计的基于划分的公式对奖励进行归一化，产生稳定的监督优化目标，无需辅助模型或强化学习循环。我们使用自动评估指标、LLM作为评判员的评估和优化稳定性分析，在多个编码器-解码器和仅解码器语言模型上评估RPO。实验结果表明，RPO在生成更对齐、更多样化和更少有毒内容的同时，始终优于强基线，包括SFT、KTO和DRO。

英文摘要

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.

URL PDF HTML ☆

赞 0 踩 0

2505.08438 2026-06-02 cs.CV cs.AI 版本更新

A Survey of 3D Reconstruction with Event Cameras

事件相机三维重建综述

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Haodong Chen, Zeke Zexi Hu, Zhicheng Lu, Ying Zhou, Vera Chung, Qiang Qu, Weidong Cai

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文首次全面综述了基于事件相机的三维重建方法，按输入模态（立体、单目、多模态）和重建技术（几何、深度学习、神经渲染如NeRF和3DGS）分类，并讨论了数据集、评估、表示和动态场景重建等挑战。

Comments This survey has been accepted for publication in the Computational Visual Media Journal

详情

AI中文摘要

事件相机正迅速成为用于三维重建的强大视觉传感器，能够异步捕捉每个像素的亮度变化。与传统基于帧的相机相比，事件相机产生稀疏但时间密集的数据流，即使在高速运动、低光照和极端动态范围等挑战性条件下，也能实现鲁棒且准确的三维重建。这些能力为自动驾驶、机器人、空中导航和沉浸式虚拟现实等各个领域的变革性应用提供了巨大前景。在本文中，我们首次专门针对基于事件的三维重建进行了全面综述。现有方法根据输入模态系统地分为立体、单目和多模态系统，并根据重建方法进一步分类，包括基于几何的技术、深度学习方法以及神经渲染技术，如神经辐射场（NeRF）和3D高斯泼溅（3DGS）。在每个类别中，方法按时间顺序组织，以突出关键概念和进展的演变。此外，我们详细总结了专门适用于基于事件重建任务的公开数据集。最后，我们讨论了数据集可用性、标准化评估、有效表示和动态场景重建方面的重大开放挑战，并概述了未来研究的有见地的方向。本综述旨在作为重要参考，并为推进事件驱动三维重建的最新技术提供清晰且激励人心的路线图。

英文摘要

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel brightness changes. Compared to traditional frame-based cameras, event cameras produce sparse yet temporally dense data streams, enabling robust and accurate 3D reconstruction even under challenging conditions such as high-speed motion, low illumination, and extreme dynamic range scenarios. These capabilities offer substantial promise for transformative applications across various fields, including autonomous driving, robotics, aerial navigation, and immersive virtual reality. In this survey, we present the first comprehensive review exclusively dedicated to event-based 3D reconstruction. Existing approaches are systematically categorised based on input modality into stereo, monocular, and multimodal systems, and further classified according to reconstruction methodologies, including geometry-based techniques, deep learning approaches, and neural rendering techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Within each category, methods are chronologically organised to highlight the evolution of key concepts and advancements. Furthermore, we provide a detailed summary of publicly available datasets specifically suited to event-based reconstruction tasks. Finally, we discuss significant open challenges in dataset availability, standardised evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research. This survey aims to serve as an essential reference and provides a clear and motivating roadmap toward advancing the state of the art in event-driven 3D reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2512.18336 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

强化学习低层四旋翼控制中的动态熵调节：随机性与确定性

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department（机械工程系）； The German University in Cairo（开罗德国大学）

AI总结研究在四旋翼控制中，通过动态熵调节训练随机策略的强化学习算法，并与确定性策略算法对比，发现动态熵调节可防止灾难性遗忘并提高探索效率。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/ICCTA64612.2024.10974880
Journal ref: 2024 IEEE 34th International Conference on Computer Theory and Applications (ICCTA)

AI中文摘要

本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响，并将其性能与训练确定性策略的算法进行了比较。随机策略通过优化动作的概率分布来最大化奖励，而确定性策略则为每个状态选择一个确定的动作。本文研究了使用静态熵和动态熵训练随机策略，然后执行确定性动作来控制四旋翼的效果，并与训练确定性策略并执行确定性动作进行了对比。为此，随机算法选择了软演员-评论家（SAC）算法，确定性算法选择了双延迟深度确定性策略梯度（TD3）算法。训练和仿真结果表明，动态熵调节通过防止灾难性遗忘和提高探索效率，对控制四旋翼产生了积极影响。

英文摘要

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

URL PDF HTML ☆

赞 0 踩 0

2512.18333 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

基于软演员-评论家(SAC)的四旋翼强化学习位置控制

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department（机械电子工程系）； The German University in Cairo（埃及德国大学）

AI总结提出一种基于强化学习的四旋翼推力矢量控制架构，使用软演员-评论家算法训练，相比传统RPM控制器训练更快、路径跟踪更平滑准确。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/NILES63360.2024.10753187
Journal ref: 2024 IEEE 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES)

AI中文摘要

本文提出了一种新的基于强化学习(RL)的四旋翼控制架构。现有文献主要关注直接控制四个旋翼的转速，而本文旨在控制四旋翼的推力矢量。RL智能体计算沿四旋翼z轴的总推力百分比以及期望的滚转角(ϕ)和俯仰角(θ)。然后，智能体将计算出的控制信号连同当前四旋翼的偏航角(ψ)发送给姿态PID控制器。PID控制器再将控制信号映射为电机转速。采用软演员-评论家算法（一种无模型离策略随机RL算法）来训练RL智能体。训练结果表明，与传统的RPM控制器相比，所提出的推力矢量控制器训练时间更短。仿真结果表明，所提出的推力矢量控制器具有更平滑、更精确的路径跟踪性能。

英文摘要

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($ϕ$) and Pitch ($θ$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($ψ$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.

URL PDF HTML ☆

赞 0 踩 0

2512.18043 2026-06-02 cs.CR cs.AI cs.CY 版本更新

Securing Agentic AI Systems -- A Multilayer Security Framework

保护自主AI系统——一种多层安全框架

Sunil Arora, John Hastings

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对自主AI系统的独特安全挑战，本文采用设计科学研究方法，提出了一种生命周期感知的安全框架MAAIS，并引入自主AI的CIAA概念，通过多层防御机制确保AI生命周期的机密性、完整性、可用性和问责性，最后利用MITRE ATLAS进行验证。

Comments 6 pages, 2 figures, 1 table

详情

DOI: 10.1109/RAAI67517.2025.11423374
Journal ref: 2025 IEEE 5th International Conference on Robotics, Automation, and Artificial Intelligence (RAAI)

AI中文摘要

保护自主人工智能（AI）系统需要应对由自主、决策和自适应行为引入的复杂网络风险。自主AI系统正越来越多地部署在工业、组织以及网络安全、金融和医疗等关键领域。然而，它们的自主性带来了独特的安全挑战，包括未经授权的操作、对抗性操纵和动态环境交互。现有的AI安全框架未能充分应对这些挑战或自主AI的独特细微差别。本研究采用设计科学研究（DSR）方法，开发了一种专门针对自主AI系统的生命周期感知安全框架。本文介绍了MAAIS，一个自主安全框架，以及自主AI的CIAA（机密性、完整性、可用性和问责性）概念。MAAIS集成了多个防御层，以在AI生命周期中维护CIAA。通过映射已建立的MITRE ATLAS（人工智能系统对抗威胁全景）AI策略进行框架验证。本研究为在企业环境中安全部署和治理自主AI提供了一种结构化、标准化且基于框架的方法。该框架面向企业CISO、安全、AI平台和工程团队，并提供了保护自主AI工作负载的详细分步方法。

英文摘要

Securing Agentic Artificial Intelligence (AI) systems requires addressing the complex cyber risks introduced by autonomous, decision-making, and adaptive behaviors. Agentic AI systems are increasingly deployed across industries, organizations, and critical sectors such as cybersecurity, finance, and healthcare. However, their autonomy introduces unique security challenges, including unauthorized actions, adversarial manipulation, and dynamic environmental interactions. Existing AI security frameworks do not adequately address these challenges or the unique nuances of agentic AI. This research develops a lifecycle-aware security framework specifically designed for agentic AI systems using the Design Science Research (DSR) methodology. The paper introduces MAAIS, an agentic security framework, and the agentic AI CIAA (Confidentiality, Integrity, Availability, and Accountability) concept. MAAIS integrates multiple defense layers to maintain CIAA across the AI lifecycle. Framework validation is conducted by mapping with the established MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) AI tactics. The study contributes a structured, standardized, and framework-based approach for the secure deployment and governance of agentic AI in enterprise environments. This framework is intended for enterprise CISOs, security, AI platform, and engineering teams and offers a detailed step-by-step approach to securing agentic AI workloads.

URL PDF HTML ☆

赞 0 踩 0

2512.17605 2026-06-02 cs.CV cs.AI 版本更新

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

MGRegBench：一个带有解剖标志的乳腺X线图像配准新型基准数据集

Svetlana Krasnova, Emiliya Starikova, Ilia Naletov, Andrey Krylov, Dmitry Sorokin

发表机构 * MSU（莫斯科国立大学）

AI总结为解决乳腺X线图像配准中缺乏公开数据集和标准化基准的问题，提出了MGRegBench，包含5000多对图像和100对带手动标注解剖标志的数据集，并评估了多种配准方法。

详情

AI中文摘要

稳健的乳腺X线图像配准对于临床相关应用（如追踪乳腺组织疾病进展）至关重要。然而，由于缺乏透明的公共数据集和可重复的标准化基准，进展受到限制。现有研究通常使用私有数据和不一致的评估框架，因此难以直接比较。为解决这一问题，我们提出了MGRegBench，一个患者独立、无泄漏控制的乳腺X线图像配准评估协议，包含超过5000对图像，每对图像带有乳腺分割掩膜，以及100对带有手动标注解剖标志的图像，此外还有标准化的训练/评估分割和即用基线。利用这一资源，我们对多种配准方法进行了基准测试——包括经典方法（ANTs）、基于学习的方法（VoxelMorph, TransMorph）、隐式神经表示（IDIR）、一种乳腺X线专用方法，以及最近的深度学习方法MammoRegNet，并针对该模态调整了实现，同时在独立数据集SDM-MCs上验证了泛化能力。我们的贡献包括：（1）首个此规模且带有手动标注标志和掩膜的乳腺X线图像配准公共数据集；（2）一个透明、无泄漏控制的基准，首次实现了多种经典和基于机器学习的方法的同类比较；（3）在SDM-MCs上的外部验证，以测试主要趋势是否超越MGRegBench；（4）对基于深度学习的配准进行了广泛分析。我们公开发布代码和数据，为公平、可重复且临床相关的比较建立基础资源，并推动AI驱动医学影像的未来研究。

英文摘要

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. However, progress has been limited by the absence of transparent public datasets and reproducible standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a patient-disjoint, leakage-controlled evaluation protocol for mammography registration, comprising over 5,000 image pairs, each with a breast segmentation mask, and 100 pairs with manually annotated anatomical landmarks, plus standardized train/evaluation splits and ready-to-run baselines. Using this resource, we benchmark diverse registration methods -- including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a mammography-specific approach, and a recent deep learning method MammoRegNet, with implementations adapted to this modality, and validate generalization on the independent SDM-MCs dataset. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) a transparent, leakage-controlled benchmark enabling the first like-for-like comparison of diverse classical and machine learning-based methods; (3) external validation on SDM-MCs to test whether the main trend transfers beyond MGRegBench; and (4) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair, reproducible, and clinically relevant comparisons and catalyze future research in AI-driven medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2512.13356 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

使用双延迟深度确定性策略梯度（TD3）控制双旋翼系统

Zeyad Gamal, Youssef Mahran, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department（机械电子工程系）； The German University in Cairo（埃及德国大学）

AI总结提出基于TD3算法的强化学习框架，用于控制双旋翼气动系统在俯仰和方位角上的稳定与轨迹跟踪，仿真和实验验证了其优于传统PID控制器的抗干扰能力。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/ICSTCC62912.2024.10744717
Journal ref: 2024 28th IEEE International Conference on System Theory, Control and Computing (ICSTCC)

AI中文摘要

本文提出了一种强化学习（RL）框架，用于在特定俯仰角和方位角下控制和稳定双旋翼气动系统（TRAS），并跟踪给定轨迹。TRAS的复杂动力学和非线性特性使得使用传统控制算法进行控制具有挑战性。然而，近年来RL的发展因其在多旋翼控制中的潜在应用而引起了兴趣。本文使用双延迟深度确定性策略梯度（TD3）算法来训练RL智能体。该算法适用于具有连续状态和动作空间的环境（类似于TRAS），因为它不需要系统的模型。仿真结果展示了RL控制方法的有效性。接下来，使用风扰形式的的外部扰动来测试控制器与传统PID控制器相比的有效性。最后，在实验室装置上进行了实验，以确认控制器在实际应用中的有效性。

英文摘要

This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2512.10414 2026-06-02 cs.AI 版本更新

ShelfAware：准静态环境下基于低成本传感器的实时语义定位

Shivendra Agrawal, Jake Brawer, Ashutosh Naik, Alessandro Roncone, Bradley Hayes

发表机构 * Department of Computer Science, University of Colorado Boulder（科罗拉多大学波尔德分校计算机科学系）

AI总结提出ShelfAware语义粒子滤波器，通过将场景语义建模为类别统计证据而非固定地标，结合深度似然与类别语义相似度，并利用预计算语义视角进行逆语义提议，实现低成本视觉硬件上的鲁棒全局定位。

Comments 8 pages

详情

DOI: 10.1109/LRA.2026.3682613
Journal ref: IEEE Robotics and Automation Letters (RA-L), 2026

AI中文摘要

许多室内工作空间是准静态的：其全局几何布局稳定，但局部语义不断变化，产生重复几何结构、动态杂乱和感知噪声，使得标准基于视觉的定位失效。我们提出ShelfAware，一种用于鲁棒全局定位的语义粒子滤波器，它将场景语义视为对象类别的统计证据而非固定数量地标。ShelfAware融合深度似然与以类别为中心的语义相似度，并利用预计算的语义视角库在蒙特卡洛定位（MCL）中执行逆语义提议，从而在低成本、纯视觉硬件上实现快速、有针对性的假设生成。为了展示感知无关的可扩展性，我们在两个领域评估ShelfAware。在严格控制的模拟零售环境中，ShelfAware实现了97%的全局定位成功率，并在购物车、可穿戴和动态遮挡条件下保持了最高的跟踪成功率（66%）。此外，在利用开放词汇视觉管道的3,500平方英尺运营杂货店中，ShelfAware显著优于几何和固定数量语义基线。通过分布性建模语义并利用逆提议，ShelfAware解决了几何混叠问题，为动态真实环境中的移动和辅助机器人提供了无需基础设施的构建模块。

英文摘要

Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetitive geometry, dynamic clutter, and perceptual noise that defeat standard vision-based localization. We present ShelfAware, a semantic particle filter for robust global localization that treats scene semantics as statistical evidence over object categories rather than fixed quantity landmarks. ShelfAware fuses a depth likelihood with a category-centric semantic similarity and uses a precomputed bank of semantic viewpoints to perform inverse semantic proposals inside Monte Carlo Localization (MCL), yielding fast, targeted hypothesis generation on low-cost, vision-only hardware. To demonstrate perception-agnostic scalability, we evaluate ShelfAware across two domains. In a rigorously controlled mock retail environment, ShelfAware achieves a 97% global localization success rate, maintaining the highest tracking success (66%) across cart, wearable, and dynamic occlusion conditions. Furthermore, in a 3,500 sq. ft. operational grocery store leveraging an open-vocabulary vision pipeline, ShelfAware significantly outperforms both geometric and fixed-quantity semantic baselines. By modeling semantics distributionally and leveraging inverse proposals, ShelfAware resolves geometric aliasing, providing an infrastructure-free building block for mobile and assistive robots in dynamic real-world environments.

URL PDF HTML ☆

赞 0 踩 0

2512.07795 2026-06-02 cs.AI cs.CL cs.LG 版本更新

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

ReasonBENCH: 基准测试LLM推理的（不）稳定性

Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora

发表机构 * Aarhus University（奥胡斯大学）； Indian Institute of Technology Delhi（德里印度理工学院）； EPFL（苏黎世联邦理工学院）

AI总结提出ReasonBench基准套件，通过30次独立试验揭示LLM推理系统在贪婪解码下仍存在结构化方差，并引入全局噪声和运行噪声分类法，证明稳定性是推理系统的固有属性，倡导分布感知评估。

Comments 29 pages, 19 tables, 85 figures

详情

AI中文摘要

LLM推理系统的基准分数被报告为单一数字，然而相同的模型、策略和任务在重复执行时，即使在贪婪解码（T=0）下也会产生显著不同的答案和成本。这种方差并非统计上的麻烦：性能最高的策略在与最接近的对手进行头对头运行时仅获胜77%，这意味着单次观测到的分数可能会无声地错误排序系统。我们引入了ReasonBench，一个基准套件，记录了10种推理策略、12个模型和6个任务的30次独立试验，将质量和成本视为分布而非点估计。我们发现这种方差是有结构的而非随机的：一个双组分分类法——全局噪声（捕捉跨基准的不均匀性）和运行噪声（捕捉基准内的随机性）——揭示了策略架构预测稳定性分布，而模型和策略则移动分布的正交方面。层次分解将四分之三的分数方差归因于基准、系统和项目结构，而单次运行评估无声地吸收了持久的残差。最后，成本和成本非对称地解耦：廉价方法在结构上对联合成本-质量失败免疫，而昂贵方法无论其准确性如何仍然暴露。这些发现确立了不稳定性作为推理系统的固有属性，并促使分布感知评估成为标准实践。

英文摘要

Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as distributions rather than point estimates. We find that this variance is structured rather than random: a two-component taxonomy -- Global Noise, capturing cross-benchmark unevenness, and Run Noise, capturing within-benchmark stochasticity -- reveals that strategy architecture predicts stability profiles, while models and strategies shift orthogonal aspects of the distribution. A hierarchical decomposition attributes three-quarters of score variance to benchmark, system, and item structure, with a persistent residual that single-run evaluation silently absorbs. Finally, cost and quality decouple asymmetrically: cheap methods are structurally immune to joint cost-quality failure, while expensive methods remain exposed regardless of their accuracy. These findings establish instability as an inherent property of reasoning systems and motivate distribution-aware evaluation as standard practice.

URL PDF HTML ☆

赞 0 踩 0

2511.20639 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Collaboration in Multi-Agent Systems

多智能体系统中的潜在协作

Jiaru Zou, Ruizhong Qiu, Gaotang Li, Xiyuan Yang, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

发表机构 * University of Washington（华盛顿大学）

AI总结提出LatentMAS框架，使LLM智能体在连续潜在空间直接协作，无需文本中介，实现更高精度、更低开销和更快推理。

Comments ICML2026 Spotlight, Project: https://github.com/Gen-Verse/LatentMAS

详情

AI中文摘要

多智能体系统（MAS）将大语言模型（LLM）从独立的单模型推理扩展到协同的系统级智能。现有LLM智能体依赖基于文本的中介进行推理和通信，而我们更进一步，使模型能够在连续潜在空间内直接协作。我们引入了LatentMAS，一个端到端无需训练的框架，实现了LLM智能体间的纯潜在协作。在LatentMAS中，每个智能体首先通过最后一层的隐藏嵌入而非文本进行自回归潜在思维生成。然后，一个共享的潜在工作记忆保存并传递每个智能体的内部表示和潜在思维，确保无需重新编码的无损信息交换。我们提供了详细的理论分析，表明LatentMAS比基于文本的标准MAS具有更高的表达能力和无损信息保存能力，且整体复杂度更低。此外，在涵盖数学和科学推理、常识理解及代码生成的9个综合基准测试上的实证评估表明，LatentMAS优于先进的单智能体和基于文本的MAS基线，准确率最高提升14.6%，输出token使用量减少70.8%-83.7%，端到端推理速度提升4倍至4.3倍。代码和数据完全开源：https://github.com/Gen-Verse/LatentMAS。

英文摘要

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent's internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4$\times$-4.3$\times$ faster end-to-end inference. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

URL PDF HTML ☆

赞 0 踩 0

2512.00062 2026-06-02 cs.RO cs.AI cs.LG 版本更新

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

SpeedAug: 通过节奏增强策略和强化学习微调实现策略加速

Taewook Nam, Junmo Cho, Youngsoo Jang, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）； UNIST（全南大学）； DeepAuto.ai

AI总结提出SpeedAug框架，通过节奏增强先验策略和强化学习微调，使机器人策略学习任务最优执行节奏，在保持高成功率的同时显著提升执行速度和样本效率。

详情

AI中文摘要

针对复杂真实世界操作任务的机器人策略学习近期取得了快速进展，这在很大程度上得益于通过人类操作收集演示数据的能力。然而，从这些演示中训练出的策略通常执行任务的速度远低于机器人的物理能力，因为演示数据是在实际约束下收集的，这些约束倾向于保守的、以成功为导向的轨迹，而非执行速度。现有的策略加速方法通过数据预处理或启发式规则确定执行节奏，而不是学习针对任务优化的执行速度。在本文中，我们提出了SpeedAug，一个策略加速框架，使策略能够通过强化学习（RL）学习任务最优的执行节奏。SpeedAug首先从速度增强的演示中学习一个节奏增强的先验策略，该策略捕捉了多样的执行节奏。在此基础上，通过强化学习微调指导探索，以优化动作轨迹并高效优化执行节奏。在机器人操作基准上的实验表明，SpeedAug在保持高成功率的同时，显著提高了策略加速的样本效率，实现了快速且稳定的任务执行。应用于真实世界的操作任务时，SpeedAug仅用16分钟的在线交互就将任务吞吐量提高了1.8倍，且未降低成功率。

英文摘要

Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration while maintaining high success rates, achieving fast and stable task execution. Applied to a real-world manipulation task, SpeedAug improves task throughput by 1.8x using only 16 minutes of online interactions without compromising the success rate.

URL PDF HTML ☆

赞 0 踩 0

2510.01800 2026-06-02 cs.AI 版本更新

REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing

REBot: 从RAG到CatRAG——语义增强与图路由

Thanh Ma, Tri-Tam La, Lam-Thu Le Huu, Minh-Nghi Nguyen, Khanh-Van Pham Luu

发表机构 * CTU（越南科技大学）

AI总结提出REBot，一种基于CatRAG混合检索推理框架的LLM增强咨询聊天机器人，通过语义增强的分层类别知识图谱和图路由实现学术法规建议，在分类和问答任务上达到98.89%的F1分数。

Comments Published in Communications in Computer and Information Science (CCIS), Springer, 2025. DOI: 10.1007/978-981-95-4960-3_35

详情

DOI: 10.1007/978-981-95-4960-3_35
Journal ref: Communications in Computer and Information Science (CCIS), Springer, 2025, pp. 435-447

AI中文摘要

学术法规建议对于帮助学生理解和遵守机构政策至关重要，但构建有效系统需要特定领域的法规资源。为应对这一挑战，我们提出REBot，一种由CatRAG增强的LLM咨询聊天机器人，CatRAG是一种混合检索推理框架，将检索增强生成与基于图的推理相结合。CatRAG统一了密集检索和图推理，由分层、类别标记的知识图谱支持，并丰富了语义特征以实现领域对齐。轻量级意图分类器将查询路由到适当的检索模块，确保事实准确性和上下文深度。我们构建了一个法规特定数据集，并在分类和问答任务上评估REBot，取得了98.89%的F1分数，达到最先进水平。最后，我们实现了一个Web应用程序，展示了REBot在真实学术建议场景中的实用价值。

英文摘要

Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective systems requires domain specific regulatory resources. To address this challenge, we propose REBot, an LLM enhanced advisory chatbot powered by CatRAG, a hybrid retrieval reasoning framework that integrates retrieval augmented generation with graph based reasoning. CatRAG unifies dense retrieval and graph reasoning, supported by a hierarchical, category labeled knowledge graph enriched with semantic features for domain alignment. A lightweight intent classifier routes queries to the appropriate retrieval modules, ensuring both factual accuracy and contextual depth. We construct a regulation specific dataset and evaluate REBot on classification and question answering tasks, achieving state of the art performance with an F1 score of 98.89%. Finally, we implement a web application that demonstrates the practical value of REBot in real world academic advising scenarios.

URL PDF HTML ☆

赞 0 踩 0

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)（坡山科学技术大学（POSTECH））

AI总结本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis，研究视觉干扰项如何影响视觉语言模型的测试时缩放行为，发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度，并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情

AI中文摘要

无关信息（即干扰项）如何影响视觉语言模型（VLM）的测试时缩放？先前关于纯文本语言模型的研究表明，文本干扰项可以加剧逆缩放，导致模型推理更长但推理轨迹效率更低。在这项工作中，我们研究了类似现象是否在多模态设置中出现。我们引入了Idis（带干扰项的图像），这是一个视觉问答数据集，系统性地沿着语义和数值维度变化干扰项。我们的分析揭示，视觉干扰项以与文本干扰项根本不同的方式影响推理VLM：尽管逆缩放仍然出现，但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查，我们提出了一种简单的提示策略，以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

URL PDF HTML ☆

赞 0 踩 0

2511.20615 2026-06-02 cs.CV cs.AI 版本更新

Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

评估深度学习模型在负重活动期间全身动态3D姿态预测中的性能

Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri

发表机构 * Department of Mechanical Engineering, Sharif University of Technology（谢赫·巴赫什大学机械工程系）

AI总结本研究利用双向长短期记忆和Transformer架构的时间序列模型，通过优化身体段长度约束的代价函数，实现了对动态负重活动中全身3D姿态的高精度预测。

Comments 11 pages, 6 figures, 7 tables, This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

本研究旨在探索深度神经网络在动态负重活动中全身人体姿态预测的应用。使用双向长短期记忆（BLSTM）和Transformer架构训练了两个时间序列模型。数据集包含20名正常体重健康男性个体的3D全身插件步态动态坐标，每人从不同负载位置执行204次负重任务，并采用不同的举升和处理技术。模型输入包括手-负载位置的3D坐标、举升（弯腰、全蹲和半蹲）和处理（单手和双手）技术、体重和身高，以及任务前25%时间的身体姿态3D坐标数据。模型利用这些输入预测任务剩余75%时间内的身体坐标。此外，提出了一种新方法，通过优化新的代价函数强制身体段长度恒定，以提高先前和当前姿态预测网络的准确性。结果表明，新代价函数使手臂和腿部模型的预测误差分别降低了约8%和21%。我们发现，使用Transformer架构（均方根误差为41.4 mm）的长期性能比基于BLSTM的模型准确约58%。本研究证明了利用捕捉时间序列依赖性的神经网络在3D运动帧中的价值，为理解和预测人工物料搬运活动中的运动动力学提供了独特方法。

英文摘要

This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 41.4 mm, exhibited approximately 58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

URL PDF HTML ☆

赞 0 踩 0

2511.20333 2026-06-02 cs.AI cs.LG cs.NE 版本更新

NNGPT: Rethinking AutoML with Large Language Models

NNGPT: 用大型语言模型重新思考AutoML

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室，CAIDAS与IFI，乌尔姆大学，德国）

AI总结提出NNGPT开源框架，利用大型语言模型实现自我改进的AutoML引擎，通过生成-评估-自我改进闭环自动设计神经网络架构和超参数。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 5664-5674, 2026

AI中文摘要

构建自我改进的人工智能系统仍然是AI领域的一个基本挑战。我们提出了NNGPT，一个开源框架，它将大型语言模型（LLM）转变为用于神经网络开发的自我改进AutoML引擎，主要针对计算机视觉。与之前的框架不同，NNGPT通过生成新模型扩展神经网络数据集，基于生成、评估和自我改进的闭环系统实现LLM的持续微调。它在一个统一的工作流中集成了五个协同的基于LLM的流水线：零样本架构合成、超参数优化（HPO）、代码感知的准确率/早停预测、检索增强的闭域PyTorch块合成（NN-RAG）以及强化学习。基于LEMUR数据集作为具有可复现指标的可审计语料库，NNGPT从单个提示出发，验证网络架构、预处理代码和超参数，端到端执行，并从结果中学习。PyTorch适配器使NNGPT框架无关，实现了强大性能：NN-RAG在1289个目标上达到73%的可执行性，3-shot提示在常见数据集上提高了准确率，基于哈希的去重节省了数百次运行。一次性预测匹配基于搜索的AutoML，减少了大量试验的需要。在LEMUR上的HPO实现了RMSE 0.60，优于Optuna（0.64），而代码感知预测器达到RMSE 0.14，Pearson r=0.78。该系统已生成超过5000个经过验证的模型，证明了NNGPT作为自主AutoML引擎的能力。接受后，代码、提示和检查点将公开发布，以实现可复现性并促进社区使用。

英文摘要

Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

URL PDF HTML ☆

赞 0 踩 0

2505.17648 2026-06-02 econ.GN cs.AI q-fin.EC 版本更新

Simulating Macroeconomic Expectations in Survey Experiments with LLM-based Economic Agents

基于LLM的经济主体在调查实验中模拟宏观经济预期

Jianhao Lin, Lexuan Sun, Yixin Yan

发表机构 * Lingnan College, Sun Yat-sen University（中山大学岭南学院）

AI总结提出一个利用基于大语言模型的经济主体（LLM Agents）模拟调查实验中宏观经济预期的框架，通过复现三种代表性调查设计验证其有效性，发现LLM Agents能生成与人类高度相似的预期分布并捕捉定性模式，其中先验信息对匹配分布至关重要。

2506.16114 2026-06-02 cs.IR cs.AI 版本更新

GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks

GFlowGR：使用生成流网络微调生成式推荐框架

Yejing Wang, Shengyu Zhou, Jinyu Lu, Qidong Liu, Xinhang Li, Wenlin Zhang, Feng Li, Pengjie Wang, Chuan Yu, Jian Xu, Bo Zheng, Xiangyu Zhao

发表机构 * City University of Hong Kong（城市大学）； Alibaba Group（阿里巴巴集团）

AI总结针对生成式推荐中微调步骤忽略未观测正样本导致的曝光偏差问题，提出基于GFlowNets的微调框架GFlowGR，通过自适应轨迹采样器和综合奖励模型整合协同知识，利用GFlowNets的多样生成特性缓解偏差。

详情

AI中文摘要

生成式推荐（GR）通常包括项目分词器和生成式大语言模型（LLM），已在广泛场景中取得显著成功。现有研究主要集中于开发强大的项目分词器或改进LLM解码策略以获得更优性能。然而，GR框架中关键的微调步骤（对于使LLM适应推荐数据至关重要）仍基本未被探索。当前方法主要依赖监督微调（SFT）的下一词预测损失或推荐特定的直接偏好优化（DPO）策略。这两种方法都忽略了对可能存在的正未观测样本的探索，这通常被称为曝光偏差问题。为缓解此问题，本文将GR视为多步生成任务，并构建了基于GFlowNets的微调框架（GFlowGR）。所提框架整合了传统推荐系统中的协同知识，以创建自适应轨迹采样器和综合奖励模型。利用GFlowNets的多样生成特性以及采样和启发式加权技术，GFlowGR成为缓解曝光偏差问题的一种有前景的方法。在两个真实世界数据集和两种不同GR骨干上的大量实证结果突显了GFlowGR的有效性和鲁棒性。

英文摘要

Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remarkable success across a wide range of scenarios. The majority of existing research efforts primarily concentrate on developing powerful item tokenizers or advancing LLM decoding strategies to attain superior performance. However, the critical fine-tuning step in GR frameworks, which is essential for adapting LLMs to recommendation data, remains largely unexplored. Current approaches predominantly rely on either the next-token prediction loss of supervised fine-tuning (SFT) or recommendationspecific direct preference optimization (DPO) strategies. Both methods ignore the exploration of possible positive unobserved samples, which is commonly referred to as the exposure bias problem. To mitigate this problem, this paper treats the GR as a multi-step generation task and constructs a GFlowNets-based fine-tuning framework (GFlowGR). The proposed framework integrates collaborative knowledge from traditional recommender systems to create an adaptive trajectory sampler and a comprehensive reward model. Leveraging the diverse generation property of GFlowNets, along with sampling and heuristic weighting techniques, GFlowGR emerges as a promising approach to mitigate the exposure bias problem. Extensive empirical results on two real-world datasets and with two different GR backbones highlight the effectiveness and robustness of GFlowGR.

URL PDF HTML ☆

赞 0 踩 0

2511.10367 2026-06-02 cs.CV cs.AI 版本更新

DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

DermAI：通过质量驱动的图像采集实现移动端AI分类的临床皮肤病学

Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fábio Papais, Francisco Mauro, Natália Lopes, Érico Medeiros, Jéssica Guido, Shirley Cruz, Paulo Borba, Tsang Ing Ren

发表机构 * Centro de Informática, Universidade Federal de Pernambuco, Brazil（巴西佩纳布卢克联邦大学计算机中心）； Hospital das Clínicas, Universidade Federal de Pernambuco, Brazil（巴西佩纳布卢克联邦大学临床医院）

AI总结提出DermAI智能手机应用，通过实时质量检查、本地模型适应和多样化数据集收集，解决AI皮肤病学中数据集偏差、图像质量差异和验证不足的问题。

Comments 4 pages, 2 figures, 1 table, submitted on ISBI

2511.10276 2026-06-02 cs.RO cs.AI 版本更新

RoboBenchMart: Benchmarking Robots in Retail Environment

RoboBenchMart：零售环境中的机器人基准测试

Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Gregorii Bukhtuev, Andrey Kuznetsov, Vlad Shakhuro, Denis Shepelev

发表机构 * FusionBrain Lab, Robotics Group（融合大脑实验室，机器人组）； NUST MISIS ； Lomonosov Moscow State University（罗蒙诺索夫莫斯科国立大学）

AI总结针对零售环境中的移动操作任务，提出RoboBenchMart开源模拟基准，通过密集杂乱物品和复杂空间配置评估通用视觉-语言-动作模型（VLA），发现现有模型在常见零售任务中仍表现不佳。

详情

AI中文摘要

大多数现有的机器人操作基准专注于桌面或家庭场景。虽然这些设置推动了令人印象深刻的进展，但目前尚不清楚在这些场景中表现出色的通用VLA是否能够真正泛化到具有不同几何、语义和工作流程的领域。我们引入了RoboBenchMart，一个针对零售暗店环境的开源模拟基准，其中移动操作器必须对多样化的杂货物品执行复杂的操作任务。该设置提出了重大挑战，包括密集的物品杂乱和多样的空间配置，物品位于不同的高度、深度且紧密相邻。通过针对零售领域，我们的基准解决了一个具有近期自动化影响潜力的场景。利用生成的轨迹，我们为当前的通用VLA建模了一个标准、现实的微调设置，并评估了几种最先进的模型。我们发现，即使在常见的零售任务上，它们仍然表现挣扎，这表明这些模型尚未真正跨领域泛化。为了支持进一步研究，我们发布了RoboBenchMart套件，其中包括程序化商店布局生成器、轨迹生成管道、评估工具和微调基线模型。

英文摘要

Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress, it remains unclear whether generalist VLAs that excel there can truly generalize to domains with different geometry, semantics, and workflows. We introduce RoboBenchMart, an open-source simulated benchmark targeting retail dark-store environments, where a mobile manipulator must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations, with items positioned at different heights, depths, and in close proximity. By targeting on the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. Using generated trajectories, we model a standard, realistic fine-tuning setup for current generalist VLAs and evaluate several state-of-the-art models. We find that they still struggle even on common retail tasks, indicating that these models are not yet truly general across domains. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools, and fine-tuned baseline models.

URL PDF HTML ☆

赞 0 踩 0

2501.02409 2026-06-02 cs.LG cs.AI cs.CE q-bio.MN stat.ME 版本更新

Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations

可解释神经ODE用于扰动下基因调控网络发现

Zaikang Lin, Sei Chang, Aaron Zweig, Minseo Kang, Fabian J. Theis, Elham Azizi, David A. Knowles

发表机构 * Department of Computer Science, Columbia University, New York, U.S.（哥伦比亚大学计算机科学系）； Department of Industrial Engineering and Operations Research, Columbia University, New York, U.S.（哥伦比亚大学工业工程与运筹学系）； Department of Applied Mathematics and Applied Physics, Columbia University, New York, U.S.（哥伦比亚大学应用数学与应用物理系）； New York Genome Center, New York, U.S.（纽约基因组中心）； Irving Institute of Cancer Dynamics, New York, U.S.（伊万·罗伯特癌症动力学研究所）； Institute of Computational Biology, Helmholtz Munich, Munich, Germany（海德堡医学院计算生物学研究所）； Department of Mathematics, Technische Universität München, Munich, Germany（慕尼黑技术大学数学系）

AI总结提出PerturbODE框架，利用可解释神经常微分方程建模扰动下的细胞状态轨迹，从ODE参数中推导因果基因调控网络，实现未见遗传干预的模拟。

详情

AI中文摘要

现代高通量生物数据集包含数千种扰动，使得能够大规模发现代表基因间调控相互作用的因果图。可微分因果图模型和基于回归的方法已被开发用于从干预数据集推断基因调控网络（GRN）。然而，现有方法未能捕捉生物过程（如细胞分化）的非线性动力学。为解决这一局限性，我们提出PerturbODE，一种新颖框架，采用可解释神经常微分方程（神经ODE）对扰动下的细胞状态轨迹进行建模，并从神经ODE参数中推导出潜在的因果GRN，从而实现对未见遗传干预的下游模拟。GRN通过单隐藏层前馈网络编码，隐含地将基因分组为可解释的共调控模块。我们展示了PerturbODE在GRN推断和扩展到扰动响应预测方面的有效性，包括模拟和真实过表达数据集。

英文摘要

Modern high-throughput biological datasets containing thousands of perturbations enable large-scale discovery of causal graphs that represent regulatory interactions between genes. Differentiable causal graphical models and regression-based methods have been developed to infer gene regulatory networks (GRNs) from interventional datasets. However, existing approaches fail to capture the non-linear dynamics of biological processes such as cellular differentiation. To address this limitation, we propose PerturbODE, a novel framework that employs interpretable neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the underlying causal GRN from the neural ODE parameters, enabling downstream simulation of unseen genetic interventions. The GRN is encoded via a single-hidden-layer feedforward network, implicitly grouping genes into interpretable co-regulated modules. We demonstrate PerturbODE's efficacy in GRN inference and extension to perturbation response prediction across both simulated and real overexpression datasets.

URL PDF HTML ☆

赞 0 踩 0

2511.05913 2026-06-02 cs.CL cs.AI 版本更新

基于总运营成本奖励的深度强化学习自动驾驶卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and and University of Gothenburg（计算机科学与工程系，查尔姆斯理工大学和哥德堡大学）； Department of Mechanics and Maritime Sciences, Chalmers University of Technology（机械与海洋科学系，查尔姆斯理工大学）； Safe and Efficient Driving, Volvo Group of Trucks Technology（安全高效驾驶，沃尔沃卡车技术集团）

AI总结提出一种深度强化学习框架，用于自动驾驶卡车在高速公路场景下的自适应巡航控制和变道战术决策，通过基于总运营成本的多目标奖励函数优化性能。

Comments Paper is accepted for publication in Artificial Intelligence Review

2504.16129 2026-06-02 cs.MA cs.AI cs.LG cs.RO 版本更新

MARFT: Multi-Agent Reinforcement Fine-Tuning

MARFT: 多智能体强化微调

Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； OPPO Research Institute（OPPO研究院）

AI总结针对基于大语言模型的多智能体系统，提出多智能体强化微调（MARFT）框架，通过引入Flex-MG马尔可夫博弈公式和通用算法，解决异步交互、异构架构等挑战，提升系统鲁棒性和适应性。

Comments 37 pages

详情

AI中文摘要

基于大语言模型的多智能体系统（LaMAS）在需要多方面推理和协作的复杂智能体任务中展现出强大能力，从高质量演示生成到科学研究。同时，强化学习（RL）被广泛认可用于增强智能体智能，但用基础RL技术微调LaMAS的研究有限。由于LaMAS的独特机制，直接将传统多智能体强化学习（MARL）应用于LaMAS也带来了重大挑战。为解决这些挑战，本文对基于LLM的MARL进行了全面研究，并提出了多智能体强化微调（MARFT）。我们引入了Flex-MG，一种与真实世界LaMAS优化一致的新马尔可夫博弈公式，以及一个针对LaMAS定制的通用算法框架。我们回顾了从传统RL到强化微调（RFT）的演变，然后分析了多智能体对应部分。对于LaMAS，我们识别了经典MARL与MARFT之间的关键差异，包括异步智能体交互、轮廓感知智能体设计和异构架构。这些差异促使了面向LaMAS的RFT公式。我们提出了一个稳健且可扩展的MARFT框架，详细介绍了其模块化算法，并提供了开源实现以支持采用和进一步研究。本文进一步讨论了应用前景和开放挑战，包括动态环境建模、样本效率低下以及缺乏连贯框架。通过将理论基础与实践方法相结合，本文旨在作为推进MARFT向弹性、自适应和与人类一致的智能体系统发展的路线图。实现：https://github.com/jwliao-ai/MARFT。

英文摘要

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.

URL PDF HTML ☆

赞 0 踩 0

2510.14904 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

CaptionFormer：时空对象的统一分割、跟踪与描述

Gabriel Fiastre, Antoine Yang, Cordelia Schmid

发表机构 * Inria, École Normale Supérieure, CNRS, PSL Research University（法国国家科学研究中心、巴黎高等师范学院、国家科学研究中心、巴黎综合理工研究所）； Google DeepMind（谷歌DeepMind）

AI总结提出 CaptionFormer 模型，通过利用 VLM 生成合成描述并扩展数据集，实现视频中对象轨迹的联合检测、分割、跟踪与描述，在三个基准上达到最优。

Comments 17 pages, 10 figures

详情

AI中文摘要

密集视频对象描述（DVOC）是联合检测、跟踪和描述视频中对象轨迹的任务，需要理解时空细节并用自然语言描述。由于任务复杂性和手动标注的高成本，先前方法采用有限数据的训练策略，可能导致次优性能。为解决此问题，我们提出利用最先进的 VLM 生成关于时空定位实体的描述，并用我们的合成描述（LVISCap 和 LV-VISCap）扩展 LVIS 和 LV-VIS 数据集。此外，我们引入端到端模型 CaptionFormer，能够联合检测、分割、跟踪和描述对象轨迹。CaptionFormer 在三个现有基准（VidSTG、VLN 和 BenSMOT）上取得了最先进的 DVOC 结果。数据集和代码可在 https://www.gabriel.fiastre.fr/captionformer/ 获取。

英文摘要

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories. CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/captionformer/.

URL PDF HTML ☆

赞 0 踩 0

2510.23379 2026-06-02 cs.LG cs.AI cs.NE q-bio.BM 版本更新

Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

符号神经生成及其在药物设计先导发现中的应用

Ashwin Srinivasan, Tirtharaj Dash, A Baskar, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee

发表机构 * Dept. of Computer Science & Information Systems and APPCAIR BITS Pilani, K K Birla Goa Campus, India（计算机科学与信息系统系及APPCAIR比特纳学院，K K Birl拉果阿校区，印度）； Dept. of Computer Science & Information Systems BITS Pilani, K K Birla Goa Campus, India（计算机科学与信息系统系比特纳学院，K K Birl拉果阿校区，印度）； Department of Biochemistry, University of Cambridge, Cambridge, UK（生物化学系，剑桥大学，剑桥，英国）； School of Computer Science and Engineering University of New South Wales, Sydney（计算机科学与工程学院新南威尔士大学，悉尼）； Dr. B.R. Ambedkar Center for Biomedical Research University of Delhi, New Delhi, India（B.R.阿姆贝卡尔生物医学研究中心，德里大学，新德里，印度）； Department of Chemistry BITS Pilani, K.K. Birla Goa Campus, India（化学系比特纳学院，K.K. Birl拉果阿校区，印度）

AI总结提出符号神经生成器（SNG）框架，结合归纳逻辑编程与大语言模型，通过符号约束指导神经生成，在药物设计中生成满足形式规范的候选分子，性能与现有方法相当，并在探索性问题上产生与临床候选分子相当的结合亲和力。

Comments 37 pages, submitted to the Machine Learning journal; partial overlap of experimental results with https://doi.org/10.1101/2025.02.14.634875

详情

AI中文摘要

我们研究了一类相对未被充分探索的混合神经符号模型，该模型将符号学习与神经推理相结合，以构建满足形式正确性标准的数据生成器。在符号神经生成器（SNG）中，符号学习器从少量实例（有时仅一个）中检查可行数据的逻辑规范。每个规范反过来约束提供给基于神经的生成器的条件信息，该生成器拒绝任何违反符号规范的实例。与其他神经符号方法一样，SNG利用了符号和神经方法的互补优势。SNG的输出是一个对$(H, X)$，其中$H$是从数据构建的可行实例的符号描述，$X$是满足该描述的一组生成的新实例。我们基于构建适当的基集和纤维偏序集并将其组合成整体偏序，为这类系统引入语义。我们实现了一个SNG，将受限形式的归纳逻辑编程（ILP）与大语言模型（LLM）相结合，并在早期药物设计上进行了评估。我们的主要兴趣在于SNG生成的描述和一组潜在的抑制剂分子。在基准问题（药物靶点已被充分理解）上，SNG的性能在统计上与最先进方法相当。在探索性问题（靶点理解不足）上，生成的分子表现出与领先临床候选分子相当的结合亲和力。专家进一步发现符号规范作为初步过滤器很有用，多个生成的分子被确定为可用于合成和湿实验测试。

英文摘要

We investigate a relatively under-explored class of hybrid neurosymbolic models that integrate symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In Symbolic Neural Generators (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a pair $(H, X)$, where $H$ is a symbolic description of feasible instances constructed from data, and $X$ a set of generated new instances that satisfy the description. We introduce a semantics for such systems, based on the construction of appropriate base and fibre partially-ordered sets combined into an overall partial order. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.

URL PDF HTML ☆

赞 0 踩 0

2510.17045 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Video Reasoning without Training

无需训练的视频推理

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

发表机构 * Qualcomm AI Research（高通AI研究）； University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出V-Reason方法，利用输出分布熵作为信号，通过轻量级控制器在推理时自适应调整值缓存，无需强化学习或微调即可提升视频推理性能。

Comments CVPR Findings 2026. Project Page https://deepaksridhar.github.io/vreason.github.io/

详情

AI中文摘要

使用大型多模态模型（LMM）进行视频推理依赖于昂贵的强化学习（RL）和冗长的思维链，导致训练和推理过程中产生大量计算开销。此外，这些推理模型中控制思维过程的机制非常有限。在本文中，我们利用模型输出分布的熵作为信号来研究和指导推理行为。我们发现高质量模型表现出微探索和微利用循环的特征模式，随后出现后期熵峰值（即更长的思考）和较低的最终熵，表明更谨慎的探索和自信的收敛（即当模型探索或思考答案时避免过度随机性）。然后，我们利用这些新颖的、有理论基础的见解，引入了V-Reason（Video-Reason），一种推理时优化方法，通过轻量级、可训练的控制器自适应调整LMM的值缓存。我们提出的控制器由基于熵的目标引导，直接在推理时调整模型行为，无需使用任何RL或监督微调。我们的实验表明，V-Reason在许多视频推理数据集上显著优于基础指令调优模型，将与RL模型的差距平均缩小到0.6%的准确率以内。我们在无需任何训练的情况下实现了这一点，同时提供了效率优势：V-Reason使用的token比RL模型少58.6%。项目页面：https://deepaksridhar.github.io/vreason.github.io/

英文摘要

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, we use the entropy of the model's output distribution as a signal to study and guide reasoning behavior. We discover that high-quality models exhibit a characteristic pattern of micro-exploration and micro-exploitation cycles, followed by a later entropy peak (i.e., longer thinking) and a lower final entropy, indicating more deliberate exploration and confident convergence (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We then use these novel, theoretically-grounded insights to introduce V-Reason (Video-Reason), an inference-time optimization method that adapts the value cache of the LMM through a lightweight, trainable controller. Our proposed controller is guided by an entropy-based objective, to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Our experiments show that V-Reason significantly outperforms the base instruction-tuned models on many video reasoning datasets, narrowing the gap with RL models to within 0.6% accuracy on average. We achieve this without any training, while offering efficiency benefits: V-Reason uses 58.6% fewer tokens than the RL model. Project Page https://deepaksridhar.github.io/vreason.github.io/

URL PDF HTML ☆

赞 0 踩 0

2505.18492 2026-06-02 cs.AI 版本更新

Formally Solving Answer-Construction Problems in Lean

在 Lean 中形式化求解答案构造问题

Jialiang Sun, Yuzhi Tang, Ao Li, Chris J. Maddison, Kuldeep S. Meel

发表机构 * University of Toronto（多伦多大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出 Enumerate-Conjecture-Prove (ECP) 框架，结合通用大语言模型和证明器大语言模型，在 Lean 中端到端地构造答案并生成形式化证明，解决数学竞赛中的答案构造问题。

详情

AI中文摘要

数学竞赛问题分为两大类：定理证明（要求证明给定陈述）和答案构造（要求构造一个满足性质的带证明的对象）。随着大语言模型（LLMs）的最新进展，形式定理证明技术在定理证明问题上取得了显著进展，但形式答案构造仍较少被研究。这暴露了当前 LLM 模型系列之间的不匹配：通用 LLM 擅长非形式化猜想，但在形式化证明生成上昂贵且不可靠；而证明器 LLM 成本低且针对形式化证明优化，但在提出候选答案的数学推理方面较弱。此外，仅靠 Lean 证明检查并不能确保构造的见证是规范答案：循环或非封闭形式的见证可以消去存在量词，但无法构成可接受的竞赛答案。为弥补这一差距，我们引入了 extit{Enumerate-Conjecture-Prove} (ECP)，一个在 Lean 中用于端到端答案构造和形式化证明的神经符号框架。ECP 利用工具辅助的通用 LLM 枚举证据并构造候选答案，并调用证明器 LLM 生成机器可检查的证明。在 PutnamBench 和自动形式化的 MathArena 的答案构造问题上，ECP 分别以可接受的答案和证明形式化解决了 17/346 和 18/75 个实例，在同等推理预算下优于 LLM 基线。我们的代码可在 https://github.com/sunjia72/ecp-lpar 获取。

英文摘要

Mathematical competition problems fall into two broad types: theorem proving, which asks for a proof of a given statement, and answer construction, which requires constructing a property-satifying object with proofs. With recent advances in large language models (LLMs), formal theorem-proving techniques have made substantial progress on theorem-proving problems, yet formal answer construction remains less studied. This exposes a mismatch between current LLM model families: general LLMs are strong at informal conjecturing but are expensive and unreliable at formal proof generation, whereas prover LLMs are cheap and optimized for formal proofs but weak at mathematical reasoning for proposing candidate answers. Moreover, Lean proof checking alone does not enforce that a constructed witness is a canonical answer: circular or non-closed-form witnesses can eliminate the existential quantifier while failing to constitute an admissible contest answer. To close this gap, we introduce \textit{Enumerate-Conjecture-Prove} (ECP), a neuro-symbolic framework in Lean for end-to-end answer construction with formal proofs. ECP leverages tool-assisted general LLMs to enumerate evidence and construct candidate answers, and invokes prover LLMs to produce machine-checked proofs. On PutnamBench's and autoformalized MathArena's answer-construction problems, ECP formally solves 17/346 and 18/75 instances with admissible answers and proofs, respectively, which outperform LLM baselines at aligned inference budgets. Our code is available at https://github.com/sunjia72/ecp-lpar.

URL PDF HTML ☆

赞 0 踩 0

2510.00615 2026-06-02 cs.AI cs.CL 版本更新

ACON: Optimizing Context Compression for Long-horizon LLM Agents

ACON：面向长周期LLM智能体的上下文压缩优化

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan

发表机构 * University of Washington（华盛顿大学）

AI总结提出ACON框架，通过自然语言空间优化压缩策略，在不微调模型的情况下减少峰值token使用量26-54%并提升任务成功率，同时可蒸馏至小模型。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为动态真实环境中的智能体，其成功依赖于对动作和观察的精确记录。然而，长周期智能体任务中无限制的上下文增长导致两个关键瓶颈：高昂的推理内存成本以及因无关信息导致的推理退化。现有压缩方法未能完全解决这一问题，通常依赖脆弱的启发式规则或需要对专有或大规模LLM进行不切实际的参数更新。我们引入了智能体上下文优化（ACON），这是一个统一框架，可将观察和历史记录最优地压缩为简洁、信息丰富的表示。与先前工作不同，ACON采用自然语言空间中的优化：它基于智能体的失败分析迭代地细化压缩指南，在无需模型微调的情况下保留关键状态信息。为了进一步最小化计算开销，我们将优化后的压缩器蒸馏到更小的模型中。在AppWorld、OfficeBench和Multi-objective QA上的实验表明，与现有压缩基线相比，ACON将峰值token使用量减少了26-54%，同时提高了任务成功率。值得注意的是，它使较小的LM能够有效地作为长周期智能体运行，通过减轻上下文干扰实现了高达46%的性能提升。我们的代码可在https://github.com/microsoft/acon获取。

英文摘要

Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining precise records of actions and observations. However, the resulting unbounded context growth in long-horizon agentic tasks makes two critical bottlenecks: prohibitive inference memory costs and reasoning degradation due to irrelevant information. Existing compression methods fail to fully address this, often relying on brittle heuristics or requiring parameter updates impractical for proprietary or large-scale LLMs. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both observations and history into concise, informative representations. Distinct from prior works, ACON employs an optimization in natural language space: it iteratively refines compression guidelines based on failure analysis of the agent, ensuring critical state information is preserved without model fine-tuning. To further minimize computational overhead, we distill the optimized compressor into smaller models. Experiments on AppWorld, OfficeBench, and Multi-objective QA demonstrate that ACON reduces peak token usage by 26-54% while improving task success over existing compression baselines. Notably, it enables smaller LMs to function effectively as long-horizon agents, achieving up to 46% performance improvement by mitigating context distraction. Our code is available at https://github.com/microsoft/acon.

URL PDF HTML ☆

赞 0 踩 0

2510.12624 2026-06-02 cs.LG cs.AI 版本更新

Learning-To-Measure: In-Context Active Feature Acquisition

学习测量：上下文主动特征获取

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi

发表机构 * University of Tokyo（东京大学）

AI总结提出 Learning-to-Measure (L2M) 方法，通过不确定性量化与条件互信息引导的贪婪特征获取，在上下文学习中解决元主动特征获取问题，无需针对每个任务重新训练。

详情

AI中文摘要

主动特征获取 (AFA) 是一个序列决策问题，目标是通过自适应选择要获取的特征来改进测试实例的模型性能。在实践中，AFA 方法通常从具有系统性特征缺失和有限任务特定标签的回顾性数据中学习。大多数先前的工作针对单个预定任务进行获取，限制了可扩展性。为解决这一限制，我们形式化了元 AFA 问题，其目标是学习跨各种任务的获取策略。我们引入了学习测量 (L2M)，它包括 i) 对未见任务的可靠不确定性量化，以及 ii) 一个最大化条件互信息的不确定性引导的贪婪特征获取代理。我们展示了一种序列建模或自回归预训练方法，该方法为具有任意缺失模式的任务提供了可靠的不确定性量化基础。L2M 直接对具有回顾性缺失的数据集进行操作，并在上下文中执行元 AFA 任务，消除了每个任务的重新训练。在合成和真实世界的表格基准测试中，L2M 匹配或超越了特定任务的基线，特别是在标签稀缺和高缺失率的情况下。

英文摘要

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

URL PDF HTML ☆

赞 0 踩 0

2510.11560 2026-06-02 cs.IR cs.AI 版本更新

Characterizing Web Search in The Age of Generative AI

生成式AI时代下网络搜索的特征刻画

Elisabeth Kirsten, Jost Grosse Perdekamp, Qinyuan Wu, Mihir Upadhyay, Krishna P. Gummadi, Muhammad Bilal Zafar

发表机构 * UA Ruhr Research Center for Trustworthy Data Science and Security（乌尔姆-鲁尔可信数据科学与安全研究中心）； Max Planck Institute for Software Systems（马克斯·普朗克软件系统研究所）； Ruhr University Bochum（波鸿鲁尔大学）

AI总结通过系统比较传统搜索与多个生成式搜索系统，揭示了它们在知识来源、多样性、稳定性上的差异，并指出生成式搜索引入了现有评估范式未覆盖的新维度。

详情

AI中文摘要

LLM的出现催生了生成式搜索，这是一种新的搜索范式，其中LLM从网络中检索与查询相关的信息，并将其综合成一个连贯的响应。这种范式与传统的网络搜索有根本不同，传统搜索的结果以独立网页的排名列表形式返回。在本文中，我们提出：生成式搜索与传统搜索在哪些维度上存在差异？我们对Google有机搜索和来自三个提供商（Google、OpenAI和Perplexity）的五个生成式搜索系统进行了系统比较。我们的分析揭示了引擎在依赖内部与外部知识、来源多样性和稳定性方面的显著差异。虽然生成式系统通常能达到与传统搜索相当的主题覆盖，但它们使用的是明显不同的检索足迹和综合策略。我们进一步表明，生成式搜索的输出可能随时间及执行而变化，这给鲁棒性带来了新的挑战。我们的发现表明，生成式搜索引入了现有评估范式未捕捉到的新维度，从而促使开发明确考虑生成式搜索系统中检索行为、综合和稳定性的评估方法。

英文摘要

The advent of LLMs has given rise to generative search, a new search paradigm in which LLMs retrieve information from the web related to a query and synthesize it into a single, coherent response. This paradigm differs fundamentally from traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions does generative search differ from traditional search? We conduct a systematic comparison between Google organic search and five generative search systems from three providers: Google, OpenAI, and Perplexity. Our analysis reveals substantial variation among engines in their reliance on internal v.s. external knowledge, source diversity, and stability. While generative systems often achieve topical coverage comparable to traditional search, they do so using markedly different retrieval footprints and synthesis strategies. We further show that the outputs of generative search can vary across time and executions, raising new challenges for robustness. Our findings demonstrate that generative search introduces new dimensions that are not captured by existing evaluation paradigms, motivating the development of evaluations that explicitly account for retrieval behavior, synthesis, and stability in generative search systems.

URL PDF HTML ☆

赞 0 踩 0

2510.10982 2026-06-02 cs.LG cs.AI 版本更新

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

仅捕获一个：用于模型特定授权的不可迁移样本

Zihan Wang, Zhiyong Ma, Zhongkui Ma, Shuofeng Liu, Akide Liu, Derui Wang, Minhui Xue, Guangdong Bai

AI总结提出不可迁移样本（NTEs），通过将数据编码为仅能被指定模型解码的“密文”，在无需训练的情况下利用模型特定低敏感子空间实现授权模型保真度与未授权模型性能退化。

详情

AI中文摘要

最近的AI法规越来越强调需要保护数据在AI创新中的效用，同时防止滥用，特别是在下游AI应用中强制执行目的限制。在实践中，执行这一原则仍然具有挑战性，因为发布的数据可以轻易地输入到超出其声明意图的任意模型中。现有方法试图通过扰动数据或重新训练模型来限制意外使用来减轻这种风险。然而，这些策略无法防止未知或外部训练模型的推理，或者从根本上依赖于对训练或部署的控制。在这项工作中，我们引入了不可迁移样本（NTEs），即重新编码的数据，作为任务级别的“密文”，只能由指定模型解码。对抗性样本利用高模型敏感性的方向，而NTEs则利用互补的不敏感子空间。我们提出了一种无需训练、数据无关的方法，在模型特定的低敏感子空间内重新编码数据，保留授权模型的输出，同时通过子空间错位降低未授权模型的性能。我们建立了形式化界限，证明授权模型的保真度，并表明未授权模型的退化与模型之间可测量的谱错位成比例。实验上，NTEs在常见预处理下保持了多种视觉骨干网络和最先进视觉语言模型的性能，而未授权模型即使在自适应重建攻击下也会崩溃。这些结果确立了NTEs作为一种实用手段，在防止未授权利用的同时保持预期的数据效用。我们的项目可在 https://trusted-system-lab.github.io/model-specificity 获取。

英文摘要

Recent AI regulations increasingly emphasize the need for mechanisms that preserve the utility of data for AI innovation while preventing misuse, particularly by enforcing purpose limitation in downstream AI applications. In practice, enforcing this principle remains challenging, as released data can be trivially fed into arbitrary models beyond its declared intent. Existing approaches attempt to mitigate this risk by either perturbing data or retraining models to limit unintended use. These strategies, however, offer no protection against inference by unknown or externally trained models, or fundamentally rely on control over the training or deployment. In this work, we introduce non-transferable examples (NTEs), recoded data that act as a task-level "ciphertext" decodable only by a designated model. Whereas adversarial examples exploit directions of high model sensitivity, NTEs leverage the complementary insensitive subspace. We propose a training-free, data-agnostic method that recodes data within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while degrading unauthorized ones through subspace misalignment. We establish formal bounds certifying authorized-model fidelity and showing that unauthorized degradation scales with measurable spectral misalignment between models. Empirically, NTEs preserve performance across diverse vision backbones and state-of-the-art vision-language models under common preprocessing, while unauthorized models collapse even under adaptive reconstruction attacks. These results establish NTEs as a practical means to preserve intended data utility while preventing unauthorized exploitation. Our project is available at https://trusted-system-lab.github.io/model-specificity

URL PDF HTML ☆

赞 0 踩 0

2510.10541 2026-06-02 cs.LG cs.AI 版本更新

合作进化压力与收益递减可能解释费米悖论：关于超级AI的形态

Daniel Vallstrom

AI总结通过广义进化视角，探讨合作压力与资源收益递减如何导致超级AI缺乏殖民动机，从而解释费米悖论。

Comments copy editing and minor fixes; moved all supplementary programs to github; added references

详情

AI中文摘要

采用进化方法，道德的基础可以解释为对合作问题的适应。将“进化”广义化，满足进化条件的AI将面临与生物实体相同的合作进化压力。本文讨论了随着物质安全和财富增加，合作增强的适应性——对人类、其他社会和AI而言。从物质资源获取中获得的收益递减也表明，总体上可能没有激励去殖民整个星系，从而为费米悖论（即“大家都在哪里？”）提供了可能的解释。进一步论证，古老社会可能孕育并最终让位于超级AI，因为超级AI可能是可行的且更适应。最后，附带讨论了道德和目标影响生命和社会的有效方式，强调环境、文化和法律，并以如何饮食为例。'收益递减'被定义为低于根号，即不可行性的逆。还指出，由于数学原因，每个实体占据一定空间，因此不可能存在指数级的殖民或繁殖。附录包括快速殖民例如星系的算法、收益递减下合作与公平演化的模型，以及模拟信号发展的软件。

英文摘要

With an evolutionary approach, the basis of morality can be explained as adaptations to problems of cooperation. With 'evolution' taken in a broad sense, AIs that satisfy the conditions for evolution to apply will be subject to the same cooperative evolutionary pressure as biological entities. Here the adaptiveness of increased cooperation as material safety and wealth increase is discussed -- for humans, for other societies, and for AIs. Diminishing beneficial returns from increased access to material resources also suggests the possibility that, on the whole, there will be no incentive to for instance colonize entire galaxies, thus providing a possible explanation of the Fermi paradox, wondering where everybody is. It is further argued that old societies could engender and eventually give way to super-AIs, since it is likely that super-AIs are feasible, and fitter. Closing is an aside on effective ways for morals and goals to affect life and society, emphasizing environments, cultures, and laws, and exemplified by how to eat. 'Diminishing returns' is defined, as less than roots, the inverse of infeasibility. It is also noted that there can be no exponential colonization or reproduction, for mathematical reasons, as each entity takes up a certain amount of space. Appended are an algorithm for colonizing for example a galaxy quickly, models of the evolution of cooperation and fairness under diminishing returns, and software for simulating signaling development.

URL PDF HTML ☆

赞 0 踩 0

2510.05566 2026-06-02 stat.ML cs.AI cs.CL cs.LG stat.AP 版本更新

Domain-Shift-Aware Conformal Prediction for Large Language Models

领域偏移感知的共形预测用于大型语言模型

Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz

发表机构 * University of Waterloo（多伦多大学）

AI总结提出领域偏移感知共形预测框架，通过重加权校准样本应对分布偏移，在MMLU基准上提升覆盖可靠性。

Comments Accepted to Forty-Third International Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

大型语言模型在各种任务中取得了令人印象深刻的性能。然而，它们倾向于产生过度自信且事实不正确的输出，即所谓的幻觉，这在实际应用中带来了风险。共形预测提供了有限样本、无分布假设的覆盖保证，但标准共形预测在领域偏移下会失效，常常导致覆盖不足和不可靠的预测集。我们提出了一种称为领域偏移感知共形预测（DS-CP）的新框架。我们的框架通过根据校准样本与测试提示的接近程度系统地重新加权校准样本，将共形预测适应于领域偏移下的大型语言模型，从而在保持有效性的同时增强适应性。我们的理论分析和在MMLU基准上的实验表明，所提出的方法比标准共形预测提供了更可靠的覆盖，尤其是在显著分布偏移下，同时保持了效率。这为大型语言模型在实际部署中实现可信的不确定性量化迈出了实际的一步。

英文摘要

Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real-world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2510.05342 2026-06-02 cs.LG cs.AI 版本更新

视觉关系的多模态函数向量

Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过因果中介分析提取多模态函数向量，操纵注意力头以改善视觉关系推理，并实现零样本和微调性能提升。

详情

AI中文摘要

大型多模态模型（LMMs）从少量多模态演示中展现出令人印象深刻的上下文学习能力，然而支持这种任务学习的内部机制仍不透明。基于大型语言模型的先前工作，我们表明大型多模态模型中一小部分注意力头负责传递视觉关系的表示。这些注意力头的激活，称为函数向量，可以被提取和操纵以改变LMM在关系任务上的性能。首先，使用合成和真实图像数据集，我们应用因果中介分析来识别强烈影响关系预测的注意力头，并提取多模态函数向量，以提高推理时的零样本准确率。我们进一步证明，这些多模态函数向量可以在保持LMM参数冻结的情况下，用适量的训练数据进行微调，从而显著优于上下文学习基线。最后，我们展示了特定关系的函数向量可以线性组合，以解决涉及新颖和未经训练的视觉关系的类比问题，突显了该方法的强大泛化能力。通过在两个LMM（包括OpenFlamingo和Qwen3-VL）上的实验，我们的结果表明这些模型在局部内部结构中编码了视觉关系知识，这些知识可以被系统地提取和优化，从而增进了我们对模型模块化的理解，并增强了对LMM中关系推理的控制。

英文摘要

Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First, using synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained visual relations, highlighting the strong generalization ability of this approach. Through experiments on two LMMs, including OpenFlamingo and Qwen3-VL, our results show that these models encode visual relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.

URL PDF HTML ☆

赞 0 踩 0

2507.09029 2026-06-02 cs.LG cs.AI 版本更新

Model Parallelism With Subnetwork Data Parallelism

模型并行与子网络数据并行

Vaibhav Singh, Zafir Khalid, Pietro Cagnasso, Edouard Oyallon, Eugene Belilovsky

发表机构 * Mila ； Concordia University（康科迪亚大学）； ISIR-Sorbonne University, CNRS（索邦大学-ISIR与CNRS）

AI总结提出子网络数据并行（SDP）框架，通过将模型划分为结构化子网络并在工作节点间独立训练，无需交换激活值，在保持或提升性能的同时显著降低内存占用。

Comments 9 pages, 5 figures

2510.01891 2026-06-02 cs.SD cs.AI eess.AS 版本更新

HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering

HRTFformer: 用于沉浸式音频渲染中个体HRTF上采样的空间感知Transformer

Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg

发表机构 * SONICOM

AI总结针对个体HRTF测量困难的问题，提出基于Transformer的HRTF上采样架构，利用注意力机制和球谐域处理，结合邻域差异损失，实现高保真HRTF重建。

Comments Accepted to IEEE Transactions on Multimedia 2026

详情

AI中文摘要

个体头相关传输函数（HRTF）正开始被引入许多商业沉浸式音频应用中，对于实现逼真的空间音频渲染至关重要。然而，引入它们的主要顾虑之一是，由于HRTF测量过程的复杂性，大规模创建个体HRTF并不实用。为缓解这一缺点，提出了HRTF空间上采样，旨在减少所需的测量量。尽管先前的工作已通过不同的机器学习方法取得成功，但这些模型通常难以在相邻源方向之间保持局部空间变化模式的长期一致性，以及在高上采样因子下的泛化能力。本文提出了一种新颖的基于Transformer的HRTF上采样架构，利用注意力机制更好地捕捉HRTF球面上的空间相关性。在球谐域中工作，我们的模型从稀疏输入测量中学习重建高分辨率HRTF，精度显著提高。为增强空间一致性，我们引入了邻域差异损失，促进幅度平滑性，从而产生更逼真的上采样。我们使用感知定位模型和客观频谱失真指标评估了我们的方法。实验表明，我们的模型在生成逼真、高保真HRTF方面，在多个评估指标上优于现有方法。

英文摘要

Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.

URL PDF HTML ☆

赞 0 踩 0

2510.01167 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

同时多目标对齐：可验证与不可验证奖励

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

AI总结提出MAHALO框架，通过标准化PRM训练、多动作头DPO和PRM引导解码，实现大语言模型在可验证与不可验证奖励上的多目标对齐，减少目标冲突并支持推理时控制。

Comments ICML 2026

详情

AI中文摘要

将大语言模型与人类偏好对齐本质上是多维的，但大多数流水线将异质信号压缩为单一目标。我们试图回答如何同时在多个领域中对齐模型，这些领域包括：可验证奖励、不可验证主观偏好以及复杂交互场景。这种多目标对齐设置常常因各个目标相互冲突而困扰，导致训练效率低下和推理时用户控制有限。为了解决这些问题，我们提出了$ extbf{MAHALO}$（Multi-Action-Head Alignment with PRM-guided Decoding），这是一个统一的框架，它在可验证和不可验证设置下标准化PRM训练以进行步骤级监督，通过多动作头DPO执行向量化多目标对齐，并通过目标特定权重和PRM引导解码实现可控推理。在数学推理、人类价值观对齐和多轮辅导上的实验表明，MAHALO能够以有限的干扰同时联合改善多个目标，同时保持跨领域的泛化性和适应性，并在推理时提供灵活的用户控制。我们的代码可在 https://github.com/pearls-lab/multiobj-align 获取。

英文摘要

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

URL PDF HTML ☆

赞 0 踩 0

2510.00481 2026-06-02 cs.NI cs.AI cs.HC cs.MM cs.PF 版本更新

Make a Video Call with LLM: A Measurement Campaign over Six Mainstream Apps

用LLM进行视频通话：对六个主流应用的测量活动

Jiayang Xu, Xiangjie Huang, Zijie Li, Antariksh Verma, Zili Meng

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结本文通过自定义测试平台和在线平台，从质量、延迟、内部机制和系统开销四个维度对六个主流AI视频聊天应用进行基准测试，发现AI视频通话的网络延迟影响小于人类视频通话，AI代理能力对用户体验影响最大。

详情

基于嵌入的链接预测的理论局限性

Samy Badreddine, Emile van Krieken, Luciano Serafini

发表机构 * Vrije Universiteit Amsterdam, Netherlands（荷兰阿姆斯特丹自由大学）； University of Trento, Italy（意大利特伦托大学）

AI总结研究线性输出层导致的秩瓶颈对知识图谱嵌入模型表达能力的限制，并提出混合非线性输出层以提升大规模密集图上的性能。

详情

AI中文摘要

神经网络通常将低维嵌入映射到高维输出空间。通常，输出层是线性的，这会产生一个“秩瓶颈”，限制模型所能表示的函数。这种瓶颈在链接预测模型中普遍存在，例如知识图谱嵌入（KGE），因为实体的输出空间可能比嵌入维度大几个数量级。我们研究了秩瓶颈如何限制模型拟合训练数据的表达能力。以往工作关注特定KGE所需嵌入维度的充分上界，而我们给出了所有具有线性输出层的KGE的必要下界，该下界随图的大小和连通性增长。我们还考虑了一种使用混合的非线性输出层，以在不显著增加参数开销的情况下打破瓶颈。实验表明，使用这种非线性层的模型在大型密集数据集上，以较低的参数成本提升了排序性能和概率拟合，正如我们的理论所预测。我们的工作揭示了线性输出层如何限制KGE，并激励使用非线性替代方案以扩展到大型密集图。

英文摘要

Neural networks often map low-dimensional embeddings to high-dimensional output spaces. Usually, the output layer is linear, which can create a "rank bottleneck" that limits the functions a model can represent. Such bottlenecks are ubiquitous in link prediction models, such as knowledge graph embeddings (KGEs), as the output space of entities can be orders of magnitude larger than the embedding dimension. We investigate how rank bottlenecks limit model expressivity for fitting the training data. While previous work focused on sufficient bounds on the embedding dimension required for specific KGEs, we show necessary bounds for all KGEs with a linear output layer, which grow with graph size and connectivity. We also consider a non-linear output layer using mixtures to break the bottleneck without significant parameter overhead. Empirically, we show that models using this non-linear layer improve in ranking performance and probabilistic fit for large and dense datasets at a low parameter cost, as predicted by our theory. Our work reveals how linear output layers limit KGEs and motivates non-linear alternatives for scaling to large and dense graphs.

URL PDF HTML ☆

赞 0 踩 0

2504.06006 2026-06-02 cs.LG cs.AI cs.NE 版本更新

Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?

Optuna vs Code Llama：LLM 是超参数调优的新范式吗？

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室，CAIDAS与IFI，乌尔姆大学，德国）

AI总结通过微调参数高效的 Code Llama 模型，提出基于大语言模型的超参数优化方法，在多种视觉架构上实现与 Optuna 相当或更优的 RMSE 并大幅降低计算开销。

详情

DOI: 10.1109/ICCVW69036.2025.00598
Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 5664-5674, 2025

AI中文摘要

最优超参数选择对于最大化计算机视觉中神经网络的性能至关重要，尤其是当架构变得日益复杂时。本文通过使用 LoRA 微调参数高效的 Code Llama 版本，探索了大语言模型在超参数优化中的应用。所得模型在广泛的视觉架构上产生了准确且计算高效的超参数推荐。与依赖资源密集型的试错过程的传统方法（如 Optuna）不同，我们的方法在实现竞争性或更优的均方根误差的同时，大幅降低了计算开销。重要的是，所评估的模型涵盖了以图像为中心的任务，如分类、检测和分割，这些是许多图像处理流程（包括增强、恢复和风格迁移）的基本组成部分。我们的结果表明，基于 LLM 的优化不仅与成熟的贝叶斯方法（如树结构 Parzen 估计器）相媲美，而且加速了需要感知质量和低延迟处理的实际应用的调优。所有生成的配置均公开在 LEMUR 神经网络数据集（https://github.com/ABrain-One/nn-dataset）中，该数据集作为超参数优化研究的开源基准，并为提高图像处理系统中的训练效率提供了实用资源。

英文摘要

Optimal hyperparameter selection is critical for maximizing the performance of neural networks in computer vision, particularly as architectures become more complex. This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA. The resulting model produces accurate and computationally efficient hyperparameter recommendations across a wide range of vision architectures. Unlike traditional methods such as Optuna, which rely on resource-intensive trial-and-error procedures, our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead. Importantly, the models evaluated span image-centric tasks such as classification, detection, and segmentation, fundamental components in many image manipulation pipelines including enhancement, restoration, and style transfer. Our results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE), but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing. All generated configurations are publicly available in the LEMUR Neural Network Dataset (https://github.com/ABrain-One/nn-dataset), which serves as an open source benchmark for hyperparameter optimization research and provides a practical resource to improve training efficiency in image manipulation systems.

URL PDF HTML ☆

赞 0 踩 0

2509.23544 2026-06-02 stat.ML cs.AI cs.LG stat.ME 版本更新

End-to-End Deep Learning for Predicting Metric Space-Valued Outputs

端到端深度学习预测度量空间值输出

Yidong Zhou, Su I Iao, Hans-Georg Müller

AI总结提出E2M框架，通过加权Fréchet均值和神经网络学习权重，实现度量空间值输出的几何感知预测，具有理论保证并在多种结构化输出上取得最优性能。

Comments 38 pages, 4 figures, 9 tables

详情

Journal ref: Journal of Machine Learning Research, 27:1--38, 2026

AI中文摘要

许多现代应用涉及预测结构化、非欧几里得输出，例如概率分布、网络和对称正定矩阵。这些输出自然地被建模为一般度量空间的元素，而依赖于向量空间结构的经典回归技术不再适用。我们引入了E2M（端到端度量回归），这是一个用于预测度量空间值输出的深度学习框架。E2M通过训练输出的加权Fréchet均值进行预测，其中权重由基于输入条件的神经网络学习。这种构造提供了一种原则性的几何感知预测机制，避免了替代嵌入和限制性参数假设，同时完全保留了输出空间的内在几何结构。我们建立了理论保证，包括刻画模型表达能力的通用逼近定理以及熵正则化训练目标的收敛性分析。通过涉及概率分布、网络和对称正定矩阵的大量模拟，我们展示了E2M始终达到最先进的性能，且其优势在更大样本量下更加明显。应用于人类死亡率分布和纽约市出租车网络进一步证明了该框架的灵活性和实用性。

英文摘要

Many modern applications involve predicting structured, non-Euclidean outputs such as probability distributions, networks, and symmetric positive-definite matrices. These outputs are naturally modeled as elements of general metric spaces, where classical regression techniques that rely on vector space structure no longer apply. We introduce E2M (End-to-End Metric regression), a deep learning framework for predicting metric space-valued outputs. E2M performs prediction via weighted Fréchet means over training outputs, where the weights are learned by a neural network conditioned on the input. This construction provides a principled mechanism for geometry-aware prediction that avoids surrogate embeddings and restrictive parametric assumptions, while fully preserving the intrinsic geometry of the output space. We establish theoretical guarantees, including a universal approximation theorem that characterizes the expressive capacity of the model and a convergence analysis of the entropy-regularized training objective. Through extensive simulations involving probability distributions, networks, and symmetric positive-definite matrices, we show that E2M consistently achieves state-of-the-art performance, with its advantages becoming more pronounced at larger sample sizes. Applications to human mortality distributions and New York City taxi networks further demonstrate the flexibility and practical utility of this framework.

URL PDF HTML ☆

赞 0 踩 0

2508.06588 2026-06-02 cs.LG cs.AI 版本更新

Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning

图是一种自然正则化：重新审视向量量化在图表示学习中的应用

Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia（新南威尔士大学计算机科学与工程学院，悉尼，澳大利亚）

AI总结针对图向量量化中码本崩溃问题，提出RGVQ框架，通过图拓扑和特征相似性正则化及Gumbel-Softmax软分配，提升码本利用率和令牌多样性。

Comments ICML2026

详情

AI中文摘要

向量量化（VQ）最近成为一种学习图结构数据压缩和离散表示的有前途的方法。然而，一个基本挑战，即码本崩溃，在图领域仍未得到充分探索，严重限制了图令牌的表达能力和泛化能力。在本文中，我们进行了一项实证研究，观察到在图形重建任务中，即使采用了视觉或语言领域提出的缓解策略，当与图神经网络联合训练VQ时，码本崩溃始终发生。此外，我们从数据和优化角度提供了崩溃的诊断，表明崩溃与图数据属性（如特征冗余和连接密度）相关，并进一步由确定性硬分配的训练动态强化。为了解决这些问题，我们提出了RGVQ，一种新颖的框架，它集成图拓扑和特征相似性作为显式正则化信号，以增强码本利用并促进令牌多样性。RGVQ通过Gumbel-Softmax重参数化引入软分配，确保所有码字接收梯度更新。此外，RGVQ包含结构感知对比正则化，以惩罚将相同令牌分配给不相似的节点对。大量实验表明，RGVQ显著提高了码本利用率，并在多个下游任务中持续提升了最先进的图VQ骨干网络的性能，实现了更具表达性和可迁移性的图令牌表示。

英文摘要

Vector Quantization (VQ) has recently emerged as a promising approach for learning compressed and discrete representations for graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present an empirical study and observe that codebook collapse consistently occurs when training VQ jointly with Graph Neural Networks under graph reconstruction tasks, even with mitigation strategies proposed in vision or language domains. Moreover, we provide a diagnosis of collapse from data and optimization perspectives, showing that collapse is associated with graph data properties such as feature redundancy and connectivity density, and is further reinforced by the training dynamics of deterministic hard assignment. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize assigning the same token to dissimilar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.

URL PDF HTML ☆

赞 0 踩 0

2504.10552 2026-06-02 cs.LG cs.AI cs.CV cs.DL 版本更新

LEMUR Neural Network Dataset: Towards Seamless AutoML

LEMUR 神经网络数据集：迈向无缝 AutoML

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg（计算机视觉实验室，CAIDAS，乌尔姆大学）

AI总结提出 LEMUR 开源数据集与框架，通过统一模板、结构化存储和自动化超参数优化，标准化神经网络实现与评估，以加速 AutoML 研究并促进公平基准测试。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3291-3300, 2026

AI中文摘要

神经网络是现代人工智能的支柱，但设计、评估和比较它们仍然劳动密集。尽管存在许多用于训练的数据集，但模型本身的标准化集合很少。我们介绍 LEMUR，一个开源数据集和框架，它提供了大量基于 PyTorch 的神经网络集合，涵盖分类、分割、检测和自然语言处理等任务。每个模型遵循统一模板，配置和结果存储在结构化数据库中，以确保一致性和可重复性。LEMUR 通过 Optuna 集成自动超参数优化，包括统计分析和可视化工具，并提供 API 以无缝访问性能数据。该框架是可扩展的，允许研究人员添加新模型、数据集或指标而不破坏兼容性。通过标准化实现和统一评估，LEMUR 旨在加速 AutoML 研究，实现公平基准测试，并降低大规模神经网络实验的障碍。为支持采用和协作，LEMUR 及其插件在 MIT 许可下发布，网址为：https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

FedS2R: 面向自动驾驶中合成到真实语义分割的一次性联邦域泛化

Tao Lian, Jose L. Gómez, Antonio M. López

发表机构 * Computer Vision Center (CVC) Univ. Autònoma de Barcelona (UAB) Barcelona, Spain（计算机视觉中心（CVC）巴塞罗那自治大学（UAB）巴塞罗那，西班牙）

AI总结提出FedS2R框架，通过不一致性驱动的数据增强和多客户端知识蒸馏，实现自动驾驶中合成到真实语义分割的一次性联邦域泛化，在五个真实数据集上性能接近集中式训练。

Comments Accepted by IEEE Intelligent Vehicles Symposium (IV) 2026

详情

AI中文摘要

联邦域泛化在图像分类中通过多客户端协作训练而不共享原始数据已显示出有希望的进展。然而，其在自动驾驶语义分割中的潜力尚未被充分探索。本文提出FedS2R，这是第一个用于自动驾驶中合成到真实语义分割的一次性联邦域泛化框架。FedS2R包含两个组件：一种不一致性驱动的数据增强策略，用于生成不稳定类别的图像；以及一种具有特征融合的多客户端知识蒸馏方案，从多个客户端模型中蒸馏出全局模型。在五个真实数据集Cityscapes、BDD100K、Mapillary、IDD和ACDC上的实验表明，全局模型显著优于单个客户端模型，并且仅比同时访问所有客户端数据训练的模型落后2个mIoU点。这些结果证明了FedS2R在联邦学习下自动驾驶合成到真实语义分割中的有效性。

英文摘要

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency-driven data augmentation strategy that generates images for unstable classes, and a multi-client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real-world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic-to-real semantic segmentation for autonomous driving under federated learning

URL PDF HTML ☆

赞 0 踩 0

2507.19702 2026-06-02 cs.SI cs.AI cs.LG 版本更新

A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks

基于轻量级深度学习的复杂网络中有影响力节点排序模型

Mohammed A. Ramadhan, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho（扎赫大学科学学院计算机科学系）； Department of Computer Science and Information Technology, The American University of Kurdistan（库尔德斯坦美国大学计算机科学与信息技术系）

AI总结提出一种结合一维卷积神经网络和GraphSAGE的轻量级混合模型1D-CGS，利用节点度和平均邻居度特征，通过回归任务高效排序有影响力节点，在12个真实网络上平均Kendall Tau提升4.73%，Jaccard相似度提升7.67%，单调性指数达0.99，运行速度显著快于现有深度学习方法。

详情

AI中文摘要

识别复杂网络中的有影响力节点是一项关键任务，在不同领域有广泛应用。然而，现有方法常在准确性和计算效率之间权衡。为解决这些挑战，我们提出1D-CGS，一种轻量级且有效的混合模型，它结合了一维卷积神经网络（1D-CNN）的速度和GraphSAGE的拓扑表示能力，用于高效节点排序。该模型使用基于两个简单且重要的拓扑特征（节点度和平均邻居度）构建的轻量级输入表示。这些特征通过一维卷积提取局部模式，然后通过GraphSAGE层聚合邻域信息。我们将节点排序任务表述为回归问题，并使用易感-感染-恢复（SIR）模型生成真实影响力分数。1D-CGS首先在Barabasi-Albert模型生成的合成网络上训练，然后应用于真实世界网络以识别有影响力节点。在12个真实网络上的实验评估表明，1D-CGS在排序准确性上显著优于传统中心性度量和最近的深度学习模型，同时运行速度非常快。与表现最佳的深度学习基线相比，所提模型在Kendall Tau相关性上平均提升4.73%，在Jaccard相似度上平均提升7.67%。它还实现了平均单调性指数（MI）分数0.99，并产生近乎完美的排名分布，表明高度独特和可区分的排名。此外，所有实验证实1D-CGS在高度合理的时间内运行，比现有深度学习方法快得多，使其适用于大规模应用。

英文摘要

Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, existing approaches often face trade-offs between accuracy and computational efficiency. To address these challenges, we propose 1D-CGS, a lightweight and effective hybrid model that integrates the speed of one-dimensional convolutional neural networks (1D-CNN) with the topological representation power of GraphSAGE for efficient node ranking. The model uses a lightweight input representation built on two straightforward and significant topological features: node degree and average neighbor degree. These features are processed through 1D convolutions to extract local patterns, followed by GraphSAGE layers to aggregate neighborhood information. We formulate the node ranking task as a regression problem and use the Susceptible-Infected-Recovered (SIR) model to generate ground truth influence scores. 1D-CGS is initially trained on synthetic networks generated by the Barabasi-Albert model and then applied to real world networks for identifying influential nodes. Experimental evaluations on twelve real world networks demonstrate that 1D-CGS significantly outperforms traditional centrality measures and recent deep learning models in ranking accuracy, while operating in very fast runtime. The proposed model achieves an average improvement of 4.73% in Kendall's Tau correlation and 7.67% in Jaccard Similarity over the best performing deep learning baselines. It also achieves an average Monotonicity Index (MI) score 0.99 and produces near perfect rank distributions, indicating highly unique and discriminative rankings. Furthermore, all experiments confirm that 1D-CGS operates in a highly reasonable time, running significantly faster than existing deep learning methods, making it suitable for large scale applications.

URL PDF HTML ☆

赞 0 踩 0

2503.05641 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

基于技能的混合专家模型：通过推断技能实现异构推理的自适应路由

Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出Skill-MoE框架，通过推断查询所需技能进行实例级专家选择，并采用批推理策略降低开销，在单GPU上集成16个专家模型，在多个推理基准上平均提升8.15%。

Comments ICML 2026 (Camera-Ready). The first three authors contributed equally. Project Page: https://skill-moe.github.io/

详情

AI中文摘要

结合现有的预训练大语言模型是处理多样化推理任务的一种有前景的方法。然而，任务级专家选择往往过于粗粒度，因为不同实例可能需要不同的专业知识。为了解决这个问题，我们提出了Skill-MoE，一个符号化的、基于技能的、无梯度的混合专家框架，用于实例级专家选择。Skill-MoE从每个查询中推断技能（例如，数学中的代数），根据技能相关性选择专家，并让每个专家生成自己的推理。然后，由选定的聚合器将得到的k个输出进行综合，该聚合器因其整合多样化响应的能力而被选中。虽然实例级选择显著提高了性能，但朴素实现会因重复的模型加载和卸载而产生巨大开销。我们通过一种批推理策略解决了这个问题，该策略将实例按分配的专家分组，使得每个模型只需加载一次。因此，Skill-MoE在单GPU上集成了16个专家模型，其运行时间与使用4个GPU的先前多智能体基线相当。在多个基准测试（MMLU-Pro、GPQA、AIME和MedMCQA）中，Skill-MoE相比最佳基线实现了平均8.15%的绝对提升。它还能很好地泛化到未见过的任务，并且无需昂贵的多轮交互即可超越基于讨论的方法。

英文摘要

Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise. To address this, we propose Skill-MoE, a symbolic, skill-based, and gradient-free Mixture-of-Experts framework for instance-level expert selection. Skill-MoE infers skills (e.g., algebra in mathematics) from each query, selects experts based on skill relevance, and lets each expert generate its own reasoning. The resulting k outputs are then synthesized by an aggregator chosen for its ability to integrate diverse responses. While instance-level selection substantially improves performance, naively implementing it incurs heavy overhead from repeated model loading and offloading. We address this with a batch inference strategy that groups instances by assigned experts, allowing each model to be loaded only once. As a result, Skill-MoE integrates 16 expert models on a single GPU with runtime comparable to prior multi-agent baselines using 4 GPUs. Across diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), Skill-MoE achieves an average absolute improvement of 8.15% over the best baseline. It also generalizes well to unseen tasks and outperforms discussion-based methods without requiring expensive multi-round interactions.

URL PDF HTML ☆

赞 0 踩 0

2507.12645 2026-06-02 eess.SP cs.AI cs.LG 版本更新

A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis

一种用于生物医学时间序列数据鲁棒深度学习分类的新型数据增强策略：在ECG和EEG分析中的应用

Mohammed Guhdar, Ramadhan J. Mstafa, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho（扎赫大学科学学院计算机科学系）； Department of Computer Science and Information Technology, The American University of Kurdistan（库尔德斯坦美国大学计算机科学与信息技术系）； PRIME Lab, Scientific Research Center, University of Zakho（扎赫大学科学研究中心PRIME实验室）

AI总结提出一种结合ResNet-CNN与注意力机制的统一深度学习框架，通过时域拼接多个增强变体的新型数据增强策略和Focal Loss处理类别不平衡，在ECG和EEG数据集上达到99.96%-100%的准确率，且内存需求低、推理速度快。

详情

AI中文摘要

准确统一分析多种生物信号（如ECG和EEG）的需求日益迫切，这对于全面评估患者状况至关重要，尤其是在同步监测中。尽管多传感器融合取得了进展，但在开发能够有效处理和提取本质上不同生理信号特征的统一架构方面仍存在关键空白。另一个挑战是许多生物医学数据集固有的类别不平衡，这常常导致传统方法性能偏差。本研究通过提出一种新颖且统一的深度学习框架来解决这些问题，该框架在不同信号类型上均达到了最先进的性能。我们的方法将基于ResNet的CNN与注意力机制相结合，并通过一种新颖的数据增强策略增强：对每个信号的多个增强变体进行时域拼接，以生成更丰富的表示。与先前工作不同，我们科学地增加信号复杂性以实现未来能力，从而相比现有技术获得了最佳预测。预处理步骤包括小波去噪、基线去除和标准化。通过结合使用这种高级数据增强和Focal Loss函数，有效管理了类别不平衡。训练过程中应用了正则化技术以确保泛化能力。我们在三个基准数据集上严格评估了所提出的架构：UCI癫痫EEG、MIT-BIH心律失常和PTB诊断ECG。它分别达到了99.96%、99.78%和100%的准确率，展示了在不同信号类型和临床背景下的鲁棒性。最后，该架构需要约130 MB内存，每个样本处理时间约10 ms，表明其适用于低端或可穿戴设备部署。

英文摘要

The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive patient assessment, especially in synchronous monitoring. Despite advances in multi-sensor fusion, a critical gap remains in developing unified architectures that effectively process and extract features from fundamentally different physiological signals. Another challenge is the inherent class imbalance in many biomedical datasets, often causing biased performance in traditional methods. This study addresses these issues by proposing a novel and unified deep learning framework that achieves state-of-the-art performance across different signal types. Our method integrates a ResNet-based CNN with an attention mechanism, enhanced by a novel data augmentation strategy: time-domain concatenation of multiple augmented variants of each signal to generate richer representations. Unlike prior work, we scientifically increase signal complexity to achieve future-reaching capabilities, which resulted in the best predictions compared to the state of the art. Preprocessing steps included wavelet denoising, baseline removal, and standardization. Class imbalance was effectively managed through the combined use of this advanced data augmentation and the Focal Loss function. Regularization techniques were applied during training to ensure generalization. We rigorously evaluated the proposed architecture on three benchmark datasets: UCI Seizure EEG, MIT-BIH Arrhythmia, and PTB Diagnostic ECG. It achieved accuracies of 99.96%, 99.78%, and 100%, respectively, demonstrating robustness across diverse signal types and clinical contexts. Finally, the architecture requires ~130 MB of memory and processes each sample in ~10 ms, suggesting suitability for deployment on low-end or wearable devices.

URL PDF HTML ☆

赞 0 踩 0

2506.21278 2026-06-02 stat.ML cs.AI cs.LG math.ST stat.TH 版本更新

Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution

使用高效球面柯西分布的超球面变分自编码器

Lukas Sablica, Kurt Hornik

发表机构 * Institute for Statistics and Mathematics（统计与数学研究所）； Vienna University of Economics and Business（维也纳经济与商业大学）； Austria（奥地利）

AI总结提出基于球面柯西分布的超球面变分自编码器，通过莫比乌斯变换实现可微重参数化，避免贝塞尔函数计算，在保持重尾特性的同时提供高效稳定的训练与推理。

详情

AI中文摘要

我们提出在超球面潜变量空间上使用球面柯西（spCauchy）潜变量的变分自编码器。spCauchy 族具有重尾全局行为，并且通过对球面上的均匀样本应用莫比乌斯变换，允许精确可微的重参数化。我们证明，在高浓度极限下，spCauchy 在显式浓度参数映射下恢复了 von Mises-Fisher（vMF）分布的局部切空间几何，同时避免了 vMF 实现所需的高阶贝塞尔函数计算。对于训练，到均匀球面先验的 Kullback-Leibler 散度具有快速收敛的级数、稳定的求积以及高浓度渐近形式。我们进一步建立了浓度依赖的 KL 核心的单调性，并推导了具有闭形式代理和误差控制的解析括号，支持极端情况下的稳定近似。压力测试基准表明，所得到的潜层目标在 CPU 和 GPU 上比 vMF 基线更稳定且评估更快。在图像和分子序列数据上的实验表明，spCauchy-VAE 为具有超球面潜表示的生式建模提供了一种鲁棒且可扩展的替代方案。

英文摘要

We propose spherical Cauchy (spCauchy) latent variables for variational autoencoders on hyperspherical latent spaces. The spCauchy family has heavy-tailed global behavior and admits an exact differentiable reparameterization by applying a Möbius transformation to uniform samples on the sphere. We show that, in the high-concentration limit, spCauchy recovers the local tangent-space geometry of the von Mises-Fisher (vMF) distribution under an explicit concentration parameter mapping, while avoiding the high-order Bessel-function evaluations required by vMF implementations. For training, the Kullback-Leibler divergence to a uniform spherical prior admits rapidly convergent series, stable quadrature, and high-concentration asymptotic forms. We further establish monotonicity of the concentration-dependent KL core and derive analytic brackets with closed-form surrogates and error control, supporting stable approximation in extreme regimes. Stress-test benchmarks show that the resulting latent-layer objective remains stable and faster to evaluate than vMF baselines on CPU and GPU. Experiments on image and molecular sequence data demonstrate that spCauchy-VAEs provide a robust and scalable alternative for generative modeling with hyperspherical latent representations.

URL PDF HTML ☆

赞 0 踩 0

2507.09766 2026-06-02 cs.LG cs.AI 版本更新

Toward accurate RUL and SoH estimation using reinforced graph-based physics-informed neural networks enhanced with dynamic weights

基于动态权重的强化图物理信息神经网络实现精确的剩余使用寿命和健康状态估计

Mohamadreza Akbari Pour, Ali Ghasemzadeh, Mohamad Ali Bijarchi, Mohammad Behshad Shafii

发表机构 * Department of Mechanical Engineering（机械工程系）； Department of Computer Engineering（计算机工程系）； Sharif University of Technology（谢赫拉特福大学）

AI总结提出一种结合图表示学习、强化学习和自适应动态权重的物理信息神经网络框架RGPD，在C-MAPSS、PHM2012和XJTU数据集上实现跨资产退化场景的RUL和SoH高精度估计。

详情

AI中文摘要

精确估计剩余使用寿命（RUL）和健康状态（SoH）对于可靠的预测与健康管理（PHM）至关重要，有助于及时维护和可靠的工业运行。然而，结合数据驱动学习与基于物理的正则化的混合模型通常依赖于固定的损失权重，因此在跨具有不同退化行为的资产迁移时会失去准确性。本研究引入了具有动态加权的强化图物理信息网络（RGPD），这是一个用于时空退化建模和自适应物理引导正则化的统一框架。基于图的表示学习捕获传感器间的退化结构，软演员-评论家（SAC）模块在噪声条件下细化潜在特征，轻量级Q学习策略在训练过程中自适应地平衡单调性、平滑性和潜在动力学残差损失。该框架在C-MAPSS、PHM2012和XJTU数据集上进行了评估，这些数据集分别代表发动机、轴承和电池的退化过程。与相应基准表中报告的最强基线相比，RGPD在PHM2012和C-MAPSS上将平均RMSE提高了高达12%，在XJTU上将平均MAPE比第二好的模型降低了20%。在这些异构基准上的性能进一步表明了该模型跨退化系统的泛化能力。物理信息组件通过退化一致性先验以及深度隐藏物理模型风格的残差实现，提高了物理合理性，而无需为每种资产类型建立完整的第一性原理模型。

英文摘要

Accurate estimation of Remaining Useful Life (RUL) and State of Health (SoH) is essential for reliable Prognostics and Health Management (PHM), supporting timely maintenance and dependable industrial operation. However, hybrid models that combine data-driven learning with physics-based regularization often rely on fixed loss weights and therefore lose accuracy when transferred across assets with different degradation behaviors. This study introduces Reinforced Graph-based Physics-informed Networks with Dynamic Weighting (RGPD), a unified framework for spatio-temporal degradation modeling and adaptive physics-guided regularization. Graph-based representation learning captures inter-sensor degradation structure, a Soft Actor-Critic (SAC) module refines latent features under noisy conditions, and a lightweight Q-learning policy adaptively balances monotonicity, smoothness, and latent-dynamics residual losses during training. The framework is evaluated on the C-MAPSS, PHM2012, and XJTU datasets, which represent engine, bearing, and battery degradation processes. Relative to the strongest compared baselines reported in the corresponding benchmark tables, RGPD improves average RMSE by up to 12 percent on PHM2012 and C-MAPSS, and reduces average MAPE by 20 percent on XJTU compared with the second-best reported model. Performance on these heterogeneous benchmarks further suggests the model's generalizability across degradation systems. The physics-informed component is implemented through degradation-consistent priors together with a Deep Hidden Physics Model-style residual, which improves physical plausibility without requiring a full first-principles model for each asset type.

URL PDF HTML ☆

赞 0 踩 0

2507.02905 2026-06-02 cs.HC cs.AI cs.LG 版本更新

Preference-Optimal Multi-Metric Weighting for Parallel Coordinate Plots

平行坐标图的偏好最优多度量加权

Chisa Mori, Shuhei Watanabe, Masaki Onishi, Takayuki Itoh

发表机构 * Preferred Networks Inc.（Preferred Networks公司）

AI总结针对平行坐标图中多度量可视化难题，提出基于偏好最优加权的公式化方法，并利用雷达图与UMAP降维实现直观偏好选择，有效揭示控制参数重要性模式。

Comments Accepted to International Conference Information Visualisation (iV2025)

详情

DOI: 10.1109/IV68685.2025.00014

AI中文摘要

平行坐标图（PCP）是一种解释控制参数与度量之间关系的常用方法。PCP通过基于单一度量的颜色渐变来提供这种解释。然而，当存在多个度量时，提供这样的渐变是具有挑战性的。虽然一种简单的方法是通过线性加权每个度量来计算单一度量，但这种加权对用户来说是不明确的。为了解决这个问题，我们首先提出了一种基于特定偏好度量组合计算最优加权的原则性公式。尽管用户可以在双度量问题的二维（2D）平面上简单地选择他们的偏好，但多度量问题需要直观的可视化以允许他们选择偏好。我们通过使用各种雷达图来可视化由UMAP降维的2D平面上的度量权衡来实现这一点。在使用行人流引导规划的分析中，我们的方法为每个用户偏好识别出了控制参数重要性的独特模式，突出了我们方法的有效性。

英文摘要

Parallel coordinate plots (PCPs) are a prevalent method to interpret the relationship between the control parameters and metrics. PCPs deliver such an interpretation by color gradation based on a single metric. However, it is challenging to provide such a gradation when multiple metrics are present. Although a naive approach involves calculating a single metric by linearly weighting each metric, such weighting is unclear for users. To address this problem, we first propose a principled formulation for calculating the optimal weight based on a specific preferred metric combination. Although users can simply select their preference from a two-dimensional (2D) plane for bi-metric problems, multi-metric problems require intuitive visualization to allow them to select their preference. We achieved this using various radar charts to visualize the metric trade-offs on the 2D plane reduced by UMAP. In the analysis using pedestrian flow guidance planning, our method identified unique patterns of control parameter importance for each user preference, highlighting the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2409.18624 2026-06-02 cs.AI cs.LG 版本更新

Unsupervised Cognition

无监督认知

Alfredo Ibias, Hector Antona, Guillem Ramirez-Miranda, Enric Guinovart, Eduard Alarcon

发表机构 * Avatar Cognition（Avatar认知）

AI总结提出一种基于原语的无监督学习方法，通过构建分布式层次结构表示输入空间，在分类任务上超越现有最先进方法，并展现出类似认知的行为。

详情

AI中文摘要

无监督学习方法在认知模型中具有软启发。迄今为止，最成功的无监督学习方法主要围绕在数学空间中对样本进行聚类。在本文中，我们提出了一种基于原语的无监督学习方法，用于决策制定，该方法受一种新颖的认知框架启发。这种以表示为中心的方法以输入无关的方式，将输入空间建设性地建模为分布式层次结构。我们将我们的方法与当前最先进的无监督学习分类、当前最先进的小规模和不完整数据集分类以及当前最先进的癌症类型分类进行了比较。我们展示了我们的方法如何超越先前的最先进技术。我们还评估了我们方法的一些类似认知的特性，在这些特性中，它不仅优于比较的算法（甚至包括监督学习算法），而且表现出不同的、更类似于认知的行为。

英文摘要

Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods revolve around clustering samples in a mathematical space. In this paper we propose a primitive-based, unsupervised learning approach for decision-making inspired by a novel cognition framework. This representation-centric approach models the input space constructively as a distributed hierarchical structure in an input-agnostic way. We compared our approach with both current state-of-the-art unsupervised learning classification, with current state-of-the-art small and incomplete datasets classification, and with current state-of-the-art cancer type classification. We show how our proposal outperforms previous state-of-the-art. We also evaluate some cognition-like properties of our proposal where it not only outperforms the compared algorithms (even supervised learning ones), but it also shows a different, more cognition-like, behaviour.

URL PDF HTML ☆

赞 0 踩 0

2411.15240 2026-06-02 cs.LG cs.AI cs.HC q-bio.QM 版本更新

A Foundation Model for Wearable Movement Data in Mental Health Research

心理健康研究中可穿戴运动数据的基础模型

Franklin Y. Ruan, Aiwei Zhang, Jenny Y. Oh, SouYoung Jin, Nicholas C. Jacobson

发表机构 * Dartmouth College（达特茅斯学院）； National Institute of Diabetes and Digestive and Kidney Diseases（国家糖尿病、消化系统疾病和肾病研究所）； National Institutes of Health（美国国立卫生研究院）； Department of Computer Science at Dartmouth College（达特茅斯学院计算机科学系）

AI总结提出预训练体动记录Transformer（PAT），一种基于自监督掩码自编码器预训练的可穿戴运动时间序列基础模型，在心理健康预测任务上优于非基础模型方法，并提供可解释的注意力图。

Comments F. Y. Ruan, A. Zhang, J. Y. Oh, S. Jin and N. C. Jacobson, "A Foundation Model for Wearable Movement Data in Mental Health Research," in IEEE Journal of Biomedical and Health Informatics, doi: 10.1109/JBHI.2026.3694809

详情

DOI: 10.1109/JBHI.2026.3694809

AI中文摘要

可穿戴运动数据由几乎所有市售智能手表收集，是心理健康研究的宝贵资源，反映了细粒度的时间行为趋势。尽管前景广阔，但与临床图像和文本分析相比，健康可穿戴建模的基础模型开发仍然有限。我们设计了带有补丁嵌入的Transformer，并在分钟级、持续一周的体动记录（身体活动强度测量）序列上使用自监督掩码自编码器预训练，以开发和评估预训练体动记录Transformer（PAT）。PAT是一个用于可穿戴运动时间序列的开源基础模型，结合了长达一周的时间建模、精神科结果评估以及在公共数据上的可重复性。在来自美国国家健康与营养调查（NHANES）的全国代表性队列中21,538名参与者的数据上预训练，PAT在心理健康预测任务（包括苯二氮卓类药物和SSRI使用、抑郁症和睡眠异常）中始终优于非基础模型基线。在苯二氮卓类药物使用预测任务中，PAT相比常用于时间序列建模的非基础深度学习模型表现出最大改进（即比LSTM提高55.6%，比一维CNN提高21.4%，比ConvLSTM提高14.8%）。除了预测准确性，PAT还提供可解释的注意力图，突出对临床预测最重要的日常活动特定时段，提供模型透明度和潜在临床见解。结果表明，PAT为研究人员和临床医生提供了一种易于部署、适应性强且可扩展的解决方案，以从可穿戴传感器数据中推进临床见解。GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/

英文摘要

Wearable movement data is collected by nearly all commercially available smartwatches and is a valuable resource for mental health research, reflecting fine-grained temporal behavioral trends. Despite its promise, the development of foundation models for health wearable modeling remains limited when compared to clinical image and text analysis. We designed transformers with patch embeddings and used self-supervised masked autoencoder pretraining on minute-level week-long actigraphy (physical activity intensity measurement) sequences to develop and evaluate the Pretrained Actigraphy Transformer (PAT). PAT is an open-source foundation model for wearable movement time series that combines week-long temporal modeling, psychiatric outcome evaluation, and reproducibility on public data. Pretrained on data from 21,538 U.S. participants in a nationally representative cohort from the National Health and Nutrition Examination Survey (NHANES), PAT consistently outperformed non-foundation-model baselines across mental health prediction tasks-including benzodiazepine and SSRI use, depression, and sleep abnormalities. During the benzodiazepine medication usage prediction task, PAT demonstrated the largest improvement over non-foundational deep learning models commonly used for time-series modeling (i.e., 55.6% improvement over the LSTM, 21.4% improvement over the 1-D CNN, 14.8% improvement over the ConvLSTM). Beyond predictive accuracy, PAT provides interpretable attention maps highlighting specific periods of daily activity most important for clinical predictions, offering model transparency and potential clinical insights. The results suggest that PAT offers an easy-to-deploy, adaptable and scalable solution to advance clinical insight from wearable sensor data for researchers and clinicians. GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/

URL PDF HTML ☆

赞 0 踩 0

2502.08884 2026-06-02 cs.CV cs.AI cs.GR 版本更新

ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models

ShapeLib: 利用大型语言模型设计程序化3D形状抽象库

R. Kenny Jones, Paul Guerrero, Niloy J. Mitra, Daniel Ritchie

发表机构 * Stanford University（斯坦福大学）； Adobe Research（Adobe研究）； University College London（伦敦大学学院）； Brown University（布朗大学）

AI总结提出ShapeLib方法，利用大型语言模型的先验知识，通过引导式工作流自动设计可泛化的程序化3D形状抽象库，并支持下游形状编辑与生成。

详情

AI中文摘要

我们提出ShapeLib，这是第一个利用大型语言模型（LLM）的先验知识来设计程序化3D形状抽象库的方法。我们的系统接受两种形式的用户提供的设计意图：输出库中应包含的功能的高级文本描述，以及一小部分示例形状的种子集。我们通过引导式LLM工作流发现与设计意图匹配的抽象库，该工作流首先提出应用和实现功能的不同方式，然后验证这些功能有助于表示种子集形状。为了扩展到种子集之外，我们开发了特定于库的识别网络，将形状（表示为基元、体素或点云）映射到使用这些新发现的抽象的程序。跨多个建模领域（按形状类别划分），我们发现，当LLM与几何推理深思熟虑地结合时，可以引导它们编写出能跨形状分布泛化的抽象函数库。我们的框架朝着实现长期以来的形状分析愿望迈出了一步，即发现可重用的、程序化的形状抽象，同时暴露可解释的、语义对齐的接口。我们的广泛评估表明，ShapeLib在泛化性、可用性和在操作下保持合理性方面，优于先前的替代抽象发现方法。最后，我们展示了ShapeLib的抽象函数解锁了多个下游应用，将LLM对形状程序的推理与几何处理工具相结合，以支持形状编辑和生成工作流。

英文摘要

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abstractions. Our system accepts two forms of user-provided design intent: high-level text descriptions of functions to include in the output library and a small seed set of exemplar shapes. We discover a library of abstractions that matches this design intent with a guided LLM workflow that first proposes different ways of applying and implementing functions, and then validates these functions are helpful in representing seed set shapes. To extend beyond the seed set, we develop library-specific recognition networks that map shapes (represented as primitives, voxels, or point clouds) to programs that use these newly discovered abstractions. Across multiple modeling domains (split by shape category), we find that LLMs, when thoughtfully combined with geometric reasoning, can be guided to author libraries of abstraction functions that generalize across shape distributions. Our framework takes a step towards realizing the long-standing shape analysis aspiration of discovering reusable, programmatic shape abstractions while exposing interpretable, semantically aligned interfaces. Our extensive evaluation demonstrates that ShapeLib provides distinct advantages over prior alternative abstraction discovery works in terms of generalization, usability, and maintaining plausibility under manipulation. Finally, we demonstrate that ShapeLib's abstraction functions unlock a number of downstream applications, combining LLM reasoning over shape programs with geometry processing tools to support shape editing and generation workflows.

URL PDF HTML ☆

赞 0 踩 0

2506.05387 2026-06-02 cs.CL cs.AI 版本更新

Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

推进解码策略：局部典型采样在大型语言模型中的增强

Jaydip Sen, Saptarshi Sengupta, Subhasis Dasgupta

发表机构 * Praxis Business School（普拉克斯商业学校）； San Jose State University（旧金山州立大学）

AI总结提出自适应语义感知典型性采样（ASTS）改进局部典型采样算法，通过动态熵阈值、多目标评分和奖惩调整，在保持计算效率的同时提升文本生成的流畅性、多样性和连贯性。

Comments This is the accepted but pre-reviewed version of the chapter that has been accepted for publication in the Springer volume 'Decision-Making in Computational Intelligence-Based Systems,' edited by Witold Pedrycz, Gilberto Rivera, Rose Ma Rodriguez, and Salvador Ibarra Martinez. The chapter is 39 pages long, and it contains 2 figures and 6 tables. This is NOT the final camera-ready version

详情

DOI: 10.1007/978-3-032-13497-4_17
Journal ref: Recent Advances in Artificial Neural Networks. Intelligent Systems Reference Library, vol 283. Springer, Cham, 2026

AI中文摘要

本章探讨了大型语言模型（LLMs）解码策略的进展，重点关注增强局部典型采样（LTS）算法。传统的解码方法，如top-k和核采样，通常在文本生成中难以平衡流畅性、多样性和连贯性。为应对这些挑战，提出了自适应语义感知典型性采样（ASTS）作为LTS的改进版本，融合了动态熵阈值、多目标评分和奖惩调整。ASTS在保持计算效率的同时，确保上下文连贯且多样的文本生成。其性能在多个基准测试中进行了评估，包括故事生成和抽象摘要，使用了困惑度、MAUVE和多样性分数等指标。实验结果表明，ASTS通过减少重复、增强语义对齐和提高流畅性，优于现有的采样技术。

英文摘要

This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

URL PDF HTML ☆

赞 0 踩 0

2506.08137 2026-06-02 cs.CV cs.AI 版本更新

IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

IGraSS: 通过迭代图约束语义分割从卫星图像中识别基础设施网络

Oishee Bintey Hoque, Abhijin Adiga, Aniruddha Adiga, Siddharth Chaudhary, Madhav V. Marathe, S. S. Ravi, Kirti Rajagopalan, Amanda Wilson, Samarth Swarup

发表机构 * Biocomplexity Institute, University of Virginia（弗吉尼亚大学生物复杂性研究所）； Department of Computer Science, University of Virginia（弗吉尼亚大学计算机科学系）； Department Biomedical Systems Engineering, Washington State University（华盛顿州立大学生物医学系统工程系）； Earth System Science Center, University of Alabama in Huntsville（阿拉巴马大学亨茨维尔分校地球系统科学中心）

AI总结提出IGraSS迭代框架，结合语义分割与图约束优化，将不可达运河段从18%降至3%，并提升道路网络完整性。

详情

DOI: 10.24963/ijcai.2025/1076

AI中文摘要

精确的运河网络制图对于水资源管理（包括灌溉规划和基础设施维护）至关重要。最先进的基础设施制图语义分割模型（如道路）依赖于大规模、良好标注的遥感数据集。然而，不完整或不充分的真实标注会阻碍这些学习方法。许多基础设施网络具有图级属性，如可达性（运河）或连通性（道路），可用于改进现有真实标注。本文开发了一种新颖的迭代框架IGraSS，将结合RGB和额外模态（NDWI、DEM）的语义分割模块与基于图的真实标注精化模块相结合。分割模块处理卫星图像块，而精化模块将基础设施网络视为图，在整个数据上运行。实验表明，IGraSS将不可达运河段从约18%降至3%，并且使用精化后的真实标注进行训练显著改善了运河识别。IGraSS是一个鲁棒的框架，既可用于精化噪声真实标注，也可用于从遥感影像中绘制运河网络。我们还以道路网络为例，应用不同的图论约束来完善道路网络，证明了IGraSS的有效性和泛化能力。

英文摘要

Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.

URL PDF HTML ☆

赞 0 踩 0

2505.19489 2026-06-02 cs.AI cs.SE 版本更新

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults

驯服系统复杂性：揭秘软件工程代理在诊断Linux内核故障中的作用

Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou

发表机构 * Fudan University（复旦大学）； Nanyang Technological University（南洋理工大学）

AI总结针对Linux内核故障定位挑战，提出LinuxFLBench基准和LinuxFL$^+$增强框架，将现有LLM代理的文件级top-1准确率从41.6%提升7.2%-11.2%。

Comments Accepted to ACL 2026

详情

AI中文摘要

Linux内核是一个关键系统，作为众多系统的基础。Linux内核中的错误可能导致严重后果，影响数十亿用户。故障定位（FL）旨在识别软件中的错误代码元素，在软件质量保证中起着至关重要的作用。虽然最近的LLM代理在SWE-bench等最新基准测试中取得了有希望的FL准确率，但目前尚不清楚这些方法在Linux内核中的表现如何，因为由于大规模代码库、有限的可观测性和多样的影响因素，FL在该领域更具挑战性。在本文中，我们介绍了LinuxFLBench，这是一个从真实世界Linux内核错误构建的FL基准。我们进行了一项实证研究，以评估最先进的LLM代理在Linux内核上的性能。我们的初步结果显示，现有代理在此任务上表现不佳，文件级最佳top-1准确率仅为41.6%。为应对这一挑战，我们提出了LinuxFL$^+$，一个旨在提高LLM代理在Linux内核中FL有效性的增强框架。LinuxFL$^+$以最小的成本显著提高了所有研究代理的FL准确率（例如，准确率提升7.2% - 11.2%）。

英文摘要

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs.

URL PDF HTML ☆

赞 0 踩 0

2505.13273 2026-06-02 cs.AI cs.LG 版本更新

MPEC：通过集成基于聚类的分类器实现流形保持的脑电图分类

Shermin Shahbazi, Mohammad-Reza Nasiri, Majid Ramezani

发表机构 * Department of Electrical and Computer（电气与计算机系）； Department of Computer Science and Engineering, Information Technology（计算机科学与工程系，信息科技）

AI总结提出MPEC方法，通过协方差矩阵和RBF核的特征工程以及黎曼流形上的改进K-means聚类集成，解决EEG信号的非欧几里得流形结构问题，在BCI Competition IV数据集2a上取得显著提升。

Comments 7 pages ,3 figures

详情

DOI: 10.1109/CSICC65765.2025.10967471

AI中文摘要

脑电图信号的准确分类对于脑机接口（BCI）和神经假体应用至关重要，然而许多现有方法未能考虑EEG数据的非欧几里得流形结构，导致性能欠佳。保留这种流形信息对于捕捉EEG信号的真实几何结构至关重要，但传统分类技术在很大程度上忽视了这一需求。为此，我们提出了MPEC（通过集成基于聚类的分类器实现流形保持的EEG分类），它引入了两项关键创新：（1）一个特征工程阶段，结合协方差矩阵和径向基函数（RBF）核来捕捉EEG通道之间的线性和非线性关系；（2）一个聚类阶段，采用针对黎曼流形空间定制的改进K-means算法，确保局部几何敏感性。通过集成多个基于聚类的分类器，MPEC取得了优越的结果，并在BCI Competition IV数据集2a上得到了显著改进的验证。

英文摘要

Accurate classification of EEG signals is crucial for brain-computer interfaces (BCIs) and neuroprosthetic applications, yet many existing methods fail to account for the non-Euclidean, manifold structure of EEG data, resulting in suboptimal performance. Preserving this manifold information is essential to capture the true geometry of EEG signals, but traditional classification techniques largely overlook this need. To this end, we propose MPEC (Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers), that introduces two key innovations: (1) a feature engineering phase that combines covariance matrices and Radial Basis Function (RBF) kernels to capture both linear and non-linear relationships among EEG channels, and (2) a clustering phase that employs a modified K-means algorithm tailored for the Riemannian manifold space, ensuring local geometric sensitivity. Ensembling multiple clustering-based classifiers, MPEC achieves superior results, validated by significant improvements on the BCI Competition IV dataset 2a.

URL PDF HTML ☆

赞 0 踩 0

2504.17471 2026-06-02 cs.LG cs.AI cs.DC 版本更新

GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

GRANITE：一种拜占庭鲁棒的动态八卦学习框架

Yacine Belal, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * CEA, List, Université Paris-Saclay Palaiseau（CEA、List、巴黎-萨克雷大学帕莱索分校）； INRIA, INSA Lyon, CITI, UR3720（INRIA、里昂INSA、CITI、UR3720）； LIRIS, INSA Lyon, CNRS Lyon（LIRIS、里昂INSA、里昂CNRS）

AI总结针对动态八卦学习中拜占庭节点通过毒化模型和操纵节点采样发起的双重攻击，提出GRANITE框架，通过累积节点标识知识并动态调整聚合阈值，实现鲁棒学习，理论证明拜占庭节点在局部邻域呈指数衰减，实验表明在30%拜占庭节点下精度接近非拜占庭场景，且通信成本降低9倍。

详情

AI中文摘要

八卦学习是一种去中心化的学习范式，用户通过迭代地与少量邻居节点交换和聚合模型。最近的方法依赖于使用随机节点采样协议构建的动态通信图，这些协议已被证明可以加速收敛。然而，我们表明这些方法容易受到双重攻击：拜占庭节点可以毒化模型并操纵节点采样以放大其影响力。我们通过GRANITE框架应对这种组合威胁，该框架用于在存在拜占庭节点的稀疏动态图上进行鲁棒学习。GRANITE随时间累积关于遇到的节点标识的知识，并根据每个节点邻域中估计的拜占庭密度动态调整局部聚合阈值。我们证明，在GRANITE下，局部邻域中的拜占庭节点呈现指数衰减。我们进一步推导了GRANITE生成图的鲁棒性条件。实验结果表明，在30%拜占庭节点下，GRANITE的收敛精度在非拜占庭精度的5%以内，收敛速度更快，且通信成本降低高达9倍。

英文摘要

Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent approaches rely on dynamic communication graphs built using Random Peer Sampling (RPS) protocols which have been proven to accelerate convergence. However, we show that these approaches are vulnerable to a dual attack: Byzantine nodes can poison models and manipulate peer sampling to amplify their influence. We address this combination of threats with GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of Byzantine nodes. GRANITE accumulates knowledge about encountered node identifiers over time and dynamically adjusts local aggregation thresholds based on estimated Byzantine density in the neighbourhood of each node. We demonstrate that under GRANITE, the Byzantine presence in local neighborhoods exhibits an exponential decay. We further derive the robustness conditions of the graphs generated by GRANITE. Empirically, our results indicate that GRANITE converges within 5% of non-Byzantine accuracy under 30% Byzantines nodes, offers faster convergence and operates on graphs with up to 9x lower communication cost.

URL PDF HTML ☆

赞 0 踩 0

2504.16139 2026-06-02 cs.CY cs.AI 版本更新

Enhancing Trust Through Standards: A Comparative Risk-Impact Framework for Aligning ISO AI Standards with Global Ethical and Regulatory Contexts

通过标准增强信任：一种用于对齐ISO AI标准与全球伦理和监管背景的比较风险-影响框架

Sridharan Sankaran

发表机构 * Research and Innovation Group（研究与创新组）； Tata Consultancy Services（塔塔咨询服务）

AI总结提出比较风险-影响评估框架，分析ISO AI标准在不同监管环境下的伦理风险覆盖情况，并建议通过强制审计、区域附件和隐私模块增强其全球适用性。

详情

DOI: 10.1109/ACDSA65407.2025.11166403

AI中文摘要

随着人工智能重塑行业和社会，确保其可信赖性——通过减轻偏见、不透明性和问责缺陷等伦理风险——仍然是一个全球性挑战。国际标准化组织（ISO）的AI标准，如ISO/IEC 24027和24368，旨在通过将公平性、透明度和风险管理嵌入AI系统来促进负责任的发展。然而，它们的有效性在不同监管环境中存在差异，从欧盟基于风险的AI法案到中国注重稳定的措施以及美国分散的州级举措。本文引入了一种新颖的比较风险-影响评估框架，以评估ISO标准在这些背景下如何应对伦理风险，并提出增强其全球适用性的改进建议。通过将ISO标准映射到欧盟AI法案，并调查十个地区（包括英国、加拿大、印度、日本、新加坡、韩国和巴西）的监管框架，我们建立了伦理对齐的基线。该框架应用于欧盟、美国科罗拉多州和中国的案例研究，揭示了差距：自愿性ISO标准在执行方面不足（例如科罗拉多州），并且低估了区域特定风险（如中国的隐私）。我们建议强制风险审计、区域特定附录和以隐私为重点的模块，以增强ISO的适应性。这种方法不仅综合了全球趋势，还提供了一种可复制的工具，用于将标准化与伦理要求对齐，促进全球AI的互操作性和信任。政策制定者和标准机构可以利用这些见解来发展AI治理，确保随着技术发展满足多样化的社会需求。

英文摘要

As artificial intelligence (AI) reshapes industries and societies, ensuring its trustworthiness-through mitigating ethical risks like bias, opacity, and accountability deficits-remains a global challenge. International Organization for Standardization (ISO) AI standards, such as ISO/IEC 24027 and 24368, aim to foster responsible development by embedding fairness, transparency, and risk management into AI systems. However, their effectiveness varies across diverse regulatory landscapes, from the EU's risk-based AI Act to China's stability-focused measures and the U.S.'s fragmented state-led initiatives. This paper introduces a novel Comparative Risk-Impact Assessment Framework to evaluate how well ISO standards address ethical risks within these contexts, proposing enhancements to strengthen their global applicability. By mapping ISO standards to the EU AI Act and surveying regulatory frameworks in ten regions-including the UK, Canada, India, Japan, Singapore, South Korea, and Brazil-we establish a baseline for ethical alignment. The framework, applied to case studies in the EU, US-Colorado, and China, reveals gaps: voluntary ISO standards falter in enforcement (e.g., Colorado) and undervalue region-specific risks like privacy (China). We recommend mandatory risk audits, region-specific annexes, and a privacy-focused module to enhance ISO's adaptability. This approach not only synthesizes global trends but also offers a replicable tool for aligning standardization with ethical imperatives, fostering interoperability and trust in AI worldwide. Policymakers and standards bodies can leverage these insights to evolve AI governance, ensuring it meets diverse societal needs as the technology advances.

URL PDF HTML ☆

赞 0 踩 0

2406.09953 2026-06-02 cs.RO cs.AI 版本更新

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

DAG-Plan：生成有向无环依赖图用于双臂协作规划

Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Shijia Peng, Chengkai Hou, Lingyue Guo, Ping Luo, Shanghang Zhang, Yanfeng Lu

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA)（多模态人工智能系统国家重点实验室，自动化研究所，中国科学院（CASIA））； School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS)（中国科学院大学人工智能学院）； School of Computer Science, Shanghai Jiao Tong University（上海交通大学计算机科学学院）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（多媒体信息处理国家重点实验室，北京大学计算机科学学院）； Department of Computer Science, The University of Hong Kong（香港大学计算机科学系）； OpenGVLab, Shanghai AI Laboratory（上海人工智能实验室，OpenGVLab）

AI总结提出DAG-Plan框架，首次使用有向无环图作为双臂协调的核心表示，通过一次LLM解析生成结构化DAG，实现自适应并行执行，在双臂厨房基准测试中成功率提升48%，执行效率提升84.1%。

Comments ICRA 2026

详情

AI中文摘要

双臂机器人有望提高效率，但需要规划具有非线性子任务依赖关系的复杂任务。当前使用大型语言模型（LLM）的方法存在根本性权衡：生成线性序列效率高但无法建模并行性和适应变化，而迭代查询具有适应性但过于缓慢且成本高昂。为弥合这一差距，我们引入DAG-Plan，一种新颖的任务规划框架，首次采用有向无环图（DAG）作为双臂协调的核心表示。关键洞察在于DAG天然捕获复杂的子任务依赖关系并明确揭示并行执行的机会。在该框架内，LLM仅被使用一次作为强大的语义解析器，将自然语言指令转换为结构化的DAG。在执行过程中，我们的系统基于实时环境观察动态地将候选节点分配给合适的机械臂，实现真正的自适应并行操作。在双臂厨房基准测试上的广泛评估表明，DAG-Plan的结构化方法从根本上优于现有范式。与单查询线性序列方法相比，通过稳健管理依赖关系，成功率提高了48%；与迭代查询方法相比，通过消除重复LLM调用的延迟，执行效率提高了84.1%。我们的工作表明，基于图的原则性表示是解锁高效可靠的基于LLM的复杂机器人系统规划的关键。更多演示和代码请访问 https://sites.google.com/view/dag-plan。

英文摘要

Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods using Large Language Models (LLMs) suffer from a fundamental trade-off: generating linear sequences is efficient but fails to model parallelism and adapt to changes, while iterative querying is adaptive but too slow and costly. To bridge this gap, we introduce DAG-Plan, a novel task planning framework that for the first time employs a Directed Acyclic Graph (DAG) as the central representation for dual-arm coordination. The key insight is that a DAG natively captures complex sub-task dependencies and explicitly reveals opportunities for parallel execution. Within this framework, an LLM is used only once as a powerful semantic parser to translate a natural language instruction into a structured DAG. During execution, our system dynamically assigns candidate nodes to the suitable arm based on real-time environmental observations, enabling truly adaptive and parallel operation. Extensive evaluation on a dual-arm kitchen benchmark shows that DAG-Plan's structured approach fundamentally outperforms existing paradigms. It achieves a 48% higher success rate than single-query linear sequence methods with dual arm by robustly managing dependencies, and an 84.1% higher execution efficiency than iterative querying methods by eliminating the latency of repeated LLM calls. Our work demonstrates that a principled, graph-based representation is the key to unlocking efficient and reliable LLM-based planning for complex robotic systems. More demos and code are available on https://sites.google.com/view/dag-plan.

URL PDF HTML ☆

赞 0 踩 0

2504.04718 2026-06-02 cs.CL cs.AI 版本更新

T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models

T1：小语言模型测试时计算扩展的工具集成验证

Minki Kang, Jongwon Jeong, Jaewoong Cho

发表机构 * KAIST（韩国科学技术院）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； KRAFTON

AI总结针对小语言模型在测试时扩展中验证能力不足的问题，提出T1框架，通过外部工具过滤候选输出后由小语言模型进行最终验证，显著提升验证准确率和测试时扩展性能。

Comments ICLR 2026

详情

AI中文摘要

近期研究表明，测试时计算扩展能有效提升小语言模型（sLMs）的性能。然而，先前研究主要利用额外的大模型作为验证器进行测试时计算扩展，而sLMs自身的验证能力尚未被充分探索。本文研究sLMs在测试时扩展中能否可靠地验证输出候选。我们发现，即使从大验证器进行知识蒸馏，sLMs在需要记忆的任务（如数值计算和事实核查）上仍表现不佳。为解决这一局限，我们提出工具集成验证（T1），这是一个两阶段框架：首先用外部工具过滤候选，然后使用sLM进行最终验证，将记忆密集型步骤卸载到代码解释器等工具上。在T1框架内，我们证明卸载到外部工具可减轻sLMs的记忆负担，并提升测试时扩展性能。在MATH基准上的实验表明，采用T1的Llama-3.2 1B模型在测试时扩展下性能优于规模更大的Llama-3.1 8B模型。此外，T1提高了过程奖励模型（PRMs）和评论家模型的验证准确率。我们的发现凸显了工具集成在显著提升sLMs验证能力方面的潜力。

英文摘要

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably verify the output candidates under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated verification (T1), a two-stage framework that first filters candidates with external tools and then uses an sLM for final verification, offloading memorization-heavy steps to tools such as a code interpreter. Within T1, we prove that offloading to external tools reduces the memorization burden on sLMs and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 improves the verification accuracy of both process reward models (PRMs) and critic models. Our findings highlight the potential of tool integration to substantially improve the verification abilities of sLMs.

URL PDF HTML ☆

赞 0 踩 0

2503.05500 2026-06-02 cs.CL cs.AI 版本更新

EuroBERT: Scaling Multilingual Encoders for European Languages

EuroBERT：面向欧洲语言的多语言编码器扩展

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo

发表机构 * Artefact Research Center（Artfact研究中心）； CNRS（法国国家科学研究中心）； ISIA Lab（ISIA实验室）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出EuroBERT系列多语言编码器，通过整合生成式模型的最新进展，在检索、回归和分类等任务上超越现有模型，并原生支持长达8192个token的序列。

Comments 28 pages, 8 figures, 13 tables

详情

AI中文摘要

用于检索、回归和分类的通用多语言向量表示传统上来自双向编码器模型。尽管应用广泛，但编码器最近被生成式仅解码器模型的进步所掩盖。然而，推动这一进展的许多创新并非解码器所独有。在本文中，我们通过这些进展的视角重新审视多语言编码器的发展，并介绍EuroBERT，一个覆盖欧洲及全球广泛使用语言的多语言编码器家族。我们的模型在包括多语言能力、数学和编码在内的多种任务上优于现有替代方案，并原生支持长达8192个token的序列。我们还研究了EuroBERT背后的设计决策，提供了关于数据集组成和训练流程的见解。我们公开发布EuroBERT模型，包括中间训练检查点以及我们的训练框架。

英文摘要

General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidirectional encoder models. Despite their wide applicability, encoders have been recently overshadowed by advances in generative decoder-only models. However, many innovations driving this progress are not inherently tied to decoders. In this paper, we revisit the development of multilingual encoders through the lens of these advances, and introduce EuroBERT, a family of multilingual encoders covering European and widely spoken global languages. Our models outperform existing alternatives across a diverse range of tasks, spanning multilingual capabilities, mathematics, and coding, and natively supporting sequences of up to 8,192 tokens. We also examine the design decisions behind EuroBERT, offering insights into our dataset composition and training pipeline. We publicly release the EuroBERT models, including intermediate training checkpoints, together with our training framework.

URL PDF HTML ☆

赞 0 踩 0

2503.15639 2026-06-02 cs.CV cs.AI 版本更新

A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition

一种轻量级上下文驱动的免训练网络用于场景文本分割与识别

Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal, Cheng-Lin Liu

发表机构 * CVPR Unit, Indian Statistical Institute, Kolkata, India（印度统计研究所柯西拉分校CVPR单位）； Manipal University Jaipur, India（印度贾浦尔曼普尔大学）； University of Salford, UK（英国萨尔福德大学）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出一种基于上下文理解、无需训练的即插即用框架，通过注意力分割和语义评估实现高效场景文本识别，性能与SOTA相当且资源消耗更低。

Comments Accepted at ICDAR 2025 (ORAL) 21 pages, 8 figures, 7 tables

详情

AI中文摘要

现代场景文本识别系统通常依赖于大型端到端架构，这些架构需要大量训练，并且对于实时场景来说成本过高。在这种情况下，由于内存、计算资源和延迟的限制，部署重型模型变得不切实际。为了应对这些挑战，我们提出了一种新颖的、无需训练的即插即用框架，该框架利用预训练文本识别器的优势，同时最小化冗余计算。我们的方法使用基于上下文的理解，并引入了一个基于注意力的分割阶段，该阶段在像素级别细化候选文本区域，从而改进下游识别。我们不执行传统的文本检测（即特征图与源图像之间的块级比较），而是利用预训练的标题生成器来利用上下文信息，使框架能够直接从场景上下文生成单词预测。候选文本经过语义和词汇评估以获得最终分数。达到或超过预定义置信度阈值的预测绕过更重的端到端文本STR（场景文本识别）流程，确保更快的推理并减少不必要的计算。在公共基准上的实验表明，我们的范式实现了与最先进系统相当的性能，但所需资源大大减少。我们的代码可在此处找到：https://ritabrata04.github.io/Context-driven-STR/。

英文摘要

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.Our code can be found here: https://ritabrata04.github.io/Context-driven-STR/.

URL PDF HTML ☆

赞 0 踩 0

2503.07154 2026-06-02 cs.LG cs.AI 版本更新

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

推理时缩放的思想可以有益于生成式预训练算法

Jiaming Song, Linqi Zhou

发表机构 * Luma AI

AI总结本文指出自回归模型与扩散模型的二分法是错误的，提出应从推理过程（序列扩展与状态细化）出发设计训练目标，并论证了推理算法优先于训练目标的原则。

Comments updated some new literature on flow maps and continuous LLMs

详情

AI中文摘要

生成式预训练通常被框定在一个错误的二分法中：用于离散信号的自回归模型与用于连续信号的扩散模型。我们认为这种二分法是错误的，因为它混淆了模型家族、数据表示、训练目标和推理过程。自回归是一种通过归一化条件采样扩展序列的推理过程，而扩散是一种反复修正现有状态的细化过程。因此，更有用的对比不是自回归与扩散，而是用交叉熵学习的离散标记与用扩散风格目标学习的连续标记，以及用于从中采样的推理算法。从这个角度来看，算法进展应优先考虑推理时间效率的两个维度：序列扩展和状态细化。我们主张在训练目标之前设计推理过程，因为如果推理映射省略了必要参数或施加了错误分解，训练方法无法弥补。我们通过DDIM风格采样器的目标时间限制、多标记预测的联合分布限制，以及直接参数化长距离推理移动的最新流映射和少步蒸馏方法来说明这一原则。

英文摘要

Generative pre-training is often framed through a false dichotomy between autoregressive models for discrete signals and diffusion models for continuous signals. We argue that the dichotomy is false because it conflates model family, data representation, training objective, and inference procedure. Autoregression is an inference procedure that expands a sequence through normalized conditional draws, while diffusion is a refinement procedure that repeatedly revises an existing state. The more useful contrast is therefore not autoregressive versus diffusion, but discrete tokens learned with cross-entropy versus continuous tokens learned with diffusion-style objectives, together with the inference algorithms used to sample from them. From this perspective, algorithmic progress should prioritize inference-time efficiency along two axes: sequence expansion and state refinement. We advocate designing the inference procedure before the training objective, because a training method cannot compensate for an inference map that omits necessary arguments or imposes an incorrect factorization. We illustrate this principle through a target-time limitation of DDIM-style samplers, a joint-distribution limitation of multi-token prediction, and recent flow-map and few-step distillation methods that directly parameterize long-range inference moves.

URL PDF HTML ☆

赞 0 踩 0

2503.06136 2026-06-02 cs.CV cs.AI 版本更新

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

GSV3D: 基于高斯溅射的几何蒸馏与稳定视频扩散用于单图像3D物体生成

Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（虚拟现实技术与系统国家重点实验室，北京航空航天大学）； SenseTime Research（商汤研究）； PBVR

AI总结提出一种结合2D扩散模型隐式3D推理能力与高斯溅射几何蒸馏的方法，通过高斯溅射解码器将SV3D潜变量输出转换为显式3D表示，实现多视图一致性和高质量3D生成。

详情

DOI: 10.1109/iccv51701.2025.00727

AI中文摘要

基于图像的3D生成在机器人和游戏领域有广泛应用，其中高质量、多样化的输出和一致的3D表示至关重要。然而，现有方法存在局限性：3D扩散模型受限于数据集稀缺和缺乏强大的预训练先验，而基于2D扩散的方法则难以保证几何一致性。我们提出了一种方法，利用2D扩散模型的隐式3D推理能力，同时通过基于高斯溅射的几何蒸馏确保3D一致性。具体来说，所提出的高斯溅射解码器通过将SV3D潜变量输出转换为显式3D表示来强制3D一致性。与仅依赖隐式2D表示进行视频生成的SV3D不同，高斯溅射显式编码空间和外观属性，通过几何约束实现多视图一致性。这些约束纠正了视图不一致性，确保了稳健的几何一致性。因此，我们的方法同时生成高质量、多视图一致的图像和精确的3D模型，为基于单图像的3D生成提供了可扩展的解决方案，并弥合了2D扩散多样性与3D结构一致性之间的差距。实验结果表明，该方法在多个数据集上实现了最先进的多视图一致性和强泛化能力。代码将在接收后公开。

英文摘要

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2502.04512 2026-06-02 cs.AI 版本更新

Safety Must Precede the Deployment of Open-Ended AI

安全必须优先于开放式AI的部署

Ivaxi Sheth, Jan Wehner, Sahar Abdelnabi, Ruta Binkyte, Mario Fritz

发表机构 * CISPA-Helmholtz Center of Information Security（CISPA-海德堡信息安全中心）； MPI for Intelligent Systems, ELLIS Institute Tübingen, Tübingen AI Center（智能系统Max Planck研究所、图宾根ELLIS研究所、图宾根人工智能中心）

AI总结本文提出开放式AI系统因自主无限生成新行为而带来预测性丧失、新兴错位和控制困难等独特安全挑战，需在部署前主动研究，并给出挑战分类和研究方向。

Comments Accepted to ICML'26

详情

AI中文摘要

AI的进步在很大程度上由基础模型和好奇心驱动的学习共同推动，旨在提高能力和适应性。在此背景下，开放式（即AI智能体自主且无限地生成新行为、表示或解决方案）引起了越来越多的兴趣。这在自我进化智能体和长期发现的背景下变得相关。本文立场论文认为，开放式AI系统的定义特性引入了一类独特且未被充分探索的安全挑战，包括预测性丧失、新兴错位以及随着系统超出初始设计假设而难以维持有效控制，这些挑战必须被预先解决。这些挑战在性质上不同于与任务受限或静态模型相关的挑战，且不太可能仅通过现有安全框架解决，因此必须在大规模部署之前主动审视这些风险。论文提出了关键挑战的分类，讨论了研究机会，并呼吁采取协调行动以支持开放式AI的安全和负责任开发。

英文摘要

AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capability and adaptability. Within this landscape, open-endedness, where AI agents autonomously and indefinitely generate novel behaviors, representations, or solutions, has gained increasing interest. This has become relevant in the context of self-evolving agents and long-horizon discovery. This position paper argues that the defining properties of open-ended AI systems introduce a distinct and underexplored class of safety challenges, including loss of predictability, emergent misalignment, and difficulties in maintaining effective control as systems evolve beyond their initial design assumptions, that must be addressed preemptively. These challenges differ qualitatively from those associated with task-bounded or static models and are unlikely to be addressed by existing safety frameworks alone, which is why these risks must be examined proactively, before large-scale deployment. The paper proposes a taxonomy for key challenges, discusses research opportunities, and calls for coordinated action to support the safe and responsible development of open-ended AI.

URL PDF HTML ☆

赞 0 踩 0

2502.04646 2026-06-02 cs.LG cs.AI 版本更新

Efficient Weighted Sampling via Score-based Generative Models

基于分数生成模型的高效加权采样

Heasung Kim, Taekyun Lee, Hyeji Kim, Gustavo de Veciana

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出一种无需训练的加权采样框架，通过轻量级引导近似和不确定性感知调度器，在预训练分数生成模型上实现高效、稳定的采样，并在大规模设置中取得1.2至4.7倍加速。

Comments 37 pages

详情

AI中文摘要

加权采样——从与基概率密度函数和权重函数乘积成比例的概率密度函数中采样——是一种基础技术，在方差缩减、有偏采样、数据增强等领域有广泛应用。利用日益可用的预训练分数生成模型，我们提出了一种无需训练的加权采样框架，通过以原则性和计算高效的方式，用辅助引导项增强预训练基分数函数，来近似目标分布的逆向扩散过程。我们的方法基于两个关键组件：一个轻量级的引导近似，避免了分数函数和权重函数的高阶导数；以及一个不确定性感知调度器，基于近似误差的时间分析动态调整引导强度。这些组件共同实现了准确稳定的采样，无需依赖现有方法通常需要的基于粒子的重采样或Hessian评估。我们从合成设置到大规模设置（如Stable Diffusion XL）验证了方法的有效性，在该框架下，我们实现了1.2倍到4.7倍的加速，同时在任务性能上始终匹配或超越最先进的基线。这些结果使我们的方法成为生成应用中任务自适应、时间敏感采样的可扩展且推理高效的解决方案。

英文摘要

Weighted sampling -- sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function -- is a fundamental technique with wide-ranging applications in variance reduction, biased sampling, data augmentation, and more. Leveraging the increasing availability of pretrained score-based generative models (SGMs), we propose a training-free weighted sampling framework that approximates the backward diffusion process of the target distribution by augmenting the pretrained base score function with an auxiliary guidance term, in a principled and computationally efficient manner. Our approach builds on two key components: a lightweight approximation of the guidance that avoids costly higher-order derivatives of both the score and weight functions, and an uncertainty-aware scheduler that dynamically adjusts the guidance strength based on a temporal analysis of approximation error. Together, these components enable accurate and stable sampling without relying on particle-based resampling or Hessian evaluations commonly required by existing methods. We validate the effectiveness of our method from synthetic to large-scale settings such as Stable Diffusion XL, where our framework achieves $1.2\times$ to $4.7\times$ speedups while consistently matching or outperforming state-of-the-art baselines in task performance. These results position our method as a scalable and inference-efficient solution for task-adaptive, time-sensitive sampling in generative applications.

URL PDF HTML ☆

赞 0 踩 0

2111.03861 2026-06-02 cs.CV cs.AI cs.LG 版本更新

What augmentations are sensitive to hyper-parameters and why?

哪些数据增强对超参数敏感以及为什么？

Ch Muhammad Awais, Imad Eddine Ibrahim Bekkouch

发表机构 * Knowledge Representation Lab Innopolis University（知识表示实验室印尼奥利普斯大学）； Sorbonne Center for Artificial Intelligence - SCAI Sorbonne University（索邦人工智能中心 - SCAI 索邦大学）

AI总结本研究通过局部代理（LIME）解释和线性回归系数评估不同数据增强对模型超参数的敏感性、一致性和影响，发现某些增强对超参数高度敏感，而另一些则更稳健可靠。

Comments 10 pages, 17 figures

详情

DOI: 10.1007/978-3-031-10461-9_31
Journal ref: Intelligent Computing: Proceedings of the 2022 Computing Conference

AI中文摘要

我们对数据集应用增强以提高预测质量，并使最终模型对噪声数据和领域漂移更具鲁棒性。然而，问题仍然存在：这些增强在不同的超参数下表现如何？在本研究中，我们通过执行局部代理（LIME）解释来评估增强对模型超参数的敏感性、一致性和影响，当不同增强应用于机器学习模型时，解释超参数的影响。我们利用线性回归系数来加权每个增强。我们的研究证明，有些增强对超参数高度敏感，而其他增强则更具鲁棒性和可靠性。

英文摘要

We apply augmentations to our dataset to enhance the quality of our predictions and make our final models more resilient to noisy data and domain drifts. Yet the question remains, how are these augmentations going to perform with different hyper-parameters? In this study we evaluate the sensitivity of augmentations with regards to the model's hyper parameters along with their consistency and influence by performing a Local Surrogate (LIME) interpretation on the impact of hyper-parameters when different augmentations are applied to a machine learning model. We have utilized Linear regression coefficients for weighing each augmentation. Our research has proved that there are some augmentations which are highly sensitive to hyper-parameters and others which are more resilient and reliable.

URL PDF HTML ☆

赞 0 踩 0

2501.04424 2026-06-02 cs.AI cs.CL 版本更新

NSA: Neuro-symbolic ARC Challenge

NSA: 神经符号 ARC 挑战

Paweł Batorski, Jannik Brinkmann, Paul Swoboda

发表机构 * Heinrich Heine Universität Düsseldorf（杜伊斯堡-艾森大学）； University of Mannheim（曼海姆大学）

AI总结提出一种结合 transformer 提案生成与领域特定语言组合搜索的神经符号方法，在 ARC 评估集上超越现有最优方法 27%。

详情

DOI: 10.14428/esann/2026.ES2026-119
Journal ref: ESANN 2026 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 99-104, 2026

AI中文摘要

抽象与推理语料库 (ARC) 评估了机器学习模型和组合搜索方法都难以处理的通用推理能力。我们提出了一种神经符号方法，该方法结合了用于提案生成的 transformer 和使用领域特定语言的组合搜索。Transformer 通过提出有希望的搜索方向来缩小搜索空间，从而使组合搜索能够在短时间内找到实际解决方案。我们使用合成生成的数据预训练 transformer。在测试时，我们生成额外的任务特定训练任务并微调我们的模型。我们的结果在 ARC 评估集上比现有最优方法高出 27%，并且在 ARC 训练集上表现良好。我们在 https://github.com/Batorskq/NSA 公开了我们的代码和数据集。

英文摘要

The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.

URL PDF HTML ☆

赞 0 踩 0

2412.19419 2026-06-02 cs.LG cs.AI 版本更新

Introduction to Graph Neural Networks for Machine Learning Engineers

面向机器学习工程师的图神经网络导论

James H. Tanis, Chris Giannella, Adrian V. Mariano, Daoud Meerzaman

发表机构 * The MITRE Corporation（MITRE公司）； National Cancer Institute（国家癌症研究所）

AI总结本文通过编码器-解码器框架介绍图神经网络，并通过同质图上的理论和实验分析不同训练规模和复杂度下的行为，重点讨论过平滑和过挤压问题。

Comments Author accepted manuscript. Title and metadata updated to match the published ACM Computing Surveys version. 73 pages, including references and supplementary material

2411.17790 2026-06-02 cs.CV cs.AI 版本更新

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

基于潜在先验的自监督单目内窥镜深度与姿态估计

Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher

发表机构 * University of Oxford（牛津大学）； University of Leeds（利兹大学）

AI总结提出一种结合生成潜在库和变分自编码器的自监督框架，通过自然图像深度先验和姿态潜在变量正则化，实现内窥镜复杂场景下的高精度深度与姿态估计。

详情

DOI: 10.1109/TMI.2026.3671423

AI中文摘要

内窥镜中的精确3D映射能够实现胃肠道（GI）内定量、整体的病变表征，这需要可靠的深度和姿态估计。然而，内窥镜系统是单目的，现有依赖合成数据集或复杂模型的方法在具有挑战性的内窥镜条件下往往缺乏泛化能力。我们提出了一种鲁棒的自监督单目深度和姿态估计框架，该框架结合了生成潜在库（Generative Latent Bank）和变分自编码器（VAE）。生成潜在库利用自然图像中的广泛深度场景来调节深度网络，通过潜在特征先验增强深度预测的真实感和鲁棒性。对于姿态估计，我们将其重新构建在VAE框架内，将姿态转换视为潜在变量以正则化尺度、稳定z轴突出性并提高x-y灵敏度。这种双重精炼流程能够实现精确的深度和姿态预测，有效应对胃肠道复杂的纹理和光照。在SimCol和EndoSLAM数据集上的广泛评估证实，我们的框架在内窥镜深度和姿态估计方面优于已发表的自监督方法。

英文摘要

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

URL PDF HTML ☆

赞 0 踩 0

2411.11436 2026-06-02 cs.LG cs.AI 版本更新

Implicit Regularization for Multi-label Feature Selection

多标签特征选择的隐式正则化

Dou El Kefel Mansouri, Khalid Benabdeslem, Seif-Eddine Benkabou

AI总结针对多标签学习中的特征选择问题，提出一种基于隐式正则化和标签嵌入的估计器，通过Hadamard积参数化避免显式正则化项的额外偏差，实验表明该方法可减少偏差并可能导致良性过拟合。

Comments 14 pages, 11 figures, Submitted for publication and currently under review

2411.05196 2026-06-02 cs.AI cs.DL cs.LG 版本更新

Explainable AI Through a Democratic Lens: DhondtXAI for D'Hondt-Projected Feature Attribution

通过民主视角的可解释AI：用于D'Hondt投影特征归因的DhondtXAI

Turker Berk Donmez

发表机构 * Sakarya University of Applied Sciences（萨卡里亚应用科学大学）

AI总结提出DhondtXAI，一种基于D'Hondt规则的独立于SHAP的表格数据可解释性框架，通过计算背景干预移除效应、分离正负证据、形成特征联盟并分配席位，实现特征归因，在合成数据和医疗数据集上验证了其与SHAP的高度一致性。

详情

AI中文摘要

本研究提出DhondtXAI，作为一种独立于SHAP、基于D'Hondt的表格可解释AI归因框架。DhondtXAI不依赖于模型原生特征重要性或SHAP值，而是计算背景干预移除效应，分离正负证据，形成可选的特征联盟，应用可选的阈值，通过D'Hondt规则分配席位，并投影到局部模型输出差异上。通过构造保持完整性，投影残差比作为诊断指标报告。该方法在合成加性和交互测试、相关特征扰动、算子和分配消融、投影模式比较、logit尺度检查、重复分割验证、配对删除测试以及两个医疗数据集（威斯康星诊断乳腺癌（CatBoost）和早期糖尿病风险预测（XGBoost））上进行了评估。SHAP仅作为外部比较器，设置对齐。在加性合成数据中，DhondtXAI精确恢复真实排名；在乘法交互中，联盟将平均投影残差从0.2527降至0.0001。在WDBC和糖尿病数据上，与SHAP高度一致（Spearman rho分别为0.9273和0.9353），并通过进一步的符号、top-k、幅度、删除和敏感性分析得到支持。结果表明，DhondtXAI是一种互补的比例性、联盟感知和阈值感知的表格可解释AI方法，而非SHAP或LIME的替代品。

英文摘要

This study presents DhondtXAI as a SHAP-independent, D'Hondt-based attribution framework for tabular XAI. Instead of model-native feature importance or SHAP values, DhondtXAI computes background-interventional removal effects, separates positive and negative evidence, forms optional feature alliances, applies optional thresholds, allocates seats via the D'Hondt rule, and projects onto the local model-output difference. Completeness is preserved by construction, with the projection residual ratio reported as a diagnostic. The method is evaluated on synthetic additive and interaction tests, correlated-feature perturbations, operator and apportionment ablations, projection-mode comparisons, logit-scale checks, repeated split validation, paired deletion tests, and two healthcare datasets: Wisconsin Diagnostic Breast Cancer (CatBoost) and early-stage diabetes risk prediction (XGBoost). SHAP serves only as an external comparator with aligned settings. In additive synthetics, DhondtXAI exactly recovers ground-truth rankings; in multiplicative interactions, alliances reduce the mean projection residual from 0.2527 to 0.0001. On WDBC and diabetes data, it shows high agreement with SHAP (Spearman rho = 0.9273 and 0.9353), supported by further signed, top-k, magnitude, deletion, and sensitivity analyses. Results position DhondtXAI as a complementary proportional, alliance-aware, and threshold-aware tabular XAI method, not a replacement for SHAP or LIME.

URL PDF HTML ☆

赞 0 踩 0

2411.05359 2026-06-02 cs.CV cs.AI cs.CY 版本更新

Agricultural Landscape Understanding At Country-Scale

国家级农业景观理解

Radhika Dua, Aditi Agarwal, Aishwarya Jayagopal, Depanshu Sani, Alex Wilson, Hoang Tran, Ishan Deshpande, Bogdan Floristean, Neelabh Goyal, Ramya Cheruvu, Vishal Batchu, Yan Mayster, Gaurav Aggarwal, Alok Talekar, Vaibhav Rajan

发表机构 * Google DeepMind（谷歌深Mind）； Google（谷歌）

AI总结提出首个国家级农业制图系统，通过新颖的后处理启发式方法实现田地、树木和水体的实例分割，并在全国范围内部署验证。

Comments 32 pages, 11 tables, 22 figs

详情

AI中文摘要

全面的农业景观理解对于应对粮食安全、气候变化和资源管理等全球挑战至关重要。这不仅需要绘制农田地图，还需要绘制树木和水体等重要特征，这些特征在主导全球南方的复杂 extit{小农户}系统中形成了错综复杂的镶嵌结构。以往开发此类土地利用地图的努力受到限制，仅专注于田地划界的方法，并且没有开发出实际部署所必需的稳健后处理步骤。此外，据我们所知，之前没有针对小农户农场的系统在国家范围内进行部署和评估。本文通过提出首个国家级农业制图系统来解决这些局限性，该系统超越了简单的田地划界，能够对田地、树木和水体等农业实例进行分割。我们的系统通过新颖的后处理启发式方法进行了优化，以确保地图的一致性和准确性，并通过严格、多方面的评估过程进行了验证。我们系统生成的精细土地利用地图可通过API在 extit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}公开访问，支持从精准农业和政策制定到推进全球可持续发展目标的各种应用。

英文摘要

Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resource management. This requires mapping not just crop fields, but also vital features like trees and water bodies which form an intricate mosaic in complex \textit{smallholder} systems dominating the Global South. Previous efforts to develop such land use maps have been limited by a narrow focus on methods for field delineation only, and also do not develop robust post-processing steps essential for real-world deployment. Further, to our knowledge, no prior system for smallholder farms has been deployed and evaluated at a national scale. This work addresses these limitations by presenting the first national-scale agricultural mapping system that moves beyond simple field delineation to enable segmentation of agricultural instances like fields, trees and water bodies. Our system is refined for real-world application using novel post-processing heuristics to ensure map consistency and accuracy, and is validated through a rigorous, multi-faceted evaluation process. Fine-grained land use maps generated by our system are publicly accessible via an API at \textit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}, enabling a wide range of applications from precision agriculture and policy-making to advancing global sustainability development goals.

URL PDF HTML ☆

赞 0 踩 0

2410.02511 2026-06-02 cs.AI cs.MA 版本更新

Stop Wandering, Find the Keys: LLMs Discriminate Key States for Efficient Multi-Agent Exploration

停止徘徊，找到关键：LLMs 辨别关键状态以实现高效多智能体探索

Yun Qu, Boyuan Wang, Yuhang Jiang, Jianzhun Shao, Yixiu Mao, Heming Zou, Chang Liu, Cheems Wang, Meiqin Liu, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University, Beijing 100084, China（清华大学自动化系）； College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China（浙江大学电气工程学院）； National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an 710049, China（西安交通大学人机混合增强智能国家级重点实验室）

AI总结提出 LEMAE 方法，利用大语言模型辨别关键状态并设计子空间内在奖励和关键状态记忆树，引导多智能体高效探索，在 SMAC 和 MPE 基准上显著超越现有方法，实现 10 倍加速。

详情

DOI: 10.1007/s11432-025-4978-2
Journal ref: SCIENCE CHINA Information Sciences 2026

AI中文摘要

在具有广阔状态-动作空间的情况下，高效的多智能体探索仍然是强化学习中一个长期存在的挑战。尽管追求新颖性、多样性或不确定性吸引了越来越多的关注，但在没有适当指导选择的情况下进行探索所带来的冗余努力，给该领域带来了一个实际问题。本文介绍了一种系统方法，称为 LEMAE，它选择从知识渊博的大语言模型（LLM）中引导信息丰富的任务相关指导，以实现高效的多智能体探索。具体来说，我们将 LLM 的语言知识以判别性的方式、以较低的 LLM 推理成本，转化为对任务完成至关重要的符号化关键状态。为了释放关键状态的力量，我们设计了基于子空间的回顾性内在奖励（SHIR），通过增加奖励密度来引导智能体朝向关键状态。此外，我们构建了关键状态记忆树（KSMT），以跟踪特定任务中关键状态之间的转换，从而实现有组织的探索。得益于减少冗余探索，LEMAE 在具有挑战性的基准测试（例如 SMAC 和 MPE）上以较大优势超越了现有的最先进方法，在某些场景中实现了 10 倍的加速。

英文摘要

With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.

URL PDF HTML ☆

赞 0 踩 0

2409.19310 2026-06-02 cs.CR cs.AI 版本更新

Model X-Ray: Detection of Hidden Malware in AI Model Weights using Few Shot Learning

Model X-Ray: 使用少样本学习检测AI模型权重中的隐藏恶意软件

Daniel Gilkarov, Ran Dubin

发表机构 * Department of Computer Science and Ariel Cyber Innovation Center, Ariel University（计算机科学系和 Ariel 网络创新中心，阿里尔大学）； Department of Computer and Software Engineering and Ariel Cyber Innovation Center, Ariel University（计算机与软件工程系和 Ariel 网络创新中心，阿里尔大学）

AI总结本文提出一种基于少样本学习的AI模型恶意软件检测方法，通过将模型权重转换为图像表示，仅需6个训练样本即可检测低至6%嵌入率的隐蔽攻击，并展现出对新型扩频隐写攻击的鲁棒性。

详情

DOI: 10.1016/j.jisa.2026.104517

AI中文摘要

随着人工智能（AI）的快速发展和Model Zoo等共享AI模型平台的广泛使用，AI模型被利用的潜在风险增加。攻击者可以通过隐写技术将恶意软件嵌入AI模型中，利用这些模型庞大的体积隐藏恶意数据并用于恶意目的，例如远程代码执行。确保AI模型的安全性是一个新兴的研究领域，对于保护依赖AI技术的众多组织和用户至关重要。本研究利用成熟的图像少样本学习技术，通过一种新颖的图像表示将AI模型转换到图像领域。在该领域应用少样本学习使我们能够创建实用的模型，这是先前工作所缺乏的。我们的方法解决了现有最先进检测技术中阻碍其实用性的关键限制。该方法将所需的训练数据集大小从40000个模型减少到仅6个。此外，我们的方法能够持续检测嵌入率低至25%甚至在某些情况下低至6%的隐蔽攻击，而先前的工作仅被证明对100%-50%的嵌入率有效。我们采用严格的评估策略，确保训练后的模型在各种因素下具有泛化能力。此外，我们展示了训练后的模型成功检测到新型扩频隐写攻击，仅通过学习一种攻击类型就证明了模型令人印象深刻的鲁棒性。我们开源代码以支持可重复性并促进这一新领域的研究。

英文摘要

The potential for exploitation of AI models has increased due to the rapid advancement of Artificial Intelligence (AI) and the widespread use of platforms like Model Zoo for sharing AI models. Attackers can embed malware within AI models through steganographic techniques, taking advantage of the substantial size of these models to conceal malicious data and use it for nefarious purposes, e.g. Remote Code Execution. Ensuring the security of AI models is a burgeoning area of research essential for safeguarding the multitude of organizations and users relying on AI technologies. This study leverages well-studied image few-shot learning techniques by transferring the AI models to the image field using a novel image representation. Applying few-shot learning in this field enables us to create practical models, a feat that previous works lack. Our method addresses critical limitations in state-of-the-art detection techniques that hinder their practicality. This approach reduces the required training dataset size from 40000 models to just 6. Furthermore, our methods consistently detect delicate attacks of up to 25% embedding rate and even up to 6% in some cases, while previous works were only shown to be effective for a 100%-50% embedding rate. We employ a strict evaluation strategy to ensure the trained models are generic concerning various factors. In addition, we show that our trained models successfully detect novel spread-spectrum steganography attacks, demonstrating the models' impressive robustness just by learning one type of attack. We open-source our code to support reproducibility and enhance the research in this new field.

URL PDF HTML ☆

赞 0 踩 0

2401.17010 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Finetuning Large Language Models for Vulnerability Detection

微调大型语言模型用于漏洞检测

Alexey Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov, Anton Cheshkov, Pavel Zadorozhny

发表机构 * Sber AI Lab（Sber AI实验室）； Huawei Russian Research Institute（华为俄罗斯研究院）； Satbayev University（萨特拜耶夫大学）

AI总结本文通过微调WizardCoder模型，优化训练流程并处理类别不平衡，在漏洞检测任务上提升了ROC AUC和F1指标，展示了预训练LLM在源代码分析中的迁移学习潜力。

详情

DOI: 10.1109/ACCESS.2025.3546700

AI中文摘要

本文介绍了微调大型语言模型（LLMs）用于检测源代码中漏洞的结果。我们利用WizardCoder（最新改进的先进LLM StarCoder），并通过进一步微调使其适应漏洞检测。为加速训练，我们修改了WizardCoder的训练过程，并研究了最优训练方案。针对负样本远多于正样本的不平衡数据集，我们还探索了不同技术以提升分类性能。微调后的WizardCoder模型在平衡和不平衡的漏洞数据集上，相比于CodeBERT类模型，在ROC AUC和F1指标上均有提升，证明了将预训练LLM用于源代码漏洞检测的有效性。关键贡献包括：微调先进的代码LLM WizardCoder、在不损害性能的前提下提高其训练速度、优化训练流程和方案、处理类别不平衡，以及在困难的漏洞检测数据集上提升性能。这展示了通过微调大型预训练语言模型进行专门源代码分析任务的迁移学习潜力。

英文摘要

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.

URL PDF HTML ☆

赞 0 踩 0

2407.15510 2026-06-02 cs.AI cs.DM cs.LO cs.SC 版本更新

Algebraic anti-unification

代数反统一

Christian Antić

发表机构 * Vienna University of Technology（维也纳技术大学）

AI总结本文在泛代数的一般框架下提出代数反统一理论，通过引入代数泛化序和最小泛化概念，建立基本结构性质，并利用自动机理论研究有限一元代数和有限代数中的可计算性。

详情

AI中文摘要

抽象是人类和人工智能的关键，因为它允许人们识别原本不同对象或情境中的共同结构。反统一（或泛化）是理论计算机科学和人工智能中研究抽象的分支，已在归纳逻辑编程、程序综合和类比推理等领域得到应用。迄今为止，反统一几乎完全从语法角度进行研究。在本文中，我们在泛代数的一般框架下开创了反统一的代数（即语义）理论，从而将反统一从基于项的表示扩展到任意代数，并超越等式理论。特别地，我们引入了代数泛化序和最小泛化泛化的概念，建立了基本结构性质，证明了与同态和同构的兼容性，并通过自动机理论方法研究了有限一元代数和有限代数中的可计算性。

英文摘要

Abstraction is key to human and artificial intelligence as it allows one to identify common structure in otherwise distinct objects or situations. Anti-unification (or generalization) is the branch of theoretical computer science and artificial intelligence that studies abstraction and has found applications in areas such as inductive logic programming, program synthesis, and analogy-making. To date, anti-unification has been studied almost exclusively from a syntactic perspective. In this paper, we initiate an algebraic (i.e.\ semantic) theory of anti-unification in the general setting of universal algebra, thereby extending anti-unification from term-based representations to arbitrary algebras and beyond equational theories. In particular, we introduce the notions of algebraic generalization ordering and minimally general generalization, establish basic structural properties, prove compatibility with homomorphisms and isomorphisms, and investigate computability in finite unary algebras and finite algebras via automata-theoretic methods.

URL PDF HTML ☆

赞 0 踩 0

2307.05213 2026-06-02 cs.LG cs.AI 版本更新

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

评分函数梯度估计以拓宽决策聚焦学习的适用性

Mattia Silvestri, Senne Berden, Jayanta Mandi, Ali İrfan Mahmutoğulları, Brandon Amos, Tias Guns, Michele Lombardi

发表机构 * University of Bologna（博洛尼亚大学）； KU Leuven（鲁汶大学）； Meta

AI总结提出一种结合随机平滑与评分函数梯度估计的方法，无需对问题结构做特定假设，即可将决策聚焦学习扩展到非线性目标、约束中不确定参数及两阶段随机优化问题。

详情

DOI: 10.1613/jair.1.19498
Journal ref: Silvestri, Mattia, et al. "Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning." Journal of Artificial Intelligence Research 85 (2026)

AI中文摘要

许多现实世界的优化问题包含在部署前未知的参数，这是由于随机性或信息缺乏（例如，配送问题中的需求或旅行时间）。在这种情况下，常见的策略是通过机器学习（ML）模型估计所述参数，这些模型以最小化预测误差为目标进行训练，然而这并不一定与下游任务级误差一致。决策聚焦学习（DFL）范式通过直接最小化任务损失（例如遗憾）来克服这一限制。由于后者对于组合问题具有非信息性梯度，最先进的DFL方法引入了能够实现训练的替代和近似。但这些方法利用了关于问题结构的特定假设（例如，凸或线性问题，仅在目标函数中的未知参数）。我们提出了一种替代方法，该方法不做此类假设，它结合了随机平滑与评分函数梯度估计，适用于任何任务损失。这为将DFL方法应用于非线性目标、问题约束中的不确定参数，甚至两阶段随机优化打开了大门。实验表明，它通常需要更多的训练周期，但在解决方案质量、可扩展性或两者方面，与专门方法相当，并且在约束中存在不确定性的困难情况下表现尤为出色。

英文摘要

Many real-world optimization problems contain parameters that are unknown before deployment time, either due to stochasticity or to lack of information (e.g., demand or travel times in delivery problems). A common strategy in such cases is to estimate said parameters via machine learning (ML) models trained to minimize the prediction error, which however is not necessarily aligned with the downstream task-level error. The decision-focused learning (DFL) paradigm overcomes this limitation by training to directly minimize a task loss, e.g. regret. Since the latter has non-informative gradients for combinatorial problems, state-of-the-art DFL methods introduce surrogates and approximations that enable training. But these methods exploit specific assumptions about the problem structures (e.g., convex or linear problems, unknown parameters only in the objective function). We propose an alternative method that makes no such assumptions, it combines stochastic smoothing with score function gradient estimation which works on any task loss. This opens up the use of DFL methods to nonlinear objectives, uncertain parameters in the problem constraints, and even two-stage stochastic optimization. Experiments show that it typically requires more epochs, but that it is on par with specialized methods and performs especially well for the difficult case of problems with uncertainty in the constraints, in terms of solution quality, scalability, or both.

URL PDF HTML ☆

赞 0 踩 0

2403.07008 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

AutoEval Done Right: Using Synthetic Data for Model Evaluation

AutoEval 的正确做法：使用合成数据进行模型评估

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA（电子工程与计算机科学系，加州大学伯克利分校）； Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel（系统免疫学系，魏茨曼科学研究所）； Inria, Ecole Normale Supérieure, Paris, France（法国国家信息与自动化技术研究所，巴黎高等师范学院）

AI总结本文提出高效且统计上无偏的算法，利用AI标记的合成数据减少模型评估所需的人工标注量，在GPT-4实验中有效样本量提升高达50%。

Comments camera-ready paper version

2307.06647 2026-06-02 cs.RO cs.AI cs.CV 版本更新

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2: 基于LiDAR的鲁棒环境感知与自动驾驶导航控制

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada（计算机科学与电子系，加查马达大学）； Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，toyohashi技术大学）

AI总结提出DeepIPCv2端到端自动驾驶框架，通过融合LiDAR点云分割与多视图投影构建鲁棒场景表示，结合门控循环单元、命令特定多层感知器和PID控制器实现路径点与导航控制命令的联合估计，在光照变化下取得最低总指标误差和最少驾驶干预。

Comments This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052

详情

DOI: 10.1109/ACCESS.2025.3647530

AI中文摘要

我们提出DeepIPCv2，一个端到端的自动驾驶框架，它集成了基于LiDAR的环境感知与命令特定的控制学习。与先前依赖摄像头的模型不同，DeepIPCv2采用点云分割和多视图投影来构建鲁棒的场景表示。这些特征通过门控循环单元、命令特定的多层感知器和PID控制器的组合进行融合和解码，以估计路径点和导航控制命令。这种设计增强了机动性并解决了驾驶数据集中的动作不平衡问题。为了验证模型，我们构建了一个覆盖不同光照条件的数据集，并进行了消融研究和与包括TransFuser在内的最新方法的对比测试。结果表明，DeepIPCv2实现了最低的总指标误差和最少的驾驶干预，突显了其对光照变化的鲁棒性和改进的控制精度。通过稍后在https://github.com/oskarnatan/DeepIPCv2发布代码，我们旨在支持端到端自动驾驶研究的可重复性和未来进展。

英文摘要

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.

URL PDF HTML ☆

赞 0 踩 0

2310.15676 2026-06-02 cs.CV cs.AI 版本更新

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

多模态3D智能的最新进展：综合调查与评估

Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang, Yang Yang

发表机构 * College of Electronics and Information Engineering, Sichuan University（四川大学电子信息工程学院）； School of Computer Science, University of Adelaide（阿德莱德大学计算机科学学院）； School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

AI总结本文系统综述了多模态3D智能方法，提出基于模态和任务的新分类法，并比较了基准数据集上的结果，最后讨论了未来研究方向。

详情

AI中文摘要

多模态3D智能因其在自动驾驶和世界模拟等领域的广泛应用而受到广泛关注。与传统的单模态3D理解相比，引入额外模态不仅提升了场景解释的丰富性和精确性，还为更高层次的物理世界交互奠定了基础。在仅依赖3D数据可能不足的多样化和挑战性环境中，这一点变得尤为关键。尽管过去六年中多模态3D方法的发展激增，特别是那些整合多相机图像（3D+2D）和文本描述（3D+语言）的方法，但缺乏全面深入的综述。在本文中，我们通过系统调查最新进展来弥补这一空白。我们首先简要总结了各种3D多模态任务中的独特挑战。之后，我们提出了一种新的分类法，根据模态和任务对现有方法进行彻底分类，探讨它们各自的优势和局限性。此外，我们提供了近期方法在几个基准数据集上的比较结果及深入分析。最后，我们讨论了未解决的问题，并提出了未来研究的几个潜在方向。

英文摘要

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also provides a foundation for higher-level physical world interaction. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over the past six years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this paper, we present a systematic survey of recent progress to bridge this gap. We begin by briefly summarizing the unique challenges among various 3D multi-modal tasks. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.

URL PDF HTML ☆

赞 0 踩 0

2309.15946 2026-06-02 cs.LG cs.AI cs.NE math.DS 版本更新

Unified Long-Term Time-Series Forecasting Benchmark

统一长期时间序列预测基准

Jacek Cyranka, Szymon Haponiuk

发表机构 * Institute of Informatics（信息学院）

AI总结提出一个专为长期时间序列预测设计的综合数据集，通过标准化轨迹和多种模型基准测试，发现模型效果依赖于数据集，并引入改进的潜在NLinear和课程学习DeepAR模型。

详情

DOI: 10.1016/j.neucom.2026.134091

AI中文摘要

为了支持时间序列数据预测的机器学习方法的发展，我们提出了一个明确针对长期时间序列预测设计的综合数据集。我们整合了来自多种动态系统和真实记录的数据集集合。每个数据集通过将数据划分为具有预定回溯长度的训练和测试轨迹进行标准化。我们包含长度高达$2000$的轨迹，以确保对长期预测能力的可靠评估。为了确定在不同场景中最有效的模型，我们使用经典和最先进的模型（即LSTM、DeepAR、NLinear、N-Hits、PatchTST和LatentODE）进行了广泛的基准分析。我们的研究结果揭示了这些模型之间有趣的性能比较，突出了模型有效性的数据集依赖性。值得注意的是，我们引入了一个自定义的潜在NLinear模型，并通过课程学习阶段增强了DeepAR。两者都持续优于其原始版本。

英文摘要

In order to support the advancement of machine learning methods for predicting time-series data, we present a comprehensive dataset designed explicitly for long-term time-series forecasting. We incorporate a collection of datasets obtained from diverse, dynamic systems and real-life records. Each dataset is standardized by dividing it into training and test trajectories with predetermined lookback lengths. We include trajectories of length up to $2000$ to ensure a reliable evaluation of long-term forecasting capabilities. To determine the most effective model in diverse scenarios, we conduct an extensive benchmarking analysis using classical and state-of-the-art models, namely LSTM, DeepAR, NLinear, N-Hits, PatchTST, and LatentODE. Our findings reveal intriguing performance comparisons among these models, highlighting the dataset-dependent nature of model effectiveness. Notably, we introduce a custom latent NLinear model and enhance DeepAR with a curriculum learning phase. Both consistently outperform their vanilla counterparts.

URL PDF HTML ☆

赞 0 踩 0

2212.06751 2026-06-02 cs.LG cs.AI 版本更新

Speeding Up Multi-Objective Hyperparameter Optimization by Task Similarity-Based Meta-Learning for the Tree-Structured Parzen Estimator

基于任务相似性元学习加速多目标超参数优化的树形结构Parzen估计器

Shuhei Watanabe, Noor Awad, Masaki Onishi, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg, Germany（弗赖堡大学计算机科学系）； Artificial Intelligence Research Center, AIST, Tokyo, Japan（日本科学技术厅人工智能研究中心）

AI总结提出利用任务间顶级域重叠定义的任务相似性扩展TPE采集函数到元学习设置，加速多目标超参数优化，理论分析并解决相似性局限，实验证明在表格HPO基准上达到最优性能并赢得AutoML 2022竞赛。

Comments Accpeted to IJCAI 2023

详情

DOI: 10.24963/ijcai.2023/487

AI中文摘要

超参数优化（HPO）是提升深度学习性能的关键步骤。实践者常面临多个指标间的权衡，如准确率和延迟。鉴于深度学习的高计算需求以及对高效HPO日益增长的需求，加速多目标优化变得愈发重要。尽管已有大量关于元学习用于HPO的工作，但现有方法不适用于多目标树形结构Parzen估计器（MO-TPE），这是一种简单而强大的多目标HPO算法。在本文中，我们利用任务间顶级域重叠定义的任务相似性，将TPE的采集函数扩展到元学习设置。我们还从理论上分析并解决了任务相似性的局限性。实验中，我们证明了该方法在表格HPO基准上加速了MO-TPE，并达到了最先进的性能。我们的方法还通过赢得AutoML 2022“Transformer多目标超参数优化”竞赛得到了外部验证。

英文摘要

Hyperparameter optimization (HPO) is a vital step in improving performance in deep learning (DL). Practitioners are often faced with the trade-off between multiple criteria, such as accuracy and latency. Given the high computational needs of DL and the growing demand for efficient HPO, the acceleration of multi-objective (MO) optimization becomes ever more important. Despite the significant body of work on meta-learning for HPO, existing methods are inapplicable to MO tree-structured Parzen estimator (MO-TPE), a simple yet powerful MO-HPO algorithm. In this paper, we extend TPE's acquisition function to the meta-learning setting using a task similarity defined by the overlap of top domains between tasks. We also theoretically analyze and address the limitations of our task similarity. In the experiments, we demonstrate that our method speeds up MO-TPE on tabular HPO benchmarks and attains state-of-the-art performance. Our method was also validated externally by winning the AutoML 2022 competition on "Multiobjective Hyperparameter Optimization for Transformers".

URL PDF HTML ☆

赞 0 踩 0

2211.14411 2026-06-02 cs.LG cs.AI 版本更新

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

c-TPE: 带不等式约束的树结构Parzen估计器用于昂贵的超参数优化

Shuhei Watanabe, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

AI总结提出c-TPE方法，通过修改TPE的采样和模型以处理不等式约束，在81个昂贵HPO问题上取得最佳平均排名性能。

Comments Accepted to IJCAI 2023

详情

DOI: 10.24963/ijcai.2023/486

AI中文摘要

超参数优化（HPO）对于深度学习算法的强性能至关重要，而实际应用通常在性能要求之上施加一些约束，例如内存使用或延迟。在这项工作中，我们提出了约束TPE（c-TPE），这是广泛使用的通用贝叶斯优化方法——树结构Parzen估计器（TPE）的扩展，以处理这些约束。我们提出的扩展不仅仅是现有采集函数和原始TPE的简单组合，而是包括解决导致性能不佳问题的修改。我们通过实验和理论彻底分析了这些修改，提供了关于它们如何有效克服这些挑战的见解。在实验中，我们证明c-TPE在81个带不等式约束的昂贵HPO问题上，以统计显著性在现有方法中表现出最佳平均排名性能。由于缺乏基线，我们仅在附录D中讨论了我们方法对硬约束优化的适用性。该实现现在可通过OptunaHub获得。

英文摘要

Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as on memory usage or latency, on top of the performance requirement. In this work, we propose constrained TPE (c-TPE), an extension of the widely-used versatile Bayesian optimization method, tree-structured Parzen estimator (TPE), to handle these constraints. Our proposed extension goes beyond a simple combination of an existing acquisition function and the original TPE, and instead includes modifications that address issues that cause poor performance. We thoroughly analyze these modifications both empirically and theoretically, providing insights into how they effectively overcome these challenges. In the experiments, we demonstrate that c-TPE exhibits the best average rank performance among existing methods with statistical significance on $81$ expensive HPO problems with inequality constraints. Due to the lack of baselines, we only discuss the applicability of our method to hard-constrained optimization in Appendix D. The implementation is now available via OptunaHub.

URL PDF HTML ☆

赞 0 踩 0

2301.06308 2026-06-02 cs.LG cs.AI 版本更新

Stability Analysis of Sharpness-Aware Minimization

锐度感知最小化的稳定性分析

Hoki Kim, Jinseong Park, Yujin Choi, Jaewook Lee

发表机构 * Chung-Ang University, South Korea（Chung-Ang 大学，韩国）； Korea Institute for Advanced Study, South Korea（韩国高级研究院）； Ulsan National Institute of Science（乌山国家科学研究院）； Nanyang Technological University (NTU), Singapore（南洋理工大学（NTU），新加坡）； Seoul National University, South Korea（首尔国立大学，韩国）

AI总结研究SAM在鞍点附近的收敛不稳定性，通过动力系统理论证明鞍点成为吸引子，并发现动量与批次大小可缓解该问题。

Comments Accepted to ICML 2026

详情

AI中文摘要

锐度感知最小化（SAM）是一种训练方法，旨在寻找深度学习中的平坦最小值，从而在各个领域取得最先进的性能。SAM不是最小化当前权重的损失，而是最小化参数空间中其邻域内的最坏情况损失。在本文中，我们研究了SAM在鞍点附近的收敛不稳定性。利用动力系统的定性理论，我们解释了SAM如何陷入鞍点，并从理论上证明了在SAM动力学下鞍点可以成为吸引子。此外，通过建立SAM的扩散，我们证明了这种收敛不稳定性也可能发生在随机动力系统中。我们证明，在逃离鞍点方面，SAM扩散比普通梯度下降更差。最后，我们展示了经常被忽视的训练技巧——动量和批次大小——可能对缓解收敛不稳定性和实现高泛化性能很重要。我们的理论和实证结果通过几个著名的优化问题和基准任务的实验得到了充分验证。

英文摘要

Sharpness-aware minimization (SAM) is a training method that seeks to find flat minima in deep learning, resulting in state-of-the-art performance across various domains. Instead of minimizing the loss of the current weights, SAM minimizes the worst-case loss in its neighborhood in the parameter space. In this paper, we investigate the convergence instability of SAM near a saddle point. Using the qualitative theory of dynamical systems, we explain how SAM becomes stuck in the saddle point and theoretically prove that the saddle point can become an attractor under SAM dynamics. Additionally, we show that this convergence instability can also occur in stochastic dynamical systems by establishing the diffusion of SAM. We prove that SAM diffusion is worse than that of vanilla gradient descent in terms of saddle point escape. Finally, we demonstrate that often overlooked training tricks, momentum and batch-size, might be important to mitigate the convergence instability and achieve high generalization performance. Our theoretical and empirical results are thoroughly verified through experiments on several well-known optimization problems and benchmark tasks.

URL PDF HTML ☆

赞 0 踩 0

2208.12389 2026-06-02 cs.LG cs.AI 版本更新

Static Seeding and Clustering of LSTM Embeddings to Learn from Loosely Time-Decoupled Events

LSTM嵌入的静态播种与聚类以从松散时间解耦事件中学习

Christian Manasseh, Razvan Veliche, Jared Bennett, Hamilton Clouse

发表机构 * Air Force Research Lab (AFRL) Autonomy Capability Team 3 (ACT3)（美国空军研究实验室（AFRL）自主能力团队3（ACT3））

AI总结提出通过静态数据播种LSTM生成嵌入并聚类，以改进松散时间解耦时间序列预测，在COVID-19县级病例预测中提升10日移动平均精度。

详情

DOI: 10.1109/ACCESS.2023.3288487

AI中文摘要

人类从不同时间和地点发生的事件中学习，以预测相似的事件轨迹。我们将松散解耦时间序列（LDT）现象定义为两个或多个可能发生在不同地点和不同时间线上，但在事件性质和位置属性上具有相似性的事件。在这项工作中，我们改进了循环神经网络（RNN），特别是长短期记忆（LSTM）网络的使用，以使AI解决方案能够为LDT生成更好的时间序列预测。我们基于趋势使用时间序列之间的相似性度量，并引入表示这些趋势的嵌入。嵌入表示事件的属性，与LSTM结构结合，可以聚类以识别相似的、时间上未对齐的事件。在本文中，我们探索了从与LSTM建模的地球物理和人口现象相关的时间不变数据中播种多变量LSTM的方法。我们将这些方法应用于从COVID-19检测感染和死亡病例中得出的时间序列数据。我们使用公开的社会经济数据来播种LSTM模型，创建嵌入，以确定这种播种是否改善了病例预测。这些LSTM产生的嵌入被聚类，以识别用于预测演变时间序列的最佳匹配候选。应用这种方法，我们在美国县级疾病传播的10日移动平均预测中显示出改进。

英文摘要

Humans learn from the occurrence of events in a different place and time to predict similar trajectories of events. We define Loosely Decoupled Timeseries (LDT) phenomena as two or more events that could happen in different places and across different timelines but share similarities in the nature of the event and the properties of the location. In this work we improve on the use of Recurring Neural Networks (RNN), in particular Long Short-Term Memory (LSTM) networks, to enable AI solutions that generate better timeseries predictions for LDT. We use similarity measures between timeseries based on the trends and introduce embeddings representing those trends. The embeddings represent properties of the event which, coupled with the LSTM structure, can be clustered to identify similar temporally unaligned events. In this paper, we explore methods of seeding a multivariate LSTM from time-invariant data related to the geophysical and demographic phenomena being modeled by the LSTM. We apply these methods on the timeseries data derived from the COVID-19 detected infection and death cases. We use publicly available socio-economic data to seed the LSTM models, creating embeddings, to determine whether such seeding improves case predictions. The embeddings produced by these LSTMs are clustered to identify best-matching candidates for forecasting an evolving timeseries. Applying this method, we show an improvement in 10-day moving average predictions of disease propagation at the US County level.

URL PDF HTML ☆

赞 0 踩 0