arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.24247 2026-05-26 cs.CL cs.AI

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

通过详细的宪法定义和AI驱动的评估提高标注一致性

Konstantin Berlin, Adam Swanda

AI总结 提出一种AI驱动的工作流,通过为每个类别编写详细的宪法定义并由前沿LLM解释,以比人类更一致和准确地生成黄金标签,在三个内容审核类别上将跨模型不一致性降低高达57倍。

Comments Under review at ACL Rolling Review (ARR), May 2026 cycle. Also available at https://doi.org/10.5281/zenodo.20125267

详情
AI中文摘要

许多自动化标注流水线根据书面规范将输入分类到类别中,内容审核是一个突出的用例。简单的类别定义不足以让标注者产生这些流水线所需的准确、一致的黄金标签。一个解决方案是编写一个规定性定义,解决足够多的实际边界情况,使得标注者无法对书面解释产生分歧。在实践中,这种详细程度的定义超出了人类标注者工作记忆的容量,因此标注者依赖直觉,标签偏离书面规则,准确性和一致性下降。我们提出并展示了一种AI驱动工作流的有效性,其中AI帮助编写每个类别的宪法,该宪法以足够详细的方式定义标签以覆盖边缘情况,并且前沿LLM在每个输入上解释该宪法,以比阅读相同文档的人类更一致和准确地产生黄金标签。我们在三个内容审核类别(骚扰、仇恨言论、非暴力犯罪)上评估,并表明该方法相比段落定义将跨模型不一致性降低高达57倍,跨模型分歧诊断规范缺口,人类负责关于每个类别应含义的高层决策,而不是单个标注调用。对于安全评估,我们引入了一个双轴公式,在完整对话上独立评分意图和内容,以便下游消费者可以基于任一轴或两者采取行动。

英文摘要

Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency. We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document. We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls. For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both.

2605.24243 2026-05-26 cs.CV cs.AI stat.ML

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

GIBLy: 通过架构无关的轻量级几何归纳偏置层改进3D语义分割

Diogo Lavado, Alessandra Micheletti, Clàudia Soares

AI总结 提出一种轻量级几何归纳偏置层GIBLy,通过集成可学习的几何先验提升3D分割性能,仅增加少量参数即可在多个基准上获得一致提升。

详情
AI中文摘要

在3D场景理解中,深度学习模型依赖大型模型和大量训练来捕捉3D数据中存在的几何结构。然而,现有方法缺乏显式机制来融入几何信息(例如可学习的基元形状),往往需要更大的模型和更多的训练数据,这增加了成本并可能限制泛化能力。我们引入了GIBLy,一种轻量级几何归纳偏置层,将可学习的几何先验集成到3D分割流程中。GIBLy通过提供与简单几何形状(因此可解释)对齐的特征来增强现有架构——无论是基于MLP、卷积还是Transformer——以最小的计算开销提升分割性能。我们在多个3D语义分割基准上验证了我们的方法,展示了一致的性能提升,包括在TS40K上使用PTV3时mIoU提升高达+11.5%,而仅增加58K额外参数。我们的结果突显了显式编码几何结构以支持准确高效的3D场景理解的优势,且仅需一个轻量级的附加层。

英文摘要

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures -- whether MLP-based, convolution-based, or transformer-based -- by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

2605.24238 2026-05-26 cs.AI

Toward Enactive Artificial Intelligence

走向生成式人工智能

Banafsheh Rafiee, Richard Sutton

AI总结 本文主张将生成式认知方法融入人工智能,强调感知与行动不可分割、具身性和自主性,并指出强化学习在结构上与生成式原则存在共鸣但仍有差距。

详情
AI中文摘要

在本文中,我们主张将生成式感知与认知方法融入人工智能(AI)。生成式方法将感知视为与世界主动的、技巧性的互动,其中智能体通过行动以及理解其行动如何塑造经验来感知。这与将感知视为被动内部过程的经典观点形成对比,后者认为大脑接收感觉输入、处理信息并发出行动指令。生成式观点强调感知的动态性、具身性和交互性,植根于嵌入环境中的智能体的生活经验。我们识别并发展了四个我们认为与AI最相关的关键生成式概念:经验、行动-感知不可分割性、自主性和具身性。主流AI,从经典基于规则的系统到大型语言模型,在很大程度上忽视了这些见解,将认知视为与具身交互和内在规范性脱节的内在处理。然而,强化学习(RL)通过强调行动、智能体-环境交互、反馈驱动的适应和以智能体为中心的评估,在结构上与生成式原则产生共鸣。然而,这种共鸣不应被视为理论等价,因为RL近似了一些生成式见解,但关键要素仍然缺失或发展不足。基于这一分析,我们建议将生成式思想更广泛地融入主流AI和RL中。

英文摘要

In this paper, we advocate for incorporating enactive approaches to perception and cognition into artificial intelligence (AI). Enactive approaches view perception as an active, skillful engagement with the world, where agents perceive by acting and by understanding how their actions shape their experience. This contrasts with classical views that treat perception as a passive internal process in which the brain receives sensory input, processes it, and issues commands for action. Enactive views emphasize the dynamic, embodied, and interactive character of perception, grounded in the lived experience of agents embedded in their environments. We identify and develop four key enactive concepts that we find most relevant to AI: experience, action perception inseparability, autonomy, and embodiment. Much of mainstream AI, from classical rule based systems to large language models, has largely neglected these insights, treating cognition as internal processing detached from embodied interaction and intrinsic normativity. Reinforcement learning (RL), however, exhibits structural resonance with enactive principles through its emphasis on action, agent environment interaction, feedback driven adaptation, and agent centered evaluation. However, this resonance should not be taken as theoretical equivalence, as RL approximates some enactive insights, but key elements remain absent or weakly developed. Building on this analysis, we suggest a broader incorporation of enactive ideas into both mainstream AI and RL.

2605.24229 2026-05-26 cs.AI

How Well Do Models Follow Their Constitutions?

模型遵循其宪法的程度如何?

Arya Jakkli, Senthooran Rajamanoharan, Neel Nanda

AI总结 提出多方法审计流程,评估前沿AI模型在对抗性多轮交互中遵循其书面行为规范(如Anthropic宪法和OpenAI模型规范)的程度,发现新一代模型违规率显著下降。

Comments 37 pages including appendix. Code, tenet lists, and full transcripts: https://github.com/ajobi-uhc/constitution-audits. Companion blog post on LessWrong/AI Alignment Forum: https://www.lesswrong.com/posts/Tk4SF8qFdMrzGJGGw/how-well-do-models-follow-their-constitutions

详情
AI中文摘要

前沿AI开发者现在训练模型遵循长篇书面行为规范,例如Anthropic的宪法(Anthropic, 2025a)和OpenAI的模型规范(OpenAI, 2025a),通过角色训练(Anthropic, 2024)和审慎对齐(Guan et al., 2024)等方法整合到后训练中。这些文档起到治理作用,但目前尚不清楚模型在实际部署中面临的对抗性、多轮压力下实际遵循这些规范的程度。我们提出一个多方法审计流程,将每个实验室发布的规范视为可审计目标:它将规范分解为可测试的原子性原则(Anthropic 205条,OpenAI 197条),使用Petri审计代理(Anthropic, 2025b)生成多轮对抗场景,运行修改后的SURF风格评分搜索(Murray et al., 2026)以捕捉Petri遗漏的浅层单轮失败,根据相关规范验证标记的对话记录,并将结果与实验室自己发布的系统卡进行比较。对每个规范的七个模型应用该流程,我们发现每一代模型遵循自己实验室规范的程度显著提高。在Anthropic宪法上,Claude系列的违规率从15.0%(Sonnet 4)下降到2.0%(Sonnet 4.6);在OpenAI模型规范上,GPT系列的违规率从11.7%(GPT-4o)下降到3.6%(GPT-5.2中等推理),严重程度上限从10/10降至7/10。我们无法从外部隔离这些改进是来自规范特定训练、更广泛的后训练改进还是评估意识。剩余的失败集中在操作者强加的角色在AI身份质疑下、代理部署中的不可逆操作以及带有虚假精度的捏造定量声明。

英文摘要

Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.

2605.24225 2026-05-26 cs.RO

ECo-MoE: Embodiment-Conditioned Mixture of Experts Increases the Evolvability of Robots

ECo-MoE: 基于具身条件的专家混合提升机器人的可进化性

Yibin Wang, Muhan Li, Zihan Guo, Sam Kriegman

AI总结 提出一种机器人进化与学习联合优化模型,通过混合控制专家和潜在设计向量分布,在统一模块化框架中平衡个体策略与通用控制器的效率,实现不同体型的适应性行为并支持知识复用与进化引导。

详情
AI中文摘要

在本文中,我们介绍了一种机器人进化与学习的模型,该模型联合优化潜在设计向量(基因型)的分布和混合控制专家(神经模块),这些专家由每个解码设计(表型)的潜在坐标门控。这为协同设计算法提供了一种可扩展的替代方案,这些算法要么为每个机器人训练一个单独的策略(效率低下),要么为所有机器人训练一个单一通用控制器(导致过于保守的结构和行为)。我们的方法介于这两个极端之间,将祖先知识保存在一个统一但模块化的框架中,其中不同的身体结构激活和停用不同的学习感觉运动回路组合,以实现目标导向行为。这使得控制器的一部分可以被彻底改造,以更好地适应新出现的物种设计,而不会破坏其他专家模块中包含的来之不易的知识。它还允许预训练的专家策略直接插入到混合中,从而将进化引导到包含所需形态特征的潜在空间中原本未探索的区域。我们将这个过程称为“演示进化”,并探索如何利用它将自由形态进化引导到由预训练模型定义的规范结构。视频和代码可在以下网址找到:https://eco-moe.github.io。

英文摘要

In this paper, we introduce a model of evolution and learning in robots that co-optimizes a distribution of latent design vectors (genotypes) and a mixture of control experts (neural modules), which are gated by the latent coordinates of each decoded design (phenotype). This provides a scalable alternative to co-design algorithms that either train an individual policy for every robot, which is inefficient, or a monolithic universal controller for all robots, which results in overly conservative structures and behaviors. Our approach lies somewhere between these two extremes, preserving ancestral knowledge in a unified yet modular framework in which different body plans activate and deactivate different combinations of learned sensorimotor circuits for goal-directed behavior. This allows one part of the controller to be overhauled to better suit new species of designs as they emerge without disrupting the hard-earned knowledge contained within other expert modules. It also allows pretrained expert policies to be directly plugged into the mixture, which can steer evolution into otherwise unexplored areas of latent space containing desired morphological traits. We refer to this process as "evo by demo" and explore how it may be used to guide freeform evolution toward canonical structures defined by the pretrained model. Videos and code can be found at: https://eco-moe.github.io.

2605.24218 2026-05-26 cs.CL

QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks

QUEST:使用完全合成任务训练前沿深度研究代理

Jian Xie, Tianhe Lin, Zilu Wang, Yuting Ning, Yuekun Yao, Tianci Xue, Zhehao Zhang, Zhongyang Li, Kai Zhang, Yufan Wu, Shijie Chen, Boyu Gou, Mingzhe Han, Yifei Wang, Vint Lee, Xinpeng Wei, Xiangjun Wang, Yu Su, Huan Sun

AI总结 提出QUEST系列模型,通过统一规则树合成训练数据,结合中训练、监督微调和强化学习,使开放模型在深度研究任务上接近或超越闭源前沿代理。

Comments Work in Progress

详情
AI中文摘要

深度研究代理将搜索引擎从检索关键词匹配页面扩展到知识综合,从根本上改变了人类与信息的交互方式。然而,前沿系统仍然是专有的,而现有的开放代理通常在不同任务类型上泛化能力较差,尚不清楚如何训练一个具有广泛能力的深度研究代理。我们发布了QUEST,一个开放模型系列(从2B到35B),作为通用深度研究代理,旨在处理各种长周期搜索任务,在事实查找、引用依据和报告综合方面具有强大能力。为了构建QUEST,我们提出了一种有效的训练方案,结合了中训练、监督微调和强化学习。该方案的核心是一个基于统一规则树的精心设计的数据合成流程,适用于不同的任务类型,并且能够在无需人工标注的情况下合成具有可验证奖励的训练数据。此外,QUEST内置了上下文管理机制,能够实现有效的长周期推理和知识综合。仅使用8K个合成任务,QUEST在涵盖多种任务类型的八个深度研究基准上接近甚至超越了前沿闭源代理,并在最近的开放权重代理中取得了最佳整体性能。我们发布了所有内容:模型、数据和训练脚本。

英文摘要

Deep research agents extend the role of search engines from retrieving keyword-matched pages to synthesizing knowledge, fundamentally changing how humans interact with information. However, frontier systems remain proprietary, while existing open agents often generalize poorly across different task types, leaving unclear how to train a broadly capable deep research agent. We release QUEST, a family of open models (ranging from 2B to 35B) that serve as general-purpose deep research agents designed to handle a wide range of long-horizon search tasks, with strong capabilities in fact seeking, citation grounding, and report synthesis. To build QUEST, we propose an effective training recipe combining mid-training, supervised fine-tuning, and reinforcement learning. Central to this recipe is a curated data synthesis pipeline based on unified rubric trees, which applies to different task types and enables synthesizing training data with verifiable rewards without human annotation. In addition, QUEST incorporates a built-in context management mechanism that enables effective long-horizon reasoning and knowledge synthesis. Using only 8K synthesized tasks, QUEST approaches or even surpasses frontier closed-source agents across eight deep research benchmarks spanning diverse task types, and achieves the best overall performance among recent open-weight agents. We released everything: models, data, and training scripts.

2605.24216 2026-05-26 cs.LG cs.AI cs.CL cs.CR

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Agent-ToM: 通过心智理论推理学习监控自主LLM智能体

Nesreen K. Ahmed, Nima Nafisi

AI总结 针对自主LLM智能体的隐蔽恶意行为监控难题,提出基于心智理论推理的Agent-ToM框架,通过信念推断、意图假设与验证实现结构化轨迹分析,在监控基准上取得优于集成方法的性能。

Comments 23 pages, 9 figures

详情
AI中文摘要

监控自主大语言模型(LLM)智能体的隐蔽恶意行为具有挑战性,因为攻击模式具有延迟性、上下文依赖性和长期性。智能体可能追求隐藏目标同时保持表面良性行为,即使拥有完整轨迹访问也难以检测。先前的监控方法改进了脚手架或集成聚合,但独立处理每条轨迹,未从先前的监控经验中学习。此外,标准推理方法解释观察到的行为,但没有明确推理智能体的信念、意图和目标对齐,而这些对于区分良性任务执行和隐蔽偏离是必要的。 我们提出 extbf{Agent-ToM},一种基于心智理论(ToM)推理的监控学习框架,用于自主智能体的安全分析。Agent-ToM通过推断信念、具有校准置信度的意图假设、预期行动以及与任务一致行为基线的偏离,执行结构化的全轨迹分析。在推理时,它采用 extit{推理-验证-细化}流程来构建和验证监控决策。在训练时,Agent-ToM将批评信号蒸馏到持久的 extit{语义护栏记忆}中,使得跨回合可重用的信念和意图条件约束成为可能。我们在对抗性智能体监控基准(SHADE-Arena和CUA-SHADE-Arena)上评估Agent-ToM。Agent-ToM实现了强精确率-召回率平衡,并使用连贯的双调用推理流程,优于包括集成方法在内的最先进监控基线。这些结果表明,在监控层学习,结合结构化的ToM推理和验证,为保护自主LLM智能体提供了有效且可部署的基础。

英文摘要

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

2605.24211 2026-05-26 cs.CL cs.AI

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation

通过类比教学:教育类比生成的模块化流水线

Mariam Barakat, Ekaterina Kochmar

AI总结 提出一个模块化流水线,将教育类比生成分解为四个阶段,基于结构映射理论,评估12个LLM在两个数据集上的表现,发现子概念显著提升解释质量和封闭检索精度,并引入LLM作为评判的评估方法。

Comments 36 pages, 25 figures. To appear in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

类比通过将不熟悉的概念与已知概念联系起来,帮助学习者理解。尽管最近取得了进展,大型语言模型(LLM)在生成与人类质量相当的类比方面仍然困难。我们提出了一个用于教育类比生成的模块化流水线,将任务分解为四个阶段:源发现、子概念生成、解释生成和评估。基于结构映射理论,该流水线能够系统地、逐阶段地分析模型选择和输入配置如何影响类比质量。我们在两个具有结构化子概念注释的数据集(SCAR和ParallelPARC)上评估了来自六个模型家族的12个最先进的LLM,以及用于封闭设置检索的七个嵌入模型。我们的结果表明,子概念显著提高了解释质量和封闭设置检索精度,但在开放式源生成中提供的益处有限。我们进一步引入了一种LLM作为评判的评估方法,并针对七名注释者的人类注释验证了其评分,发现Claude Sonnet 4.6在人类排名上的对齐比细粒度绝对分数更可靠。综合来看,我们的发现揭示了孤立研究无法捕捉的跨阶段交互,并强调了子概念基础作为类比质量生成的关键驱动因素。

英文摘要

Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (LLMs) continue to struggle to generate analogies of comparable quality to those produced by humans. We present a modular pipeline for educational analogy generation, decomposing the task into four stages: source finding, sub-concept generation, explanation generation, and evaluation. Grounded in Structure Mapping Theory, the pipeline enables systematic, stage-by-stage analysis of how model choice and input configuration affect analogy quality. We evaluate 12 state-of-the-art LLMs across six model families on two datasets with structured sub-concept annotations (SCAR and ParallelPARC), alongside seven embedding models for closed-setting retrieval. Our results show that sub-concepts substantially improve explanation quality and closed setting retrieval precision but provide limited benefit in open-ended source generation. We further introduce an LLM-as-a-judge evaluation methodology and validate its scoring against human annotations from seven annotators, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with fine-grained absolute scores. Taken together, our findings reveal cross-stage interactions that isolated studies cannot capture, and highlight sub-concept grounding as a key driver of analogy quality generation.

2605.24210 2026-05-26 cs.LG stat.ML

Characterizing the Representational Capacity of Neural Processes

神经过程表示能力的刻画

Robin Young

AI总结 本文通过严格层级分析,刻画了条件神经过程、注意力神经过程、Transformer神经过程及其潜在变体的表示能力,揭示了不同架构在函数表示上的包含关系与局限。

Comments To appear at ProbML/AABI 2026

详情
AI中文摘要

神经过程能表示哪些函数?我们分析了流行的NP架构的表示能力:条件神经过程(CNPs)、注意力神经过程(ANPs)、Transformer神经过程(TNPs)及其潜在变体。我们证明这些架构形成了一个严格的层级结构。CNP可表示的函数恰好是那些依赖于上下文分布的有限多个期望特征的函数。ANPs通过查询相关的重新加权严格推广了CNPs,从而实现了核平滑器。ConvCNPs和ANPs不可比较;每个都包含对方之外的函数,通过平稳性与平移等变性区分。具有$L$个自注意力层的TNPs捕获$L$跳上下文交互。对于潜在NPs,我们证明有限维潜在变量提供一致的采样,但不能规避编码器的限制;匹配GP后验分布需要潜在维度随上下文大小缩放。这些结果为基于任务结构的架构选择提供了理论基础。

英文摘要

What functions can Neural Processes represent? We analyze the representational capacity of popular NP architectures: Conditional Neural Processes (CNPs), Attentive Neural Processes (ANPs), Transformer Neural Processes (TNPs), and their latent variants. We prove these architectures form a strict hierarchy. CNP-representable functions are exactly those depending on finitely many expected features of the context distribution. ANPs strictly generalize CNPs via query-dependent reweighting, enabling kernel smoothers. ConvCNPs and ANPs are incomparable; each contains functions outside the other, separated by stationarity versus translation equivariance. TNPs with $L$ self-attention layers capture $L$-hop context interactions. For latent NPs, we show finite-dimensional latents provide coherent sampling but do not circumvent encoder limitations; matching GP posterior distributions requires latent dimension scaling with context size. These results provide a theoretical foundation for architecture selection based on task structure.

2605.24203 2026-05-26 cs.RO

Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

Afford-VLA:通过内化可操作性实现动作对齐的视觉规划

Runze Wang, Yuqian Fu, Yu Li, Tao Lin, Tianwen Qian, Mohamed Elhoseiny, Bo Zhao, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue

AI总结 提出Afford-VLA框架,通过内化任务条件可操作性作为显式视觉规划接口,利用可学习<AFF>令牌查询交互区域并解码为紧凑嵌入以直接条件化动作生成,在多个模拟基准上取得最先进性能。

Comments 20 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出巨大潜力,但仍受限于空间推理不足,特别是在复杂视觉场景中确定交互位置。虽然近期工作引入多种形式的视觉规划来解决此问题,但现有方法要么依赖全局几何线索、符号中间表示,要么依赖外部生成的视觉信号,这些往往与下游动作预测弱耦合。本文重新审视VLA系统中的视觉规划,认为有效的规划应是局部的、视觉锚定的、内部生成的且直接与动作对齐。基于此洞察,我们提出Afford-VLA,一个统一框架,将任务条件可操作性内化为VLA模型中的显式视觉规划接口。具体而言,我们引入可学习<AFF>令牌来查询任务相关交互区域,从多模态特征解码可操作性掩码,并将其转换为紧凑嵌入,直接条件化动作生成。此设计使可操作性在VLA内部生成和利用,形成紧密耦合的感知-动作通路。为进一步支持此集成,我们采用训练策略,使可操作性通路与动作预测联合优化,提高其对下游控制的有效性。我们在多个模拟基准(包括LIBERO、LIBERO-Plus和SimplerEnv)上评估方法,取得一致的最先进性能,并展示了强大的真实世界结果。这些发现表明,将可操作性内化为动作对齐的视觉规划为改进VLA系统提供了强大范式。

英文摘要

Vision-language-action (VLA) models have shown strong potential for generalist robot manipulation, yet they remain limited by insufficient spatial reasoning, particularly in determining where to interact in complex visual scenes. While recent efforts introduce various forms of visual planning to address this issue, existing approaches either rely on global geometric cues, symbolic intermediate representations, or externally generated visual signals, which are often weakly coupled with downstream action prediction. In this work, we revisit visual planning in VLA systems and argue that effective planning should be local, visually grounded, internally generated, and directly aligned with action. Based on this insight, we propose Afford-VLA, a unified framework that internalizes task-conditioned affordance as an explicit visual planning interface within VLA models. Concretely, we introduce learnable <AFF> tokens to query task-relevant interaction regions, decode affordance masks from multimodal features, and convert them into compact embeddings that directly condition action generation. This design enables affordance to be both generated and utilized within the VLA, forming a tightly coupled perception-action pathway. To further support this integration, we adopt a training strategy that allows the affordance pathway to be jointly optimized with action prediction, improving its effectiveness for downstream control. We evaluate our method on multiple simulation benchmarks, including LIBERO, LIBERO-Plus, and SimplerEnv, achieving consistent state-of-the-art performance, along with strong real-world results. These findings demonstrate that internalizing affordance as action-aligned visual planning provides a powerful paradigm for improving VLA systems.

2605.24201 2026-05-26 cs.CV physics.med-ph

Radiuma: A Unified Zero-Code Executable Graphical Workflow Generator for Reproducible and Shareable Medical Image Analysis and Machine Learning

Radiuma: 一个统一的无代码可执行图形工作流生成器,用于可重复和可共享的医学图像分析与机器学习

Mohammad Salmanpour, Mehrdad Oveisi, Isaac Shiri, Arman Rahmim

AI总结 提出Radiuma模块化平台,通过可视化工作流系统集成图像处理、放射组学特征提取和机器学习模块,实现无需编程即可构建可重复的多步分析流程。

详情
AI中文摘要

医学图像计算软件对于识别支持诊断、预后、治疗计划和临床研究的成像生物标志物至关重要。然而,缺乏标准化、用户友好且可重复的软件环境限制了先进医学图像分析工作流的广泛采用。我们提出了Radiuma,一个免费可用的模块化平台,旨在支持跨多种模态和文件格式的可靠且可重复的医学图像分析。Radiuma集成了图像读取、可视化、配准、融合、处理、分割、放射组学特征提取以及用于分类、回归和聚类的机器学习模块。其模块化设计允许用户独立执行每个组件,或通过可视化工作流系统连接模块,其中一个步骤的输出可以图形化地传递到下一步。这使得无需大量编程专业知识即可创建自定义、可执行且可重复的多步骤流程。每个模块的结果可以直接在可视化窗口中检查,提供对处理质量和工作流准确性的即时反馈。Radiuma还支持保存和共享自定义工作流,促进协作研究中的透明度、可重用性和一致性。通过结合灵活性、易用性和标准化分析工具,Radiuma为临床和转化环境中的放射组学和机器学习研究提供了一个实用环境。该平台旨在面向具有不同专业知识的用户,包括放射科医生、物理学家、临床医生和数据科学家。

英文摘要

Medical image computing software is essential for identifying imaging biomarkers that can support diagnosis, prognosis, treatment planning, and clinical research. However, the lack of standardized, user-friendly, and reproducible software environments has limited the broader adoption of advanced medical image analysis workflows. We present Radiuma, a freely available modular platform designed to support reliable and reproducible medical image analysis across multiple modalities and file formats. Radiuma integrates image reading, visualization, registration, fusion, processing, segmentation, radiomics feature extraction, and machine learning modules for classification, regression, and clustering. Its modular design allows users to execute each component independently or connect modules through a visual workflow system, where the output of one step can be graphically passed to the next. This enables the creation of custom, executable, and reproducible multi-step pipelines without requiring extensive programming expertise. Results from each module can be inspected directly in the visualization window, providing immediate feedback on processing quality and workflow accuracy. Radiuma also supports saving and sharing customized workflows, promoting transparency, reusability, and consistency across collaborative studies. By combining flexibility, usability, and standardized analysis tools, Radiuma provides a practical environment for radiomics and machine learning research in clinical and translational settings. The platform is designed to be accessible to users with diverse expertise, including radiologists, physicists, clinicians, and data scientists.

2605.24195 2026-05-26 cs.CV cs.LG

Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering

通过可微渲染从成像声纳进行单视图海底恢复

Sevan Brodjian, Michael Hobley, Pietro Perona

AI总结 提出一种无需训练的方法,通过可微渲染在30秒内从单张声纳图像恢复海底地形,利用已知海底倾斜条件,首次实现单视图高度恢复。

详情
AI中文摘要

由于光衰减和浑浊度,声纳通常是水下高分辨率成像的唯一合适模态。前视成像声纳提供距离和水平角度的测量,但将垂直结构压缩成平面图像,产生歧义,使得3D恢复具有挑战性。成像声纳的一个常见应用是水下地形测绘(测深),但目前的方法需要多个视图、昂贵的多传感器设置或大量训练数据,这限制了其使用和对新环境的适应性。我们提出了一种无需训练的方法,通过可微渲染在30秒内从单张声纳图像恢复测深,条件为已知的海底倾斜。据我们所知,这是声纳中单视图高度恢复的第一个可微渲染方法。我们的方法实现了可微声纳光线追踪,并优化显式高度场以重现目标图像。在合成数据集上,我们的方法在分布偏移下优于有监督的CNN,在粗糙地形上保持接近,而CNN在分布内获胜。通过建模声纳过程的物理基础先验,我们的方法无需训练数据即可适应不同的传感器配置和环境。

英文摘要

Sonar is often the only modality suitable for high-resolution imaging underwater due to light attenuation and turbidity. Forward-looking imaging sonar provides measurements over range and horizontal angle but collapses vertical structure into a flat image, creating ambiguities that make 3D recovery challenging. A common use case for imaging sonar is underwater terrain mapping (bathymetry), yet current methods require many views, expensive multi-sensor setups, or significant training data, which limits use and adaptability to new environments. We present a training-free method that recovers bathymetry from a single sonar image in under 30 seconds via differentiable rendering, conditioned on a known seafloor tilt. To our knowledge, this is the first differentiable rendering approach for single-view height recovery in sonar. Our method implements differentiable sonar ray tracing and optimizes an explicit height field to reproduce the target image. On synthetic datasets, our approach outperforms a supervised CNN under distribution shift and remains close on rough terrain, while the CNN wins in-distribution. By modeling physically grounded priors of the sonar process, our method adapts across sensor configurations and environments without training data.

2605.24193 2026-05-26 cs.SD cs.LG

Music Transcription with (Almost) No Supervision

音乐转录:几乎无需监督

Saebyeol Shin, Chao Wan, Zhenzhen Liu, Justin Lovelace, Daniel C. Lin, Kilian Q. Weinberger, John Thickstun

AI总结 采用循环一致性翻译框架,利用少量配对数据作为锚点,充分挖掘未配对音频和乐谱数据,实现高质量音乐转录。

详情
AI中文摘要

竞争性的音乐转录模型需要大量的配对音频-乐谱数据,但由于收集成本、对齐困难和版权限制,这类数据稀缺。与此同时,大量未配对的音频录音和符号乐谱可免费获取,但未被利用。我们采用循环一致性翻译框架,其中少量配对数据作为最小锚点,释放未配对数据池的全部潜力。我们发现:未配对数据带来惊人的提升,尤其在有限监督下;未配对音频比未配对乐谱贡献更大;在训练中引入新乐器的未标注音频,可在无需任何配对监督的情况下改善该乐器的转录。这些结果共同表明,扩展未配对数据为标注数据仍然稀缺的乐器提供了一条实现高质量转录的实用途径。

英文摘要

Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.

2605.24192 2026-05-26 cs.LG cs.AI cs.CV

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

滤波后验均值集合:扩散泛化分析模型的统一框架

Matthew Niedoba, Berend Zwartsenberg, Frank Wood

AI总结 本文提出滤波后验均值集合(FPMC)统一框架,通过查询精度向量、响应权重和源分布建模扩散模型去噪函数的泛化行为,并通过软松弛和源分布增强提升现有方法性能。

Comments 27 Pages, 7 figures

详情
AI中文摘要

作为图像扩散模型骨干的神经网络去噪函数,在多种网络架构和训练超参数下展现出显著一致的泛化行为。最近一系列研究试图通过聚合训练数据集补丁的后验加权平均值来建模这些网络的输出。在本工作中,我们将这些方法整合为一个统一的模型类,称为滤波后验均值集合(FPMC)。我们使用查询精度向量、响应权重和源分布定义该模型类,并说明现有方法可通过这些设计轴的具体选择恢复。依次研究每个轴,我们发现FPMC性能可以通过对先前基于补丁的方法进行软松弛以及通过源分布的增强来改进。将这些发现应用于现有的FPMC,我们在三个自然图像数据集上展示了样本的一致改进。

英文摘要

The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour across a wide variety of network architectures and training procedure hyperparameters. A recent line of research has sought to model the outputs of these networks by aggregating posterior weighted averages of training dataset patches. In this work, we consolidate these approaches into a unified model class which we call Filtered Posterior Mean Collections (FPMCs). We define this model class using query precision vectors, response weights, and source distributions, and illustrate that existing methods are recoverable with specific choices of these design axes. Investigating each axis in turn, we find that FPMC performance can be improved with soft relaxations of prior patch-based methods, and through augmentations of source distributions. Applying these findings to an existing FPMC, we demonstrate consistent sample improvement across three natural image datasets.

2605.24176 2026-05-26 cs.CV

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

Loki:基于表示而非架构的扩散肖像动画

Pouyan Navard, Sernam Lim

AI总结 提出Loki方法,通过使用参数化人脸模型解耦身份与表情/姿态,并利用轻量级键值注入保持身份,在减少参数和训练数据的同时提升扩散肖像动画的驱动跟随性。

详情
AI中文摘要

肖像动画将驱动片段的面部表情和头部姿态迁移到单个参考图像上,同时保留参考图像的身份。最先进的扩散系统通过依次堆叠用于表情、姿态和身份的训练模块来解决这一问题,但付出了可训练参数、专有语料库以及系统本应独立控制的轴之间的残余纠缠的代价。这种复杂性补偿了一个上游选择——从RGB中学习面部表情和头部姿态,而RGB是一种身份、姿态和表情无法分离的表示,除非分别学习。Loki在条件路径上跳出RGB。驱动表情和头部姿态由一个面部模型编码,该模型的参数轴在构造上与身份正交,然后光栅化为扩散骨干网络原生消费的空间图。身份通过轻量级键值注入,经由扩散骨干网络自身的预训练特征单独路由。由于参数化表示将身份与表情和姿态解耦,跨身份重演在推理时简化为系数替换,无需跨身份训练数据。Loki所需的推理参数比领先的扩散基线少约43%,并且训练所用的视频样本少1496倍。我们定义了两个指标,直接衡量生成的头部位姿轨迹和面部表情是否跟随驱动——这正是肖像动画所关注的问题;Loki在这两个指标上领先或并列领先。

英文摘要

Portrait animation transfers a driver clip's facial expression and head pose onto a single reference image while preserving the reference's identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice -- learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone's own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver's -- the questions portrait animation actually asks; Loki leads or co-leads on both.

2605.24173 2026-05-26 cs.CL cs.AI cs.CR cs.LG

Extracting Training Data from Diffusion Language Models via Infilling

通过填充从扩散语言模型中提取训练数据

Yihan Wang, N. Asokan

AI总结 提出填充提取协议,利用扩散语言模型的双向去噪能力,通过任意二进制掩码参数化,揭示掩码几何形状控制提取能力,边缘条件掩码比前缀条件掩码多提取三倍逐字序列,且双向访问打开了自回归模型无法利用的通道。

详情
AI中文摘要

大型语言模型中的记忆化几乎完全通过前缀条件提取进行研究,这是自回归模型的自然选择。然而,扩散语言模型(DLM)可以在任意位置去噪掩码标记。因此,仅前缀探测揭示了DLM中记忆化的一个方面,并显著低估了训练数据提取的风险。为了真实地建模DLM中训练数据的可提取性,我们引入了\emph{填充提取},这是一种由任意二进制掩码参数化的数据提取协议,它包含了前缀仅探测并考虑了DLM的双向归纳偏差。在LLaDA-8B和Dream-7B上,跨五种提取模式、三种训练流水线和三个涵盖逐字和部分泄漏的语料库进行实例化,我们发现掩码几何形状控制着可提取性:边缘条件掩码比前缀条件掩码\emph{多提取三倍}的逐字序列,并且双向访问打开了自回归模型中无法利用的通道。特别是,我们表明,一个能够访问已删除个人身份信息的训练数据的现实对手,甚至可以从DLM中提取被删除的电子邮件地址,其召回率高于规模匹配的自回归模型。解码的可调参数可测量地影响提取性能,而后续的监督微调阶段并未消除先前的记忆化。

英文摘要

Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emph{infilling extraction}, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emph{extract up to three times more} verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.

2605.24172 2026-05-26 cs.AI

EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

EPPC-OASIS:面向安全消息中电子患者-提供者通信挖掘的本体感知适应与结构化推理精炼

Samah Fodeh, Sreeraj Ramachandran, Elyas Irankhah, Muhammad Arif, Afshan Khan, Ganesh Puthiaraju, Linhai Ma, Srivani Talakokkul, Jordan Alpert, Sarah Schellhorn

AI总结 提出EPPC-OASIS框架,通过本体感知的Wasserstein对齐目标增强微调,并结合推理精炼步骤,从安全消息中自动提取结构化EPPC编码,在多个语言模型上取得一致改进。

详情
AI中文摘要

安全的患者-提供者消息包含临床上重要的通信行为,这些行为难以大规模手动表征。电子患者-提供者通信(EPPC)框架为编码这些行为提供了本体,但自动提取仍然具有挑战性,因为预测必须保留细粒度的代码/子代码结构,同时将注释锚定在消息文本中。我们开发了EPPC-OASIS,一种用于结构化EPPC提取的本体感知适应方法,并将其与可部署的推理精炼程序相结合,旨在提高最终注释的一致性。EPPC-OASIS通过Wasserstein对齐目标增强监督微调,该目标鼓励模型表示邻域与EPPC本体派生邻域之间的对齐,而推理精炼则使用验证、自一致性、混合校正以及选择或集成来解决残差预测错误。我们在一个去标识化的安全患者-提供者消息语料库上,针对多个开放权重语言模型,将框架与提示、监督微调、基于偏好和鲁棒性导向的基线进行了比较。跨模型家族,最佳可部署流水线实现了77.13%的代码+子代码F1和63.83%的三元组F1,相比最强的监督微调基线,分别获得了+1.39和+2.12 F1点的适度但一致的绝对提升。这些结果表明,结合结构化推理精炼的本体感知适应可以支持可扩展的回顾性EPPC挖掘,但在操作使用前需要外部验证。

英文摘要

Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale. The Electronic Patient-Provider Communication (EPPC) framework provides an ontology for coding these behaviors, but automated extraction remains challenging because predictions must preserve fine-grained code/sub-code structure while grounding annotations in message text. We developed EPPC-OASIS, an ontology-aware adaptation approach for structured EPPC extraction, and combined it with deployable inference-refinement procedures designed to improve the coherence of final annotations. EPPC-OASIS augments supervised fine-tuning with a Wasserstein alignment objective that encourages alignment between model representation neighborhoods and EPPC ontology-derived neighborhoods, while inference refinement uses verification, self-consistency, hybrid correction, and selection or ensembling to address residual prediction errors. We evaluated the framework on a de-identified corpus of secure patient-provider messages against prompting, supervised fine-tuning, preference-based, and robustness-oriented baselines across multiple open-weight language models. Across model families, the best deployable pipeline achieved 77.13% Code+Sub-code F1 and 63.83% Triplet F1, corresponding to modest but consistent absolute gains of +1.39 and +2.12 F1 points over the strongest supervised fine-tuning baseline. These results suggest that ontology-aware adaptation with structured inference refinement can support scalable retrospective EPPC mining, although external validation is needed before operational use.

2605.24171 2026-05-26 cs.LG cs.AI

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

PromptAudit: 审计基于LLM的漏洞检测中的提示敏感性

Steffen J. Camarato, Yahya Hmaiti, Mandana Ghadamian, David Mohaisen

AI总结 提出PromptAudit框架,通过固定数据集、解码和解析仅变化提示策略,评估五种提示策略在五个开源模型上对1000个CVE(6074个代码样本,16种编程语言)的漏洞检测性能,发现标准思维链提示整体性能最佳,而提示敏感性是系统的一级属性。

详情
AI中文摘要

大型语言模型越来越多地用于漏洞检测,但它们在各种提示表述下的可靠性仍未得到表征。我们提出了PromptAudit,一个受控评估框架,通过固定数据集、解码和解析,仅变化提示策略来隔离提示效应。我们在1000个CVE(涵盖16种编程语言的6074个代码样本)上,使用五种提示策略对五个开源模型进行评估,计算准确率、召回率、弃权率、覆盖率和有效F1。我们发现,标准思维链提示实现了最强的整体操作性能,而少样本提示提供了模型相关的益处,对于提示敏感的模型最为显著。相比之下,自适应思维链经常抑制召回率,而自一致性导致过度弃权,急剧降低了有效性能。这些结果表明,漏洞检测行为由模型和提示共同决定,并且提示敏感性是一个一级系统属性,必须在评估和部署中明确表征。

英文摘要

Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.

2605.24168 2026-05-26 cs.AI cs.LG

Inference Time Context Sparsity: Illusion or Opportunity?

推理时上下文稀疏性:幻觉还是机遇?

Sahil Joshi, Prithvi Dixit, Agniva Chowdhury, Anshumali Shrivastava, Joseph E. Gonzalez, Ion Stoica, Kumar Krishna Agrawal, Aditya Desai

AI总结 本文通过实证和理论证据论证,在长上下文LLM推理中采用极端但原则性的上下文维度稀疏性不仅是可行的,而且能显著加速处理(如H100上实现10倍加速),从而挑战了密集注意力机制的必要性。

Comments 19 pages, 8 figures

详情
AI中文摘要

稀疏性长期以来一直是LLM效率的核心主题,但其在上下文处理中的作用仍未解决。随着LLM工作负载转向更长的上下文和智能体交互,注意力的计算和内存瓶颈变得日益关键,这引发了这些约束是否根本性的问题。我们的立场是,这些约束是人为且不必要的,LLM推理的未来在于沿上下文维度的极端但原则性的稀疏性。这一立场得到了多方面的经验和理论证据支持。首先,我们发现坚持密集注意力是不合理的,因为在长上下文中,查询实际上将O(N)个注意力信息投影到维度d << N的隐藏空间中,使得该过程固有地有损。其次,我们对跨越五个模型家族的20个LLM进行了广泛的稀疏性研究,变化上下文长度和不同稀疏水平。我们经验性地展示了一个强烈趋势:当前的LLM,尽管未针对上下文稀疏性进行训练,但在不同复杂度的任务(包括检索、多跳QA、数学推理和智能体编码)中对推理时解码稀疏性表现出显著的鲁棒性。重要的是,我们还表明当前的硬件已经足以从这种稀疏性中实现实质性收益。例如,我们的稀疏解码内核在H100等硬件上以50倍稀疏性水平将大上下文处理加速高达10倍,相比FlashInfer。总体而言,这些结果将极端上下文稀疏性定位为不仅是启发式的,而是LLM推理、训练和架构设计的原则性基础:既可行又有益,是未来系统的一个有吸引力的方向。

英文摘要

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d << N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.

2605.24164 2026-05-26 cs.CL

CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes

CUNY at CLPsych 2026: 一种用于心理健康变化分类和总结的流水线方法

Amirmohammad Ziaei Bideh, Shameed Charlomar Job, Ava Yahyapour, Alla Rozovskaya

AI总结 提出一种流水线方法,通过集成三个开源大语言模型的上下文学习和监督分类,对社交媒体时间线中的心理健康状态进行推断、变化点预测和模式总结,在CLPsych 2026共享任务中取得领先排名。

详情
AI中文摘要

我们描述了在CLPsych~2026共享任务中的提交,该任务旨在通过社交媒体时间线动态捕捉和表征心理健康变化。为了推断帖子中的主导自我状态(任务1.1和1.2),我们使用多数投票集成了三个开源大语言模型的上下文学习。为了预测时间线中的变化时刻(任务2),我们在从任务1.1预测中提取的特征上训练监督分类器。为了总结时间线内情绪动态的模式及其随时间的变化(任务3.1),我们增加了上游系统(任务1.1、1.2和2)预测的上下文示例标签,相比零样本和未增强的上下文学习基线获得了性能提升。我们的提交在任务1.1上排名第一,任务1.2上排名第四,任务2上排名第四,任务3.1上排名第三。实验源代码可在https://github.com/amirzia/clpsych26-cuny获取。

英文摘要

We describe our submission to the CLPsych~2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self-states in posts (Tasks 1.1 and 1.2), we ensemble in-context learning of three open-weight large language models using majority voting. For predicting moments of change in a timeline (Task~2), we train supervised classifiers on features derived from Task~1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in-context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero-shot and unaugmented in-context learning baselines. Our submission ranked first on Task~1.1, fourth on Task~1.2, fourth on Task~2, and third on Task~3.1.\footnote{The source code for the experiments is available at https://github.com/amirzia/clpsych26-cuny

2605.24162 2026-05-26 cs.LG cs.AI

Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

知识图谱调制的深度学习用于有限样本临床数据分析

Yuwei Xue, Sakib Mostafa, James Zou, Joseph Liao, Maximilian Diehn, Ash A. Alizadeh, Lei Xing, Md. Tauhidul Islam

AI总结 提出Graph-in-Graph (GiG)框架,通过将患者表示为模块化图并整合生物知识图谱,在有限样本临床任务中显著提升预测性能。

Comments 17 pages, 4 figures, 12 supplementary figures

详情
AI中文摘要

生物系统受结构化分子相互作用支配,其中通路、调控回路和功能基因关系塑造细胞行为和疾病进展。这些知识大多自然表示为图。然而,大多数生物医学AI模型无法直接使用图编码的生物知识,而是需要压缩的低维表示,这可能会丢失重要结构并降低性能,尤其是在有限样本的临床研究中。这里,我们引入Graph-in-Graph (GiG),一个知识图谱调制的深度学习框架,用于数据高效的临床预测。GiG将每个患者表示为一个独立的模块化图,其中精选的生物知识图谱定义边,患者特定的测量(如基因表达)定义节点特征。这种设计允许整合多个生物知识图谱,同时在患者级表示学习中保留基因-基因相互作用和通路拓扑。在涵盖近9700名患者和五个临床任务(包括液体活检癌症检测、前列腺癌诊断和32类泛癌分类)的队列中,GiG持续优于传统和最先进的方法,在有限样本设置中增益最大。在具有挑战性的前列腺癌诊断任务中,GiG相对于竞争方法将macro-F1提高了最多49个百分点。用随机拓扑替换真实通路图的对照实验证实,这些增益源于生物学基础的知识图谱结构,而非仅图建模。这些发现表明,知识图谱调制的深度学习可以提高临床数据分析的鲁棒性、可解释性和样本效率,并为将生物知识图谱整合到预测建模中提供了一个原则性框架。

英文摘要

Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationships shape cellular behavior and disease progression. Much of this knowledge is naturally represented as graphs. However, most biomedical AI models cannot directly use graph-encoded biological knowledge and instead require compressed low-dimensional representations, which can lose important structure and reduce performance, especially in limited-sample clinical studies. Here, we introduce Graph-in-Graph (GiG), a knowledge graph-modulated deep learning framework for data-efficient clinical prediction. GiG represents each patient as a standalone modular graph, in which curated biological knowledge graphs define edges and patient-specific measurements, such as gene expression, define node features. This design allows multiple biological knowledge graphs to be integrated while preserving gene-gene interactions and pathway topology during patient-level representation learning. Across cohorts comprising nearly 9,700 patients and five clinical tasks, including liquid biopsy cancer detection, prostate cancer diagnosis, and 32-class pan-cancer classification, GiG consistently outperforms traditional and state-of-the-art methods, with the largest gains in limited-sample settings. On the challenging prostate cancer diagnosis task, GiG improves macro-F1 by up to 49 percentage points relative to competing methods. Control experiments replacing real pathway graphs with random topologies confirm that these gains arise from biologically grounded knowledge graph structure rather than graph modeling alone. These findings show that knowledge graph-modulated deep learning can improve robustness, interpretability, and sample efficiency in clinical data analysis, and provide a principled framework for integrating biological knowledge graphs into predictive modeling.

2605.24159 2026-05-26 cs.CV

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

EchoVQA:为床旁心脏超声提供对话式辅助

Filippos Bellos, Yutong Li, Jessie N Dong, Zaiyang Guo, Emily Mackay, Yayuan Li, Yannis Avrithis, Alison Pouch, Jason J. Corso

AI总结 针对床旁心脏超声图像获取与解读依赖专业知识的问题,提出首个大规模超声心动图VQA数据集EchoVQA(含14,299张图像和74,819个问答对),并开发基于多模态可学习提示的参数高效方法,在多数基准上达到最优性能。

详情
AI中文摘要

床旁经胸超声心动图(TTE)几乎可在任何临床环境中进行心脏评估,但其诊断效用仍受限于图像获取和解读所需的专业知识。视觉问答(VQA)为通过交互式临床辅助弥合这一专业知识差距提供了有前景的范式,但现有的超声心动图VQA数据集规模有限、仅限于高质量图像,且仅覆盖少数视图。我们提出了EchoVQA,首个大规模超声心动图VQA数据集,包含14,299张图像和74,819个问答对。该数据集整合了公共来源(EchoNet-Dynamic、CAMUS)和我们使用两种手持探头(Lumify、Clarius)自行采集的床旁图像,涵盖多种视图,包括高质量和次优图像。独特的是,EchoVQA包含采集指导问题,帮助用户优化探头位置以获得用于左心室射血分数评估的诊断性心尖四腔心视图——这对床旁环境中的新手操作者来说是一项具有挑战性的任务。我们进一步开发了一种基于多模态可学习提示的参数高效方法,在包括EchoVQA在内的大多数基准上取得了最先进的性能,且可训练参数显著少于现有最先进方法。

英文摘要

Point-of-care transthoracic echocardiography (TTE) enables cardiac assessment in virtually any clinical setting, yet its diagnostic utility remains constrained by the expertise required for image acquisition and interpretation. Visual question answering (VQA) offers a promising paradigm for bridging this expertise gap through interactive clinical assistance, but existing echocardiography VQA datasets are limited in scale, restricted to high-quality images, and only cover a few views. We introduce EchoVQA, the first large-scale VQA dataset for echocardiography, comprising 14,299 images and 74,819 question-answer pairs. The dataset integrates public sources (EchoNet-Dynamic, CAMUS) with our own point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. Uniquely, EchoVQA includes acquisition guidance questions to help users optimize transducer positioning toward a diagnostic apical 4-chamber view for left ventricular ejection fraction estimation -- a challenging task for novice operators in point-of-care settings. We further develop a parameter-efficient method based on multimodal learnable prompts achieving state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches.

2605.24154 2026-05-26 cs.AI cs.SE

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

Palette: 一种模块化、可控且高效的框架,用于按需授权安全对齐放松的LLMs

Qitao Tan, Xiaoying Song, Arman Akbari, Arash Akbari, Yanzhi Wang, Xiaoming Zhai, Lingzi Hong, Zhen Xiang, Jin Lu, Geng Yuan

AI总结 提出Palette框架,通过多目标搜索识别拒绝方向并轻量级适配模型,实现按需放松授权领域的安全拒绝行为,同时保持其他区域的标准安全,支持模块化组合多领域授权。

详情
AI中文摘要

当前基础模型的安全对齐大多遵循“一刀切”范式,跨用户和上下文应用相同的拒绝策略。因此,模型可能拒绝对于一般用户不安全但授权专业人员合法的请求,限制了专业环境中的有用性。现有方法要么需要昂贵的重新对齐,要么依赖推理时控制,但存在控制不精确和延迟增加的问题。为此,我们提出Palette,一个模块化、可控且高效的框架,选择性地放松授权目标领域的拒绝行为,同时在其他地方保持标准安全。我们的方法通过多目标搜索识别拒绝方向,并通过轻量级适配将其内化到模型中。Palette进一步支持模块化组合:它独立学习领域特定的安全控制,并通过参数合并进行组合,无需重新训练即可实现按需多领域授权。在四个安全基准、多个模型变体以及LLMs和VLMs上的实验表明,Palette在不牺牲通用实用性的情况下提供精确的安全控制,为基础模型适应多样化专业需求提供了一条实用路径。

英文摘要

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

2605.24139 2026-05-26 cs.AI cs.LG

MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

MAPLE:不完全信息游戏中AlphaZero的多状态聚合策略评估

Qian-Rong Li, Hung Guei, I-Chen Wu, Ti-Rong Wu

AI总结 提出MAPLE方法,通过单搜索树聚合多个采样世界状态的策略和价值评估,结合PIMC和IS-MCTS优势,在Phantom Go和Dark Hex上分别提升Elo 291和136。

Comments Accepted by the IEEE Conference on Games (IEEE CoG 2026)

详情
AI中文摘要

不完全信息游戏(IIGs)具有挑战性,因为玩家必须在未完全观察真实游戏状态的情况下做出决策。虽然AlphaZero在完美信息游戏中取得了显著成功,但将其扩展到IIGs仍然困难。现有的基于搜索的方法,如完美信息蒙特卡洛(PIMC),存在策略融合问题,而信息集蒙特卡洛树搜索(IS-MCTS)在与神经网络结合时计算成本高昂。在本文中,我们提出了多状态聚合策略评估(MAPLE),一种树搜索方法,它在单个搜索树内聚合来自多个采样世界状态的策略和价值评估,结合了PIMC和IS-MCTS的优点,同时保持可控的计算成本。我们进一步引入基于孪生网络的采样策略,从信息集中选择信息丰富的世界状态。在Phantom Go和Dark Hex上的实验表明,MAPLE显著优于基于PIMC的AlphaZero基线,分别实现了291和136的Elo提升。这些结果表明,MAPLE是一种在不完全信息游戏中进行AlphaZero式学习的有效方法。

英文摘要

Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While AlphaZero has achieved remarkable success in perfect-information games, extending it to IIGs remains difficult. Existing search-based approaches, such as Perfect Information Monte Carlo (PIMC), suffer from strategy fusion, while Information Set Monte Carlo Tree Search (IS-MCTS) incurs high computational cost when combined with neural networks. In this paper, we propose Multi-State Aggregated PoLicy Evaluation (MAPLE), a tree search method that aggregates policy and value evaluations from multiple sampled world states within a single search tree, combining the advantages of PIMC and IS-MCTS while maintaining a controllable computational cost. We further incorporate a Siamese-based sampling strategy to select informative world states from the information set. Experiments on Phantom Go and Dark Hex show that MAPLE significantly outperforms the PIMC-based AlphaZero baseline, achieving Elo improvements of 291 and 136, respectively. These results demonstrate that MAPLE is an effective approach for AlphaZero-style learning in imperfect-information games.

2605.24128 2026-05-26 cs.CV

ImPartial: Multi-channel Whole-Cell Segmentation using Partial Annotations

ImPartial: 使用部分注释的多通道全细胞分割

Gunjan Shrivastava, Saad Nadeem

AI总结 提出ImPartial框架,通过自监督多通道量化插值,在稀疏标注和有限监督下实现与全监督模型相当的全细胞分割性能。

Comments MICCAI'26 Early Accept

详情
AI中文摘要

病理图像中准确的细胞分割通常需要密集的像素级注释,这既昂贵又耗时。这一挑战对于新兴的生物成像模态和具有可变通道配置的多重数据集尤其重要,因为这些数据中专家标注的数据很少。在这项工作中,我们引入了ImPartial,这是一个深度学习框架,旨在使用稀疏涂鸦和有限监督在低标注条件下实现最先进的分割性能。ImPartial通过自监督多通道量化插值增强了分割目标。该方法利用了以下观察结果:精确的像素级重建或图像去噪对于准确分割并非必需,因此引入了一个与整体分割目标更一致的自监督分类目标。我们证明,ImPartial在需要显著更少注释的情况下实现了与全监督模型相当的性能。在基准多重细胞成像和单重临床明场免疫组化数据集上的大量实验表明,仅使用部分注释,ImPartial相对于强基线有一致的改进。所有基准数据集和代码均可通过我们的GitHub获取:https://github.com/nadeemlab/ImPartial。

英文摘要

Accurate cell segmentation in pathology images typically requires dense pixel-wise annotations, which are costly and time-consuming to obtain. This challenge is especially important for emerging biological imaging modalities and multiplexed datasets with variable channel configurations, where expert-labeled data are scarce. In this work, we introduce ImPartial, a deep learning framework designed to achieve state-of-the-art segmentation performance in low-annotation regimes using sparse scribbles and limited supervision. ImPartial augments the segmentation objective via self-supervised multi-channel quantized imputation. This approach leverages the observation that perfect pixel-wise reconstruction or denoising of the image is not needed for accurate segmentation, and thus, introduces a self-supervised classification objective that better aligns with the overall segmentation goal. We demonstrate that ImPartial achieves performance at par with fully supervised models while requiring substantially fewer annotations. Extensive experiments on benchmark multiplexed cellular imaging and single-plex clinical brightfield immunohistochemistry datasets show consistent improvements over strong baselines with only partial annotations. All benchmark datasets and code are available via our Github: https://github.com/nadeemlab/ImPartial.

2605.24127 2026-05-26 cs.RO

Investigating the Effect of a Series Elastic Actuation Retrofit to Black-Box Actuators

研究串联弹性驱动改造对黑箱执行器的影响

Ivan Tregear, Ayhan Aktas, Ferdinando Rodriguez y Baena

AI总结 通过为黑箱执行器加装串联弹性元件,利用有限元分析设计扭转弹性元件,实现了高保真力测量,将开环力控制带宽从10.32 Hz提升至30.32 Hz,提升2.93倍,且性能优于成本更高的商用传感器。

Comments Related GitHub repo available here: https://github.com/ITregear/SeriesElasticActuation-FYP

详情
AI中文摘要

在机器人应用中,执行器通常设计为具有最小间隙的刚性结构,以确保精度和可重复性。然而,这限制了柔顺性,导致在不确定环境中可能造成损坏和较差的力控制。串联弹性驱动(SEA)引入柔顺性以增强扰动抑制,并能够通过胡克定律进行力测量,但会降低系统带宽。 一个定制的串联弹性(SE)元件被加装到一个黑箱执行器上,以减轻间隙和静摩擦等非线性。集成SE元件实现了高保真力测量,提高了力控制带宽和性能。 通过有限元(FE)分析设计了一个扭转SE元件,其刚度为2155.4 Nm/rad。测量了原始电机和SEA集成配置的开环力控制带宽,同时利用SEA和商用力传感器的反馈评估了闭环带宽。SEA模块将带宽从10.32 Hz提高到30.32 Hz,提升了2.93倍。此外,尽管成本仅为25英镑,其性能仍比商用传感器高出7.63%。

英文摘要

In robotic applications, actuators are typically designed to be stiff with minimal backlash to ensure precision and repeatability. However, this limits compliance, leading to potential damage and poor force control in uncertain environments. Series Elastic Actuation (SEA) introduces compliance to enhance disturbance rejection and enable force measurement via Hooke's Law but reduces system bandwidth. A custom Series Elastic (SE) element was retrofitted to a black-box actuator to mitigate non-linearities like backlash and static friction. Integrating the SE element enabled high-fidelity force measurements, improving force control bandwidth and performance. A torsional SE element was designed through Finite Element (FE) analysis, yielding a stiffness of 2155.4 Nm/rad. Open-loop force control bandwidth was measured for the original motor and the SEA-integrated configuration, while closed-loop bandwidth was assessed using feedback from the SEA and a commercial force sensor. The SEA module increased bandwidth from 10.32 Hz to 30.32 Hz, a 2.93X improvement. Additionally, it outperformed the commercial sensor by 7.63% despite costing 25 GBP, a fraction of the price.

2605.24125 2026-05-26 cs.RO

Anisotropic Diffusion-Driven Ergodic Coverage in Multi-Robot Systems

多机器人系统中各向异性扩散驱动的遍历覆盖

Thales C. Silva, Anoop Kiran, Nora Ayanian

AI总结 提出一种基于Perona-Malik各向异性扩散的遍历搜索方法,通过非均匀误差传播生成势场,引导多机器人系统实现更灵活的遍历覆盖。

详情
AI中文摘要

我们考虑在多机器人系统中结合势场和遍历搜索的问题。传统的遍历搜索算法使用遍历性度量来考虑不同尺度下的期望分布。最近,提出了一种热方程驱动的遍历方法,增加了遍历度量平滑的灵活性。然而,这种方法作为各向同性扩散,无论期望分布如何变化,都会在所有方向上均匀传播误差。我们引入了遍历性问题的一类通用各向异性扩散公式,为遍历搜索生成势场。我们证明该方法推广了先前考虑径向基函数和热方程解来表示目标密度分布与覆盖轨迹之间差异的结果。在我们的解决方案中,智能体运动使用Perona-Malik扩散解的梯度进行引导,并且我们的公式将热方程作为特例。我们通过一系列不同场景的仿真来展示该方法。

英文摘要

We consider the problem of combining potential field and ergodic search on multi-robot systems. Traditional ergodic search algorithms use metrics for ergodicity that account for the desired distribution at different scales. Recently, a heat equation-driven ergodic approach was proposed, which adds flexibility to the smoothing of the ergodic metric. However, such an approach, as it is an isotropic diffusion, propagates the error uniformly in all directions, regardless of changes in the desired distribution. We introduce a general class of anisotropic diffusion formulation of the ergodicity problem, which generates a potential field for the ergodic search. We demonstrate that this approach generalizes previous results, which consider radial basis functions and the solution of the heat equation to represent the difference between the goal density distribution and the covered trajectories. In our solution, the agent movement is directed using the gradient of the solution of the Perona-Malik diffusion, and our formulation includes the heat equation as a special case. We demonstrate the methodology with a series of simulations in different scenarios.

2605.24117 2026-05-26 cs.AI

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

SkillEvolBench:从情景经验到程序性技能的演化基准测试

Yingtie Lei, Zhongwei Wan, Jiankun Zhang, Samiul Alam, Zixuan Zhong, Peizhou Huang, Xin Wang, Jingxuan Zhang, Donghao Zhou, Yunta Hsieh, Zhihao Dou, Hui Shen, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang

AI总结 提出SkillEvolBench基准,通过六个真实环境中的180个任务,评估大语言模型代理能否将情景经验提炼为可复用的程序性技能,发现当前代理难以形成稳健的技能,且原始轨迹复用优于蒸馏技能。

详情
AI中文摘要

大语言模型(LLM)代理在解决现实任务时会积累丰富的情景轨迹,但目前尚不清楚这种经验能否被提炼为可复用的程序性技能。我们引入了SkillEvolBench,一个用于评估从经验复用到技能形成这一步骤的诊断基准。它包含跨越六个真实代理环境的180个任务,组织成具有共享潜在程序的角色条件任务族。代理从获取任务中学习,使用压缩轨迹和验证器反馈更新外部技能库,然后面对冻结部署任务,测试上下文偏移、对抗性捷径和组合。通过将自生成和策划起始的技能演化与无技能和原始轨迹控制进行比较,SkillEvolBench将程序抽象与基础能力、策划的先验知识和情景轨迹的直接复用分离开来。在十种模型配置和三种代理框架下,我们发现当前代理通常能局部适应,但很少形成稳健的可复用技能。基于技能的条件可以改善获取或重放,个别模型有时在特定部署轴上有所提升,但这些提升在冻结部署下不稳定。原始轨迹复用经常优于蒸馏技能,表明当前的抽象过程丢弃了未来任务仍有用的上下文和程序线索。容量和成本分析进一步表明,编写更多技能或更大的三级资源库并不足够:额外的更新可以改善覆盖范围,同时引入特定于情节的漂移和程序混乱。这些发现将SkillEvolBench定位为一个测试平台,用于衡量一次性经验何时成为持久的程序性知识而非任务局部记忆。

英文摘要

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

2605.24114 2026-05-26 cs.CV

COSY: Compositional 3DGS Synthesis for Disentangled Human Head Editing

COSY: 用于解耦人像编辑的组合式3DGS合成

Florian Barthel, Shalini De Mello, Koki Nagano, Wieland Morgenstern, Anna Hilsmann, Peter Eisert

AI总结 提出一种组合式3DGS生成器架构,通过独立合成头发、皮肤、眼镜和躯干等组件实现语义属性的精确解耦编辑,无需分割掩码或几何先验。

详情
AI中文摘要

近期用于人像的3D高斯泼溅(3DGS) GAN能够实时合成和渲染逼真的3D模型,并在身份和外观上提供丰富的多样性。然而,控制特定语义属性(如发色或眼镜)仍然具有挑战性,因为纠缠的潜在空间中的编辑常常导致身份或外观的意外变化。尽管有几种方法旨在通过估计仅修改特定特征的方向来解耦训练后的潜在空间,但这些方法无法保证完全解耦,并且通常需要预训练的分类器。在我们的方法中,我们提出了一种新的生成器架构,该架构完全独立地合成组件,例如头发、皮肤、眼镜和躯干。这使得可以更改一个区域的潜在向量,同时保持其余部分固定。此外,我们仅使用稀疏信息(如头发或皮肤颜色)实现这种分离,消除了先前工作中常见的分割掩码或几何先验的需求。为了确保编辑过程中形状和光照条件的匹配,我们允许独立生成器之间通过上下文令牌共享最少的信息。这些令牌甚至使我们能够控制形状和光照,而无需任何先验注释。与现有的基于GAN的生成和编辑工作相比,我们的方法显示出更好的解耦性、更精确的编辑控制和有竞争力的视觉质量。

英文摘要

Recent 3D Gaussian Splatting (3DGS) GANs for human heads synthesize and render photorealistic 3D models in real-time and offer a vast variety in identity and appearance. However, controlling specific semantic attributes such as hair color or glasses remains challenging, as edits in the entangled latent space often induce unintended changes in identity or appearance. Although there are several methods that aim to disentangle the latent space post training by estimating directions that only modify certain features, these methods cannot guarantee complete disentanglement and often require pre-trained classifiers. In our approach, we propose a new generator architecture that synthesizes components, such as hair, skin, glasses, and torso, completely independently. This allows for changing the latent vector for one region while keeping the remaining parts fixed. Further, we achieve this separation using only sparse information such as the hair or skin color, eliminating the requirement of segmentation masks or geometric priors, often seen in prior work. To ensure matching shape and lighting conditions during editing, we allow minimal shared information via context tokens between the independent generators. These tokens even allow us to control the shape and light, without any prior annotation. Compared to existing works on GAN-based generation and editing, our method shows better disentanglement, more precise editing control, and competitive visual quality.

2605.24113 2026-05-26 cs.LG math.DG math.OC math.ST stat.TH

Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

黎曼原型分析:变形星形分布上的可解释非线性数据分析

Willem Diepeveen, Deanna Needell

AI总结 提出黎曼原型分析(RAM),通过数据驱动的拉回几何将经典原型分析扩展到非线性流形,结合可解释性与非线性表达能力,并基于凸松弛与非凸细化的优化方案实现。

详情
AI中文摘要

经典原型分析因其可解释性而具有吸引力,但其线性几何在处理强非线性结构的数据时可能限制性能;同时,现有的神经扩展提高了灵活性,但往往削弱了原型和插值的几何意义。在这项工作中,我们基于数据驱动的拉回几何,针对实值数据开发了黎曼版本的原型分析,旨在结合经典原型分析的可解释性与现代非线性模型的表达能力。我们引入了一类变形星形分布及其相关的拉回黎曼几何,以提供所得流形映射的统计解释,将黎曼原型映射(RAM)定义为投影到原型的测地凸组合流形上,并提出了基于凸松弛后接非凸细化的实用优化方案。我们进一步提出了一种学习方案,从数据中产生合理但通常次优的变形星形分布。在合成示例和MNIST上的实验表明,所提出的框架产生了有意义的测地线、有用的去噪投影和几何感知分类,同时也明确了当前优化限制所在。

英文摘要

Classical archetypal analysis is appealing for its interpretability, but its linear geometry can limit performance on data with strongly non-linear structure; at the same time, existing neural extensions improve flexibility while often weakening the geometric meaning of archetypes and interpolations. In this work, we develop a Riemannian version of archetypal analysis based on data-driven pullback geometry for real-valued data, with the goal of combining the interpretability of classical archetypal analysis with the expressive power of modern non-linear models. We introduce a class of deformed star distributions together with associated pullback Riemannian geometry to provide a statistical interpretation of the resulting manifold mappings, define the Riemannian archetypal mapping (RAM) as a projection onto the manifold of geodesically convex combinations of archetypes, and propose a practical optimization scheme based on convex relaxation followed by non-convex refinement. We further propose a learning scheme that yields reasonable, albeit generally suboptimal, deformed star distributions from data. Experiments on synthetic examples and MNIST show that the resulting framework produces meaningful geodesics, useful denoising projections, and geometry-aware classifications, while also clarifying where current optimization limitations remain.