arXivDaily arXiv每日学术速递 周一至周五更新
2606.19804 2026-06-19 cs.CV 新提交

HypOProto: Hyperbolic Ordinal Prototypes for Left Ventricular Filling Pressure Classification

HypOProto: 用于左心室充盈压分类的双曲序数原型

Victoria Wu, Nima Hashemi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang

发表机构 * The University of British Columbia(不列颠哥伦比亚大学) Vancouver General Hospital(温哥华综合医院)

AI总结 提出HypOProto框架,利用双曲空间中的序数原型对左心室充盈压进行分类,通过冻结的可解释基础模型实现高精度与临床可解释性。

详情
AI中文摘要

超声心动图(echo)是一种广泛用于评估心脏功能的成像模态,左心室充盈压(LVFP)是心力衰竭等疾病的关键生理标志物。将LVFP分为正常和升高类别的标准依赖于多普勒衍生的$E/e'$比值,该比值依赖于操作者,且在资源有限的环境中通常不可用,这促使了直接从B模式超声推断LVFP的方法。现有的深度学习方法实现了高性能,但大多是黑盒模型,限制了临床可解释性。我们提出了HypOProto,一个基于双曲序数原型的可解释LVFP分类框架,使用冻结的可解释基础模型骨干。HypOProto沿着生理$E/e'$尺度排列原型,将边界情况放置在双曲面根附近,其中小的角度差异区分相似情况,而正常和升高情况占据向外位置,反映诊断确定性的增加。这种双曲几何编码了临床上有意义的序数关系,并提高了可解释性。我们还引入了一种新的双曲原型角度分离(HyperPAS)损失,强制在双曲空间中实现类间原型分离。HypOProto在保持透明性的同时实现了最先进的性能,并在可视化中突出显示临床相关区域。这项工作代表了超声中LVFP分类的第一个基于原型的框架。我们的代码可在以下网址找到:此 https URL。

英文摘要

Echocardiography (echo) is a widely used imaging modality for assessing cardiac function, with Left Ventricular Filling Pressure (LVFP) serving as a critical physiological marker for conditions such as heart failure. Standard LVFP classification into normal \emph{vs} elevated categories relies on the Doppler-derived $E/e'$ ratio, which is operator-dependent and often unavailable in resource-limited settings, motivating methods that infer LVFP directly from B-mode echo. Existing deep learning approaches achieve high performance but remain largely black-box, limiting clinical interpretability. We propose HypOProto, a hyperbolic, ordinal prototype-based framework for interpretable LVFP classification using a frozen, explainable foundation model backbone. HypOProto arranges prototypes along the physiological $E/e'$ scale, placing borderline cases near the hyperboloid root where small angular differences separate similar cases, while normal and elevated cases occupy outward positions reflecting increasing diagnostic certainty. This hyperbolic geometry encodes clinically meaningful ordinal relationships and improves interpretability. We also introduce a novel Hyperbolic Prototype Angular Separation (HyperPAS) loss, enforcing inter-class prototype separation in hyperbolic space. HypOProto achieves SOTA performance while maintaining transparency, and highlights clinically relevant regions in visualizations. This work represents the first prototype-based framework for LVFP classification in echo. Our code can be found at https://github.com/DeepRCL/HypOProto.

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 新提交

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索:向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

发表机构 * Portland State University(波特兰州立大学)

AI总结 本文提出策略感知向量搜索的愿景,形式化向量数据库中的细粒度访问控制(FGAC)策略模型与实施问题,比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情
AI中文摘要

向量数据库越来越多地用于安全敏感的场景,如检索增强生成和组织AI管道;然而,其安全能力仍然有限。具体而言,现代向量数据库不完全支持细粒度访问控制(FGAC),而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同,向量数据库结合结构化和非结构化属性以提供语义近似查询结果,这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中,我们通过形式化向量数据库中的FGAC策略模型以及实施问题,提出了策略感知向量搜索的愿景。我们比较了各种实施策略,展示了初步发现,并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.

2606.19802 2026-06-19 cs.LG cs.CV 新提交

Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

流映射去噪器:遍历逆问题的失真-感知平面

Nicolas Zilberstein, Morteza Mardani, Santiago Segarra

发表机构 * Rice University(莱斯大学) NVIDIA Inc.(英伟达公司)

AI总结 提出流映射模型,通过单一参数t在MMSE和感知质量间连续调节,实现逆问题的失真-感知权衡,无需额外监督或调参。

详情
AI中文摘要

图像复原面临一个基本权衡:最小化误差的方法产生模糊重建,而最大化感知质量的方法产生锐利但不够保真的图像。现有方法要么在失真-感知(DP)前沿上固定一个操作点,要么需要配对数据监督、辅助模型或对采样器进行超参数调优以访问不同点。我们证明,流映射模型——一种用于少步采样的流匹配的近期扩展,学习一个平均场——隐式定义了一个单参数去噪器族,连续跨越DP前沿。前瞻参数t充当MMSE和感知区域之间的控制旋钮。对于高斯目标,我们证明改变t精确恢复最优DP前沿;对于自然图像,我们在经验上观察到类似行为。在即插即用求解器中,相同机制扩展到一般逆问题,控制感知对齐与数据一致性之间的权衡。尽管在此设置中缺乏精确最优性保证,单个训练的流映射跨越DP权衡,在两端匹配或超越专门基线。在CelebA(128×128)和AFHQ(256×256)上的多个线性和非线性逆任务的广泛实验验证了我们的发现。

英文摘要

Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.

2606.19799 2026-06-19 cs.SE cs.LG 新提交

The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions

TensorFlow和Keras应用中不良编码实践的隐藏环境成本:资源泄漏与碳排放研究

Bashar Abdallah, Gustavo Santos, Rola Al Bataineh, Alain Abran, Mohammad Hamdaqa

AI总结 研究TensorFlow/Keras中两种资源泄漏气味(IMR和UTR)对能耗和碳排放的影响,实验表明两者分别增加约32%和46%的电力消耗,证明资源泄漏显著降低ML能效并增加环境负担。

详情
AI中文摘要

效率和可持续性是机器学习(ML)应用开发和部署中的关键考量。在影响可持续性的因素中,ML代码中的资源泄漏可能引入隐藏的低效率,从而增加能源消耗和CO2排放。尽管如此,量化其环境影响的实证证据仍然有限。这篇新兴结果论文对两种常见的资源泄漏气味,即不当模型重用(IMR)和未释放张量引用(UTR),及其对TensorFlow和Keras工作负载中能源消耗和CO2排放的影响进行了初步实证研究。通过执行相同的训练任务,并与无气味基线进行比较,对每种气味进行了受控实验。我们的初步结果表明,两种气味都持续增加了估计的用电量和碳排放。IMR和UTR分别使电力消耗增加约32%和46%,CO2排放也成比例增加。配对统计检验表明这些差异是系统性的且具有统计显著性,提供了初步的实证证据,表明资源泄漏气味可能降低ML的能效和环境可持续性。这些发现表明,资源泄漏气味对软件质量和可持续性构成可衡量的风险,强调了将资源生命周期管理和能效考虑纳入ML开发的重要性。

英文摘要

Efficiency and sustainability are critical considerations in the development and deployment of machine learning (ML) applications. Among the factors influencing sustainability, resource leaks in ML code can introduce hidden inefficiencies that elevate energy consumption and CO2 emissions. Despite this, empirical evidence quantifying their environmental impact remains limited. This emerging results paper presents an initial empirical investigation of two common resource-leak smells, namely Improper Model Reuse (IMR) and Unreleased Tensor References (UTR), and their impact on energy consumption and CO2 emissions in TensorFlow and Keras workloads. Controlled experiments were conducted for each smell by executing identical training tasks while comparing against a smell-free baseline. Our preliminary results show that both smells consistently increase estimated electricity usage and carbon emissions. IMR and UTR increased electricity consumption by approximately 32% and 46%, respectively, with proportional increases in CO2 emissions. Paired statistical tests indicate that these differences are systematic and statistically significant, providing initial empirical evidence that resource-leak smells may degrade ML energy efficiency and environmental sustainability. These findings suggest that resource-leak smells pose measurable risks to both software quality and sustainability, emphasizing the importance of integrating resource-lifecycle management and energy-efficiency considerations into ML development.

2606.19795 2026-06-19 cs.SE cs.AI 新提交

Agentic Electronic Design Automation: A Handoff Perspective

代理式电子设计自动化:一种交接视角

Jiawei Liu, Peiyi Han, Yuntao Lu, Su Zheng, Fengyu Yan, Bei Yu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Primarius Technologies(Primarius技术公司)

AI总结 本文从交接有效性角度出发,将EDA流程中的代理系统分为三类,并提出五层代理通信协议,以解决多阶段、多工具间的状态传递和验证问题。

详情
AI中文摘要

电子设计自动化(EDA)本质上是多阶段且交接密集的。设计工件、流程脚本和工程决策在最终实现、签核或发布之前,跨越工具、会话和组织边界。每次传递都携带显式和隐式需求,这些需求可能无法被阶段局部检查完全捕获。基于LLM的代理现在直接调用EDA工具,将检索到的知识嵌入可执行脚本,并在会话和阶段之间传递状态。一旦它们的输出影响下游工程决策,传递的对象必须满足交接合同并符合其下一个消费者的假设。本综述引入交接有效性作为其组织原则。当传递的对象满足消费者的接受条件,并携带足够的上下文、证据和来源以供下游使用时,交接是有效的。我们回顾了82个系统,并将它们分为三个边界类别。阶段边界系统在单个EDA阶段或有界验证任务内建立有效性。流程边界系统在工具、调用和会话之间保持连贯的工作流状态。组织边界系统在知识和权限边界之间维护源基础、来源、范围及可接受性。对于每个类别,我们分析交接合同、交接对象、协调机制和开放问题。这些分析激发了一个五层EDA代理通信协议(EACP),涵盖代理发现、代理消息、工具调用、工作流编排以及安全和IP协议。我们旨在为可信的代理式EDA提供通用词汇和研究议程。

英文摘要

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

2606.19792 2026-06-19 cs.SD 新提交

Exploring Pre-training Benefits on Phoneme Addition through Fine-tuning in Speech Synthesis

探索预训练在语音合成中通过微调对音素添加的益处

Masato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda

发表机构 * CyberAgent, Japan(日本CyberAgent公司) Nagoya University, Japan(日本名古屋大学)

AI总结 研究预训练模型在微调过程中添加新音素时的表现,发现预训练主要提升自然度,但对新音素添加的益处有限。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

迁移学习广泛用于低资源文本到语音合成。当目标语料包含预训练中未见过的音素时,模型必须在微调期间扩展其音素库存;我们称此过程为“音素添加”。然而,尚不清楚预训练生成已见音素的能力是否有助于此过程。本研究在两个设置中调查音素添加:(1)使用LLM生成的音素控制语料库的模拟设置,可以在不考虑混杂因素的情况下进行研究,以及(2)真实语音跨语言迁移设置(英语到日语),以验证发现是否在实践中成立。两个设置中的实验表明,虽然微调比从头训练实现了更高的自然度,但需要相同或更多的数据才能达到与新音素相当的PER。这些结果表明,预训练主要有助于自然度提升,但对音素添加的益处有限。

英文摘要

Transfer learning is widely used for low-resource text-to-speech. When the target corpus contains phonemes unseen in pre-training, the model must expand its phoneme inventory during fine-tuning; we call the process "phoneme addition." However, it remains unclear whether the pre-trained ability to generate seen phonemes contributes to this process. This study investigates phoneme addition in two settings: (1) a simulation setup using LLM-generated phoneme-controlled corpora that enables investigation without considering confounding factors, and (2) a real-speech cross-lingual transfer setup (English to Japanese) to validate whether the findings hold in practice. Experiments in both settings showed that while fine-tuning achieved higher naturalness than training from scratch, it required as much or more data to achieve comparable PER for new phonemes. These results indicate that pre-training mainly contributes to naturalness improvement, but offers limited benefit for phoneme addition.

2606.19790 2026-06-19 cs.CE 新提交

The Orchestration Gap: Why Process Automation Stalls in Operationally Complex Industries

编排鸿沟:为何流程自动化在操作复杂行业中停滞不前

Jiechao Gao, Yuandong Pan. Yuangang Li, Jie Wang, Kincho Law, Michael Lepech

AI总结 本文提出“编排鸿沟”概念,分析为何多智能体系统在物流、医疗等复杂行业自动化中失败,并给出基于约束执行和可解释性的分阶段自动化路径。

详情
AI中文摘要

智能体系统在数字原生任务上进展迅速,但几乎未触及那些协调自动化可能最重要的行业:物流、医疗运营、建筑以及许多工作分散在不兼容工具和众多参与者中的领域。我们认为原因是缺少一种抽象。在这些场景中,价值并非来自单个有能力的模型调用,而是来自编排——协调多步骤工作流、强制执行硬领域约束、管理人工审批并桥接遗留系统的运行时。我们将这一思想发展成一个可用的概念框架。我们给出了一个操作性测试来识别哪些工作流受限于编排,一种分解方法将工作流的混乱程度与其协调工作量及价值分离,以及一个特征层面的解释说明为何当今的多智能体框架留下了一个特定鸿沟。然后我们提出核心主张:正确的自动化路径是分阶段的,而哪种架构保证最重要取决于一个行业的主要摩擦来源。在监管摩擦下,约束执行是承重关键;在责任摩擦下,可解释性是承重关键。我们以这一观点所暗示的研究计划作为结尾。

英文摘要

Agentic systems have advanced quickly on digitally native tasks, yet they have barely touched the industries where coordinated automation could matter most: logistics, healthcare operations, construction, and the many sectors whose work is spread across incompatible tools and many hands. We argue that the reason is a missing abstraction. The value in these settings does not come from a single capable model invocation; it comes from \emph{orchestration}, the runtime that coordinates multi-step workflows, enforces hard domain constraints, manages human approval, and bridges legacy systems. We develop this idea into a usable conceptual frame. We give an operational test for which workflows are orchestration-bound, a decomposition that separates how tangled a workflow is from how much of its effort is coordination and what that coordination is worth, and a feature-level account of why today's multi-agent frameworks leave a specific gap. We then advance our central claim: the right automation path is staged, and which architectural guarantee carries the most weight depends on a sector's dominant source of friction. Constraint enforcement is load-bearing under regulatory friction; explainability is load-bearing under liability friction. We close with the research program this view implies.

2606.19788 2026-06-19 cs.AI cs.CL 新提交

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval:评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Czech Technical University in Prague(捷克布拉格理工大学) CRRC Zhuzhou Institute(中车株洲研究所) Tengen Intelligence Institute(天元智能研究院) International Center of Future Science, Jilin University(吉林大学未来科学国际合作中心) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE(教育部知识驱动人机智能工程研究中心)

AI总结 提出CombEval动态基准,通过类型化Cofola规范生成组合计数问题,评估11个大语言模型在直接和代码增强设置下的表现,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情
AI中文摘要

我们提出了CombEval,一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范,从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同,CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

2606.19787 2026-06-19 cs.AI 新提交

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

ORAgentBench: LLM代理能否解决具有挑战性的端到端运筹学任务?

Jiajun Li, Mingshu Cai, Yixuan Li, Yu Ding, Ran Hou, Guanyu Nie, Xiongwei Han, Wanyuan Wang

发表机构 * Southeast University(东南大学) Waseda University(早稻田大学) Nanyang Technological University(南洋理工大学)

AI总结 提出ORAgentBench基准,评估LLM代理在端到端运筹学任务中的表现,发现当前代理通过率仅35.51%,主要受策略性弱点限制。

Comments 31 pages, preprint, v1

详情
AI中文摘要

大型语言模型越来越多地被部署为可执行环境中多步任务的自主代理,但它们执行现实运筹学工作的能力仍不明确。现有的运筹学评估通常将建模与求解分离,依赖预形式化或纯文本实例,很少测试从操作工件到验证决策的完整工作流程。在这项工作中,我们引入了ORAgentBench,一个基于执行环境的基准,用于评估自主代理在具有挑战性的端到端运筹学任务上的表现。它包含107个经过人工审核的任务,涵盖多样化的操作场景,每个任务都打包在一个隔离环境中,包含自然语言简介、多文件数据、配置工件和所需的提交模式。代理必须编写并运行解决方案代码,其提交由隐藏验证器根据模式有效性、硬约束可行性和归一化目标质量进行评估。对十四个前沿代理模型配置的实验表明,当前代理远未达到可靠的运筹学实践。最佳代理仅通过35.51%的所有任务和20.59%的困难任务,许多可行的提交仍低于所需的质量阈值。失败分析进一步表明,错误主要由策略性弱点主导,包括遗漏操作规则、脆弱的公式化、弱可行解构造以及解改进不足。运筹学特定的程序性技能增加了困难任务的可行性,但并未可靠地提高解质量或通过率。这些结果表明,运筹学代理的进展需要超越合理的优化代码,转向可靠、高质量的操作决策。

英文摘要

Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.

2606.19784 2026-06-19 cs.RO 新提交

EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models

EquiVLA: 旋转等变视觉-语言-动作模型的通用框架

Thien-Loc Ha, Quang-Tan Nguyen, Trong-Bao Ho, Long Dinh, Minh Duc Nguyen, Gia-Binh Nguyen, Pham Tri Quang, Minh N. Vu, Duy M. H. Nguyen, An Thai Le, Ngo Anh Vien

发表机构 * VinRobotics VinUniversity DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院)

AI总结 提出EquiVLA,首个端到端SO(2)等变VLA框架,通过EquiPerceptor和EquiActor实现从视觉到动作的近似等变链,在LIBERO、CALVIN和真实机器人任务上显著提升性能。

Comments Comment: First version 22 pages, project site: https://equivla.github.io/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人操作的有力范式,但它们缺乏几何归纳偏置:在特定方向训练的策略需要大量数据才能泛化到不同旋转配置。我们提出 \textsc{EquiVLA},首个端到端 $\mathrm{SO}(2)$-等变 VLA 模型的通用框架,适用于任何将冻结的视觉-语言骨干与流匹配扩散 Transformer 动作头耦合的架构。\textsc{EquiVLA} 引入了 \textsc{EquiPerceptor},它从冻结的 ViT 特征生成近似 $\mathrm{SO}(2)$-等变的视觉表示;以及 \textsc{EquiActor},一个精确 $\mathrm{SO}(2)$-等变的流匹配扩散 Transformer 动作头。两者共同建立了一条从相机观测到预测动作序列的近似 $\mathrm{SO}(2)$ 等变链。在 GR00T~N1.5 上实例化,并在四个 LIBERO 套件、CALVIN ABCD$\to$D 以及 Mobile ALOHA 上的五个真实机器人任务中评估,\textsc{EquiVLA} 在 LIBERO 上达到 $92.6\%$ 的平均成功率(基线为 $78.1\%$),在 CALVIN 上平均序列长度为 $4.03$(基线为 $3.45$),并将真实机器人成功率从 $54\%$ 提升至 $72\%$。

英文摘要

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present \textsc{EquiVLA}, the first general framework for end-to-end $\mathrm{SO}(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. \textsc{EquiVLA} introduces \textsc{EquiPerceptor}, which produces approximately $\mathrm{SO}(2)$-equivariant visual representations from frozen ViT features; and \textsc{EquiActor}, an exactly $\mathrm{SO}(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $\mathrm{SO}(2)$ equivariance chain from camera observations to predicted action sequences. Instantiated on GR00T~N1.5 and evaluated across four LIBERO suites, CALVIN ABCD$\to$D, and five real-robot tasks on Mobile ALOHA, \textsc{EquiVLA} achieves $92.6\%$ average success on LIBERO (vs. $78.1\%$ baseline), an average sequence length of $4.03$ on CALVIN (vs. $3.45$), and improves real-robot success from $54\%$ to $72\%$.

2606.19782 2026-06-19 cs.AI cs.CL 新提交

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA:一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute(向量研究所)

AI总结 提出多智能体管道AgentFinVQA,通过分解查询步骤并记录可追溯的模型评估包,在金融图表问答中实现可审计性与本地部署,在FinMME上提升准确率7.68个百分点。

详情
AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性:从业者必须在采取行动之前知道哪些答案值得信任,而且许多机构无法将客户数据发送给外部模型提供商。然而,现有的图表问答智能体注重准确性且不透明,并且大多数假设专有API访问;据我们所知,没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA,一个多智能体管道,将每个查询分解为规划、OCR、图例定位、视觉检查和验证,每个样本记录在可追溯的模型评估包(MEP)中。在FinMME上,AgentFinVQA在使用专有主干(Gemini-3 Flash;71.24% vs. 63.56%,McNemar p ≈ 1.1×10^{-16})时比主骨干匹配的零样本基线提高+7.68个百分点,在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号(确认答案与修正答案的精确准确率分别为68.2%和55.6%),支持人在回路审查路由。错误分析表明,问题误解、图例混淆和提取错误占失败原因的近三分之二,并且是验证器检测最少的类别,为未来工作指明了明确方向。这些结果共同表明,可审计、本地部署的金融图表问答是可行的,并且开放权重系统保留了大部分准确率提升,同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

2606.19776 2026-06-19 cs.CV 新提交

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

AI总结 提出Occ-VLM,仅用姿态RGB图像和单一2D视觉编码器,通过重建3D占用作为几何先验,实现统一的3D场景理解,在占用预测、3D VQA和密集描述任务上达到领先水平。

详情
AI中文摘要

近期,视觉语言模型(VLM)在3D场景理解方面取得了显著进展,推动了具身智能和机器人视觉等应用的发展。然而,现有方法通常要么直接依赖显式的3D输入(如点云或RGB-D序列),要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦,阻碍了统一3D视觉语言表示的发展。在这项工作中,我们提出了Occ-VLM,一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言,Occ-VLM重建3D场景占用作为辅助几何先验,用于将前景2D标记与3D空间进行空间关联。然后,这些标记由大型语言模型(LLM)解码,实现统一的场景理解。大量实验表明,Occ-VLM实现了准确的几何感知和稳健的视觉语言推理:在多视角占用预测上达到最先进性能,同时在3D视觉问答(VQA)和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

2606.19775 2026-06-19 cs.SI stat.AP stat.OT 新提交

Rethinking Sampling Strategy in Link Prediction

重新思考链接预测中的采样策略

Yilin Bi, Zhenyu Deng, Xinshan Jiao, Tao Zhou

AI总结 提出β-采样方案,研究两阶段采样对链接预测性能的影响,发现缺失链接的结构特征显著影响预测精度,且第二阶段采样策略至关重要。

Comments 19 pages, 5 figures, 3 tables

详情
AI中文摘要

许多现实世界的网络是不完整的,使得链接预测成为网络科学中的一个基本挑战。为了训练参数和评估算法,观察到的链接通常被划分为三个子集,即训练集、验证集和探测集。这种划分隐含地涉及两个采样过程:第一阶段采样产生探测集,第二阶段采样获得变化集。迄今为止,我们对这两个采样过程如何影响算法性能的理解仍然非常有限。为了解决这个问题,我们提出了一种称为β-采样的采样方案,其中链接的采样概率与其两个端点的度数乘积的β次幂成正比。在45个真实网络上的实验表明,通过改变探测集模拟的缺失链接的结构特征显著影响预测精度。当缺失链接倾向于连接高度数节点时,这类链接可以很容易地被准确预测。此外,即使探测集固定,第二阶段采样仍然对预测精度产生显著影响。值得注意的是,最优的第二阶段采样策略不同于随机采样(随机选择链接形成验证集)和一致采样(保证验证集和探测集中的链接具有相同的结构特征)。

英文摘要

Many real-world networks are incomplete, making link prediction a fundamental challenge in network science. To train parameters and evaluate algorithms, observed links are usually divided into three subsets, namely training, validation, and probe sets. This division implicitly involves two sampling processes: first-stage sampling yields the probe set and second-stage sampling obtains the variation set. To date, our understanding of how these two sampling processes affect algorithm performance remains quite limited. To address this issue, we propose a sampling scheme called $β$-sampling, where the sampling probability of a link is proportional to the product of the degrees of its two endpoints raised to the power of $β$. Experiments on 45 real-world networks reveal that the structural characteristics of missing links, as simulated via varying probe sets, substantially impact prediction accuracy. When missing links tend to connect high-degree nodes, such links can be predicted accurately with ease. Furthermore, even with a fixed probe set, second-stage sampling still exerts a significant influence on prediction accuracy. Notably, the optimal second-stage sampling strategy differs from \textit{random sampling} (which randomly selects links to form the validation set) and \textit{consistent sampling} (which guarantees that links in the validation and probe sets share identical structural characteristics).

2606.19774 2026-06-19 cs.RO 新提交

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection

开始正确,到达正确:通过初始噪声选择实现异步执行

Trong-Bao Ho, Quang-Tan Nguyen, Thien-Loc Ha, Gia-Binh Nguyen, Viet-Thanh Nguyen, Long Dinh, Minh N. Vu, Duy M. H. Nguyen, An Thai Le, Ngo Anh Vien

发表机构 * VinRobotics VinUniversity DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院)

AI总结 针对流式策略异步执行中的动作块边界不一致问题,提出无需训练的PAINT方法,通过初始噪声选择而非轨迹引导实现前缀一致性,在12个模拟和6个真实操作任务中提升执行一致性与任务性能。

Comments First version 19 pages, project site: https://paint-action-chunking.github.io

详情
AI中文摘要

动作分块使机器人策略能够产生时间上连贯的行为,但基于流的策略生成多步动作序列会产生延迟,与实时控制不兼容。在异步执行下,机器人继续执行当前块的同时生成下一个块,即使微小延迟也会在块边界造成不一致。现有方法通过将生成导向已执行的动作前缀来解决此问题。我们则表明,通过在生成开始前选择合适的初始噪声即可实现前缀一致性,使得未经修改的流ODE能够生成连贯的下一块。这将异步推理重新定义为噪声选择问题而非轨迹引导问题。我们提出\textbf{PAINT},一种无需训练的方法,通过后向欧拉反演找到此噪声,并通过重绘规则构建最终块。总之,\texttt{PAINT}不需要梯度、重新训练或策略修改;然而它在\textit{12个模拟基准}和\textit{6个真实世界操作任务}(涵盖单臂、双臂和人形机器人)上提高了执行一致性和任务性能。网站:~\href{ this https URL }{\texttt{ this https URL }}。

英文摘要

Action chunking enables robot policies to produce temporally coherent behavior, but generating multi-step action sequences with flow-based policies incurs latency that is incompatible with real-time control. Under asynchronous execution, the robot continues executing the current chunk while the next one is generated, causing even minor delays to create inconsistencies at chunk boundaries. Existing methods address this problem by steering generation toward the already executed action prefix. We instead show that prefix consistency can be achieved by selecting an appropriate initial noise before generation begins, allowing the unmodified flow ODE to produce a coherent next chunk. This reframes asynchronous inference as a noise selection problem rather than a trajectory steering problem. We introduce \textbf{PAINT}, a training-free method that finds this noise via backward Euler inversion and constructs the final chunk through a repainting rule. In summary, \texttt{PAINT} requires no gradients, retraining, or policy modification; yet it improves execution consistency and task performance across \textit{12 simulated benchmarks} and \textit{6 real-world manipulation tasks} spanning single-arm, bimanual, and humanoid embodiments. Website: ~\href{https://paint-action-chunking.github.io}{\texttt{https://paint-action-chunking.github.io}}.

2606.19771 2026-06-19 cs.AI 新提交

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵:从令牌级分布偏差中学习以增强LLM推理

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Sichuan University(四川大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 针对RLVR中令牌更新导致的熵塌陷或爆炸问题,提出ICT框架,利用JS散度识别关键令牌,通过选择性更新平衡策略集中度,提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著推进了大语言模型(LLM)推理;然而,它面临一个基本的优化不稳定性:均匀令牌更新会导致熵塌陷,从而过早收敛到次优策略,而过度的香农熵最大化可能导致熵爆炸,驱动盲目探索走向不连贯的推理链。为解决这一二分问题,我们引入了独立组合令牌(ICT)框架,该框架将优化焦点从标量不确定性转移到令牌logits的分布特性。通过利用令牌logits分布之间的詹森-香农(JS)散度,ICT将具有独特分布模式的令牌识别为引导LLM推理中有效探索的关键分支点。我们的理论分析基于香农熵和二阶Rényi熵,证明选择性地更新这些令牌可以调节策略集中度:它降低了由香农熵度量的整体分布不确定性,同时控制了由二阶Rényi熵捕获的概率集中度。这种双重效应防止了过度集中的令牌生成削弱探索,并有效稳定了训练景观。实验结果表明,在Qwen2.5(0.5B/1.5B/7B)模型上仅更新前10%的独特令牌,在涵盖数学、常识和奥林匹克级别问题的七个基准测试中,与GRPO、20-Entropy和STAPO基线相比,平均pass@4提升了4.58%,最大提升达14.9%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

2606.19770 2026-06-19 cs.LG 新提交

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

基于潜在混合建模的图新颖性生成的信息论框架

Itsuki Nakagawa, Kenji Yamanishi

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(东京大学信息科学与技术研究生院)

AI总结 提出信息论框架,通过潜在混合建模和描述长度约束,生成与现有模式不同且保持全局结构一致性的新颖图数据。

详情
AI中文摘要

我们提出了一个用于图新颖性生成的信息论框架,旨在生成与现有模式不同且保持全局结构一致性的数据。我们的方法将数据嵌入潜在空间,使用有限混合模型对潜在分布进行建模,并通过基于描述长度制定的显式新颖性和可靠性条件生成新颖样本。具体来说,新颖性通过要求生成样本难以被所有现有混合成分解释来强制执行,而可靠性则根据最小描述长度(MDL)原则约束其对整体混合结构的影响。我们提供了理论分析,表明在适当的阈值选择下,将非新颖或不可靠样本错误分类的概率以显式速率收敛到零。在合成和基准图数据集上的实验表明,所提出的方法能够以可量化的风险实现原则性的新颖性生成。

英文摘要

We propose an information-theoretic framework for graph novelty generation, which aims to generate data that are distinct from existing patterns while preserving global structural consistency. Our approach embeds data into a latent space, models the latent distribution using finite mixture models, and generates novel samples by imposing explicit novelty and reliability conditions formulated in terms of description length. Specifically, novelty is enforced by requiring generated samples to be poorly explained by all existing mixture components, while reliability constrains their impact on the overall mixture structure under the Minimum Description Length (MDL) principle. We provide a theoretical analysis showing that, with appropriate threshold choices, the probabilities of misclassifying non-novel or unreliable samples converge to zero with explicit rates. Experiments on synthetic and benchmark graph datasets demonstrate that the proposed method enables principled novelty generation with quantifiable risk.

2606.19769 2026-06-19 cs.RO cs.AI 新提交

Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI

人形机器人数据标准:物理AI缺失的基础设施

Shaoshan Liu, Xiugong Qin, Xuan Wu, Xuan Xia, Ning Ding, Jialu Liu, Jie Tang

AI总结 本文论证数据标准是人形机器人可扩展性的关键基础设施,通过提出ISO/WD 26264-1标准,解决数据非累积性问题,使具身经验可解释、可共享、可追溯和可复用。

详情
AI中文摘要

人形机器人的可扩展性不仅取决于模型和硬件,还取决于物理经验能否在机器人、任务、组织及时间维度上积累。基于作者在ISO/TC 299/WG 16内制定ISO/WD 26264-1《人形机器人数据集——第1部分:通用要求》的工作,本文论证数据标准正成为物理AI的基础设施。我们提出三个见解:第一,人形机器人数据是具身交互数据,而非孤立数字样本的集合;有用的数据集必须保留机器人本体、动作、任务、场景、执行轨迹和结果之间的关系。第二,其价值取决于物理一致性:多模态流仅在时序、坐标系、标定、运动学、单位和同步假设可检查时才可复用。第三,主要瓶颈不仅是数据稀缺,更是由高采集成本、数据孤岛和不一致评估导致的非累积性数据。我们认为人形机器人数据标准通过使具身经验可解释、可共享、可追溯和可复用来解决这些瓶颈。通用标准应为生命周期管理、元数据、来源、质量、版本控制和可追溯性提供横向基础设施,而能力特定部分应定义操作、移动、人机交互、认知及未来人形能力的领域语法。随着AI从屏幕进入实体,数据标准必须从组织数字信息演变为结构化物理交互。

英文摘要

The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.

2606.19759 2026-06-19 cs.AI cs.SI 新提交

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

知识工作者问答论坛中的最优调度

Rohit Negi, Mustafa Yilmaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对知识工作者问答论坛,提出基于专家专业水平的请求调度模型,计算系统容量并设计达到容量的调度器,同时探讨专家协作对容量的提升。

Comments 14 pages, 4 figures

详情
AI中文摘要

随着个人转向互联网寻找他们可能遇到的问题的答案,一些问答论坛已经发展起来,在这些论坛中,某些主题知识渊博的用户可以贡献他们的专业知识来回答这些信息请求。虽然目前这些是志愿性质的,但我们考虑未来版本雇佣在特定主题上是专家的知识工作者。在这样的系统中,形成排队系统的请求-回答过程可以利用调度器,将不同主题的请求分配给论坛中的专家,这些专家可以根据他们在不同主题上的专业水平来回答。通过这个模型,我们计算了系统在处理请求时的容量,同时保持系统稳定,并设计了达到容量的调度器。我们还研究了专家之间在回答请求时的协作如何可能增加容量。

英文摘要

As individuals turn to the Internet to find answers to questions they may have, several Question Answering (QA) forums have evolved, where users knowledgeable in certain topics can contribute their expertise to answering these requests for information. While these are currently volunteer based, we consider a future version employing knowledge workers who are experts in certain topics. In such a system, the request-answer processes forming the queuing system may utilize schedulers that assign requests in different topics to the experts in the forum, who may be able to answer them according to their expertise levels in different topics. With this model, we calculate the capacity of the system for handling the requests while keeping the system stable, and design schedulers that achieve capacity. We also investigate how collaboration between experts in answering requests can potentially increase capacity.

2606.19758 2026-06-19 cs.MA 新提交

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

SIGMA: 用于组合式多智能体设计的技能-关联图

Kun Zeng, Yu Huo, Siyu Zhang, Yuecheng Zhuo, Yuquan Lu, Haoyue Liu, Siyue Chen, Xiaoying Tang

AI总结 提出SIGMA框架,通过技能-智能体关联图将智能体构建为可复用技能的任务条件组合,并解码通信拓扑,在六个基准测试中优于基线方法,并展现出对未见技能库的鲁棒性。

Comments EMNLP2026

详情
AI中文摘要

现有的基于图的多智能体系统(MAS)设计者主要通过优化预定义智能体、角色或组上的通信拓扑来改善协作。然而,由于每个节点仍然是一个封闭集实体,这些方法难以泛化到需要未见能力组合的任务。我们提出SIGMA,一个技能-关联图框架,将智能体构建为可复用技能的任务条件组合。给定一个任务和一个技能库,SIGMA预测一个技能-智能体关联矩阵,从选定的技能中组合智能体节点嵌入,并在构建的智能体上解码通信拓扑。在执行过程中,特定技能的邮箱将消息路由到相关分配的能力,使关联结构直接可操作。在六个推理和编码基准测试中,使用三个基础LLM,SIGMA实现了最佳平均性能,并分别比最强的非组合式拓扑基线CARD提高了2.06、2.36和1.75分。它还对未见技能库表现出更强的鲁棒性,平均性能下降仅为0.96分。这些结果表明,组合式节点构建是多智能体设计中除了通信拓扑优化之外的一个互补且重要的方向。代码可在以下网址获取:https://this URL。

英文摘要

Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.

2606.19755 2026-06-19 cs.CR cs.AI 新提交

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec: 通过动态反射采样实现快速且安全的LLM

Haotian Xu, Zeyang Zhang, Linbao Li, Huadi Zheng, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University, Hangzhou, China(浙江大学) Huawei(华为) Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 提出SafeSpec框架,将轻量安全头集成到推测解码的验证过程中,通过风险估计和反射采样恢复安全生成,在保持加速的同时显著降低攻击成功率。

详情
AI中文摘要

推测推理加速了大语言模型(LLM)的解码过程,但本身不提供任何安全保障。现有的安全防御措施与推测推理大多不兼容:它们要么引入额外的计算,要么破坏草稿-验证机制,抵消加速优势。这揭示了当前安全方法与推测解码之间的根本性不兼容。我们提出SafeSpec,一个安全感知的推测推理框架,将风险估计直接集成到验证过程中。SafeSpec在目标模型上附加一个轻量级的潜在安全头,以在单次前向传递中联合评估语义有效性和安全性。当检测到不安全生成时,SafeSpec应用回滚和安全引导的反射多次采样来恢复安全延续,而不是终止生成。我们将越狱攻击建模为生成轨迹上的分布偏移,其中对抗性提示增加了有害延续的概率,但并未消除安全延续。在此模型下,SafeSpec在推测解码过程中执行风险感知的轨迹恢复。在多个模型和对抗基准测试中,SafeSpec实现了显著改进的安全-效率权衡。在Qwen3-32B上,SafeSpec将攻击成功率降低了15%,同时在良性工作负载上保持了2.06倍的推理加速,表明推测加速和推理时安全性可以联合优化。

英文摘要

Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.

2606.19753 2026-06-19 cs.AI cs.SE 新提交

Grounded Inference: Principles for Deterministically Encapsulated Generative Models

基于推理:确定性封装生成模型的原则

Marty O'Neill

发表机构 * Odenton, MD, USA(美国马里兰州奥登顿)

AI总结 提出四种AI混合架构原语,实现概率模型的确定性封装,并指出两个行业反模式,为AI与传统系统集成提供基础框架。

Comments 12 pages, 3 figures

详情
AI中文摘要

将生成模型整合到传统计算系统中既带来了巨大的机遇,也带来了巨大的风险。尽管许多早期采用者已付出巨大代价认识到这些风险,但该领域仍需基础框架来降低AI融入传统系统的风险。本文通过定义四种AI混合架构的具体原语,旨在实现概率模型的确定性封装,从而奠定这一基础。此外,本文还确立了行业中广泛存在的两个总体反模式,作为该领域工程师的警示。该框架旨在实现AI与传统系统的成功集成,同时为生成模型提供商构建下一代生成模型接口提供基础。

英文摘要

The incorporation of generative models into traditional computational systems presents both enormous opportunity and tremendous peril. Although many early adopters have realized these perils at great expense, the field still requires foundational frameworks to de-risk incorporation of AI into traditional systems. This manuscript establishes this foundation through the definition of four specific primitives of AI blended architecture, designed to enable deterministic encapsulation of probabilistic models. It further establishes two overarching anti-patterns broadly represented across industry to serve as warnings for engineers in this field. This framework was designed to enable successful integration of AI into traditional systems while providing a foundation upon which generative model providers could build the next generation of generative model interfaces.

2606.19752 2026-06-19 cs.RO cs.AI 新提交

Temporal Self-Imitation Learning

时间自我模仿学习

Yinsen Jia, Boyuan Chen

发表机构 * Duke University(杜克大学)

AI总结 提出时间自我模仿学习框架,通过挖掘高效成功轨迹并转化为可重用监督信号,提升长时域机器人操作任务的学习效率与鲁棒性。

详情
AI中文摘要

基于奖励塑形训练的长时域机器人操作策略仍可能通过低效交互利用密集奖励,而训练过程中稀有高效行为可能被遗忘。我们认为时间效率本身为强化学习提供了强大且未充分利用的自我监督源。我们引入时间自我模仿学习(TSIL),一种强化学习框架,挖掘学习过程中产生的时间高效成功轨迹,并将其转化为可重用的监督信号以改进未来策略。TSIL通过从快速成功轨迹中提取配置条件自适应时间目标逐步优化学习,并通过效率加权自我模仿学习保留和重放高效行为。在15个不同的长时域操作任务中,TSIL持续提升了学习效率、任务完成效率、快速成功行为的重访率以及对不稳定训练条件的鲁棒性。更广泛地,我们的结果表明,成功行为的时间结构本身为强化学习提供了超越人工奖励塑形的可扩展自我监督信号。

英文摘要

Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 新提交

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机:大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出贝叶斯流形课程(BMC)框架,将问题采样建模为流形结构赌博机问题,通过层次任务树和贝叶斯学习引导采样,平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情
AI中文摘要

强化学习(RL)是提高大语言模型(LLMs)推理能力的关键方法,其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示,将问题选择视为具有独立臂的标准赌博机问题,忽略了任务空间的结构化和异质性。在这项工作中,我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题:问题通过模型的潜在表示空间相关联,采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角,我们引入了贝叶斯流形课程(BMC),这是一个结构感知框架,将问题组织成层次任务树,并应用贝叶斯学习来指导采样。实验发现,不同的采样策略在生产性(学习信号)、多样性(任务流形覆盖)和实用性(评估相关性)之间引入了非平凡的权衡。这些结果表明,仅优先考虑难度不足以获得强大的下游性能,突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

2606.19749 2026-06-19 cs.AI cs.CL 新提交

Benchmarking Agentic Review Systems

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago(芝加哥大学) Bar-Ilan University(巴伊兰大学)

AI总结 针对AI辅助研究给同行评审带来的压力,新兴智能审稿系统涌现,但缺乏评估标准。本文评估了多种系统,发现最佳配置(OpenAIReview + GPT-5.5)在成对准确性上达83.0%,能捕获71.6%注入错误,且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情
AI中文摘要

一类新的智能审稿系统正在兴起,以缓解AI辅助研究给同行评审系统带来的压力,但如何评估它们尚不明确。我们评估了两个开源系统(OpenAIReview和coarse)、一个专有系统(Reviewer3)以及一个零样本基线,跨越六个涵盖前沿和高效模型的LLM。首先,我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量(通过引用和接受决定等外部信号近似)相关。每个系统在成对准确性上均高于随机水平,最佳为OpenAIReview + GPT-5.5,达到83.0%。其次,为测试系统能否捕获已知真实错误的错误,我们构建了一个扰动基准,向八个arXiv学科类别的论文中注入四类错误,并测量检测召回率。最强配置(OpenAIReview + GPT-5.5)捕获了71.6%的注入错误,仍有很大改进空间。六个模型的检测并集达到83.3%的召回率,表明不同模型检测不同错误,更好的利用设计可能提高性能。除这些基准外,我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面,比例为1.44:1,最常见的抱怨是误报和琐碎挑剔。总之,通过评估基于最先进模型的全审稿系统在真实研究论文上的表现,我们表明虽然AI评审仍有改进空间,但它们已经能够很好地跟踪人类质量判断、捕获重要错误,并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

2606.19747 2026-06-19 cs.AI 新提交

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

预训练Transformer模型用于古兰经语音识别的比较研究:语音表示、标签格式和数据集组成

Nabil Mosharraf Hossain, Riasat Islam, Unaizah Obaidellah

发表机构 * Greentech Apps Foundation(Greentech Apps基金会) Queen Mary University of London(伦敦玛丽女王大学) University of Malaya(马来亚大学)

AI总结 本文系统比较了Wav2Vec2.0、HuBERT和XLS-R等预训练Transformer模型在古兰经语音识别中的微调效果,通过870小时数据集实验,最佳配置实现0.08词错误率,训练时间从140小时降至40小时。

Comments 30 pages, 9 figures, 5 tables, Submitted to International Journal of Speech Technology

详情
AI中文摘要

古兰经自动语音识别(ASR)旨在将古兰经诵读转换为文本,从而支持辅助记忆工具和古兰经搜索引擎等应用。然而,现有的ASR模型在用户诵读的经文上通常表现出较高的词错误率(WER),并且缺乏对古兰经语料库的完整覆盖。本文对基于预训练Transformer模型的领域特定微调进行了系统的实证研究,使用了先进的语音特征提取方法:Wav2Vec2.0、HuBERT和XLS-R。这些模型通过掩码输入音频的部分内容并利用Transformer架构学习上下文感知的语音特征,应用自监督学习。预训练模型在超过870小时的专业和用户诵读过滤后的古兰经数据集上进行微调。通过跨特征提取器、输出标签格式、训练策略和剪辑时长的全面消融研究,我们确定了影响该领域转录准确性的关键因素。我们的最佳配置在EveryAyah子集上实现了0.08的WER,在EveryAyah+Tarteel组合设置上实现了0.11的WER,相比Citrinet基线(WER=0.163)提高了约五个百分点,同时将组合模型训练时间从140小时减少到40小时。不带变音符号的阿拉伯文本产生了最佳的微调结果,而Wav2Vec2-XLSR-53提供了最强的整体表示。未来的工作包括改进数据集质量和开发音素感知模型,以提取更深的语音特征表示,用于对Tajweed敏感的应用。

英文摘要

Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.

2606.19746 2026-06-19 cs.DC 新提交

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

SAC: 面向稀疏注意力LLM的基于CXL的解耦KV缓存系统

Ruiyang Ma, Teng Ma, Junru Li, Hantian Zha, Xuchun Shang, Qingda Hu, Zheng Liu, Xinjun Yang, Tao Ma, Guojie Luo

AI总结 针对稀疏注意力模型在长上下文推理中全量KV缓存传输导致的瓶颈,提出基于CXL按需获取top-k KV条目的解耦缓存系统SAC,相比RDMA方案吞吐提升2.1倍、TTFT降低9.7倍。

详情
AI中文摘要

LLM向长上下文推理的扩展将主要服务系统瓶颈从计算转移到内存容量。传统针对密集注意力模型的解决方案依赖基于RDMA的解耦内存池,在解码前从远程存储粗粒度地获取整个前缀KV缓存到本地内存。然而,这种方法对于新兴的稀疏注意力模型本质上是低效的。尽管解码过程中只有一小部分KV条目是活跃的,这些系统仍然将完整的KV缓存获取到本地,导致严重的传输瓶颈和本地内存浪费。为了解决这个问题,我们提出了SAC,第一个针对稀疏注意力模型优化的高效解耦KV缓存系统。通过利用Compute Express Link (CXL)的低延迟、缓存行粒度的加载/存储语义,SAC在推理过程中按需仅获取所需的top-k KV条目。在使用SGLang对DeepSeek-V3.2的评估中,与基于RDMA的基线相比,SAC实现了2.1倍的吞吐量提升、9.7倍的TTFT降低和1.8倍的TBT降低,确立了基于CXL的解耦作为新兴稀疏注意力模型的优越基础设施。

英文摘要

The scaling of LLMs toward long-context inference has shifted the primary serving system bottleneck from computation to memory capacity. Traditional solutions for dense attention models rely on RDMA-based disaggregated memory pools, which perform coarse-grained fetching of the entire prefix KV cache from remote storage to local memory before decoding. However, this approach is fundamentally inefficient for emerging sparse attention models. While only a small fraction of KV entries are active during decoding, these systems still fetch the full KV cache locally, leading to severe transmission bottlenecks and local memory wastage. To address this, we propose SAC, the first efficient disaggregated KV cache system optimized for sparse attention models. By leveraging the low-latency, cache-line granularity load/store semantics of Compute Express Link (CXL), SAC fetches only the required top-k KV entries on demand during inference. Evaluations on DeepSeek-V3.2 using SGLang show that SAC achieves 2.1x higher throughput, 9.7x lower TTFT, and 1.8x lower TBT compared to RDMA-based baselines, establishing CXL-based disaggregation as the superior infrastructure for emerging sparse attention models.

2606.19745 2026-06-19 cs.HC 新提交

Designing for Interconnected Islamic Learning: A Qualitative Study of Muslim Women's Experiences with Qur'an, Hadith, and Seerah Apps

设计互联的伊斯兰学习:穆斯林女性使用古兰经、圣训和先知传记应用的质性研究

Ishrat Jahan Easha, Nabil Mosharraf Hossain, Araf Mohammad Mahbub, Fairoze Bint Abu Hassan, Zunaid Aslam, Yemin Sajid, Riasat Islam

AI总结 通过访谈穆斯林女性,发现她们在数字工具中阅读古兰经、圣训和先知传记时面临上下文分离的张力,提出分层语境性概念,强调在可信、可选且不打断阅读的前提下提供跨文本语境。

Comments 27 pages, 1 figure, 3 tables. Submitted to the International Journal of Human-Computer Interaction

详情
AI中文摘要

伊斯兰学习通常依赖于同时阅读古兰经、圣训和先知传记,然而数字工具通常将这些资源分散在不同的应用、屏幕和搜索路径中。我们通过从在线伊斯兰学习社区招募的五名穆斯林女性的半结构化访谈,将此视为人机交互问题。参与者描述了一个反复出现的张力:她们希望在阅读时获得古兰经-圣训-先知传记的上下文,但仅当上下文扩展是可信的、可选的且不打断阅读时。通过性别化数字宗教、认知信任和无缝学习的视角解读访谈,我们识别出关于上下文理解、真实性、界面杂乱、学习模式和指导特征的五个主题。我们引入分层语境性作为该领域的HCI解释:上下文扩展必须与解释责任、虔诚流动以及跨设备和学习强度的连续性相平衡。

英文摘要

Islamic learning often depends on reading the Qur'an, Hadith, and Seerah together, yet digital tools typically separate these sources across apps, screens, and search pathways. We examine this as a human-computer interaction problem through five semi-structured interviews with Muslim women recruited from an online Islamic learning community. Participants described a recurring tension: they wanted Qur'an-Hadith-Seerah context at the point of reading, but only when contextual expansion remained trustworthy, optional, and did not interrupt reading. Interpreting the interviews through gendered digital religion, epistemic trust, and seamless learning, we identify five themes concerning contextual understanding, authenticity, interface clutter, study modes, and guidance features. We introduce layered contextuality as an HCI account of this domain: contextual expansion must be balanced with interpretive accountability, devotional flow, and continuity across devices and study intensities.

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘:不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab(网络分析与社会影响建模实验室) School of Physics Maths and Computing, The University of Western Australia(西澳大学物理数学与计算学院) School of Psychological Science, The University of Western Australia(西澳大学心理科学学院) School of Computing, Macquarie University(麦考瑞大学计算机学院)

AI总结 研究顺序DPO在不同偏好设置下的影响,发现遗忘模式并非统一,而是取决于目标关系、信号强度和训练顺序,并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化(DPO)等偏好优化方法顺序应用这些目标,但目前尚不清楚后续训练是否会统一降低先前学习的偏好,或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置(包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标)的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct,我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式;偏好变化从部分退化到稳定、成对重新分配或正迁移,具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明,聚合指标可能掩盖偏好对之间的异质性变化,而四分位数分解显示,高置信度对可能根据设置而退化或改进。机制诊断表明,在所有设置中,阶段2的梯度和适配器更新与先前目标接近正交,几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明,未来的顺序对齐流程应考虑目标兼容性和信号强度,而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

2606.19741 2026-06-19 cs.AI cs.LG 新提交

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出演化程序瓶颈(EPB)框架,通过将黑盒神经组合优化模型蒸馏为可读程序组合,利用LLM和混合梯度下降实现可解释性,揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情
AI中文摘要

神经组合优化(NCO)取得了强劲性能,但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具(如概念瓶颈模型)不适用于NCO,因为其决策是动态的、状态依赖的,且缺乏适当的概念词汇定义。为弥合这一差距,我们引入了演化程序瓶颈(EPB),据我们所知,这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序,其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作:模块I固定程序库容量,并引入混合文本-数值梯度下降方案,该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合;模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性,蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化,并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO,并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

2606.19736 2026-06-19 cs.CV 新提交

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology(智能汽车安全技术国家重点实验室) School of Cyber Science and Engineering, Huazhong University of Science and Technology(华中科技大学网络空间安全学院) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Hebei Energy College of Vocation And Technology(河北能源职业技术学院)

AI总结 提出一种端到端框架,结合UV体积渲染与扩散纹理生成器,并引入照明颜色一致性估计器和多尺度动态训练策略,生成可穿戴对抗图案,在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情
AI中文摘要

物理世界中的对抗性伪装仍然极具挑战性,尤其是在无人机侦察场景下,目标会经历连续的几何变化和极端光照变化。现有方法要么优化无法泛化到动态视角的2D数字扰动,要么产生视觉上不自然的纹理而无法在实际场景中部署。因此,我们提出一个端到端的对抗性伪装生成框架,能够自动生成可穿戴的对抗图案,并在视角、姿态和光照条件变化的真实物理环境中保持稳定的攻击性能。我们的方法将UV体积渲染与基于扩散的纹理生成器相结合,使得在不同尺度、姿态和光照条件下外观保持一致。为了确保环境真实性,我们提出一个照明颜色一致性估计器,提取主导背景属性并引导自然纹理损失,使生成的UV纹理与周围环境对齐。多尺度动态训练策略进一步增强了对抗视角变化和身体变形的鲁棒性。在多个主流检测器上的大量实验表明,我们的方法在保持高感知自然性的同时实现了强大且稳定的物理攻击性能,在不引入不自然伪影的情况下降低了人类检测率。

英文摘要

Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.