arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2117
2606.13441 2026-06-12 cs.AI cs.CL 新提交

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

为什么采样不是选择:大语言模型中的意向性、能动性与道德责任

Joseph Keshet

AI总结 本文论证大语言模型不具备道德责任所需的承诺性能动性,其输出源于概率映射而非内在意向性,随机采样不等于选择。

详情
AI中文摘要

近期大语言模型(LLMs)的进展引发了关于此类系统展现能动性或具备道德主体资格的讨论。本文认为这些归因是错误的。我们坚持道德责任需要基于内在意向性和自我归因行动的承诺性能动性,而这种能动性构成了与责任相关的自由意志形式。尽管LLMs生成连贯且可进行规范性评估的输出,其操作完全由从数据中学习到的概率输入-输出映射所刻画。它们表面的意向性是衍生的而非内在的,其输出既不被作为承诺拥有,也不受理由引导。随机采样引入的变异性并不等同于选择或作者身份。我们回应来自意向立场、功能主义、相容论以及模型输出中存在道德推理的反对意见,认为这些都不足以确立真正的能动性。

英文摘要

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

2606.13436 2026-06-12 cs.AI 新提交

Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems

元数据驱动分类中的评估主权:面向弱监督信息系统的多轨道框架

Raymond Vasquez

AI总结 针对弱监督元数据系统中标签权威性影响评估有效性的问题,提出评估主权概念及多轨道评估框架,通过实验揭示模型性能在银标与金标评估下的显著差异,并重新定义评估有效性为系统级属性。

详情
AI中文摘要

机器学习中的评估通常被视为中立的测量过程。然而,在操作性信息系统中,评估结果往往受标签生成过程的影响。本文并非旨在提升分类性能,而是考察在不同标签权威体制下性能测量的有效性。这一问题在大规模元数据驱动系统中尤为突出,此类系统中的标签常不完整、不一致或仅受弱监督。我们引入评估主权概念,定义为性能指标独立于标签权威和监督体制的程度,并提出一个多轨道评估框架,系统性地变化训练和评估标签来源。通过对大规模科学元数据进行层次多标签分类,我们证明在操作性(“银标”)评估下表现强劲的模型在独立(“金标”)评估下性能显著下降,尤其在细粒度分类中。例如,Micro-F1从约0.54降至0.03。值得注意的是,基于排名的指标仍高于基线,揭示了潜在模型信号与分类有效性之间的分歧。这些发现表明,通常报告的性能指标可能反映的是与标注过程的对齐,而非真正的预测能力。因此,我们将评估有效性重新概念化为由标签治理塑造的系统级属性,并为审计在弱监督下运行的智能系统提供了一种实用方法论。

英文摘要

Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluation outcomes are often conditioned by the processes used to generate labels. This paper does not seek to improve classification performance. Instead, it examines the validity of performance measurement under differing label-authority regimes. This issue is particularly relevant in large-scale metadata-driven systems, where labels are often incomplete, inconsistent, or weakly supervised. We introduce evaluation sovereignty, defined as the degree to which performance metrics are independent of label authority and supervision regime, and propose a multi-track evaluation framework that systematically varies training and evaluation label sources. Using hierarchical multi-label classification on large-scale scientific metadata, we demonstrate that models exhibiting strong performance under operational ("silver") evaluation degrade substantially under independent ("gold") evaluation, particularly for fine-grained classification. For example, Micro-F1 decreases from approximately 0.54 to 0.03. Notably, ranking-based metrics remain above baseline, revealing a divergence between latent model signal and classification validity. These findings suggest that commonly reported performance metrics may reflect alignment with labeling processes rather than true predictive capability. We therefore reconceptualize evaluation validity as a system-level property shaped by label governance and provide a practical methodology for auditing intelligent systems operating under weak supervision.

2606.13426 2026-06-12 cs.LG stat.ML 新提交

Accelerating Speculative Diffusions via Block Verification

通过块验证加速推测性扩散

Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

AI总结 提出一种针对扩散模型的推测性采样方案,通过块验证提高草稿接受率,无需训练的Free Drafter实现高达6.3%的加速。

详情
AI中文摘要

推测性解码通过使用草稿模型生成令牌,并采用接受-拒绝方案确保输出与目标分布匹配,从而加速LLM推理。将其适应于连续扩散是困难的,因为推测性采样需要从残差分布中采样。虽然在离散空间中直接,但在连续空间中高效采样残差并非易事。因此,现有的扩散适应要么使用计算效率低下的采样技术,要么依赖替代方案。在这项工作中,我们引入了一种新颖的方案,高效地实现了扩散模型的原始推测性采样机制。我们的方法相比现有方法具有关键优势:它使我们能够将LLM的块验证适应到扩散——这被证明可以提高草稿的接受率。此外,我们形式化并分析了Free Drafter,一种无需训练的扩散启发式自推测草稿生成器。通过启用块验证,我们的Free Drafter在无需额外训练且开销可忽略的情况下,相比现有推测性方法实现了高达6.3%的加速。

英文摘要

Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 新提交

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配,具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

AI总结 提出PolyFlow,一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架,通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束,在规划与控制任务中实现零约束违反并降低推理延迟。

详情
Comments
30 pages, 12 figures, Accepted to ICML 2026
AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能,但由于严格的约束要求,在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性,这会产生大量的计算开销,并可能扭曲学习到的分布。我们提出了PolyFlow,一种多面体约束流匹配框架,将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构,消除了离散化误差,并保证严格满足任意多面体约束,无需昂贵的迭代求解器。实验结果表明,PolyFlow在规划和控制任务中实现了零约束违反,同时保持了较高的分布保真度。与最先进的约束生成基线相比,PolyFlow显著降低了推理延迟,并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

2606.13381 2026-06-12 cs.LG 新提交

Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Hölder++:改进多模态VAE中的质量-一致性权衡

Huyen Vo, María Martínez-García, Isabel Valera

AI总结 针对多模态VAE生成质量与语义一致性之间的权衡问题,提出Hölder++,通过精确Hölder池化、扩展架构和层次推理,在提升一致性的同时保持生成质量。

详情
Comments
Accepted at ICML 2026. Camera-ready version
AI中文摘要

现有的多模态变分自编码器(VAE)方法面临生成质量与一致性之间的权衡——即它们难以生成既真实多样又在各模态间语义一致的样本。最近的一项工作表明,使用Hölder池化的简单近似作为聚合方法,尽管假设所有模态共享单一表示,但能提高一致性超过SOTA MMVAE+。然而,它略微牺牲了样本多样性。受此启发,我们提出Hölder++,一种新颖的多模态VAE,通过以下方式改进生成质量-一致性权衡:(i) 首次实现无近似的Hölder池化用于多模态VAE;(ii) 扩展架构,建模不同的共享和私有(即模态特定)表示(Hölder+);(iii) 层次推理,进一步增强共享和私有表示之间的解耦(Hölder++)。我们的实验证实,Hölder++持续改进生成质量-一致性权衡,产生更结构化的潜在空间,并学习对下游任务信息丰富的共享表示。

英文摘要

Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

2606.13376 2026-06-12 cs.CV 新提交

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

AI总结 提出MoVerse,从单张窄视场图像实时构建可交互漫游的360度全景世界,通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架,并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

详情
AI中文摘要

我们提出MoVerse,一个实时视频世界模型,能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性,因为输入仅观察到环境的一小部分,而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图,在3D推理之前闭合缺失的视场。然后,利用全景几何感知残差预测将全景图提升为持久的3D高斯支架,形成密集且可直接渲染的空间记忆。最后,一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互,我们训练了一个双向扩散教师用于高质量条件渲染,并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游,展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

2606.13368 2026-06-12 cs.AI cs.CV 新提交

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD:一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

AI总结 提出IterCAD,一种闭环交互式CAD生成与编辑的多模态智能体框架,通过渐进式SFT和几何感知强化学习优化,在代码可执行性和几何精度上显著超越现有方法。

详情
AI中文摘要

计算机辅助设计在现代制造业中至关重要,然而现有的自动化方法主要依赖于开环、一次性生成,与迭代的实际实践不匹配。在本文中,我们提出了IterCAD,一个统一的闭环交互式CAD生成与编辑的多模态智能体框架。我们将任务形式化为多模态智能体与可执行CAD沙箱之间的多轮交互,涵盖三个任务:绘图到代码、文本到代码和交互式编辑。为此,我们开发了一个数据合成流水线,结合先进的工业制造特征,生成符合标准的多视图工程图纸、复杂的代码编辑任务和高保真交互轨迹。我们通过渐进式SFT,然后结合几何感知强化学习和可行前缀掩码来优化智能体,以增强代码可执行性和几何保真度。最后,我们引入了IterCAD-Bench评估套件,并提出了Chamfer距离容忍度-召回率(CD-TR)曲线及其AUC-TR指标,建立了一个无幸存者偏差的标准,统一了代码有效性和几何精度。大量实验表明,IterCAD在多个基准测试中取得了极具竞争力的性能,在代码可执行性和几何精度上显著优于现有方法,并在闭环迭代优化中展现出卓越的能力。

英文摘要

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

2606.13347 2026-06-12 cs.LG 新提交

Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

改进反向扩散采样在分类器引导扩散模型中的低密度区域探索

Jagriti Singh, Shekhar Verma, Muneendra Ojha

AI总结 提出一种无需额外训练的采样时间密度感知方法,通过修改分类器梯度引导轨迹朝向低置信区域并引导采样朝向预测真实图像,以增强扩散模型对低密度区域的探索。

详情
AI中文摘要

扩散模型已成为高保真图像合成的最先进生成模型,特别是在无分类器引导和分类器引导形式中。然而,标准分类器引导将概率质量集中在高密度类均值周围,导致对类条件分布尾部罕见样本的覆盖不足。最近关于基于扩散的尾部采样的工作通过训练一个额外的低密度寻求分类器(使用合成与真实判别器)来缓解这一问题,但代价是额外的网络和训练。与此同时,许多采样器和蒸馏技术加速或改进扩散采样,但并未明确解决长尾覆盖问题。我们提出一种纯采样时间、密度感知的分类器引导条件扩散模型扩展,针对低密度区域且无需任何额外训练。我们像大多数扩散模型一样,对噪声图像应用引导而非预测噪声。从预训练的ImageNet条件扩散模型和分类器开始,我们通过修改分类器梯度将轨迹引导向低置信区域,并在每个时间步引导采样过程朝向预测的真实图像,从而修改引导反向动力学。第一个引导有助于探索低概率样本,第二个引导有助于生成接近真实数据流形的样本。所提出的采样器在64x64分辨率下一致提高了ADM模型的召回率,同时保持可比的FID,并且使用256x256 ADM模型,我们展示了两种引导不同组合的视觉结果。我们还表明,标准ADM分类器引导结合预测真实图像引导,有助于在ImageNet上使用256x256 ADM模型生成高感知质量的样本。

英文摘要

Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

2606.13300 2026-06-12 cs.LG 新提交

Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity Score

将时间序列模型量化为动力系统:基于轨迹的量化敏感度评分

Mariya Pavlova, Harrison Bo Hua Zhu, Elizsveta Semenova, Yingzhen Li

AI总结 提出基于轨迹的量化敏感度评分(TQS),从动力系统稳定性角度分析量化误差传播,实现无需校准数据的混合精度量化。

详情
Comments
ICML 2026, Workshop on Forecasting as a New Frontier of Intelligence
AI中文摘要

我们引入了基于轨迹的量化敏感度评分(TQS),这是一种通过动力系统稳定性视角重新定义训练后量化(PTQ)的指标。通过将网络的展开建模为离散时间动力系统,TQS 描述了量化引起的误差如何在展开时间范围内传播和放大。与传统的 PTQ 方法不同,传统方法中敏感度分析通常与量化过程耦合,而 TQS 实现了先验的敏感度估计,与量化器选择和位宽分配解耦。这种分离允许即使在具有融合算子的黑盒或编译网络中进行量化预算规划。在此基础上,我们提出了 TQS-PTQ,一个灵活的混合精度框架,不需要校准数据或昂贵的二阶近似。我们的实验表明,动力系统视角为资源受限环境下的低精度部署提供了一条稳健且高性能的路径。

英文摘要

We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

2606.13282 2026-06-12 cs.AI 新提交

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

ERTS: 通过有界后果空间中的语义扰动进行伦理AI的对抗鲁棒性测试

Pratyush Chaudhari

AI总结 提出伦理鲁棒性测试系统(ERTS),通过有界伦理后果空间、语义扰动和领域自适应评估,测试AI在伦理推理中的对抗鲁棒性,实验表明仅33%模型通过测试。

详情
Comments
8 pages, 10 tables
AI中文摘要

随着AI系统在医疗分诊、自动驾驶和就业筛选等高风险的伦理场景中部署,评估其对伦理推理的对抗性操纵鲁棒性的形式化方法仍不成熟。本文介绍了伦理鲁棒性测试系统(ERTS),一个闭环管道框架,它:(1) 将伦理困境编码为基于既定伦理理论的22维伦理后果空间(ECS);(2) 应用17种语义扰动函数,受6种有效性约束类别(包括一种新颖的语义一致性约束)约束;(3) 通过4分量伦理不稳定性指数(EII)测量决策偏差;(4) 生成领域自适应的部署前鲁棒性评估判定。我们评估了4个结构化基线模型和2个生产级LLM(Gemini 2.0 Flash和Llama 3.2),涵盖8个部署领域的50个伦理场景,生成了1500个对抗测试用例。结果表明,仅33%的模型通过评估审核,其中本地Llama-3.2模型特别容易受到公平性破坏和信息退化攻击(ERS = 0.737)。据我们所知,现有框架中没有将有限伦理后果空间、语义一致性约束和领域自适应评估结合在单个对抗测试管道中的。

英文摘要

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

2606.13279 2026-06-12 cs.RO 新提交

See Selectively, Act Adaptively: Dual-Level Structural Decomposition for Bimanual Robot Manipulation

选择性观察,适应性行动:双水平结构分解用于双臂机器人操作

Yoon-Ji Choi, Young-Chae Son, Soo-Chul Lim

AI总结 提出基于双水平结构分解的双臂操作VLA框架,通过视觉选择路由和动作专家混合机制分别处理视觉相关性和双臂交互模式,在模拟和真实任务中成功率分别提升27.7%和43.3%。

详情
AI中文摘要

在双臂机器人操作中,任务相关的视觉信息随任务阶段和上下文变化,而两臂的交互在独立和协调模式之间切换,使得策略学习具有挑战性。然而,现有的整体式视觉-语言-动作(VLA)策略通过单一共享表示和动作生成路径处理多样的视觉输入和交互模式,往往无法分别考虑视觉相关性和双臂交互结构。为了解决这个问题,我们提出了一个基于双水平结构分解的双臂操作VLA框架。视图选择视觉路由器动态调整腕部视角的贡献以强调相关视觉线索,而交互感知动作专家混合(MoE)将动作生成分解为协调和单臂路径,以适应不同的双臂交互模式。我们在RoboTwin 2.0中的六个模拟双臂操作任务和三个长时域真实世界任务上评估了所提方法。我们的模型在模拟和真实世界评估中,整体平均成功率分别比整体式基线提高了27.7%和43.3%,并且在两种设置下始终优于单模块变体。这些结果表明,联合考虑选择性视觉处理和双臂交互结构的显式分解为鲁棒的双臂操作提供了有效的归纳偏置。

英文摘要

In bimanual robotic manipulation, task-relevant visual information varies with the task stage and context, while the interaction of the two arms shifts between independent and coordinated modes, making policy learning challenging. However, existing monolithic Vision-Language-Action (VLA) policies process diverse visual inputs and interaction patterns through a single shared representation and action generation pathway, often failing to separately account for visual relevance and bimanual interaction structure. To address this issue, we propose a bimanual manipulation VLA framework based on Dual-Level Structural Decomposition. The View-Selective Visual Router dynamically adjusts wrist-view contributions to emphasize relevant visual cues, while the Interaction-Aware Action Mixture-of-Experts (MoE) decomposes action generation into coordinated and arm-wise pathways to adapt to varying bimanual interaction modes. We evaluate the proposed method on six simulated bimanual manipulation tasks in RoboTwin 2.0 and three long-horizon real-world tasks. Our model improves the overall average success rate over a monolithic baseline by 27.7% in simulation and 43.3% in real-world evaluation, while consistently outperforming single-module variants across both settings. These results demonstrate that jointly considering selective visual processing and explicit decomposition of bimanual interaction structures provides an effective inductive bias for robust bimanual manipulation.

2606.13275 2026-06-12 cs.CV 新提交

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

文化遗产的零样本描述:印度尼西亚传统服装的自动化图像分析

Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan

AI总结 提出Custom ZeroCLIP框架,利用检索增强的视觉-语言模型,在零样本设置下为印度尼西亚传统服装生成描述,在8个未见省份上取得优于基线的性能。

详情
Comments
accepted to ICME workshop on AIART 2026
AI中文摘要

本文提出了Custom ZeroCLIP,一个用于印度尼西亚传统服装零样本描述的检索增强视觉-语言框架。数据集包含来自印度尼西亚所有38个省份的3,800张专家标注图像。采用省份级归纳零样本协议,模型在24个可见省份上训练,在6个可见省份上验证,并在8个未见省份上评估。该框架结合了冻结的CLIP ViT-B/32图像编码器、CLIP文本编码器、BERT文本编码器和LSTM描述解码器。在推理过程中,未见省份的标签和描述不可用,检索仅使用训练省份的描述。训练、验证或检索库构建过程中未使用任何未见省份的图像、标签或描述。Custom ZeroCLIP实现了0.8536的CLIPScore、0.3342的BLEU-4和0.4859的METEOR,优于现有基线。消融实验表明,检索提高了文化词汇的恢复能力,METEOR提升了19.3%,而人工评估证实了更强的文化准确性和流畅性。结果证明了检索增强的领域自适应在低资源文化遗产环境下生成文化基础描述的有效性。数据集可在以下网址公开获取:https://this https URL。

英文摘要

This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

2606.13267 2026-06-12 cs.CV cs.CL cs.IR 新提交

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

TimeLens: 面向大埃及博物馆的基于检索增强问答的设备端文物识别

Rawan Hesham, Ali Ashraf, Amr Ahmed, Malak Alaa, Omar Ahmed, Omar Wagih

AI总结 针对博物馆场景中的细粒度视觉相似性、训练数据与手持相机差距以及AI幻觉问题,提出设备端文物检测器与双语检索增强生成(RAG)问答系统,实现实时识别与可靠问答。

详情
Comments
6 pages, 4 figures, 5 tables. Submitted to AIVRCH 2026
AI中文摘要

TimeLens 是一款面向大埃及博物馆(GEM)的 AI 驱动双语移动导览应用。游客将手机对准展品时,可实时识别文物,并针对后续问题获得英语或阿拉伯语回答。本工作解决了馆内部署特有的三个问题:51 件编目文物(许多近乎相同的拉美西斯雕像)间的细粒度视觉相似性、策展训练数据与手持相机条件之间的差距,以及 AI 导览陈述未经证实的历史事实的风险。报告了两项工程贡献。首先,通过数据质量驱动的迭代研究——从基础模型自动标注(YOLO-World),经过空间标签清理规则,到完全人工标注的数据集——开发了设备端文物检测器,将标签质量确定为决定性因素:最终的 YOLOv8n 模型解决了所有先前失败的类别,同时保持为 5.97 MB 的 TensorFlow Lite 资产,可在中端手机上实时运行(mAP@0.5 = 0.995,mAP@0.5:0.95 = 0.924)。其次,基于 108 条记录的 ChromaDB 知识库的双语检索增强生成(RAG)导览,在七个候选语言模型上进行了基准测试,选定了 Gemma 4 E2B(Q4 K M);十项针对性优化将端到端延迟从超过 30 秒降低到约 10 秒。两个子系统集成在一个生产级 Flutter 应用中,具有双语界面、博物馆位置门控和文本转语音支持。

英文摘要

TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study -- from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset -- isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone (mAP@0.5 = 0.995, mAP@0.5:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

2606.13256 2026-06-12 cs.RO cs.AI 新提交

Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes

幽默风格驱动笑声,话题塑造可接受性:评估双语个人与政治机器人交付的AI笑话

Anna-Maria Velentza, Anne-Gwenn Bosser

AI总结 本研究通过混合因素设计,评估机器人用双语讲AI生成笑话时,幽默类型(亲和、自我增强、攻击、自贬)和内容(个人vs政治)对趣味性和适当性的影响,发现幽默类型显著影响趣味性,内容影响适当性,语言偏好受内容及参与者流利度影响。

详情
Comments
Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan
AI中文摘要

幽默在人类社交关系中扮演核心角色,计算幽默的最新进展为将幽默融入人机交互(HRI)创造了新机会。虽然大型语言模型(LLMs)能生成多种形式的幽默,但在群体环境中,幽默风格、笑话内容和语言偏好如何影响对机器人传递幽默的感知仍不清楚。在这项探索性研究中,我们采用混合因素设计,让参与者在大学教室中评估由机器人传递的AI生成笑话。我们考察了幽默类型(亲和型、自我增强型、攻击型、自贬型)和笑话内容(个人相关vs政治)对感知趣味性和适当性的影响,以及语言偏好。结果表明,幽默类型显著影响趣味性,攻击型和亲和型幽默评分更高;而笑话内容主要影响适当性,个人相关笑话优于政治笑话。语言偏好受笑话内容和参与者自我报告的流利度及幽默实践的影响。

英文摘要

Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.

2606.13254 2026-06-12 cs.CL 新提交

Evaluating Pluralism in LLMs through Latent Perspectives

通过潜在视角评估LLM中的多元主义

Laura Majer, Jan Šnajder, Martin Tutek

AI总结 提出一种领域无关的多层无监督框架,从LLM生成文本中提取潜在视角,评估多元主义差距,发现稀有视角仍被不成比例地低估。

详情
Comments
Pluralistic Alignment Workshop @ ICML 2026
AI中文摘要

对代表多样化视角的需求日益增长,增加了对多元主义LLM生成的兴趣。尽管难以操作化,但识别文本中表达的视角将为多元主义对齐提供明确指导,并更清晰地阐明LLM生成中的多元主义差距。虽然模型已被证明会减少训练数据的多样性并生成同质化内容,但这主要是在多项选择问卷或使用自由文本的高层特征上得到证明。在本文中,我们介绍并实现了一个领域无关的多层无监督框架,用于提取适合识别LLM生成文本中多元主义差距的视角。我们在书评(一个高度意见化、代表多样化视角的数据集)上评估了该框架,并比较了各种提示和模型。我们的结果表明,虽然一些模型和提示技术接近覆盖广泛的视角,但稀有视角仍然不成比例地被低估,导致分布偏离人类文本。

英文摘要

The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralistic gap in LLM generation. While models have been shown to reduce the diversity of training data and generate homogeneously, this has been demonstrated primarily on multiple-choice questionnaires or using high-level characteristics of free-form text. In this paper, we introduce and implement a domain-agnostic multi-layered framework for unsupervised extraction of perspectives suitable for identifying the pluralistic gap in LLM-generated text. We evaluate our framework on book reviews, a highly opinionated dataset representing diverse perspectives, and compare various prompts and models. Our results show that while some models and prompting techniques come close to covering a broad spectrum of perspectives, rarer perspectives remain disproportionately underrepresented, resulting in distributions that diverge from human text.

2606.13232 2026-06-12 cs.RO 新提交

WT-UMI: Tactile-based Whole-Body Manipulation via Force-Supervised Contact-Aware Planning

WT-UMI: 基于触觉的全身操控通过力监督的接触感知规划

Jaehwi Jang, Zhaoyuan Gu, Alfred Cueva, Zimeng Chai, Junjie Sheng, Thong Nguyen, Himank Galundia, Yifan Wu, Huishu Xue, Isaac Legene, Ojas Mediratta, Davin Doan, Andrew Collins, Sarah Sadegh, KyoungMok Kim, Rishita Dhalbisoi, Zun Chen, Ye Zhao

AI总结 提出WT-UMI系统,结合人体演示与遥操作数据,通过力监督规划器预测末端执行器位姿和接触力轨迹,并利用触觉导纳控制器提升全身操控性能。

详情
Comments
18 pages, 8 figures
AI中文摘要

全身人形操控笨重、可变形和共享负载物体需要分布式接触感知和显式力调节,然而大多数模仿策略仅隐式处理接触力。另一方面,不同的演示来源提供了具有固有权衡的互补模态:人体演示捕捉自然接触力但不包含机器人可执行动作,而遥操作直接记录机器人动作但力调节不够自然。本文提出\textbf{WT-UMI},一种可穿戴全身触觉接口,可由人类操作员佩戴或安装在人形机器人上,在人体演示和人形遥操作模式下提供触觉图像、接触力和末端执行器位姿的精确观测。我们引入一个力条件目标位姿校正模块,通过从遥操作数据中学习校正,将测量的人体位姿转换为接触感知的机器人目标。为了利用人体数据中的自然力交互,我们提出一个力监督规划器,预测末端执行器位姿块和接触力轨迹。预测的接触力作为基于触觉的导纳控制器的参考。在五个接触密集型任务中,涵盖可变形物体、笨重刚体物体和人-人形协作,WT-UMI在成功率上优于四个策略基线,并降低了接触位置跟踪误差。我们的项目页面可在此https URL访问。

英文摘要

Whole-body humanoid manipulation of bulky, deformable, and shared-load objects requires distributed contact sensing and explicit force regulation, yet most imitation policies treat contact force only implicitly. On the other hand, different demonstration sources provide complementary modalities with inherent trade-offs: human demonstrations capture natural contact forces but not robot-executable actions, while teleoperation directly records robot actions but with less natural force regulation. This paper presents \textbf{WT-UMI}, a wearable whole-body tactile interface worn by human operators or mounted on humanoids, providing accurate observations of tactile images, contact forces, and end-effector poses across both human demonstration and humanoid teleoperation modes. We introduce a force-conditioned target-pose correction module that converts measured human poses into contact-aware robot targets by learning corrections from teleoperation data. To leverage the natural force interaction in human data, we propose a force-supervised planner that predicts end-effector pose chunks and contact-force trajectories. The predicted contact force serves as the reference for a tactile-based admittance controller. Across five contact-rich tasks spanning deformable objects, bulky rigid objects, and human--humanoid collaboration, WT-UMI improves success rate and reduces contact-position tracking error over four policy baselines. Our project page is available at https://wt-umi.github.io/WTUMI/.

2606.13223 2026-06-12 cs.LG cs.CV 新提交

Distributional Loss for Robust Classification

分布损失用于鲁棒分类

Kathleen Anderson, Thomas Martinetz

AI总结 提出一种基于双峰高斯分布的分布损失概念,通过软化目标隐式捕捉类别模糊性,缓解过拟合,提升决策边界鲁棒性,尤其在低数据场景下效果显著。

详情
Comments
ICANN 2026
AI中文摘要

本文提出了一种用于监督分类任务的新型损失概念。我们不是强制每个输入样本直接映射到单个分配标签,而是将分类器输出的优化目标定义为双峰高斯分布。这种更柔和的目标公式隐式地捕捉了类别模糊性,减轻了过拟合,并鼓励学习更鲁棒的决策边界,所有这些都不需要额外的标签信息。实验结果表明,鲁棒性持续提升,在低数据场景下尤其明显,同时仅需对标准训练流程进行最小修改。

英文摘要

This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

2606.13218 2026-06-12 cs.CL 新提交

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

当相似意味着不同:评估大语言模型在阿拉伯语-希伯来语同源词上的表现

Junhong Liang, Noor Abo Mokh, Bashar Alhafni

AI总结 针对阿拉伯语和希伯来语同源词、假朋友和借词,构建SemCog Bench基准(1858对词对),评估LLM跨语言语义理解,发现模型依赖表面形式相似性,在假朋友和借词上表现差,上下文帮助有限。

详情
AI中文摘要

阿拉伯语和希伯来语作为密切相关的闪米特语言,共享大量真正的同源词、误导性的假朋友和现代借词。这种重叠对大语言模型(LLM)的跨语言语义理解构成了挑战。为了评估这一能力,我们引入了SemCog Bench,这是一个精心策划的基准,包含1,858个阿拉伯语-希伯来语词对,并带有用于同源词识别和语义消歧的句子级注释。我们评估了开源和商业LLM在多种输入表示(原始、带变音符号、罗马化和音标)下的表现,揭示了跨语言推理中的关键差距。虽然模型在真正的同源词上达到了高准确率,但在假朋友和借词上性能急剧下降,反映出对表面形式相似性的强烈依赖。此外,句子级上下文仅带来微小的改进,表明仅靠上下文线索不足以克服误导性的形式信号。这些发现揭示了当前LLM在解决跨语言形式-意义冲突方面的根本局限性,并将SemCog Bench确立为多语言语义推理的严格基准。我们的代码和数据已公开。

英文摘要

Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

2606.13190 2026-06-12 cs.RO cs.HC 新提交

Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

基于非侵入式消费级脑机接口的多模态多智能体机器人认知对齐:概念验证探索

Nataliya Kosmyna, Liz Jenkins, Anoop K. Sinha

AI总结 提出一种框架,利用消费级脑机接口监测脑电信号,在高认知负荷时延迟智能体通信,实现认知对齐的多智能体交互,初步验证了实时信号处理、大语言模型与机器人结合的可行性。

详情
Comments
19 pages, 9 figures, for associated video, see https://youtu.be/0Tav-G87XGs
AI中文摘要

尽管非语言行为和表达性动作对于自然的人机交互至关重要,但现有方法常常忽略一个关键要素:人类的内在认知状态。主动式多智能体系统经常在不合时宜的时刻打断人类,导致认知过载和任务性能下降。本文引入了一个生成“认知对齐”多智能体交互的框架,增强了机器人系统在人类高心理工作负荷和高投入度时刻,能够上下文相关地延迟向智能体系统用户发送通信的能力。我们介绍了一种闭环架构的设计与实现,该架构探索了自主任务执行与实时神经生理学专注度之间的相互作用。使用消费级脑机接口(BCI),我们的方法在人类执行投入度诱导任务时持续监测脑电图(EEG)频谱带功率。我们提出了一种基于投入度的流水线,其中基于HTTP的信令机制在检测到高投入度时将主智能体的感官输入和音频输出置于保持状态,从而允许次级智能体在后台无缝处理复杂的委托任务。一旦人类的认知状态恢复到较低的认知负荷基线,主智能体释放排队的智能体消息。我们的初步结果证明了利用实时信号处理、大语言模型(LLMs)和物理机器人实体创建认知感知、非侵入式多智能体系统的可行性。

英文摘要

While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human's internal cognitive state. Frequently, proactive multi-agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating "cognitively aligned" multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Using a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement-inducing task. We propose an engagement-driven pipeline where an HTTP-based signaling mechanism places a primary agent's sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human's cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively-aware, non-intrusive multi-agent systems.

2606.13177 2026-06-12 cs.CL cs.AI cs.LG 新提交

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

AI总结 提出MemRefine框架,利用LLM判断事实内容,通过删除、合并和保留操作将记忆库压缩到固定预算内,在多个基准上保持下游性能并优于基于规则的基线。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越需要在长期交互中运行,其中过去对话中的信息必须被保留和回忆以支持未来任务。然而,随着交互的积累,记忆存储无限制增长,并充满冗余条目,这些条目增加了存储成本,并通过排挤最有用的证据而降低了检索质量。此外,在具有硬性内存预算的资源受限平台上,这尤其受限,促使我们制定了有存储预算的记忆管理任务,即在固定预算内保持已构建的记忆库,同时保留对未来交互有用的信息。为此,我们提出了MemRefine,一个基于LLM引导的框架,由于表面相似性不能很好地反映事实价值,该框架仅使用相似性来提出候选对,并将删除、合并和保留决策推迟给基于事实内容的LLM判断,迭代直到满足预算。在多个记忆框架和长期对话基准上,MemRefine始终满足目标预算,同时保持下游性能,并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

2606.13104 2026-06-12 cs.LG 新提交

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

权威、真实性与引文偏差:研究大语言模型认知易感性的大规模多领域基准

Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

AI总结 提出AuthorityBench基准,通过2x2因子设计隔离引文权威信号对LLM认知行为的影响,发现引文存在(无论真假)均提高幻觉率,真声明搭配假引文时幻觉率上升3-22个百分点。

详情
Comments
10 pages, 5 figures. Accepted to AI4GOOD and EIML at ICML 2026
AI中文摘要

大型语言模型越来越多地部署在引文增强的环境中,但引文存在对模型行为的影响(独立于事实内容)仍知之甚少。我们引入了AuthorityBench,一个包含220,564个提示的多领域基准,用于隔离基于引文的权威信号如何影响LLM的认知行为。该基准采用完全平衡的2x2因子设计,交叉声明真实性(claim veracity)与引文真实性(citation veracity),这是首个这样做的基准,涵盖四个领域(常识、科学、法律和医学),并在40个提示模板、四个场所声望等级和一个国家编码的作者姓名数据集上进行受控变化。评估七个模型在12个结构化研究问题上的表现,我们发现引文的存在(无论是真实的还是捏造的)相对于无引文基线一致地提高了幻觉率。当捏造的引文伴随真实声明时,这种效应最强,使幻觉率提高3到22个百分点,在常识领域达到35%到77%,而法律声明相对稳健,场所声望和作者人口统计学影响可忽略不计。所有数据集和评估代码均可在以下网址获取:this https URL

英文摘要

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

2606.13054 2026-06-12 cs.LG cs.AI 新提交

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

TWLA:通过训练后量化实现大语言模型的三值权重和低位激活

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

AI总结 提出TWLA框架,通过后训练量化实现1.58位权重和4位激活,解决激活分布长尾问题,加速推理。

详情
Comments
Accepted by ICML 2026
AI中文摘要

大型语言模型(LLMs)展现出卓越的通用语言处理能力,但其内存和计算成本阻碍了部署。三值化已成为一种有前景的压缩技术,可显著降低模型大小和推理复杂度。然而,现有方法难以处理重尾激活分布,因此将激活保持在高精度,从根本上限制了端到端推理加速。为克服这一限制,我们提出TWLA,一种后训练量化(PTQ)框架,在保持高精度的同时实现1.58位权重压缩和4位激活量化。TWLA包含三个组件:(1)欧几里得到流形非对称三值量化器(E2M-ATQ),通过从欧几里得初始化到流形重定位的两阶段优化,最小化权重三值化下的层输出误差;(2)Kronecker正交三模态整形(KOTMS),应用Kronecker结构正交旋转将权重重塑为三值友好的三模态分布,同时共享旋转统计上抑制激活异常值;(3)层间感知激活混合精度(ILA-AMP),在位分配中显式引入相邻层二阶交互成本,并联合优化由共享正交变换引起的激活量化增益的层间差异,防止少数弱层触发级联效应。大量实验表明,TWLA在W1.58A4下保持高精度,同时实现显著的推理加速。代码见<此https URL>。

英文摘要

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at <https://github.com/Kishon-zzx/TWLA>.

2606.13028 2026-06-12 cs.RO cs.CV 新提交

Comparing Commercial Depth Sensor Accuracy for Medical Applications

面向医疗应用的商用深度传感器精度比较

Pit Henrich, Maximilian Weiherer, Franziska Hansen, Bernhard Egger, Franziska Mathis-Ullrich

AI总结 本文在猪骨、猪肚和硅胶肾模型上,以触针采样为参考,比较了立体视觉、结构光和飞行时间四类深度传感器在50cm距离下的精度,发现Zivid 2M+ 60在所有物体和指标上表现最佳。

详情
Comments
4 Pages
AI中文摘要

深度估计在医疗和外科手术中有众多应用。我们使用触针采样的参考数据,在猪骨标本、猪肚标本和硅胶肾脏模型上对四种深度传感器进行了基准测试。这些物体包含多个现实挑战,包括均匀表面、镜面反射表面和次表面散射。比较包括距离约50厘米处的立体视觉、结构光和飞行时间传感器。具体而言,比较了Intel RealSense D405(美国Intel RealSense)、PMD Flexx2(德国pmdtechnologies)、Stereolabs ZED 2i(法国Stereolabs)和Zivid 2M+ 60(挪威Zivid)。在本研究考虑的所有物体和指标中,Zivid 2M+ 60表现最佳。ZED在真实组织上排名第二,但在模型上排名最后。

英文摘要

Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

2606.13020 2026-06-12 cs.AI 新提交

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

SciR: 面向LLM科学推理的可控基准

Pierre Beckmann, Marco Valentino, Andre Freitas

AI总结 提出SciR基准,通过形式对象生成可验证的多范式科学推理任务,并控制信息提取和推理难度两个维度,揭示LLM在科学推理中的弱点。

详情
AI中文摘要

科学推理中反复出现三种范式的推理形式:演绎、归纳和因果溯因。目前,在科学环境中可靠地评估LLM在这三种推理上的表现尚不可及:基于人工标注的科学基准成本高昂且缺乏机制性真值,而合成逻辑推理基准则不像真实的科学文档。我们引入了SciR,这是一个将多范式推理与可控科学渲染相结合的基准,以三个范式性科学问题为锚点。任务从形式对象(演绎树、归纳规则假设、因果图)生成,以保证可验证答案,然后通过每个轨道的领域调优体裁渲染成多文档科学论述。该构建使我们能够独立变化两个难度轴:提取推理所需关键信息的难度,以及原则性推理本身的难度。我们测试了六个模型。两个轴都对每个模型造成伤害,且其效应叠加。渲染甚至伤害了神经符号管道,后者将推理交给经过验证的求解器。这两个轴产生了每个模型的提取与推理轮廓:例如,像deepseek-r1这样的推理模型在推理轴上大多超过了非推理指令模型。据我们所知,SciR是第一个在提取和推理难度上具有参数化控制的多范式科学推理基准。

英文摘要

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

2606.12988 2026-06-12 cs.CV cs.AI 新提交

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

一种用于实时个性化人体工学姿态分析的机器学习框架

Manex Atxa, Bruno Simoes, Julen Balzategui

AI总结 提出利用三维体积视频数据实时预测人体工学/非工学姿态的方法,结合3D点云多角度分析与个性化深度学习分类器,克服固定视角遮挡问题,实现实时评估。

详情
Comments
13 pages, 7 figures, conference 24CMH
AI中文摘要

本文介绍了一种利用三维体积视频数据实时预测人体工学和非工学姿态的新方法。尽管该方法是为人体工学评估设计的,但它可以适应其他需要实时分析人体姿态的应用。该系统的一个突出特点是能够在评估过程中分析3D点云,从而实现多角度计算。这克服了相机通常提供固定视角的关键限制,从而限制了全面姿态评估可用的数据,尤其是在发生遮挡时。系统持续自动地对实时流数据使用选定的视角进行姿态推断;然而,只有用户手动选择和标记的姿态用于训练个性化深度学习分类器。该方法通过一个案例研究进行了优化,其中RGB-D相机捕捉了执行负重任务的受试者,实现了实时骨骼标记。模型在此数据上训练,并在训练阶段后对新流数据实时进行推断。本研究通过结合最先进的3D数据技术和传统的2D姿态估计算法,为实时人体工学评估提供了一种可扩展且实用的方法。它解决了工作场所环境中日益增长的安全与健康监测需求,标志着对该领域的显著贡献。

英文摘要

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

2606.12969 2026-06-12 cs.AI 新提交

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

用于配电缺陷检测的多模态智能体:基础模型评估

Quan Quan

AI总结 提出多模态智能体框架,系统评估基础模型在感知、推理和工具使用三方面的能力,用于配电缺陷检测的闭环自动化。

详情
AI中文摘要

配电网络对可靠电力输送至关重要,但传统检测方法在语义理解、泛化和闭环自动化方面存在局限。为解决这些挑战,本文提出了一种专门用于配电缺陷检测的多模态智能体框架。本研究的核心是系统评估多模态基础模型作为统一认知引擎的能力。我们严格评估了它们在三个关键能力上的综合表现:(1)感知,模型必须准确识别设备并生成专家级的缺陷描述;(2)推理,模型根据视觉发现解释原因、评估严重性并基于领域知识规划维护策略;(3)工具使用,模型作为自主操作者执行动作——如查询知识库或生成工单——以实现闭环维护。为支持此评估,我们开发了领域特定的评估数据集和综合基准。实验结果表明了当前基础模型在这三个维度的优势与局限,为在高风险工业环境中部署自主智能体提供了实证依据。

英文摘要

The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions -- such as querying knowledge bases or generating work orders -- to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

2606.12954 2026-06-12 cs.RO 新提交

Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025

面向杂乱环境中的可靠顺序物体抓取:RGMC 2025 亚军方案

Wei Yu, Xidan Zhang, Ziyi Zheng, Weijie Kong, Huixu Dong

AI总结 针对杂乱环境中的顺序物体抓取任务,提出集成硬件-软件流水线,结合多功能夹爪设计与物体分布及遮挡关系新表示,实现高效识别、搜索与顺序抓取,获RGMC 2025亚军。

详情
Comments
First, Second and Third Coauthor contributed equally to this work
AI中文摘要

作为机器人操作中的长期挑战,在杂乱环境中稳定高效地抓取在工业场景中至关重要。尽管近期研究在杂乱抓取中取得了较高的成功率,但对于顺序物体搜索与分类等更具挑战性的任务,成熟解决方案仍然较少。本工作基于杂乱环境抓取基准(CEPB)解决杂乱环境中的顺序物体抓取问题,并展示了我们在ICRA 2025第十届机器人抓取与操作竞赛(RGMC)的“杂乱抓取”赛道中的方案。该任务提出了几个关键挑战。首先,它需要鲁棒且考虑碰撞的抓取,在包括刚性和可变形物体在内的多样化物体集上具有高成功率。其次,它要求高效搜索目标物体,这对方案的清理和搜索策略提出了严格要求。为应对上述挑战,我们设计了一个集成的硬件-软件流水线,结合了物体识别、清理和多模态抓取。主要贡献包括多功能夹爪的硬件设计以及杂乱空间中物体分布和遮挡关系的新表示。该流水线实现了对杂乱环境中物体的高效识别、搜索和顺序抓取,在实验室测试和竞赛场景中均表现出色,最终在RGMC 2025的“杂乱抓取”赛道中获得第二名。

英文摘要

As a long-standing challenge in robotic manipulation, stable and efficient grasping in cluttered environments is of great importance in industrial settings. While recent studies have achieved relatively high success rates in grasping from clutter, there remain few mature solutions for more demanding tasks such as sequential object search and sorting. This work addresses sequential object picking in cluttered environments based on the Cluttered Environment Picking Benchmark (CEPB) and presents our solution to the Pick-in-Clutter track of the 10th Robotic Grasping and Manipulation Competition (RGMC) at ICRA 2025. The task poses several key challenges. First, it requires robust and collision-aware grasping with high success rates across a diverse set of objects, including both rigid and deformable ones. Second, it demands efficient search for target objects, which places stringent requirements on the decluttering and searching strategies of the solution. To address the above challenges, we design an integrated hardware-software pipeline that combines object recognition, decluttering, and multi-modal grasping. The main contributions include the hardware design of a multifunctional gripper and novel representations for object distribution and occlusion relationships in cluttered space. This pipeline enables efficient recognition, search, and sequential grasping of objects in clutter, demonstrating strong performance in both laboratory tests and competition scenarios, and ultimately achieving second place in the Pick-in-Clutter track of the RGMC 2025.

2606.12940 2026-06-12 cs.SD cs.LG 新提交

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

自引导:通过解码器流形对齐增强神经编解码器

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang

AI总结 提出自引导方法,通过轻量特征映射损失对齐解码器内部流形,在不改变推理过程下提升VQ-VAE神经语音编解码器重建质量,实现低比特率SOTA性能并支持4倍码本缩减。

详情
Comments
20 pages, 9 figures, accepted to ICML 2026, demo website available at https://sgvqvae.github.io/sgvqvae-demo
AI中文摘要

基于向量量化VAE(VQ-VAE)的神经语音编解码器是语音大语言模型的核心音频分词器,但其重建保真度受限于量化误差。常见的修复方法是修改量化器或增加模型容量,但这会复杂化下游语言建模。我们的核心思想是,在处理量化标记及其原始连续嵌入时,使用轻量级特征映射损失对齐解码器的内部特征流形。这需要最小的训练开销,且无需改变推理过程。应用于XCodec2时,自引导改善了所有重建指标,实现了低比特率下的最先进性能。值得注意的是,它实现了4倍码本缩减而无保真度损失,下游TTS实验表明,通过简化标记建模空间,这显著改善了基于LLM的合成。多项统计观察和可视化证实了解码器中内部流形对齐的增强。大量实验证实了其在各种归纳偏置下的通用性。因此,自引导建立了一种高效、广泛适用的高保真神经音频编码方法。

英文摘要

Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

2606.12936 2026-06-12 cs.RO cs.AI 新提交

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

面向湿实验室机器人的具身仿真平台、基准测试及数据高效增强框架

Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang, He Xu, Peijia Li, Jiaming Gu, Quan Lu, Qi Wang, Bin Ji, Ting Xiao

AI总结 提出Pipette平台,包含可编辑资产、仿真数据增强管道和11任务基准测试,将30次演示的VLA成功率从44.1%提升至74.7%。

详情
Comments
25 pages, 17figures
AI中文摘要

湿实验室机器人可以提高生物医学实验的可重复性、通量和安全性,但扩展其学习需要可定制的模拟器以进行安全和可重复的任务生成、开放的可编辑实验室资产,以及将有限演示转化为可用训练数据的高效管道。我们提出了Pipette,一个用于湿实验室机器人学习的具身仿真平台、基准测试和数据高效增强框架。Pipette发布了超过43个开源且可重新编辑的湿实验室资产,以及一个可扩展的资产构建管道。Pipette的一个关键组件是其基于仿真的数据增强管道,在仿真中重放人类演示,应用光照、相机、速度和动作扰动,并通过自动任务成功检查过滤生成的片段,从有限的手动演示中快速扩展可用的训练数据。我们进一步引入了一个包含11个任务的湿实验室具身基准测试,涵盖样本处理、培养器具操作、设备操作和精确放置。每个任务仅需30次演示,ACT实现了65.5%的平均成功率,而仿真增强将SmolVLA从44.1%提升至74.7%,将π0从40.4%提升至46.5%,验证了Pipette在数据高效的VLA训练和评估中的有效性。Pipette还支持自然语言驱动的场景构建和任务注册,降低了非专家用户定义新湿实验室机器人任务的门槛。

英文摘要

Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and π0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

2606.12930 2026-06-12 cs.LG 新提交

Is Spurious Correlation Removal Always Learnable?

虚假相关性去除是否总是可学习的?

Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang, Xiaokang Zhang, Ruifan Zhang

AI总结 研究不变学习在统计可识别时的计算障碍,证明存在一维不变子空间的可采样多环境实例,多项式时间算法无法达到常数精度,并量化环境多样性对可识别性和风险的影响。

详情
Comments
poster paper in ICML-2026
AI中文摘要

即使不变结构在统计上是可识别的,不变学习也可能失败。我们展示了一个条件计算障碍:在由平均情况稀疏恢复归约驱动的黑盒可采样监督稀疏恢复原语下,存在具有一维预测不变子空间($k=1$)的\emph{可采样}多环境实例,这些实例可以通过穷举搜索用多项式样本学习,而任何多项式时间常数精度恢复算法都会与该原语矛盾。我们进一步通过分离参数$\gamma$量化环境多样性,该参数控制可识别性和不变性目标的曲率。在充分多样性和局部高斯正则性下,极小极大风险为$\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$,在标签诱导的偏移下,在$n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$处发生相变,估计误差缩放比例与$1/\gamma^2$成正比。合成和真实数据集说明了预测的差距和转变,并激发了简单的多样性诊断。

英文摘要

Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emph{samplable} multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $γ$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=Θ(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|γ^2)$ with refined estimation error scaling proportional to $1/γ^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.