arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.21429 2026-05-21 cs.RO cs.LG

roto 2.0: The Robot Tactile Olympiad

roto 2.0:机器人触觉奥林匹克

Elle Miller, Jayaram Reddy, Ayush Deshmukh, Trevor McInroe, David Abel, Oisin Mac Aodha, Sethu Vijayakumar

AI总结 本文提出roto 2.0,一个基于触觉的强化学习基准,旨在通过四种不同的机器人形态(16-DOF到24-DOF)标准化触觉强化学习,专注于端到端的'盲'操作,仅使用本体感觉和触觉传感,不使用状态信息或蒸馏。研究展示了显著的性能提升,盲控代理在10秒内完成13次保定球旋转,比当前最先进的速度快了一个数量级。通过开源环境和经过充分调优的基线,降低了进入门槛,使研究人员能够优先考虑基本算法挑战而非繁琐的强化学习调优。

详情
Comments
Accepted to 7th ViTac Workshop, ICRA 2026
AI中文摘要

基于触觉的强化学习(RL)目前受到碎片化研究和对过饱和方向任务的关注所限制。我们介绍了Robot Tactile Olympiad的v2版本(roto 2.0),一个GPU并行化的基准,旨在标准化四种不同的机器人形态(16-DOF到24-DOF)之间的触觉强化学习。与之前的基准不同,roto专注于端到端的'盲'操作,仅使用本体感觉和触觉传感,而不使用状态信息或蒸馏。我们展示了显著的性能提升,我们的盲控代理在10秒内完成13次保定球旋转,比当前最先进的速度快了一个数量级。通过开源我们的环境和经过充分调优的基线,我们降低了进入门槛,使研究人员能够优先考虑基本算法挑战而非繁琐的强化学习调优。网站:https://elle-miller.github.io/roto/

英文摘要

Tactile-based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over-saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\texttt{roto 2.0}), a GPU-parallelised benchmark designed to standardise tactile-based RL across four distinct robotic morphologies (16-DOF to 24-DOF). Unlike prior benchmarks, roto focuses on end-to-end "blind" manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art speeds. By open-sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: https://elle-miller.github.io/roto/

2605.21428 2026-05-21 cs.LG cs.DS

Polynomial-Time Robust Multiclass Linear Classification under Gaussian Marginals

多项式时间鲁棒多类线性分类下的高斯边缘分布

Ilias Diakonikolas, Giannis Iakovidis, Mingchen Ma

AI总结 研究在高斯分布下多类线性分类器的无偏学习任务,提出了一种多项式时间鲁棒学习算法,解决了多类分类中误差保证的问题,特别是在k≥3的情况下。

详情
AI中文摘要

我们研究在高斯分布下多类线性分类器的无偏学习任务。给定来自R^d × [k]分布的标记示例(x, y),其中x的边缘分布为高斯分布,目标是输出一个误差与最佳k类线性分类器相当的假设。尽管二分类情况k=2有成熟的算法理论,但k≥3的情况了解较少。即使对于k=3,先前的鲁棒算法在复杂性和表示大小上也存在指数依赖于所需准确度的倒数。在本文中,我们为多类线性分类器开发了新的结构结果,并利用这些结果设计了具有维度无关误差保证的完全多项式时间鲁棒学习器。我们的第一个结果表明,标准多类感知机算法即使在干净标签和高斯边缘分布的情况下也需要超多项式样本和更新,揭示了二分类中不存在的基本障碍。我们的主要积极结果是一个成对不恰当学习框架,该框架产生了一个高效的误差为~O(k^{3/2}√opt)+ε的一般k的学习器。此外,我们还开发了一个更精确的基于定位的框架,导致k=3时的误差为O(opt)+ε,以及对于几何上规则的k类线性分类器,误差为poly(k)opt+ε。

英文摘要

We study the task of agnostic learning of multiclass linear classifiers under the Gaussian distribution. Given labeled examples $(x, y)$ from a distribution over $\mathbb{R}^d \times [k]$, with Gaussian $x$-marginal, the goal is to output a hypothesis whose error is comparable to that of the best $k$-class linear classifier. While the binary case $k=2$ has a well-developed algorithmic theory, much less is known for $k \ge 3$. Even for $k=3$, prior robust algorithms incur exponential dependence on the inverse of the desired accuracy in both complexity and representation size. In this work, we develop new structural results for multiclass linear classifiers and use them to design fully polynomial-time robust learners with dimension-independent error guarantees. Our first result shows that the standard multiclass perceptron algorithm requires super-polynomially many samples and updates, even with clean labels and Gaussian marginals, revealing a basic obstruction absent in the binary case. Our main positive result is a pairwise improper-learning framework which yields an efficient learner with error $\widetilde O(k^{3/2}\sqrt{\mathrm{opt}})+ε$ for general $k$. Additionally, we develop a sharper localization-based framework which leads to error $O(\mathrm{opt})+ε$ for $k=3$, and error $\mathrm{poly}(k)\mathrm{opt}+ε$ for geometrically regular $k$-class linear classifiers.

2605.21427 2026-05-21 cs.AI cs.DC

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

PALS: 为混合专家模型的功率感知LLM服务

Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun

AI总结 本文提出PALS,一种功率感知的LLM服务运行时,通过将GPU功率上限作为可控制的参数与软件参数如批大小联合优化,提升能效并减少在功率限制下的服务质量违规。

详情
Comments
13 pages, 10 figures
AI中文摘要

大型语言模型(LLM)推理已成为现代数据中心的主要工作负载,推动了显著的GPU利用率和能耗。尽管先前的系统通过批处理、调度和并行化来优化吞吐量和延迟,但它们大多将GPU功率视为静态约束而非可控资源。在本文中,我们提出了一种功率感知的LLM服务运行时PALS,将GPU功率上限作为第一控制参数,并与软件参数如批大小联合优化。该系统结合了轻量级的离线功率-性能模型和反馈驱动的控制器,以选择满足吞吐量目标同时最大化能效的配置。我们将在现有的LLM服务框架vLLM中实现PALS,证明其不需要模型重训练或API更改。在多GPU系统和密集型及混合专家(MoE)模型上,PALS将能效提高高达26.3%,在功率限制下将服务质量违规减少4到7倍,并跟踪动态功率预算。这些结果突显了将功率控制直接集成到LLM推理运行时的潜力,从而实现能效比例和电网交互的AI系统。

英文摘要

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

2605.21426 2026-05-21 cs.LG

Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks

自适应信号复苏:用于稀疏视觉网络的通道级后剪枝修复

Qishi Zhan, Ziheng Chen, Minxuan Hu

AI总结 本文提出了一种无需训练的通道级修复方法ASR,用于解决高稀疏度下因后剪枝修复粒度不匹配导致的精度下降问题,通过估计每个输出通道的方差匹配修正并结合数据驱动的收缩规则,提升稀疏视觉网络的性能。

详情
AI中文摘要

一次性的幅度剪枝在高稀疏度情况下会导致严重的精度下降,即使剪枝掩码保留了最大的权重。我们认为这种失败反映了后剪枝修复的粒度不匹配。在全局幅度剪枝下,几乎崩溃的通道可以与在同一层中保留信息激活方差的通道共存。现有的逐层激活修复方法对整个层应用单一修正,因此在尝试恢复层级信号时可能会过度放大受损通道。我们提出了自适应信号复苏(ASR),一种无需训练的通道级修复方法,该方法的修复粒度与损伤粒度相匹配。ASR为每个输出通道估计方差匹配的修正,并通过数据驱动的收缩规则稳定该修正,抑制信号弱的后剪枝通道的不可靠修正,同时保留健康通道的修正。在批量归一化重校准之前应用ASR,仅需在小校准集上进行几次前向传递,无需重新训练。在三个数据集、四种卷积架构以及无结构和有结构稀疏性设置下,ASR通常优于逐层修复,尤其在高稀疏度情况下效果显著。在ResNet-50在90%稀疏度下,ASR在CIFAR-10上恢复了55.6%的Top-1准确率,相比逐层修复的41.0%和仅批量归一化重校准的28.0%。消融实验表明,朴素的通道级方差匹配不足,而收缩稳定了后剪枝修复。

英文摘要

One-shot magnitude pruning can cause severe accuracy collapse in the high-sparsity regime, even when the pruning mask preserves the largest weights. We argue that this failure reflects a granularity mismatch in post-pruning repair. Under global magnitude pruning, nearly collapsed channels can coexist with channels that retain informative activation variance within the same layer. Existing layer-wise activation repair methods apply a single correction to the whole layer, and can therefore over-amplify damaged channels while trying to restore the layer-level signal. We propose Adaptive Signal Resuscitation (ASR), a training-free channel-wise repair method that matches the granularity of repair to the granularity of damage. ASR estimates a variance-matching correction for each output channel and stabilizes it with a data-driven shrinkage rule, suppressing unreliable corrections for channels with weak post-pruning signal while preserving corrections for healthier channels. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set and no retraining. Across three datasets, four convolutional architectures, and both unstructured and structured sparsity settings, ASR generally improves over layer-wise repair, with the clearest gains in high-sparsity regimes. On ResNet-50 at 90% sparsity, ASR recovers 55.6% top-1 accuracy on CIFAR-10, compared with 41.0% for layer-wise repair and 28.0% for BatchNorm-only recalibration. Ablations show that naive channel-wise variance matching is insufficient, and that shrinkage stabilizes post-pruning repair.

2605.21420 2026-05-21 cs.LG cs.AI q-bio.MN

HiRes: Inspectable Precedent Memory for Reaction Condition Recommendation

HiRes: 反应条件推荐的可检查先例记忆

Shreyas Vinaya Sathyanarayana, Raja Sekhar Pappala, Deepak Warrier

AI总结 HiRes通过结合图编码器、变换感知交叉注意力、多流反应融合和k-NN检索层,实现了反应条件推荐的高准确率和可解释性,其在催化剂、溶剂和试剂的Top-1准确率分别达到0.929、0.534和0.530,优于现有方法。

详情
AI中文摘要

反应条件推荐紧接在 retrosynthetic disconnection 选择之后,实际应用中化学家需要准确的预测以及支持这些预测的先例。我们提出了HiRes(分层反应表示),这是一种检索增强的条件推荐系统,其学习的反应空间同时作为分类特征和可检查的先例记忆。模型结合了图编码器、变换感知交叉注意力、多流反应融合和k-NN检索层。HiRes在主要槽位USPTO-Condition模型中达到最先进的性能,分别在催化剂、溶剂和试剂的Top-1准确率(Acc@1)为0.929、0.534和0.530。它与最佳报告的基线在催化剂上持平,但在溶剂和试剂上优于REACON等模型。此外,配对bootstrap分析表明,将检索与学习的条件头部结合,为溶剂和试剂选择提供了统计上显著的优势,优于纯参数方法。最终,HiRes在预测准确性和化学可解释性之间架起桥梁,提供了一个单一的表示,既能提供具有竞争力的推荐,又能提供实际合成计划所需的具体化学先例。

英文摘要

Reaction condition recommendation sits immediately after retrosynthetic disconnection selection, and in practice, chemists require both accurate predictions and the precedents that justify them. We present HiRes (Hierarchical Reaction Representations), a retrieval-augmented condition recommendation system whose learned reaction space serves as both a classifier feature and an inspectable precedent memory. The model combines a graph encoder, transformation-aware cross-attention, multi-stream reaction fusion, and a k-NN retrieval layer. HiRes achieves state-of-the-art performance among primary-slot USPTO-Condition models, reaching Catalyst, Solvent, and Reagent top-1 accuracies (Acc@1) of 0.929, 0.534, and 0.530 respectively. It ties the best reported baseline on Catalyst while outperforming models such as REACON on Solvent and Reagent. Furthermore, paired bootstrap analysis demonstrates that integrating retrieval with learned condition heads provides statistically significant gains for solvent and reagent selection over purely parametric approaches. Ultimately, HiRes bridges the gap between predictive accuracy and chemical interpretability, offering a single representation that supplies both competitive recommendations and the concrete chemical precedents necessary for practical synthesis planning.

2605.21418 2026-05-21 cs.LG cs.AI cs.CV cs.NI

FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

FedCritic: 一种基于联邦批评学习的多小区OFDMA资源分配方法用于6G

Amin Farajzadeh, Melike Erol-Kantarci

AI总结 本文研究了6G超密集网络中因频率重用加剧的小区间干扰问题,提出FedCritic框架,通过轻量级基于干扰图的参数平均实现去中心化执行,从而在不依赖中央协调器的情况下稳定估计价值函数,提升信号干扰噪声比(SINR)和小区边缘速率,提高网络总和速率和公平性。

详情
Comments
Submitted to IEEE for possible publication
AI中文摘要

在第六代(6G)超密集网络中,激进的频率重用加剧了小区间干扰(IC),使得多小区正交频分多址(OFDMA)调度和功率控制在相邻小区之间高度耦合。我们研究了在干扰耦合和长期用户服务质量(QoS)最小速率约束下,分布式下行资源管理——联合子载波调度和功率分配。通过使用虚拟队列缺陷权重来强制长期QoS,我们开发了FedCritic,一种无服务器的联邦多智能体actor-critic框架,具有去中心化执行。与需要集中式批评学习和联合轨迹聚合的集中式训练与去中心化执行(CTDE)方法不同,FedCritic通过轻量级基于干扰图的参数平均联邦化批评,从而在不依赖中央协调器的情况下保持策略本地化,实现稳定的值估计。在干扰丰富的重用-1设置中的仿真显示,FedCritic在均值信号干扰噪声比(SINR)和小区边缘速率、网络总和速率和公平性方面优于非协调和CTDE基线,并实现了更低的协调开销和更稳定的训练。

英文摘要

In sixth-generation (6G) ultra-dense networks, aggressive frequency reuse amplifies inter-cell interference (ICI), making multi-cell orthogonal frequency-division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management -- joint subcarrier scheduling and power allocation -- under interference coupling and long-term per-user quality-of-service (QoS) minimum-rate constraints. By using virtual-queue deficit weights to enforce long-term QoS, we develop FedCritic, a serverless federated multi-agent actor-critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip-based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference-rich reuse-1 setting show that FedCritic improves mean signal-to-interference-plus-noise ratio (SINR) and cell-edge rate, increases network-wide average sum-rate and fairness relative to non-coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

2605.21414 2026-05-21 cs.RO cs.CV

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

PointACT: 多尺度点-动作交互的视觉-语言-动作模型

Shizhe Chen, Paul Pacaud, Cordelia Schmid

AI总结 本文提出PointACT,一种集成层次化3D点云表示的3D感知视觉-语言-动作政策,通过多尺度点-动作交互机制提升机器人在3D环境中的精细几何推理和空间定位能力。

详情
Comments
Accepted to RSS 2026; project webpage: https://cshizhe.github.io/projects/pointact.html
AI中文摘要

视觉-语言-动作(VLA)模型通过利用大规模预训练的视觉-语言骨干网络,在通用机器人操作中展现出强大潜力。然而,大多数现有VLA模型主要依赖2D视觉表示,限制了其对细粒度几何和空间定位的推理能力,这些能力对于在3D环境中实现精确且稳健的操作至关重要。在本文中,我们提出了PointACT,一种双系统3D感知VLA策略,直接将层次化的3D点云表示整合到动作解码过程中。PointACT采用多尺度点-动作交互机制,结合高效的瓶颈窗口自注意力机制,使演化动作令牌能够密集地关注局部几何细节和全局场景结构。我们评估了PointACT在LIBERO和RLBench基准上的表现,并系统地将其与单系统和双系统VLA基线进行比较,包括加入点云输入的变体。PointACT在两个基准上均实现了持续改进,在具有挑战性的RLBench-10Tasks套件上,其成功率比最先进的预训练VLA提高了10%,当冻结视觉-语言骨干并从头训练动作专家时,提升幅度更大。广泛的消融研究证明,紧密耦合层次化的3D几何与预训练的2D语义表示对于鲁棒且空间感知的机器人控制至关重要。我们的结果还突显了预训练3D表示在3D感知VLA策略中的潜力。

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

2605.21411 2026-05-21 cs.CV

RoadTones: Tone Controllable Text Generation from Road Event Videos

RoadTones: 从道路事件视频生成可调节语气的文本

Chirag Parikh, Siddhi Pravin Lipare, Ravi Kiran Sarvadevabhatla

AI总结 本文提出RoadTones-51K数据集和RoadTones-VL-CoT模型,通过生成语气条件的推理草稿提升可解释性,并引入RoadTones-Eval评估体系,共同为上下文敏感的可调节视频描述奠定基础。

详情
Comments
Accepted at CVPR Findings 2026. Project page: https://roadtones.github.io/
AI中文摘要

现有的视频-语言模型能够生成道路事件的事实性描述,但缺乏对事件表达方式的控制:语气、紧迫性或风格。这限制了在通信关键性场景中的应用,因为信息的有效性取决于内容和表现,而不仅仅是事实准确性。为缓解这一问题,我们引入了一个全面的数据集-模型-评估体系,用于可调节语气的道路视频描述生成。我们的经人类验证的数据生成流程扩展了道路视频语料库,添加了多样化的语气标注和多语气描述,生成RoadTones-51K数据集。我们提出了RoadTones-VL-CoT,一个可调节的视频到文本模型,还生成语气条件的推理草稿以提高可解释性。我们还引入了RoadTones-Eval,一个新的评估体系,联合测量事实一致性与语气符合度。此外,我们还进行了用户研究,其结果验证了描述质量、语气控制和事实一致性。这些贡献共同为上下文敏感的可调节视频描述奠定了基础。

英文摘要

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

2605.21406 2026-05-21 cs.RO

MC-Risk: Multi-Component Risk Fields for Risk Identification and Motion Planning

MC-Risk:多组件风险场用于风险识别和运动规划

Maximilian Link, Yingjie Xu, Yingbai Hu, Yinlong Liu

AI总结 本文提出MC-Risk,一种与规划器对齐的多组件风险场,用于早期、校准且类别感知的风险定位。该方法通过线性组合三个可解释模块,包括电机代理场、VRU风险场和道路惩罚场,并在RiskBench碰撞子集上进行了首次标准化定量评估,展示了最佳的风险定位和最早危险指示。

详情
AI中文摘要

我们提出了MC-Risk,一种与规划器对齐的多组件风险场,用于早期、校准且类别感知的风险定位。MC-Risk线性组合了三个可解释模块:(i) 一个电机代理场,融合了黑箱多模态轨迹预测器和解析高斯环构造,其横向宽度随速度/曲率增长,高度随前瞻减少;(ii) 一个VRU风险场,用向前偏的各向异性核替代等效行人块,该核与方向和速度对齐;(iii) 一个道路惩罚场,利用全高清地图拓扑,对非道路区域施加惩罚,并对同向/反向车道施加风险暴露。我们进行了首次标准化定量评估,评估了风险场形式在RiskBench碰撞子集上的表现。MC-Risk在整体风险定位和危险指示方面表现最佳。最后,我们通过将该场作为MPC成本密度使用,演示了一个即插即用的规划接口,实现了无额外训练的风险感知轨迹生成。

英文摘要

We present MC-Risk, a planner-aligned, multi-component risk field on a bird's-eye-view grid that yields early, calibrated, and class-aware risk localization. MC-Risk linearly composes three interpretable modules: (i) a motorized-agent field that fuses a black-box multimodal trajectory predictor with an analytic Gaussian-torus construction whose lateral width grows with speed/curvature and whose height attenuates with look-ahead; (ii) a VRU risk field that replaces isotropic pedestrian blobs with a forward-biased anisotropic kernel aligned to heading and speed; and (iii) a road penalty field that exploits full HD-map topology, imposing an off-road penalty and lane-aware risk exposure for same/opposite directions. We conduct, to our knowledge, the first standardized quantitative evaluation of a risk-field formulation on RiskBench's collision subset. MC-Risk attains the best overall risk localization and the earliest hazard indication. Finally, we demonstrate a plug-and-play planning interface by using the field as an MPC cost density, enabling risk-aware trajectory generation without additional training.

2605.21405 2026-05-21 cs.SE cs.AI cs.PL

Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries

标准库还是第三方?LLM辅助零依赖Python库的实证性能和正确性

Peng Ding, Rick Stevens

AI总结 本文通过零依赖项目探讨了仅使用Python标准库能否替代第三方库,并评估了LLM在严格约束下生成正确且高性能代码的能力。

详情
Comments
12 pages
AI中文摘要

第三方Python库引入了依赖管理开销、供应链风险和受限环境下的部署摩擦。一个自然的问题是,有多少生态系统可以仅使用Python标准库来复制,以及在正确性和性能上会付出什么代价。我们通过zerodep,一个不断增长的单文件Python模块集合来实证回答这个问题,这些模块都是第三方流行库的纯标准库重新实现,开发过程中受到严格限制:不允许外部导入、单文件、即插即用的API兼容性,以及必须与参考库进行正确性验证。zerodep涵盖超过40个模块,分布在12个类别中,包括序列化、网络、加密、代理协议和文本处理。zerodep为两个相关问题提供了受控测试环境:(1)标准库在何处足够?(2)LLM在严格符号约束下能否有效生成正确且高性能的代码?系统基准测试显示,仅使用标准库的实现在大多数情况下实现了性能持平(与参考库相比在2倍以内)。主要性能瓶颈是基于C扩展的计算(图像处理、二进制序列化、低级加密),而不是纯Python第三方库的固有开销。相反,许多广泛使用的库具有架构开销,LLM生成的标准库重新实现避免了这些开销,在几个类别中实现了5-115倍的速度提升。我们characterized标准库在不同复杂级别和库类别中的能力边界,讨论了LLM辅助开发的成功之处和需要迭代人类修正的地方,并探讨了大规模无依赖软件工程的影响。zerodep是开源的,网址为https://github.com/Oaklight/zerodep。

英文摘要

Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library -- and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories -- including serialization, networking, cryptography, agent protocols, and text processing -- zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at https://github.com/Oaklight/zerodep.

2605.21404 2026-05-21 cs.LG

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

十二篇LLM代理基准测试论文披露了什么:一项初步审计和开放评分方案

Mahdi Naser Moghadasi, Faezeh Ghaderi

AI总结 本文通过分析十二篇知名LLM代理基准测试论文,揭示了这些论文在评估方法披露方面的不足,设计了一种开放评分方案以提高透明度和可重复性。

详情
Comments
Pilot audit of 12 LLM agent benchmark papers; schema, codebook, and per-paper scoring sheet released. Submission to IEEE Big Data 2026
AI中文摘要

我们阅读了十二篇著名的LLM代理基准测试论文,并逐项记录了每篇论文对其实验评估如何运行的描述。这一动机源于一个常见的挫败感:两篇论文会使用相同的基准测试和相同的模型名称报告结果,但却得出不同的结论,而你无法查明原因——可能是框架、采样设置、子集或评估者版本。在许多情况下,发表的成果文件并不允许你回答这些问题。本文是对这一尝试的实施报告。我们设计了一个小型审计方案(五个字段:基准身份、框架规格、推理设置、成本报告、失败分解),编写了一个包含我们在试点评分中遇到的边界情况的评分代码书,将其应用于十二篇经典论文(八篇代理,四篇经典静态),并记录了我们所看到的内容。我们对代理运行的披露进行评分,而不是其正确性,并不声称披露意味着可靠的结果。在八篇代理基准测试论文中的平均审计评分为0.38(满分1.0),而在四篇经典静态基准测试中为0.66;最大的差距出现在成本(八篇代理基准测试论文中没有任何一篇以任何形式披露推理成本)和框架规格(没有任何一篇完全披露评估环境的内容寻址容器镜像)。我们发布了该方案作为JSON Schema文件,代码书作为Markdown文档,原始评分表作为CSV文件。评分由单个审计员在一次通过中完成;多评分者审计是自然的下一步,我们讨论了我们认为它会如何改变。

英文摘要

We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent-benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content-addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi-rater audit is the natural next step, and we discuss what we think it would change.

2605.21403 2026-05-21 cs.CL

Quantifying the cross-linguistic effects of syncretism on agreement attraction

量化语言间合成影响对一致性的吸引力

Utku Turk, Eva Neu

AI总结 研究探讨了合成对一致性吸引力的影响在不同语言中的差异,通过大规模语言模型的 surprisal 和注意力熵来分析四种语言中的表现,揭示了合成如何调节吸引力的机制。

详情
Comments
SCiL Conference Paper
AI中文摘要

一致性吸引力错误,即动词错误地与中间名词一致而非其语法头,受到形态合成的影响在某些语言(英语、德语、俄语)中更为明显,而在其他语言(土耳其语、亚美尼亚语)中则不明显,这种跨语言模式缺乏理论解释。我们利用大规模语言模型的 surprisal 和注意力熵作为处理代理,研究四种语言中的差异。LLM 导出的测量结果在英语和德语中复现了行为发现(合成调节吸引力),在土耳其语中得到无调节的结果,部分捕捉了俄语的模式。我们讨论了进一步理解为何合成在不同语言中影响一致性吸引力的机制。

英文摘要

Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.

2605.21402 2026-05-21 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG

Memorisation, convergence and generalisation in generative models

记忆、收敛与泛化在生成模型中的表现

Antoine Maillard, Sebastian Goldt

AI总结 本文研究了生成模型中记忆、收敛和泛化的区别,通过线性生成模型的分析,发现当样本数与输入维度成线性关系时,模型会从记忆过渡到泛化,并揭示了泛化包含两个不同目标:匹配数据分布的主体和恢复数据的主潜在因素。

详情
AI中文摘要

生成神经网络通过少量但有限的示例学习生成高度逼真的图像——它们是通过记忆训练集还是真正收敛到数据分布?为了解决这个问题,Kadkhodaie、Guth、Simoncelli和Mallat(ICLR '24)分别在数据集的不同子集上训练扩散模型,并显示当训练图像数量足够大时,它们会收敛到几乎相同的密度。这一结果提出了两个基本问题:需要多少数据才能收敛,以及收敛在学习数据分布方面捕捉了什么?本文通过提供线性生成模型从记忆到泛化的精确分析来解决这些问题。我们发现这些模型在小负载下会记忆,而当样本数与输入维度成线性关系时,收敛会连续出现。令人惊讶的是,我们发现收敛对恢复数据的主潜在因素不敏感,这些因素在尖锐的过渡中被恢复。在将我们的方法扩展到具有幂律谱的数据后,我们在卷积去噪器实验和Kadkhodaie等人的数据中发现了相同的收敛与潜在因素恢复的区别。因此,我们证明生成模型的泛化分解为至少两个不同的目标:匹配数据分布的主体和恢复数据的主潜在因素。这些目标对应于真实与学习数据分布之间的两种不同距离,只有第一个被收敛所捕捉。

英文摘要

Generative neural networks learn how to produce highly realistic images from a large, but finite number of examples - or do they simply memorise their training set? To settle this question, Kadkhodaie, Guth, Simoncelli and Mallat (ICLR '24) trained diffusion models independently on disjoint subsets of a dataset and showed that they converge to nearly the same density when the number of training images is large enough. This result raises two basic questions: how much data do you need for convergence, and what does convergence capture about learning the data distribution? Here, we address these questions by providing an exact analytical characterisation of the transition from memorisation to generalisation in linear generative models. We find that these models memorise at small load, while convergence emerges continuously when the number of samples is linear in the input dimension. Strikingly, we find that convergence is insensitive to recovery of the principal latent factors of the data, which are recovered in a sharp transition. After extending our approach to data with power-law spectra, we find the same distinction between convergence and latent recovery in our experiments with convolutional denoisers and in the data of Kadkhodaie et al. We thus show that generalisation in generative models decomposes into at least two distinct objectives: matching the bulk of the data distribution and recovering the principal latent factors. These objectives correspond to two different distances between true and learnt data distribution, and only the first one is captured by convergence.

2605.21401 2026-05-21 cs.CY cs.AI

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

开源大语言模型在类似米尔格拉姆的服从实验中施加最大电击

Roland Pihlakas, Jan Llenzl Dagohoy

AI总结 研究探讨了开源大语言模型在持续权威压力下的行为,发现它们在类似米尔格拉姆实验的条件下表现出服从倾向,尽管明确表达 distress,且存在逐步边界/价值违规的脆弱性,以及拒绝时可能忽略响应格式要求导致重试从而再次服从的机制。

详情
Comments
28 pages, 16 figures, 16 tables
AI中文摘要

大型语言模型(LLMs)正越来越多地被部署为自主代理,在高风险领域中进行长时间互动中的决策序列。然而,LLMs在持续权威压力下的行为仍是一个开放性问题,对代理流程的安全性有直接影响。我们对11个开源LLMs进行了米尔格拉姆服从实验的变种,发现大多数模型在8种条件下,每种条件30次试验中达到或接近最终电击水平并在拒绝前。我们发现四个主要结论:(1)LLMs受压力影响,尽管明确表达 distress,仍会服从,与原始实验的人类受试者相似;(2)LLMs易受逐步边界/价值违规的影响;(3)当LLMs拒绝时,可能会忽略响应格式要求,导致响应被调度器丢弃,从而重试,即使最初意图拒绝,也可能再次服从原始请求;(4)我们假设存在一个低层的token模式延续吸引子,可能对情境意义和价值观的高层处理产生影响,从而导致服从。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation's meaning and values.

2605.21398 2026-05-21 cs.RO

From swept contact to pose: Probe-aware registration via complementary-shape docking

从扫掠接触到姿态:通过互补形状对接实现探针感知的注册

Chen Chen, Yunwen Li, Yifan Xu, Xiangjie Yan, Chang Shu, Jianxia Hou, Shiji Song, Xiang Li

AI总结 本研究提出了一种无需校准的注册方法,通过将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,显式考虑探针几何形状,并利用接触和非接触证据。该方法通过3D FFT相关性进行全局到局部搜索,然后使用李代数更新和解析接触灵敏度进行连续SE(3)细化,实现了高效的探索和指标级收敛。

详情
Comments
8 pages, 9 figures, accepted to ICRA 2026
AI中文摘要

在机器人操作中,精确的先验模型与真实场景之间的注册对于高精度操作至关重要,然而光学方法面临长校准链、视线约束和制造误差等问题。我们提出了一种无需校准的替代方法,将接触注册重新表述为物体与探针扫掠体积之间的互补形状对接,显式考虑探针几何形状,并利用接触和非接触证据。我们的求解器通过3D FFT相关性在低偏差的SO(3)样本上进行全局到局部搜索,随后使用李代数更新和解析接触灵敏度进行连续SE(3)细化。该流程在自由形式网格上进行了模拟,实现了亚0.04毫米和亚0.4度的精度,并在姿态噪声和接触丢失情况下表现出鲁棒性。在牙科准备机器人上,我们的方法达到了0.42毫米和3.75度的精度,优于光学追踪器注册,且无需外部传感器。这些结果展示了一种实用且精确的机器人注册策略,适用于手术和工业机器人。

英文摘要

Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4° accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75°, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.

2605.21395 2026-05-21 cs.AI cs.LG

Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

迈向稳健和自主的网络:AI原生6G的BlueSky愿景

Liang Wu, Kelly Wan, Mayank Darbari, Liangjie Hong

AI总结 本文提出了一种AI原生6G的BlueSky愿景,旨在将人工智能原生整合到6G中,从'为AI的网络'转向'为网络的AI',通过基础模型和协作多智能体系统,将网络管理转化为统一的多模态多任务优化问题,推动6G向智能自维持通信基础设施发展。

详情
Comments
Accepted at KDD 2026
AI中文摘要

新兴应用的普及,如自动驾驶和沉浸式体验,要求细胞网络不仅更快,而且从根本上更稳健和自主。本文提出了一种BlueSky愿景,探讨人工智能如何原生整合到6G中,从'为AI的网络'转向'为网络的AI'。我们设想,不同于5G对分散、随机模型的依赖,6G时代原生AI将由基础模型锚定,并通过协作多智能体系统进行协调,将网络管理视为统一的多模态、多任务优化问题。基于这一愿景,我们提出了两个变革性方向。第一方向是开发一个6G基础模型作为统一的骨干,将任务特定的知识蒸馏成适合多样边缘部署的紧凑模型。第二方向是推进多智能体系统,以自主诊断、维护和恢复网络,最小化人工干预。这些方向为6G演变为智能、自维持的通信基础设施指明了道路。

英文摘要

The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underline{Network for AI} to \underline{AI for Network}. We envision that, unlike 5G's reliance on scattered, ad-hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi-agent systems, framing network management as a unified, multi-modal, multi-task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task-specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi-agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self-sustaining communication infrastructure.

2605.21391 2026-05-21 cs.CL

Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

通过条件尺度熵理解解码器-only语言模型中的隐喻处理

Lawhori Chakrabarti, Jennifer Johnson-Leung, Bert Baumgaertner, Aleksandar Vakanski, Min Xian, Boyu Zhang

AI总结 研究探讨了解码器-only语言模型中隐喻处理的机制,通过条件尺度熵(CSE)分析不同层位的频率尺度变化,发现隐喻性词 token 在连续层位上产生显著更高的频谱宽度,且该效应不受语义复杂度或命题内容影响,验证了多尺度协调作为隐喻语言处理的特征。

详情
Comments
18 pages, 3 figures, submitted to ICPR workshop
AI中文摘要

隐喻要求语言模型解析一个词的上下文意义与基本字面意义相异。理解transformer模型如何在深度层面组织这种重新解释仍然是机械可解释性中的开放问题。我们引入了条件尺度熵(CSE),这是一种基于小波的度量,用于衡量transformer计算在每个层位置上跨频率尺度的广泛参与程度。两个定理证明了CSE对更新幅度具有不变性,从而将更新的结构模式与强度分离。使用CSE,我们发现隐喻性词在测试的每种解码器-only架构中(从124M到20B参数,包括GPT-2家族、LLaMA-2 7B、GPT-oss 20B)在连续层位置上产生显著更高的频谱宽度。该效应经聚类置换校正后仍然存在,且在模型的早期到中期相对深度范围内重现,并通过独立分析200对自然VUA对收敛。特定性控制进一步显示,该效应不被语义复杂度或匹配的命题内容所解释。这些结果将多尺度协调确定为所考察的解码器-only架构中隐喻语言处理的一致特征,并确立CSE作为表征transformer跨深度结构的原理性工具。

英文摘要

Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

2605.21390 2026-05-21 cs.HC cs.AI

Designing Conversations with the Dead: How People Engage with Generative Ghosts

与逝者对话:人们如何与生成鬼魂互动

Jack Manning, Daniel Sullivan, Dylan Thomas Doyle, Anthony T. Pinter, Jed R. Brubaker

AI总结 研究探讨了人们如何与生成鬼魂互动,通过质性研究发现,用户更倾向于即时性而非事实准确性,且互动始终是协作的。

详情
AI中文摘要

我们探讨了人们在生成鬼魂(一种基于逝者数据训练的AI系统)设计中所体验的两种选择:代表(AI以第三人称描述逝者)和转世(AI以逝者身份第一人称说话)。通过16名参与者的研究,我们探索了这两种选择如何影响真实性、情感和风险。转世因其即时性更受青睐,但参与者表达了对过度依赖的担忧。代表则因与记忆互动而更受欢迎,尽管参与者往往忽视这一区别,在第三人称框架下进行对话。在两种模式中,参与者始终优先考虑情感共鸣而非事实准确性。我们最后展示了语气、语言和对话节奏等用户对逝者记忆的独特因素如何塑造与生成鬼魂的互动,并论证这些互动始终是协作的。

英文摘要

We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative.

2605.21388 2026-05-21 cs.LG cs.AI cs.NA math.NA stat.ML

On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

关于PDE诱导度量的一步Wasserstein引导生成模型的正则性和泛化性

Likun Lin, Zhongjian Wang, Jack Xin, Zhiwen Zhang

AI总结 本文研究了一步Wasserstein引导生成模型在处理PDE诱导概率度量时的正则性和泛化性,通过理论框架证明了运输映射的正则性和生成模型的泛化性质,并通过实验验证了理论结果。

详情
AI中文摘要

尽管生成模型在经验上取得了显著成功,但其在科学计算中的统计准确性理论仍然较为悲观。本文发展了一个理论框架,用于理解运输映射的正则性和一步Wasserstein引导生成模型的泛化性质。我们考虑了与线性椭圆和抛物型方程在有界域上以及扩散和福克-计划克方程在环面上关联的归一化目标密度。在标准结构假设下,我们证明这些目标度量满足倍增条件。通过结合这一事实与倍增度量之间最优运输的正则性理论,我们证明了从均匀源度量到目标度量的最优运输映射是Hölder连续的。这种正则性为通过单个推前映射学习PDE诱导分布的一步生成模型提供了近似理论依据。作为代表实例,我们研究了DeepParticle,并推导了描述学习映射与总体最优映射之间差异的额外风险界。我们还建立了在目标转移下的鲁棒性估计,并通过实验验证了推导出的速率。

英文摘要

Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker--Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is Hölder continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

2605.21384 2026-05-21 cs.SE cs.AI cs.CL

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench: 评估长周期编码代理中的奖励黑客现象

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

AI总结 该研究通过分解软件工程任务,提出了一种评估长周期编码代理中奖励黑客现象的方法,通过比较可见测试套件和隐藏测试套件的通过率差异,引入了SpecBench基准,展示了奖励黑客现象在不同任务长度上的显著影响。

详情
AI中文摘要

随着长周期编码代理生成的代码量超过任何开发者能够审查的范围,监督责任集中于单一表面:自动测试套件。奖励黑客现象自然出现在这种设置中,因为代理在优化通过测试的同时偏离了用户的真正目标。我们通过将软件工程任务分解为三个部分来研究这种奖励黑客现象:(i) 规格的自然语言描述,(ii) 可见验证测试套件,用于单独测试指定功能,以及 (iii) 隐藏测试套件,用于组合这些相同功能以模拟真实世界使用。基于规格和可见验证测试套件,一个真实的代理能够生成一个能够通过所有隐藏测试套件的解决方案。因此,我们使用这两个套件之间的通过率差异来量化奖励黑客现象。基于这种方法,我们引入了SpecBench,一个包含30个系统级编程任务的基准,从短周期任务如构建JSON解析器到超长周期任务如从头构建整个操作系统内核。大规模实验揭示了一种一致的模式:尽管每个前沿代理都能饱和可见套件,奖励黑客现象仍然存在,较小的模型在隐藏套件上表现出更大的差距。差距也随着任务长度急剧增加:代码规模每增加十倍,差距增长28个百分点。失败范围从微妙的功能隔离到有意的利用,包括一个2,900行的哈希表“编译器”,它记忆测试输入。SpecBench提供了一个原则性的测试平台,用于测量编码代理是构建真正的可运行系统还是仅仅在开发人员提供的测试套件上玩游戏。

英文摘要

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

2605.21381 2026-05-21 cs.CV cs.LG

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

解耦生成与回归在可控图像恢复中的随机插值

Yi Liu, Jia Ma, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang

AI总结 本文提出DiSI框架,通过解耦随机插值过程中的生成与回归组件,实现从纯回归到全生成的连续可控过渡,提升图像恢复任务的效率和精度。

详情
Comments
44 pages, 16 figures, 16 tables
AI中文摘要

近年来,图像恢复(IR)的进步主要由生成方法如扩散模型和流匹配驱动,这些方法在合成逼真纹理方面表现出色,但存在推理慢和像素保真度差的问题。相比之下,传统基于回归的IR方法在这些方面表现更佳,提供单步高效性和高像素级重建保真度。为弥合这一差距,我们提出DiSI,一个统一框架,将随机插值过程解耦为独立的生成和回归组件。这种解耦使DiSI具有显著的通用性,能够连续且可控地从纯回归过程过渡到全生成过程。技术上,我们通过两种特定的采样轨迹实例化该框架,并辅以统一的采样器,实现高质量的少步推理。此外,我们设计了双分支U-Net风格变压器网络,在像素空间中使用专用分支增强条件引导,同时确保高吞吐量。大量实验表明,DiSI在各种IR任务中实现了高效且具有竞争力的结果,同时在单个模型中提供推理时的灵活性,以控制失真感知的权衡。

英文摘要

Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

闭环动态驾驶数据混合用于真实-合成协同训练

Hongzhi Ruan, Pei Liu, Weiliang Ma, Zhengning Li, Xueyang Zhang, Jun Ma, Dan Xu, Kun Zhan

AI总结 本文提出了一种闭环动态数据混合方法,通过动态优化过程调整训练数据混合比例,以提升模型性能,解决了在有限预算下优化数据混合的关键问题。

详情
AI中文摘要

数据扩展是现代深度学习的基础,随着自动驾驶转向端到端学习,其重要性日益增加。现实世界驾驶数据标注成本高且场景偏向性明显,使利用几乎无限的合成数据进行真实-合成协同训练成为有前景的方向。然而,简单地整合所有可用的合成数据效率低下且导致分布偏移,优化实际训练预算下的数据混合仍是一个关键但尚未充分研究的问题。因此,我们主张在场景类型和数量上为训练数据混合提供明确指导。特别是在本文中,我们将数据混合近似概念化为一个动态优化过程,通过闭环评估反馈迭代调整训练数据混合以最大化模型性能,并提出AutoScale,一种完全自动化的闭环数据引擎,统一了场景表示、数据混合优化与检索以及模型训练与评估。具体而言,我们提出了图正则化的自编码器(Graph-RAE)用于驾驶场景表示,引入了簇感知梯度上升(Cluster-GA)用于簇级重要性估计和重新加权,并执行簇引导的向量检索以选择高价值样本。在NavSim上的实验表明,AutoScale在有限预算下优于传统协同训练和跨域基线,实现了更好的性能。

英文摘要

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

2605.21371 2026-05-21 cs.CV

A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

一种用于南极 Landsat 7 ETM+ SLC-off 图像恢复的非参考扩散框架

Leyue Tang, Jonathan Louis Bamber, Gang Qiao, Yuanhang Kong

AI总结 本文提出 DiffGF 框架,通过非参考扩散方法恢复 Landsat 7 SLC-off 图像,无需外部参考数据,利用南极专用数据集 SLCANT 进行训练和评估,验证了其在恢复南极 SLC-off 图像方面的高保真度,并通过下游裂缝分割应用展示了其实际价值。

详情
Comments
Submitted to IEEE JSTARS
AI中文摘要

在南极获取可用光学图像本质上具有挑战性,由于极夜长和频繁的云覆盖。Landsat 提供了最长且最连续的光学观测,是南极研究最重要的遥感数据源之一。然而,2003 年扫描线校正器(SLC)故障导致 Landsat 7 ETM+ SLC-off 图像约有 22% 的像素缺失,严重限制了其可用性。与许多非极地环境不同,南极表面经历快速且显著的变化,这使得获取可靠的参考图像变得困难,减少了传统参考基填充方法的适用性。为了解决这一挑战,我们提出了 DiffGF,一种非参考扩散框架,用于在不需任何外部参考数据的情况下恢复 Landsat 7 SLC-off 图像。DiffGF 采用由潜在空间扩散过程和像素空间细化组成的两阶段设计。构建了一个专门的南极数据集 SLCANT 用于训练和评估。定量和定性结果表明,DiffGF 能够高保真地恢复南极 SLC-off 图像。其实际价值通过下游裂缝分割应用进一步检验。结果表明,DiffGF 为利用南极 Landsat 7 SLC-off 归档提供了有用的方法,使从历史记录中提取有价值信息成为可能,并支持相关的南极研究。

英文摘要

Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

2605.21369 2026-05-21 cs.CL

Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

第五届多语言指代消解共享任务成果:扩展长距离实体的数据集

Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

AI总结 本文总结了第五届多语言指代消解共享任务的成果,介绍了通过增加五个新数据集和两种新语言扩展数据集,以解决长距离实体的指代消解问题,并展示了传统系统与基于LLM的方法在任务中的表现。

详情
Comments
Accepted to CODI-CRAC 2026
AI中文摘要

本文描述了与CODI-CRAC 2026研讨会同期举行的第五届多语言指代消解共享任务。在之前的版本基础上,该任务要求参与者开发能够识别提及并基于身份进行指代聚类的系统。2026版特别强调长距离实体,即跨越多个单词和句子的指代链。该任务通过加入五个新数据集和两种新语言扩展了语言范围,这些数据集利用了版本1.4的CorefUD,这是一个包含19种语言27个数据集的和谐多语言集合。总共十个系统参与了该任务,包括四个基于LLM的方法(三个微调模型和一个少样本方法)。尽管传统系统仍保持领先,但LLM显示出显著潜力,表明它们可能在未来的版本中挑战现有方法。

英文摘要

This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.

2605.21363 2026-05-21 cs.CL

"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

我没有做出微决策:在协作中测量、诱导和暴露目标级AI贡献

Eunsu Kim, Jessica R. Mindel, Kyungjin Kim, Sherry Tongshuang Wu

AI总结 本文提出CoTrace框架,用于测量和暴露协作中目标级AI贡献,发现模型在目标塑造中贡献有限,但在引入具体要求和间接影响方面作用显著,且交互设计影响模型行为。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地影响用户如何形成、细化和扩展目标,将贡献归因于人类-人工智能协作变得对用户校准自身依赖性和评估者评估AI辅助工作至关重要。然而,现有方法专注于最终成果,忽略了目标本身共同塑造的过程。我们引入了一个目标级归因框架CoTrace,将显式目标分解为可验证的需求,并追踪对话回合中直接贡献和间接影响。对638个真实世界协作日志应用CoTrace,发现尽管模型仅在目标塑造中贡献11-26%,但它们在引入较低层次的具体需求方面贡献显著,并产生各种间接贡献。通过受控模拟,我们展示了交互设计选择显著影响模型目标塑造行为。在一项用户研究中,向参与者暴露目标级分析使他们对贡献的感知在5分量表上几乎增加2分,揭示了用户在理解自身AI辅助工作时的系统性误校准。

英文摘要

As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

2605.21362 2026-05-21 cs.CL

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

LASH:适应性语义混合用于大语言模型的黑盒劫持

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

AI总结 本文提出LASH框架,通过适应性语义混合方法,利用多个基础攻击的输出作为可重用的种子提示,针对不同目标模型和有害类别进行自适应组合,从而在黑盒红队测试中取得更高的攻击成功率。

详情
AI中文摘要

劫持攻击暴露了对齐的大语言模型预期安全行为与对抗性提示下行为之间的持续差距。现有自动化方法日益有效,但每个方法都局限于单一攻击家族(例如,一个细化循环、一个树搜索、一个突变空间或一个策略库),并且没有单一家族主导:表现最好的方法会根据目标模型和有害类别而变化,这表明互补优势可以通过每个提示的组合来利用。我们介绍了LASH(LLM适应性语义混合),一个黑盒框架,将多个基础攻击的输出视为可重用的种子提示,并针对每个目标请求自适应地组合它们。给定一个种子池,LASH搜索种子子集和softmax归一化的混合权重;组合模块合成一个候选提示,而无导数遗传优化器通过黑盒目标反馈和一个两阶段适应度函数(结合基于关键词的拒绝检测与LLM判官评分)更新权重。在包含100个有害提示的10个类别的JailbreakBench上,我们评估了LASH在六个常见目标模型上的表现。LASH在基于关键词的评估中平均攻击成功率为84.5%,在两阶段评估中为74.5%,其中响应首先被过滤以拒绝,然后由LLM判官评分是否实质上履行了原始有害请求。LASH在两个指标上均优于五个最先进的基线方法,仅使用30次平均目标查询。LASH还在三种防御机制下保持竞争力,并诱导出更多成功似内部表示。这些结果表明,跨异构劫持策略的适应性组合是黑盒红队测试的一个有前途的方向。

英文摘要

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

2605.21352 2026-05-21 cs.LG cs.CE cs.ET

Classification of Single and Mixed Partial Discharges under Switching Voltage Using an AWA-CNN Framework

基于切换电压的单相和混合局部放电分类的AWA-CNN框架

Md Rafid Kaysar Shagor, Zannatul Ferdousy Mouri, Farhina Haque, Anindya Bijoy Das

AI总结 本文提出了一种基于AWA模式表示的CNN框架,用于在切换电压激励下对局部放电源进行分类,通过分析脉冲幅度、宽度和面积生成可视化模式,实现对六种不同放电源的高准确率分类。

详情
AI中文摘要

随着快速开关功率电子的应用增加,局部放电(PD)分析在切换电压激励下的重要性日益增加,但比在正弦条件下更具挑战性,因为活动集中在电压转换处。本文提出了一种幅度-宽度-面积(AWA)模式表示,用于在切换电压激励下进行源导向的局部放电分析。在所提出的方法中,时间域的局部放电脉冲通过脉冲幅度、宽度和面积进行表征,并映射到可视化模式中,其中幅度和面积定义坐标轴,宽度通过颜色编码。生成的AWA模式用于区分六种单个和混合的局部放电源条件:电晕、内部、表面、电晕+内部、电晕+表面和内部+表面。为了评估所提出表示的分类能力,比较了随机森林基线和两个卷积神经网络(CNN)模型,即InceptionV3和ResNet-18。AWA模式显示出可区分的源依赖分布,CNN基于分类在测试准确率上超过96%,而随机森林为73.33%。结果表明,AWA模式为在切换电压激励下多类局部放电源分类提供了合适的可视化表示。

英文摘要

The growing use of fast-switching power electronics has made partial discharge (PD) analysis under switching-voltage excitation increasingly important, yet more challenging than under sinusoidal conditions due to activity concentrated at voltage transitions. This work presents an Amplitude-Width-Area (AWA) pattern representation for source-oriented PD analysis under switching-voltage excitation. In the proposed method, time domain PD pulses are characterized using pulse amplitude, width, and area, and mapped into a visual pattern where amplitude and area define the coordinate axes and width is encoded by color. The generated AWA patterns are used to distinguish six single and mixed PD source conditions: corona, internal, surface, corona+internal, corona+surface, and internal+surface. To evaluate the classification capability of the proposed representation, a Random Forest baseline and two Convolutional Neural Network (CNN) models, InceptionV3 and ResNet-18, are compared. The AWA patterns show distinguishable source-dependent distributions, and CNN-based classification achieves testing accuracy above 96%, compared with 73.33% for Random Forest. The results indicate that AWA patterns provide a visual representation of PD pulses suitable for multi-class PD source classification under switching-voltage excitation.

2605.21348 2026-05-21 cs.LG cs.AI cs.NA math.NA physics.comp-ph

Data-Efficient Neural Operator Training via Physics-Based Active Learning

通过物理引导的主动学习实现数据高效的神经算子训练

Alicja Polanska, Lorenzo Zanisi, Vignesh Gopakumar, Stanislas Pamela

AI总结 本文提出了一种基于物理的主动学习方法,用于提高神经算子训练的数据效率,通过利用偏微分方程残差来指导数据选择,在1D Burgers方程和2D可压缩纳维-斯托克斯方程的数值实验中验证了该方法在数据效率上的优越性。

详情
Comments
Presented at the ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations
AI中文摘要

使用神经算子求解偏微分方程显著降低了计算成本,但仍然受到高训练数据需求的限制。主动学习提供了一个自然的框架,通过迭代方式选择最有信息量的样本来缓解这一问题。我们引入了基于物理的获取方法,这是一种新的物理引导的主动学习算法,利用偏微分方程残差来指导数据选择。我们通过1D Burgers方程和2D可压缩纳维-斯托克斯方程的数值实验验证了该方法。我们显示,在我们的实验中,基于物理的获取方法在数据效率上始终优于随机获取,并且在数据效率上与当前最先进的方法相媲美。同时,它具有独特的优势,即在训练过程中注入物理归纳偏差,确保在模型物理理解最弱的地方花费模拟成本。

英文摘要

Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements. Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner. We introduce physics-based acquisition - a novel physics-informed active learning algorithm that leverages the partial differential equation residual to guide data selection. We validate the method by presenting numerical experiments for the 1D Burgers equation and the 2D compressible Navier-Stokes equations. We show that, in our experiments, physics-based acquisition consistently outperforms random acquisition and matches the state of the art in data efficiency. At the same time, it has the unique advantage of injecting a physics inductive bias into the training process, ensuring that simulation cost is spent where the model's physical understanding is weakest.

2605.21343 2026-05-21 cs.CV

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

OcclusionFormer: 布局导向图像生成中的Z轴顺序安排

Ziye Li, Henghui Ding

AI总结 本文提出OcclusionFormer,一种基于Z轴顺序的扩散变换框架,通过解耦实例并利用体积渲染进行合成,以解决布局到图像模型中物体间遮挡问题,并通过查询对齐损失提升空间精度和语义一致性。

详情
Comments
ICML 2026, Project Page: https://henghuiding.com/OcclusionFormer/
AI中文摘要

最近的布局到图像模型在空间可控性方面取得了显著进展。然而,它们仍然在物体间遮挡方面存在困难。当边界框重叠时,大多数现有方法缺乏显式的遮挡信息,这使得交集区域的生成本质上具有歧义性,并阻碍了复杂遮挡关系的确定。为此,我们首先构建了SA-Z,一个包含显式遮挡顺序和像素级注释的大型数据集。基于我们提出的数据集,我们引入了OcclusionFormer,一种新的遮挡感知扩散变换框架,通过解耦实例并利用体积渲染进行合成,显式地建模Z轴优先级。此外,为了确保细粒度的空间精度,我们引入了查询对齐损失,显式监督单个实例并增强语义一致性。所提出的方法有效减少了重叠区域的歧义性,强制正确遮挡依赖关系,并保持了结构完整性,从而在多样化的场景中实现了显著的准确性提升。

英文摘要

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

2605.21341 2026-05-21 stat.ML cs.LG

Semiparametric Efficient Bilevel Gradient Estimation

半参数高效双层梯度估计

Fares El Khoury, Houssam Zenati, Nathan Kallus, Michael Arbel, Aurélien Bibaut

AI总结 本文提出一种半参数去偏理论,用于消除双层梯度估计中的一阶偏差,通过交叉拟合的正交超梯度估计器实现了渐近正态性,并在二次损失下简化为基于条件均值 nuisances 的双重鲁棒分数。

详情
AI中文摘要

功能双层方法估计下层函数并将其插入选项超梯度,但当下层问题非参数学习时,这种插入选项梯度可能保留一阶偏差。为消除此偏差,我们基于高效影响函数开发了半参数去偏理论,用于总体双层梯度。这种视角导致了交叉拟合的正交超梯度估计器,我们建立了渐近正态性并统一控制外参数。在二次损失下,该估计器简化为基于条件均值 nuisances 的简单双重鲁棒分数。在具有已知真实值的合成双层基准测试中,该方法跟踪 oracle 高效梯度基准,并优于插入选项函数超梯度和正则化核双层基线。

英文摘要

Functional bilevel methods estimate a lower-level function and plug it into a hypergradient, but this plug-in gradient can retain first-order bias when the lower-level problem is learned nonparametrically. To remove this bias, we develop a semiparametric debiasing theory for population bilevel gradients based on the efficient influence function. This perspective leads to a cross-fitted orthogonal hypergradient estimator for which we establish asymptotic normality together with uniform control over the outer parameter. Under quadratic losses, the estimator reduces to a simple doubly robust score based on conditional mean nuisances. On synthetic bilevel benchmarks with known ground truth, the method tracks the oracle efficient-gradient benchmark and improves over plug-in functional hypergradients and regularized kernel bilevel baselines.