arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.13035 2026-06-12 cs.CV cs.AI 新提交

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

TetherCache: 基于门控召回与可信对齐的自回归长视频生成稳定性方法

Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学) D-INFK, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出TetherCache,一种无需训练、即插即用的缓存管理策略,通过门控召回(GRAB)和可信对齐编辑(TAME)缓解自回归视频扩散模型中的上下文漂移,实现稳定长视频生成。

详情
Comments
17 pages, 8 figures
AI中文摘要

自回归视频扩散模型通过将新生成帧的条件建立在先前生成内容上,为流式变长视频生成提供了自然框架。然而,将这些模型扩展到分钟级生成仍具挑战:有限的KV缓存预算使模型无法保留完整历史,而反复以自生成帧为条件会导致上下文分布偏移随时间累积,引发视觉伪影、质量下降和时间漂移。本文提出TetherCache,一种无需训练、即插即用的缓存管理策略,用于抗漂移长视频生成。TetherCache将缓存组织为sink、memory和recent区域,并引入两种互补机制。首先,GRAB(基于注意力多样性平衡的门控召回)使用结合注意力相关性与时间多样性的门控分数选择长程记忆帧,在固定缓存预算下保留信息丰富且多样化的历史上下文。其次,TAME(通过记忆编辑的可信对齐)通过将新召回的记忆令牌的统计量对齐到可信上下文分布来对其进行轻量编辑,减少漂移历史特征造成的污染。基于Self-Forcing,TetherCache在VBench-Long的30秒、60秒和240秒设置上持续提升长视频生成质量。特别地,在240秒生成中,它显著提高了整体和语义分数,同时将质量漂移从7.84降至1.33,证明了其在稳定长程自回归视频扩散中的有效性。

英文摘要

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

2606.13033 2026-06-12 cs.CV 新提交

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

SAM-Deep-EIoU:面向多目标跟踪的选择性掩码传播

Alexander Holmberg

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出选择性掩码传播算法,仅在不确定性高的帧调用视频目标分割模型,以轻量级基跟踪器为主,在DanceTrack和SportsMOT上提升性能,SportsMOT达86.8 HOTA。

详情
AI中文摘要

多目标跟踪的难度分布呈重尾特性:大多数帧对于轻量级基跟踪器是容易的,而一小部分帧本质上是困难的。视频目标分割(VOS)模型通常能在基跟踪器失败的困难帧中保持身份,但其计算和内存成本高得多。我们提出选择性掩码传播,一种跟踪算法,仅在分配不确定性信号触发的窗口上从基跟踪器调度到VOS模型。仅当VOS模型做出与基跟踪器身份分配相矛盾的置信预测时,才修改基跟踪器的输出;弱或不确定的预测保留基输出。该方法无需训练,将基跟踪器和VOS模型均视为黑盒,并且可以通过用更强大的模型替换VOS组件而受益。在DanceTrack上,选择性掩码传播改进了三种不同的基跟踪器。在SportsMOT上,身份保持是体育分析的核心,使用全局轨迹关联的SAM3-Deep-EIoU以86.8 HOTA达到基准上的最先进性能。

英文摘要

Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

2606.13032 2026-06-12 cs.CV 新提交

GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

GeoCFNet: 几何感知置信场网络用于机器人辅助内镜黏膜下剥离术

Rui Tang, Guankun Wang, Long Bai, Haochen Yin, Huxin Gao, Jiewen Lai, Jiazheng Wang, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong(香港中文大学电子工程系) Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co. Ltd.(华为技术有限公司中央研究院2012实验室理论实验室)

AI总结 提出GeoCFNet,通过几何感知置信场估计解决动态内镜场景下的解剖引导问题,集成Token差异化融合和几何感知空间正则化,实现精确稳定的置信场预测。

详情
Comments
IEEE ICIA 2026
AI中文摘要

先进的手术机器人技术使机器人辅助内镜黏膜下剥离术(ESD)成为整块切除大病变的有前景方法,具有降低复发率和改善长期预后的潜力。然而,ESD的技术复杂性和并发症风险需要稳定精确的视觉引导,以维持准确的解剖通道和安全组织边界。密集置信场通过描述优选解剖区域及其向周围组织的空间过渡,为此提供了有效表示。然而,在动态内镜场景中,由于烟雾、镜面高光、组织变形、弱纹理以及目标区域的薄几何结构,可靠的置信场估计仍然具有挑战性。为解决这些问题,我们将解剖引导表述为几何感知置信场估计问题,并提出GeoCFNet,一种基于预训练DINOv3骨干网络的几何感知置信场网络。GeoCFNet集成了Token差异化融合模块以聚合类别令牌上下文与密集补丁表示、用于置信回归的SegFormer解码器,以及几何感知空间正则化(GASR)以保持空间一致性和局部几何过渡。实验结果表明,GeoCFNet实现了RMSE 0.0480、PSNR 27.1995、SSIM 0.3397和CC 0.2466,表明其能够为机器人辅助ESD引导提供精确且几何稳定的置信场估计。

英文摘要

Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

2606.13030 2026-06-12 cs.CV 新提交

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

一种结合跨主体伪标签与语义对齐的多模态微手势识别框架

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT)(合肥工业大学计算机科学与信息工程学院) School of Computer Science, University of Auckland (UOA)(奥克兰大学计算机科学学院)

AI总结 针对微手势识别中低信噪比、长尾分布和跨主体域偏移问题,提出多模态框架,通过显著性引导提取、平方根平滑加权、正交语义嵌入损失和跨模态伪标签策略,实现有效识别,F1分数达68.13%。

详情
Comments
14 pages, 2 figures
AI中文摘要

微手势(MGs)是自发的、细微的身体动作,经常传达隐藏的人类情感。在未修剪视频中识别MGs仍然极具挑战性,因为其极低的信噪比、严重的长尾类分布以及跨主体评估场景中固有的域偏移。在本文中,我们为第四届MiGA-IJCAI挑战赛的Track 1提出了一个全面的多模态框架。为了捕捉细粒度表示,我们设计了一个显著性引导的多模态提取流程,整合了68关键点骨架关节坐标、3D热图体积和高分辨率RGB视觉特征。我们引入了一种温和的平方根平滑加权机制,配合正交语义嵌入损失,以保护尾部类别而不损害整体识别能力。更重要的是,为了弥合跨主体泛化差距,我们提出了一种跨模态伪标签(CMPL)策略用于无监督域适应,显著提升了单模态鲁棒性。最后,采用温度缩放软投票机制以减轻后期融合中的过度自信。大量实验表明,我们的框架达到了具有竞争力的68.13%的F1分数,获得第四名。

英文摘要

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

2606.13028 2026-06-12 cs.RO cs.CV 新提交

Comparing Commercial Depth Sensor Accuracy for Medical Applications

面向医疗应用的商用深度传感器精度比较

Pit Henrich, Maximilian Weiherer, Franziska Hansen, Bernhard Egger, Franziska Mathis-Ullrich

AI总结 本文在猪骨、猪肚和硅胶肾模型上,以触针采样为参考,比较了立体视觉、结构光和飞行时间四类深度传感器在50cm距离下的精度,发现Zivid 2M+ 60在所有物体和指标上表现最佳。

详情
Comments
4 Pages
AI中文摘要

深度估计在医疗和外科手术中有众多应用。我们使用触针采样的参考数据,在猪骨标本、猪肚标本和硅胶肾脏模型上对四种深度传感器进行了基准测试。这些物体包含多个现实挑战,包括均匀表面、镜面反射表面和次表面散射。比较包括距离约50厘米处的立体视觉、结构光和飞行时间传感器。具体而言,比较了Intel RealSense D405(美国Intel RealSense)、PMD Flexx2(德国pmdtechnologies)、Stereolabs ZED 2i(法国Stereolabs)和Zivid 2M+ 60(挪威Zivid)。在本研究考虑的所有物体和指标中,Zivid 2M+ 60表现最佳。ZED在真实组织上排名第二,但在模型上排名最后。

英文摘要

Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

2606.13026 2026-06-12 cs.CY cs.AI 新提交

Democracy in the Era of Artificial Intelligence

人工智能时代的民主

Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing

AI总结 本文探讨如何利用人工智能升级民主制度,增强集体智慧、审议民主和自治系统,同时应对隐私、偏见和虚假信息等风险。

详情
AI中文摘要

将人工智能(AI)与民主相结合是我们时代最深刻的挑战之一。一方面,AI 为克服民主中长期存在的挑战提供了机会,例如在代表权不足的审议和投票过程中参与度低的问题。另一方面,AI 算法带来了新的风险,这些算法侵犯隐私、存在偏见、具有操纵性、传播虚假信息并影响选举结果。超越“AI 对民主是好是坏”这一过于简单的问题,《人工智能时代的民主手册》转而提出:如何利用 AI 升级民主及其所基于的原则?如何与 AI 互动以及以何种条件互动?需要哪些新的价值观和设计原则来建立民主韧性?来自世界各地不同学科的 59 位作者在 34 章中探讨了 AI 如何增强民主的集体智慧(第 1 部分),以及使用大型语言模型和社交媒体的审议民主的未来(第 2 部分)。我们还阐述了 AI 在构建有韧性的自治系统中的作用(第 3 部分),以及 AI 时代民主转型的挑战(第 4 部分)。最后,我们以更广阔的视角(第 5 部分)重新构想民主与 AI 的相互作用。

英文摘要

Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

2606.13024 2026-06-12 cs.LG cs.AI 新提交

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE:基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院通用人工智能国家重点实验室) National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University(北京大学健康医疗大数据国家研究院、人工智能研究院)

AI总结 提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,通过模式路由混合异构专家解耦动态机制,结合因果自注意力与LLM/VLM先验,实现稀疏因果图恢复,在监督和少样本场景中达到最优。

详情
AI中文摘要

格兰杰因果发现(GCD)是分析复杂系统中时间依赖性的基础。然而,现有的神经GCD方法主要依赖“一刀切”范式,难以捕捉真实世界时间序列中固有的分布偏移和动态机制变化,常导致表示纠缠和虚假因果图。本文提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,显式建模补丁级异质性。CausalMoE引入模式路由混合异构专家,动态识别潜在时间模式并将补丁路由到专门领域专家,有效解耦机制特定动态与共享动态。为确保可解释的图恢复,我们设计了一种跨变量运行的因果感知自注意力机制,通过近端优化生成稀疏格兰杰因果图。此外,CausalMoE是首个集成LLM和VLM以对齐数值信号与文本和视觉先验的模型,在复杂场景中正则化因果估计。大量实验表明,CausalMoE在全监督基准上达到新最优,同时在传统方法失败的少样本设置中有效泛化。

英文摘要

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

2606.13022 2026-06-12 cs.CV cs.LG 新提交

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

基于骨架的人体动作识别中保质量不可察觉对抗攻击

Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum

发表机构 * Durham University(杜伦大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Zhongguancun Laboratory(中关村实验室)

AI总结 针对骨架动作识别的对抗攻击常引入噪声扰动降低动作质量,本文提出一种基于分布的对抗攻击方法,通过最小化经验风险与真实风险的差距来保持动作质量,并设计新指标评估自然性,实验表明该方法在攻击成功率和动作质量上均优于现有方法。

详情
AI中文摘要

针对骨架人体动作识别的对抗攻击已受到广泛关注。然而,现有方法通常引入类似噪声的扰动,导致攻击后动作质量下降,从而在S-HAR系统的最新进展中本质上是可察觉的。我们发现这种退化源于先前对抗攻击优化过程中经验风险与真实风险之间的差距。为解决此问题,我们提出一种在不损害动作质量的情况下获得对抗动作的攻击方法。为最小化风险差距并保持动作质量,我们提出一种基于分布的对抗攻击方法,不引入类似噪声的扰动。为忠实评估动作质量,我们提出一种新指标,该指标与人类对真实世界自然性的感知一致。在两个数据集上对最先进的S-HAR方法进行了实验,通过定性和定量分析证明了我们的方法在攻击成功率和攻击后动作质量方面的优越性。我们的保质量攻击应用和基于分布的方法的成功引发了关于动作识别器鲁棒性的严重担忧,强调了在该领域进一步改进的必要性。

英文摘要

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

2606.13020 2026-06-12 cs.AI 新提交

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

SciR: 面向LLM科学推理的可控基准

Pierre Beckmann, Marco Valentino, Andre Freitas

AI总结 提出SciR基准,通过形式对象生成可验证的多范式科学推理任务,并控制信息提取和推理难度两个维度,揭示LLM在科学推理中的弱点。

详情
AI中文摘要

科学推理中反复出现三种范式的推理形式:演绎、归纳和因果溯因。目前,在科学环境中可靠地评估LLM在这三种推理上的表现尚不可及:基于人工标注的科学基准成本高昂且缺乏机制性真值,而合成逻辑推理基准则不像真实的科学文档。我们引入了SciR,这是一个将多范式推理与可控科学渲染相结合的基准,以三个范式性科学问题为锚点。任务从形式对象(演绎树、归纳规则假设、因果图)生成,以保证可验证答案,然后通过每个轨道的领域调优体裁渲染成多文档科学论述。该构建使我们能够独立变化两个难度轴:提取推理所需关键信息的难度,以及原则性推理本身的难度。我们测试了六个模型。两个轴都对每个模型造成伤害,且其效应叠加。渲染甚至伤害了神经符号管道,后者将推理交给经过验证的求解器。这两个轴产生了每个模型的提取与推理轮廓:例如,像deepseek-r1这样的推理模型在推理轴上大多超过了非推理指令模型。据我们所知,SciR是第一个在提取和推理难度上具有参数化控制的多范式科学推理基准。

英文摘要

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

2606.13016 2026-06-12 cs.AI 新提交

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

Otters++: 一种基于首次脉冲时间的高能效光学脉冲Transformer

Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Qianhui Liu, Weng-Fai Wong

发表机构 * National University of Singapore(新加坡国立大学) Westlake University(西湖大学) Shandong University(山东大学) Zhejiang University(浙江大学) Agency for Science, Research and Technology(新加坡科技研究局)

AI总结 提出Otters++,利用光电器件自然信号衰减实现TTFS计算,通过层等效与混合训练方法,在GLUE上达到84.17%平均分且能耗更低。

详情
AI中文摘要

脉冲神经网络(SNN)有望实现高能效推理,而首次脉冲时间(TTFS)编码尤其吸引人,因为每个神经元最多发放一次脉冲。然而,在实践中,这一优势往往因计算时间衰减项并将其与突触权重相乘的成本而减弱。我们通过将物理硬件“缺陷”——光电器件中的自然信号衰减——转化为TTFS的主要计算来解决这一问题,命名为Otters++。具体来说,我们利用定制In$_2$O$_3$光电突触的实测衰减直接实现TTFS时间项,从而消除了显式数字衰减计算的需求。为了将该思想扩展到Transformer模型,我们建立了Otters++与量化神经网络(QNN)之间的逐层功能等价性,并开发了一种混合训练方法,在前向传播中使用忠实于器件的SNN计算,在后向传播中通过等效QNN路径使用QNN直通梯度,并结合模型蒸馏。这避免了对离散首次脉冲事件的微分,并减少了直接TTFS-SNN训练中的过度稀疏问题。我们进一步通过采样运行间变化使训练感知实测器件噪声,并通过考虑器件共享和多跳通信来细化系统级能耗模型。在GLUE数据集上,Otters++将平均得分提高到84.17%,同时相比先前的脉冲Transformer基线保持明显的能耗优势。这些结果表明,基于物理的TTFS计算在实际硬件效应下可以高效、可训练且鲁棒。

英文摘要

Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

2606.13007 2026-06-12 cs.LG cs.AI 新提交

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

scLLM-DSC:基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of Chinese Academy of Sciences(中国科学院大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) School of Computing and Information Technology, Great Bay University(大湾区大学计算机科学与技术学院) School of Engineering, Westlake University(西湖大学工学院)

AI总结 提出scLLM-DSC框架,通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐,利用LLM增强单细胞RNA测序数据的聚类性能,显著优于现有方法。

详情
AI中文摘要

聚类是scRNA-seq分析的基础,是识别细胞群体和解析组织异质性的基石。然而,现有方法专注于挖掘数值统计模式,由于忽略了基因编码的内在生物学功能,存在语义不可知的问题。虽然大语言模型(LLM)提供了有前景的语义能力,但生成式预训练目标与判别式下游任务之间的结构不匹配阻碍了它们直接适应细胞聚类。为弥合这一差距,我们提出了scLLM-DSC,一种新颖的LLM知识增强跨模态深度结构聚类框架。与数据驱动范式不同,scLLM-DSC通过协同两个视图建立语义基础表示:从NCBI基因先验和上下文化的Cell2Sentence嵌入中提取的知识驱动语义视图,以及通过图引导编码器提取的结构感知拓扑视图。关键的是,我们引入了一种跨模态对比对齐机制,以在统一潜在空间中强制生物学语义与转录组特征之间的一致性。广泛的基准测试表明,scLLM-DSC在聚类准确性上显著优于十一个最先进的基线方法。

英文摘要

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

2606.13006 2026-06-12 cs.SD 新提交

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

Emo-LiPO:基于LLM的文本到语音中细粒度情感强度控制的列表式偏好优化

Yihang Lin, Li Zhou, Congwei Cao, Dongchu Xie, Xiaoxue Gao, Chen Zhang, Haizhou Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Agency for Science, Technology and Research(新加坡科技研究局) National University of Singapore(新加坡国立大学) Shenzhen Research Institute of Big Data(深圳市大数据研究院) Shenzhen Loop Area Institute(深圳市环区研究院)

AI总结 提出Emo-LiPO框架,将情感强度控制建模为学习排序问题,通过列表式偏好优化对齐文本与语音的情感强度,实现更忠实连续的情感表达,在ESD-plus数据集上显著提升情感准确性和强度可控性。

详情
Comments
Accepted by IJCAI 2026. Emotional TTS, Preference Optimization, Emotion Intensity Control
AI中文摘要

基于大型语言模型(LLM)的文本到语音(TTS)系统能够实现提示条件的情感控制,但由于文本与语音之间的语义-声学差距,在细粒度情感强度方面存在困难。为了解决这一挑战,我们将LLM-based TTS中的情感强度控制形式化为一个学习排序问题,并提出了Emo-LiPO,一种列表式偏好优化框架,该框架将提示条件的语音生成与文本中表达的相对情感强度对齐。Emo-LiPO在固定文本下显式建模每种情感内的全局强度排序,从而实现更忠实和连续的情感表达。我们进一步构建了ESD-plus,一个具有显式情感强度变化的多说话人数据集,以支持细粒度情感建模和评估。在ESD-plus上的实验表明,与基于监督学习和DPO的LLM TTS基线相比,Emo-LiPO显著提高了情感准确性和强度可控性,特别是在高强度水平上表现尤为突出。

英文摘要

Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.

2606.13003 2026-06-12 cs.AI cs.CL cs.MA 新提交

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) HKUST (Guangzhou)(香港科技大学(广州)) University of British Columbia(不列颠哥伦比亚大学) Nanyang Technological University(南洋理工大学)

AI总结 通过系统评估,发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线(如思维链自一致性),揭示了现有评估框架的缺陷和架构膨胀问题。

详情
AI中文摘要

普遍观点认为多智能体系统优于单智能体系统,其优势包括上下文保护、并行处理和分布式决策。然而,这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较,这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统(旨在比手动设计的系统具有更强的泛化能力),对单智能体系统(特别是思维链自一致性)进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务(例如 BrowseComp-Plus)上,我们证明自动多智能体系统始终不如思维链自一致性,尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来,我们引入了一个为多智能体系统量身定制的诊断性合成数据集,该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明,专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构,这表明现有的评估框架未能考虑增加计算成本的边际效用,从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是,对生成的多智能体系统架构的系统解构表明,当前的自动化设计范式产生了架构膨胀,优先考虑表面复杂性,但这并未转化为功能效用,暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

2606.13001 2026-06-12 cs.IR cs.MM 新提交

CFALR: Collaborative Filtering-Augmented Large Language Model for Personalized Fashion Outfit Recommendation

CFALR:协同过滤增强的大语言模型用于个性化时尚套装推荐

Yujuan Ding, Junrong Liao, Yunshan Ma, Yi Bin, Wenqi Fan, Tat-Seng Chua, Qing Li

AI总结 提出CFALR框架,结合协同过滤与大语言模型,通过自然语言描述用户-套装交互并利用CF增强嵌入,解决个性化套装推荐中的数据稀疏和组合空间问题,在Polyvore和IQON数据集上优于传统方法。

详情
AI中文摘要

个性化套装推荐在电子商务和社交媒体平台中构成重大挑战,需要系统平衡用户偏好与美学兼容性。协同过滤(CF)为此提供了传统解决方案,但在数据稀疏场景和复杂的用户-物品-套装关系中表现不佳。同时,现有的基于模板的方法受限于僵化的预设计结构。为弥合这些研究空白,我们引入了CFALR(协同过滤增强的大语言模型用于推荐),一种新颖的框架,将协同过滤与大语言模型协同用于个性化套装推荐。具体来说,CFALR用自然语言描述用户-套装交互,利用LLM捕捉时尚语义,同时采用CF增强的嵌入来桥接语义空间和协同交互空间。我们的技术贡献包括:(1)首个专门为个性化套装推荐设计的基于LLM的架构,(2)一种CF增强的生成机制,高效导航套装物品的广泛组合空间,(3)可训练的投影层,最优地整合关系特征和内容特征。在Polyvore和IQON基准上的实验表明,CFALR在个性化填空和个性化套装生成任务中均优于传统的基于CF和基于LLM的方法。

英文摘要

Personalized outfit recommendation poses a significant challenge in e-commerce and social media platforms, requiring systems that balance user preferences with aesthetic compatibility. Collaborative filtering (CF) provides a traditional solution for this, but it struggles with data-sparse scenarios and complex user-item-outfit relationships. Meanwhile, existing template-based approaches are constrained by rigid pre-designed structures. To bridge these research gaps, we introduce CFALR (Collaborative Filtering-Augmented Large Language Model for Recommendation), a novel framework that synergizes collaborative filtering with large language models for personalized outfit recommendation. Specifically, CFALR describes user-outfit interactions in natural language and leverages LLMs to capture fashion semantics while employing CF-enhanced embeddings to bridge the semantic space and the collaborative interaction spaces. Our technical contributions include: (1) the first LLM-based architecture specifically designed for personalized outfit recommendation, (2) a CF-augmented generative mechanism that efficiently navigates the extensive combination space of outfit items, and (3) trainable projection layers that optimally integrate relational and content features. Experiments on Polyvore and IQON benchmarks demonstrate CFALR's superior performance over both traditional CF-based and LLM-based methods in personalized fill-in-the-blank and personalized outfit generation tasks.

2606.13000 2026-06-12 cs.CR 新提交

SoK: The Constant Time Model

SoK: 恒定时间模型

Billy Bob Brumley

AI总结 系统化恒定时间模型及其演化,揭示模型保护与规范假设之间的差距,并提出发现密码原语边界外时序漏洞的攻击方法,在OpenSSL和BoringSSL中确认了私钥加载漏洞。

详情
Comments
WOOT 2026
AI中文摘要

恒定时间编程模式是对抗密码实现时序攻击的主要防御手段,然而“恒定时间”的含义在学术界和工业界各不相同。本文系统化了恒定时间模型及其演化,识别出模型保护与规范假设之间反复出现的差距,并提炼出一种用于发现源自密码原语边界之外的时序漏洞的攻击方法。应用该方法,我们定位了一个与私钥加载相关的规范级漏洞,并在OpenSSL和BoringSSL中确认了该泄漏。反直觉的是,尽管BoringSSL的威胁模型明确更严格,但其每次观测的信号强度比OpenSSL高出几个数量级。

英文摘要

Constant time programming patterns is the primary defense against timing attacks on cryptographic implementations, yet what "constant time" means varies across academia and industry. This work systematizes constant time models and their evolution, identifies a recurring gap between what models protect and what specifications assume, and distills an offensive methodology for discovering timing vulnerabilities that originate outside the cryptographic primitive boundary. Applying this methodology, we locate a specification-level vulnerability related to private key loading, and confirm the leak in both OpenSSL and BoringSSL. Counterintuitively, BoringSSL's per-observation signal is several orders of magnitude stronger than OpenSSL's, despite an explicitly stricter threat model.

2606.12997 2026-06-12 cs.LG stat.ML 新提交

Reliability of Probabilistic Emulation of Physical Systems

物理系统概率仿真的可靠性

Sam F. Greenbury (1), Radka Jersakova (1), Paolo Conti (1 and 2), Marjan Famili (1 and 3), Christopher Iliffe Sprague (1 and 4), Edwin Brown (1 and 5), Jason D. McEwen (1 and 6) ((1) The Alan Turing Institute, (2) Autodesk Research, (3) PhysicsX, (4) Orbital, (5) University of Sheffield, (6) University College London)

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) Autodesk Research(欧特克研究院) PhysicsX Orbital University of Sheffield(谢菲尔德大学) University College London(伦敦大学学院)

AI总结 比较生成模型与CRPS训练集成在物理系统概率仿真中的可靠性,发现CRPS集成在覆盖率和推理速度上更优。

详情
AI中文摘要

目前,生成物理系统概率预测的两种主要方法已经出现:生成模型(如扩散或流匹配)以及注入随机性的确定性模型集成(使用连续排序概率评分(CRPS)损失训练)。虽然这两种方法都表现出强大的预测准确性,但其不确定性的可靠性尚未得到系统评估。我们通过开发一个框架来填补这一空白,该框架在匹配模型大小和计算预算的情况下,评估这两种方法在多种二维时空物理系统中的表现。我们通过检查预测区间的经验覆盖率来评估概率仿真的可靠性,同时考虑准确性和计算效率指标。CRPS训练的集成在单步预测和自回归展开中通常能实现更可靠的不确定性,显示出比在潜在空间中训练生成模型的标准替代方案更好的覆盖率。此外,CRPS方法提供了显著更快的推理速度。当生成模型在环境空间而非压缩潜在空间中训练时(这在高维问题中通常不可行),它们表现出与CRPS训练集成相当的覆盖率,但推理延迟显著更大。相比之下,当CRPS训练的集成在潜在空间中训练时,其覆盖率相对于环境空间没有明显下降。生成模型和CRPS训练的集成都表现出良好的预测准确性。为促进未来的研究和应用,我们发布了AutoCast,一个实现生成模型和CRPS训练集成的模块化框架,以及AutoSim,一个用于快速原型的灵活数据集生成包。

英文摘要

Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

2606.12995 2026-06-12 cs.RO 新提交

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

GenHOI: 通过模仿生成视频实现接触感知的人形机器人-物体交互,无需任务特定训练

Zhihai Bi, Qiang Zhang, Guoyang Zhao, Jiahang Cao, Xueyin Luo, Yushan Zhang, Jinglan Xu, Ruoyu Geng, Yulin Li, Andrew F. Luo, Jun Ma

发表机构 * The University of Tokyo(东京大学) National University of Singapore(新加坡国立大学) University of California, Los Angeles(加州大学洛杉矶分校) Tsinghua University(清华大学)

AI总结 提出GenHOI框架,通过模仿单个生成视频实现人形机器人零样本执行多种物体交互任务,无需任务特定训练或物理演示数据,利用接触事件和手-物接触区域编码为几何约束优化轨迹。

详情
AI中文摘要

人形机器人-物体交互(HOI)是人形机器人的基本能力,但由于动态平衡与与多样物体稳定交互之间的紧密耦合,它仍然具有挑战性。现有方法通常需要耗时的任务特定策略训练或依赖于刚性轨迹回放,这限制了它们适应新颖交互场景的能力。在这项工作中,我们提出了\textit{GenHOI},一个简单而有效的框架,通过直接模仿单个生成视频,使人类形机器人能够以零样本方式执行多样化的物体交互任务,无需任务特定训练或物理演示数据。GenHOI首先在仿真中重建机器人-物体场景并渲染第一帧图像,该图像与语言命令一起条件化任务导向交互视频的合成。然后分析生成的视频以识别交互相关的接触事件并估计手-物体接触区域,这些被编码为以物体为中心的几何约束,将视觉交互线索转化为物理基础的优化先验。在这些先验的指导下,从视频中恢复的参考运动被细化和平滑,以解决2D视频生成中固有的尺度模糊性,同时将单个参考轨迹适应于未见过的机器人-物体相对姿态。优化后的轨迹最终由闭环跟踪控制器执行。我们在包括箱子抓取、非对称双臂椅子搬运、从下方抬桌子和圆柱物体包裹在内的多样化物体交互任务中,通过大量仿真和真实世界实验验证了所提出的框架。

英文摘要

Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.

2606.12994 2026-06-12 cs.LG cs.CE 新提交

DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space Augmentation

DeepJEB++: 基于基础模型驱动的二维潜空间增强的大规模三维工程数据集

Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee, Sunwoong Yang, Hyogu Jeong, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology(韩国科学技术院赵春植移动研究生院) Department of Mechanical Engineering, Hanyang University(汉阳大学机械工程系) Narnia Labs(纳尼亚实验室)

AI总结 提出DeepJEB++框架,通过二维潜空间增强和基础模型,将少量喷气发动机支架种子设计扩展为大规模带仿真标签的三维数据集,实现40倍扩展。

详情
Comments
16 pages, 14 figures. Submitted to ASME Journal of Mechanical Design
AI中文摘要

数据驱动的工程设计受到缺乏大规模三维数据集的限制,这些数据集需要将几何形状与基于物理的性能标签配对。特别是,现有的三维数据增强技术在保留微妙且多样的几何变化方面存在局限性,并且自动化后续的仿真标注过程仍然困难,因为边界条件取决于生成的几何形状。我们提出了DeepJEB++,一个基础模型驱动的数据增强框架,在资源受限的情况下将少量喷气发动机支架种子设计扩展为大规模、带仿真标签的三维数据集。我们的关键思想是在数据丰富的二维潜空间中进行增强,然后转移到三维。在第一阶段,我们在多视图渲染上微调预训练的二维潜扩散模型,并通过潜插值合成新视图,通过视觉语言模型(VLM)质量过滤器保留可制造的设计。在第二阶段,经过验证的图像通过领域适应的生成基础模型提升为三维网格。在第三阶段,一个自动化流水线识别每个网格上的载荷和螺栓接口,并分配有限元标签——质量、应力和位移——无需人工干预。我们沿着三个内在轴评估增强质量:可制造性、相对于SimJEB真实值的标签保真度以及分布一致性。从少于400个种子设计开始,DeepJEB++在每阶段使用单个GPU的情况下,生成了15,360个带仿真标签的三维支架——实现了40倍的扩展。该数据集将公开提供,以支持可复现的工程AI研究。

英文摘要

Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

2606.12993 2026-06-12 cs.IR 新提交

Charge as a Construct-Validity Factor in Chinese Legal Case Retrieval: A Cross-Benchmark Audit

罪名作为中国法律案例检索中的构念效度因素:跨基准审计

Yao Liu, Tien-Ping Tan, Zhilan Liu

AI总结 研究发现中国法律案例检索基准中,仅按主要罪名排序即可恢复BM25与最佳模型间99.2%的性能差距,揭示罪名是基准设计中的高杠杆构念效度因素,而非模型依赖。

详情
AI中文摘要

中国法律案例检索(LCR)基准将法律定性匹配查询的参考判决评为相关,强系统现达到NDCG@10为0.85-0.88。BM25到最佳训练模型之间的大部分差距无需检索模型即可恢复:仅按共享主要罪名对候选排序,再经BM25打断,在LeCaRDv2上关闭了99.2%的差距——与最佳训练系统无显著差异。这反映了基准设计:LeCaRDv2通过犯罪的关键构成要件定义顶级相关性,这些要件编码了罪名,因此同罪名案例在构念上相关(相关性提升4.49;罪名到相关性的宏AUC为0.871)。固定罪名后,训练的重排序器相对于BM25的优势缩小为小的罪名内残差(+0.026 NDCG@10,聚类自助法置信区间排除零,约四分之一),这是唯一的非定义性正向效应。该效应并不均匀:同一规则在LeCaRDv1上恢复84.3%,在CAIL2022上不符合规范,罪名到相关性的信号逐步减弱(宏AUC 0.871/0.759/0.728);预测罪名级联在LeCaRDv2上重现76.6%但不可迁移。该构念在第一阶段也可兑现:一个探索性的零训练罪名池通道提升了LeCaRDv2召回率(R@100 +0.025,错误罪名控制有损),作为混淆因素的正向对照报告,而非检索方法或新颖性声明。因此,罪名是基准层面的高杠杆构念效度因素——不是NDCG@10的统一解释,也不表明任何系统依赖罪名。我们将已建立的构念效度和部分输入检查打包为可重用的罪名控制协议(CCE);在所有三个基准上,其触发返回为空或描述性,行为符合设计。我们发布脚本、模式及协议,以便未来基准在将其NDCG@10解读为法律推理能力之前进行筛查。

英文摘要

Chinese Legal Case Retrieval (LCR) benchmarks grade a reference judgment relevant when its legal characterization matches the query, and strong systems now reach NDCG@10 of 0.85-0.88. Most of the BM25-to-best-trained gap is recoverable with no retrieval model: ranking candidates only by shared primary charge, broken by BM25, closes 99.2% of it on LeCaRDv2 -- with no detectable difference from the best-trained system. This reflects benchmark design: LeCaRDv2 defines top relevance via the crime's key constitutive elements, which encode the charge, so same-charge cases are relevant by construction (relevance lift 4.49; charge-to-relevance macro-AUC 0.871). Holding charge fixed, the trained reranker's advantage over BM25 collapses to a small within-charge residual (+0.026 NDCG@10, cluster-bootstrap CI excluding zero, about a quarter), the only non-definitional positive. The effect is not uniform: the same rule recovers 84.3% on LeCaRDv1 and is out of spec on CAIL2022, with the charge-to-relevance signal weakening in step (macro-AUC 0.871/0.759/0.728); a predicted-charge cascade reproduces 76.6% on LeCaRDv2 but does not transfer. The construct is also cashable at first stage: an exploratory zero-training charge-pool channel lifts LeCaRDv2 recall (R@100 +0.025, wrong-charge controls hurt), reported as a positive control for the confound, not a retrieval method or novelty claim. Charge is thus a high-leverage construct-validity factor at the benchmark level -- not auniform explanation of NDCG@10, and not evidence that any system relies on charge. We package established construct-validity and partial-input checks as a reusable charge-controlled protocol (CCE); on all three benchmarks its triggers come back null or descriptive, behaving as designed. We release the scripts, schema, and protocol so future benchmarks can be screened before their NDCG@10 is read as legal-reasoning ability.

2606.12991 2026-06-12 cs.AI 新提交

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

APCyc:通过自动环化实现环肽的性质导向设计

Yifan Zhao, Lang Qin, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) AI-Peptide Drug Design Joint Laboratory(AI-多肽药物设计联合实验室)

AI总结 提出APCyc框架,通过扩展残基词汇和显式编码环化位点与连接类型,结合贝叶斯后验引导,实现目标感知的环肽从头设计并联合优化多种理化性质。

详情
Comments
Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

环肽是现代药物发现中一类有前景的治疗化合物,通常具有更好的稳定性和结合亲和力。然而,环肽的从头设计仍然具有挑战性,因为方法必须识别口袋适应的环化模式和连接位点,同时控制药物相关性质。这一挑战对于主要在线性肽数据上训练的生成模型尤为突出,这些模型可能无法捕捉环化特异性约束。为解决这一局限性,我们引入了APCyc,一个目标感知的从头环肽生成框架,该框架显式建模环化并联合优化多种基本理化性质。通过使用扩展的残基词汇表并显式编码环化位点和连接类型信息,APCyc学习环化感知表示,并利用贝叶斯后验引导将采样导向满足多个性质目标的环肽。实验结果表明,我们的模型学习了目标依赖的环化偏好,并实现了环肽设计的有效且可控的多性质优化。本文源代码可在以下网址获取:https://this https URL。

英文摘要

Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at this https URL.

2606.12990 2026-06-12 cs.LG 新提交

Exposure Bias as Epistemic Underidentification in Recursive Forecasting

递归预测中的曝光偏差作为认知欠识别问题

Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho

发表机构 * University of Bristol(布里斯托大学)

AI总结 本文证明递归多步预测中的曝光偏差不仅是分布偏移,更是部分可观测性下的认知欠识别问题,并提出基于来源变量的误差分解与校正方法。

详情
Comments
Accepted for ICML 2026 EIML workshop
AI中文摘要

递归多步预测通常被表述为分布偏移:模型在观测历史数据上训练,但部署于自身预测结果上。我们通过证明在部分可观测性或状态截断下,递归展开也是一个认知欠识别问题,表明这种表述是不完整的。即使具有确定性潜在动力学,一步贝叶斯监督仅在观测上下文中识别行为,一旦展开查询自生成诱导状态(其正确的局部目标不能仅由数值状态确定),则无需识别部署的递归预测器。我们通过诱导状态 $Z$ 和来源变量 $P$ 形式化这一点,并推导出诱导状态误差分解为教师强制/展开不匹配、表示-类别逼近和来源信息差距。实验表明,展开进入一个不同的诱导状态区域,固定诱导状态定义了一个不同的局部校正任务,闭环增益不仅来自局部适应,还来自改变展开期间访问的诱导状态。使用简单的二进制来源编码,来源感知校正可以进一步提高性能,尽管增益是有条件的而非均匀的。这些结果将曝光偏差重新定义为自诱导认知不确定性下的推理。

英文摘要

Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

2606.12988 2026-06-12 cs.CV cs.AI 新提交

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

一种用于实时个性化人体工学姿态分析的机器学习框架

Manex Atxa, Bruno Simoes, Julen Balzategui

AI总结 提出利用三维体积视频数据实时预测人体工学/非工学姿态的方法,结合3D点云多角度分析与个性化深度学习分类器,克服固定视角遮挡问题,实现实时评估。

详情
Comments
13 pages, 7 figures, conference 24CMH
AI中文摘要

本文介绍了一种利用三维体积视频数据实时预测人体工学和非工学姿态的新方法。尽管该方法是为人体工学评估设计的,但它可以适应其他需要实时分析人体姿态的应用。该系统的一个突出特点是能够在评估过程中分析3D点云,从而实现多角度计算。这克服了相机通常提供固定视角的关键限制,从而限制了全面姿态评估可用的数据,尤其是在发生遮挡时。系统持续自动地对实时流数据使用选定的视角进行姿态推断;然而,只有用户手动选择和标记的姿态用于训练个性化深度学习分类器。该方法通过一个案例研究进行了优化,其中RGB-D相机捕捉了执行负重任务的受试者,实现了实时骨骼标记。模型在此数据上训练,并在训练阶段后对新流数据实时进行推断。本研究通过结合最先进的3D数据技术和传统的2D姿态估计算法,为实时人体工学评估提供了一种可扩展且实用的方法。它解决了工作场所环境中日益增长的安全与健康监测需求,标志着对该领域的显著贡献。

英文摘要

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 新提交

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

详情
Comments
10 pages, 9 figures, 2 tables
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $\rho = 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2606.12986 2026-06-12 cs.SE 新提交

The Rise of AI-Native Software Engineering: Implications for Practice, Education, and the Future Workforce

AI原生软件工程的崛起:对实践、教育和未来劳动力的影响

Mamdouh Alenezi

AI总结 本文系统综述了生成式AI和LLM对软件工程的变革,提出AI原生SE的概念框架、九维能力模型、四阶段课程路线图及研究议程,强调培养工程师的判断、验证与编排能力。

详情
AI中文摘要

生成式人工智能(GenAI)、大型语言模型(LLM)和新兴的智能体AI构成了软件工程(SE)历史上最具颠覆性的变革,重塑了开发流程、所需能力、专业角色以及大学必须交付的教育成果。本文对来自软件工程、机器学习、计算教育、人机协作和软件生产力等领先领域的48篇经过验证且有影响力的同行评审出版物(2016–2026年)进行了系统综述。研究通过一个四智能体工作流程(文献发现、科学计量分析、课程转型和劳动力影响)进行发现、筛选和分析,并与原始来源进行了验证。我们沿着九个主题和三个轨迹——实践、教育和劳动力——综合了证据,并报告了一个科学计量拐点:2022年底之后,年度LLM用于SE的产出增长了约五倍。基于此综合,我们贡献了:(i)一个围绕意图、协作和验证组织的AI原生软件工程概念框架;(ii)一个涵盖规范、批判性评估、智能体编排和元认知的九维能力模型;(iii)一个包含AI韧性评估的四阶段大学课程路线图;(iv)教师发展和劳动力转型策略;以及(v)一个包含11个研究空白的优先议程。证据基础在生产力效应的幅度和方向上存在内部矛盾,强调收益高度依赖于上下文,并且培养工程师的判断、验证和编排能力——而不仅仅是代码生产——是AI原生时代的核心挑战。

英文摘要

Generative Artificial Intelligence (GenAI), Large Language Models (LLMs), and emerging Agentic AI constitute the most disruptive transformation in the history of software engineering (SE), reshaping development processes, required competencies, professional roles, and the educational outcomes that universities must deliver. This paper presents a systematic review of 48 verified, influential peer-reviewed publications (2016--2026) drawn from leading venues in software engineering, machine learning, computing education, human--AI collaboration, and software productivity. Studies were discovered, screened, and analyzed through a four-agent research workflow (Literature Discovery, Scientometric Analysis, Curriculum Transformation, and Workforce Impact) and were verified against primary sources. We synthesize the evidence along nine themes and three trajectories -- practice, education, and workforce -- and report a scientometric inflection in which annual LLM-for-SE output grew roughly five-fold after late 2022. From this synthesis we contribute: (i) a conceptual framework for AI-native software engineering organized around \emph{intent}, \emph{collaboration}, and \emph{verification}; (ii) a nine-dimension competency model spanning specification, critical evaluation, agent orchestration, and metacognition; (iii) a four-phase university curriculum roadmap with AI-resilient assessment; (iv) faculty-development and workforce-transformation strategies; and (v) a prioritized agenda of eleven research gaps. The evidence base is internally contradictory on the magnitude and direction of productivity effects, underscoring that benefits are strongly context-dependent and that educating engineers for judgment, verification, and orchestration -- rather than code production alone -- is the central challenge of the AI-native era.

2606.12985 2026-06-12 cs.CV 新提交

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

物体先于词汇:用于从儿童视角视频中语言接地学习的物体优先归纳偏置

Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 针对婴儿视角视频中命名参照物出现时间和位置的双重歧义,提出BabyMind方法,通过物体优先的归纳偏置、掩码区域接口和原型空间多实例对比学习,在稀疏弱监督下提升语言接地性能。

详情
AI中文摘要

从自然经验中学习接地词汇含义需要解决婴儿视角记录中的两个歧义:命名参照物何时出现以及在杂乱画面中的位置。在SAYCam风格的数据中,看护者的语言稀疏且与自我中心视频弱同步,因此单帧对比配对会产生噪声正样本,其中目标物体缺失或被干扰物纠缠。我们提出BabyMind,一种在稀疏、噪声监督下用于儿童视角对比学习的物体优先偏置。BabyMind使用离线掩码区域接口提取候选物体嵌入,通过跟踪将短话语中心窗口内的候选物体链接成轻量级物体文件,并使用原型空间多实例对比目标将话语与物体文件袋对齐。轨迹一致性和全局物体一致性正则化器稳定学习,并将物体文件结构转移到评估时使用的全局帧嵌入中。在SAYCam-S上,BabyMind将Labeled-S 15强制选择准确率比CVCL提高了+2.6个点,并在词汇内分布外基准测试中取得一致提升。代码可在该网址获取。

英文摘要

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at this https URL.

2606.12984 2026-06-12 cs.CL 新提交

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

SkillChain: 为基于图像的电商AI助手闭环技能演化

Yimin Hu, Mengtao Xu, Hao Guo, Yuheng Song, Xiaoyong Zhu, Bo Zheng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出SkillChain框架,通过技能创建、路由优化和主体精炼三阶段自动化技能生命周期,解决电商图像助手多意图混淆问题,显著提升响应质量和用户参与度。

详情
AI中文摘要

基于图像的AI助手现已大规模部署在电商平台上,其中单张上传图像可能触发根本不同的用户意图:产品搜索、风格推荐、视觉百科或实用工具调用,每种意图都需要自己的响应格式、工具调用和领域知识。如果没有按意图的行为约束,基于LLM的系统会混淆这些异构模式,达不到领域质量标准,而意图空间的广度和动态性使得手动工程不可行。为解决这一问题,我们提出了SkillChain,它闭环了技能演化的生产反馈循环,通过三个阶段自动化技能生命周期:用于从任务规范和轨迹中引导启动的技能创建器、用于路由对齐的路由优化器,以及通过双路径LLM-Judge评估进行迭代技能主体精炼的主体精炼器。部署在生产规模的电商图像助手上,SkillChain显著提高了聚合响应质量,在结构合规性和内容质量上提升最大;为期一周的在线A/B实验进一步证实了用户参与度、内容消费和长期留存率的显著提升。

英文摘要

Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

2606.12983 2026-06-12 cs.AI 新提交

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

面向LLM驱动的硬件描述语言设计与验证数据整理的结构化测试台生成

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin, Yao-Ting Hsieh, Cheng Liang, Hsiang-Yu Tsou, Mu-Chi Chen, Yu-Kai Hung, Shao-Chun Ho, Po-Hsuang Huang, Shih-Hao Hung, H.T. Kung

发表机构 * National Taiwan University(国立台湾大学) Academia Sinica(中央研究院) Harvard University(哈佛大学)

AI总结 提出STG框架,利用硬件设计固有结构生成确定性测试台,比迭代LLM方法快720倍,编译成功率更高,覆盖率更高,误判更少,并用于数据整理和测试时扩展。

详情
Comments
9 pages, 10 figures
AI中文摘要

自动化测试台生成已成为大型语言模型(LLM)驱动的寄存器传输级(RTL)工作流中的关键瓶颈,其中大量候选设计必须快速可靠地验证。现有的基于提示的方法将测试台生成视为无约束的代码合成,产生随机输出,具有高令牌成本、低可重复性和不足的覆盖率。为了解决这一差距,我们提出了STG,一个结构化测试台生成框架,利用硬件设计的固有结构生成确定性测试台。作为直接验证工具,STG比基于迭代LLM的测试台生成流程快720倍,具有更高的编译成功率,实现更高的覆盖率,并减少对不正确DUT的错误通过判定。STG还通过暴露有缺陷的基准测试台帮助识别RTL生成基准中的错误。作为数据整理引擎,它在单个CPU核心上比基于LLM的过滤快11倍,能耗低127倍,由此得到的蒸馏模型在我们的多基准评估中提供了最先进的性能。作为测试时扩展预言,它减少了14-47%的节点数。我们的模型可在https://this URL获取。

英文摘要

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at this https URL.

2606.12981 2026-06-12 cs.CV 新提交

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

用于TUMTraf V2X协同3D目标检测的相机与LiDAR BEV融合

Muhammad Shahbaz, Shaurya Agarwal

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida(中佛罗里达大学土木、环境与建筑工程系)

AI总结 提出一种融合路边相机与基础设施-车辆点云的BEV空间检测器,采用CenterPoint风格头部和IoU重排序,在DriveX 2026挑战赛公开测试集上达到0.85 mAP,并分析了训练/验证与测试集重叠对分数的影响。

详情
AI中文摘要

我们描述了一种为DriveX 2026挑战赛的TUMTraf V2X协同3D目标检测赛道开发的相机与LiDAR融合检测器。该检测器在共享的鸟瞰视图空间中融合三个路边相机与一个融合的基础设施-车辆点云,并通过带有广义IoU回归损失和IoU质量重排序头的CenterPoint风格头部预测边界框。在提供的训练和验证分割上训练后,模型在公开Codabench测试分割上达到了0.85的3D mAP。在迭代系统时,我们观察到50个测试帧中有44个也出现在已发布的训练(40个)和验证(4个)分割中并带有标签。因此,我们进行了两项额外研究来量化这种重叠对最终分数的影响:(1)一个微调运行,对44个重叠帧进行过采样,达到0.89 mAP;(2)一个后处理运行,将这些帧上的预测替换为已发布的真实值,达到0.99 mAP(上传到我们的Codabench账户进行测试,但未在排行榜上发布)。报告了所有三种配置及其每类结果。

英文摘要

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

2606.12979 2026-06-12 cs.LG 新提交

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

EPM-JEPA:JEPA系列世界模型中的算子侧经验调制

Vedant Pandya

发表机构 * School of Artificial Intelligence and Data Engineering (SAIDE), Indian Institute of Technology Jodhpur(印度理工学院焦特布尔分校人工智能与数据工程学院)

AI总结 提出EPM-JEPA,通过LoRA在权重层面调制预测器,以应对测试时动态偏移;实验表明其优于无记忆基线,但效果弱于预期,并揭示了三种独立动力学过程。

详情
Comments
16 pages, 5 figures, 9 tables, 5 code listings. Pre-registered experimental study with mechanism analysis
AI中文摘要

JEPA系列世界模型使用静态预测器,其权重在测试时动态偏离训练时不会自适应。我们比较了在分布偏移下将累积经验融入JEPA预测器的两种机制:操作数侧注入(EI-JEPA),将压缩的经验表示作为残差添加到预测器的隐藏状态;以及算子侧调制(EPM-JEPA),通过应用于预测器权重的LoRA生成低秩权重增量。在预注册的比较(Moving MNIST,重力偏移)中,EPM-JEPA(D_shift^{n=50} = 0.7848 +/- 0.0078,三个种子)与EI-JEPA(0.8238)相差delta = 4.74% - 根据我们声明的标准,结果C:零结果 - 是一个有效结果。作为次要的、非预注册的观察,EPM-JEPA在无记忆基线(0.8000)上提高了1.90%,且在所有种子上一致,而EI-JEPA低于基线,表明收益特定于权重级调制。我们的主要贡献是机制分析:D_shift^{n=50}轨迹反映了三个独立的动力学过程——缓冲区循环、EMA目标漂移和内在的LoRA稳定瞬态(+0.021)——而非收敛到平衡。这些发现推动了PEM-JEPA,一个基于物理的后续模型,以解决这一动力学峰值限制。

英文摘要

JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

2606.12978 2026-06-12 cs.RO cs.CV eess.SY 新提交

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文发现VLA模型存在轨迹级漏洞:看似保留原始指令的对抗性提示,能重定向机器人最终物理结果,并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将自然语言引入闭环机器人控制,使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色,因为提示在每个重新规划步骤中被重复使用,每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示,这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式:一个提示仍然$\textit{看起来}$指定了预期任务,但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$,这是一种仅提示的威胁模型,其中攻击者在情节开始前选择一个提示,所有策略和环境组件保持不变,并且提示必须保持接近良性指令,同时省略目标词和纠正语言。为了找到这样的提示,我们引入了一种在线提示搜索方法,该方法使用滚动来发现扰动,其闭环行为跟踪目标任务,同时满足命令保持约束。在仿真和硬件上的实验表明,接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞:看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站:此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: this https URL