arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12638 2026-06-12 cs.DC cs.AR 新提交

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Eidola: 分布式AI工作负载中多GPU网络通信流量建模

Ranganath R. Selagamsetty, Matthew Poremba, Bradford M. Beckmann, Joshua San Miguel, Mikko H. Lipasti

AI总结 提出Eidola,一种可扩展的gem5模拟框架扩展,通过注释时序配置精确建模多GPU间通信流量,支持细粒度同步分析和架构探索。

详情
Comments
13 pages, 11 figures, 1 table
AI中文摘要

随着分布式AI工作负载规模的扩大,多GPU系统已成为训练大型模型的关键。尽管内核融合和计算与通信重叠等技术有助于减少延迟,但它们也引入了不规则和瞬态的流量模式,难以用现有工具建模。这些技术高度依赖细粒度同步和点对点通信,对互连带宽和延迟造成显著压力。在这项工作中,我们介绍了Eidola,这是gem5模拟框架的一个可扩展扩展,能够对GPU间通信流量进行详细建模。该扩展具有可扩展性,因为我们的GPU模型作为一个简洁的eidolon,模拟了流量建模所需的最小特征。Eidola使用来自真实应用的注释时序配置,以周期级精度模拟点对点GPU写入。这使得研究人员能够模拟和分析大规模多GPU配置下的同步行为。该模拟器支持可配置的每GPU流量模式,并能够在不同通信场景下进行隔离性能分析。我们通过重现融合内核执行中的变异性以及实现一个受SyncMon启发的同步机制,证实了Eidola的有效性,确认了轮询相关内存流量的减少。我们的结果表明,Eidola为研究GPU间通信提供了一个灵活且可扩展的平台,并支持现代分布式GPU系统中的架构探索。

英文摘要

As distributed AI workloads grow in scale, multi-GPU systems have become essential for training large models. Although techniques like kernel fusion and overlapping communication with computation help reduce delays, they also introduce irregular and transient traffic patterns that are difficult to model using existing tools. These techniques rely heavily on fine-grained synchronization and peer-to-peer communication, which place significant pressure on interconnect bandwidth and latency. In this work, we introduce Eidola, a scalable extension to the gem5 simulation framework that enables detailed modeling of inter-GPU communication traffic. The extension is scalable as our GPU model serves as a succinct eidolon, emulating the minimal characteristics needed for traffic modeling. Eidola uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. This allows researchers to simulate and analyze synchronization behavior across large multi-GPU configurations. The simulator supports configurable per-GPU traffic patterns and enables isolated performance analysis under different communication scenarios. We demonstrate Eidola's effectiveness by reproducing variability in fused kernel execution and by implementing a SyncMon-inspired synchronization mechanism, confirming reductions in polling-related memory traffic. Our results show that Eidola provides a flexible and scalable platform for studying inter-GPU communication and supports architectural exploration in modern distributed GPU systems.

2606.12635 2026-06-12 cs.CV 新提交

CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

CD-RCM:面向反射共聚焦显微镜的泛化连续深度新视角合成

Tooba Imtiaz, Milind Rajadhyaksha, Kivanc Kose, Jennifer Dy

发表机构 * Northeastern University(东北大学) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心)

AI总结 针对反射共聚焦显微镜各向异性3D体积,提出首个RCM专用新视角合成方法CD-RCM,通过前馈模型从稀疏z-stack预测连续深度切片,实现亚秒级高保真合成。

详情
AI中文摘要

反射共聚焦显微镜(RCM)通过获取连续深度处的正面图像,形成稀疏z-stack,从而提供人体皮肤 \emph{体内} 的无创、细胞分辨率“光学活检”。由于光学限制,这些堆栈是各向异性的3D体积,横向分辨率(0.5 $\mu$m)比轴向分辨率(由光学切片定义,3 $\mu$m)高约6倍,限制了组织解释。我们的目标是通过插值中间切片并使3D体积各向同性,提供连续深度可视化。这种表示允许任意方向切片,包括类似组织病理学的横截面检查,无需针对每位患者进行优化。为此,我们引入了首个RCM特定的新视角合成(NVS)方法CD-RCM,这是一种前馈模型,可从稀疏采样的RCM堆栈预测逼真的、未见过的深度。经典神经渲染方法侧重于从表面级多视角观测进行重建。与表面级相机视图不同,RCM可以获取组织表面以下至200 $\mu$m的光学切片正面图像。然而,在可视化RCM堆栈时,较浅切片(朝向表面)的观测会遮挡较深切片。这种独特的轴向成像几何和层依赖性解剖结构促使我们开发了定制的架构和训练框架,明确考虑了RCM的深度分辨、遮挡成像物理特性。实验表明,CD-RCM实现了高保真新视角合成,推理时间低于一秒。

英文摘要

Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution "optical biopsies" of human skin \emph{in vivo} by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 $\mu$m) $\sim$6 times higher compared to axial resolution, which is defined by the optical sectioning (3 $\mu$m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 $\mu$m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM's depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 新提交

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导:面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 针对长程工具使用强化学习中轨迹级优势信号稀疏的问题,提出兄弟引导信用蒸馏(SGCD),通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考,实现密集信用分配,在AppWorld和τ³-airline任务上显著提升性能。

详情
Comments
13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track
AI中文摘要

长程工具使用强化学习可以从结果验证中学习,但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而,我们表明直接的令牌级自蒸馏会悄然破坏工具使用:它复述教师行为而不知道验证器奖励哪些动作,因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏(SGCD),它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹;外部LLM将其对比总结为训练时逐步信用参考;密集的教师/学生散度驱动信用重新分配;有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO比较器:AppWorld上test_normal的TGC从42.9提升到45.6,test_challenge从24.7提升到27.0;τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $\tau^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $\tau^3$-airline pass@1 $0.583 \to 0.602$.

2606.12633 2026-06-12 cs.CV cs.LG 新提交

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA:面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

AI总结 提出ECA方法,通过混合查询模块、Fisher动态扩展和字典重放,实现无需旧数据的持续对齐,缓解灾难性遗忘,提升开放图像到文本生成的增量学习性能。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

开放图像到文本生成(OpenITG)的增量学习(IL)使模型能够持续为新的图像生成准确、上下文相关的文本,同时保留先前获得的知识。与先前研究不同,本文处理了一个更实际的场景,其中视觉数据的主要类别随时间推移而演变。在此背景下,我们引入了持续对齐的新概念,它逐步调整预训练VLM中的对齐模块,以保持高质量的跨模态表示。基于这一思想,我们提出了高效持续对齐(ECA),一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:混合查询(MoQ)模块,用于适应任务特定的查询令牌;Fisher动态扩展(FeDEx),基于Fisher信息矩阵(FIM)度量动态扩展模型结构;以及带有字典重放(DR)的嵌入字典,以保留过去的知识。为了评估ECA的性能,我们构建了四个新的IL OpenITG基准,更好地反映了现实场景。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at this https URL.

2606.12631 2026-06-12 cs.CC 新提交

The Switching Lemma shows what the Switching Lemma cannot prove: an unconditional natural-proofs barrier

切换引理展示了切换引理无法证明的内容:一个无条件的自然证明障碍

Bruno Loff, Suhail Sherif, Navid Talebanfard, Francesca Ugazio

AI总结 本文无条件地证明了AC0自然证明无法证明超过2^{n^{7/(d-5)}}的深度-d电路下界,揭示了切换引理本身也无法超越其自身所确立的下界。

详情
Comments
34 pages, 2 figures
AI中文摘要

Razborov和Rudich (JCSS'97) 观察到所有已知的下界证明都遵循某种模式:当证明一个函数$F$是困难的时,证明过程会提供一个区分器,即一个高效的算法,能够区分简单函数和随机函数。他们称这种下界证明为自然证明。然后他们展示了一个自然证明障碍:在标准密码学假设下,自然证明无法证明针对布尔电路的超多项式下界。类似地,可以证明在合适的密码学假设下,自然证明无法显著改进针对恒定深度电路(AC0)的当前最先进下界。目前最先进的下界,使用Håstad的切换引理(SL),对于深度-$d$电路是$2^{n^{1/(d-1)}}$,并且(有条件地)没有自然证明能证明$2^{n^{c/d}}$的下界,其中$c$是某个大常数。在本文中,我们从$\textit{无条件的}$角度重新审视自然证明障碍。我们专注于AC0自然证明,即其区分器可由AC0电路计算的自然证明。Razborov和Rudich观察到基于SL的下界是AC0自然的。我们证明这对于大多数已知的针对恒定深度电路的下界技术都是成立的。然后我们为这类证明建立了一个无条件的障碍。通过局部化Trevisan--Xue伪随机生成器,我们能够证明没有AC0自然证明能证明针对深度-$d$电路的大于$2^{n^{7/(d-5)}}$的下界。这与SL前沿的定量范围相同,后者在$n$的幂次中是$1/(d-1)$。证明具有惊人的自指性质:Trevisan--Xue生成器的安全性证明关键依赖于SL,因此SL被用来证明AC0自然证明(如SL本身)无法证明比SL更好的AC0下界。

英文摘要

Razborov and Rudich (JCSS'97) observed that all known lower-bound proofs follow a certain pattern: when showing that a function $F$ is hard, along the way the proof provides us with a distinguisher, namely, an efficient algorithm which can distinguish easy functions from random functions. They called such lower-bound proofs natural proofs. They then showed a natural-proofs barrier: under standard cryptographic assumptions, natural proofs cannot show superpolynomial lower-bounds against Boolean circuits. Along similar lines it can be shown that under a suitable cryptographic assumption, natural proofs cannot significantly improve the current state-of-the-art lower bound against constant depth circuits (AC0). The state of the art, using Håstad's Switching Lemma (SL), is $2^{n^{1/(d-1)}}$ for depth-$d$ circuits, and (conditionally) no natural proof can prove lower bounds of $2^{n^{c/d}}$ for some large constant $c$. In this paper we revisit the natural-proofs barrier from an $\textit{unconditional}$ perspective. We focus on AC0-natural proofs, i.e. proofs whose distinguishers are computable by AC0 circuits. Razborov and Rudich observed that lower bounds based on SL are AC0-natural. We show that this is true for most known lower-bound techniques against constant-depth circuits. We then establish an unconditional barrier for such proofs. By localizing the Trevisan--Xue pseudorandom generator, we are able to show that no AC0-natural proof can prove a lower bound greater than $2^{n^{7/(d-5)}}$ against depth-$d$ circuits. This is in the same quantitative regime as the SL frontier which instead has $1/(d-1)$ in the power of $n$. The proof has a striking self-referential aspect: the proof of security of the Trevisan--Xue generator crucially relies on SL, and so SL has been used to show that AC0-natural proofs, such as SL itself, cannot prove AC0 lower bounds better than that of SL.

2606.12629 2026-06-12 cs.LG cs.AI 新提交

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims:通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 本文提出Bag of Dims框架,证明Transformer隐藏状态的标准基即可作为无需训练的特征基,通过维度符号模式编码语义,并在三个模型上验证了其有效性。

详情
Comments
14 pages, 4 figures, 10 tables
AI中文摘要

我们表明,Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容,通过其幅度编码置信度,充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B)上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容:将所有幅度替换为1,通过LM头实现72-93%的top-5下一个token准确率,而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征:使用单token类型缓存(每个词汇token一次前向传播,无上下文),我们通过每维度符号一致性(平均AUC 0.80)从50个锚点发现了175个类别,无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重,证实了可忽略的跨维度结构。这种结构扩展到注意力:所有175个类别在K和V投影中仍然可发现。在写入端,静态FFN权重检查将20%的特征与单个写入神经元联系起来(一致性>0.70;随机对照:0%),通过多数投票,top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现(随机种子,无标签)在所有三个模型上扩展到1500个特征,产量100%,稀疏度99%,成对互信息为0.0014比特,证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取,无需训练、无需优化,且每个词汇token仅需一次前向传播,无需GPU天数。

英文摘要

We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs and confidence via their magnitudes, functioning as independent binary registers. We validate this Bag of Dims framework across three model families (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B) through four progressive experiments. Sign patterns alone carry predictive content: replacing all magnitudes with unity achieves 72-93% top-5 next-token accuracy through the LM head, and pure Hamming scoring without any decoder reaches 80-90% top-4096. These sign patterns organize into semantic features: using a single-token type cache (one forward pass per vocabulary token, no context), we discover 175 categories via per-dimension sign consistency (mean AUC 0.80) from 50 anchors with zero training. A trained probe adds only +0.018 AUC and converges to axis-aligned weights, confirming negligible cross-dimension structure. This structure extends to attention: all 175 categories remain discoverable in K and V projections. On the write side, static FFN weight inspection links 20% of features to individual writer neurons (>0.70 agreement; random controls: 0%), with top-200 neuron coalitions achieving >0.70 agreement on 99.9% of prototypes via majority vote. Fully unsupervised discovery (random seeds, no labels) scales to 1500 features at 100% yield and 99% sparsity across all three models, with pairwise MI of 0.0014 bits confirming low inter-dimension coupling. These results establish that the standard basis already suffices for feature reading throughout the transformer compute pathway, requiring no training, no optimization, and no GPU-days beyond a single forward pass per vocabulary token.

2606.12628 2026-06-12 cs.CV 新提交

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

面向自动驾驶中共现对象检测的上下文感知特征融合

Binay Kumar Singh, Niels Da Vitoria Lobo

发表机构 * Department of Computer Science, University of Central Florida(中佛罗里达大学计算机科学系)

AI总结 提出上下文中心特征融合框架CCFF,通过局部上下文融合模块和全局上下文注意力模块分别处理小/遮挡对象与共现先验,提升共现对象检测性能,在Cityscapes和BDD100K上实现类别一致性策略0.973和0.969,小目标检测AP_S提升14.1%。

详情
Comments
8 pages, 3 figures, CVPR 2026 Precognition Workshop
AI中文摘要

自动驾驶中的目标检测需要精确定位以及对共现对象之间关系上下文的固有理解。在极其复杂的异构环境中,稀有类别、小尺度对象和频繁出现的对象对于标准目标检测框架来说难以处理。在本文中,我们提出了一种新颖的框架,称为上下文中心特征融合(CCFF),它利用两个基于注意力的模块:局部上下文融合模块(LCFM)使用RoI到RoI的自注意力机制来解决空间交互,主要考虑小且部分遮挡的对象;而全局上下文注意力模块(GCAM)通过将top-K RoI特征池化为全局上下文注意力标记来转换对象的共现先验,避免了像素级全局池化的计算开销。这种局部和以对象为中心的全局特征的融合产生了上下文化的嵌入,增强了分类结果和共现对象检测。我们的方法在两个数据集Cityscapes和BDD100K上进行了评估,在关系一致性上显示出显著改进,分别达到了0.973和0.969的类别级一致性策略(CCS)。此外,我们的方法在小目标检测(AP_S: 14.1%)上取得了实质性提升,并成功恢复了通常在大分布中丢失的稀有类别,如“火车”。我们的效率报告显示,该框架以0.2 FPS的开销实时处理图像。代码可在此https URL获取。

英文摘要

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at this https URL.

2606.12620 2026-06-12 cs.SE cs.AI 新提交

HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection

HybridCodeAuthorship:一个用于行级代码作者归属检测的基准数据集

Luke Patterson, Li Wang, Adam Faulkner

AI总结 针对现有基准无法反映真实AI代码助手使用场景的问题,提出HybridCodeAuthorship数据集,包含交错的人类和AI编写代码行,并评估两种检测算法性能。

详情
Comments
Accepted to LREC 2026
AI中文摘要

由于基于大型语言模型(LLM)的AI代码助手的快速采用,行业代码库越来越多地成为AI和人类编写代码的混合体。出于风险管理和生产力分析的目的,实现对AI生成代码的细粒度位置检测至关重要。为了开发此任务的算法,需要高质量的基准来评估性能。然而,现有的基准往往包含学术性的LeetCode风格问题,并假设代码片段要么完全由人类编写,要么完全由AI编写,这并不能反映使用AI代码助手的行业代码库的多样意图和风格。为了填补这些空白,我们引入了HybridCodeAuthorship,这是一个新颖的Python代码文件基准,其中交错有人类和AI编写的代码行,以模拟AI代码助手的真实使用。在本文中,我们首先介绍了我们的数据集构建流程,该流程利用了CodeSearchNet,这是一个包含GitHub上开源仓库链接的大型集合。然后,我们在行级和块级上评估了两种最先进的AI生成代码检测算法的性能。实验结果表明,HybridCodeAuthorship是一个具有挑战性的基准,得分最高的算法AIGCode Detector在块级和行级代码检测任务上分别获得了0.48和0.56的最高F1分数。

英文摘要

Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid of AI- and human-authored code. For risk management and productivity analysis purposes, it is crucial to enable fine-grained location detection of AI-generated code. To develop algorithms for this task, quality benchmarks are needed to assess performance. However, existing benchmarks tend to comprise academic, LeetCode-style problems and presume a code snippet is either completely human-authored or completely AI-authored, which is not reflective of the diverse intents and styles of industry codebases utilizing AI code assistants. To fill these gaps, we introduce HybridCodeAuthorship, a novel benchmark of Python code files with interleaved human- and AI-authored lines of code to simulate authentic utilization of AI code assistants. In this paper, we first present our dataset construction pipeline, which leverages CodeSearchNet, a massive collection of links to open sourced repositories on GitHub. We then benchmark the performance of two state-of-the-art AI-generated code detection algorithms at both the line- and chunk-level. Experimental results demonstrate that HybridCodeAuthorship is a challenging benchmark with a top-scoring algorithm, AIGCode Detector, obtaining a highest F1 score of 0.48 and 0.56 on chunk-level and line-level code detection tasks, respectively.

2606.12618 2026-06-12 cs.AI 新提交

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗?”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute(AI安全研究所)

AI总结 本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集,评估了四种谎言检测器在不同规模模型上的表现,发现基于激活和概率的检测器在训练模型生物体上性能显著下降,而思维链法官保持较强性能,但存在伪影。

详情
Comments
12 pages, 6 figures
AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术,但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明,现有的训练模型生物体通常无法满足这一要求,使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题,这些生物体的隐藏信念在思维链中得到验证,并显示泛化到保留任务,同时结合了多样化欺骗(Varied Deception),一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上,我们评估了四个检测器:一个思维链法官、一个对数概率分类器和两个激活探针,包括Did-You-Lie(DYL),一种训练后续探针的新方法。在提示撒谎任务上,跨越31个开放权重模型(参数从2B到1T),所有四个检测器都显示出与模型能力正相关的缩放。然而,每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降,其中DYL保留了最多的信号;只有思维链法官保持强劲,平衡准确率为0.82,部分原因是我们的验证过程偏向于CoT可读的信念。因此,当前的谎言检测器无法支持关于模型信念的高置信度声明,我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

2606.12616 2026-06-12 cs.AI cs.CL 新提交

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

PersonaDrive: 面向闭环驾驶模拟的人类风格检索增强VLA智能体

Mahmoud Srewa, Praneetsai Iddamsetty, Mohammad Abdullah Al Faruque, Salma Elmalaki

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PersonaDrive流水线,通过检索风格指令下的人类驾驶演示来调节视觉-语言-动作(VLA)驾驶智能体,实现闭环模拟中多样化的非自车智能体行为,无需针对每种风格重新训练。

详情
AI中文摘要

闭环驾驶模拟器通常在其环境中填充行为大致相同的非自车交通智能体,这些智能体要么由基于规则的交通管理器生成,要么由训练为单一行为模式的学习模型生成。最近的工作通过观测数据上的事后标签或LLM推断的奖励权重引入风格变化,但这些信号充当了风格应奖励什么的代理,而不是明确要求以该风格驾驶的人类演示。我们提出了PersonaDrive,一个流水线,它根据从风格指令的人类驾驶数据集中检索到的演示来调节视觉-语言-动作(VLA)驾驶智能体,在该数据集中,参与者在驾驶员在环平台上以激进、中性和保守指令驾驶CARLA排行榜路线。该流水线包括三个阶段:(i) 使用组合的图像-文本相似度分数对每种风格的人类驾驶数据进行离线三元组挖掘;(ii) 训练一个轻量级检索头,将冻结的视觉特征与每个风格数据库上的小型控制编码器融合;(iii) 微调单个VLA主干,以在航点预测期间将检索到的上下文点视为上下文行为演示。在推理时,通过切换检索头查询的每个风格数据库,相同的主干可以适应任何风格,因此选择风格无需针对每种风格重新训练,同时为闭环模拟启用人类风格、风格多样的非自车智能体。在Bench2Drive上,PersonaDrive(无风格)的驾驶得分比SimLingo高4.6%,比HiP-AD高2.5%,在风格条件下,每种风格都获得最高驾驶得分,波动范围约2%(其最弱风格超过最强基线DMW 5.4%),而从保守指令到激进指令,平均速度和加速度分别提高18%和25%。

英文摘要

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

2606.12615 2026-06-12 cs.LG 新提交

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

迈向可证明公平的机器学习:用于一致和透明预测的贝叶斯方法

Owen O'Neill, Fintan Costello

发表机构 * University College Dublin(都柏林大学学院)

AI总结 提出公平贝叶斯分类器,通过强制确定性和统计一致性,在多个数据集上实现零一致性错误,同时保持准确性和多校准,解决少数群体因正则化导致的预测不一致问题。

详情
AI中文摘要

部署在高风险领域的机器学习分类器产生的预测质量在不同子组之间存在系统性差异。对于由多个特征交叉定义的细粒度子组,预测通常与观测数据不一致:模型输出与该子组可用的证据相矛盾。正则化通过将小子组合并到较大组中来改善整体性能,从而加剧了这一问题,对人口统计少数群体产生不成比例的影响。我们定义了一致性预测的两个要求:确定性(相同的个体获得相同的预测)和统计一致性(在显著性水平alpha下,我们不能拒绝子组预测来自为该子组推断的贝叶斯最优目标分布的假设)。从这些要求出发,我们推导出公平贝叶斯分类器,该分类器同时强制每个组和子组满足这两个要求,并在无法进行一致确定性预测时弃权。在三个基准数据集(Adult、COMPAS和Bank Marketing)上,标准分类器对相当一部分子组产生统计上不一致的预测。我们的分类器通过构造实现零一致性错误,同时在每个测试数据集上超过基线准确性和多校准。统计一致性为预测质量提供了原则性基础,对算法公平性有直接影响。少数群体人口不成比例地集中在小子组中,而正是在这些子组中频率论推断最不可靠;因此,解决这一推断问题是迈向公平ML的必要步骤。通过在数据支持的最细粒度上强制贝叶斯一致性,我们的分类器证明了在实践中可以实现具有原则性弃权的详尽子组公平性。

英文摘要

ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

2606.12614 2026-06-12 cs.RO 新提交

DARRMS -- An Efficient Algorithm for Dynamic Attention Radius in Resource-Constrained Multi-Agent Systems

DARRMS——资源受限多智能体系统中动态注意力半径的高效算法

Benjamin Alcorn, Eman Hammad

发表机构 * Texas A&M University(德克萨斯A&M大学)

AI总结 提出DARRMS算法,通过优化注意力半径和决策,在资源受限下降低计算需求,提升协调性和可扩展性。

详情
AI中文摘要

多智能体系统是机器人、网络安全和自动驾驶规划等领域不可或缺的工具。这类系统通常面临计算资源约束,需要高效的轻量级算法。传统决策框架常假设理想条件(如完全可观测性和无限计算能力),这与现实挑战不符。本文提出一种新算法,在不显著牺牲其他性能指标的前提下,降低对计算资源的需求。智能体将可观测性限制在某个注意力半径内,从而有意识地忽略对行动规划可能不必要的环境部分。通过同时优化注意力半径和决策,我们的方法在不确定环境中增强了协调性和可扩展性。通过理论分析和实证验证,我们证明了自适应观测在资源受限系统中提升系统性能并维持稳健决策策略的有效性。

英文摘要

Multi-agent systems are integral tools for various domains such as robotics, cybersecurity, and autonomous vehicle planning. These types of systems often have constraints on the computational resources, leading to a need for efficient lightweight algorithms. Traditional decision making frameworks often assume ideal conditions, such as full observability and unlimited computational capacity, which do not align with real-world challenges. In this paper, we introduce a new algorithm that allows for reduced demand on computational resources without a large cost of other performance metrics. Agents will limit their observability to some attention radius, which intentionally allows them to ignore parts of the environment that might be unnecessary for action planning. By optimizing both the attention radius and decision-making, our approach enhances coordination and scalability in uncertain environments. Through both theoretical analysis and empirical validation, we demonstrate the effectiveness of adaptive observation in improving system performance and maintaining robust decision-making strategies in resource-constrained systems.

2606.12610 2026-06-12 cs.LG 新提交

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

AI寒冬的数学:AI中范式脆弱性的数学分类

Miquel Noguer i Alonso, David Pacheco Aznar

发表机构 * AIFI Staq.io

AI总结 本文提出AI寒冬的数学解释,通过感知机不可能性、神经网络训练复杂度、高维非参数估计率、梯度消失和统计学习理论等数学瓶颈,分析早期AI范式失败的原因,并关联后续突破。

详情
Comments
33 pages, 1 figure
AI中文摘要

人工智能研究中两个主要的资金减少和信心下降时期,通常被称为第一次和第二次AI寒冬,通常被解释为工程失败、商业失望和预期膨胀。本文提出一个补充论点:这些时期的主导范式也遇到了真正的形式障碍,包括表示、优化、计算复杂性、统计可学习性和高维近似的限制。贡献是综合性的而非档案性的。我们并不声称特定定理机械地导致了寒冬;相反,我们表明早期AI的几个核心失望与数学上精确的瓶颈相一致。我们通过Minsky和Papert的感知机不可能结果、Blum和Rivest建立的精确神经网络训练的计算复杂性困难、Stone的高维非参数估计的极小化极大率、Hochreiter以及Bengio及其合作者的梯度消失分析,以及Vapnik和Chervonenkis、Valiant、Blumer及其合作者传统的经典统计学习理论来分析这些瓶颈。然后我们将这些障碍与后来缓解(而非消除)它们的突破联系起来。

英文摘要

Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

2606.12609 2026-06-12 cs.LG q-bio.QM 新提交

Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

AI总结 研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示,发现嵌入空间中存在主导的“天然性”轴,该轴按模型困惑度排序序列,且缩放效果因病毒家族而异,但嵌入仍保留病毒特异性信号。

详情
Comments
Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at this https URL
AI中文摘要

蛋白质语言模型在高度不平衡的数据集上训练,引发了一个问题:它们如何表示代表性不足的生物序列?以病毒蛋白作为跨ESM模型家族的案例研究,我们在嵌入空间中识别出一个主导的天然性轴,该轴与掩码重建困惑度对齐,将序列从建模良好的细胞蛋白通过病毒蛋白排序到打乱和随机序列。缩放效果在不同病毒家族间不均匀地压缩该轴。尽管如此,蛋白质语言模型嵌入保留了病毒特异性信号:病毒蛋白在零样本困惑度和浅层序列特征之上仍然是线性可分的。这些结果共同表明,pLM表示由天然性的一般概念结构化,同时保留了特定于不同生物群体的信息。

英文摘要

Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

2606.12608 2026-06-12 cs.CL cs.LG 新提交

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

购物推理基准:面向多轮对话购物助手的专家编写基准

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一个由零售专家编写的525个任务的多轮对话购物推理基准,包含10863个加权评分标准,评估9个模型显示通过率仅57-77%,多轮任务性能下降4-18分。

详情
AI中文摘要

对话式购物助手现已服务数亿客户,但现有基准均未联合评估真实购物对话所需的开放式多轮推理、领域专业知识和标准级质量。购物推理在语言模型应用中独具特色。与事实性问答或可验证代码生成不同,它需要在多轮对话中平衡主观偏好、预算约束和跨产品权衡,这些能力在以往的电商和通用基准中缺失。我们引入了购物推理基准(Shopping Reasoning Bench),这是一个由零售领域专家编写的基准,包含525个任务(232个单轮,293个多轮)和10863个重要性加权的二元评分标准。这些标准组织在包含五个推理类别和十五个子类别的分类体系下,涵盖偏好细化、权衡分析和兼容性评估等多样化需求。对三个模型系列(GPT、Claude、Gemini)中九个模型的评估显示,整体通过率仅为57-77%。在多轮任务中,所有模型在可选的超越标准上的得分比必需标准低13-29分,并且随着对话进行,性能下降4-18分。这些差距表明,当前模型能处理基本购物辅助,但达不到专家级建议,使购物推理基准成为未来购物助手开发的挑战性测试平台。

英文摘要

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

2606.12604 2026-06-12 cs.RO 新提交

EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

EgoEngine:从自我中心人类视频到高保真灵巧机器人演示

Yangcen Liu, Shuo Cheng, Xinchen Yin, Woo Chul Shin, Alfred Cueva, Yiran Yang, Zhenyang Chen, Chuye Zhang, Danfei Xu

AI总结 提出EgoEngine框架,通过视觉和动作桥接,将自我中心人类视频转化为高保真机器人数据,首次实现零样本灵巧策略学习。

详情
AI中文摘要

灵巧操作受限于大规模机器人演示数据的收集成本。自我中心人类视频提供了多样操作行为的可扩展来源,但直接用于机器人学习需要弥合两个差距:人类与机器人观测之间的视觉差距,以及人类运动与机器人可执行动作之间的动作差距。我们提出EgoEngine,一个可扩展的框架,用于将自我中心人类操作视频转化为高保真机器人数据。给定一个自我中心RGB视频,EgoEngine生成:(i) 高保真机器人观测视频,用机器人替换人类,同时保留场景上下文和时间对齐,以及(ii) 在可行性约束下,与任务对齐、可执行的机器人动作轨迹。在仿真和真实机器人上的实验表明,EgoEngine能够将人类视频可扩展地转化为机器人数据,并且据我们所知,首次展示了无需真实机器人演示,从自我中心人类视频进行零样本视觉运动灵巧策略学习。项目网站:此 https URL。

英文摘要

Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: this https URL.

2606.12603 2026-06-12 cs.RO cs.AI 新提交

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

从模仿到对齐:面向长距离人行道导航的人类偏好流策略

Honglin He, Zhizheng Liu, Yukai Ma, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出FlowPilot,一种仅使用单目RGB相机的无地图导航策略,通过锚定流匹配进行预训练,并引入人类偏好学习实现对齐,在长距离人行道导航中提升鲁棒性和社会合规性。

详情
AI中文摘要

自主长距离人行道导航对于微出行应用(如机器人送餐和辅助电动轮椅)至关重要。与道路上的自动驾驶不同,长距离人行道导航需要在不可预测的人行道地形和行人中精确操作,且感知栈轻量,仅需单个单目RGB相机。虽然从演示中模仿学习(IL)提供了一种实用解决方案,但由此产生的自动驾驶策略常常遭受复合误差、人行道上缺乏社会合规性以及缺乏处理复杂情况的反事实推理能力。为解决这些挑战,我们提出了FlowPilot,一种仅使用单目RGB相机即可实现稳健高效长距离导航性能的无地图导航策略。我们首先提出使用锚定流匹配作为动作表示,用于在大型机器人车队数据上进行策略预训练,并捕捉人行道导航行为的多样、复杂、多模态分布。为弥合模仿与对齐之间的差距,我们进一步设计了一种人在环的偏好学习方案,通过少量人类干预数据调整策略。它增强了模型的反事实推理能力和在人行道上的社会合规性。我们通过在多样化人行道环境中的广泛仿真和真实世界实验评估了FlowPilot。在仿真中,FlowPilot实现了42%的成功率和66%的路线完成率,而FlowPilot-HP进一步提升了真实世界的鲁棒性和社会合规性,相对于基础模型,IR降低了40.0%,NIR降低了52.1%。

英文摘要

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

2606.12601 2026-06-12 cs.CV 新提交

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

双状态槽注意力:解耦外观与身份用于视频目标中心学习

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

发表机构 * University of Arkansas(阿肯色大学)

AI总结 提出双状态槽注意力(DSSA),通过分离每个槽为局部状态(外观)和身份状态(稳定身份),并采用竞争调制聚合减少弱匹配槽的干扰,提升视频目标分割质量与时间一致性。

详情
AI中文摘要

无监督视频目标中心学习旨在无需监督地将动态场景分解为持久的目标级表示。然而,现有的基于槽的方法在快速运动和部分遮挡等挑战性场景中难以维持稳定的目标身份。首先,它们通常将目标的每帧外观和跨帧身份编码在单个槽向量中,造成目标冲突导致槽交换:重建需要对瞬态视觉变化敏感,而时间一致性需要对它们不变。其次,槽注意力中使用的令牌重归一化可能放大弱注意力槽,使其吸收其他目标的令牌,破坏槽与目标的对应关系。我们提出双状态槽注意力(DSSA),一种完全自监督框架,通过分离外观与身份并减少弱匹配槽的虚假更新来解决这些限制。DSSA将每个槽分解为用于每帧外观的局部状态和用于时间稳定目标信息的身份状态,从而用分离的表示对齐重建和时间一致性。身份状态通过学习的循环转换更新,该转换作为局部状态的时间滤波器,而竞争调制聚合(CMA)降低弱匹配槽的更新权重,防止它们吸收其他目标的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明,DSSA在分割质量和时间一致性上持续优于先前方法,同时在下游目标识别和视频动态预测中表现更强。代码和模型将在接收后公开。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

2606.12599 2026-06-12 cs.CL 新提交

Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation

通过波斯谚语条件故事生成实现LLM中的约束语义解压缩

Zahra Habibzadeh, Paria Khoshtab, Amir Mesbah, Yadollah Yaghoobzadeh

AI总结 提出约束语义解压缩任务,通过波斯谚语条件故事生成测试大语言模型的抽象到实现能力,构建PAND数据集,发现解压缩差距,并表明显式推理和迭代细化可部分缓解。

详情
AI中文摘要

将一个密集、抽象的谚语转化为引人入胜且道德忠实的故事需要深厚的文化理解和稳健的语义基础。我们将此问题定义为约束语义解压缩任务,并研究谚语条件故事生成作为大语言模型中抽象到实现的测试平台。聚焦波斯语,我们引入了谚语对齐叙事数据集(PAND),将谚语与人类编写的故事和显式含义配对。通过结合人类校准的LLM-as-a-Judge与结构度量的混合评估框架,我们分析了多种提示机制下的模型行为。我们的发现揭示了一个持续存在的解压缩差距:当前的LLM通常实现强大的表面流畅性,但未能忠实地实例化谚语中编码的潜在道德和因果结构。我们进一步表明,显式推理和迭代细化可以部分缓解这些失败,这表明许多解压缩错误源于将抽象含义转化为叙事形式的困难,而非完全缺乏相关知识。我们提出的任务自然扩展到其他形式的压缩文化知识。

英文摘要

Transforming a dense, abstract proverb into an engaging and morally faithful narrative requires deep cultural understanding and robust semantic grounding. We frame this problem as a \emph{constrained semantic decompression} task and study proverb-conditioned story generation as a testbed for abstraction-to-realization in large language models (LLMs). Focusing on Persian, we introduce the Proverb Aligned Narrative Dataset (PAND), pairing proverbs with human-written stories and explicit meanings. By a hybrid evaluation framework that combines human-calibrated LLM-as-a-Judge with structural metrics, we analyze model behavior across multiple prompting regimes. Our findings reveal a persistent \emph{decompression gap}: current LLMs often achieve strong surface-level fluency while failing to faithfully instantiate the underlying moral and causal structure encoded in proverbs. We further show that explicit reasoning and iterative refinement can partially mitigate these failures, suggesting that many decompression errors arise from difficulties in translating abstract meaning into narrative form rather than a complete lack of relevant knowledge. Our proposed task naturally extends to other forms of compressed cultural knowledge.

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 新提交

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12594 2026-06-12 cs.AI 新提交

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover: 通过增强型Lean形式化推进高效形式化证明

Joshua Ong Jun Leang, Zheng Zhao, Mihaela Cătălina Stoian, Qiyuan Xu, Haonan Li, Wenda Li, Shay B. Cohen, Eleonora Giunchiglia

发表机构 * Imperial College London(伦敦帝国学院) University of Edinburgh(爱丁堡大学) Nanyang Technological University(南洋理工大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出Pythagoras-Prover系列,包括自回归和扩散模型,通过课程SFT、动态过滤和增强型Lean形式化(ALF)扩展验证数据,在MiniF2F-Test上以更少参数超越DeepSeek-Prover-V2。

详情
Comments
Pythagoras-Prover: Technical Report
AI中文摘要

现代Lean定理证明器只有在大量训练和推理计算下才能取得强性能,部分原因是由于稀缺的验证证明数据和形式化证明搜索的长推理轨迹,使得监督微调(SFT)和采样成本高昂。我们介绍了Pythagoras-Prover,一个计算高效的开源Lean定理证明器系列,专为实际计算预算而构建。该系列涵盖两种生成范式:4B和32B参数的自回归模型,以及首个概念验证的基于扩散的证明器(4B),它在推理时迭代地精炼Lean证明。为了提高训练效率,我们构建了一个Lean验证的语料库,按易、中、难问题分层,用于课程SFT,使模型逐步从较短、较简单的证明过渡到较长、较难的证明。在SFT期间,动态证明推理过滤方案保留了信息丰富的证明轨迹,同时将每个实例保持在8k令牌的上下文预算内。我们还引入了增强型Lean形式化(ALF),它将稀缺的验证语料库扩展为形式化语句的变体,通过自蒸馏填充以提供额外训练信号,而无需正式验证每个变异实例。通过扰动已知问题同时保留其形式化特征,ALF减少了对任何语句表面形式的依赖。实验上,Pythagoras-Prover-4B在MiniF2F-Test上的pass@32(86.1% vs 82.4%)超过了DeepSeek-Prover-V2-671B,参数数量约为其1/167,而Pythagoras-Prover-32B在MiniF2F-Test上以93.0%的成绩创下了开源最先进水平,并在672个PutnamBench问题中解决了93个。我们发布了MiniF2F-ALF,一个经ALF变异的对污染敏感的基准,每个评估模型在该基准上的准确率均下降;在此基准上,我们的32B模型仍然最强,而4B模型匹配了先前最先进的Goedel-Prover-V2-32B。

英文摘要

Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive. We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time. For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget. We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement's surface form. Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

2606.12592 2026-06-12 cs.SE 新提交

Characterizing Tests in IoT Software: Practices, Challenges and Opportunities

物联网软件中的测试特征:实践、挑战与机遇

Rufeng Chen, Hengcheng Zhu, Wuqi Zhang, Zixu Zhou, Lili Wei

AI总结 通过首个开源物联网软件测试用例实证研究,评估测试有效性,识别与外部依赖交互的挑战,并分析模拟对象的使用潜力。

详情
Comments
15 pages, 4 figures
AI中文摘要

物联网(IoT)正在经历快速增长。智能设备出现在智能家居和工业应用中,执行关键任务。物联网软件中的错误可能导致严重后果。例如,有缺陷的智能锁可能允许未经授权访问私人财产。测试是暴露软件错误和确保软件质量的主要实践。然而,关于物联网软件如何测试知之甚少。为填补这一空白,我们对开源物联网软件中的测试用例进行了首次实证研究。具体来说,我们评估了物联网软件中测试用例的有效性,探索了测试物联网软件固有的挑战,并分析了模拟对象的使用情况。我们的结果表明,虽然物联网软件通常包含相当数量的测试,但其有效性仍然有限。我们确定测试物联网软件的主要挑战是管理与各种外部依赖的复杂交互,例如其他依赖网络的物联网组件、文件系统、操作系统和数据库。我们还观察到,物联网软件中模拟对象的使用与我们识别的测试挑战密切相关。这种一致性表明模拟作为增强测试覆盖率和解决物联网软件测试复杂性的解决方案的潜力。

英文摘要

The Internet of Things (IoT) is experiencing rapid growth. Smart devices are emerging in smart homes and industrial applications, performing mission-critical tasks. Bugs in IoT software can lead to severe consequences. For example, a buggy smart lock can allow unauthorized access to a private property. Testing is a primary practice to expose software bugs and ensure software quality. However, little is known about how IoT software is tested. To bridge this gap, we conducted the first empirical study on test cases in open-source IoT software. Specifically, we evaluated the effectiveness of test cases in IoT software, explored the challenges inherent in testing IoT software, and analyzed the usage of mock objects. Our results indicate that while IoT software often contains a considerable number of tests, their effectiveness remains limited. We identified the primary challenges in testing IoT software as managing complex interactions with various external dependencies, such as other network-reliant IoT components, file systems, operating systems, and databases. We also observed that the use of mock objects in IoT software closely aligns with our identified testing challenges. This alignment demonstrates the potential of mocking as a solution to enhance test coverage and address the complexities of IoT software testing.

2606.12590 2026-06-12 cs.CV cs.AI 新提交

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Queen’s University(女王大学)

AI总结 针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足,提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架,通过最小编辑模型输出构建偏好对,仅修正临床错误片段,显著提升诊断准确性。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在医学影像任务中取得了强劲性能,但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键限制:(1)序列级奖励信号将临床关键令牌与通用填充文本等同对待;(2)依赖静态监督微调参考作为偏好响应引入了离策略分布偏移,将优化导向风格伪影而非临床正确性;(3)对齐目标缺乏明确的视觉定位约束,使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标,该目标将干净图像与病变破坏图像配对,以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架,通过最小编辑模型生成的输出来构建偏好对,仅修正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

2606.12587 2026-06-12 cs.AI cs.HC 新提交

Strategic Decision Support for AI Agents

AI智能体的战略决策支持

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 针对AI智能体作为主要决策者时的可靠性问题,提出通过优化问题最小化支持使用并控制反事实遗漏支持误差的战略决策支持框架,并开发在线算法自适应阈值化支持分数。

详情
AI中文摘要

传统上,决策支持研究人类如何使用机器学习模型做出更好的决策。在现代智能体系统中,这种角色分工日益反转:AI智能体代表用户行动,而人类和工具成为围绕它们的支持机制。这种角色反转将可靠性问题推至前沿,因为智能体错误可能产生严重后果,且智能体行为必须始终与人类目标和约束保持一致。脱离经典的决策支持观点,我们在AI智能体作为核心行动者的设定下,重新审视其两个基本原则:寻求支持的成本-价值权衡以及不确定性量化的作用。我们提出了一个AI智能体战略决策支持框架,通过一个优化问题来最小化支持使用,同时控制一个反事实遗漏支持误差:即智能体在那些支持本可实质改善其输出的实例上单独行动的概率。在总体层面,我们证明最优策略是关于支持价值的阈值规则。基于这一结构,我们开发了一种在线算法,该算法自适应地阈值化这样的分数,并使用随机探索来控制遗漏支持误差,无需分布假设。我们进一步引入了一种即时校准方法,在线减少不必要的支持调用。我们将该框架实例化到多种场景中,包括信息收集、人机协作和工具使用,展示了每种场景如何通过相同的战略决策支持视角建模。跨这些场景的实验表明,我们的方法可靠地控制了目标误差,同时在实际中大幅减少了支持使用。

英文摘要

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

2606.12586 2026-06-12 cs.CR 新提交

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

超越攻击成功率:视觉-语言智能系统中的触发器泄漏研究

Jiamin Chang, Salil Kanhere, Piotr Koniusz, Jason (Minhui)Xue, Hammond Pearce

AI总结 本文提出“触发器泄漏”概念,量化视觉-语言智能系统中后门触发器在视觉或语义相近输入下意外激活隐藏行为的风险,并引入邻域泄漏率(NLR)指标。

详情
AI中文摘要

视觉-语言智能系统(VLAS)将视觉感知与规划、工具使用和物理动作相连接。这意味着后门型触发器可以通过决策管道及其连接的接口传播,从而使视觉后门成为系统级威胁。目前对此类后门的评估侧重于干净准确率和攻击成功率(ASR),这些指标衡量触发器是否有效,但并未评估攻击是否真正“精确”——即是否仅在预期时触发隐藏行为。在本工作中,我们将触发器精度的失败形式化为“触发器泄漏”:视觉或语义上接近预期触发器的输入,从而无意中激活攻击者指定的行为。为了量化这种泄漏,我们引入了邻域泄漏率(NLR)。实验表明,在3%的投毒率下,图标和文本触发器对常见视觉变换保持鲁棒性,但其邻近变体严重泄漏,NLR达到0.996(图标)和0.944(文本)。使用文本触发器作为受控探针,我们发现标准微调学习到的是宽泛的激活区域而非精确的触发条件,导致即使精确触发器缺失,邻近字符串也能调用恶意行为。在训练中添加编辑距离为1的硬负样本可显著缩小此激活区域并减少泄漏,包括在图像编辑和具身操作工作流中,泄漏的触发器可能传播到可执行程序和动作序列。

英文摘要

Vision-Language Agentic Systems (VLAS) connect visual perception to planning, tool use, and physical actions. This means backdoor-type triggers can propagate through both decision pipelines and their connected interfaces, thus making visual backdoors a system-level threat. Current evaluations on such backdoors focus on clean accuracy and attack success rate (ASR), metrics that capture whether a trigger works, but not whether an attack is actually "precise" -- i.e. whether it triggers hidden behaviors only when intended. In this work, we formalize the failure of trigger precision as "trigger leakage": inputs that are visually or semantically close to the intended trigger and therefore inadvertently activate the attacker-specified behavior. To quantify this leakage, we introduce Neighbor Leakage Rate (NLR). Our experiments show that at a 3% poisoning ratio, icon and text triggers remain robust to common visual transformations, but their neighboring variants leak heavily, with NLR reaching 0.996 (icon) and 0.944 (text). Using textual triggers as a controlled probe, we show that standard fine-tuning learns a broad activation region rather than an exact trigger condition, causing neighboring strings to invoke the malicious behavior even when the exact trigger is absent. Adding edit-distance-one hard-negative samples during training substantially narrows this activation region and reduces leakage, including in image-editing and embodied-manipulation workflows, where leaked triggers can propagate into executable programs and action sequences.

2606.12581 2026-06-12 cs.SI cs.AI 新提交

Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark

多关系网络中的图缩减:面向传播的缩减基准

Mateusz Stolarski, Michał Czuba, Piotr Bielak, Piotr Bródka

AI总结 提出SORB基准框架,系统评估图缩减对影响力最大化任务的影响,发现缩减效果依赖于网络类型和评估指标。

详情
AI中文摘要

现实世界网络天生不完整、有噪声且动态演化,难以捕获所有参与者及其关系。其规模常使直接分析计算量大。虽然影响力最大化(IM)已被广泛研究,但图缩减作为预处理步骤及其对IM准确性的影响仍未被充分探索。本文引入面向传播的缩减基准(SORB),一个开源、标准化的框架,用于系统评估不同任务设置下的IM模型。SORB提供可扩展的流水线,操作于代表性真实世界网络集合(包括单层和多层结构),并将图缩减直接纳入评估过程。此设计将焦点从孤立分析IM算法转向量化图缩减如何改变预测性能。利用SORB,我们研究了多种IM场景下稀疏化和粗化的效果。结果表明,缩减的影响强烈依赖于网络类型(单层 vs. 多关系)和下游任务($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$):稀疏化在单层网络上保持种子集质量,而扁平化多层网络无论缩减策略如何均表现出系统性排名退化。这些发现强调了在研究复杂网络传播过程时,进行缩减感知的多任务评估的重要性。

英文摘要

Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relationships. Their scale often renders direct analysis computationally demanding. While influence maximisation (IM) has been widely studied, the role of graph reduction as a preprocessing step, and its impact on IM accuracy, remains underexplored. In this work, we introduce the Spreading-Oriented Reduction Benchmark (SORB), an open-source, standardised framework for systematically evaluating IM models across diverse task settings. SORB provides an extensible pipeline operating on a representative collection of real-world networks, including single- and multilayer structures, and accounts for graph reduction directly into the evaluation process. This design shifts the focus from analysing IM algorithms in isolation to quantifying how graph reduction alters predictive performance. Using SORB, we study the effects of sparsification and coarsening across multiple IM scenarios. Our results show that the impact of reduction is strongly dependent on both the network type (single-layer vs. multirelational) and the downstream task ($Gain@k$ vs. $\mathrm{AUC}_{\mathrm{cutoff}}$): sparsification preserves seed set quality on single-layer networks, whereas flattened multilayer networks exhibit systematic ranking degradation regardless of reduction strategy. These findings highlight the importance of reduction-aware, multi-task evaluation when studying spreading processes in complex networks.

2606.12579 2026-06-12 cs.RO 新提交

G-MAPP: GPU-accelerated Multi-Agent Planning and Perception for Reactive Motion Generation

G-MAPP: 基于GPU加速的多智能体规划与感知用于反应式运动生成

Tanmay Bishnoi, Riddhiman Laha, Tobias Löw, Jose Alex Chandy, Luis F. C. Figueredo, Sami Haddadin

发表机构 * Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University(多伦多都会大学电气、计算机与生物医学工程系) Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich (TUM)(慕尼黑工业大学慕尼黑机器人与机器智能研究所) Institute for Experiential Robotics, Northeastern University(东北大学体验式机器人研究所) Idiap Research Institute(Idiap 研究所) EPFL(瑞士联邦理工学院洛桑) CHART Group at the School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院 CHART 小组) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GPU加速的框架,通过并行状态探索和紧密耦合感知-动作循环,实现非结构化环境中的实时反应式运动生成,在7自由度机器人上达到5倍加速并成功避障。

详情
Comments
The implementation is available at: this https URL
AI中文摘要

在非结构化环境中的反应式运动生成仍然是机器人学中的一个开放挑战。由于无碰撞运动生成的计算复杂性,现有方法要么为静态场景生成全局轨迹,要么采用对环境做出保守假设的模型。本文指出主要瓶颈在于高保真环境规划的运行时性能需求,以及感知与规划模块之间的时间集成。因此,我们提出一个框架,通过使用GPU加速世界建模和基于向量场的规划,不牺牲运行时性能和感知与规划的世界表示。这使得我们能够实现更快的并行状态探索以进行准全局轨迹规划,并在动态杂乱环境中使用现成的深度传感器实时紧密耦合感知-动作循环。我们定量评估了CPU和GPU版本规划器的计算时间和成功率差异,并在7自由度Franka Emika机器人上通过真实世界实验对我们的耦合框架进行了定性评估。实验结果表明,我们的基于GPU的框架相比CPU版本实现了高达5倍的加速,并在简单和具有挑战性的物理世界场景中成功避免了碰撞。

英文摘要

Reactive motion generation in unstructured environments remains an open challenge in robotics. Due to the computational complexity of collision-free motion generation, existing methods either generate global trajectories for static scenarios, or employ models that make conservative assumptions about the environment. This paper identifies the primary bottleneck as the runtime performance demand of planning on high-fidelity environments, and the temporal integration between the perception and planning modules. Therefore, we propose a framework that does not compromise on runtime performance and world representations for perception and planning by accelerating world modeling and vector-field based planning using the GPU. This allows us to achieve faster parallel state exploration for quasi-global trajectory planning, and tighter coupling of the perception-action loop in real-time for dynamic cluttered environments with off-the-shelf depth sensors. We quantitatively evaluate the computation-time and success rate differences for the CPU and GPU versions of our planner, and perform qualitative evaluations of our coupled framework using real-world experiments on a 7-DoF Franka Emika robot. Experimental results demonstrate that our GPU-based framework achieves up to a 5x speedup over the CPU version and successfully avoids collisions across both trivial and challenging physical world scenarios.

2606.12578 2026-06-12 cs.CL 新提交

MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction

MARD: 镜像增强推理蒸馏用于机制级药物-药物相互作用预测

Mohammadreza Riyazat, Vian Lelo, Rameen Jafri, Yumna Khan, Abeer Badawi

发表机构 * University of Guelph(圭尔夫大学) York University(约克大学) Vector Institute(向量研究所)

AI总结 提出MARD-7B模型,通过镜像增强推理蒸馏、单token KL散度、PRM加权DPO和机制感知检索通道,在机制级DDI预测中准确率超越GPT-4o 6.7个百分点,且成本仅为1%。

详情
Comments
29 pages, 9 figures. Preprint
AI中文摘要

机制级药物-药物相互作用(DDI)预测需要识别涉及的酶或药效学轴、作用方向及证据,而不仅仅是判断两种药物是否相互作用。我们引入了一个可复现的机制级DDI标注与评估协议,包括结构化的7家族/147亚型分类法、无泄漏的冷切分协议以及可审计的推理指标,用于评估超越平面交互分类的药理学预测。我们提出一个流水线,生成了7B推理模型MARD(镜像增强推理蒸馏),结合了三种训练创新:方向标签上的单token KL散度,将模型的预测与方向标签绑定;基于PRM权重的DPO,使用程序化硬负样本;以及无泄漏的机制感知检索通道。过程奖励步骤标签可自动根据DrugBank结构化字段验证,无需人工或LLM评判。在2026年4月的DrugBank版本上,我们的MARD-7B是32个系统比较中唯一在药物对新颖性下准确率保持稳定的系统,以约1%的前沿API成本,比最佳基线高出13.9个百分点,比GPT-4o高出6.7个百分点。进一步分析揭示了反记忆特征,即在罕见药物上准确率提升,表明增益来自结构化药理学推理而非药物频率记忆。我们发布了语料库、DDI-PRM、检索索引和训练代码。

英文摘要

Mechanism-level drug-drug interaction (DDI) prediction requires identifying which enzyme or pharmacodynamic axis is implicated, in which direction, and with which evidence -- not merely whether two drugs interact. We introduce a reproducible mechanism-level DDI labelling and evaluation protocol with a structured 7-family/147-subtype taxonomy, leakage-safe cold-split protocols, and auditable reasoning metrics for evaluating pharmacological prediction beyond flat interaction classification. We propose a pipeline that produces a 7B reasoning MARD (Mirror-Augmented Reasoning Distillation), combining three training innovations: a single-token KL divergence on direction tag that ties the model's prediction, per-loss PRM-weighted DPO with programmatic hard negatives, and a leakage-safe mechanism-aware retrieval channel. Process-reward step labels are automatically verifiable against DrugBank-structured fields, requiring no human or LLM judges. On the April-2026 DrugBank release, our MARD-7B is the only system in a 32-system comparison whose accuracy survives drug-pair novelty, beating the best baseline by +13.9 pp and GPT-4o by +6.7 pp at ~1% of frontier API cost. Further analysis reveals an anti-memorisation signature where accuracy improves on rarely seen drugs, suggesting that gain comes from structured pharmacological reasoning rather than drug-frequency memorisation. We release corpus, DDI-PRM, retrieval index, and training code.

2606.12576 2026-06-12 cs.CL 新提交

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

帮助图表讲述它们的故事!基于论文的视频生成解释复杂科学图表

Ishani Mondal, Javad Baghirov, Jordan Boyd-Graber

AI总结 提出MINARD流水线,从图表及其论文生成基于区域分解的叙述性视频,并发布FigTalk基准,在自动和人工评估中优于现有方法。

详情
Comments
Webpage: this https URL
AI中文摘要

科学图表将复杂的流程压缩到单个画布中,但理解它们需要基于论文的、逐步的叙述,并与视觉高亮对齐——这是当前视频生成系统和基准所缺乏的能力。为了解决这个问题,我们引入了基于论文的图表到视频生成:从图表及其论文生成叙述性的、区域引导的导览视频。我们提出了MINARD(通过区域分解对叙述性架构进行多模态解释),这是一个生成基于论文的叙述并顺序将其与图表区域对齐的流水线。我们还发布了FigTalk,一个包含新的顺序和组件级对齐指标的基准。在FigTalk上,MINARD生成类人的、忠于论文的叙述,并在自动和人工评估中,在叙述条件下的图表空间对齐方面优于现有方法。

英文摘要

Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

2606.12575 2026-06-12 cs.CV 新提交

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

高保真两步图像生成:通过教师对齐的端到端蒸馏

Dongyang Liu, Ruoyi Du, David Liu, Dengyang Jiang, Liangchen Li, Qilong Wu, Zhen Li, Steven C.H. Hoi, Hongsheng Li, Peng Gao

发表机构 * Z-Image Team, Alibaba Group(阿里巴巴集团Z-Image团队) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Z-Image Turbo++,通过分布对齐对抗学习、步解耦参数化和迭代正则化端到端训练,将8步教师模型蒸馏为2步生成模型,显著缩小质量差距。

详情
AI中文摘要

少步扩散蒸馏在4-8步生成中已日趋成熟,但进一步推进到2步仍具挑战。本文介绍Z-Image Turbo++,一种从8步Z-Image Turbo教师模型蒸馏得到的高质量2步图像生成模型。我们的方法通过三个针对该场景简单而有效的设计选择,解决了2步生成中任务难度增加和模型容量有限的核心瓶颈。首先,我们提出分布对齐对抗学习,使用教师生成的图像而非外部真实图像作为GAN训练的真实样本,提供更易实现且信息量更大的对抗目标。其次,我们采用步解耦参数化,为两个去噪步骤分配独立的模型参数,以更好地匹配它们不同的容量需求。第三,我们执行带迭代正则化的端到端训练,使第一步能够接收来自最终图像质量的梯度,同时通过显式的步1损失保留有意义的中间生成。这些设计共同在定性和定量评估中显著缩小了2步与8步生成之间的质量差距,凸显了精心定制的蒸馏策略在改善少步生成中质量-效率权衡方面的潜力。

英文摘要

Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.