arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.04893 2026-06-11 cs.LG cs.CL stat.ML 版本更新

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

自注意力作为传输:对称谱诊断的极限

Dominik Dahlem, Diego Maniloff, Mac Misiura

AI总结 研究语言模型注意力路由的两种失效形状(过度集中或过度分散),证明对称谱诊断对方向不敏感,并揭示因果注意力中传输容量的理论下限,提出基于容量和方向的双轴诊断方法。

详情
Comments
48 pages, 6 figures, 7 tables; 81-page online supplement (proofs, additional experiments, dataset statistics) as an ancillary file
AI中文摘要

当语言模型处理幻觉响应时,其注意力路由往往以两种形状之一失效:过度集中在狭窄的位置集合上,或者分散得如此广泛以至于相关性被稀释,而失效的形状携带诊断信号。我们研究这些形状作为诊断特征,从在基准标记响应的\emph{强制评分}下计算的注意力矩阵中得出,而不是在实时生成期间。一类广泛使用的谱方法分析度归一化注意力算子的对称分量,该算子控制传输\emph{容量};我们证明该算子的每个转置不变谱诊断在结构上是\emph{方向盲的}(它无法区分算子与其转置,因此无法检测信息流方向),并且盲定理的逆定理将任何Lipschitz诊断的转置敏感性限制为不对称系数$G$。将其与规范因果架构的闭式二分-Cheeger景观配对,我们证明均匀因果注意力满足一个与$n$无关的下界$\phi \ge 1/5$,而窗口注意力以$O(w/n)$穿透下界;失效模式在形状上不同,而不仅仅在数值上不同。这个下界是一个理想化架构的基准,而不是经验吸引子:穿透它的真实注意力头的比例本身就是一个架构特征。由此产生的双轴诊断($\phi$表示容量,$G$表示方向)产生一个可证伪的极性预测:瓶颈主导和分散主导的基准应表现出相反的极性。在长度控制评估下,传输特征在测试的仅解码器、仅编码器和编码器-解码器模型中保持可解释的信号(0.62-0.84 LC-AUROC),极性在HaluEval和MedHallu之间如预测般反转。

英文摘要

When a language model processes a hallucinated response, its attention routing tends to fail in one of two shapes: over-concentrating on a narrow set of positions, or spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. We study these shapes as a diagnostic characterization, computed from attention matrices under \emph{forced scoring} of benchmark-labeled responses rather than during live generation. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport \emph{capacity}; we prove that every transpose-invariant spectral diagnostic of this operator is structurally \emph{orientation-blind} (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a converse to the blindness theorem bounding any Lipschitz diagnostic's transpose sensitivity by the asymmetry coefficient $G$. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $\phi \ge 1/5$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. This floor is an idealized-architecture benchmark, not an empirical attractor: the fraction of real attention heads that pierce it is itself an architectural signature. The resulting two-axis diagnostic ($\phi$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (0.62-0.84 LC-AUROC) across the tested decoder-only, encoder-only, and encoder-decoder models, with polarity reversing as predicted between HaluEval and MedHallu.

2605.04853 2026-06-11 cs.LG 版本更新

Hybrid Iterative Neural Low-Regularity Integrator for Nonlinear Dispersive Equations

非线性色散方程的混合迭代神经低正则积分器

Zhangyong Liang, Huanhuan Gao

AI总结 提出HIN-LRI混合框架,用轻量神经网络学习并校正经典低正则积分器的结构截断误差,通过显式时间步缩放保证稳定性,在粗糙数据色散方程上提升精度并保持泛化能力。

详情
AI中文摘要

我们提出HIN-LRI,一种混合框架,通过训练一个神经算子来校正经典数值求解器的结构截断误差,从而增强该求解器。基础低正则积分器为非线性色散偏微分方程提供一致的一阶近似,而一个在低维潜在流形上运行的轻量神经网络学习解析方法无法闭合的残差缺陷。神经校正上的显式时间步缩放确保其Lipschitz贡献为$\mathcal{O}(\tau)$,从而产生一个在步长上一致有界且与空间分辨率无关的Gronwall稳定性因子。该网络通过求解器在环的目标进行端到端训练,该目标展开完整迭代并在Bourgain型范数中惩罚轨迹误差,使学习与多步求解器动态对齐,而非孤立的单步目标。在给定假设下,全局误差满足$C(\varepsilon_{net}+\delta)\\,\tau^\gamma\ln(1/\tau)$,其中$\varepsilon_{net}$衡量网络逼近质量,$\delta$衡量训练不足。在三个具有粗糙数据的色散基准上的实验表明,HIN-LRI在精度上优于解析积分器、分裂方法和神经PDE替代模型,具有稳定的空间细化、有效的分布外迁移和适度的在线开销。

英文摘要

We propose HIN-LRI, a hybrid framework that augments a classical numerical solver with a neural operator trained to correct the solver's structured truncation error. A base low-regularity integrator provides a consistent first-order approximation to nonlinear dispersive PDEs, while a lightweight neural network, operating on a low-dimensional latent manifold, learns the residual defect that analytical methods cannot close. An explicit time-step scaling on the neural correction ensures that its Lipschitz contribution remains $\mathcal{O}(\tau)$, yielding a Gronwall stability factor bounded uniformly in the step size and independent of the spatial resolution. The network is trained end-to-end through a solver-in-the-loop objective that unrolls the full iteration and penalises trajectory error in a Bourgain-type norm, aligning learning with multi-step solver dynamics rather than isolated one-step targets. Under stated assumptions, the global error satisfies $C(\varepsilon_{net}+\delta)\,\tau^\gamma\ln(1/\tau)$, where $\varepsilon_{net}$ measures the network approximation quality and $\delta$ the training shortfall. Experiments on three dispersive benchmarks with rough data show that HIN-LRI improves accuracy over analytical integrators, splitting methods, and neural PDE surrogates, with stable spatial refinement, effective out-of-distribution transfer, and modest online overhead.

2605.04221 2026-06-11 cs.CL cs.AI 版本更新

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

面向隐私敏感的临床信息抽取的自提示小型语言模型

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh, Shirindokht Shiraz, Chun-Teh Lee, Ryan Brandon, Muhammad F Walji, Xiaoqian Jiang, Bunmi Tokede

AI总结 针对牙科病历中非结构化、领域特定且隐私敏感的命名实体识别挑战,提出一种本地可部署的自提示框架,通过多提示集成推理和基于QLoRA的微调及直接偏好优化,使小型语言模型在Qwen2.5-14B-Instruct上达到微宏F1分数0.864/0.837。

详情
AI中文摘要

从牙科病程记录中进行临床命名实体识别具有挑战性,因为文档高度非结构化、领域特定且通常涉及隐私敏感信息。我们开发了一个本地可部署的框架,使小型语言模型能够自行生成、验证、完善和评估实体特定提示,以从牙科记录中提取多个临床实体。利用1,200份标注记录,我们通过多提示集成推理评估了候选开放权重模型,并进一步使用基于QLoRA的监督微调和直接偏好优化对选定模型进行调整。模型性能差异显著,凸显了需要针对特定任务进行评估而非依赖通用基准。Qwen2.5-14B-Instruct取得了最强的基线性能。经过DPO后,Qwen2.5-14B-Instruct和Llama-3.1-8B-Instruct分别达到了0.864/0.837和0.806/0.797的微/宏F1分数。这些发现表明,自动提示优化结合轻量级基于偏好的后训练可以支持使用本地部署的小型语言模型进行可扩展的临床信息抽取。

英文摘要

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

2605.03065 2026-06-11 cs.LG cs.RO 版本更新

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

OGPO:生成控制策略的样本高效全微调

Sarvesh Patil, Mitsuhiko Nakamoto, Manan Agarwal, Shashwat Saxena, Jesse Zhang, Giri Anantharaman, Cleah Winston, Chaoyi Pan, Douglas Chen, Nai-Chieh Huang, Zeynep Temel, Oliver Kroemer, Sergey Levine, Abhishek Gupta, Hongkai Dai, Paarth Shah, Max Simchowitz

AI总结 提出OGPO算法,通过离策略评论网络和修改的PPO目标,实现生成控制策略的样本高效微调,在多种操作任务上达到最优性能,并能在无专家数据下微调不良初始化的行为克隆策略。

详情
AI中文摘要

生成控制策略(GCPs),如基于扩散和基于流的控制策略,已成为机器人学习的有效参数化方法。本文介绍了离策略生成策略优化(OGPO),一种用于微调GCPs的样本高效算法,该算法维护离策略评论网络以最大化数据重用,并通过修改的PPO目标将策略梯度传播到策略的完整生成过程,使用评论网络作为终端奖励。OGPO在涵盖多任务设置、高精度插入和灵巧控制的操作任务上达到了最先进的性能。据我们所知,它也是唯一一种能够在在线回放缓冲区中无专家数据的情况下,将初始化不良的行为克隆策略微调到接近完全任务成功的方法,并且只需很少的任务特定超参数调整。通过广泛的实证研究,我们证明了OGPO在策略引导和残差学习方面显著优于替代方法,并确定了其性能背后的关键机制。我们进一步引入了实用的稳定技巧,包括成功缓冲区正则化、双边保守优势和Q方差减少,以减轻基于状态和基于像素的设置中的评论网络过度利用。除了提出OGPO,我们还对GCP微调进行了系统的实证研究,确定了控制成功离策略全策略改进的稳定机制和失败模式。

英文摘要

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

2605.02849 2026-06-11 cs.CV 版本更新

Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

通过条件控制扩散实现超低比特率视频压缩的主动采样

Amirhosein Javadi, Shirin Saeedi Bidokhti, Tara Javidi

AI总结 提出ActDiff-VC框架,利用条件扩散模型和主动采样策略(自适应关键帧选择与预算感知稀疏轨迹选择),在超低比特率下实现高感知质量视频压缩。

详情
Comments
21 pages, 11 figures, 3 tables
AI中文摘要

扩散模型为超低比特率下的感知重建提供了强大的生成先验,但有效的视频压缩需要使用高度紧凑的条件信号来控制生成过程。在这项工作中,我们提出了ActDiff-VC,一种基于扩散的超低比特率视频压缩框架。我们的方法将视频划分为可变长度的片段,仅在需要时传输关键帧,并使用一组紧凑的跟踪点轨迹总结时间动态。基于这些稀疏信号,条件扩散解码器合成剩余帧,从而在严格的码率约束下实现感知上逼真的重建。为了支持这一设计,我们引入了两种机制:内容自适应关键帧选择和预算感知稀疏轨迹选择,它们共同为生成重建提供了紧凑而有效的条件。在UVG和MCL-JCV基准上的实验表明,在匹配NIQE时,ActDiff-VC实现了高达64.6%的码率降低,在可比码率下,KID改善高达64.6%,FID改善高达37.7%,并且在超低比特率下,相对于学习和基于扩散的基线,提供了有利的感知率失真权衡。

英文摘要

Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.

2605.02411 2026-06-11 cs.AI cs.IR cs.LG cs.MA 版本更新

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

FitText: 通过模因检索演化智能体工具生态

Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang

AI总结 针对用户任务描述与工具文档间的语义鸿沟,提出FitText框架,将检索嵌入推理循环,通过自然语言伪工具描述迭代优化和模因进化选择,显著提升工具检索性能。

详情
AI中文摘要

用户描述任务的方式与工具文档之间存在语义鸿沟。随着API生态扩展到数万个端点,仅凭初始查询的静态检索无法弥合这一鸿沟:智能体对其所需工具的理解在执行过程中不断演变,但其工具集却保持不变。我们指出,这种检索接口(而非规划)是端到端智能体性能的约束瓶颈,并引入FitText——一个无需训练的框架,通过将检索直接嵌入智能体的推理循环中,使其动态化。FitText将检索视为测试时假设的演化:智能体生成自然语言的伪工具描述(关于所需工具的可修正信念),利用检索反馈迭代优化,并通过随机生成探索多样化的替代方案。模因检索在候选描述上施加进化选择压力,并由避免冗余搜索的工具记忆引导。在ToolRet(三个领域)上,FitText的重构策略在所有基模型上相比静态查询检索将NDCG@5提升了2.7至10.6个点;在StableToolBench(16,464个API)上使用GPT-5.4-mini时,模因检索达到了84.3%的合并通过率,相比静态查询检索绝对提升了26.7个点。

英文摘要

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.

2606.11107 2026-06-11 eess.IV cs.CV cs.LG 版本更新

Multimodal Brain Tumour Classification Using Feature Fusion

使用特征融合的多模态脑肿瘤分类

Wajih ul Islam, Muhammad Yaqoob, Javed Ali Khan, Volker Steuber

AI总结 提出双分支多模态网络,融合MRI图像与91个放射组学特征,通过门控融合实现脑肿瘤分类,准确率达96.13%。

详情
AI中文摘要

临床医生通过综合患者症状、病史以及来自MRI和CT扫描等模态的定量成像数据,形成统一的临床判断来诊断脑肿瘤。然而,大多数深度学习模型仅依赖MRI/CT图像,未能复制临床医生的多模态推理。我们探索了一种双分支多模态网络,将原始MRI扫描与91个提取的放射组学特征(强度、纹理、形状和边界描述符)相结合,将脑肿瘤分类为胶质瘤、脑膜瘤、垂体瘤和无肿瘤。预训练的CNN骨干网络编码图像流,而专用的MLP编码放射组学特征流。通过拼接、门控或双向跨模态注意力策略融合两个流。在平衡的7200张图像数据集上的九次实验运行中,所有多模态配置均优于单模态基线,其中门控融合实现了最佳准确率96.13%。

英文摘要

Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.

2606.11152 2026-06-11 cs.CV 版本更新

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

P3D-Bench:用于参数化3D生成与结构推理的多模态大语言模型基准

Yikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou, Jingxi Xu, Feihu Zhang, Jiaheng Liu, Yao Yao

发表机构 * Nanjing University(南京大学) Envision

AI总结 提出P3D-Bench基准,通过参数化3D程序评估多模态大语言模型在几何精度、语义对齐和装配一致性上的表现,涵盖文本到3D、图像到3D和装配3D三类任务。

详情
Comments
Project page: this https URL
AI中文摘要

多模态大语言模型能够编写代码生成复杂程序,并利用程序进行3D建模,这为基于其先验知识、世界知识和推理能力的3D生成开辟了新途径。然而,现有基准很少通过代码评估3D建模。这种建模不仅需要可运行代码:从文本或视觉规范出发,模型必须生成几何精确、语义对齐且装配一致的参数化3D程序。我们引入P3D-Bench,一个用于参数化3D生成的基准。与3D网格不同,参数化3D程序暴露了显式尺寸、构造操作和零件关系,揭示了模型是否恢复设计结构而不仅仅是外观。在统一协议下,P3D-Bench涵盖三个任务族(文本到3D、图像到3D和装配3D),并对每个输出进行可执行性、几何保真度、拓扑、文本约束、多视图语义对齐和零件级结构的评分。我们在400个文本案例、400个图像案例和203个带注释的装配体上评估了前沿多模态大语言模型和纯文本大语言模型,并以领域特定模型作为参考点。我们的广泛评估得出三个发现。首先,装配是最困难的设置,模型仍然无法将多个零件组合成连贯结构。其次,模型通常能恢复目标对象的整体形状和语义身份,但无法再现输入指定的精确参数化几何。第三,零件级建模在装配上仍然薄弱,模型既不能恢复每个零件的几何形状,也不能恢复正确的零件数量。这些结果使P3D-Bench成为评估参数化3D生成中精确参数化几何和零件级结构的基准。

英文摘要

Multimodal large language models can write code to produce complex programs as well as use programs to do 3D modeling, which opens up a new avenue for 3D generation powered by their priors, world knowledge and reasoning. Yet existing benchmarks rarely evaluate 3D modeling through code. Such modeling demands more than runnable code: from a text or visual specification, a model must generate a parametric 3D program that is geometrically precise, semantically aligned and assembly-consistent. We introduce P3D-Bench, a benchmark for parametric 3D generation. Unlike a 3D mesh, a parametric 3D program exposes explicit dimensions, construction operations and part relations, revealing whether a model recovers a design's structure, not just its appearance. Under a unified protocol, P3D-Bench covers three task families (Text-to-3D, Image-to-3D and Assembly-3D) and scores each output for executability, geometric fidelity, topology, text-grounded constraints, multiview semantic alignment and part-level structure. We evaluate frontier MLLMs and text-only LLMs on 400 text cases, 400 image cases and 203 annotated assemblies, with domain-specific models as reference points. Our extensive evaluation yields three findings. First, assemblies are the hardest setting, where models still fail to compose multiple parts into a coherent structure. Second, models can often recover the global shape and semantic identity of the target object, yet fail to reproduce the precise parametric geometry specified by the input. Third, part-level modeling remains weak on assemblies, where models recover neither the geometry of each part nor the right number of parts. These results position P3D-Bench as a benchmark for evaluating precise parametric geometry and part-level structure in parametric 3D generation.

2606.11092 2026-06-11 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo:通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Archon Robotics

AI总结 提出三阶段运动引导课程强化学习框架RoboNaldo,从单一人踢参考逐步优化射门性能,在仿真中射门误差降低48.6%、速度提升2.96倍,真实机器人上3米外平均射门误差0.73-0.86米,触球后球速达13.10米/秒。

详情
AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性,但固定参考难以适应不同的球位和击球时机;相比之下,任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此,我们引入了RoboNaldo,一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架,并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验,然后使踢球适应任意静止球位的任意球场景,最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间,一个高级启发式规划器控制该接口,而推理时其他高级控制器可驱动相同的低级策略。在仿真中,RoboNaldo的任意球射门误差比先前工作基线低48.6%,射门速度高2.96倍。在真实世界中,使用搭载机载感知的宇树G1,RoboNaldo在3米距离的任意球和移动球情况下,平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒,是职业比赛开放射门速度的59-71%。项目页面:$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: this https URL.

2606.11074 2026-06-11 cs.CL cs.AI 版本更新

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

建模复杂行为:视觉语言模型中的多人格组合与动态切换

Peiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du, Yuntao Wang, Zhou Su

发表机构 * Xi'an Jiaotong University(西安交通大学) Beihang University(北京航空航天大学)

AI总结 本研究在视觉语言模型中引入显式人格条件,建立包括单人格、多人格和人格切换的系统评估框架,发现人格提示可提升图像描述但损害精确推理任务,并观察到多特质组合与动态切换中的平衡与残留效应。

详情
Comments
16 pages, 4 figures, 10 tables
AI中文摘要

随着多模态大语言模型(MLLMs)在社交互动中的广泛部署,理解和控制其在复杂人格条件下的行为至关重要。本文引入显式人格条件,并建立了一个系统的评估框架,涵盖单人格诱导、多人格诱导和人格切换。实验表明,人格诱导能提升图像描述性能,但会损害需要精确推理的任务(如视觉问答)的性能。在多特质组合和动态切换过程中观察到平衡和残留效应,表明模型行为受到先前和当前人格约束的共同调节。现有的基于提示的人格诱导方法在多模态设置中表现出有限的迁移性。我们的工作揭示了MLLMs中人格建模的动态和复杂性质,并强调了针对人格诱导和评估的鲁棒、定制化方法的必要性。代码将在论文被接收后发布。

英文摘要

With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

2606.11042 2026-06-11 cs.AI 版本更新

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM:面向真实世界专业领域的长周期计算机使用代理任务评估

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang

发表机构 * ByteDance Seed(字节跳动Seed) M-A-P Humanlaya

AI总结 提出Workflow-GYM基准,评估AI代理在专业软件中执行长周期、高价值工作流的能力,发现最强模型成功率仅略超30%,揭示当前代理在长周期工作流一致性方面的严重不足。

详情
AI中文摘要

近年来,AI代理在处理日益复杂、真实世界任务方面取得了快速发展。然而,现有基准很少评估代理能否操作图形用户界面以完成跨领域的长周期、高价值专业工作流。当前的GUI基准仍主要关注通用软件、相对简单的应用和短周期任务,使得现代代理能否遵循用户指令自主操作领域特定专业软件并以端到端方式完成经济价值工作尚不清楚。为填补这一空白,我们引入Workflow-GYM,一个以专业领域和专门软件环境为中心的长周期GUI任务基准。通过对最先进模型的广泛实验,我们发现即使最强的模型也仅达到略高于30%的成功率,突显出专业长周期GUI工作流对当前GUI代理仍极具挑战性。进一步分析表明,当前代理难以维持长周期工作流的一致性,频繁出现工作流阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的发现为当前代理系统的局限性提供了重要见解,并为下一代GUI代理研究指明了关键方向。

英文摘要

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

2606.11021 2026-06-11 cs.DL cs.CY 版本更新

Making a Name for Myself: On Academic Naming Policies and their Impact

为自己正名:论学术命名政策及其影响

A Pranav, Vagrant Gautam, Martin Mundt, Jordan Taylor, Arjun Subramonian, Franziska Sofia Hafner, Daniel Chechelnitsky, William Agnew, Anne Lauscher

AI总结 通过混合方法(调查、访谈及2019-2025年八大计算机科学会议的大规模引文分析),研究命名变更政策对学者引文准确性和心理健康的影响,发现可见的命名变更政策显著减少引文错误,且跨性别研究者的死名现象在2019-2024年间下降92%。

详情
Comments
Accepted at FAccT 2026. This version has corrected some typos
AI中文摘要

在学术出版中,姓名将学者与其工作联系起来。当学者因婚姻、学术认可或性别过渡等原因更改姓名时,他们可能会失去对过去工作的归属。然而,尽管这对引文准确性和研究者福祉有重大影响,目前尚无研究探讨计算机科学领域的命名政策如何服务于更改姓名的研究者。我们采用混合方法,结合调查、访谈以及对2019-2025年八个主要计算机科学场所论文的大规模引文分析。我们记录了建立首个姓名变更政策的多年代倡导努力,识别了实施障碍,包括出版商更新不完整和长达数月的处理延迟。即使出版商更新后,研究者仍被错误解析和不正确的姓名引用。当这些引文错误发生时,受访者报告了显著的心理健康影响,包括压力、焦虑和安全风险。实证发现,拥有可访问且可见的姓名变更政策的场所,其引文错误显著少于政策不可访问的场所(每千篇论文899 vs. 996个错误)。我们的注释分析显示,跨性别研究者在引文中的死名现象从2019年到2024年减少了92%。我们的发现证明了包容性出版政策的重要性,而由跨性别研究者主导的姓名变更政策倡导是重要推动力。我们建议场所采用主动可见的姓名变更政策,支持酷儿倡导团体,并改进出版基础设施,以构建包容的出版环境。

英文摘要

In academic publishing, names connect scholars to their work. When scholars change their names, including for marriage, academic recognition, or gender transition, they may lose credit for past publications. However, despite significant impacts on citation accuracy and researcher well-being, no existing studies examine how naming policies in computer science serve researchers who change their names. We use a mixed-methods approach combining surveys, interviews, and large-scale citation analysis of papers from eight major computer science venues from 2019-2025. We document the multi-year advocacy effort that established the first name change policies, identify implementation barriers including incomplete publisher updates and months-long processing delays. Researchers continue being cited with misparsed and incorrect names despite publisher updates. When these citation errors happen, interviewees report significant mental health impacts, including stress, anxiety, and safety risks. Empirically, we find that venues with accessible and visible name change policies have significantly fewer citation errors compared to inaccessible policies (899 vs. 996 errors per 1,000 papers). Our annotation analysis shows that deadnaming of transgender researchers in citations decreased by 92% from 2019 to 2024. Our findings demonstrate the importance of inclusive publishing policies, for which name change policy advocacy led by trans researchers has been a significant driver. We recommend that venues adopt proactive visible name change policies, support queer advocacy groups, and improve publication infrastructure to build an inclusive publishing landscape. The accompanied toolkit to check errors in bibliographic latex file is available here this https URL.

2606.10982 2026-06-11 cs.DC 版本更新

FairWave: A Fairness-Aware Asynchronous DAG-BFT Consensus

FairWave: 一种公平感知的异步DAG-BFT共识

Syariful Mujaddiq

AI总结 提出FairWave协议,通过双通道设计分离锚点选择与奖励分配,解决异步BFT与PoS结合时的公平性三难问题,实现低基尼系数和抗富者愈富。

详情
Comments
20 pages, 36 figures, preprint version
AI中文摘要

将异步拜占庭容错(BFT)共识与权益证明(PoS)结合会产生一个三难问题:女巫攻击抵抗、奖励分配公平性和对抗持久性富豪统治之间的权衡。现有的DAG-BFT方法(Narwhal+Tusk、Bullshark和Mysticeti)优先考虑活性而非基于权益的选择的公平性影响,导致持续的纵向不平等。本文提出一种双通道DAG BFT协议,将锚点选择与奖励分配分离。选择通道与权益呈超线性关系,确保对于所有分裂因子K>1,女巫增益<1。奖励通道呈次线性关系,使用平方根权益归一化来缓解富者愈富效应。最终确定的DAG结构提供确定性的正常运行时间和延迟因子,使诚实验证者无需任何外部预言机即可就操作质量达成一致。为避免选择结果与选择权重之间的循环依赖,信誉以滞后形式使用:第e个时期的活跃值等于前一时期的最终值。我们推导出两个通道的闭式约束,并通过九个实证分析(约550,000轮蒙特卡洛模拟)与八个基线进行验证。FairWave实现了0.149的基尼系数(而Pure-PoS为0.488),在50,000个时期中HHI从0.039单调降至0.021,最优对手女巫分裂K*=1,在±25%输入扰动下成功率变异系数为5.2%。安全性(一致性和有效性)是2f+1强支持提交规则的形式化结果,在f<n/3时无条件成立;经验差异是单调连续的活性退化曲线,在b=0.20时提交率为99.6%,在理论边界b=1/3时降至71.1%,没有视图变更驱动的领导者BFT所特有的不连续悬崖特征。

英文摘要

Combining asynchronous Byzantine Fault Tolerant (BFT) consensus with Proof-of-Stake (PoS) creates a trilemma between Sybil resistance, reward distribution fairness, and protection against persistent plutocracy. Existing DAG-BFT approaches (Narwhal+Tusk, Bullshark, and Mysticeti) prioritize liveness over the fairness implications of stake-based selection, resulting in persistent longitudinal centralization. FairWave is a dual-channel DAG BFT protocol that separates anchor selection from reward distribution. The selection channel is super-linear in stake, guaranteeing Sybil gain < 1 for all split factors K > 1. The reward channel is sub-linear, using square-root stake normalization to mitigate rich-get-richer dynamics. The finalized DAG structure provides deterministic uptime and latency factors, allowing honest validators to agree on operational quality without any external oracle. To avoid circular dependency between selection outcomes and selection weights, reputation is used in a lagged form: the active value at epoch e equals the prior epoch's final value. We derive closed-form constraints for both channels and validate them through nine empirical analyses (approximately 550,000 Monte Carlo rounds) against eight baselines. FairWave achieves a Gini coefficient of 0.149 (vs. Pure-PoS's 0.488), a monotone HHI reduction from 0.039 to 0.021 over 50,000 epochs, an optimal-adversary Sybil split of K* = 1, and a success-rate coefficient of variation of 5.2% under +/-25% input perturbation. Safety (agreement and validity) is a formal consequence of the 2f+1 strong-support commit rule, holding unconditionally for f < n/3; the empirical differential is the monotone-continuous liveness-degradation curve, which decreases from 99.6% commit rate at b=0.20 to 71.1% at the theoretical bound b=1/3 without the discontinuous cliff characteristic of view-change-driven leader-BFT.

2606.10968 2026-06-11 cs.LG cs.AI 版本更新

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

超越大语言模型强化学习中的统一令牌级信任区域

Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu

发表机构 * Tencent Hunyuan(腾讯混元)

AI总结 针对PPO风格信任区域在自回归生成中的位置无关问题,提出CPPO方法,通过位置加权阈值和累积前缀预算动态调整令牌级约束,提升训练稳定性和推理准确性。

详情
Comments
Project Page: this https URL
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的标准方法。然而,现有的PPO风格信任区域机制通过在所有令牌上独立施加统一阈值,仍然是位置无关的。这种逐点处理方式在两个方面与自回归生成相冲突。首先,统一阈值忽略了自回归不对称性。早期阶段的偏差会产生累积的序列级漂移,导致静态阈值对早期发散约束不足,而对后期探索过度约束。其次,孤立地评估令牌级发散忽略了累积前缀漂移,无论条件历史已经偏离滚动策略多远,都给予相同的发散允许量。为解决这一局限性,我们提出了CPPO(累积前缀散度策略优化),这是一种令牌级掩码规则,通过两种耦合机制将更新与有限时域策略改进界对齐。首先,位置加权阈值对早期位置施加更严格的限制,因为这些位置的影响持续时间更长,同时放宽对后期令牌的约束。其次,累积前缀预算跟踪历史偏差,动态限制进一步的令牌级偏差,以防止沿前缀的复合错误。实验表明,CPPO在不同模型规模上增强了训练稳定性并显著提高了推理准确性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.

2606.10820 2026-06-11 cs.LG cs.AI cs.CL 版本更新

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

K-Forcing:通过前推语言建模进行联合下一K词解码

Zhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao, Jiasheng Tang, Fan Wang, Bohan Zhuang

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出K-Forcing范式,通过前推映射将自回归模型蒸馏为单次前向传播生成多个未来词,实现2.4-3.5倍加速,质量损失小。

详情
Comments
Code: this https URL
AI中文摘要

自回归语言建模是文本生成的主导范式,但其逐词顺序解码使得推理受限于内存且效率低下。现有的加速方法(如推测解码和扩散语言模型)在特定条件下可提升速度,但并未直接解决高负载批量服务——这一对工业级部署最为关键的场景。我们提出K-Forcing,一种用于联合下一k词解码的前推语言建模范式。K-Forcing将现有自回归模型蒸馏为条件前推映射——该映射在单次前向传播中将独立均匀噪声变量转换为多个未来词的联合样本。该设计保留了固定长度输出,复用了自回归教师模型的主干,并与标准自回归服务基础设施兼容。我们通过渐进式自强迫蒸馏训练该映射,逐步扩展预测窗口,同时使学生模型紧密匹配自回归教师模型的序列分布。我们在LM1B和OpenWebText上使用标准因果Transformer主干评估K-Forcing。当激进配置为每次前向传播生成k=4个词时,K-Forcing在不同批量大小下实现约2.4-3.5倍加速,同时相对于自回归教师模型仅带来轻微的质量下降。随着推理在现代LLM的生命周期计算成本中占据主导地位,K-Forcing为在现实高负载部署下加速自回归生成提供了一条有前景的途径。

英文摘要

Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

2606.10813 2026-06-11 cs.CR cs.CL 版本更新

RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

RedAct: 为程序技能保护而编辑智能体能力痕迹

Shuwen Xu, Zhitao He, Yi R. Fung

AI总结 提出RedAct框架,通过定位保护关键信息、重写痕迹并嵌入行为水印,将技能转移率降至无技能基线以下,同时保留审计证据。

详情
AI中文摘要

用户依赖执行痕迹来观察智能体行为、诊断故障并确保问责。这些痕迹包含丰富的程序细节,包括工具调用、中间决策和错误恢复逻辑。然而,这些细节可能暴露私有的程序技能,使下游方法能够在没有模型权重或技能文件的情况下恢复关键公式、阈值和策略。为了量化这种风险并评估保护措施,我们构建了\textsc{CapTraceBench},一个包含75个专业长时任务和154个跨七个领域精选技能的基准。我们还引入了\textsc{RedAct}(https://github.com/...),一个受保护的痕迹发布框架,该框架定位受保护的关键信息,重写痕迹同时保留验证者关键证据,并嵌入行为水印用于下游溯源分析。在代表性的痕迹重用方法中,\textsc{RedAct}将归一化技能转移(NST)从原始痕迹的44.7--67.1%降至低于无技能基线,同时保留审计证据。其独立的行为水印实现了93.6--100.0%的真实检测率,误报率最多为1.9%。这些结果将公共智能体痕迹视为安全接口,并表明选择性编辑可以在不删除审计证据的情况下减少程序能力泄露。

英文摘要

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct \textsc{CapTraceBench}, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce \textsc{RedAct} this https URL, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, \textsc{RedAct} reduces normalized skill transfer (NST) from 44.7--67.1\% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6--100.0\% true detection with a false alarm rate of at most 1.9\%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

2606.10804 2026-06-11 cs.CV 版本更新

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

SCAIL-2:通过端到端上下文条件统一受控角色动画

Wenhao Yan, Fengjia Guo, Zhuoyi Yang, Jie Tang

发表机构 * Z.ai Tsinghua University(清华大学)

AI总结 提出SCAIL-2框架,通过端到端上下文条件统一受控角色动画,绕过中间表示直接利用驱动视频,并合成MotionPair-60K数据集,采用上下文掩码和模式RoPE实现统一,结合Bias-Aware DPO减少误差,显著优于现有方法。

详情
AI中文摘要

受控角色动画需要将运动从驱动序列转移到参考角色。先前的工作严重依赖中间表示,包括用于表示运动的姿态骨架或用于表示环境的掩码背景,这不可避免地导致信息损失。为了解决这个问题,我们提出了SCAIL-2,一个绕过这些中间表示并实现\textbf{端到端}角色动画的框架。通过将驱动视频直接连接到序列,模型可以从输入视频中获得所有所需的视觉信息。为了解决缺乏端到端数据的问题,我们通过解耦条件统一角色动画的子任务,然后策划一个流程来合成MotionPair-60K,一个包含角色动画异构任务的端到端运动转移数据集。为了实现统一,我们利用上下文掩码条件和模式特定的RoPE作为文本指令和原始视觉信息之外的软引导。为了解决详细区域的合成差异,我们提出了Bias-Aware DPO来构建偏好项目以减轻误差。大量实验表明,我们的方法在各种角色动画任务中显著优于现有的最先进方法。合成数据的一个大子集以及模型权重将在我们的项目页面发布:this https URL。

英文摘要

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To achieve the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: this https URL.

2606.10794 2026-06-11 cs.AI 版本更新

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

READER: 基于提取表示的鲁棒证据作者身份解码

Jiaxu Liu, Sunnan Mu, Dong Huang, Liuyin Wang, Jing Shao, Jie Zhang

发表机构 * National University of Singapore(新加坡国立大学) Xidian University(西安电子科技大学) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对黑盒LLM来源识别问题,提出READER框架,通过冻结代理LLM读取隐藏作者证据,利用贝叶斯证据累积实现多查询归因,在Agent500数据集上显著优于基线方法。

详情
AI中文摘要

随着智能体应用越来越多地通过官方和第三方LLM API路由用户任务,来源成为一个操作性问题:哪个模型生成了给定的黑盒响应?我们研究动态黑盒LLM来源识别:从由查询变化、非预定义提示(而非固定输入集或基准套件)引发的生成中识别源LLM。这种设置很困难,因为提示语义主导文本,而模型特定的作者痕迹在表面层面是微弱且不一致的。我们引入READER(基于提取表示的鲁棒证据作者身份解码),一种轻量级来源框架,将冻结的代理LLM视为隐藏作者证据的读取器。READER将黑盒输出映射到代理激活空间,在时间上过滤每个响应中的令牌状态,并通过跨独立采样提示求和单响应对数后验证据来执行贝叶斯证据累积。这避免了提示特定表示的脆弱平均池化,同时保留了校准置信度所需的查询级证据。在Agent500(一个基于智能体风格提示构建的50目标数据集)上,READER从单个响应达到31.0%-42.4%的top-1准确率,从50个响应达到70.0%-84.0%的准确率,显著优于句子编码器指纹。跨九个代理读取器的扩展进一步表明,更强的LLM暴露更多线性可解码的作者身份结构,表明作者身份感知已经存在于冻结的LLM表示中,并且可以转化为可靠的多查询归因。

英文摘要

As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches $31.0$-$42.4\%$ top-1 accuracy from a single response and $70.0$-$84.0\%$ from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.

2606.10775 2026-06-11 cs.CV 版本更新

Spatially Selective Self-Training for Unsupervised Building Change Detection

空间选择性自训练用于无监督建筑变化检测

Wafaa I. M. Hussin, Zhi Lu, Anas M. I. Mohammed, Xiang Zhou, Ratiba A. H. Abubaker, Zhenming Peng

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院) Chengdu Yaguang Electronic Co., Ltd.(成都亚光电子股份有限公司) Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China(电子科技大学智能协同计算实验室) School of Civil Engineering, University of Khartoum(喀土穆大学土木工程学院) National Energy Research Center, Ministry of Higher Education and Scientific Research(高等教育部和科学研究部国家能源研究中心)

AI总结 提出SST-CD框架,利用空间选择性自训练和局部一致性准则,从无标签双时相遥感图像中学习建筑变化检测器,在三个数据集上超越现有无监督方法。

详情
Comments
Under Review
AI中文摘要

无监督建筑变化检测旨在从未标记的双时相遥感图像中学习建筑变化掩膜。现有的无标签方法通常遵循差异到掩膜范式,直接使用时相差异、冻结的基础模型响应、基于提示的输出或后处理结果作为最终变化图。尽管这些策略提供了无标注线索,但它们并未学习任务特定的建筑变化检测器,并且仍然容易受到通用时相差异与建筑定义的结构变化之间的差距的影响。在实践中,这种差异通常是嘈杂且与任务无关的,因为外观变化、配准误差和非建筑修改可能产生强烈但误导性的响应。为了解决这个问题,我们提出了SST-CD,一种空间选择性自训练框架,将完全无标签的建筑变化检测重新表述为在嘈杂伪监督下的端到端检测器学习。SST-CD使用时相差异作为候选伪标签,并仅在空间可靠像素上训练检测器,其可靠性通过局部一致性准则估计,该准则从监督中过滤不一致区域。为了进一步稳定嘈杂的自训练,一个轻量级特征适配器重新校准双时相特征,而基于原型的解码器产生紧凑的变化和无变化表示。在LEVIR-CD、WHU-CD和DSIFN-CD上的实验表明,SST-CD分别达到了83.08%、91.69%和86.60%的F1分数,优于现有的无监督和无标签基线。代码将公开提供。

英文摘要

Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines.

2606.10725 2026-06-11 cs.LG cs.CL 版本更新

Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

Pre-AF 13:从出院报告中挖掘的可解释房颤风险评分

Olga Shakhmatova, Dmitrii Kriukov, Daniil Larionov, Nikita Khromov, Iaroslav Bespalov, Alexander Zolotarev, Kirill Grishchenkov, Ekaterina Ivanova, Miron Kuznetsov, Ilya Sochenkov, Elizaveta Panchenko, Artem Shelmanov, Dmitry V. Dylov

发表机构 * National Medical Research Center of Cardiology named after Academician E.I. Chazov(国家医学研究中心心脏病学以E.I. Chazov院士命名) Skolkovo Institute of Science and Technology (Skoltech)(斯科尔科沃科学技术研究所) Artificial Intelligence Research Institute (AIRI)(人工智能研究所) University of Mannheim(曼海姆大学) Russian Center for Scientific Information (RCSI)(俄罗斯科学信息中心) Institute of Cyber Intelligence Systems, National Research Nuclear University MEPhI(网络智能系统研究所,国家研究核大学MEPhI) M.V. Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学) Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)(俄罗斯科学院信息传输问题研究所(Kharkevich研究所)) Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences (FRC CSC RAS)(俄罗斯科学院联邦研究中心“计算机科学与控制”) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 利用NLP从出院报告中提取特征,构建可解释ML模型预测心血管病患者房颤风险,Pre-AF 13模型优于现有临床评分。

详情
Comments
O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Correspondence to: Olga Shakhmatova < this http URL [at] this http URL > and Dmitry V. Dylov < this http URL [at] this http URL >
AI中文摘要

背景:房颤(AF)是最常见的心律失常,也是预后的主要决定因素。现有的AF风险评分依赖于在心血管疾病(CVD)患者中几乎普遍存在的因素(如高龄、高血压),因此在该高风险群体中提供的分层有限。大多数评分针对长期(5-10年)而非中期预测。我们开发了可解释的ML模型,利用常规收集的医院数据预测CVD患者在24个月和整个随访期间内的AF风险。方法:对俄罗斯国家心脏病学研究中心电子健康记录进行单中心回顾性研究,纳入2012年1月至2019年5月期间多次住院、年龄≥18岁、患有CVD但无既往AF的患者。自定义NLP流水线将非结构化出院报告转化为73个结构化特征,结合基于规则的解析器和基于Transformer的命名实体识别。使用LightAutoML构建了完整模型(73个特征)、简单模型(简化子集)以及用于床旁风险评分的线性模型。性能通过ROC AUC评估,并与CHARGE-AF、C2HEST、MHS和HAVOC进行比较,并通过SHAP进行解释。结果:在来自45,000名患者的80,576份记录中,17,562份符合纳入标准;其中1,438名(8.19%)发生AF。完整模型在24个月和整个随访期间的ROC AUC分别为0.735和0.696;简单模型几乎相同(0.725和0.696)。所有非线性模型均优于四个临床风险评分(ROC AUC 0.53-0.64)。简单模型使用13个特征,命名为Pre-AF 13。SHAP识别出年龄和左心房容积为主要预测因子。线性风险评分(Pre-AF 9)将观察到的24个月AF发生率从约7%分层至36%。结论:基于常规收集的EHR数据构建的可解释ML模型能够识别高AF风险的CVD患者,优于现有的临床风险评分。

英文摘要

Background. Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and a major determinant of prognosis. Established AF risk scores rely on factors (older age, hypertension) nearly ubiquitous among patients with cardiovascular disease (CVD), offering limited stratification in this high-risk group. Most target long-term (5-10 year) rather than medium-term prediction. We developed interpretable ML models predicting AF risk over a 24-month and entire follow-up horizon in CVD patients using routinely collected hospital data. Methods. Single-center retrospective study of electronic health records from the National Research Cardiology Center (Russia) for patients aged >=18 with CVD but without pre-existing AF, hospitalized more than once between January 2012 and May 2019. A custom NLP pipeline transformed unstructured discharge reports into 73 structured features, combining a rule-based parser with transformer-based NER. Using LightAutoML we built a full model (73 features), a simple model (reduced subset), and a linear model for a bedside risk score. Performance was assessed by ROC AUC, compared with CHARGE-AF, C2HEST, MHS, and HAVOC, and interpreted via SHAP. Results. Of 80,576 records from 45,000 patients, 17,562 met inclusion criteria; 1,438 (8.19%) developed AF. The full model reached ROC AUC 0.735 (24-month) and 0.696 (entire follow-up); the simple model was nearly identical (0.725, 0.696). All non-linear models outperformed the four clinical risk scores (ROC AUC 0.53-0.64). The simple model uses 13 features and is named Pre-AF 13. SHAP identified age and left atrial volume as dominant predictors. A linear risk score (Pre-AF 9) stratified observed 24-month AF incidence from ~7% to 36%. Conclusion. Interpretable ML models built from routinely collected EHR data identify high-AF-risk CVD patients, outperforming established clinical risk scores.

2606.10639 2026-06-11 cs.RO 版本更新

Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters

面向敏捷目标拦截的升力翼四旋翼平面扇形视线制导

Linkai Liu, Kun Yang, Han Zou, Chen Min, Shuli Lv, Shuai Wang, Quan Quan

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院) Research and Development Department, China Academy of Launch Vehicle Technology(中国运载火箭技术研究院研发部)

AI总结 提出平面扇形视线(PS-LOS)制导框架,通过非对称约束释放机动性,使升力翼四旋翼在仅用单目相机的情况下实现远程自主拦截敏捷目标,实验验证了高达138米距离的成功拦截。

详情
Comments
Accepted to the IEEE International Conference on Robotics and Automation (ICRA 2026). Recipient of the ICRA 2026 Best Paper Award in Field and Service Robotics
AI中文摘要

由于目标运动不可预测、感知受限以及目标可见性与拦截器机动性之间的强耦合,对敏捷空中目标的自主视觉拦截具有挑战性。大多数现有的捷联相机拦截方法使用锥形视线(LOS)约束来保持目标靠近图像中心,从而保证可见性。虽然安全,但这种对称约束不必要地限制了机动性,并可能显著减少可用于追击的推力。受激进FPV飞行员不在所有图像方向上保持相等可见性裕度的观察启发,本文提出了一种平面扇形视线(PS-LOS)制导框架,用于仅配备捷联单目相机的升力翼四旋翼的自主拦截。PS-LOS严格约束横向图像误差,同时放松纵向图像误差在安全的视场裕度内,在保持可见性的同时释放机动性以进行加速密集型追击。在升力翼四旋翼模型下,PS-LOS在LOS方向附近提供的可用推力比传统锥形LOS约束多近50%。为了实现无需直接深度测量的仅视线拦截,为升力翼四旋翼开发了延迟补偿状态估计框架和非线性制导与控制架构。广泛的外场飞行实验证明了在真实风扰动下,对具有大幅、高频和不可预测运动的敏捷目标的自主拦截。所提出的系统在高达138米的距离上实现了成功拦截,并在整个交战过程中保持连续视觉跟踪。结果验证了PS-LOS作为一种保持可见性、感知机动性的制导框架,用于远程视觉拦截敏捷空中目标。

英文摘要

Autonomous visual interception of agile aerial targets is challenging due to unpredictable target motion, limited sensing, and the strong coupling between target visibility and interceptor maneuverability. Most existing strapdown-camera interception methods preserve visibility using conic line-of-sight (LOS) constraints that keep the target near the image center. While safe, such symmetric constraints unnecessarily restrict maneuverability and can significantly reduce the usable thrust for pursuit. Motivated by the observation that aggressive FPV pilots do not maintain equal visibility margins in all image directions, this paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance framework for autonomous interception using a lifting-wing quadcopter equipped with only a strapdown monocular camera. PS-LOS tightly constrains lateral image error while relaxing longitudinal image error within a safe field-of-view margin, preserving visibility while releasing maneuverability for acceleration-intensive pursuit. Under the lifting-wing quadcopter model, PS-LOS provides nearly 50% more available thrust near the LOS direction than conventional conic LOS constraints. To realize LOS-only interception without direct depth measurements, a delay-compensated state-estimation framework and a nonlinear guidance-and-control architecture are developed for lifting-wing quadcopters. Extensive outdoor flight experiments demonstrate autonomous interception of agile targets exhibiting large-amplitude, high-frequency, and unpredictable motion under real wind disturbances. The proposed system achieves successful interceptions at ranges up to 138 m while maintaining continuous visual tracking throughout the engagement. The results validate PS-LOS as a visibility-preserving, maneuverability-aware guidance framework for long-range visual interception of agile aerial targets.

2606.10615 2026-06-11 cs.CR 版本更新

Two-Way Confidential VMs (2cVM): Collaborative Confidential Computing for Mutually Distrustful Parties

双向机密虚拟机 (2cVM):面向互不信任方的协作机密计算

Jordi Thijsman, Merlijn Sebrechts, Stefan Lefever, Filip De Turck, Bruno Volckaert

AI总结 提出2cVM双层架构,结合硬件可信执行环境与工作负载内隔离,通过承诺清单实现策略不可变,提供可验证的隐私保护协作计算,性能开销取决于内存访问模式。

详情
Comments
Accepted for publication in IEEE Access
AI中文摘要

跨组织的协作计算通常受限于需要处理敏感数据和专有代码,同时避免将其暴露给不可信的基础设施或参与者。全同态加密和安全多方计算等密码学方法提供了强机密性,但由于其极高的计算成本,对于通用工作负载仍不实用。我们提出了双向机密虚拟机(2cVM),一种双层架构,将硬件可信执行环境与工作负载内隔离层配对。与常规机密虚拟机不同,2cVM 强制共驻工作负载之间的相互隔离,确保参与者对其数据和代码保持控制。2cVM 中的所有计算由一份承诺清单管理,该清单列举了参与者、组件组成、允许的数据通道和授权输出;清单被锁定到虚拟机并纳入证明证据,使得策略在虚拟机整个生命周期内不可变且可独立验证。一个概念验证实现结合了用于硬件保护的 AMD SEV-SNP 和用于参与者代码细粒度沙箱化的 WebAssembly 组件模型。在四个基准类别的商用硬件上的评估表明,两个隔离层不会线性累积:一旦工作负载在 WebAssembly 沙箱内执行,启用硬件内存保护的边际成本很小。开销取决于工作负载,主要由内存访问模式决定,范围从顺序工作负载的可忽略不计到不规则、指针追逐访问模式的约 2 倍。这些结果表明,2cVM 为隐私保护协作计算提供了实用且可验证的基础。

英文摘要

Collaborative computation across organizations is often constrained by the need to process sensitive data and proprietary code without exposing them to untrusted infrastructure or participants. Cryptographic approaches such as fully homomorphic encryption and secure multi-party computation provide strong confidentiality but remain impractical for general workloads due to their extreme computational cost. We present the Two-Way Confidential Virtual Machine (2cVM), a two-layer architecture that pairs a hardware trusted execution environment with an intra-workload isolation layer. Unlike regular Confidential Virtual Machines, 2cVM enforces mutual isolation between co-resident workloads, ensuring that participants retain control over their data and code. All computation in 2cVM is governed by a Commitment Manifest that enumerates participants, component composition, permitted data channels, and authorized outputs; the manifest is locked to the VM and incorporated into attestation evidence, making the policy immutable and independently verifiable throughout the VM's lifetime. A proof-of-concept realization combines AMD SEV-SNP for hardware protection with the WebAssembly Component Model for fine-grained sandboxing of participant code. Evaluation on commodity hardware across four benchmark classes shows that the two isolation layers do not accumulate linearly: once a workload executes inside the WebAssembly sandbox, the marginal cost of enabling hardware memory protection is small. Overhead is workload-dependent, governed primarily by memory access pattern, ranging from negligible for sequential workloads to approximately 2x for irregular, pointer-chasing access patterns. These results indicate that 2cVM provides a practical and verifiable foundation for privacy-preserving collaborative computation.

2606.10546 2026-06-11 cs.MA 版本更新

SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement

SkillAxe: 通过评估引导的自我精炼提升LLM编写的智能体技能

Srishti Gautam, Arjun Radhakrishna, Sumit Gulwani

AI总结 提出SkillAxe框架,通过无监督评估引导LLM自我诊断和精炼技能,在SkillsBench上提升通过率28%,缩小与人类技能的差距47-67%。

详情
Comments
9 pages, under review
AI中文摘要

技能文档是指导大型语言模型(LLM)智能体的结构化自然语言指令,对现代智能体框架至关重要,但LLM难以编写实际可用的技能。在SkillsBench上,人类编写的技能将通过率提高了16.2个百分点,而LLM编写的技能没有带来可衡量的提升。我们引入了SkillAxe,一个完全无监督的框架,使LLM能够迭代地诊断和精炼自己的技能。SkillAxe将技能质量分解为四个可解释的维度(质量影响、触发精度、指令合规性与故障归因、解决方案路径覆盖),生成结构化的改进简报,无需真实标签、测试套件或环境奖励。在SkillsBench上,SkillAxe相对于未改进的LLM技能将通过率提高了28%,并缩小了与人类技能差距的47-67%。我们在SpreadsheetBench上验证了该方法作为持续改进引擎的效果,其中SkillAxe构建的技能库从过去的智能体轨迹中学习,仅使用22个技能就将通过率从16.0%提高到52.0%。

英文摘要

Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage points, while LLM-authored skills provide no measurable gain. We introduce SkillAxe, a fully unsupervised framework that enables LLMs to iteratively diagnose and refine their own skills. SkillAxe decomposes skill quality into four interpretable dimensions (quality impact, trigger precision, instruction compliance with fault attribution, and solution-path coverage), producing structured improvement briefs that require no ground-truth labels, test suites, or environment rewards. On SkillsBench, SkillAxe improves pass rates by 28\% relative over unimproved LLM skills and closes 47--67\% of the gap to human-authored skills. We validate the approach as a continuous improvement engine in the wild on SpreadsheetBench, where a SkillAxe-built skill library learns from past agent trajectories and raises pass rate from 16.0\% to 52.0\% using only 22 skills.

2606.10508 2026-06-11 cs.CR cs.NI 版本更新

A Deployment-Oriented Framework for Explainable AI-Assisted eBPF/XDP Mitigation at the IoT Edge

面向部署的可解释AI辅助eBPF/XDP缓解框架在物联网边缘的应用

Abdurrahman Tolay

AI总结 提出一种基于Linux的物联网边缘网关框架,结合资源感知的AI风险评分、事件级可解释性和eBPF/XDP限界缓解,实现可部署的异常流量管控。

详情
Comments
59 pages, 2 figures, 12 tables. Conceptual framework and research agenda for explainable AI-assisted eBPF/XDP mitigation at the IoT edge. Corrected truncated abstract metadata
AI中文摘要

物联网部署结合了异构、资源受限的设备,这些设备具有弱安全配置、暴露的服务、有限的日志记录、补丁约束和长生命周期。基于签名和阈值的控制仍然是有用的基线,但在动态物联网网络中作为独立机制是不够的。同样,离线人工智能基准性能本身并不能建立操作可部署性。本文提出了一个概念框架和研究议程,用于基于Linux的物联网边缘网关,该网关结合了资源感知的流级AI辅助风险评分、事件级可解释性以及通过eBPF/XDP的限界缓解。控制器应用可逆的、时间受限的操作,受关键设备保护措施约束,更新数据包级执行状态,并记录结构化日志。该架构将用户空间中的复杂推理和策略控制与内核中简洁的数据包处理决策分离。它还定义了一条未来的硬件感知评估路径,涵盖检测质量、资源成本、响应时间、回滚行为和合法流量保留。本文不报告新的实验测量结果。

英文摘要

Internet of Things (IoT) deployments combine heterogeneous, resource-constrained devices with weak security configurations, exposed services, limited logging, patching constraints, and long lifecycles. Signature- and threshold-based controls remain useful baselines, but they are insufficient as standalone mechanisms in dynamic IoT networks. Likewise, offline artificial intelligence (AI) benchmark performance alone does not establish operational deployability. This article presents a conceptual framework and research agenda for a Linux-based IoT edge gateway that combines resource-aware flow-level AI-assisted risk scoring, event-level explainability, and bounded mitigation through eBPF/XDP. The controller applies reversible, time-limited actions subject to critical-device safeguards, updates packet-level enforcement state, and records structured logs. The architecture separates complex reasoning and policy control in user space from concise packet-handling decisions in the kernel. It also defines a future hardware-aware evaluation pathway covering detection quality, resource cost, response timing, rollback behaviour, and legitimate-traffic preservation. The paper does not report new experimental measurements or claim measured superiority or completed real-time performance.

2606.10401 2026-06-11 cs.CV 版本更新

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

CoCoSI: 面向空间智能的协作认知地图构建

Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Cornell University(康奈尔大学)

AI总结 提出一种即插即用的多智能体框架,通过协作构建结构化认知地图作为空间记忆,无需修改架构或额外训练即可增强预训练多模态大模型的空间理解能力。

详情
AI中文摘要

空间智能是多模态大语言模型(MLLMs)的一个关键前沿,使其能够从视觉体验中推理物理世界。受人类空间认知启发,最近的方法从多帧视觉输入构建基于网格的认知地图,以随时间维持连贯的空间表示。然而,有限的上下文长度仍然挑战空间理解,而现有方法如长上下文建模和外部记忆通常需要架构更改、记忆模块或微调,限制了其对现成预训练MLLMs的适用性。这促使我们提出一种轻量级、模型无关的方法,以在原生上下文窗口之外保留空间信息。为此,我们提出一个即插即用的多智能体框架,协作构建认知地图作为结构化空间记忆,无需架构修改或额外训练即可增强任意预训练MLLMs的空间理解。我们的框架具有局部-全局智能体协调、原子提交的认知地图构建以及跨智能体验证的特点。大量实验表明,我们的方法在空间理解任务上取得了优越性能,同时完全无需训练。代码将发布。

英文摘要

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

2606.10360 2026-06-11 cs.SD 版本更新

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

ViP-VL:基于向量量化学习的越南语自监督语音预训练模型

Khanh Le, Kiet Anh Hoang, Bao Nguyen, Duy Vo, Dung Vo, Thai Tran, Linh Pham, Khoa D Doan

AI总结 提出ViP-VL模型,通过声学堆叠、感受野对齐和掩码选择策略,在BEST-RQ框架上实现高效自监督预训练,在越南语ASR、情感识别、方言分类和说话人验证四项任务上取得最优结果。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

我们提出了ViP-VL,一种高效的越南语自监督语音预训练模型,利用向量量化学习。为了弥合高分辨率音频与高效处理之间的差距,ViP-VL在ChunkFormer架构中引入了声学堆叠和感受野对齐,实现了同步的8倍下采样率,同时通过在BEST-RQ框架上的预训练中采用专门的掩码选择策略,进一步增强了表示的鲁棒性。在17,000小时未标注的越南语语音上预训练后,我们的模型在自动语音识别、语音情感识别、方言分类和说话人验证四个主要下游任务上建立了新的最优结果。为了促进未来研究和高性能越南语语音技术的发展,我们在此http URL公开发布了预训练权重和实现。

英文摘要

We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a synchronized 8x subsampling rate within the ChunkFormer architecture, while further enhancing representation robustness through a specialized Mask Selection Strategy during pretraining on the BEST-RQ framework. Pretrained on 17,000 hours of unlabeled Vietnamese speech, our model establishes new state-of-the-art results across four major downstream tasks: Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification. To facilitate future research and the development of high-performance Vietnamese speech technologies, we publicly release our pretrained weights and implementation at this http URL.

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测:校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

AI总结 针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题,提出基于核密度估计的密度脊方法,利用隐藏状态生成轨迹的六维运动特征图构建响应流形,通过到最近脊顶点的欧氏距离评分,在标签稀缺协议下AUROC提升5-20点。

详情
AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测,其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器(Semantic Entropy, EigenScore)避免标签但质量停滞,而有监督探针(SAPLMA)获得更强的分布内分数,但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分,从而得到随机输出分布的低维几何骨架。我们在七个问答基准(HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA)上,使用九个文本和视觉大语言模型,在刻意标签稀缺协议($n_{\ ext{cal}}{=}200$ 查询,$N{=}5$ 生成)下,与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜,同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

2606.10135 2026-06-11 cs.CV cs.AI 版本更新

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

BiWM:利用双向自回归推进开源交互式视频世界模型

Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

AI总结 提出BiWM框架,通过双向自回归范式将预训练视频骨干转化为交互式世界模型,仅需两阶段训练(微调+分布匹配蒸馏),支持多尺度模型和长程生成,优于现有因果流水线。

详情
Comments
After the paper was posted, we discovered that several visualization results were produced using wrong configuration settings during runtime. This error affects the reliability of the presented visual comparisons. Additionally, further optimization of the design is needed. We therefore request to withdraw this version and will submit a corrected and improved version later
AI中文摘要

将双向视频扩散模型过渡到自回归范式提高了视频世界模型的交互性,但现有的因果流水线需要多个阶段(控制微调、自回归训练、因果初始化、少步蒸馏),并且由于误差累积,质量仍落后于双向模型。最近的世界模型如Yume-1.5和Matrix-Game-3.0采用双向自回归方法,通过自我纠正误差传播获得保真度和稳定的长程展开,但开源框架(如minWM)仅支持因果模型。我们提出BiWM,这是首个在双向自回归范式下用于交互式视频世界模型的全栈框架,联合优化生成质量和推理速度。从预训练视频骨干开始,BiWM通过微调注入相机控制,然后运行几步分布匹配蒸馏(DMD)阶段,将骨干转化为动作/相机可控的世界模型:仅需两个训练阶段(而非minWM的四个),在8xH200 GPU上几百步内收敛。单一方案覆盖Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B和LTX-2.3-22B,并支持现有双向模型的二次微调。BiWM实现了minWM失去可控性的真实相机控制,集成了可插拔历史压缩(FramePack风格和PackForcing风格)用于长程展开,并提供可选的NVFP4 4位训练/推理流水线。为对抗DMD的模式寻求退化,我们添加了GAN和覆盖前向KL目标,以保留场景动态。我们开源BiWM,用于资源受限的研究和高保真环境模拟。

英文摘要

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

2606.10120 2026-06-11 cs.IR cs.AI cs.HC 版本更新

MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention

MetaPlate: 反事实引导的RAG-LLM工具用于个性化食物推荐和高血糖预防

Asiful Arefeen, Carol Johnston, Hassan Ghasemzadeh

AI总结 提出MetaPlate框架,结合反事实解释、机器学习预测和RAG-LLM,生成个性化膳食建议以预防餐后高血糖,经注册营养师评估证明其可行性和有效性。

详情
AI中文摘要

餐后高血糖是代谢紊乱的关键风险因素;然而,现有的饮食指导通常是静态的、不切实际的且个性化不足,提供的建议难以遵循或效果不佳。尽管最近的进展利用连续血糖监测(CGM)和机器学习来预测血糖反应,但这些方法主要是预测性的,缺乏可操作的指导。此外,推荐系统常常与用户目标不一致,且需要大量输入。我们提出了MetaPlate,一个反事实解释(CF)引导的、上下文感知的决策支持框架,用于生成个性化膳食建议,以减轻健康成年人的餐后血糖波动。MetaPlate整合了多模态数据,包括来自25名个体的CGM读数、可穿戴设备衍生的生理信号以及用户提供的膳食输入,以建模餐前上下文。一个机器学习模型预测血糖反应,而CF优化模块通过调整膳食组成(修改宏量营养素数量)来维持血糖水平在目标范围内(≤140 mg/dL)。基于LLM的检索增强生成(RAG)层通过使用USDA食品数据库的约束搜索生成人类可读的建议,增强了可解释性。我们通过结构化的专家在环评估,与注册营养师(RDs)一起评估MetaPlate,比较提示优化前后的性能。结果显示,在膳食真实性、份量适宜性和推荐可能性方面有所改进,专家反馈表明从临床不可行的输出转向了可操作、上下文适宜的建议。我们的发现强调了领域知识和结构化约束在LLM驱动系统中的重要性,并突出了MetaPlate作为实时个性化膳食决策支持工具的潜力。

英文摘要

Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.

2606.10046 2026-06-11 cs.SD cs.AI 版本更新

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

潜流内部:音频分离基础模型中注意力动力学的因果解读

Yuxuan Chen, Haoyuan Yu, Peize He

AI总结 本文通过因果干预协议揭示流匹配Transformer在音频分离中的双路径注意力机制,并提出无训练加速方法LSAC,在保持质量的同时减少约25%自注意力计算。

详情
AI中文摘要

流匹配变压器实现了强大的音频分离,但其注意力动力学是不透明的。我们将已建立的因果干预原则适应为SAM Audio的确定性推理时探测协议。正交探测揭示了一种双路径文本条件机制:加法注入控制语义身份,而交叉注意力细化声学结构。我们观察到异步逐层收敛:稳定层早期构建时间支架,而快速层在采样过程中继续解决伪影。该模型还减弱时间分割线索以维持连续流稳定性。利用这些见解,我们提出了层选择性注意力缓存(LSAC),一种无训练加速方法,在稳定层中缓存注意力。在各种声学复杂度下,LSAC将自注意力计算减少约25%,质量损失可忽略,并且与朴素步长减少相比,质量保持率高达6.7倍。

英文摘要

Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.