arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11934 2026-06-11 cs.NI 新提交

Exploratory Analysis of Wi-Fi 6 Dynamic Resource Unit Sharing in Small-Scale Network Scenarios

小规模网络场景中Wi-Fi 6动态资源单元共享的探索性分析

Sai Mada, Anna Baron, Luigi Martino, Rute C. Sofia

AI总结 针对静态RU调度在动态流量下的局限性,提出一种动态RU分配算法,映射TSN流量类别至Wi-Fi 6 QoS机制,仿真表明相比静态方案降低了延迟、抖动和丢包率。

详情
AI中文摘要

本文研究了与时间敏感网络(TSN)集成的Wi-Fi 6(IEEE 802.11ax)网络的动态资源单元(RU)分配策略,针对静态RU调度在动态流量条件下的局限性。我们提出了一种动态RU分配算法,将TSN流量类别映射到Wi-Fi 6服务质量(QoS)机制,包括增强分布式信道接入(EDCA),并使TSN控制与基于以太网的TSN域对齐。所提出的解决方案使用fortiss开发的ns-3 DetNetWiFi框架进行评估,重点关注时间敏感流量。仿真结果表明,与静态RU分配方案相比,网络效率提高,延迟、抖动和丢包率降低。这些发现突显了动态RU分配在支持基于Wi-Fi 6的TSN部署中的确定性通信需求以及增强混合工业网络可靠性方面的潜力。

英文摘要

This paper investigates dynamic Resource Unit (RU) allocation strategies for Wi-Fi~6 (IEEE 802.11ax) networks integrated with Time-Sensitive Networking (TSN), targeting the limitations of static RU scheduling under dynamic traffic conditions. We propose a dynamic RU allocation algorithm that maps TSN traffic classes to Wi-Fi~6 Quality of Service (QoS) mechanisms, including Enhanced Distributed Channel Access (EDCA) and aligns TSN control with Ethernet-based TSN domains. The proposed solution is evaluated using the ns-3 DetNetWiFi framework developed by fortiss, focusing on time-sensitive traffic. Simulation results demonstrate improved network efficiency with reductions in latency, jitter, and packet loss compared to static RU allocation schemes. These findings highlight the potential of dynamic RU allocation to support deterministic communication requirements in Wi-Fi~6-based TSN deployments and to enhance the reliability of hybrid industrial networks.

2606.11931 2026-06-11 cs.CL 新提交

Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

低资源语言孟加拉语中书面答案的语义评分:使用微调轻量级语言模型

Meherun Farzana, Aniket Joarder, Mahmudul Hasan, Md. Mosaddek Khan

发表机构 * Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系)

AI总结 针对低资源语言孟加拉语,提出一种基于微调轻量级语言模型的双语评估系统,通过语义正确性而非词汇重叠进行自动评分,在合成和人工评估中均取得最优性能。

详情
Comments
10 pages, 5 figures, 2 tables. Preprint
AI中文摘要

孟加拉语是世界上使用最广泛的语言之一,但在教育NLP研究中仍服务不足。在许多偏远和农村地区,合格学科教师资源有限,书面答案因此主要依靠人工评分,限制了及时和一致的反馈。自动评估具有挑战性,因为语义正确的回答在表面形式上可能有很大差异。我们提出一个为低资源教育环境设计的双语(孟加拉语-英语)评估系统,优先考虑语义正确性而非词汇重叠。我们的方法微调一个轻量级语言模型,使用问题、参考答案和学生答案对每个回答进行评分,产生一个数值分数和简洁、基于上下文的反馈,适合课堂部署。我们还构建了一个合成双语数据集,以实现受控训练和评估。在统一协议下评估的专有和开源LLM中,我们的QLoRA微调Qwen3-8B在合成评估中产生最具抗泄漏性的反馈(RoRa = 0.819),并在专门的人工研究中与人类评分的一致性最强(rho = 0.936, MAE = 0.725),证实了持续改进。

英文摘要

Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.

2606.11930 2026-06-11 cs.HC cs.AI cs.CV 新提交

Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews

冻结多模态嵌入用于异步视频面试中的个性与认知能力评估

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

AI总结 针对异步视频面试中标注数据有限的高维多模态学习问题,提出使用冻结多模态编码器(CLIP、Whisper、RoBERTa等)结合低容量下游模型,在个性预测任务上实现MSE降低19.1%,并发现认知能力预测中存在数据集捷径。

详情
Comments
9 pages, 1 figure, 4 tables
AI中文摘要

从异步视频面试(AVI)中预测心理特质是一个具有挑战性的多模态学习问题,因为标注数据集有限,而每个回答包含高维的视觉、声学和语言信号。本文介绍了我们针对ACM多媒体AVI挑战2026的解决方案,该挑战评估两个任务:Track~1从与个性相关的面试回答中预测自我报告的HEXACO个性特质,Track~2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。我们不微调大型预训练模型,而是使用冻结的多模态编码器,包括用于视觉特征的CLIP、用于声学特征和转录的Whisper,以及用于文本表示的RoBERTa、E5和DeBERTaV3,随后使用低容量下游模型。对于Track~1,我们的特质特定回归和晚期融合系统实现了平均验证MSE为0.2696,优于官方基线0.3334。消融结果显示,从全局模型(0.3189)到逐特质建模(0.2871)再到逐特质晚期融合(0.2696)的三步改进,相对于官方基线MSE相对降低了19.1%。对于Track~2,一个紧凑的主题属性基线达到了0.5781的准确率,而我们的多模态集成达到了0.5313,两者均高于官方基线0.4062。我们将这一结果解释为验证分割中可能存在主题属性捷径的证据,而非从AVI内容中进行的稳健认知推理。总体而言,我们的发现表明,基于AVI的心理评估受益于特质特定的多模态建模,但认知能力预测需要仔细控制数据集捷径。

英文摘要

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1\% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

2606.11926 2026-06-11 cs.CL cs.AI 新提交

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

通过假设树精炼迈向通用自主研究

Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Microsoft Research(微软研究院)

AI总结 提出Arbor框架,通过假设树精炼(HTR)实现长期自主研究循环,在六项真实任务中平均相对保留增益超过Codex和Claude Code的2.5倍。

详情
AI中文摘要

科学进步依赖于探索、实验和抽象的重复循环。研究人员测试候选方向,解释证据,并将所得经验用于后续尝试。我们研究AI代理如何自主地长期运行这一循环。我们提出了Arbor,一个用于自主研究的通用框架,它结合了长期存在的协调器、短期执行器和假设树精炼(HTR),后者是一个持久树,跨时间连接假设、工件、证据和提炼的见解。协调器管理树上的全局研究策略,而执行器在隔离的工作树中实现和测试单个假设。当结果返回时,Arbor更新树,传播可重用的经验,优化搜索前沿,并接受验证过的改进。这种设计将自主研究从一系列局部尝试转变为累积过程,其中策略、执行和证据跨时间传递。我们在自主优化(AO)下评估Arbor,这是一种操作设置,代理通过迭代实验改进初始研究工件,无需逐步人工监督。在模型训练、工具工程和数据合成等六项真实研究任务中,Arbor在所有六项任务上取得了最佳保留结果,在相同任务接口和资源预算下,平均相对保留增益是Codex和Claude Code的2.5倍以上。在MLE-Bench Lite上,Arbor使用GPT-5.5达到86.36%的任何奖牌,这是我们比较中的最强结果。

英文摘要

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

2606.11925 2026-06-11 cs.CV cs.LG 新提交

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

通过LLM引导的视频拼接进行手语翻译的语料增强

Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey

发表机构 * Peter Pazmany Catholic University, Faculty of Information Technology and Bionics(彼得·帕兹马尼天主教大学信息科技与仿生学院) DeepSign Technologies Ltd.(DeepSign科技有限公司)

AI总结 提出一种无需额外标注或生成模型的手语翻译语料增强方法,利用CTC强制对齐提取手语片段,通过LLM生成句子并拼接视频,在GFSLT-VLP基线上提升BLEU-4达2.92,并发现合成数据对视觉-语言预训练有害但可提升下游任务。

详情
AI中文摘要

手语翻译(SLT)将手语视频转换为口语文本,对于改善无障碍交流以及促进手语与非手语社区之间的沟通具有重要前景。虽然大规模弱对齐数据集实现了规模化预训练,且无词汇表方法减少了对专家标注的依赖,但用于微调的高质量平行手语视频-文本对仍然稀缺,限制了长尾词汇和未见结构的泛化。我们提出一种语料增强方法,无需额外人工标注、外部手语视频语料库或生成式视频模型,仅依赖现有的词汇表标注训练语料和用于句子生成的LLM:通过CTC强制对齐从训练视频中提取每个手语词汇的片段,由语料锚定的LLM生成新的词汇-句子对,通过随机句子采样和片段分配组装合成序列。得到的合成RGB视频-文本对在下游训练阶段与架构无关,可直接被基于RGB的SLT模型使用,或通过从视频提取输入的流水线转换为姿态或特征表示。Sincan等人在严格相同条件下重新评估了五种近期无词汇表方法;在GFSLT-VLP基线上验证的最大增益仅为0.98 BLEU-4。我们的增强方法在同一框架内应用,无需改变架构或训练协议,实现了+2.92 BLEU-4。我们进一步发现,合成数据虽然改善了视觉-语言预训练的目标,但对其有害;并且基于L2准则优化片段过渡以实现视觉平滑适得其反;我们提出,突兀的边界可能作为一种隐式正则化形式。代码可在https://this https URL获取。

英文摘要

Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at this https URL.

2606.11922 2026-06-11 cs.SD cs.AI 新提交

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

Lung-SRAD: 基于谱感知正则化音频DASS与双轴补丁混合对比学习的呼吸音分类

Hemansh Shridhar, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS(RSC实验室,MODULABS) Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究所)

AI总结 针对呼吸音分类中AST模型对局部异常模式不敏感的问题,提出基于状态空间模型的谱感知层正则化和双轴补丁混合对比学习,在ICBHI基准上达到64.48%分数,比AST基线提升5%。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

最近的呼吸音分类(RSC)研究主要依赖于CLS令牌驱动的自注意力架构,如音频频谱图变换器(AST)。虽然它在建模全局上下文方面有效,但最近的分析表明存在低通滤波行为,可能会降低对局部异常模式的敏感性。在这项工作中,我们研究了状态空间模型(SSM)作为RSC的替代骨干网络。使用蒸馏音频状态空间模型,我们通过频谱响应曲线分析中间表示,并观察到对中到高空间频率分量的更强保留。基于这些观察,我们引入了使用高斯卷积应用于选定层的谱感知层正则化。我们进一步提出了针对基于SSM的音频模型定制的双轴补丁混合对比学习,以实现稳健的表示学习。在ICBHI基准上的实验表明,我们的方法达到了64.48%的分数,比AST基线高出5%。代码可在以下网址获取:https://this https URL。

英文摘要

Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at this https URL.

2606.11918 2026-06-11 cs.AI 新提交

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

提问的艺术:一致性增强空间推理中的事实性

Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas

AI总结 提出自监督强化学习框架,通过几何与语义一致性验证器(如图像翻转、文本对象顺序交换)对齐预训练模型的内在空间推理能力,无需标注数据即可达到接近监督方法的精度。

详情
AI中文摘要

当前的大型推理模型(LRMs)展现出显著的通用能力,但在空间推理任务中表现明显不足。现有方法将此差距视为知识缺陷,依赖监督微调(SFT)从外部视觉源或合成引擎中获取标注空间数据。相反,我们认为对于许多任务,空间推理能力已经存在于预训练的LRMs中,但需要通过几何2D和3D约束下的逻辑一致性进行对齐。在这项工作中,我们提出了一个自监督强化学习(RL)框架,针对内部推理过程,无需真实标注。通过形式化一致性验证器——即在变换下检查几何和语义一致性的奖励函数——我们证明模型可以提高其空间推理能力。我们同时使用图像变换(如翻转)和文本变换(如交换问题中对象的顺序),并提出了一种新的基于最优传输的RL策略OT-GRPO,这是针对成对验证器定制的组相对策略优化的最小匹配变体。我们展示了这种无标签一致性训练在精度上接近使用真实监督训练的模型,并在不同任务和数据领域实现了类似的泛化。

英文摘要

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

2606.11916 2026-06-11 cs.SE cs.AI 新提交

Characterizing Software Aging in GPU-Based LLM Serving Systems

基于GPU的大语言模型服务系统中的软件老化特征分析

Domenico Cotroneo, Bojan Cukic

AI总结 提出一种实证方法研究GPU大语言模型服务系统中的软件老化,通过216小时实验发现所有部署均存在显著内存老化,泄漏率与运行时和配置强相关,并提供了可复现框架。

详情
Comments
7 pages
AI中文摘要

本文提出了一种实证方法,用于研究基于GPU的大语言模型服务系统中的软件老化。传统的老化研究侧重于以CPU为中心的软件,且工作负载相对规律;而大语言模型服务则不同,它跨越Python主机和CUDA设备,处理成本相差数个数量级的请求,并依赖于快速演进的软件栈。我们在相同的压力条件下,对六个共置部署进行了216小时的实验,并行监控主机、设备和客户端指标,并应用了考虑自相关和多重比较的统计流程。结果显示,所有部署均存在统计上显著的内存老化,泄漏率强烈依赖于服务运行时和部署配置。除这些发现外,我们还提供了一个可复现的框架,为软件老化与再生领域以及大语言模型服务社区开辟了交叉研究方向。

英文摘要

This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

2606.11915 2026-06-11 cs.SD cs.AI 新提交

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

呼吸音分类的质量自适应角度边界学习

Yoon Tae Kim, Heejoon Koo, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS, Republic of Korea(RSC实验室,MODULABS,韩国) Department of Electronic Engineering, Wonkwang University, Republic of Korea(韩国圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University, Republic of Korea(韩国圆光大学人工智能融合研究所)

AI总结 提出质量自适应角度边界学习框架QLung,通过频谱熵和均方根能量推导无参考音频质量边界,自适应缩放角度边界,改善特征泛化,在ICBHI和SPRSound数据集上分别提升2.46%和达到最优分布外性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出了一种质量自适应角度边界学习框架,通过增强类内紧凑性和类间可分离性来改进特征泛化。我们的框架名为QLung,引入了基于频谱熵和均方根能量的无参考音频质量边界,根据录音质量自适应缩放角度边界。为此,我们提出了一种对数缩放的角度边界,在严重类别不平衡下稳定训练。我们还使用了一个角度分类器,对特征和类别权重进行归一化,确保在单位超球面上一致地应用边界惩罚。我们的方法在ICBHI数据集上比交叉熵基线提高了2.46%的分布内性能,最重要的是,在SPRSound数据集上,与先前最先进的方法相比,实现了最强的分布外性能。代码可在以下网址获取:https://this URL。

英文摘要

We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at this https URL.

2606.11914 2026-06-11 eess.SP cs.LG 新提交

NARRAS: Edge-Triggered Distributed Inference for CSI-Based Localization in Vehicular IoT Networks

NARRAS:车载物联网中基于CSI的定位的边缘触发分布式推理

Rodrigo Oliver, Ricardo Vazquez Alvarez, Alejandro Lancho, Stefano Rini

AI总结 针对分布式天线阵列CSI定位中资源受限问题,提出NARRAS边缘触发分布式推理策略,各阵列本地决策是否上报观测,通过可微活动惩罚和通道图正则化实现预算控制,在低活动率下提升定位精度。

详情
Comments
10 pages, 5 figures, 5 tables. Under review at the IEEE Internet of Things Journal
AI中文摘要

基于CSI的定位与空间分布式天线阵列存在基本的资源权衡。每个阵列可以提供丰富的信道视图,但当只有少数阵列携带有用信息时,将所有阵列的观测结果转发到融合中心是浪费的,且共享上行链路仅支持有限数量的同时传输。我们让每个阵列本地决定其当前观测是否值得报告,受限于平均活跃发射机数量的预算。我们将这种抽象称为边缘触发分布式推理(ETDI)。它捕获了一类更广泛的任务导向通信问题,其中资源受限设备共享接入信道以完成共同推理任务。我们将ETDI实例化用于基于CSI的定位,这是车载物联网中的常见场景。空间分布的远程天线阵列(RAA)将来自用户设备(UE)传输的本地信道状态信息(CSI)编码为潜在特征,融合中心根据报告的特征子集估计UE位置。我们提出NARRAS,一种去中心化的报告策略,其中每个RAA将其最近观测的循环摘要与其最后传输的潜在记忆相结合。训练通过可微活动惩罚和验证校准的确定性阈值来控制显式活动预算,并使用通道图正则化来塑造潜在几何结构。实验表明,在可比的上行链路活动下,NARRAS比学习型和启发式稀疏报告策略提高了定位精度,而密集全报告模型仍然作为有用的无预算参考。在低活动率下,图正则化进一步减少了高百分位定位误差,表明几何感知的潜在表示在稀疏报告下更加鲁棒。

英文摘要

CSI-based localization with spatially distributed antenna arrays exposes a basic resource trade-off. Each array can provide a rich view of the channel, but forwarding observations from all arrays to a fusion center is wasteful when only a few carry useful information, and the shared uplink supports only a limited number of simultaneous transmissions. We let each array decide locally whether its current observation is worth reporting, subject to a budget on the average number of active transmitters. We refer to this abstraction as Edge-Triggered Distributed Inference (ETDI). It captures a broader class of task-oriented communication problems where resource-constrained devices share an access channel for a common inference task. We instantiate ETDI for CSI-based localization, a common scenario in vehicular IoT networks. Spatially distributed remote antenna arrays (RAAs) encode local channel state information (CSI) from user equipment (UE) transmissions into latent features, and the fusion center estimates the UE position from the subset of reported features. We propose NARRAS, a decentralized reporting policy in which each RAA combines a recurrent summary of its recent observations with a memory of the last latent it transmitted. Training controls an explicit activity budget through differentiable activity penalties and validation-calibrated deterministic thresholds, and uses channel-chart regularization to shape the latent geometry. Experiments show that, at comparable uplink activity, NARRAS improves localization accuracy over learned and heuristic sparse-reporting strategies, while dense full-report models remain useful budget-free references. In low-activity regimes, chart regularization further reduces high-percentile localization errors, suggesting that geometry-aware latent representations are more robust under sparse reporting.

2606.11913 2026-06-11 cs.CV 新提交

From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

从内容到知识:基于神经知识表示的闪电般快速长视频理解

Yuchen Guan, Xiao Li, Zongyu Guo, Xiaoyi Zhang, Xiulian Peng, Chun Yuan, Yan Lu

AI总结 提出将长视频编码为神经知识表示(NKR),通过智能体知识蒸馏(AKD)自动合成描述和问答对,将视频知识嵌入VLM骨干网络的少量权重中,实现轻量级、可复用的视频理解,推理时无需重新加载视频,大幅降低延迟。

详情
AI中文摘要

我们提出了一种新的长视频理解范式,将长视频视为神经知识表示(NKR)。NKR既不将视频内容表示为标记流,也不表示为预组织的数据库,而是作为附加到VLM骨干网络的一小部分网络权重。通过一种新颖的智能体知识蒸馏(AKD)过程优化NKR权重,以封装视频的语义内容,其中智能体自动合成密集描述和问答对,将视频知识蒸馏到NKR中。虽然AKD作为一次性的全面编码阶段,但生成的NKR将视频转换为可移植、可重用的资产。在推理时,轻量级NKR被挂载到冻结的视觉语言模型(VLM)上,实现直接的、基于查询的理解,无需重新加载或重新编码原始视频。这种方法将视频长度与推理成本解耦,为多轮视频理解提供了高摊销效率。在LVBench基准上的实验表明,我们的方法在实现与最先进方法相当的性能的同时,将端到端延迟降低了两个数量级以上,为交互式长视频理解开辟了新的可能性。

英文摘要

We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.

2606.11911 2026-06-11 stat.ML cs.LG math.AT 新提交

From Persistence to Survival: Hypothesis Testing, Effect Sizes and Vectorisation for Topological Features

从持续性到生存:拓扑特征的假设检验、效应大小与向量化

Juliette Murris, Bernadette Stolz, Karsten Borgwardt

AI总结 提出STRAND方法,将持久性图视为生存数据,利用持久性生存函数统一实现假设检验、效应大小计算和向量化,在合成数据和真实基准上验证了有效性。

详情
AI中文摘要

持久性图是拓扑数据分析中常见的表示形式,但它们并非天然存在于向量空间中,且用于比较它们的统计工具在很大程度上与用于下游预测的工具分开发展。我们引入STRAND(生存拓扑表示图分析),将(集合的)持久性图视为生存数据:每个具有持久性值 $p = d - b$ 的拓扑特征是一个完全观测的事件时间,持久性生存函数 $S(t) = \mathbb{P}(p > t)$ 是比较图的中心对象。从这个单一表示中,我们推导出(i)一个非参数双样本检验,具有校准的第一类错误率和少量图的高功效;(ii)可解释的效应大小;以及(iii)用于下游机器学习的1-Wasserstein稳定特征向量。我们在具有受控拓扑的合成流形上验证了校准和功效,展示了在14个图和3D点云基准上的竞争性向量化,并将该方法应用于fMRI/神经科学数据中的功能性脑连接研究。据我们所知,STRAND是第一个从单一连贯且可解释的表示为持久性图提供假设检验和向量化的方法。

英文摘要

Persistence diagrams are common representations in topological data analysis, but they do not naturally live in a vector space, and the statistical tools developed for comparing them have largely evolved separately from those used for downstream prediction. We introduce STRAND (Survival Topological Representation ANalysis of Diagrams), which treats (collections of) PDs as survival data: each topological feature with persistence value $p = d - b$ is a fully observed time-to-event, and the persistence survival function $S(t) = \mathbb{P}(p > t)$ is the central object for comparing diagrams. From this single representation we derive (i) a non-parametric two-sample test with calibrated Type I error and high power from a small number of diagrams; (ii) interpretable effect sizes; and (iii) a 1-Wasserstein-stable feature vector for downstream machine learning. We validate calibration and power on synthetic manifolds with controlled topology, demonstrate competitive vectorisation across 14 graph and 3D point cloud benchmarks, and apply the method to study functional brain connectivity in fMRI/neuroscience data. To our knowledge, STRAND is the first method to provide hypothesis testing and vectorisation for persistence diagrams from a single coherent and interpretable representation.

2606.11910 2026-06-11 cs.CL 新提交

An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination

一种本体引导的多锚点图检索框架用于交通事故法律责任判定

Xu Li, Shuqi Tian, Xun Han, Kuncheng Zhao, Xinyi Li

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出OMAGR框架,通过本体引导将查询分解为锚点并执行并行图检索,解决多维度检索瓶颈,在TrafficLaw-QA数据集上提升上下文精度和忠实度。

详情
Comments
Submitted to ICONIP. 15 pages, 3 figures
AI中文摘要

交通事故法律责任判定对于分配法律处罚至关重要,需要同时识别跨多个法律维度的相互依赖的法定条款。然而,现有的检索增强生成方法存在多维度检索瓶颈:单轴架构将复杂的法律查询压缩为单一通路,导致相互依赖的法定维度被忽视。为了解决这个问题,我们提出了OMAGR,一个本体引导的框架,将查询分解为与本体对齐的锚点,并在每个维度上执行并行图检索,确保在融合前各维度独立检索。为了评估所提出的方法,我们创建了TrafficLaw-QA数据集,这是一个经过专家验证的基准数据集,包含200个问题和527条法律条款。结果表明,TrafficOmni-RAG在上下文精度和忠实度指标上优于基线。研究结果表明,并行多锚点检索有效解决了多维度检索瓶颈,为交通事故法律责任判定研究提供了有前景的方向。

英文摘要

Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented generation methods suffer from a multi-dimensional retrieval bottleneck: single axis architectures compress complex legal queries into a single pathway, causing interdependent statutory dimensions to be overlooked. To address this, we propose OMAGR, an ontology-guided framework that decomposes queries into ontology-aligned anchors and executes parallel graph retrieval across each dimension, ensuring independent retrieval across dimensions before fusion. To evaluate the proposed method, we created the TrafficLaw-QA dataset, an expert-validated benchmark dataset containing 200 questions and 527 legal provisions. Results show that TrafficOmni-RAG outperforms baselines on Context Precision and Faithfulness metrics. The findings demonstrate that parallel multi-anchor retrieval effectively resolves the multi-dimensional retrieval bottleneck, offering a promising direction for traffic law liability determination research.

2606.11909 2026-06-11 cs.AI 新提交

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Embodied-BenchClaw:用于具身空间智能基准构建的自主多智能体系统

Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma

发表机构 * QiYuan Lab(启元实验室) School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院)

AI总结 提出Embodied-BenchClaw,一个通过五阶段流水线和三个智能体协调的自主系统,自动构建可验证、可执行、可维护且诊断有用的具身空间智能基准,减少人工工作量。

详情
AI中文摘要

基准测试对于评估具身空间智能至关重要,但其构建劳动密集、难以重用且维护困难。现有的具身基准通常是静态的,随着模型改进可能迅速饱和,限制其区分新能力的能力。我们提出Embodied-BenchClaw,一个用于构建具身空间智能基准的自主智能体系统。给定用户指定的评估意图,Embodied-BenchClaw通过五个阶段流水线自动生成完整且可持续更新的基准包:意图蓝图、数据收集、结构化与清洗、基准合成、评估报告。该流水线由三个智能体协调:规划、构建和评估。为提高可重用性和可靠性,Embodied-BenchClaw引入了可扩展的技能库和过程质量控制,使基准构建可组合、可验证和可修复。我们实例化了多个基准,涵盖室内空间推理、室外空间推理、机器人操作、四足机器人导航、无人机/空中视图理解以及静态基准增强。这些基准跨越不同的具身载体、数据源和空间能力。通过人工评估、基于评判者的评估、一致性检查、成本分析和消融实验,结果表明Embodied-BenchClaw能够以较少的人工努力构建可验证、可执行、可维护且诊断有用的具身空间基准。

英文摘要

Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

2606.11907 2026-06-11 cs.IR 新提交

Tail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation

尾部感知自适应-k:面向检索增强生成的查询自适应上下文选择

Ziyu Song, Jiaming Fang, Kuangyu Li, Tuo Xia, Chuanpeng Wang

AI总结 针对固定Top-K检索在查询依赖和重尾相似度分布下的失效问题,提出TAA-k框架,通过局部化极值理论验证策略实现高效、稳定的查询自适应截断,在三个数据集上达到接近最优的检索质量且效率大幅提升。

详情
Comments
First two authors contributed equally. Accepted at ECML PKDD 2026
AI中文摘要

自适应上下文选择对于检索增强生成(RAG)系统至关重要,因为固定的Top-K检索在查询依赖和重尾相似度分布下会失效。尽管极值理论(EVT)为自适应截断提供了原则性框架,但现有方法在整个排序列表上全局应用EVT,导致计算成本高昂且统计不稳定。我们提出尾部感知自适应-k(TAA-k),一种无需训练的框架,通过局部化验证策略实现EVT的操作化。关键洞察是,排序相似度曲线呈现出典型的陡-平-陡模式,反映了从相关主导到噪声主导的转变。TAA-k利用这种几何结构,通过拐点检测识别紧凑候选区域,然后在该窗口内应用基于EVT的拟合优度检验来验证尾部行为的起始点。这种由粗到精的设计将计算复杂度从O(N^2M)降低到O(sqrt{N log N} * M),同时保持统计严谨性。在温和的单调似然比假设下,TAA-k产生一个稳定的、查询自适应的截断点,对应于最早的噪声主导位置。在WebQuestions、2WikiMultiHopQA和MuSiQue上的实验表明,TAA-k实现了接近最优的检索质量(F1分数在最优值的2-3%以内),相比全局EVT方法效率提升数个数量级,并且在不同的嵌入模型和压缩维度下保持鲁棒性。

英文摘要

Adaptive context selection is critical for retrieval-augmented generation (RAG) systems, as fixed Top-K retrieval fails under query-dependent and heavy-tailed similarity distributions. While Extreme Value Theory (EVT) offers a principled framework for adaptive truncation, existing approaches apply EVT globally across the entire ranked list, incurring prohibitive computational costs and statistical instability. We propose Tail-Aware Adaptive-k(TAA-k), a training-free framework that operationalizes EVT through a localized validation strategy. The key insight is that ranked similarity curves exhibit a characteristic steep--flat--steep pattern reflecting a transition from relevance-dominated to noise-dominated regimes. TAA-k exploits this geometric structure via knee detection to identify a compact candidate region, then applies EVT-based goodness-of-fit testing within this window to validate the onset of tail behavior. This coarse-to-fine design reduces computational complexity from O(N^2M) to O(sqrt{N\log N}*M) while maintaining statistical rigor. Under mild monotone likelihood ratio assumptions, TAA-k yields a stable, query-adaptive cutoff corresponding to the earliest noise-dominated position. Experiments on WebQuestions, 2WikiMultiHopQA, and MuSiQue demonstrate that TAA-k achieves near-oracle retrieval quality (F1 within 2-3% of oracle) with orders-of-magnitude efficiency gains over global EVT methods, while maintaining robustness across embedding models and compression dimensions.

2606.11906 2026-06-11 cs.CL 新提交

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

语言何时重要?多语言指令揭示视觉-语言-动作模型中的逐步语言敏感性

Xuan Dong, Zhe Han, Tianhao Niu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本研究通过将LIBERO基准翻译成十种语言,首次系统评估了VLA模型的多语言鲁棒性,发现非英语指令下成功率下降30-50%,并基于步骤级语言敏感性提出推理时对齐干预,显著提升性能。

详情
Comments
Accepted to ACL 2026 Main Conference
AI中文摘要

视觉-语言-动作(VLA)模型在语言条件机器人操作中表现出强大性能,但其对语言变化的鲁棒性仍知之甚少。在这项工作中,我们通过将LIBERO基准翻译成十种语言,首次对VLA模型进行了系统的多语言评估,揭示了在非英语指令下性能严重下降,成功率下降30-50%。通过对任务执行的细粒度分析,我们发现语言影响在步骤间高度不均匀:某些步骤表现出强烈的语言依赖性并主导整体任务失败,而其他步骤则基本与语言无关。基于这一见解,我们提出了一种逐步推理时干预方法,根据步骤语言敏感性对齐表示,显著提高了语言变化下的性能。我们的结果表明,VLA模型中的语言鲁棒性本质上是一个逐步控制问题,突出了时间结构化分析对于可靠具身智能体的重要性。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.

2606.11903 2026-06-11 cs.SD 新提交

Snapping Matters: Context-Aware Onset Refinement for Automatic Music Transcription

Snapping Matters: 上下文感知的起始点细化用于自动音乐转录

Abhirup Saha, Hans-Ulrich Berendes, Meinard Müller, Ben Maman

AI总结 针对弱对齐的乐谱-音频数据,提出基于二分图匹配的上下文感知起始点细化方法,显著提升自动音乐转录的起始点对齐和转录精度。

详情
Comments
Published in International Computer Music Conference (ICMC) 2026
AI中文摘要

精确的音符级标注对于训练自动音乐转录(AMT)系统至关重要,尤其是音符起始点标签,它是许多现代AMT系统的核心组成部分。然而,真实世界录音的高质量标注非常稀缺。序列级乐谱-音频对齐方法(如动态时间规整)仅提供粗略对应,因此需要局部细化步骤。这个细化步骤称为snapping,它使用神经起始点后验图的峰值来调整对齐的乐谱起始点,并且通常决定了弱对齐的乐谱-音频对是否能够成为可用的训练数据。尽管具有实际重要性,snapping通常被视为简单的后处理启发式方法,并通过贪婪的局部决策实现。我们提出了用于训练乐器无关转录器的snapping策略的系统分析,证明了snapping对于从弱对齐数据学习至关重要。在此基础上,我们将snapping形式化为每个音高的分配问题,并通过二分图匹配解决,从而在重叠的细化窗口和不确定的初始对齐下做出上下文感知的起始点决策。在钢琴、室内乐和管弦乐录音上的广泛跨数据集实验表明,与贪婪snapping相比,起始点对齐和转录精度有所提高,并且随着snapping窗口变宽和初始对齐变粗糙,增益增加。定性示例见我们的项目页面:this https URL

英文摘要

Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level score--audio alignment methods such as dynamic time warping provide only coarse correspondence, making a local refinement step necessary. This refinement step, known as snapping, adjusts aligned score onsets using peaks in a neural onset posteriorgram and often determines whether weakly aligned score--audio pairs become usable training data at all. Despite its practical importance, snapping is typically treated as a simple post-processing heuristic and implemented with greedy local decisions. We present a systematic analysis of snapping strategies for training instrument-agnostic transcribers, demonstrating that snapping is essential for learning from weakly aligned data. Building on this, we formulate snapping as a per-pitch assignment problem and solve it via bipartite graph matching, yielding context-aware onset decisions under overlapping refinement windows and uncertain initial alignments. Extensive cross-dataset experiments across piano, chamber, and orchestral recordings show improved onset alignment and transcription accuracy over greedy snapping, with gains increasing for wider snapping windows and coarser initial alignments. Qualitative examples are provided on our project page: this https URL

2606.11901 2026-06-11 cs.RO cs.AI 新提交

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

DuoBench: 一个可复现的双手操作基准,涵盖仿真与现实世界

Tobias Jülg, Seongjin Bien, Simon Hilber, Yannik Blei, Pierre Krack, Maximilian Li, Sven Parusel, Rudolf Lioutikov, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg(纽伦堡工业大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Franka Robotics Technical University of Munich(慕尼黑工业大学)

AI总结 提出DuoBench,一个基于FR3 Duo平台的双手操作基准框架,包含11个任务和阶段式评估方案,用于诊断当前策略在双手协调、仿真到现实迁移等方面的失败模式。

详情
AI中文摘要

双手机器人系统极大地扩展了操作能力,但协调两只手臂引入了额外的控制复杂性和故障模式,现有基准未能很好地捕捉这些。我们介绍了DuoBench,一个针对FR3 Duo平台上的双手操作策略的可扩展基准框架。DuoBench包含跨越四个协调类别的十一个任务,在仿真中实现,并通过可复现的任务配方和3D打印资产部分地在现实世界中复现。此外,我们提出了一种基于阶段的评估方案,支持超出二元成功之外的细粒度语义故障分析,并为所有基准任务提供人类遥操作数据集。我们在仿真和真实硬件上对几种双臂模仿学习和视觉-语言-动作策略进行了基准测试。我们的结果表明,当前策略在双手操作中仍然面临挑战,特别是在早期交互阶段、并行手臂执行以及仿真与现实环境之间的迁移方面。DuoBench为诊断这些故障模式和研究未来的双臂策略学习方法提供了一个可复现的测试平台。代码、数据集和视频可在该https URL获取。

英文摘要

Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at this https URL

2606.11898 2026-06-11 cs.CL cs.LG 新提交

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Meiyi Qiang, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.

2606.11897 2026-06-11 cs.CL 新提交

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Notes2Skills: 从实验室笔记本到具有确定性意识的科学智能体技能

Shi Liu, Jiayao Chen, Chengwei Qin, Yanqing Hu, Jufan Zhang, Linyi Yang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University College Dublin(都柏林大学学院)

AI总结 提出Notes2Skills框架,将实验室笔记转化为保留作者确定性的可验证科学智能体技能,解决不确定判断与确认结论混淆问题。

详情
Comments
28 pages, preprint
AI中文摘要

科学发现工作流程通常包含并严重依赖实验室笔记,研究人员在其中记录观察结果、解释不确定的结果并规划后续实验。这些信息丰富的实验室笔记保留了不断演变的科学推理和作者的不确定性,而不是出版物中展示的经过修饰的最终结果,为人工智能在更全面和更深层次上参与科学探索提供了宝贵机会。然而,大多数先前关于科学文本的工作集中在论文、协议或结构化数据库上,使得非正式的实验室笔记作为科学AI智能体的输入未被充分探索。这一差距很重要,因为实验室笔记通常在同一段落中混合了经过验证的观察结果、初步判断和可能的实验下一步。如果这些信号被混淆,AI智能体可能会将不确定的科学判断误认为是已确认的结论或可执行的行动。为此,我们提出了Notes2Skills,一个两阶段框架,用于将实验室笔记本转化为可验证的科学AI智能体技能,同时保留作者的不确定性。在七个条件和三个湿实验环节中,Notes2Skills是唯一既不会将不确定的笔记误认为是明确的指令,也不会丢弃明确指令的配置。我们表明,确定性保留是实验室笔记本与可靠智能体技能之间缺失的一环,为更安全的AI共同科学家系统开辟了一条道路。

英文摘要

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

2606.11896 2026-06-11 cs.HC 新提交

PAPEL: A Collaborative System for Parental Guidance during Preschool Play-Based English Learning

PAPEL:一种面向学前游戏英语学习的家长协作系统

Xutong Wang, Yu Mei, Qinwei Li, Muyu Liu, Xiwen Yao, Chang Liu, Zhoutong Ye, Jie Cai, Chun Yu, Yuanchun Shi

AI总结 针对家长在游戏式英语学习中面临的挑战,提出PAPEL系统,通过场景感知建议和四个核心模块(内容生成、语言适配、平衡评估、扩展回应),提升亲子互动质量。

详情
Comments
38 pages, 9 figures, 5 tables. Accepted to CSCW 2026 / To appear in Proceedings of the ACM on Human-Computer Interaction (CSCW 2026)
AI中文摘要

基于游戏的亲子互动为学龄前儿童提供了丰富的日常外语学习机会,但许多家长难以将开放式游戏转化为有效的家庭英语作为外语(EFL)学习体验。为探索AI如何支持这一过程,我们通过访谈和Wizard-of-Oz研究进行了形成性研究,确定了四个关键挑战:内容选择、语言表达、教学与游戏的平衡以及问题解决。为应对这些挑战,我们提出了PAPEL,一个家长-AI协作系统,它将建议扎根于当前游戏场景,并将支持组织为四个核心模块:内容生成、语言适配、平衡评估和扩展回应。在一项包含16对亲子对的平衡受试者内研究中,与研究中使用的轻量级聊天机器人基线相比,PAPEL与更多整合了游戏和教学内容的家长话语以及更多的亲子对话轮次相关。

英文摘要

Play-based parent-child interaction offers preschoolers rich opportunities for everyday foreign language learning, yet many parents struggle to turn open-ended play into effective English-as-a-Foreign-Language (EFL) learning experiences at home. To explore how AI might support this process, we conducted formative studies through interviews and a Wizard-of-Oz study. We identified four key challenges: content selection, language expression, balancing instruction and play, and problem solving. To address these challenges, we present PAPEL, a parent-AI collaborative system that grounds suggestions in the ongoing play scene and organizes support into four core modules: content generation, language adaptation, balance assessment, and extended response. In a counterbalanced within-subjects study with 16 parent-child dyads, PAPEL was associated with more integrated parent utterances that combined playful and instructional content, as well as more parent-child conversational turns, than the lightweight chatbot baseline used in our study.

2606.11894 2026-06-11 cs.CV 新提交

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 提出Wild3R,一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法,通过引入包含多样光照和瞬态物体的WildCity数据集,学习跨视角外观一致性并移除瞬态内容,性能优于现有前馈方法,与基于逐场景优化的方法相当。

详情
AI中文摘要

前馈式3D高斯泼溅(3DGS)消除了传统3DGS所需的耗时逐场景优化。然而,现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中,我们提出了Wild3R,一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据,而这些是学习鲁棒场景表示所必需的。为解决这一问题,我们引入了WildCity数据集,该数据集包含200个场景、170种光照条件和瞬态物体,总计337,500张图像。通过利用该数据集,我们的模型在参考视图条件下学习跨视角的外观一致性,同时移除瞬态内容。大量实验表明,我们的方法优于现有的前馈方法,并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 新提交

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐:基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室、智能科学与技术学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Microsoft Research Asia(微软亚洲研究院)

AI总结 研究通过fMRI信号增强大型语言模型推理能力,提出脑引导框架,在10个模型上实现最高13%的准确率提升。

详情
AI中文摘要

大型语言模型(LLMs)与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的,一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐,以及这些信号是否能够改进它们。在此,我们聚焦于演绎推理,表明LLM内部表征不仅与任务fMRI活动部分对齐,而且可以直接通过这些信号增强。使用神经预测性度量,我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分,而在特定推理类型内的预测性较低,表明对齐和分歧并存。基于此,我们提出一个脑引导框架:我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征,在推理时进行干预,在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理,在10个LLM(1.5B-72B)上产生与仅语言监督正交的增益,具有跨推理类型的迁移,以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导,建立了一条由脑信号驱动的路径,通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

2606.11891 2026-06-11 cs.RO cs.LG 新提交

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

评论家架构的重要性:双评论家与统一评论家在人形机器人移动操作中的对比

Mehmet Turan Yardımcı

AI总结 针对人形机器人多目标强化学习,对比统一评论家与双评论家架构,实验表明双评论家策略在到达速度、吞吐量和成功率上显著优于统一评论家,且架构选择比奖励工程影响更大。

详情
Comments
Accepted at the ICRA 2026 Workshop on Reinforcement Learning for Imitation Learning (RL4IL), Vienna, Austria. 4 pages, 2 figures
AI中文摘要

人形机器人的多目标强化学习必须在单一策略中协调移动和操作。一个自然的设计选择是使用单一(统一)评论家来估计所有目标的组合价值,还是使用具有不相交奖励信号的单独(双)评论家。我们在NVIDIA Isaac Lab中对Unitree G1人形机器人(23个主动自由度)进行了受控比较,通过一个从静态到达延伸到具有可变方向目标的行走的13级顺序课程训练移动操作策略。在标准化评估中,与统一评论家策略相比,双评论家策略到达目标的速度快3.5倍(6.5 vs. 22.6模拟步),吞吐量高2倍(每1000步验证到达次数14.3 vs. 7.0),并且验证到达率更高(65.2% vs. 53.8%)。值得注意的是,额外的反博弈奖励机制在架构改变之外没有提供进一步改进(60.9% vs. 65.2%)。这些结果对新兴的强化学习微调模仿学习策略范式有直接影响:当使用强化学习优化预训练的操作策略时,统一评论家可能通过竞争性的移动梯度抑制已学习的行为。这些发现表明,评论家架构是多目标人形机器人强化学习中一个首要且常被忽视的设计选择,其对到达效率的影响大于奖励工程。

英文摘要

Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.

2606.11889 2026-06-11 cs.CV cs.AI cs.RO 新提交

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

AI总结 研究视觉-语言模型在自动驾驶危险检测中,嵌入漂移与任务对齐危险分数变化的关系,发现不同腐败类型导致不同的失效模式,建议基准测试包含任务对齐稳定性指标。

详情
Comments
8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)
AI中文摘要

视觉-语言模型(VLM)越来越多地用于自动驾驶中的场景理解,但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败,我们将嵌入漂移与边际漂移(定义为扰动下危险分数的变化)进行比较。这种关系高度依赖于腐败类型:某些家族表现出表示漂移与决策漂移之间的强耦合,而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外,腐败家族在失效方向上有所不同:大多数通过假阴性抑制危险检测,而遮挡则触发假警报,这表明基准设计应考虑不对称的失效模式,而不仅仅是整体不稳定性率。这些结果表明,鲁棒性基准应包含任务对齐的稳定性指标,而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

2606.11886 2026-06-11 cs.SD cs.OS 新提交

Real-Time Language Model Jamming: A Case Study for Live Music Accompaniment Generation

实时语言模型阻塞:现场音乐伴奏生成的案例研究

Bowen Zheng, Andrew H. Yang, Jiaqi Ruan, Jia He, Xinyue Li, Yuan-Hsin Chen, Ziyu Wang, Xiaosong Ma

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出StreamMUSE系统,在客户端-服务器架构中实现帧同步流式推理,通过现场音乐伴奏任务验证了不同延迟环境下实时同步的有效性。

详情
Comments
Accepted to RTAS 2026. 14 pages, 5 figures, 3 tables
AI中文摘要

语言模型(LMs)已成为现代生成建模中最突出的范式之一。虽然提高速度是实时部署的主要焦点,但仅靠速度是不够的。许多实际应用,如同步翻译和语音合成,还需要生成内容与外部信号在生成内容和时序上精确对齐。我们将此问题称为\textit{帧同步流式推理}。为了解决这个问题,我们提出了StreamMUSE,一个在客户端-服务器架构中响应外部信号流执行LM生成的推理系统。客户端基于最新输入持续发送高频推理请求,并接收与外部时钟同步的输出,而服务器执行模型推理。我们通过现场音乐伴奏任务演示了该框架,展示了在不同往返延迟的部署环境中如何实现实时同步。我们进一步建模了系统超参数与往返延迟之间的关系,并评估了不同环境如何影响实现实时性能的最佳配置。实验结果表明,系统实时性能与音乐质量之间存在一致对应关系,证明了所提出框架的有效性。该项目是开源的。相关代码和最新更新可在此https URL获取。

英文摘要

Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synthesis, also require precise alignment between generation and external signals, both in terms of generation content and timing. We refer to this problem as \textit{frame-synchronous streaming inference}. To address it, we present StreamMUSE, an inference system that performs LM generation in response to an external signal stream within a client-server architecture. The client continuously sends high-frequency inference requests based on the most recent inputs and receives outputs synchronized to the external clock, while the server executes model inference. We demonstrate the framework through a live music accompaniment task, showing how real-time synchronization can be achieved across different deployment environments with varying round-trip latencies. We further model the relationship between system hyperparameters and round-trip latency, and evaluate how different environments affect optimal configurations to achieve real-time performance. Experimental results show a consistent correspondence between system real-time performance and music quality, demonstrating the effectiveness of the proposed framework. The project is open source. Relevant code and the latest updates are available at this https URL.

2606.11884 2026-06-11 cs.CV cs.CR 新提交

Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality

使用开放人脸图像质量度量对身份证进行图像质量评估

Gregor Grote, Juan E. Tapia, Christian Rathgeb

发表机构 * da/sec - Biometrics and Internet Security Research Group, Hochschule Darmstadt(达姆施塔特应用科学大学生物识别与互联网安全研究组)

AI总结 本文通过将OFIQ标准中的捕获相关质量度量应用于身份证图像,提出一种预处理流程,并分析这些度量与三种呈现攻击检测算法性能的相关性,表明基于某些OFIQ度量的质量评估可显著提升PAD性能。

详情
Comments
Presented on IWBF 2026 (14th International Workshop on Biometrics and Forensics)
AI中文摘要

本文通过将开放人脸图像质量(OFIQ)标准中的捕获相关质量度量应用于身份证图像,解决了远程验证系统中身份证图像质量评估的挑战。我们的预处理流程包括角点检测、透视归一化和全面的前景掩码,以确保准确且无偏的质量度量计算。我们通过分析这些度量与三种呈现攻击检测(PAD)算法在四个不同身份证数据集上的性能相关性来评估其有效性,其中两个数据集包含真实(即原始)图像,两个包含打印的模拟身份证。我们的结果表明,基于某些OFIQ度量的质量评估可以显著提升PAD性能。

英文摘要

This paper addresses the challenge of assessing image quality in ID cards in remote verification systems by applying capture-related quality measures from the Open Face Image Quality (OFIQ) standard to ID card images. Our preprocessing pipeline includes corner detection, perspective normalization, and comprehensive foreground masking to ensure accurate and unbiased quality measure computation. We evaluate the effectiveness of these measures by analyzing their correlation with the performance of three presentation attack detection (PAD) algorithms across four diverse ID card datasets, where two datasets contain bona fide, i.e. pristine, images and two contain printed mock ID cards. Our results suggest that quality assessment based on some OFIQ measures can significantly improve PAD performance.

2606.11880 2026-06-11 cs.CV 新提交

SG2Loc: Sequential Visual Localization on 3D Scene Graphs

SG2Loc: 基于3D场景图的顺序视觉定位

Nicole Damblon, Olga Vysotska, Federico Tombari, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌) TU Munich(慕尼黑工业大学) Microsoft(微软)

AI总结 提出一种轻量级顺序视觉定位方法,利用紧凑的3D场景图表示环境,通过粒子滤波和语义匹配实现高效定位,显著降低存储需求。

详情
Comments
The code will be available at this https URL
AI中文摘要

复杂室内环境中的视觉定位仍然是机器人和AR应用的关键挑战。顺序定位,即随时间细化位姿估计,对自主智能体至关重要。然而,传统方法通常需要存储大量图像数据库或点云,导致显著开销。本文提出一种新颖的轻量级顺序视觉定位方法,使用3D场景图。我们的方法用紧凑的场景图表示环境,其中节点表示对象(带有粗略网格),边编码空间关系。在定位阶段,对于每张图像,我们提取逐块语义特征,预测对象身份。定位在粒子滤波框架内进行。每个粒子代表一个相机位姿,将场景图中的粗略对象网格投影到图像中,根据可见性为块分配对象身份。输入图像中逐块特征与场景图对象特征的相似度决定粒子的权重。后续图像顺序融合,细化位姿估计。通过利用紧凑的场景图和高效的语义匹配,我们的方法在保持真实世界数据集性能的同时显著减少存储。代码将在该网址提供。

英文摘要

Visual localization in complex indoor environments remains a critical challenge for robotics and AR applications. Sequential localization, where pose estimates are refined over time, is important for autonomous agents. However, traditional methods often require storing extensive image databases or point clouds, leading to significant overhead. This paper introduces a novel, lightweight approach to sequential visual localization using 3D scene graphs. Our method represents the environment with a compact scene graph, where nodes represent objects (with coarse meshes) and edges encode spatial relationships. For each image in the localization phase, we extract per-patch semantic features, predicting object identities. Localization is performed within a particle filter framework. Each particle, representing a camera pose, projects the coarse object meshes from the scene graph into the image, assigning object identities to patches based on visibility. The similarity of the per-patch features, in the input image, and object features from the scene graph determines the weight of a particle. Subsequent images are incorporated sequentially, refining the pose estimate. By leveraging a compact scene graph and efficient semantic matching, our method significantly reduces storage while maintaining performance on real-world datasets. The code will be available at this https URL.

2606.11878 2026-06-11 cs.CR 新提交

Gerrymandering the Warp: Non-Control-Data Attacks on CUDA Collective Decision

扭曲 Warp:针对 CUDA 集体决策的非控制数据攻击

Igor Santos-Grueiro

AI总结 本文提出集体语义破坏(CSC)攻击,利用 CUDA 集体操作中的参与元数据(如掩码、谓词等)绕过安全决策,并引入集体完整性契约(CIC)防御机制。

详情
Comments
17 pages
AI中文摘要

CUDA 集体操作通常位于安全决策路径上:投票接受批次、归约聚合证据、洗牌选择代表、屏障在使用前检查状态。这些决策不仅依赖于计算值,还依赖于哪些通道被代表、它们贡献了什么证据、哪个通道代表群体、以及哪个检查过的状态到达提交。我们将这些参与元数据识别为决策性的非控制数据。我们定义了集体语义破坏(CSC),一种非控制数据攻击家族,其中范围有效的掩码、谓词、源通道、描述符、组标签或时期导致符合 CUDA 规范的集体在错误的成员、贡献、角色或验证使用状态上授权决策。内核到达预期的集体站点并执行预期的原语;原语代表了错误的授权集合。我们使用站点本地的参与-授权契约对 CSC 进行建模。受保护的集体在授权前派生、重新计算、检查或冻结成员、贡献、角色和时间状态。我们在 NVIDIA CUDA 集体原语、触发通道、紧凑工作负载风格内核、简化习语桥和准入保护框架上评估 CSC。在涵盖四个授权维度的 CUDA 定义的契约一致性套件中,损坏的参与元数据导致 102/102 实例中的可信参考不匹配,而强化变体在 102/102 中保留了该参考。我们单独报告了 13 个同步敏感实例。然后,我们引入了集体完整性契约(CIC),一种包装规范,在集体使用前绑定参与元数据。对于 CUDA 集体决策,安全性既依赖于计算的值,也依赖于代表的参与者。

英文摘要

CUDA collective operations often sit on security decision paths: votes accept batches, reductions aggregate evidence, shuffles select representatives, and barriers order checked state before use. Such decisions depend not only on computed values, but also on which lanes are represented, what evidence they contribute, which lane speaks for the group, and which checked state reaches commit. We identify this participation metadata as decision-making non-control data. We define Collective Semantic Corruption (CSC), a non-control-data attack family in which range-valid masks, predicates, source lanes, descriptors, group labels, or epochs cause a CUDA-conforming collective to authorize a decision over the wrong membership, contribution, role, or validation-to-use state. The kernel reaches the intended collective site and executes the expected primitive; the primitive represents the wrong authority set. We model CSC with a site-local participation-authority contract. A protected collective derives, recomputes, checks, or freezes membership, contribution, role, and temporal state before authorization. We evaluate CSC across NVIDIA CUDA collective primitives, trigger channels, compact workload-style kernels, reduced idiom bridges, and admission-guard harnesses. In a CUDA-defined contract-conformance suite spanning the four authority dimensions, corrupted participation metadata causes a trusted-reference mismatch in 102/102 instances, while hardened variants preserve that reference in 102/102. We report 13 synchronization-sensitive instances separately. We then introduce Collective Integrity Contracts (CIC), a wrapper discipline that binds participation metadata before collective use. For CUDA collective decisions, security depends on both the values computed and the participants represented.

2606.11877 2026-06-11 cs.NI 新提交

LLM-Enabled NWDAF: A Step Toward AI-Native 6G Network Intelligence

LLM赋能的NWDAF:迈向AI原生的6G网络智能

Henok Daniel, Omar Alhussein, Cheng Li, Jie Liang, Ernesto Damiani

AI总结 开发了一个与Free5GC兼容的开源NWDAF,集成大语言模型接口,通过意图识别实现自然语言交互,简化网络分析管理,为AI原生6G网络奠定基础。

详情
Comments
20 pages
AI中文摘要

网络数据分析功能(NWDAF)通过支持实时分析和闭环自动化,在第五代(5G)网络中实现零接触网络管理方面起着核心作用。尽管其关键作用,开源NWDAF实现的范围和可访问性仍然有限。在本文中,我们开发了一个与开源核心网络Free5GC兼容的开源NWDAF,它通过订阅网络功能(NF)收集网络数据,并包含一个集成的大语言模型(LLM)接口,支持与人类操作员的自然语言交互。该接口处理用户意图,使用语义嵌入模型进行编码,并将其映射到七个预定义意图类别之一,以触发分析查询或事件订阅命令。这种架构抽象了传统接口的复杂性,使非专家用户能够轻松管理网络分析和订阅。该系统支持访问和移动管理功能(AMF)和会话管理功能(SMF)事件订阅、实时监控以及通过Prometheus进行分析检索,所有这些都可以通过对话界面访问。通过将AI驱动的意图识别与标准化网络分析相结合,我们的实现增强了操作员的可用性,并为AI原生6G网络奠定了基础。本研究中生成的源代码和数据集可在github仓库中获取,网址为:this https URL。

英文摘要

The Network Data Analytics Function (NWDAF) is central to enabling zero-touch network management in fifth-generation (5G) networks by supporting real-time analytics and closed-loop automation. Despite its critical role, open-source NWDAF implementations remain limited in scope and accessibility. In this paper, we develop an open-source NWDAF, compatible with the open-source core network Free5GC, that collects network data via subscriptions to Network Functions (NFs), and also includes an integrated Large Language Model (LLM) interface that enables natural language interaction with human operators. The interface processes user intents, encodes them using a semantic embedding model, and maps them to one of seven predefined intent categories to trigger analytics queries or event subscription commands. This architecture abstracts the complexity of traditional interfaces, allowing non-expert users to manage network analytics and subscriptions with ease. The system supports Access and Management Function (AMF) and Session Management Function (SMF) event subscriptions, real-time monitoring, and analytics retrieval via Prometheus, all accessible through a conversational interface. By bridging AI-driven intent recognition with standardized network analytics, our implementation enhances operator usability and provides a foundation towards AI-native 6G networks. The source code and datasets generated during the current study are available in the github repository, this https URL.