arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1970
专题追踪
2606.13277 2026-06-12 stat.ML cs.LG 新提交

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

ProtoX-AD:自解释的时间序列异常检测与特征描述

Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

AI总结 提出ProtoX-AD框架,通过原型学习实现自监督时间序列异常检测的可解释性,在保持检测性能的同时提供语义一致的异常特征解释。

Comments 26 pages, 8 figures

详情
AI中文摘要

时间序列异常检测(TSAD)的最新进展突显了自监督分类方法的有效性。这些方法对正常训练样本应用变换,训练分类器识别变换特定模式,从而通过增加分类误差来帮助识别异常。尽管性能强大,但一个重大挑战是缺乏可解释性,因为它们对标记异常的特征提供的洞察有限。为了解决这一局限,我们提出了ProtoX-AD,一种基于原型的自解释框架,用于自监督TSAD。ProtoX-AD学习变换感知的潜在表示以及可解释的原型,从而实现准确的异常检测和通过基于原型的解释识别不同的异常轮廓。此外,它允许系统分析变换设计如何影响检测性能和可解释性。在合成和真实世界数据集上的实验结果表明,ProtoX-AD实现了与其黑盒对应物相当的检测性能,同时比现有的可解释基线提供更一致和语义上有意义的解释。我们的代码在此 https URL 公开。

英文摘要

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at https://github.com/Aitorzan3/ProtoX-AD.

2606.13193 2026-06-12 eess.AS cs.PL cs.SD 新提交

A Dual-Mode Faust-to-CLAP Compilation System

双模式 Faust 到 CLAP 编译系统

Facundo Franchino, Stéphane Letz, Jatin Chowdhury

AI总结 提出 faust2clap 框架,支持静态编译和动态解释两种模式,通过地址身份匹配算法和稳定槽位分配方案解决 DSP 参数身份保持问题,实现高效编译与热更新。

Comments 4 pages, 4 figures, 1 algorithm. Presented at the International Faust Conference (IFC-26), Lyon, France, June 2026

详情
AI中文摘要

我们描述了 faust2clap,一个建立从 Faust DSP 规范到 CLAP 格式的首个官方维护编译路径的框架。该系统以两种不同模式运行。静态模式采用提前编译以生成最优效率的原生二进制文件,而动态模式使用运行时解释以允许在不中断宿主应用程序的情况下修改 DSP 代码。后一种能力解决了音频软件开发中一个长期存在的摩擦,即编辑、编译和重载循环的累积开销。我们详细阐述了两种模式背后的算法机制,特别关注参数身份问题。为了在结构 DSP 突变中保留参数值及其与宿主自动化的绑定,我们引入了一种基于地址的身份匹配算法和一种稳定的槽位分配方案。该实现包含约 2400 行 C++ 架构和 Python 工具代码,并已集成到 Faust 主发行版中。

英文摘要

We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

2606.13146 2026-06-12 stat.ML cs.LG stat.ME 新提交

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

鲁棒的状态条件特征加权跳跃模型用于时间聚类

Federico P. Cortese, Alessio Farcomeni

AI总结 提出一种鲁棒的特征加权跳跃模型,通过Tukey双权损失函数实现鲁棒性,并引入状态特定特征权重,在模拟和实证中优于竞争方法。

详情
AI中文摘要

我们提出了一种用于时间依赖聚类的鲁棒特征加权跳跃模型。使用惩罚项来鼓励随时间平滑过渡,同时通过Tukey双权损失函数实现鲁棒性。一个额外的参数控制特征权重在不同状态间的变异性,允许模型为每个特征分配状态特定的相关性。我们在模拟中展示了该方法如何准确恢复真实聚类序列并可靠识别相关特征,特别是在存在异常值的情况下优于竞争方法。最后,我们进行了两个实证应用,一个涉及1998-2000年科索沃冲突相关杀人事件的数量,另一个涉及1949-2024年十二个欧洲国家的宏观经济表现。

英文摘要

We propose a robust feature-weighted jump model for time-dependent clustering. A penalty is used to encourage smoothness of transitions over time, while robustness is achieved through the use of a Tukey's biweight loss function. An additional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.

2606.13109 2026-06-12 eess.AS cs.SD 新提交

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

为真实场景语音增强生成训练目标:通过近远麦克风投影

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki

AI总结 提出近远麦克风投影(C2D投影)方法,利用真实录音生成配对数据,通过参数化多通道维纳滤波器实现投影,训练神经网络在远场语音增强中优于现有GSS方法。

Journal ref Proceedings of IEEE ICASSP 2026

详情
AI中文摘要

在远距离语音捕获场景中训练语音增强(SE)神经网络需要配对的失真和干净参考语音信号。虽然此类数据通常通过模拟生成,但模拟与真实录音之间的不匹配显著限制了SE的准确性。为解决此问题,我们提出近远麦克风投影(C2D投影),一种从近距离和远距离麦克风捕获的真实录音中生成配对数据的方法。C2D投影估计一个最优投影矩阵,将近麦克风输入转换为与远麦克风录音对齐的干净参考信号,同时执行去噪。我们证明,使用参数化多通道维纳滤波器(PMWF)的变体可以有效地实现这种投影。实验结果表明,在具有挑战性的CHiME6晚宴派对ASR任务中,使用C2D投影数据训练的神经网络在oracle说话人日志条件下,当使用GSS的增强输出作为神经网络的辅助输入时,优于最先进的引导源分离(GSS)。

英文摘要

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

2606.13095 2026-06-12 eess.AS cs.SD 新提交

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

在端到端大语言模型中平衡ASR与说话人日志以进行多说话人语音识别

Naijun Zheng, Yuke Lin, Sanli Tian, Mengtian Li, Zhiwei Lin, Longshuai Xiao, Dandan Tu

AI总结 提出双编码器架构、特征交错格式、长度感知说话人ID损失和自适应阈值ASR损失策略,在有限真实数据下高效训练LLM系统,平衡ASR与说话人日志任务,在AliMeeting和Aishell4语料库上分别实现18%和24%的相对改进。

Comments Accepted in Interspeech 2026

详情
AI中文摘要

多说话人语音识别通常通过结合自动语音识别(ASR)和说话人日志的流水线系统来处理。最近,基于大语言模型(LLM)的方法通过联合建模语义和说话人信息显示出前景,但它们通常需要大规模的多说话人语料库,而标注这些语料库成本高昂。在本文中,我们研究了如何在有限真实录音数据下高效训练基于LLM的系统,同时保持说话人归属的高准确性。我们提出了几种策略:(1)双编码器架构,用于提取语义和说话人特征;(2)特征交错格式,将这些特征合并作为LLM的输入;(3)长度感知的说话人ID损失,以增强日志能力;(4)自适应阈值的ASR损失计算,以减轻语音重叠引起的幻觉。这些策略平衡了ASR和说话人日志任务之间的训练。我们的系统优于开源基线方法,在AliMeeting语料库上实现了18%的相对改进,在Aishell4语料库上实现了24%的相对改进。

英文摘要

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

2606.13017 2026-06-12 q-bio.NC cs.LG 新提交

Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

基于EEG信号临界性的深度睡眠分类:一种用于改善睡眠神经反馈的被动BCI方法

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结 本研究利用去趋势波动分析(DFA)提取的临界性特征,通过朴素贝叶斯分类器实现了对深度睡眠(N3)的高精度识别(平衡准确率87.17%),为被动脑机接口中的状态依赖神经反馈提供了高效感知机制。

Comments 7 pages, 3 figures, accepted for publication in the Proceedings of the 10th Graz Brain-Computer Interface Conference 2026, Graz, Austria, September 14-17, 2026

详情
AI中文摘要

自动睡眠分期是被动脑-机接口(pBCI)的一项基础应用,它解码自发神经状态以实现独立于用户意图的闭环干预。本研究评估了从去趋势波动分析(DFA)中提取的临界性特征,用于特定识别深度睡眠(N3)。我们分析了来自290名老年女性的347,232个EEG时段,使用UMAP流形学习可视化状态转换。随后,通过10折交叉验证对六个分类器进行基准测试,使用平衡准确率确定此http URL的最佳“状态感知”引擎。朴素贝叶斯达到了最高的平均平衡准确率(87.17% ± 0.24%),显著优于全连接深度神经网络(FNN:81.58%)和随机森林(80.97%)。线性模型(LDA:57.21%;SVM:51.01%)表现不佳,表明DFA衍生的临界性特征位于一个独特的非线性流形上。EEG临界性的概率解码为pBCI提供了一种高精度的感知机制。这种稳健的分类流程支持开发状态依赖的神经反馈,例如靶向听觉刺激,以增强认知恢复。

英文摘要

Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed $347,232$ EEG epochs from $290$ older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal "state-sensing" engine for neurofeedback.Naive Bayes achieved the highest mean balanced accuracy ($87.17\% \pm 0.24\%$), significantly outperforming a fully connected deep neural network (FNN: $81.58\%$) and Random Forest ($80.97\%$). Linear models (LDA: $57.21\%$; SVM: $51.01\%$) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery.

2606.12838 2026-06-12 q-bio.QM cs.AI cs.LG q-bio.GN 新提交

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

OCOO-T: 一种用于转录扰动响应预测的简单可扩展虚拟细胞模型

Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai

AI总结 提出OCOO-T,一种基于流匹配的简约虚拟细胞模型,通过连续时间去噪和自适应层归一化,在多个基准上实现转录扰动预测的最优性能。

Comments 22 pages, 6 figures

详情
AI中文摘要

预测单细胞对遗传、化学和细胞因子扰动的转录响应是计算生物学和AI虚拟细胞(AIVC)建模中的一个基本挑战,对药物发现和基因调控网络的阐明具有直接影响。现有方法通常依赖辅助细胞状态编码器、分层变分自编码器、专用Transformer编码器-解码器模块或基因相互作用先验,将高维表达谱压缩为潜在表示。虽然有效,但这些设计增加了架构复杂性,可能限制可扩展性和泛化性。本文介绍了OCOO-T,一种基于流匹配的简约AIVC模型,用于转录扰动响应预测。OCOO-T利用一个直接操作连续基因表达谱的普通Transformer堆栈,并将扰动响应预测表述为连续时间去噪过程。通过自适应层归一化和上下文令牌整合扰动嵌入、剂量信息以及细胞系/细胞类型特异性。在Tahoe100M、Replogle和PBMC基准上的全面评估表明,OCOO-T在多种扰动和细胞类型上实现了最先进的性能,同时通过细胞上下文的修补和拆补有效扩展到长转录谱。通过利用基于Transformer去噪的单细胞组学简单性,OCOO-T为计算机细胞模拟提供了一个有效且可扩展的框架。

英文摘要

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

2606.12654 2026-06-12 stat.ME cs.LG stat.ML 新提交

Computationally tractable robust differentially private mean estimation

计算可处理的鲁棒差分隐私均值估计

Kelly Ramsay

AI总结 提出一种名为“气球均值”的新差分隐私均值估计器,通过扩展马氏距离球上的迭代裁剪实现计算可处理性、鲁棒性及零集中差分隐私,理论保证在重尾和污染椭圆模型下的统计性能与鲁棒性。

Comments 40 pages, 17 figures

详情
AI中文摘要

我们开发了一种新的差分隐私均值估计器,称为气球均值。气球均值的主要特点是计算可处理且对异常观测具有鲁棒性。它基于在扩展的马氏距离球(即“气球”)上的迭代裁剪过程。该方法满足零集中差分隐私,并依赖于少量可解释的调优参数。我们在重尾和污染椭圆模型下提供了理论保证,刻画了其统计性能和对异常值的鲁棒性。大量模拟表明,气球均值对重尾和污染数据具有鲁棒性,并且在污染环境下优于现有的差分隐私均值估计器。

英文摘要

We develop a new, differentially private mean estimator called the balloon mean. The main features of the balloon mean are that it is computationally tractable and enjoys robustness to outlying observations. It is based on an iterative clipping procedure over expanding Mahalanobis balls, or ``balloons.'' The method satisfies zero-concentrated differential privacy and depends on a small number of interpretable tuning parameters. We provide theoretical guarantees under heavy-tailed and contaminated elliptical models, characterizing its statistical performance and robustness to outliers. Extensive simulations demonstrate that the balloon mean is robust to heavy-tailed and contaminated data, and outperforms existing differentially private mean estimators in contaminated settings.

2606.12471 2026-06-12 stat.ML cs.CL cs.ET cs.LG 新提交

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

无高斯假设的可识别性:符号世界模型与近无限时间一致性

Seth Dobrin, Łukasz Chmiel

AI总结 本文提出物理基础符号架构(PGSA),证明其在非高斯动态系统中实现精确线性可识别性和近无限时间一致性,克服了统计世界模型的高斯边界限制。

Comments Pre-print

详情
AI中文摘要

Klindt、LeCun 和 Balestriero (arXiv:2605.26379) 证明了联合嵌入预测架构(JEPA)实现线性可识别性(即线性恢复世界的真实潜在变量)当且仅当世界的潜在动态遵循高斯平稳过程。这一高斯边界意味着时间一致性的基本限制:对于任何非高斯物理系统,统计世界模型的表示误差随时间单调增长。我们证明这一限制是统计对齐机制的产物,而非世界模型的一般性质。我们引入物理基础符号架构(PGSA),并证明三个结果:(1) PGSA 对所有物理机制实现精确线性可识别性,无论潜在分布如何;(2) PGSA 的每步误差仅受数值精度限制;(3) 直接推论是,PGSA 在无界数量的转换中保持时间一致性,我们称之为近无限时间一致性。我们进一步证明,对于任何非高斯系统,统计世界模型无法实现这一性质,无论模型容量或训练数据量如何。其中四个定理的代数核心已在 Lean 4 中使用 Mathlib4 v4.31.0 形式化(零个 sorry 占位符);Klindt 等人的逆命题作为外部前提。对比表明,在世界动态的因果生成器中进行符号基础化是充分条件,并且在非高斯体制下,是实现近无限时间一致性的唯一条件。

英文摘要

Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

2606.13633 2026-06-12 eess.SY cs.LG cs.SY 新提交

Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model

基于混合CNN-元胞自动机火灾模型的空中野火抑制规划

Ion Matei, Maksym Zhenirovskyy, Takuya Kurihana, Rohit Vupala, Anthony Wong

AI总结 提出结合混合神经-元胞自动机野火模型与梯度优化空中投放的框架,通过蒙特卡洛采样和空间相关扰动量化不确定性,案例验证可生成有效抑制方案。

详情
AI中文摘要

空中野火抑制不仅需要预测火势蔓延,还需要在操作和环境不确定性下设计有效的干预策略。我们提出了一个空中野火抑制的建模与优化框架,该框架将混合神经-元胞自动机野火模型与基于梯度的目标空中投放设计相结合。野火模型根据地形、燃料和风数据预测空间变化的蔓延行为,而干预模块确定二元投放动作,其连续值位置和方向参数映射到模拟网格。水和阻燃剂具有不同的抑制效果,分别对应于立即减少活跃燃烧和持续减少未来蔓延。为了评估所得抑制方案的鲁棒性,我们通过每日火势状态的蒙特卡洛采样量化偶然不确定性,并通过空间相关的预测误差扰动量化认知不确定性。基于2020年Bear Fire的案例研究表明,该框架可以生成连贯的空中抑制调度,以减少总火灾影响面积,并支持对野火干预策略的不确定性感知分析。

英文摘要

Aerial wildfire suppression requires not only predicting fire spread, but also designing effective intervention strategies under operational and environmental uncertainty. We present a modeling and optimization framework for aerial wildfire suppression that combines a hybrid neural-cellular automaton wildfire model with gradient-based design of targeted aerial drops. The wildfire model predicts spatially varying spread behavior from terrain, fuel, and wind data, while the intervention module determines binary drop actions with continuous-valued location and orientation parameters mapped to the simulation grid. Water and retardant are represented with distinct suppression effects, corresponding to immediate reduction of active burning and persistent reduction of future spread. To evaluate the robustness of the resulting suppression plans, we quantify both aleatoric uncertainty through Monte Carlo sampling of daily fire-state realizations and epistemic uncertainty through spatially correlated prediction-error perturbations. A case study based on the 2020 Bear Fire shows that the framework can generate coherent aerial suppression schedules for reducing total fire-affected area and can support uncertainty-aware analysis of wildfire intervention strategies.

2606.13543 2026-06-12 cs.NI cs.LG 新提交

NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks

NetCause:大规模网络中根因分析的反事实学习

Fabien Chraim, Jian Zhang, Dominik Janzing, Xiang Song, Christos Faloutsos, John Evans

AI总结 提出NetCause框架,将网络事件建模为图时间过程,通过反事实模拟排序候选根因,在31个专家标注事件上准确率提升16.1%。

Comments 9 pages, 6 figures

详情
AI中文摘要

一个学习模型能否捕捉故障在大规模网络中的传播方式,并利用这些知识将客户影响因果归因于其根本原因?现有的根因分析技术通常依赖于静态规则、相关启发式或拓扑局部推理,难以在动态环境中泛化,因为故障在复杂的物理和逻辑依赖关系中传播。我们提出了NetCause,一个基于自监督学习的框架,将网络事件建模为图时间过程,并使用反事实模拟对候选根因进行排序。该方法生成可解释的根因假设排序,并自然地与操作员定义的缓解和修复措施集成。我们在来自领先云提供商生产网络的六个月内收集的1500多个事件上训练模型,并在31个专家标注的事件上评估。NetCause在与运营决策最相关的场景中持续改善根因排序质量,相比基于规则的启发式基线,准确率提升16.1%。虽然训练计算密集,但推理轻量,每个事件仅需数秒GPU运行时间(远低于典型的遥测收集延迟)。

英文摘要

Can a learned model capture how faults propagate through a large-scale network and use this knowledge to causally attribute customer impact to its underlying root cause? Existing root cause analysis techniques often rely on static rules, correlation heuristics, or topology-local reasoning, which struggle to generalize in dynamic environments where faults propagate across complex physical and logical dependencies. We present NetCause, a self-supervised learning-based framework that models network incidents as graph-temporal processes and uses counterfactual simulation to rank candidate root causes. This approach produces an interpretable ranking of root cause hypotheses and integrates naturally with operator-defined mitigation and remediation actions. We train the model on over 1,500 incidents collected over six months from a leading cloud provider's production network and evaluate it on 31 expert-labeled incidents. NetCause consistently improves root cause ranking quality in the regime most relevant to operational decision-making, achieving a 16.1% accuracy improvement over a rule-based heuristic baseline. While training is computationally intensive, inference is lightweight, requiring only seconds of GPU runtime per incident (well below typical telemetry collection latencies).

2606.13532 2026-06-12 cs.NI cs.LG 新提交

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

云网络中根本原因分析的图因果推理

Fabien Chraim, Dominik Janzing, John Evans

AI总结 提出基于图因果发现的云网络事故根本原因分析方法,通过时空分组和自动化本体降维,利用双变量Granger因果性和条件独立性检验构建因果图,并引入概率方法进行时间感知的根因评分。在35个生产事故中召回率85.7%,精确匹配率74.3%。

Comments 6 pages, 4 figures

详情
AI中文摘要

云计算依赖于大规模网络,这些网络本质上是复杂系统。在本文中,我们提出了一种新颖的云网络事故根本原因分析(RCA)方法,利用基于图的因果发现技术。我们的方法通过引入时空分组策略和自动化本体来降低问题维度,从而解决了基于规则的自动化的局限性。我们使用双变量Granger因果性和条件独立性检验从二元时间序列数据构建因果图。对于推理,我们引入了一种概率方法,该方法根据时间延迟分配边特定的条件概率,从而通过因果图遍历实现可解释的、时间感知的根因评分。我们使用来自一家主要云提供商的35个生产事故的标记数据集评估了该系统。该模型成功召回正确根因的事故占85.7%,精确匹配的事故占74.3%。在生产中,该系统已用于800多个真实世界事故,并获得了网络工程师的积极定性反馈。这些结果突显了在动态和大规模运营环境中采用数据驱动的因果方法进行RCA的实用性。

英文摘要

Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

2606.13501 2026-06-12 cs.DC cs.LG cs.PF 新提交

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

GF-DiT:扩散Transformer服务的并行调度

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo

AI总结 提出GF-DiT,一种策略可编程运行时,通过动态调整请求并行度来优化扩散Transformer服务,利用无组集合通信实现低开销在线重配置,显著提升吞吐量和降低延迟。

详情
AI中文摘要

扩散Transformer(DiT)已成为图像和视频生成的主流架构,对高效DiT服务的需求日益增长。现有系统为每个请求在其整个生命周期内分配固定的并行配置。然而,DiT工作负载在请求、执行阶段和系统条件之间表现出显著的异构性,使得静态并行性效率低下,通常导致GPU利用率低和服务质量下降。本文认为,DiT服务应将GPU并行性视为一种可调度的资源。我们提出GF-DiT,一种策略可编程的弹性DiT服务运行时,能够根据工作负载需求和服务目标动态调整运行中请求的并行度。GF-DiT引入了一种异步执行抽象,将请求分解为独立可调度的轨迹任务,并支持在线GPU重新分配。为了使弹性并行性实用化,GF-DiT进一步提出了无组集合(group-free collectives),一种轻量级通信抽象,支持低开销的任意执行组在线形成和重新配置。我们在vLLM-Omni中实现了GF-DiT,并在代表性的图像和视频扩散工作负载上进行了评估。与具有静态并行性的固定流水线执行相比,GF-DiT将吞吐量提高了高达6.01倍,平均延迟降低了高达95%,SLO违规率降低了高达90%,并将通信组设置开销从778毫秒降低到约60微秒。

英文摘要

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $μ$s.

2606.13468 2026-06-12 cs.SE cs.AI 新提交

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

理解AI代理生成的拉取请求修复被拒绝的原因——来自AIDev数据集的洞察

Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh

AI总结 通过分析AIDev数据集,发现46.41%的AI代理(Copilot、Devin、Cursor、Claude)提出的代码修复被拒绝。本文对306个未合并的PR进行定性研究,归纳出14个拒绝原因,分为四类,并提出了改进模型引导的建议。

Comments 5 pages, 2 figures, MSR '26: Proceedings of the 23rd International Conference on Mining Software Repositories, April 2026, Rio de Janeiro, Brazil

详情
AI中文摘要

AI编码代理越来越多地被用于生成拉取请求(PR),以在软件项目中提出代码修复。通过对AIDev数据集的初步探索,我们发现由Copilot、Devin、Cursor和Claude代理提出的修复中有46.41%被拒绝。这代表了大量浪费的资源,需要人工审查、验证以及运行测试和验证,而这些修复最终被丢弃。本文的目标是理解AI代理的失败模式,这对于更好地将AI代理集成为高效团队成员至关重要。本文对由前述代理创建或共同创作的306个未合并的拉取请求的代表性样本进行了定性研究,随后对拒绝原因进行了定量分析。我们的定性发现确定了14个原因,分为四个高级类别,用于拒绝AI代理的修复。我们观察到,开发者可能因以下原因拒绝修复:修复的实现不正确(例如,不完整、方法错误)、修复未通过持续集成(CI)管道并测试失败、代理无法执行实现(例如,未生成代码、会话丢失),以及修复优先级低。我们的结果揭示了在以下层面更好引导模型的重要性:(1)提出关于修复问题应遵循的方法的提示,(2)概述不应采取的方法的约束或限制,以及(3)指导代理如何通过CI管道验证实现而不引入破坏性变更。我们的结果表明,需要良好的任务优先级排序,以便生成的修复不会导致浪费的人工审查努力或浪费的代理资源(例如,令牌、计算或允许的请求数量)。

英文摘要

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

2606.13449 2026-06-12 cs.SE cs.AI 新提交

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

面向指令即代码:理解指令文件对智能体拉取请求的影响

Ali Arabat, Mohammed Sayagh

AI总结 通过分析148个项目的15549个智能体PR,发现指令文件对合并率、代码变更量和合并工作量无一致正面影响,但成功项目指令文件更长且结构更清晰,提出“指令即代码”研究方向。

Comments 5 pages, 8 figures, 23rd International Conference on Mining Software Repositories, April 13--14, 2026

详情
AI中文摘要

AI智能体(如GitHub Copilot)作为队友协作完成不同的软件工程任务,包括通过拉取请求(Agentic-PRs)提出的代码生成。为了提高智能体效率,开发者创建指令文件来指导AI智能体,包括如何导航项目、定位正确组件、运行测试、遵守最佳实践等。本文研究了这些指令的创建与AI智能体在创建更好的拉取请求方面的性能之间的关系,这些拉取请求具有更高的成功机会(即合并率)、处理更复杂的任务(例如代码变更量),并且需要更少的合并工作量(例如合并时间)。为此,我们分析了来自AIDev数据集中148个项目的15,549个智能体PR。使用这三个维度,我们比较了每个项目在创建指令文件前后的情况。我们发现,为AI智能体指定指令并不一定会带来更好的结果。使用指令文件后,27.7%的项目的合并率至少提高了20%,而26.35%的项目合并率下降。在变更量(例如代码变更量、修改文件数量)和合并智能体PR的工作量(例如合并时间和评论数量)方面也观察到相同的情况。通过初步探索,我们发现成功提高合并率的项目具有更长的指令文件,并且这些文件结构良好,分为更多的章节和子章节。我们的结果激励了研究需求,以帮助从业者将指令文件的开发视为一项软件工程活动(即,\textbf{指令即代码})。

英文摘要

AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

2606.13397 2026-06-12 cs.HC cs.AI cs.CY 新提交

Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities

Mod-Guide:一种基于LLM的内容审核反馈系统,用于解决针对原住民及少数族裔宗教群体的不敏感言论

Dipto Das, Achhiya Sultana, Ankit Singh Chauhan, Saadia Binte Alam, Mohammad Shidujaman, Shion Guha, Sunandan Chakraborty, Syed Ishtiaque Ahmed

AI总结 本文研究LLM审核系统对孟加拉国印度教和查克玛社区不敏感言论的认知局限,通过共同构建文化语料库和检索增强生成(RAG)方法开发Mod-Guide工具,提升模型对少数群体观点的敏感性。

详情
AI中文摘要

语言既是边缘化的机制,也是抵抗的机制,尤其是对于在网络上面对不敏感和有害言论的少数群体。随着内容审核越来越依赖大型语言模型(LLMs),人们开始担忧这些系统能否识别文化不敏感言论——即通过隐含的抹除、歪曲或规范性框架(而非公开敌意)忽视或边缘化历史上代表性不足社区的文化和宗教观点的言论。本文聚焦孟加拉国的印度教和查克玛社区——该国最大的宗教少数群体和原住民少数民族,研究了基于LLM的审核系统的认知局限,并探索融入少数群体视角的方法。我们与社区成员共同创建了一个文化敏感言论语料库,并使用检索增强生成(RAG)将他们的叙事整合到审核流程中。我们的工具Mod-Guide通过利用源自生活经验的上下文线索,提升了LLM对少数群体观点的敏感性。通过涉及少数群体和多数群体参与者的混合方法评估,我们证明RAG增强的审核响应在上下文上更准确,且不同族群对其感知存在差异。这项工作通过在前台化内容审核系统设计中的修复正义和诠释学包容,推进了人机交互、AI伦理和社会计算领域的研究。

英文摘要

Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh's Hindu and Chakma communities -- the country's largest religious and Indigenous ethnic minorities, respectively -- this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.

2606.13385 2026-06-12 cs.CR cs.AI cs.CY cs.HC cs.MM 新提交

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

谁买单?面向真实世界网络代理的以利益相关者为中心的提示注入基准测试

Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li, Dacheng Tao, Tianwei Zhang

AI总结 提出以利益相关者为中心的基准测试框架,系统分类和归因真实世界网络代理系统中的提示注入危害,揭示当前代理无法可靠抵抗任何攻击目标,且失败模式多样。

Comments 32 pages

详情
AI中文摘要

由大型语言模型驱动的网络代理越来越多地部署在真实环境中,它们在不受信任的网络内容上操作并执行具有直接后果的动作。这使得它们容易受到提示注入攻击,其中看似良性的内容嵌入了操纵代理行为的对抗性指令。现有的安全基准采用以攻击为中心的视角,关注注入的技术可行性,而忽略了由此产生的危害的细微分布。然而,在实践中,提示注入风险是受害者依赖的:单一漏洞可能对不同利益相关者产生不对称后果,同一攻击模式可能因目标不同而表现出显著不同的有效性。为了捕捉这些特性,我们引入了\sysname,一个以利益相关者为中心的基准,用于系统分类和归因真实世界网络代理系统中的危害。它区分受影响的实体(如用户、卖家、平台),将攻击分解为具体目标,并使用互补的结果和过程级指标评估每个案例。我们的结果揭示了显著且异质的漏洞:当前代理无法可靠抵抗任何单一攻击目标,失败分布在从“隐蔽寄生”(攻击成功而不干扰用户委托任务)到“错位破坏”(任务被破坏而攻击未成功)以及“复合失败”(对抗目标和任务完整性同时被违反)等不同模式。这些模式被传统评估所忽略,突显了在真实部署中对基于LLM的代理进行利益相关者感知评估的必要性。基准可在该https URL获取。

英文摘要

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

2606.13298 2026-06-12 cs.SE cs.AI 新提交

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

在智能体AI采用下的架构质量挖掘:Java仓库的因果研究

Oliver Aleksander Larsen, Mahyar T. Moghaddam

AI总结 通过差分差分设计和Borusyak插值估计器,研究智能体AI工具采用对Java仓库架构气味密度(ASD)的因果影响,发现ASD下降6.7%源于代码量增长,而非架构改进。

Comments 16 pages. Accepted for presentation at the 52nd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2026, Krakow, Poland, 2-4 September 2026, and for publication in the Springer LNCS proceedings. This is the author's accepted manuscript

详情
AI中文摘要

AI编码工具现已被大多数开发者使用,这些工具的智能体化使用普及了俗称“氛围编码”的实践。然而,关于其对软件架构影响的因果证据却很少。先前的因果工作衡量了代码层面的结果(复杂度、静态分析警告);这种退化是否会传播到架构层面仍未知。我们挖掘了151个开源Java仓库,其中74个检测到智能体AI采用(通过配置文件和Co-Authored-By提交尾注识别),以及77个倾向得分匹配的对照仓库,每个仓库跨越13个月,生成1,811个月度Arcan快照。我们采用交错差分差分设计和Borusyak插值估计器,估计采用对架构气味密度(ASD)的因果效应,将近期用于代码层面指标的因果设计应用于架构层面。总气味计数基本不变(+1.1%,p=0.82),而代码行数增长12.8%(p=0.003);因此,ASD下降6.7%(p=0.004)是分母效应而非架构改进。按类型估计和稳健性检验(wild cluster bootstrap、Lee bounds、陈旧观测敏感性)证实了这一模式;预处理趋势平坦(Wald p=0.90),与平行趋势一致。当处理影响系统规模时,密度归一化结果可能产生误导:对AI工具采用的因果挖掘研究需要原始计数和显式分解。完整的复现包,包括精心整理的151个仓库月度面板,已公开提供。

英文摘要

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

2606.13239 2026-06-12 cs.SE cs.AI cs.CL cs.CV 新提交

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结 提出COM即行动范式,将专业软件交互转化为确定性程序合成,解决GUI代理的脆弱性和API代理的异构性问题;构建ComCADBench基准和ComActor自校正代理,在工业CAD软件上实现SOTA性能。

详情
AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制:基于GUI的代理受困于脆弱的视觉基础和长程错误累积,而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中,我们将组件对象模型(COM)识别为统一的、可执行的抽象,提出了COM即行动:一种新的范式,将专业软件交互重新定义为确定性程序合成,而非顺序视觉控制。为了在最苛刻的环境中验证这一范式,我们引入了ComCADBench,这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距:前沿的专有模型在基于GUI的交互下几乎无法成功,而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距,我们开发了ComActor,一个通过渐进式三阶段框架训练的自校正代理,以及ComForge,一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明,ComActor在ComCADBench上达到了最先进的性能,在基线崩溃的长程任务中表现出强大的韧性,并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

2606.13133 2026-06-12 cs.DS cs.LG 新提交

Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling

学习增强的无关联机器调度近似算法

Kaito Baba, Evripidis Bampis, Giorgos Mitropoulos

AI总结 针对无关联机器调度问题,提出学习增强算法,利用重作业分配预测实现精确预测时(1+ε)-近似,误差增大时退化为2-近似。

Comments 22 pages, 3 figures

详情
AI中文摘要

最近,Antoniadis等人(ICLR 2025)提出了一个框架,通过引入预测来近似NP-hard选择问题。尽管该方法简单,但它紧密匹配理论下界,因此其推广极具吸引力。我们解决了Antoniadis等人工作中提出的一个开放问题,即如何将该方法扩展到选择问题类之外的其他重要问题,例如调度问题。我们为无关联机器上的最小化完工时间问题(记为$R\\|C_{\max}$)开发了一种学习增强算法。通过使用重作业分配的预测,我们在预测准确时实现了多项式时间的$(1+\varepsilon)$-近似,并且随着误差增加,该近似平滑地退化为最坏情况下的2-近似。我们通过实证分析总结了我们的工作。

英文摘要

Recently, Antoniadis et al. (ICLR 2025) proposed a framework for incorporating predictions to approximate NP-hard selection problems. Despite its simplicity, this approach tightly matches theoretical lower bounds, making its generalization highly compelling. We address an open question raised in the work of Antoniadis et al., concerning the extension of this approach to other important problems outside the class of selection problems, such as scheduling. We develop a learning-augmented algorithm for the makespan minimization problem on unrelated machines, denoted by $R\|C_{\max}$. By using predictions of heavy job assignments, we achieve a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions that smoothly degrades to a worst-case 2-approximation as the error increases. We conclude our work with an empirical analysis of our method.

2606.13113 2026-06-12 eess.SY cs.RO cs.SY 新提交

MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer

基于李雅普诺夫监督的物理信息神经网络校正层的欠驱动航天器MPC控制

Amirhossein Ayanmanesh Motlaghmofrad, Carlo Cena, Mauro Martini, Marcello Chiaberge

AI总结 针对欠驱动航天器姿态控制,提出一种分层架构,结合非线性模型预测控制、物理信息神经网络和李雅普诺夫监督机制,在不确定性下降低稳态误差并保持鲁棒性。

Comments Accepted at SPAICE (AI in and for Space) 2026

详情
AI中文摘要

欠驱动航天器面临可控性限制和对环境干扰的高度敏感性,使得姿态机动和稳定复杂化。由于沿欠驱动轴缺乏控制能力,传统控制器无法直接稳定所有姿态分量,因此需要参考规划策略。此外,MPC方法对惯性不确定性和未建模动态耦合仍然敏感,导致在失配下跟踪性能下降。为解决这些问题,我们考虑一种集成三层的分层架构:(i) 非线性模型预测控制器(NMPC),用于约束和欠驱动感知的机动规划以及在执行器限制下的标称闭环稳定性;(ii) 物理信息神经网络(PINN),在仿真数据上离线训练以估计残余干扰力矩,其损失项强制执行与刚体旋转动力学的一致性;(iii) 基于李雅普诺夫的监督安全机制,在线评估学习到的校正并限制或抑制其影响,以保持基线控制器的稳定性特性。该架构在模拟反作用轮动力学、执行器饱和及环境干扰的高保真仿真环境中进行评估。蒙特卡洛研究表明,与独立NMPC相比,稳态姿态误差有统计显著的降低,同时在不确定性下保持鲁棒行为。监督层确保当基于学习的增强不可靠时,能够优雅地退化到纯模型控制。

英文摘要

Underactuated spacecraft faces controllability limitations and heightened sensitivity to environmental disturbances, complicating attitude maneuvering and stabilization. Due to the lack of control authority along the underactuated axis, conventional controllers cannot directly stabilize all attitude components and therefore require reference planning strategies. Furthermore, MPC approaches remain sensitive to inertia uncertainty and unmodeled dynamic couplings, resulting in degraded tracking performance under mismatch. To address these issues, we consider a hierarchical architecture integrating three layers: (i) a nonlinear model predictive controller (NMPC) for constraint and underactuation-aware maneuver planning and nominal closed-loop stability under actuator limits; (ii) a physics-informed neural network (PINN) trained offline on simulation data to estimate residual disturbance torques, with loss terms that enforce consistency with rigid-body rotational dynamics; (iii) a Lyapunov-based supervisory safety mechanism that evaluates the learned correction online and bounds or suppresses its influence to preserve the stability properties of the baseline controller. The architecture is evaluated in a high-fidelity simulation environment modelling reaction wheel dynamics, actuator saturation, and environmental disturbances. Monte Carlo studies show statistically significant reductions in steady-state attitude error relative to standalone NMPC while maintaining robust behavior under uncertainty. The supervisory layer ensures graceful degradation to purely model-based control when the learning-based augmentation is unreliable.

2606.13097 2026-06-12 cs.PL cs.AI 新提交

Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents

功能缓存嫁接:具身智能体的鲁棒且快速代码策略合成

Saehun Chun, Wonje Choi, Sera Choi, Sanghyun Ahn, Honguk Woo

AI总结 提出FCGraft框架,通过维护函数级验证代码骨架及其键值缓存,对新任务进行缓存嫁接(拼接和修补),减少预填充计算并复用验证结构,实现更鲁棒和快速的策略合成。

Comments Accepted at ICML 2026

详情
AI中文摘要

编写代码的大型语言模型(CodeLLMs)通过将自然语言目标和环境约束转化为结构化控制程序,为具身智能体生成可执行的代码策略。然而,在开放域具身环境中,策略生成存在两个基本限制:(i) 由于长提示上的重复预填充计算导致的延迟解码,以及(ii) 由于完全生成式解码导致的鲁棒性有限,这常常产生API不匹配、缺少安全防护和不稳定的控制逻辑。为了解决这些限制,我们提出了FCGraft,一种功能缓存嫁接框架。FCGraft维护一个函数级验证代码骨架库及其相关的提示级Transformer键值(KV)缓存,并在提供新任务时通过检索相关函数并嫁接其KV缓存来合成新策略。给定检索到的函数缓存,FCGraft通过拼接(将缓存的函数片段组合成复合策略)和修补(仅局部调整必要的代码区域以满足任务特定参数和约束,且只需最少的额外解码)进行缓存嫁接。通过消除冗余的预填充计算,该方法减少了生成延迟,同时重用经过验证的控制结构提高了鲁棒性,相比提示级缓存方法RAGCache,任务成功率提高了18.31%,策略合成速度提高了2.3倍。

英文摘要

Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3x faster policy synthesis.

2606.13071 2026-06-12 cs.CY cs.AI cs.HC 新提交

"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System

“这还不够吗?”:加拿大算法签证分类系统中的机构问责与集体意义建构的不对称性

Dipto Das, Matthew Tamura, Syed Ishtiaque Ahmed, Shion Guha

AI总结 研究加拿大签证系统中算法问责的机构表述与申请者体验,发现机构强调透明度与程序保障,而申请者通过集体意义建构应对不透明决策,揭示认知、管辖和时空关系三方面不对称。

详情
AI中文摘要

本文研究了加拿大签证系统中算法问责如何在机构层面被表述,以及跨境申请者如何体验这种问责。我们使用为公共部门调整的算法决策(ADMAPS)框架,分析了加拿大移民、难民和公民部(IRCC)针对临时居民签证(TRV)分类系统的算法影响评估(AIA),并采用混合方法分析了Reddit上申请者之间的讨论。我们表明,虽然机构工件强调透明度、程序保障和有限影响,但申请者进行集体意义建构以解读不透明决策,常常在不确定性中依赖同行知识。我们识别了机构问责结构与人们感知过程之间的三种不对称:获取决策逻辑的认知不对称、由地缘政治定位塑造的管辖不对称,以及等待和不确定性体验中的时间-关系不对称。我们强调了将注意力从机构设计转向公共部门算法治理中体验的不均匀分布的重要性。这些贡献共同展示了跨国移民背景下的算法治理系统如何产生机构披露框架未能捕捉的结构性不对称,以及扩展ADMAPS如何能够解释这些不平等的问责转化。

英文摘要

This paper examines how algorithmic accountability in Canada's visa system is articulated institutionally and experienced by applicants across borders. We analyzed Immigration, Refugees and Citizenship Canada (IRCC)'s Algorithmic Impact Assessment (AIA) for the temporary resident visa (TRV) triage system using the algorithmic decision-making adapted for the public sector (ADMAPS) framework and analyzed Reddit discussions among applicants using a mixed-methods approach. We show that while institutional artifacts emphasize transparency, procedural safeguards, and bounded impacts, applicants engage in collective sensemaking to interpret opaque decisions, often relying on peer knowledge amid uncertainty. We identify three asymmetries between how institutional accountability is structured and how people perceive the process: epistemic asymmetry in access to decision logic, jurisdictional asymmetry in exposure shaped by geopolitical positioning, and temporal--relational asymmetry in how waiting and uncertainty are experienced. We emphasize why it is important to shift attention from institutional design to the uneven distribution of experiences with public-sector algorithmic governance. Together, these contributions demonstrate how algorithmic governance systems in the context of transnational migration produce structured asymmetries not captured by institutional disclosure frameworks, and how extending ADMAPS can account for those uneven translations of accountability.

2606.13068 2026-06-12 cs.MA cs.RO 新提交

Effects of Social Interactions in Self-Organising Railway Traffic Management

自组织铁路交通管理中社交互动的影响

Fabio Oddi, Federico Naldini, Leo D'Amato, Grégory Marlière, Paola Pellegrini, Vito Trianni

AI总结 研究自组织铁路交通管理中预测邻域范围(horizon)对分布式协调过程的影响,发现短时间范围足够,长范围会损害局部可解性和计算响应性而无全局收益。

详情
AI中文摘要

最近的研究正在探索自组织交通管理作为扩展到复杂现实网络的一种解决方案。在这样的系统中,列车预测其邻域,生成交通计划假设,并通过与邻居的共识达成未来要实施的交通计划。本文研究了该流程中的一个结构参数:预测邻域范围。列车使用该范围来识别与邻居的未来潜在冲突,并建立局部交互拓扑,即要与之协商的列车子集。作为主要设计变量,范围直接决定了社交互动图的大小和密度,而其对局部子问题复杂性和分布式共识动态的影响则代表了需要探索的权衡。通过闭环仿真框架,研究评估了范围变化如何影响整个分散协调过程,从初始冲突检测到分布式调度共识。分析重点在于研究范围选择引入的潜在权衡:平衡局部可解性和计算响应性与安全关键环境中全局调度一致性和可行性的需求。与直觉相反,我们的实证结果表明,短时间范围就足够了,而长时间范围会损害局部可解性和计算响应性,且不会带来全局调度最优性的提升。

英文摘要

Recent research is exploring self-organised traffic management as a solution for scaling to complex real-world networks. In such a system, trains predict their neighbourhood, produce traffic plan hypotheses, and agree via consensus with neighbours on a future traffic plan to be implemented. This paper investigates a structural parameter within this pipeline: the predictive neighbourhood horizon. The horizon is used by trains to identify future potential conflicts with neighbours, and to establish the local interaction topology, that is, the subset of trains to negotiate with. As the primary design variable, the horizon directly determines the size and density of the social interaction graph, whereas its impact on the complexity of local sub-problems and the distributed consensus dynamics represents a trade-off to be explored. Through a closed-loop simulation framework the study evaluates how variations of the horizon impact the overall decentralised coordination process, from initial conflict detection to distributed schedule consensus. The analysis focuses on investigating the potential trade-off introduced by the horizon choice: balancing local tractability and computational responsiveness with the need for global schedule coherence and feasibility in safety-critical environments. Contrary to intuition, our empirical results indicate that the short time horizons suffice, while long values compromise local tractability and computational responsiveness with no gain in global schedule optimality.

2606.13026 2026-06-12 cs.CY cs.AI 新提交

Democracy in the Era of Artificial Intelligence

人工智能时代的民主

Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing

AI总结 本文探讨如何利用人工智能升级民主制度,增强集体智慧、审议民主和自治系统,同时应对隐私、偏见和虚假信息等风险。

详情
AI中文摘要

将人工智能(AI)与民主相结合是我们时代最深刻的挑战之一。一方面,AI 为克服民主中长期存在的挑战提供了机会,例如在代表权不足的审议和投票过程中参与度低的问题。另一方面,AI 算法带来了新的风险,这些算法侵犯隐私、存在偏见、具有操纵性、传播虚假信息并影响选举结果。超越“AI 对民主是好是坏”这一过于简单的问题,《人工智能时代的民主手册》转而提出:如何利用 AI 升级民主及其所基于的原则?如何与 AI 互动以及以何种条件互动?需要哪些新的价值观和设计原则来建立民主韧性?来自世界各地不同学科的 59 位作者在 34 章中探讨了 AI 如何增强民主的集体智慧(第 1 部分),以及使用大型语言模型和社交媒体的审议民主的未来(第 2 部分)。我们还阐述了 AI 在构建有韧性的自治系统中的作用(第 3 部分),以及 AI 时代民主转型的挑战(第 4 部分)。最后,我们以更广阔的视角(第 5 部分)重新构想民主与 AI 的相互作用。

英文摘要

Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

2606.12949 2026-06-12 cs.CR cs.CV 新提交

ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection

ViPER:基于视觉的打包感知编码器用于鲁棒恶意软件检测

Fatima Qaiser, Bisma Tahir, Muhammad Abid Mughal, Nauman Shamim

AI总结 提出ViPER,一种基于LoRA适配ViT-B/14的双头架构,联合学习恶意软件分类和打包检测,通过打包感知门控机制和频率加权损失处理打包标签偏斜,在20万Windows PE图像上达到0.8521平衡准确率、0.9260 ROC-AUC和0.9279 AUPR。

详情
AI中文摘要

基于可视化的恶意软件检测将原始二进制字节映射为灰度图像,并应用学习的视觉分类器,为传统分析流程提供了一种抗规避且无需反汇编的替代方案。然而,可执行文件打包仍然是一个关键的失效模式:打包后的二进制文件产生高熵图像,掩盖了这些模型所依赖的结构模式。由于打包在良性软件中也很常见(例如用于压缩或复制保护),仅凭打包状态并不能可靠地指示恶意性,且现有方法未在统一的监督框架内解决这一挑战。我们提出了ViPER,一种基于视觉的打包感知编码器,用于鲁棒的恶意软件检测。ViPER构建在LoRA适配的ViT-B/14骨干网络上,采用双头架构,联合学习恶意软件分类和打包检测。打包感知门控机制根据推断的打包状态调节恶意软件预测,从而为打包和未打包输入实现不同的决策边界。为了解决训练期间打包标签偏斜的问题,我们采用了频率加权损失,并在联合类别-打包层上进行分层采样。在20万张Windows PE字节图图像上的评估中,ViPER达到了0.8521的平衡准确率、0.9260的ROC-AUC和0.9279的AUPR,在所有主要指标上均优于代表性的最先进基线,同时打包检测AUC达到0.9949。

英文摘要

Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

2606.12904 2026-06-12 cs.IR cs.CL cs.HC cs.SI 新提交

Trait, Not State: The Durability of Reading Identity in Social Highlighting

特质而非状态:社交高亮中阅读身份的持久性

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 通过分析读者前六个月的高亮行为作为个人档案,追踪其后续选择,发现阅读选择特征在长达24个月以上保持稳定,表明这是一种特质而非状态。

Comments 12 pages, 3 figures, 3 tables

详情
AI中文摘要

先前关于社交网络高亮工具的研究将个体性定位于选择——即一个人选择高亮哪些文档——但仅从横截面角度进行测量。我们提出时间性问题:读者的选择特征是特质还是状态?我们将每位读者前六个月的高亮行为冻结为个人档案,并追踪其在后续选择中(间隔逐渐增大至24个月以上)的自身优势,负样本来自同一日历时期——因此供给漂移不能伪装成个人漂移——在粗粒度全局层面和细粒度层面(其负样本和对照来自读者自身的兴趣领域)进行测量;锚定单元重现了先前的横截面水平(+0.188 vs +0.169),验证了该框架。四个结果:在同一用户内,细粒度优势在任何时间跨度上均未显示统计上可检测的配对下降(6-12个月保留率 R = 1.00 [0.85, 1.18],n = 212;最远的区间与适度下降兼容;唯一区间排除零的对比是12-24个月的粗粒度层,约下降13%)。该信号不可简化为重复域名(排除所有档案来源后约90%信号保留)。个体内漂移缓慢(最近半年的档案比旧半年档案高出+0.042)。前瞻性地,个人档案——即使仅由读者最早期的文档构建(评估前中位数20个月)——其下一阅读的AP值约为所有测试过的简单非个人先验的3倍。我们将“特质”操作性地定义为在持续参与下的稳定特征;研究范围限于一个平台上的重度、长期读者,且曝光与选择不可分离。

英文摘要

Prior work on a social web highlighter located individuality in selection -- which documents a person chooses to highlight -- but measured it cross-sectionally. We ask the temporal question: is a reader's selection signature a trait or a state? We freeze each reader's first six months of highlighting as a profile and track its own-vs-other advantage on their later selections at growing gaps (to 24+ months), with negatives drawn from the same calendar era -- so supply drift cannot masquerade as personal drift -- at a coarse global level and at a fine level whose negatives and controls come from the reader's own interest neighborhood; the anchor cell reproduces the prior cross-sectional level (+0.188 vs +0.169), validating the harness. Four results. Within the same users, the fine-layer advantage shows no statistically detectable paired decline at any horizon (6-12 month retention R = 1.00 [0.85, 1.18], n = 212; the farthest bin is compatible with a modest decline; the only contrast whose interval excludes zero is the coarse layer at 12-24 months, about 13%). The signal is not reducible to repeated domains (~90% survives excluding all profile sources). Within-person drift is slow (a recent-half profile beats the old half by +0.042). Prospectively, personal profiles -- even one built from a reader's earliest documents, median 20 months before evaluation -- rank their next reads at roughly 3x the AP of every simple non-personal prior tested. We use "trait" operationally (a stable signature under continued engagement); the scope is heavy, long-tenured readers of one platform, and exposure is not separable from choice.

2606.12864 2026-06-12 cs.SE cs.AI 新提交

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

超越问题求解:用于评估竞赛编程中代码生成、攻击和修复的UOJ-Bench基准

Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu, Kaifeng Lyu

AI总结 提出UOJ-Bench基准,通过代码生成、攻击和修复三项任务评估LLM在竞赛编程中的问题求解与人类代码错误识别能力,发现最强模型在一次性评估中无法识别超过50%的错误提交,但测试时扩展可提升至90%以上,且能发现约5%的满分提交中的错误。

详情
AI中文摘要

尽管大型语言模型(LLM)在竞赛编程中表现出色,但其在相同环境下支持人类学习的作用仍 largely unexplored。本文介绍UOJ-Bench,一个旨在评估LLM不仅解决问题能力,还能识别人类编写代码中错误的基准——这是传统上通过在线评测系统运行测试用例支持的关键教育活动。UOJ-Bench包含三个不同任务:代码生成、代码攻击和代码修复,所有任务均基于Universal Online Judge(UOJ)上的真实代码提交构建,并通过UOJ的原生评测基础设施进行评估。我们的结果表明,在一次性评估下,即使最强的模型也无法识别超过50%的被UOJ用户发现错误的提交。虽然测试时扩展将成功率提升至90%以上,但模型推理带来的巨大计算成本限制了其大规模部署的实用性。尽管存在这些限制,我们发现,在测试时扩展下,最佳性能模型可以在大约30个问题中识别超过5%的满分提交中的错误,这表明前沿LLM已经能够提供超越标准评测系统的补充信号。

英文摘要

Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.

2606.12849 2026-06-12 cs.DC cs.CV cs.RO 新提交

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

SemanticXR: 低功耗实时可查询语义建图与对象级设备-云架构

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve

AI总结 提出首个设备-云协同系统SemanticXR,通过对象级通信、执行和内存管理,在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询,服务器建图延迟提升2.2倍,设备功耗仅增加2%。

详情
AI中文摘要

语义建图是新兴扩展现实(XR)应用(如AI助手和空间对象搜索)中实现具身交互的核心服务。在移动XR设备上部署此功能需要系统具备开放词汇、实时和低功耗特性。现有方法计算密集且假设服务器级资源。云卸载提供了一条实用路径,但现有系统未在设备-云边界拆分语义建图或管理其通信、执行和内存占用。我们提出SemanticXR,首个在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询的设备-云系统。我们的关键洞察是将语义可识别对象提升为跨设备和服务器的通信、执行和内存的一级单元。在服务器端,对象级并行和几何下采样改善了建图延迟,而对象级深度建图协同设计降低了上行带宽。在设备端,具有增量更新和更新优先级的对象级稀疏局部地图实现了网络鲁棒的查询,并限制了内存和下行带宽。对象级可配置的资源使用与质量权衡让应用和系统分别根据应用需求和运行条件调整建图。与使用相同感知模型的设备-云基线相比,对象级组织在同等语义质量下将服务器端建图延迟提升了2.2倍。深度建图协同设计将上行带宽维持在2.5 Mbps以下。在设备端,SemanticXR即使在网络中断时也能为多达10,000个对象维持低于100 ms的查询延迟,在500 MB内支持数万个对象,并将下行带宽随地图变化而非总场景大小缩放。系统在正常运行时仅增加2%的设备功耗。

英文摘要

Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

2606.12845 2026-06-12 cs.CR cs.LG 新提交

A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction

一种使用远程数据科学的隐私保护框架用于机构间学生保留率预测

John Fields, K M Sajjadul Islam, Ruchitha Thota, Victor Chen, Praveen Madiraju

AI总结 提出基于PySyft和半气隙架构的远程数据科学框架,实现三所大学在不直接访问敏感数据的情况下协作预测学生保留率,验证了隐私保护机器学习在教育场景的可行性。

Comments 7 pages, 2 figures. Accepted at the 2026 IEEE International Conference on Information Reuse and Integration (IEEE IRI 2026)

详情
AI中文摘要

本研究探索了使用PySyft平台的隐私保护机器学习(PPML)技术,以实现机构间学生保留率的协作预测。我们开发了一个远程数据科学(RDS)框架,采用半气隙架构,包含高端和低端服务器,使来自三所大学的研究人员能够在无需直接访问数据的情况下,基于敏感学生数据构建预测模型。利用一所小型私立大学的历史数据(N=720),我们评估了三种合成数据生成方法,并通过机构间协作验证了该框架。结果显示,各机构的分类性能一致(Macro F1: 0.690--0.695),同时严格遵守《家庭教育权利和隐私法案》(FERPA)。我们还提出了数据类型感知模板,这是一种新颖的合成数据方法,优先考虑隐私而非分布保真度。我们的发现证实,基于RDS的PPML在教育环境中技术上可行,并为小规模机构间协作提供了一种联邦学习的实用替代方案。代码可在以下网址获取:this https URL。

英文摘要

This study explores privacy-preserving machine learning (PPML) techniques using the PySyft platform to enable collaborative prediction of student retention between institutions. We developed a remote data science (RDS) framework with a semi-air-gapped architecture consisting of high-side and low-side servers, allowing researchers from three universities to build predictive models on sensitive student data without direct data access. Using historical data from a small private university (N=720), we evaluated three synthetic data generation approaches and validated the framework through inter-institutional collaboration. The results demonstrate consistent classification performance across institutions (Macro F1: 0.690--0.695) while maintaining strict Family Educational Rights and Privacy Act (FERPA) compliance. We also propose Data-Type-Aware Templates, a novel synthetic data method that prioritizes privacy over distributional fidelity. Our findings confirm that RDS-based PPML is technically feasible for educational settings and offers a practical alternative to federated learning for small-scale inter-institutional collaborations. The code is available at https://github.com/jtfields/NAIRR240195-Privacy-Preserving-Machine-Learning.