arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05568 2026-06-05 cs.IR cs.CL

ColBERTSaR: Sparsified ColBERT Index via Product Quantization

ColBERTSaR: 通过乘积量化实现稀疏化的 ColBERT 索引

Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, Saron Samuel, Rohan Jha

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出通过乘积量化将 ColBERT 索引转化为真正的倒排索引，显著减小索引大小（比 PLAID 小 50-70%）同时保持检索效果。

Comments 6 pages, 1 figure, accepted at SIGIR 2026 as a short paper

详情

DOI: 10.1145/3805712.3809920

AI中文摘要

虽然 ColBERT 是一种有效的神经检索架构，但它需要庞大的索引结构来支持基于近似 token 嵌入的候选集检索、收集和解压文档 token 嵌入以及应用 MaxSim 操作。PLAID 和类似 ColBERT 实现中的索引所需磁盘存储量是原始原始文本的五到十倍，这限制了它们的可扩展性。此外，先前的工作已经确定，收集和解压阶段是查询时的主要低效环节。通过阈值和分数近似来限制必须收集的文档 token 数量并不能消除整个索引支持即席查询的需求。在这项工作中，我们提出了一种嵌入量化方法，将 ColBERT 索引转变为真正的倒排索引。我们从理论上证明，除了评分机制外，带有嵌入量化的 ColBERT 等价于学习型稀疏检索。实验表明，我们的索引比一位 PLAID 索引小 50-70%，同时保持检索效果。

英文摘要

While ColBERT is an effective neural retrieval architecture, it requires a heavy index structure to support candidate set retrieval based on approximated token embeddings, gathering and decompressing document token embeddings, and applying the MaxSim operation. Indexes in PLAID and similar ColBERT implementations require five to ten times the disk storage of the original raw text, which limits their scalability. Furthermore, prior work has identified that the gathering and decompression stages are the primary inefficiencies at query time. Limiting the number of document tokens that must be gathered by thresholding and score approximation does not eliminate the need for the entire index to support ad hoc queries. In this work, we propose an embedding quantization approach that turns a ColBERT index into a true inverted index. We show that, theoretically, ColBERT with embedding quantization is equivalent to learned-sparse retrieval except for the scoring mechanism. Empirically, we demonstrate that our index is 50-70% smaller than a one-bit PLAID index while retaining retrieval effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.05548 2026-06-05 cs.SE cs.AI

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

ADK Arena: 通过LLM即开发者评估智能体开发工具包

Jintao Huang, Xiaomin Li, Gaurav Mittal, Yu Hu

发表机构 * The Ohio State University（俄亥俄州立大学）； Microsoft（微软）

AI总结提出LLM-as-a-Developer方法，通过自动化流水线ADK Arena评估51个Python ADK框架，发现框架间生成成本差异达5.6倍，但无单一框架占优，且文档、源码和参数知识可相互替代。

Comments Work in Progress

详情

AI中文摘要

智能体开发工具包（ADK）作为构建LLM驱动自主智能体的SDK级框架的快速普及，已经超越了关于框架选择如何影响智能体性能的任何实证理解。我们提出 extbf{LLM即开发者}方法，用LLM编码智能体替代人类开发者，该智能体从文档中学习每个框架的API，编写智能体代码，并通过验证-反馈循环迭代修复直到测试通过。通过保持开发者不变而仅改变框架，生成工作成为API可用性的定量代理，生成的智能体提供了框架有效性的受控度量。我们在 extbf{ADK Arena}中实现这一点，这是一个完全自动化的流水线，具有每个框架的Docker隔离、三级验证流水线以及针对SWE-bench、$τ^2$-bench、Terminal-Bench和MCP-Atlas的基准适配器。评估所有51个流行的Python ADK框架（204个智能体-基准对），我们发现：（1）生成在57%的运行中成功，其成本在框架间变化5.6倍（每个智能体0.6美元至3.4美元），这是API复杂性的定量代理，尽管成本本身不能预测成功；（2）没有单一框架占主导：最佳单基准ADK智能体解决了高达80%的任务，甚至能以一小部分成本击败通用前沿编码智能体，但中位数框架仅解决32%；（3）在信息源消融实验中，真正的框架使用率保持在狭窄的28-40%范围内（原始源码访问时最高，无参考材料时仍为33%），表明文档、源代码和参数知识在很大程度上是可替代的，而不是任何一个成为硬瓶颈。

英文摘要

The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $τ^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\% of runs, and its cost varies 5.6$\times$ across frameworks (\$0.6 to \$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\% of tasks and can even \emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\% band (highest with raw source access and still 33\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.05509 2026-06-05 cs.HC cs.AI

The Role of Instructional Guidance in Generative AI-Assisted Learning: Empirical Evidence from Construction Engineering Education

教学指导在生成式AI辅助学习中的作用：来自建筑工程教育的实证证据

Xiaoyu Hou, Bo Xiao, Hexu Liu, Shane Mueller

发表机构 * Dept. of Civil, Environmental, and Geospatial Engineering, Michigan Technological Univ.（土木、环境与地理空间工程系，密歇根技术大学）； Dept. of Civil and Construction Engineering, Western Michigan Univ.（土木与建设工程系，西部密歇根大学）； Dept. of Psychology and Human Factors, Michigan Technological Univ.（心理学与人因工程系，密歇根技术大学）

AI总结本研究通过引入基于生成学习理论的五步提示框架，在建筑工程教育中对比无提示AI辅助、有提示AI辅助和幻灯片学习三种条件，发现提示框架显著提升了需要解释和推理的任务表现（开放式评分提高约2-3分，p<0.01），表明AI辅助学习的有效性取决于交互结构。

详情

AI中文摘要

生成式人工智能（AI）越来越多地被用于支持自主学习，然而学生与此类系统的交互往往缺乏结构性，限制了对更深层次认知过程的参与。本研究探讨了教学指导如何塑造建筑工程教育中学生与AI的交互。引入了一个基于生成学习理论（GLT）的五步提示框架，以指导学习者在复习活动中的交互。一项对照实验比较了三种学习条件：基于幻灯片的学习、无提示的AI辅助学习和有提示的AI辅助学习。学习表现通过多项选择和开放式任务进行评估，用户体验通过用户体验问卷（UEQ）测量。表现差异集中在需要解释和推理的任务上。有提示条件在开放式任务上得分更高，在18分量表上提高了约2或3分（p < 0.01），而多项选择表现无显著差异。无提示条件与基于幻灯片的学习相当。这些发现表明，AI辅助学习的有效性取决于交互如何结构化。所提出的框架为将学习科学原理整合到建筑工程教育的生成式AI系统中提供了基础。

英文摘要

Generative artificial intelligence (AI) is increasingly used to support self-directed learning, yet student interaction with such systems often remains unstructured, limiting engagement in deeper cognitive processes. This study examines how instructional guidance shapes student and AI interaction in construction education. A five-step prompting framework grounded in Generative Learning Theory (GLT) is introduced to guide learner interaction during review activities. A controlled experiment compares three learning conditions: slide-based learning, unprompted AI-supported learning, and prompted AI-supported learning. Learning performance is assessed using multiple-choice and open-ended tasks, and user experience is measured using the User Experience Questionnaire (UEQ). Performance differences are concentrated on tasks requiring explanation and reasoning. The prompted condition achieves higher open-ended scores, with an improvement of approximately 2 or 3 points on a scale of 18 (p < 0.01), while no significant differences are observed in multiple-choice performance. The unprompted condition remains comparable to slide-based learning. These findings indicate that the effectiveness of AI-supported learning depends on how interaction is structured. The proposed framework provides a basis for integrating learning science principles into generative AI systems for construction education.

URL PDF HTML ☆

赞 0 踩 0

2606.05488 2026-06-05 stat.ML cs.LG stat.ME

Sparse Functional Singular Value Decomposition for Biclustering and Triclustering Longitudinal Data

纵向数据的稀疏函数奇异值分解用于双聚类和三聚类

Yue Zhao, Thierry Chekouo, Sandra Safo

发表机构 * Division of Biostatistics and Health Data Science University of Minnesota（生物统计学与健康数据科学系明尼苏达大学）

AI总结提出Tri-SfSVD框架，通过稀疏惩罚同时进行连续轨迹估计与对象、特征和时间选择，实现纵向数据中的双聚类和三聚类，优于现有方法。

详情

AI中文摘要

识别复杂疾病（如炎症性肠病，IBD）的亚型通常需要捕捉纵向组学数据中的潜在模式。然而，这些数据通常是高维、稀疏采样且时间上不规则观测的，对传统的（双）聚类和函数数据分析方法构成了重大挑战。我们提出Tri-SfSVD，一个统一的稀疏函数奇异值分解框架，用于发现纵向数据中的双聚类和三聚类。与现有的依赖于临时插值或强制限制性形状同质性假设的函数双聚类方法不同，Tri-SfSVD在单个优化框架中集成了连续轨迹估计与同时的对象、特征和时间选择。通过在对象、变量和时间子区域上施加稀疏惩罚，所提出的方法直接对观测数据操作，以发现对象级、对象-特征级和对象-特征-时间级的局部结构。大量模拟表明，Tri-SfSVD在高维设置下优于现有方法。应用于IBD多组学数据，该方法识别了三个双聚类，将样本聚类与不同的IBD相关临床特征以及特定细菌类群相关的微生物通路组联系起来，提供了可解释的对象-通路关联以表征疾病异质性。应用于多通道脑电图数据，该方法识别了三个三聚类，将样本聚类与不同的酒精相关表型以及局部脑活动模式联系起来，包括同一空间区域内由时间子区域分隔的亚组差异。

英文摘要

Identifying subtypes of complex conditions, such as Inflammatory Bowel Disease (IBD), often requires capturing latent patterns in longitudinal omics data. However, these data are typically high-dimensional, sparsely sampled, and irregularly observed over time, posing substantial challenges for conventional (bi)clustering and functional data analysis methods. We propose Tri-SfSVD, a unified sparse functional Singular Value Decomposition framework for discovering biclusters and triclusters in longitudinal data. Unlike existing functional biclustering methods that rely on ad hoc imputation or enforce restrictive shape-homogeneity assumptions, Tri-SfSVD integrates continuous trajectory estimation with simultaneous subject, feature, and temporal selection within a single optimization framework. By imposing sparse penalties across subjects, variables, and temporal subregions, the proposed method works directly on observed data to uncover localized structures at the subject, subject-feature, and subject-feature-time levels. Extensive simulations demonstrate that Tri-SfSVD outperforms existing approaches in high-dimensional settings. Applied to IBD multi-omics data, the method identified three biclusters linking sample clusters with distinct IBD-related clinical characteristics to microbial pathway groups associated with specific bacterial taxa, providing interpretable subject-pathway associations for characterizing disease heterogeneity. Applied to multi-channel EEG data, the method identified three triclusters linking sample clusters with distinct alcohol-related phenotypes to localized brain activity patterns, including subgroup differences separated by temporal subregions within the same spatial region.

URL PDF HTML ☆

赞 0 踩 0

2606.05474 2026-06-05 q-bio.BM cs.LG

AlloGen: Conformation-Selective Binder Generation with Differential State Scoring

AlloGen: 基于差异状态评分的构象选择性结合物生成

Hanqun Cao, Zachary Quinn, Aastha Pal, Sumi Kimura, Jingjie Zhang, Pheng Ann Heng, Pranam Chatterjee

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； The Chinese University of Hong Kong（香港中文大学）； Department of Bioengineering（生物工程系）； University of Pennsylvania（宾夕法尼亚大学）； Department of Computer and Information Science（计算机与信息科学系）

AI总结提出AlloGen框架，通过可学习的构象选择性评分器Qθ，结合骨架生成与状态选择性，实现针对蛋白不同构象状态的选择性结合物设计。

详情

AI中文摘要

蛋白质结合物设计主要优化亲和力，忽视了构象选择性：对于激酶、核受体和GPCR等变构靶点，无论结合多紧密，同时结合活性态和非活性态的结合物无法提供功能特异性。我们提出AlloGen，一个模块化框架，将骨架生成与学习到的状态选择性评分器$Q_θ$解耦，$Q_θ$是一个SE(3)不变的界面图变换器，通过两阶段课程训练，先学习界面几何，再施加构象区分。由于$Q_θ$完全可微且与生成器无关，它可以作为被动重排序器或主动基于梯度的引导器与任何骨架生成器集成，无需重新训练。在跨越多个家族和构象机制的多样化蛋白质基准上，AlloGen一致地识别出优先识别所需结构状态同时排斥替代构象的结合物。在钙调蛋白上的实验验证进一步表明，这些计算选择性信号可转化为物理分子，产生从头设计的肽，结合所需的全息构象，而对apo状态无检测到的结合。总之，这些结果确立了构象选择性作为可学习属性，并为状态选择性蛋白质结合物设计提供了通用框架。

英文摘要

Protein binder design has largely optimized for affinity alone, leaving conformational selectivity unaddressed: for allosteric targets such as kinases, nuclear receptors, and GPCRs, a binder that engages both active and inactive states provides no functional specificity regardless of how tightly it binds. We introduce AlloGen, a modular framework that decouples backbone generation from a learned state-selectivity scorer $Q_θ$, an SE(3)-invariant interface graph transformer trained via a two-phase curriculum that first learns interface geometry before imposing conformational discrimination. Because $Q_θ$ is fully differentiable and generator-agnostic, it integrates with any backbone generator as a passive reranker or an active gradient-based guide without retraining. Across a diverse benchmark of proteins spanning multiple families and conformational mechanisms, AlloGen consistently identifies binders that preferentially recognize desired structural states while rejecting alternative conformations. Experimental validation on calmodulin further demonstrates that these computational selectivity signals translate to physical molecules, yielding de novo peptides that bind the desired holo conformation while exhibiting no detectable binding to the apo state. Together, these results establish conformational selectivity as a learnable property and provide a general framework for state-selective protein binder design.

URL PDF HTML ☆

赞 0 踩 0

2606.05443 2026-06-05 cs.DL cs.CL

MIRAI: Prediction and Generation of High-Impact Academic Research

MIRAI：高影响力学术研究的预测与生成

Alex Li, Joseph Jacobson

发表机构 * MIT Media Lab（MIT媒体实验室）

AI总结提出MIRAI深度学习框架，利用论文标题、摘要和发表日期预测其5年PageRank和引用量，并基于此构建研究构思流程以生成高影响力研究想法。

2606.05396 2026-06-05 cs.CR cs.AI cs.SE

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

有意但无力：通过消融分离代码大语言模型中的拒绝与能力

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

发表机构 * University of Naples Federico II（那不勒斯费德里科二世大学）； University of Coimbra（科英布拉大学）； University of North Carolina at Charlotte（北卡罗来纳州夏洛特大学）

AI总结本文通过消融技术（abliteration）对代码LLM进行低秩权重编辑，以消除其对安全注入提示的拒绝行为，从而分离拒绝意愿与代码生成能力，实验表明消融后拒绝率降至零而语法有效性保持93%以上，但注入率仍受模型容量限制。

详情

AI中文摘要

大规模生成带标签的脆弱代码是基于学习的漏洞检测的一个反复出现的障碍：挖掘的语料库带有大量标签噪声，而现有的基于LLM的增强方法传播了这些不准确性，因为它转换了脆弱的种子，而不是根据规范合成漏洞。一个补充的途径是从安全代码开始，要求经过指令调优的LLM注入指定的CWE（这将把标签负担从开放式的检测转移到有界的二元确认），但安全对齐的代码LLM系统地拒绝此类提示。本文是对消融技术（abliteration）的初步可行性研究，这是一种低秩权重编辑，通过正交投影消除残差流中的拒绝方向，作为消除这一障碍的工具。我们使用Python和CWE-89（SQL注入）作为案例研究，评估了Qwen2.5-Coder-Instruct系列在3B、7B和14B参数下对从PromSec和SafeCoder中抽取的安全样本的表现，每种条件重复三次。我们发现：（i）对注入提示的拒绝强烈依赖于大小和提示上下文：14B模型拒绝100%的提示，7B模型拒绝73%的PromSec但仅5%的SafeCoder，而3B模型基本不受阻；（ii）消融技术将所有大小模型的拒绝率降至零或接近零，同时语法有效性保持在93%以上，支持了在这种设置下拒绝可以与测量的代码生成能力分离的观点；（iii）消融后的注入率仍然受容量限制（14B为88-97%，7B为89-90%，3B为25-48%），将意愿（消融技术解锁）与能力（随参数扩展）分离。漏洞判定由三个工具的检测器集成（CodeQL、Semgrep、Bandit）产生，然后由两位作者对检测器阳性输出进行人工裁决。

英文摘要

Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it transforms vulnerable seeds rather than synthesising vulnerabilities from a specification. A complementary route is to start from safe code and ask an instruction-tuned LLM to inject a specified CWE (which would shift the labeling burden from open-ended detection to bounded binary confirmation) but safety-aligned code LLMs systematically refuse such prompts. This paper is a preliminary feasibility study of abliteration, a low-rank weight edit that orthogonally projects out the refusal direction in the residual stream, as a tool to remove this barrier. We use Python and CWE-89 (SQL injection) as a case study, evaluating the Qwen2.5-Coder-Instruct family at 3B, 7B, and 14B parameters on safe samples drawn from PromSec and SafeCoder, replicated three times per condition. We find that (i) refusal on injection prompts is strongly size- and prompt-context-dependent: the 14B refuses 100% of prompts, the 7B refuses 73% of PromSec but only 5% of SafeCoder, whereas the 3B is essentially never blocked; (ii) abliteration reduces refusal to zero or near-zero across all sizes while leaving syntactic validity above 93%, supporting the view that, in this setting, refusal can be detached from measured code-generation capability; and (iii) the post-abliteration injection rate remains capacity-bound (88-97% on the 14B, 89-90% on the 7B, and 25-48% on the 3B) separating willingness, which abliteration unlocks, from capability, which scales with parameters. Vulnerability verdicts are produced by a three-tool detector ensemble (CodeQL, Semgrep, Bandit) followed by manual adjudication by two authors on detector-positive outputs.

URL PDF HTML ☆

赞 0 踩 0

2605.04135 2026-06-05 cs.CY cs.AI cs.CL

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

前沿滞后：学术AI评估中能力误述的文献计量审计

David Gringras, Misha Salahshoor

发表机构 * Harvard University（哈佛大学）； AISST

AI总结通过审计112,303篇LLM相关论文，发现中位论文评估的模型落后同期前沿10.85 ECI（约1.4倍Claude Sonnet 3.7与Claude Opus 4.5的差距），且差距以每年5.53 ECI扩大，仅3.2%的摘要披露推理模式状态，52.5%的结论将结果泛化为“AI”，并提出VERSIO-AI检查表等补救措施。

Comments v2. 65 pp, 9 figs, 8 tables, 8 appendices. Pre-registered on OSF: doi.org/10.17605/OSF.IO/7XM3D. Code+data: doi.org/10.5281/zenodo.20060457. VERSIO-AI v1.2 reporting checklist (Appendix A): doi.org/10.5281/zenodo.20060459. frontierlag package + per-DOI audit tool: frontierlag.org

详情

AI中文摘要

应用领域LLM能力评估的读者希望了解AI系统当前能做什么。但相关文献回答的是一个相关但结果不同的问题：更旧、更便宜、更少引导的模型在数月或数年前能做什么（例如，一篇2026年的论文评估GPT-3.5或GPT-4零样本，对比前沿的推理能力、工具使用系统如GPT-5.5 Pro和Claude Opus 4.7），通常报告稀疏的配置细节，并抽象上升为关于“AI”的声明，通过引用、媒体和政策传播。我们在一个预注册的审计中测量了“发表引导差距”（这些答案之间的差距），审计了112,303条LLM关键词匹配的候选记录（2022年1月至2026年4月；18,574条可接受，4,766篇全文可检索），将测试模型与同期前沿在Epoch AI能力指数（ECI）上进行比较，并在Arena Elo和Artificial Analysis上复现。中位论文评估的模型在评估时落后同期前沿+10.85 ECI（约Claude Sonnet 3.7与Claude Opus 4.5距离的1.4倍）（H1）；一个探索性的理性滞后基线（H8）将其分解为约25%的同行评审延迟和约75%的额外滞后。差距以每年+5.53 ECI的速度扩大（H2；95% CI [+5.03, +5.83]）。同时，仅3.2%的摘要（21.2%的全文）披露了具有推理能力模型的推理模式状态（H4），52.5%（95% CI [48.2, 56.9]）的结论以“AI”而非被评估模型（们）的层面陈述，并以OR = 1.23/年的速度上升。提出的补救措施包括API访问补贴和编辑执行报告框架，强制披露配置表面（模型快照、推理模式/努力、工具访问、脚手架、提示等）；VERSIO-AI是一个13项检查表（核心3项桌面拒稿），在引导表面扩展现有框架，并在frontierlag.org上提供每DOI分析。

英文摘要

Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about "AI" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of "AI" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.

URL PDF HTML ☆

赞 0 踩 0

2606.05391 2026-06-05 cs.SE cs.AI

Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents

实践中对智能体系统的人类监督：考察使用软件代理的开发者的监督工作、挑战与启发式方法

Shipi Dhanorkar, Samir Passi, Mihaela Vorvoreanu

发表机构 * Microsoft Redmond USA（微软红mond美国）

AI总结通过访谈17位经验丰富的开发者，探索人类对自主软件代理的监督实践，发现四种监督工作形式（先验控制、协同规划、实时监控、事后审查），并总结监督挑战与应对策略。

详情

AI中文摘要

自主软件代理有望提高开发者的生产力，但会犯错并表现出新颖的故障模式，使得人类监督成为成功的人机协作的关键。现有关于代理监督的研究主要是概念性的；存在规范性框架，但用户实际如何监督代理尚不明确。本文通过为代理监督的理论讨论提供早期实证锚点来弥合这一差距。基于对17位经验丰富的开发者的访谈，我们进行了一项探索性调查，考察开发者执行哪些形式的涌现监督工作、何时以及如何执行。我们还记录了开发者面临的监督挑战以及他们开始使用的应对策略。我们发现了至少四种形式的涌现监督工作：先验控制、协同规划、实时监控和事后审查。我们表明，监督工作不仅是现有研究中所描绘的反应性和回顾性的，而且是预防性和主动性的。我们描述了情境化的监督挑战（例如，难以审查代理生成的代码），并概述了开发者为解决这些挑战而采用的启发式方法（例如，使用测试结果作为代码正确性的保证）。最后，我们总结了高层次要点、未来研究方向、对软件代理的人本设计及软件工程实践的影响，以及我们研究的局限性。

英文摘要

Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent oversight is largely conceptual; normative frameworks exist, but how users actually oversee agents is less known. In this paper, we bridge this gap by providing early empirical anchors for the theoretical discourse on agent oversight. Drawing on interviews with 17 experienced developers, we conduct an exploratory inquiry examining what forms of emergent oversight work developers perform, when, and how. We also document the oversight challenges developers face and the strategies they have started using to address them. We found at least four forms of emergent oversight work: a priori control, co-planning, real-time monitoring, and post hoc review. We show that oversight work is not only reactive and retrospective, as portrayed in existing research, but also preventative and proactive. We describe situated oversight challenges (e.g., difficulty reviewing agent-generated code) and outline heuristics developers adopt to address such challenges (e.g., using test results as guarantees for code correctness). We conclude with high-level takeaways, future research directions, implications for the human-centered design of software agents and for software engineering practice, and limitations of our research.

URL PDF HTML ☆

赞 0 踩 0

2606.05383 2026-06-05 econ.GN cs.AI econ.TH q-fin.EC

Can AI Refute Economic Theory? Evidence from Beyond the Knowledge Cutoff

AI能否反驳经济理论？来自知识截止日期之外的证据

Alexis Akira Toda

发表机构 * Department of Economics, Emory University（埃默里大学经济学系）

AI总结本文通过实验测试多个AI模型（Gemini、Refine、Claude和ChatGPT）检查四篇包含错误的经济理论论文，发现ChatGPT Pro表现最佳但无法独立发现错误，表明AI尚不能自主反驳经济理论。

2606.05380 2026-06-05 cs.DS cs.LG

Learning-Augmented Online Minimization with Dual Predictions

具有双重预测的学习增强在线最小化

Christian Coester, Alexa Tudose, Alexander Turoczy

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对度量任务系统和层状集合覆盖两类在线最小化问题，提出利用对偶线性规划最优解的机器学习预测来改进理论保证的学习增强算法。

2606.05365 2026-06-05 stat.ML cs.LG

Environment-Robust Representation Learning with Empirical Bayes

基于经验贝叶斯的环境鲁棒表示学习

Yuli Slavutsky, Matthew Shen, Bohan Wu, David M. Blei

发表机构 * Department of Statistics Columbia University（统计学系哥伦比亚大学）； Columbia University（哥伦比亚大学）； Departments of Statistics and Computer Science Columbia University（统计学与计算机科学系哥伦比亚大学）

AI总结提出一种经验贝叶斯变分方法，通过跨环境平衡项学习不变潜在变量，实现对新环境的鲁棒预测，在天文、微生物和ICU数据上优于现有方法。

详情

AI中文摘要

我们考虑多环境预测问题。假设环境改变潜在变量的分布，而生成观测协变量和目标的机制在给定该变量条件下保持稳定。例如，医院或临床队列可能在潜在患者状态的流行率上有所不同，尽管这些状态、生理测量和结果之间的关系保持不变。给定来自多个环境的数据集，我们为这类问题构建了一个贝叶斯模型，并推导出相应的变分目标。我们证明该目标分解为每个环境项和由模型结构引起的额外跨环境平衡项。我们使用经验贝叶斯方法设置先验并将其纳入目标。基于该目标，我们开发了一种用于后验近似的摊销变分算法，并利用学习到的潜在变量在新环境中形成预测。我们通过模拟以及天文源识别、基于微生物组的疾病检测和ICU脓毒症预测的实际研究来研究我们的方法。在这些设置中，我们的方法在新环境预测方面优于先前的方法。

英文摘要

We consider multi-environment prediction problems. We assume the environments change the distribution of a latent variable, while the mechanisms generating observed covariates and targets remain stable conditional on that variable. For example, hospitals or clinical cohorts may differ in the prevalence of latent patient states, even though the relationships between those states, physiological measurements, and outcomes remain unchanged. Given a dataset from multiple environments, we formulate a Bayesian model for such problems and derive the corresponding variational objective. We show that this objective decomposes into per-environment terms and an additional cross-environment balancing term induced by the model's structure. We use an empirical Bayes method to set the prior and incorporate it into the objective. Based on this objective, we develop an amortized variational algorithm for posterior approximation, and use the resulting learned latent variables to form predictions in new environments.We study our approach through simulations and real-world studies of astronomical source identification, microbiome-based disease detection, and ICU sepsis prediction. Across these settings, our method outperforms previous approaches for prediction in new environments.

URL PDF HTML ☆

赞 0 踩 0

2606.05361 2026-06-05 stat.ML cs.LG

TabSODA: Tabular Diffusion based Imputation with Skip Pattern Detection and Ordinal Awareness

TabSODA: 基于表格扩散的插补方法，结合跳跃模式检测与序数感知

Yuyu Chen, Taehyo Kim, Hai Shu, Yang Feng

发表机构 * Department of Biostatistics NYU School of Global Public Health（生物统计学系纽约大学全球公共卫生学院）

AI总结提出TabSODA方法，通过EM框架下的扩散模型处理大规模调查中的结构跳跃和序数变量，在PATH和NSDUH数据集上显著提升插补精度。

详情

AI中文摘要

大规模调查中的缺失数据插补面临两个当前表格扩散方法未能很好处理的挑战。首先，\emph{结构跳跃}（因问卷设计而不可回答的单元格）不应被插补，但常与项目无回答混为一谈。其次，\emph{序数}响应编码了有序类别，但大多数流程通过独热或模拟位编码将其视为名义水平。我们提出 extbf{TabSODA}（具有跳跃模式检测和序数感知的表格扩散），一种基于期望最大化（EM）的扩散插补器，建立在阐明扩散模型（EDM）框架上。TabSODA通过去噪损失和逆时采样器传播结构跳跃，并用累积概率标量潜变量表示序数变量，同时保留名义变量的模拟位编码。当码本跳跃掩码可用时，TabSODA直接使用；否则，TabSODA+SKIP变体通过基于CART的跳跃模式挖掘器从原始响应和问卷顺序估计掩码。在烟草与健康人口评估（PATH）研究和全国药物使用与健康调查（NSDUH）这两个美国全国代表性调查中，TabSODA在MCAR、MAR和MNAR掩码下将序数MACE降低高达23.7%，并将分类准确率提高高达9%（相对于最强基线）。跳跃挖掘器在两个数据集上实现了近乎完美的精确度，使得TabSODA+SKIP能够紧密跟踪码本掩码变体。

英文摘要

Missing data imputation in large-scale surveys faces two challenges that are not well handled by current tabular diffusion methods. First, \emph{structural skips}, cells made inapplicable by questionnaire design, should not be imputed but are often conflated with item nonresponse. Second, \emph{ordinal} responses encode ordered categories, yet most pipelines treat them as nominal levels through one-hot or analog-bit encodings. We introduce \textbf{TabSODA} (\textbf{Tab}ular diffusion with \textbf{S}kip pattern detection and \textbf{O}r\textbf{d}inal \textbf{A}wareness), an Expectation-Maximization (EM)-based diffusion imputer built on the Elucidated Diffusion Model (EDM) framework. TabSODA propagates structural skips through the denoising loss and reverse-time sampler, and represents ordinal variables with cumulative-probit scalar latents while retaining analog-bit encodings for nominal variables. When a codebook skip mask is available, TabSODA uses it directly; otherwise, the TabSODA+SKIP variant estimates the mask from raw responses and questionnaire order using a CART-based skip-pattern miner. On Population Assessment of Tobacco and Health (PATH) study and the National Survey on Drug Use and Health (NSDUH), two nationally representative U.S.\ surveys, TabSODA reduces ordinal MACE by up to $23.7\%$ and improves categorical accuracy by up to $9\%$ over the strongest baseline across MCAR, MAR, and MNAR masking. The skip miner achieves near-perfect precision on both datasets, allowing TabSODA+SKIP to closely track the codebook-mask variant.

URL PDF HTML ☆

赞 0 踩 0

2606.05339 2026-06-05 cs.SE cs.AI

A Taxonomy of Runtime Faults in Model Context Protocol Servers

模型上下文协议服务器运行时故障的分类法

Joshua Owotogbe, Indika Kumara, Willem-Jan van den Heuvel, Damian Andrew Tamburri, Antonio Ken Iannillo, Roberto Natella

发表机构 * Jheronimus Academy of Data Science and Tilburg University（赫伦尼姆数据科学学院和蒂尔堡大学）； Jheronimus Academy of Data Science（赫伦尼姆数据科学学院）； University of Sannio（萨尼亚大学）； University of Luxembourg（卢森堡大学）； University of Naples Federico II（那不勒斯费德里科二世大学）

AI总结本文通过手动分析473个MCP服务器仓库中的837个故障线程，采用自下而上的开放式编码方法，首次建立了MCP服务器运行时故障的经验分类法，包含11个顶层类别和27个子类别（73种叶子故障类型），并通过开发者调查验证了其外部有效性。

Comments 14 pages

详情

AI中文摘要

MCP（模型上下文协议）使LLM（大语言模型）能够通过标准化协议与外部工具和数据源交互。其在工具增强型人工智能工作流中的快速采用引入了新的可靠性挑战，例如配置参数被接受但在运行时未强制执行，导致意外的默认行为，其运行时故障特征尚未得到实证检验。我们提出了MCP服务器运行时故障的首个经验分类法。我们手动分析了来自473个活跃维护的MCP服务器GitHub仓库的837个MCP特定运行时故障线程，并使用自下而上的开放式编码程序推导出分类法。该分类法包括11个顶层类别和27个子类别（73种叶子故障类型），涵盖了协议交互、工具调用、模式执行、状态管理、模型提供商集成、安全验证以及超时或显式取消进行中操作中的反复故障。为评估分类法的外部有效性，我们调查了55名MCP服务器开发者。受访者报告平均经历了27个故障子类别中的20个，且没有类别未被观察到。这些结果表明，该分类法反映了MCP系统中广泛观察到的运行时故障，并将有助于未来AI软件的维护和演化。

英文摘要

MCP (Model Context Protocol) enables LLMs (Large Language Models) to interact with external tools and data sources via a standardized protocol. Its rapid adoption in tool-augmented Artificial Intelligence (AI) workflows has introduced new reliability challenges, such as configuration parameters that are accepted but not enforced at runtime, leading to unintended default behavior, whose runtime fault characteristics remain empirically unexamined. We present the first empirical taxonomy of runtime faults in MCP servers. We manually analyzed 837 MCP-specific runtime fault threads from 473 actively maintained MCP server GitHub repositories and derived a taxonomy using a bottom-up open coding procedure. The taxonomy comprises 11 top-level categories and 27 subcategories (73 leaf fault types), covering recurrent failures across protocol interactions, tool invocations, schema enforcement, state management, model-provider integration, security validation, and timeouts or explicit cancellations of in-progress operations. To assess the taxonomy's external validity, we surveyed 55 MCP server developers. Respondents reported experiencing an average of 20 of the 27 fault subcategories, and no category remained unobserved. These results indicate that the taxonomy reflects widely observed runtime failures in MCP-based systems and shall assist AI software maintenance and evolution in the future.

URL PDF HTML ☆

赞 0 踩 0

2606.05328 2026-06-05 cs.GR cs.AI cs.CV cs.LG

The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

物理的隐形之手：当视频扩散模型知道的比它们展示的更多

Parsa Esmati, Somjit Nath, Katja Hofmann, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Majid Mirmehdi

发表机构 * University of Bristol（布里斯托大学）； McGill University（麦吉尔大学）； Mila–Quebec AI Institute（魁北克AI研究院）； Microsoft Research（微软研究院）； University of Calgary（卡尔加里大学）

AI总结通过逆向扩散过程探测视频扩散模型的潜在轨迹，发现物理合理性可以从扩散变换器状态中线性解码，准确率达81.27%，表明物理有意义的表示是生成式去噪的副产品。

详情

AI中文摘要

现代视频扩散模型生成越来越真实和时间上连贯的视频，这激发了它们作为候选世界模拟器的使用。然而，目前尚不清楚这些模型是否内部编码了物理结构，或者仅仅是复现了训练中看到的运动模式。我们通过沿着对应已知物理合理性的真实视频的潜在轨迹探测视频扩散模型来研究这个问题。为了获得这样的轨迹，我们通过从干净视频潜在变量向后积分学习到的速度场到噪声，近似逆向确定性采样过程，从而访问模型的中间状态和注意力图。利用这些恢复的轨迹，我们表明物理合理性可以从扩散变换器状态中线性解码，在IntPhys和InfLevel上达到约81.27%的平均准确率，并优于专门的表示学习基线如V-JEPA和VideoMAE。令人惊讶的是，这个信号在VAE潜在输入中不存在，而是在去噪变换器内部出现，尽管模型没有使用自监督预测目标进行训练。这些发现表明，物理有意义的表示可以作为生成式去噪的副产品产生。

英文摘要

Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model's intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.

URL PDF HTML ☆

赞 0 踩 0

2606.05326 2026-06-05 math.OC cs.AI cs.LG math-ph math.AP math.MP

Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

稳定边缘的梯度下降：双层网络的自由能模型与动力学描述

Antonin Chodron de Courcel

发表机构 * Ecole Normale Supérieure, CNRS, 45 rue d’Ulm, 75005 Paris, France（巴黎高等师范学院、法国国家科学研究中心、巴黎 rue d’Ulm 45 号、75005 地址、法国）

AI总结针对大学习率下梯度下降的稳定边缘动力学，提出连续时间有效模型跟踪平均轨迹与快速振荡协方差，揭示有效自由能作为关键监控量，并导出宽双层网络的平均场极限动力学方程。

Comments Comments are welcome!

详情

AI中文摘要

我们研究了稳定边缘（Edge of Stability）机制下梯度下降的动力学，其中学习率足够大，导致损失和锐度出现持续振荡。我们提出了一个连续时间有效模型，跟踪平均轨迹的演化以及其快速振荡的时间平均协方差。我们的分析表明，在这种不稳定机制中，需要监控的自然量是有效自由能，它将原始风险泛函与曲率相关的“熵”项相结合。我们的模型允许我们跟踪振荡的包络，即使在动力学与平均权重在相似时间尺度上演化的情况下。换句话说，我们可以跟踪某些神经网络架构训练过程中出现的尖峰。对于在稳定非消失振荡下优化的宽双层神经网络，我们推导出一个平均场极限，产生了一个新的动力学方程，描述了权重及其波动的联合分布。我们证明该方程可以解释为宏观自由能的Wasserstein-2梯度流。最后，我们提供了矩阵分解和深度学习任务（CIFAR-10）上的数值证据，以证明模型在捕捉振荡包络方面的准确性以及有效自由能的预测能力。

英文摘要

We study the dynamics of gradient descent in the Edge of Stability regime, where the learning rate is large enough to induce persistent oscillations in the loss and the sharpness. We propose a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. Our analysis reveals that the natural quantity to monitor in such unstable regimes is an effective free energy, which combines the original risk functional with a curvature-related "entropic" term. Our model allows us to track the envelope of the oscillations even in situations where its dynamics evolve on similar timescales as the averaged weights. Otherwise stated, we can track the spikes that occur during the training of some neural network architectures. For wide two-layer neural networks optimized under stable non-vanishing oscillations, we derive a mean-field limit that results in a novel kinetic equation describing the joint distribution of weights and their fluctuations. We show that this equation can be interpreted as a Wasserstein-2 gradient flow of a macroscopic free energy. Finally, we provide numerical evidence on matrix factorization and deep learning tasks (CIFAR-10) to demonstrate the model's accuracy in capturing the envelope of the oscillations and the predictive power of the effective free energy.

URL PDF HTML ☆

赞 0 踩 0

2606.05268 2026-06-05 cs.GR cs.LG

Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation

基于LLM的弱验证器聚合用于空间布局生成

Sharon Zhang, R. Kenny Jones, Jiajun Wu, Maneesh Agrawala

发表机构 * Stanford University（斯坦福大学）； Roblox

AI总结提出一种通过聚合LLM生成的弱验证器来构建强验证器的流水线，用于空间布局领域，在3D房间布局和2D海报设计任务中F1分数提升高达7倍。

详情

AI中文摘要

我们提出了一种流水线，用于构建和聚合任务特定的、LLM生成的弱（不完美）验证器，以形成空间布局领域的强验证器。给定任务描述，我们的流水线要求LLM使用布局验证DSL合成一组验证程序。每个单独的LLM生成的验证器通常对布局与相应任务描述之间的匹配提供不完美的检查。我们表明，通过聚合许多此类验证器的响应，我们可以产生更强的验证器。此外，通过应用弱学习技术，我们的流水线可以从非常稀疏的人工标记示例布局（约10个）中学习如何聚合弱验证器。我们发现，我们的流水线产生的强验证器优于使用一组LLM评判者直接检查布局是否与任务描述匹配的现状方法，在各种3D房间布局和2D海报设计任务中，F1分数提高了高达7倍。我们还证明，使用来自我们强验证器的自然语言反馈进行验证器引导的布局生成，根据人类评估者的评估，将基础布局生成器的布局质量提高了高达66.2%。

英文摘要

We present a pipeline for building and aggregating task-specific, LLM-generated weak (imperfect) verifiers into a strong verifier for spatial layout domains. Given a task description, our pipeline asks an LLM to synthesize a collection of verifier programs using a layout verification DSL. Each individual LLM-generated verifier usually provides an imperfect check for a match between the layout and the corresponding task description. We show that by aggregating the responses of many such verifiers we can produce a stronger verifier. Moreover, by applying techniques from weak learning, our pipeline can learn how to aggregate the weak verifiers from a very sparse set of human labeled example layouts (about 10). We find that the strong verifiers produced by our pipeline outperform the status-quo approach of using a set of LLM judges to directly check whether a layout matches a task description, raising F1-scores by up to 7X across a variety of 3D room layout and 2D poster design tasks. We also demonstrate that verifier-guided layout generation using natural language feedback from our strong verifiers improves layout quality of a base layout generator by up to 66.2% according to a human evaluator.

URL PDF HTML ☆

赞 0 踩 0

2606.05262 2026-06-05 cs.IT cs.AI math.IT

X-Band UAV-enabled Integrated Sensing and Communications for Vehicular Networks

X波段无人机赋能的车联网集成感知与通信

Remon Polus, Soumaya Cherkaoui

发表机构 * Department of Computer and Software Engineering（计算机与软件工程系）

AI总结针对X波段无人机集成感知与通信系统，提出基于双阴影信道模型的最优时间分配方法，平衡感知精度与通信性能。

2606.05258 2026-06-05 stat.ML cs.LG stat.AP

HyFAD: 用于时间序列插值的混合时频扩散与频率感知嵌入

Hongfan Gao, Wangmeng Shen, Bin Yang, Jilin Hu

发表机构 * School of Data Science and Engineering（数据科学与工程学院）； East China Normal University（华东师范大学）

AI总结提出HyFAD模型，通过耦合时频扩散框架和频率感知步嵌入，实现从时域到频域的渐进式去噪，有效解决频率敏感去噪和高频重建问题，在多个基准数据集上达到最先进性能。

详情

AI中文摘要

扩散模型通过迭代去噪逐步捕捉复杂数据分布，在时间序列建模中表现出强大性能。然而，现有方法在处理频率敏感去噪、高频重建以及平衡全局趋势与局部动态方面存在困难。为解决这些限制，我们提出 extbf{HyFAD}，一种用于时间序列插值的 extbf{混合}时频 extbf{扩散}模型，带有 extbf{频率感知}嵌入。基于DDPM范式，HyFAD采用耦合的时频扩散框架，其中反向去噪从时域到频域顺序进行，实现从粗到细的生成。具体地，时域扩散过程捕捉低频全局趋势，而频域扩散过程细化高频频谱分量。我们进一步引入频率感知步嵌入，利用扩散步与频谱分量之间的关系，提供步依赖的频谱引导，促进更准确的频带重建。在多个基准数据集上的大量实验表明，HyFAD达到了最先进的性能。我们的源代码可在https://github.com/hongfangao/HyFAD获取。

英文摘要

Diffusion models have demonstrated strong performance in time series modeling due to their ability to progressively capture complex data distributions through iterative denoising. However, existing approaches struggle with frequency-sensitive denoising, high-frequency reconstruction and balancing global trends with local dynamics. To address these limitations, we propose \textbf{HyFAD}, a \textbf{Hy}brid time-frequency \textbf{D}iffusion model with \textbf{F}requency-\textbf{A}ware embedding for time series imputation. Built upon the DDPM paradigm, HyFAD adopts a coupled time-frequency diffusion framework, in which the reverse denoising proceeds sequentially from the time domain to the frequency domain, enabling coarse-to-fine generation. Specifically, the time-domain diffusion process captures low-frequency global trends, while the frequency-domain diffusion process refines high-frequency spectral components. We further introduce a frequency-aware step embedding that exploits the relationship between diffusion steps and spectral components, providing step-dependent spectral guidance and facilitates more accurate band-wise reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that HyFAD achieves state-of-the-art performance. Our source code is available at https://github.com/hongfangao/HyFAD.

URL PDF HTML ☆

赞 0 踩 0

2606.05233 2026-06-05 cs.CR cs.AI cs.CL

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

前沿计算机使用代理中的领域条件安全：一个793集浏览器基准测试、编码领域交叉引用以及近期红队攻击的可重复性审计

Nicholas Saban

发表机构 * Patronus AI University of California, Berkeley（Patronus AI 伯克利大学）

AI总结本研究通过构建包含793个浏览器任务和56个攻击模板的基准测试，评估前沿计算机使用代理对提示注入攻击的鲁棒性，发现模型权重提供了强抵抗性（攻击成功率0%），但该安全性是领域条件的，在编码代理中失效（攻击成功率高达100%），并指出文献中高攻击成功率主要归因于RL优化的注入文本而非攻击类别。

详情

AI中文摘要

最近的计算机使用代理（CUA）红队论文报告提示注入攻击成功率（ASR）为42-98%，但这些头条数字集中在已退役模型和每篇论文面板中最易受攻击的模型上。我们询问这些技术，作为手工制作的模板重现，是否仍然对当前前沿CUA有效。我们发布了CUA-HandCrafted，一个包含793个集成的公共基准测试，涵盖24个多步骤网络任务、56个攻击模板、8个攻击家族和4个系统提示配置。针对Claude Sonnet 4.6和GPT-5.4，我们测量到0/140的多步骤攻击成功（Clopper-Pearson 95%上限2.60%）；一个提示消融实验表明这种抵抗性存在于模型权重中。然而，它并不泛化：在一个姐妹编码代理基准测试（SkillBench）上，相同的权重对手工制作的技能注入攻击成功率高达100%。我们认为文献中的高ASR主要归因于RL优化的注入文本，而不是攻击类别，并且前沿安全加固是领域条件的，特定于被高度针对的浏览器表面。报告技术而不发布优化字符串，或将浏览器领域安全性外推到其他CUA模态，使得已发表的ASR数字无法重现。

英文摘要

Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature's high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.05230 2026-06-05 stat.ML cs.LG eess.SP

Central Description Length (CDL) Clustering Validation Index

中心描述长度（CDL）聚类验证指标

Mahdi Shamsi, Soosan Beheshti

发表机构 * Toronto Metropolitan University（多伦多 Metropolitan 大学）

AI总结提出中心描述长度（CDL）聚类验证指标，通过计算不可观测真实聚类中心描述长度的概率上界来评估聚类质量，无需标签且适用于非凸和不规则形状数据。

详情

AI中文摘要

在工程机器学习管道中，无标签情况下选择聚类算法及其超参数是一个常见难题，这些管道通常用于传感器、图像或过程数据的无监督分析。聚类验证指标（CVI）提供内部评分来对候选聚类进行排序，但大多数流行的CVI基于欧几里得紧致性和分离项构建，因此倾向于紧凑的凸分区。已知它们在非凸、不规则或变密度数据上的性能会下降，通常需要使用核变换或替代距离度量，但代价是额外的调优和计算。本文介绍了中心描述长度（CDL）聚类验证指标。CDL利用观测到的簇内紧致性、估计的聚类中心和估计的聚类协方差，计算与不可观测的真实聚类中心相关的描述长度的概率上界。该界限将簇内紧致性和质心位移压缩为一个可计算的量，并在任何聚类算法产生的分区上进行评估。实现仅使用可观测的量（数据、分区、估计中心和估计协方差），不使用真实标签。在具有非凸和任意形状簇的合成基准测试中，CDL-CVI比我们测试的传统CVI更频繁地选择参考聚类数，并达到更高的调整兰德指数（ARI）值，且无需额外的核预处理阶段。在从冻结的无监督嵌入聚类的图像基准测试（MNIST、CIFAR-10、STL-10）中，CDL-CVI在报告的试验中，针对K-means、DBSCAN和谱聚类返回的聚类数接近参考类别数。

英文摘要

Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering machine learning pipelines that work with unsupervised analysis of sensor, image, or process data. Clustering validation indices (CVIs) provide internal scores for ranking candidate clusterings, but most popular CVIs are built from Euclidean compactness and separation terms and so tend to favour compact, convex partitions. Their performance is known to degrade on non convex, irregular, or variable density data, where kernel transformations or alternative distance measures are typically used at the cost of additional tuning and computation. This paper introduces the Central Description Length (CDL) clustering validation index. CDL uses the observed within cluster compactness, the estimated cluster centers, and the estimated cluster covariances to compute a probabilistic upper bound on the description length associated with the unobservable true cluster centers. The bound condenses intra cluster compactness and centroid displacement into a single computable quantity and is evaluated on the partition produced by any clustering algorithm. The implementation uses only observable quantities (the data, the partition, the estimated centers, and the estimated covariances) and does not use ground truth labels. On synthetic benchmarks with non convex and arbitrary shape clusters, CDL-CVI selected the reference number of clusters more often and reached higher Adjusted Rand Index (ARI) values than the conventional CVIs we tested, without an additional kernel preprocessing stage. On image benchmarks (MNIST, CIFAR-10, STL-10) clustered from frozen unsupervised embeddings, CDL-CVI returned cluster numbers close to the reference class counts across K-means, DBSCAN, and spectral clustering in the reported trials.

URL PDF HTML ☆

赞 0 踩 0

2606.05227 2026-06-05 q-bio.CB cs.LG math-ph math.MP q-bio.BM

Quantifying the biophysical properties of stomatocytes in health and disease

量化健康与疾病状态下口形红细胞的生物物理特性

Zhaojie Chai, Jianlu Zheng, He Li, Ming Dao, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University（布朗大学应用数学系）； Department of Materials Science and Engineering, Massachusetts Institute of Technology（麻省理工学院材料科学与工程系）； College of Engineering, University of Georgia（佐治亚大学工程学院）

AI总结通过耗散粒子动力学模拟与微流控成像结合，构建三种口形红细胞模型，揭示其几何主导的脾窦穿越行为、膜滚动抑制及高剪切粘度增加，统一解释遗传性口形红细胞增多症的脾切除悖论。

Comments 26 pages, 9 figures

详情

AI中文摘要

遗传性口形红细胞增多症（HS）包括以杯状红细胞为特征的红细胞疾病，这些细胞对脾切除术的反应相反：在过度水化型HS（OHS）中可治愈，但在脱水型HS（DHS/干裂红细胞）中可能促进血栓形成。这一悖论持续存在，因为红细胞生物力学由部分独立的参数——剪切模量、弯曲刚度、表面积体积比（S/V）和细胞质粘度——控制，而现有检测方法仅能零散地捕获这些参数。本文结合耗散粒子动力学（DPD）模拟与微流控成像，在固定膜面积和递减体积（109.7、101.5、89.8 fL）下构建了一个对照盘状红细胞和三种口形红细胞模型（ST-RBC1-3），覆盖OHS到DHS的范围。通过五种力学正交的检测追踪这组参数，我们发现内皮间裂隙（IES）穿越由几何主导：过度水化型ST-RBC1需要比健康红细胞高一个数量级的临界压力，而脱水型ST-RBC3可自由通过。然而，ST-RBC3抑制膜滚动，并在生理血细胞比容下将低剪切全血粘度提高约29%，与戈谢病高粘度相当。一个漏斗-障碍芯片将这些差异放大为无标记的中心线偏移信号，预计可区分所有四种红细胞类型（极端表型间约4.5个标准差）。这些结果将单细胞力学、脾滤过和血液流变学统一在一个框架内，解决了脾切除悖论，并指向HS的微流控术前风险分层。

英文摘要

Hereditary stomatocytosis (HS) comprises red blood cell (RBC) disorders characterized by cup-shaped erythrocytes that respond oppositely to splenectomy: curative in overhydrated HS (OHS) but potentially thrombogenic in dehydrated HS (DHS/xerocytosis). This paradox persists because RBC biomechanics is governed by partly independent parameters--shear modulus, bending rigidity, surface-to-volume ratio (S/V), and cytoplasmic viscosity--that existing assays capture only piecemeal. Here we combine dissipative particle dynamics (DPD) simulations with microfluidic imaging to construct a control discocyte and three stomatocyte models (ST-RBC1-3) at fixed membrane area and decreasing volume (109.7, 101.5, 89.8 fL), spanning the OHS-to-DHS range. Tracing this parameter set through five mechanically orthogonal assays, we find that interendothelial-slit (IES) traversal is geometry-dominated: overhydrated ST-RBC1 requires an order of magnitude higher critical pressure than healthy RBCs, whereas dehydrated ST-RBC3 passes freely. ST-RBC3 nonetheless suppresses membrane tank-treading and raises low-shear whole-blood viscosity by ~29% at physiological haematocrit, comparable to Gaucher-disease hyperviscosity. A funnel-obstacle chip amplifies these differences into a label-free centerline-offset signal predicted to separate all four RBC types (~4.5 standard deviations between extreme phenotypes). These results unite single-cell mechanics, splenic filtration, and hemorheology in one framework, resolve the splenectomy paradox, and point toward microfluidic pre-operative risk stratification in HS.

URL PDF HTML ☆

赞 0 踩 0

2606.05225 2026-06-05 q-bio.QM cs.LG

The Language of Elution: Autoregressive Prediction of the Next Feature in Untargeted LC-HRMS Lipidomics

洗脱的语言：非靶向LC-HRMS脂质组学中下一个特征的自回归预测

Dayanjan S. Wijesinghe

发表机构 * Virginia Commonwealth University School of Pharmacy（弗吉尼亚联邦大学药学院）

AI总结将色谱洗脱建模为自回归序列预测任务，使用LSTM和Transformer模型基于无注释特征预测下一个洗脱的质荷比区间，在临床脂质组学数据上达到98.4%的top-1准确率，并揭示了序列模式而非分子特性驱动预测。

详情

AI中文摘要

非靶向液相色谱-高分辨质谱（LC-HRMS）每份样本可检测数千个分子特征，但仅有2-20%获得可靠的结构注释。造成这种“暗代谢组”的一个根本原因是串联质谱（MS/MS）采集是反应性的：仪器仅在离子出现后选择前体，而对后续洗脱的离子一无所知。我们将色谱洗脱重新定义为自回归序列预测任务。由于反相洗脱顺序由疏水性决定，连续特征形成物理约束的序列，类似于语言中的标记。我们将质荷比（m/z）轴离散化为110个区间，并训练长短期记忆（LSTM）和Transformer模型，基于五个无注释的每个标记特征（m/z区间、质量亏损、保留时间间隔、极性和强度排名）预测下一个洗脱的m/z区间。在来自四个临床脂质组学队列（342份血浆样本；SCIEX TripleTOF 6600+，Waters CSH C18）的15,242个特征上训练，LSTM达到98.4%的top-1准确率（top-5为99.99%；平均绝对误差3.6 Da），Transformer达到98.0%。消融实验表明，自回归上下文贡献了55.5个百分点，而没有任何单个特征贡献超过0.2个百分点：是序列模式而非分子特性驱动预测。模型在共享方法的仪器间可迁移（在独立Agilent 6530数据集上r=0.999），但在不同色谱柱化学（top-1为5.1%）或极性模式（2.6%）下失败，证实了方法和模式特异性。在少至2到5次质控进样上进行微调，可将保留准确率从2.6%恢复到近50%，因此跨条件部署需要最少的校准。这些结果表明洗脱序列高度可预测，并为预测性MS/MS采集奠定基础，以提高非靶向代谢组学的注释覆盖率。

英文摘要

Untargeted liquid chromatography-high-resolution mass spectrometry (LC-HRMS) detects thousands of molecular features per sample, yet only 2-20% receive confident structural annotations. A root cause of this "dark metabolome" is that tandem MS/MS acquisition is reactive: instruments select precursors only after ions appear, blind to what elutes next. We reframe chromatographic elution as an autoregressive sequence prediction task. Because reversed-phase elution order is governed by hydrophobicity, successive features form a physically constrained sequence, like tokens in language. We discretize the mass-to-charge (m/z) axis into 110 bins and train long short-term memory (LSTM) and Transformer models to predict the next eluting m/z bin from five annotation-free per-token features: m/z bin, mass defect, retention-time gap, polarity, and intensity rank. Trained on 15,242 features from four clinical lipidomics cohorts (342 plasma samples; SCIEX TripleTOF 6600+, Waters CSH C18), the LSTM reaches 98.4% top-1 accuracy (99.99% top-5; mean absolute error 3.6 Da) and the Transformer 98.0%. Ablation shows autoregressive context accounts for 55.5 percentage points while no single feature contributes more than 0.2 pp: the sequential pattern, not molecular properties, drives prediction. Models transfer across instruments sharing the method (r=0.999 on an independent Agilent 6530 dataset) but fail under a different column chemistry (5.1% top-1) or polarity mode (2.6%), confirming method- and mode-specificity. Fine-tuning on as few as two to five quality-control injections recovers held-out accuracy from 2.6% to nearly 50%, so cross-condition deployment needs minimal calibration. These results establish that elution sequences are highly predictable and lay the groundwork for predictive MS/MS acquisition to improve annotation coverage in untargeted metabolomics.

URL PDF HTML ☆

赞 0 踩 0

2606.05222 2026-06-05 cs.CY cs.AI cs.HC

Where's the Structure? A Systematic Literature Review of Empirical Research on Human-AI Collaboration and Hybrid Intelligence for Learning

结构在哪里？关于人机协作与混合智能用于学习的实证研究的系统文献综述

Luis P. Prieto, Juan I. Asensio-Pérez, María Jesús Rodríguez-Triana, Mohamed Saban, Yannis Dimitriadis

发表机构 * GSIC-EMIC research group, Universidad de Valladolid (Spain)（瓦伦西亚大学GSIC-EMIC研究组）； GICAP research group, Department of Digitization, Universidad de Burgos (Spain)（布尔戈斯大学数字技术系GICAP研究组）

AI总结本文通过系统文献综述（N=62）分析了人机协作与混合智能在学习支持中的协作过程、结构及应用背景，提取了设计知识和研究空白。

Comments 59 pages, 4 figures, submitted to a journal

2606.05217 2026-06-05 math-ph cs.AI cs.LG math.MP physics.data-an

The Score Hamiltonian: Mapping Diffusion Models to Adiabatic Transport

得分哈密顿量：将扩散模型映射到绝热输运

Peter Halmos, Boris Hanin

发表机构 * Computer Science Department, Princeton University（普林斯顿大学计算机科学系）； ORFE Department, Princeton University（普林斯顿大学ORFE系）

AI总结本文通过构建得分哈密顿量，建立了基于得分的扩散模型采样与薛定谔算子基态绝热输运之间的精确对应关系，并利用绝热定理推导了密度重建误差界和退火调度方案。