arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4033
2606.14808 2026-06-16 eess.IV cs.CV cs.IT math.IT 新提交

Explainable Task-Oriented Token Communication for AI-Native 6G Networks

面向AI原生6G网络的可解释任务导向Token通信

Feibo Jiang, Lei Mao, Li Dong, Kezhi Wang, Cunhua Pan, Jiangzhou Wang

发表机构 * IEEE

AI总结 提出ET-TokenCom框架,通过跨模态注意力融合视觉Token与任务Token,实现可解释的任务导向图像通信,解决Token表示不足、协作差和可解释性低的问题。

详情
AI中文摘要

基础模型(FMs)与无线通信的集成正在推动图像通信从比特精确传输向任务导向传输的演进。然而,现有的任务导向图像通信方法仍面临三大挑战:任务导向Token表示不足、视觉Token与任务Token之间的协作不充分,以及任务决策的可解释性有限。为了解决这些挑战,我们提出了一种可解释的任务导向Token通信(ET-TokenCom)框架。通过将Token作为信息表示和传输的统一单元,所提框架构建了一个跨越视觉感知、无线传输和任务推理的端到端通信链路。在发射端,ET-TokenCom框架从图像中提取视觉Token以保留低层视觉信息。同时,引入由FM生成的任务Token来表示当前任务所需的目标信息和决策意图。进一步设计了跨模态注意力(CMA)融合机制,使任务Token能够显式地指导视觉Token的选择、加权和传输。在接收端,该框架将Token解码与可解释输出机制相结合,生成注意力热图以突出不同任务目标下的关键感知区域,并揭示任务Token对输出的影响。最后,仿真结果验证了所提ET-TokenCom框架的有效性和鲁棒性。

英文摘要

The integration of Foundation Models (FMs) and wireless communications is driving the evolution of image communication from bit-accurate transmission toward task-oriented transmission. However, existing task-oriented image communication methods still face three major challenges: insufficient task-oriented Token representation, inadequate collaboration between Visual Tokens and Task Tokens, and limited interpretability of task decisions. To address these challenges, we propose an Explainable Task-Oriented Token Communication (ET-TokenCom) framework. By treating Tokens as unified units for information representation and transmission, the proposed framework constructs an end-to-end communication link that spans visual perception, wireless transmission, and task reasoning. At the transmitter, the ET-TokenCom framework extracts Visual Tokens from images to preserve low-level visual information. Meanwhile, Task Tokens generated by the FM are introduced to represent the target information and decision intent required by the current task. A Cross-Modal Attention (CMA) fusion mechanism is further designed, enabling Task Tokens to explicitly guide the selection, weighting, and transmission of Visual Tokens. At the receiver, the framework integrates Token decoding with an explainable output mechanism, where attention heatmaps are generated to highlight critical perceptual regions under different task objectives and reveal the influence of Task Tokens on the outputs. Finally, simulation results validate the effectiveness and robustness of the proposed ET-TokenCom framework.

2606.14805 2026-06-16 cs.SE cs.AI 新提交

Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces

基于知识的无重放多智能体LLM轨迹调试

Dong Ho Kang, Hyeonjeong Cha, Daein Weon

发表机构 * ustechlab.com(ustechlab)

AI总结 提出一种知识图谱驱动的无重放预测方法,通过结构化事件知识图谱和轻量级预测器,在不执行重放的情况下定位高影响事件,将轨迹定位召回率从0.73提升至0.93。

Comments 21 pages, 1 figure, 6 tables. Submitted to Knowledge-Based Systems

详情
AI中文摘要

多智能体大语言模型(LLM)系统的可靠运行依赖于对长执行轨迹的调试,其中少数因果决定性事件被埋没在消息、路由、内存写入和工具调用的非结构化日志中。标准工具是反事实重放(回退、编辑并重新运行轨迹以衡量每个事件的影响),但其成本随候选事件数量线性增长,使得大规模穷举重放不可行。我们将轨迹调试视为基于知识的决策支持问题。每条轨迹被编译成一个结构化的知识图谱,涵盖路由、内存、工具使用、不确定性和潜在证据,并通过校准的预测器决定稀缺的重放预算应分配到哪里。我们不提出新的重放预言机;我们提出一种无需支付重放成本即可预测其结果的方法。我们形式化了无重放反事实效应预测:给定固定预算下的轨迹,在未执行任何重放前预测预言机会将哪些事件标记为高影响。BranchPoint-Latent 是一个轻量级预测器,基于知识图谱的可观测、结构、不确定性和潜在特征。通过针对37个轨迹族系的确定性重放预言机进行校准,单个学习排序梯度提升预测器在零预言机重放成本下,将留出族系的每轨迹定位(Branch Recall@5)从0.73提升至0.93。我们并非声称普遍优势,而是刻画了何时廉价图中心性足够、何时需要学习到的证据。最终成果是一个可审计、成本高效的AI可靠性调试决策支持系统,明确位于成本-精度前沿,并提供可复现的工件。

英文摘要

Reliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re-run the trajectory to measure each event's effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge-based decision-support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool-use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero-replay counterfactual-effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high-effect before any replay is performed. BranchPoint-Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning-to-rank gradient-boosted predictor raises per-trace localization (Branch Recall@5) from 0.73 to 0.93 on held-out families at zero oracle-replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost-efficient decision-support system for AI-reliability debugging, positioned explicitly on the cost-accuracy frontier with reproducible artifacts.

2606.14800 2026-06-16 stat.ME cs.LG eess.IV stat.ML 新提交

Bridging data-driven priors via the score function for posterior sampling -- Comparative review and experimental study

通过得分函数桥接数据驱动先验进行后验采样——比较综述与实验研究

Elhadji Cisse Faye, Mame Diarra Fall, Sylvain Delchini, Nicolas Dobigeon

发表机构 * IDP, Univ Orléans(IDP,奥尔良大学) LITIS, Univ Rouen Normandie(LITIS,鲁昂-诺曼底大学) Bureau de Recherches Géologiques et Minières Orléans, France(奥尔良地质与矿业研究局,法国) IRIT, Univ Toulouse(图卢兹大学IRIT)

AI总结 本文综述了贝叶斯逆问题中多种数据驱动先验如何通过得分函数统一,并展示其在采样算法中的有效集成,通过图像修复和超分辨率实验验证了方法的效率与通用性。

详情
AI中文摘要

本文综述了贝叶斯逆问题中常用的多种数据驱动先验如何通过各自的得分函数统一起来。通过将这些先验置于这一共同视角下,我们表明它们可以受益于直接且有效地集成到最近提出的采样算法中。通过考虑几种数据驱动先验,即去噪正则化、基于归一化流的先验、基于得分的生成模型和凸脊正则化,说明了这一通用框架的适用性。对于这四种特定的先验,在图像修复和单图像超分辨率任务中评估了该方法的性能。这些结果以及在地质背景下恢复真实图像的结果证明了该方法的效率。这一统一框架证明足够通用,能够处理由广泛类别的基于得分函数的先验定义的任何后验分布,而不仅限于本文考虑的具体情况。

英文摘要

This paper reviews how a diverse set of popular data-driven priors commonly used in Bayesian inverse problems can be unified through their respective score functions. By framing these priors under this common perspective, we show that they can benefit from their straightfoward and effective integration into a recently proposed sampling algorithm. The applicability of this common framework is illustrated by considering several data-driven priors, namely regularization-by-denoising, normalizing flow-based priors, score-based generative models, and convex-ridge regularizers. For these four particular priors, the performance of the method is evaluated when conducting image inpainting and single image super-resolution. These results, as well as those obtained when restoring real images acquired in a geological context, demonstrate the efficiency of the method. This unified framework proves versatile enough to handle any posterior distribution defined by a broad class of score function-based priors, beyond the specific cases considered in this paper.

2606.14791 2026-06-16 eess.AS cs.LG cs.SD 新提交

From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation

从物理到表示:通过程序化生成进行合成预训练的音频学习

Fengrui Liu, Ruiyang Huang, Qijian Zheng, Yuanfang Wang, Feng Liu

发表机构 * East China Normal University(华东师范大学) Southeast University(东南大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出AudioPG框架,利用程序化合成生成波形进行掩码自编码器预训练,无需真实音频数据,在多个基准上取得高精度,且单GPU训练不到20分钟。

Comments Accepted to ACM ICMR 2026

详情
AI中文摘要

自监督学习推动了多媒体分析中音频表示的发展。然而,主流的数据驱动方法依赖大规模真实世界语料库,增加了训练成本、整理负担和隐私障碍。为解决这一问题,我们提出了AudioPG,一个程序化合成框架,在预训练过程中完全消除了真实音频录音。AudioPG在由基本声学基元和组合规则实时生成的波形上训练基于Transformer的掩码自编码器。该编码器有效迁移到真实音频基准,在ESC-50上达到90.60%的准确率,在FSD50K上达到0.546 mAP,在UrbanSound8K上达到88.17%,在Speech Commands V2上达到97.03%。值得注意的是,预训练在单个GPU上不到20分钟即可完成。潜在空间分析揭示了物理因素(包括基频和相对强度)在正交子空间中出现,使得表示可线性解码。这些结果表明,当大规模语料库不可用时,程序化合成是一种高效、可解释的预训练信号。我们的代码可在https://github.com/Freyliu0516/audioPG获取。

英文摘要

Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: https://github.com/Freyliu0516/audioPG.

2606.14786 2026-06-16 cs.MM cs.AI cs.CV 新提交

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

MatchLM2Lite: 一种可扩展的MLLM-to-Lite框架用于重复内容识别

Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Zirui Zhu, Kanchan Sarkar, Kun Xu

发表机构 * Tiktok(字节跳动) National University of Singapore School of Computing(新加坡国立大学计算机学院)

AI总结 提出MatchLM2Lite框架,通过将多模态大语言模型蒸馏为轻量模型,实现视频、音频和文本联合建模的实时重复内容识别,在降低35倍计算成本的同时保持高准确率,并成功部署于大规模生产环境。

详情
AI中文摘要

内容审核对于在线视频平台确保内容安全、保护创作者和维持积极的用户体验至关重要。除了过滤有害内容,平台必须大规模保证内容真实性,以便用户接触到多样化、原创的视频,而非低价值的重复内容。我们提出MatchLM2Lite,一个实时、生产级的重复内容识别(RCI)系统,它利用多模态大语言模型(MLLM)的强大理解能力,将其蒸馏为一个小型且推理速度快的模型。我们的系统联合建模视频、音频和文本信号,对视频对进行操作以生成细粒度的重复分数。该系统包含两个模块,MatchLM和MatchLite,以及一个两阶段训练方案。首先,我们高容量的MLLM,MatchLM,作为教师模型定义RCI性能的上限。然后,其能力被蒸馏到一个紧凑的学生模型MatchLite中。这种设计使MatchLite能够在视频对上实现低延迟、高吞吐量的推理,同时保留MatchLM的大部分准确性,使其适合集成到实时推荐系统中。MatchLM相比我们之前的生产模型F1分数提高了+8.57。经过知识蒸馏后,MatchLite保留了+6.55的F1分数提升,同时计算成本降低了35倍。大规模部署后,MatchLM2Lite实现了高效的成对多模态RCI,以高每秒查询数(QPS)稳定服务在线流量,端到端延迟低于30秒。该系统在不降低用户参与度的情况下,将我们平台上的重复视频观看率降低了2.5%,证明了其在大规模生产环境中的有效性。

英文摘要

Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

2606.14769 2026-06-16 econ.EM cs.AI cs.GT 新提交

Agentomics: Economic Foundations for the Valuation, Attribution, and Pricing of AI Agents in Human-AI Workflows

Agentomics:人机协作工作流中AI代理的估值、归因和定价的经济基础

Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering(纽约大学Tandon工程学院电气与计算机工程系)

AI总结 提出Agentomics框架,基于工作流模型将AI部署视为联盟形成问题,使用Shapley值进行经济盈余归因,实现AI代理的估值、归因和定价。

详情
AI中文摘要

代理型AI系统越来越多地被部署为组织工作流中的生产资源,然而现有的评估方法主要衡量孤立的技术性能而非经济贡献。本文引入了\emph{Agentomics},一个基于工作流的框架,用于对人类和人工代理进行估值、归因和定价。该框架将工作流建模为异构代理的配置,其集体绩效决定了总价值、部署成本、可靠性和预期故障损失。工作流价值被视为团队层面的量,可能包括互补性、替代效应、瓶颈和非线性生产;可加的阶段级价值仅是一个特例。基于此工作流模型,本文将AI部署表述为一个联盟形成问题,并将联盟价值定义为相对于基准人类工作流所产生的增量净剩余。然后使用Shapley值在参与的AI代理之间分配经济盈余,从而在估值、问责和市场定价之间建立原则性联系。由此产生的Shapley定价均衡为评估代理价格是否反映预期边际贡献提供了规范基准。一个安全运营案例研究说明了该框架如何解释混合人机工作流中的生产力提升、部署成本、可靠性损失和联盟级互补性。

英文摘要

Agentic AI systems are increasingly being deployed as productive resources in organizational workflows, yet existing evaluation methods primarily measure isolated technical performance rather than economic contribution. This paper introduces \emph{Agentomics}, a workflow-based framework for valuing, attributing, and pricing human and artificial agents. The framework models a workflow as a configuration of heterogeneous agents whose collective performance determines gross value, deployment cost, reliability, and expected failure loss. Workflow value is treated as a team-level quantity that may include complementarities, substitution effects, bottlenecks, and nonlinear production; additive stage-level value is only a special case. Building on this workflow model, the paper formulates AI deployment as a coalition-formation problem and defines coalition value as the incremental net surplus generated relative to a benchmark human workflow. The Shapley value is then used to attribute economic surplus among participating AI agents, yielding a principled connection among valuation, accountability, and market pricing. The resulting Shapley pricing equilibrium provides a normative benchmark for assessing whether agent prices reflect expected marginal contribution. A security-operations case study illustrates how the framework accounts for productivity gains, deployment costs, reliability losses, and coalition-level complementarities in hybrid human--AI workflows.

2606.14750 2026-06-16 eess.AS cs.AI cs.CV cs.SD 新提交

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS: 基于图像的文字渲染实现鲁棒文本转语音

Adarsh Arigala, Arjun Gangwar, S Umesh, Yova Kementchedjhieva

发表机构 * SPRING Lab, Indian Institute of Technology, Madras, India(SPRING实验室,印度理工学院,马德拉斯,印度) MBZUAI, UAE(MBZUAI,阿联酋)

AI总结 提出Pixel-TTS框架,将文本渲染为图像并通过2D卷积生成嵌入,消除嵌入矩阵扩展,提升对未见字符和拼写变体的鲁棒性,实现零样本泛化。

Comments 5 pages, 4 figures, 4 tables

详情
AI中文摘要

近期基于像素的文本建模进展表明,将文本表示为图像能使模型利用视觉线索进行语言理解。将文本锚定在其视觉形式上,允许具有不同Unicode编码的结构相似字符产生相似的嵌入,从而有益于跨语言和零样本场景。传统的基于文本的方法独立处理每个字符,限制了向未见字符的泛化,并在跨语言适应时需要嵌入扩展。我们提出Pixel-TTS,首个视觉接地语音合成框架。它将文本渲染为图像,并通过2D卷积层投影以生成嵌入。这种设计在微调过程中消除了嵌入矩阵扩展,同时提高了对未见字符和拼写变体的鲁棒性。大量实验表明,Pixel-TTS在强基线上实现了有竞争力的性能、更快的收敛和鲁棒的零样本泛化。

英文摘要

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

2606.14739 2026-06-16 cs.ET cs.LG cs.SY eess.SY 新提交

An RRAM-based Hardware Implementation of a Radial Basis Function Neuron for Edge Classifiers

基于RRAM的径向基函数神经元硬件实现用于边缘分类器

Georgios Papandroulidakis, Shady Agwa, Themis Prodromakis

发表机构 * Centre for Electronics Frontiers, Institute of Micro and Nano Systems(电子前沿中心,微纳系统研究所)

AI总结 提出一种基于金属氧化物RRAM的模拟内容可寻址存储器(ACAM)硬件设计,通过可配置感受野神经元实现边缘设备上的度量分类和在线自适应,在MNIST上达到89.1%准确率,每单元每操作能耗185fJ。

详情
AI中文摘要

现代机器学习(ML)解决方案在资源受限的边缘设备上的部署凸显了实现挑战。对于包含安全关键组件(如自主导航任务)的极端边缘应用尤其如此。本文展示了一种人工神经网络(ANN)设计,利用基于金属氧化物电阻式RAM(RRAM)的模拟内容可寻址存储器(ACAM)作为高效的硬件基础,用于在边缘执行基于度量的分类和在线自适应。所提出的设计基于用于构建ACAM模块的自定义模板像素(TXL)单元,其中每个TXL单元充当可配置的感受野神经元。这些单元采用径向基激活函数来计算输入与编程感受野的距离。TXL可以组织成密集阵列,用于计算高维输入与所有存储原型之间的距离,从而有效执行快速且节能的相似性搜索。该硬件引擎支持即时学习,其中感受野参数可以调整以跟踪域偏移。通过模拟所提出的TXL-RBF分类器,我们在MNIST数据集上实现了89.1%的准确率,同时在100MHz运行时每单元每操作消耗185fJ。

英文摘要

The deployment of modern machine learning (ML) solutions on resource-constrained edge devices highlights implementation challenges. This is especially true for extreme edge applications that include safety-critical components, such as autonomous navigation tasks. This paper demonstrates an artificial neural network (ANN) design leveraging Metal-Oxide Resistive RAM (RRAM) -based Analogue Content Addressable Memory (ACAM) as an efficient hardware substrate for performing metric-based classification and online adaptation on the edge. The proposed design is based on a custom Template piXeL (TXL) cell used for building the ACAM module, where each TXL cell acts as a configurable receptive field neuron. These cells employ a Radial Basis activation function to calculate the distance of an input from the programmed receptive field. The TXL can be organised into dense arrays for calculating the distance of a high-dimensional input against all stored prototypes, effectively performing fast and energy efficient similarity search. This hardware engine enables on-the-fly learning, where the receptive field parameters can be tuned to track domain shift. Through simulation of the proposed TXL-RBF classifier we can achieve 89.1\% accuracy on the MNIST dataset while consuming 185fJ per cell per operation when operating at 100MHz.

2606.14737 2026-06-16 q-bio.BM cs.LG stat.ML 新提交

Learning Topological Representations for Molecular Dynamics

学习分子动力学的拓扑表示

Dominik Geng, Florian Graf, Martin Uray, Roland Kwitt

发表机构 * University of Salzburg(萨尔茨堡大学) Centre for Intelligent and Secure Industrial Automation(智能与安全工业自动化中心) University of Applied Sciences(应用科学大学)

AI总结 提出掩蔽Flood复形用于持久同源性分析,在共享表示空间中实现蛋白质构象的几何感知表征,并在分类、回归和马尔可夫状态模型估计中取得竞争性能。

Comments 20 pages, 4 figures

详情
AI中文摘要

分子动力学(MD)模拟生成高维构型空间中的轨迹,其分析关键依赖于分子描述符,通常是手工设计的可观测量或学习的动力学嵌入。然而,设计既具表达力又广泛适用的描述符仍然具有挑战性。我们研究持久同源性(PH)作为MD的通用表示,并引入掩蔽Flood复形,这是一种针对蛋白质定制的最近提出的单纯复形构造的改进,以低计算成本强调残基间结构。向量化的持久图随后提供信息丰富、几何感知的蛋白质构象摘要,我们在单个共享表示空间中评估其在蛋白质类别预测、帧级可观测回归以及从学习的低维坐标估计马尔可夫状态模型(MSM)上的性能。在mdCATH数据集上的结果表明,基于PH的描述符在各项任务中具有竞争力,其中掩蔽Flood PH产生最一致的整体性能。此外,在最近的MarS-FM框架中,当使用拓扑信息MSM作为蛋白质构象生成建模的直接替代时,我们获得了比基于物理可观测量的MSM更一致的系综统计。最后,我们探索了生成模型向性质不同的快速折叠蛋白质的可迁移性。

英文摘要

Molecular dynamics (MD) simulations generate trajectories in a high-dimensional configuration space whose analysis critically depends on molecular descriptors, typically handcrafted observables or learned kinetic embeddings. Designing descriptors that are both expressive and broadly applicable, however, remains challenging. We study persistent homology (PH) as a general-purpose representation for MD and introduce the masked Flood complex, a protein-tailored modification of a recently introduced simplicial complex construction that emphasizes inter-residue structure at low computational cost. Vectorized persistence diagrams then provide information-rich, geometry-aware summaries of protein conformations, which we evaluate on protein class prediction, frame-level observable regression, and Markov state model (MSM) estimation from learned low-dimensional coordinates in a single shared representation space. Results on the mdCATH dataset show that PH-based descriptors are competitive across tasks, with masked Flood PH yielding the most consistent overall performance. Further, when using topologically-informed MSMs as a drop-in replacement within the recent MarS-FM framework for generative modeling of protein conformations, we obtain consistently better ensemble statistics than MSMs based on physical observables. Finally, we explore the transferability of the generative model to qualitatively different, fast folding, proteins.

2606.14729 2026-06-16 physics.comp-ph cs.LG physics.flu-dyn 新提交

Machine Learning-Driven Chemical Reactor Network Modeling of the Sandia-D Flame

机器学习驱动的Sandia-D火焰化学反应器网络建模

Nicolas J. Tricard, Benjamin C. Koenig, Sili Deng

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对湍流燃烧模拟成本高的问题,提出机器学习辅助的等效反应器网络(ERN)自动构建方法,结合主成分分析、k-means聚类和梯度下降优化,在Sandia-D火焰上实现6000倍加速且最大温度R²达0.7945。

Comments 12 pages, 11 figures

详情
AI中文摘要

湍流燃烧模拟对许多科学和工程系统至关重要。然而,完全解析复杂的多尺度和多物理行为的成本很高,使得直接模拟通常不可行。等效反应器网络(ERN)方法试图通过用一系列更便宜的0-D和1-D化学反应器替代多维湍流模拟来提高计算效率,提供了一种保留详细化学但简化流动物理的代理模型。然而,其开发仍然是一个挑战,通常需要专家分析或牺牲精度的自动化方法。在这项工作中,我们开发了一个自动化的机器学习辅助框架,用于构建Sandia-D湍流甲烷/空气火焰的ERN。首先使用主成分分析将高维热化学计算流体动力学(CFD)数据降维到低维潜在空间,其中k-means聚类识别出物理可解释的火焰区域,用于初始化反应器网络图。然后使用有限差分梯度下降(围绕不可微的Cantera反应器模拟)优化此初始化。在跨越一系列引燃温度和入口甲烷组成的30个RANS模拟中,优化的7反应器ERN实现了最大温度$R^2$得分为0.7945,同时相对于CFD求解器保持了约6000倍的加速。出口CO预测仍然更具挑战性,最终$R^2$得分为-0.4183,但相对于未优化的聚类初始化有显著改善。这些结果表明,无监督热化学特征提取可以为ERN构建提供有效的物理信息初始化,而基于梯度的优化可以显著提高预测精度,无需手动设计反应器网络。

英文摘要

Turbulent combustion simulations are crucial for many scientific and engineering systems. However, the high cost to fully resolve the complex multiscale and multiphysics behavior makes direct simulation typically infeasible. The equivalent reactor network (ERN) approach attempts to improve computational efficiency by replacing a multidimensional turbulent simulation with a series of much cheaper 0-D and 1-D chemical reactors, providing a surrogate model that retains detailed chemistry at the cost of simplified flow physics. However, their development remains a challenge, often requiring either expert analysis, or automated approaches that sacrifice accuracy. In this work, we develop an automated machine-learning-assisted framework for constructing ERNs of the Sandia-D turbulent methane/air flame. Principal component analysis is first used to reduce high-dimensional thermochemical computational fluid dynamics (CFD) data to a low-dimensional latent space, where k-means clustering identifies physically interpretable flame regions used to initialize a reactor-network graph. This initialization is then refined using finite-difference gradient descent wrapped around non-differentiable Cantera reactor simulations. Across 30 RANS simulations spanning a range of pilot temperatures and inlet methane compositions, the optimized 7-reactor ERN achieves a maximum-temperature $R^2$ score of 0.7945 while preserving a $\sim6000\times$ speedup over the CFD solver. Outlet CO prediction remains more challenging, with a final $R^2$ score of $-0.4183$, but improves substantially from the unoptimized clustering initialization. These results show that unsupervised thermochemical feature extraction can provide effective physics-informed initializations for ERN construction, while gradient-based refinement can significantly improve predictive accuracy without manual reactor-network design.

2606.14721 2026-06-16 cs.GR cs.CV cs.RO 新提交

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

DC-Motion: 通过离散-连续令牌解耦语义与细节以生成人体运动

Hequan Wang, Jiaxu Zhang, Zhengbo Zhang, Zhigang Tu

发表机构 * Wuhan University(武汉大学)

AI总结 提出DC-Motion框架,通过离散-连续VAE将运动分解为语义离散令牌和细节连续残差,结合掩码自回归模型和残差扩散模型,实现复杂文本指令下的高质量运动生成。

详情
AI中文摘要

文本到运动生成需要合成物理上真实的动态,这些动态严格遵循复杂且长程的文本指令。现有方法依赖于同质表示空间,可能无法捕捉人体运动的层次结构,扩散模型在组合语义推理上表现不佳,而自回归模型由于量化牺牲了细粒度的物理细节。为了解决这个问题,我们引入了DC-Motion,一个分解式生成框架,旨在通过离散-连续令牌显式解耦语义和细节。首先,离散-连续VAE(DC-VAE)将运动分解为用于语义的离散令牌和用于细粒度动态的连续残差。然后,一个掩码自回归模型从文本预测离散结构,一个轻量级残差扩散模型恢复连续的物理细节。大量实验表明,DC-Motion有效提高了遵循复杂指令的能力。通过有效平衡语义可控性和物理真实性,我们的方法为人体运动生成提供了一种高度可适应的建模范式。在HumanML3D和KIT-ML数据集上,DC-Motion实现了最先进的性能,在运动真实感方面获得了最佳的FID,在文本对齐方面获得了最佳的R-precision。

英文摘要

Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

2606.14718 2026-06-16 cs.CY cs.AI 新提交

Gender Differences in AI Literacy Workshop Outcomes and Deepfake Engagement

AI素养工作坊成果与深度伪造参与中的性别差异

Jake Renzella, Christian Bergh, Natasha Banks, Alexandra Vassar

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 本研究通过统计回归分析澳大利亚中学生AI素养工作坊前后数据,发现男性在STEM职业兴趣上显著更高,女性更常使用AI工具,且工作坊后女性在AI知识和职业兴趣上提升更大,部分缩小了性别差距。

详情
AI中文摘要

随着人工智能(AI)素养倡议在K-12教育中的扩展,理解性别如何影响学生的基础认知、工具使用以及对干预措施的反应,对于公平的课程设计至关重要。本研究考察了来自两所男女同校公立学校的澳大利亚中学生(7、8和10年级;前测N=199,后测N=136)在参加为期一天的AI素养工作坊后,在AI素养、安全意识和STEM职业抱负方面的性别差异。使用控制年级和学校的统计回归方法,我们发现:工作坊前,男性学生在AI、计算机科学和工程三个领域的STEM职业兴趣均显著更高,而女性学生更可能将AI用于学业任务并向AI工具寻求建议。深度伪造行为中也出现了性别差异模式:男性更可能创建或分享深度伪造内容。干预后,男女学生的AI知识均有所提升,但女性表现出更丰富的进步:更广泛的概念理解、更高的自信心以及AI和计算机科学职业兴趣的显著增长,部分缩小了STEM性别差距。这些发现强调了开发性别响应型AI课程的必要性,特别是针对男性学生的深度伪造安全教育,并表明即使是单日工作坊也能缩小STEM抱负和AI信心方面的性别差距。

英文摘要

As Artificial Intelligence (AI) literacy initiatives expand in K-12 settings, understanding how gender shapes student baseline perceptions, tool-use, and responsiveness to interventions is essential for equitable curriculum design. This study examines gender differences in AI literacy, safety awareness, and STEM career aspirations among Australian secondary students (Years 7, 8, and 10; N(pre) = 199, n(post) = 136) from two co-educational government schools who participated in a one-day AI literacy workshop. Using statistical regression methods controlling for year level and school, we found that pre-workshop, male students reported significantly higher STEM career interest across all three domains (AI, computer science, and engineering), while female students were significantly more likely to use AI for schoolwork and to seek advice from AI tools. Gender-differentiated patterns also emerged in deepfake behaviours: males were significantly more likely to have created or shared deepfake content. Both genders improved in AI knowledge post-intervention, yet females showed a richer profile of gains: wider conceptual understanding, greater confidence, and meaningful increases in AI and CS career interest that partially narrowed the gender STEM gap. These findings highlight the need for gender-responsive AI curricula, particularly deepfake safety education for male students, and demonstrate that even single-day workshops can narrow gender gaps in STEM aspirations and AI confidence.

2606.14715 2026-06-16 cs.MA cs.AI cs.SI 新提交

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

MiroBench:真实世界讨论的智能体模拟真实性基准测试

Yaoning Yu, Ye Yu, Haojing Luo, Haohan Wang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Starc.Institute(Starc研究院)

AI总结 提出MiroBench基准,基于4292条真实Reddit帖子,通过统计测试评估LLM智能体模拟在重复性、叙事内容、毒性攻击和结构复杂度四个方面的分布匹配度,发现当前模拟器与真实讨论存在分布差异。

详情
AI中文摘要

LLM智能体越来越多地被用于模拟真实世界互动,但尚不清楚模拟行为是否保留了真实人类行为的内容模式和互动动态。现有评估仍然碎片化,使得比较系统或衡量进展变得困难。在本文中,我们聚焦于Reddit讨论,作为评估真实世界社会模拟的具体第一步。Reddit帖子提供了公开的、基于主题的多方互动,人们在其中分享经验、辩论、寻求建议、表达情感,并共同对产品、事件和社会问题做出回应。这些讨论为更广泛的社会行为提供了可观察的窗口,使其成为测试LLM智能体能否不仅再现流畅文本,还能再现真实在线社区的分布模式和互动动态的有用场景。我们介绍了MiroBench,一个基于4292条真实Reddit帖子构建的Reddit讨论模拟基准。MiroBench使用统计测试在四个主要方面比较生成讨论和真实讨论:重复性和语义一致性、叙事内容、毒性攻击以及结构复杂度。跨五个领域和五个模型的实验表明,当前模拟器与真实Reddit帖子在分布上仍不匹配,而一种轻量级的基于提示的改进程序仅带来有限的提升。MiroBench为衡量、诊断和改进基于LLM的社会模拟的真实性提供了一个具体基准。

英文摘要

LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real-world social simulation. Reddit threads provide public, topic-grounded, multi-party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt-based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM-based social simulation.

2606.14710 2026-06-16 cs.DC cs.AI 新提交

Poster: EdgeCitadel -- Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

海报:EdgeCitadel——面向边缘多智能体系统的混合NATS-MQTT编排

Zhonghao Zhan, Yefan Zhang, Hamed Haddadi

发表机构 * Imperial College London(帝国理工学院伦敦分校) Independent Researcher(独立研究员)

AI总结 针对边缘AI智能体协调依赖云传输或中央中继的问题,提出基于NATS 2.10服务器与内置MQTT适配器的混合编排平台EdgeCitadel,实现异构智能体连接、持久化存储、直接委托和被动聚合,并在ARM64、x64和Android设备上验证。

详情
AI中文摘要

边缘驻留的AI智能体越来越多地跨越家庭服务器、物联网中心、笔记本电脑和手机,但它们的协调栈仍然假设云风格传输或中央中继。我们提出了EdgeCitadel,一个基于单一NATS 2.10服务器并内置MQTT适配器的边缘多智能体编排平台。该设计结合了用于异构智能体的MQTT连接、用于后端服务的JetStream支持的持久化和重放、通过共享主题命名空间的直接对等委托,以及一个不在传输路径上但能可视化并存储流量的被动聚合器。我们的海报重点展示了从MQTT中继原型(在物联网通信中常见)到当前混合架构的迁移,并演示了一个跨ARM64、x64和Android客户端的工作跨设备测试平台。

英文摘要

Edge-resident AI agents increasingly span home servers, IoT hubs, laptops, and phones, yet their coordination stacks still assume cloud-style transports or a central relay. We present EdgeCitadel, an edge multi-agent orchestration platform built around a single NATS 2.10 server with the built-in MQTT adapter. The design combines MQTT connectivity for heterogeneous agents, JetStream-backed persistence and replay for backend services, direct peer delegation over a shared subject namespace, and a passive aggregator that visualizes and stores traffic without sitting on the delivery path. Our poster highlights the migration from MQTT relay prototypes (common in IoT communication) to the current hybrid architecture and demonstrates a working cross-device testbed spanning ARM64, x64, and Android clients.

2606.14707 2026-06-16 cs.PF cs.AI cs.CY 新提交

Green AI Carbon Optimizer: Carbon-Efficient Training Location Recommendation and Global AI Energy Demand Forecasting

绿色AI碳优化器:碳高效训练位置推荐与全球AI能源需求预测

Yuxin Chen, Hao Gao, Chujie Zou

发表机构 * arXiv.org

AI总结 提出Green AI Carbon Optimizer,包括基于电网碳强度、可再生能源占比和PUE的云区域推荐方法(最佳vs最差区域减排97.2%),以及基于幂律的全球AI能源需求预测模型(2030年需求7-1436 TWh)。

Comments Short workshop of 5 pages. 2 figures

详情
AI中文摘要

AI训练和部署消耗大量电力,但碳排放结果尚未充分融入常规模型开发决策。本文提出Green AI Carbon Optimizer,包含两个主要贡献:(i) 一种用于训练工作负载的碳感知云区域推荐方法,以及(ii) 一个用于全球AI能源需求的幂律预测流程。对于位置推荐,我们将区域电网碳强度、可再生能源占比和数据中心电能利用效率(PUE)结合成一个统一评分模型,覆盖来自主要云提供商的100多个区域。对于一个参考工作负载(8*A100, 100h),我们采样区域的估计排放量从7.74kg到272.00kg CO2不等。选择最佳区域而非最差区域相对于最差情况减少了97.2%。消融实验表明,仅按可再生能源占比排序可能选择碳排放高于包含电网碳强度排序的区域。对于预测,我们使用26个锚点模型拟合参数数量与训练能量之间的幂律关系。我们将此拟合与模型增长、硬件效率和训练频率的情景假设相结合,并评估对推理比率和生态系统扩展的敏感性。在不同情景下,根据所述假设,预计2030年需求范围从7 TWh到1,436 TWh,凸显了部署选择、模型扩展纪律和透明能源报告的重要性。

英文摘要

AI training and deployment consume substantial electricity, but carbon outcomes remain weakly integrated into routine model development decisions. This paper presents Green AI Carbon Optimizer with two primary contributions: (i) a carbon aware cloud region recommendation method for training workloads, and (ii) a power law forecasting pipeline for global AI energy demand. For location recommendation, we combine regional grid carbon intensity, renewable share, and data center Power Usage Effectiveness (PUE) into a unified scoring model across 100+ regions from major cloud providers. For a reference workload (8*A100, 100h), estimated emissions in our sampled regions range from 7.74kg to 272.00kg CO2. Selecting the best region instead of the worst corresponds to a 97.2% reduction relative to the worst case. Ablation shows that ranking by renewable share alone can select regions with higher CO2 emissions than rankings that include grid carbon intensity. For forecasting, we fit a power law relation between parameter count and training energy using 26 anchor models. We combine this fit with scenario assumptions on model growth, hardware efficiency, and training frequency, and evaluate sensitivity to inference ratio and ecosystem scaling. Across scenarios, projected 2030 demand ranges from 7TWh to 1,436TWh under the stated assumptions, highlighting the importance of deployment choices, model scaling discipline, and transparent energy reporting.

2606.14202 2026-06-16 cs.NE cs.AI 新提交

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China(诺丁汉大学宁波分校计算机科学学院) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院)

AI总结 提出MeEvo框架,通过循环耦合自然进化(探索启发式代码)和元认知进化(反思历史生成改进启发式),解决现有方法知识继承弱、探索不足的问题,在五个优化问题上表现更优。

详情
AI中文摘要

大型语言模型(LLMs)通过推理和代码合成实现启发式生成,推动了自动启发式设计(AHD)的发展。现有的基于LLM的AHD架构主要遵循两种范式:自然进化,它使用交叉和变异来探索启发式程序;以及元认知进化,它通过反思来改进推理。然而,自然进化丢弃了推理轨迹,削弱了知识继承和利用,而元认知进化缺乏种群级别的重组,限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距,我们提出了MeEvo,一种双层AHD框架,它循环耦合自然进化和元认知进化。自然进化探索启发式代码,同时将推理轨迹、适应度值和错误记录到共享历史中;然后元认知进化反思该历史以生成改进的启发式,这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验(使用两个LLM骨干)表明,MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能,尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

2606.13693 2026-06-16 cs.CY cs.AI 新提交

Limited Marginal Benefit of Reasoning-Heavy LLM Deployment in ESG Narrative Scoring: A 4-Model Consensus Study on Japanese Listed Firms

ESG叙述评分中重度推理LLM部署的有限边际收益:一项关于日本上市公司的4模型共识研究

Hiroyuki Kokubu

发表机构 * Kansai University(关西大学)

AI总结 通过4模型共识设计,研究在ESG叙述评分中,重度推理模型相比非推理模型是否带来显著收益,发现其边际收益有限且成本高昂。

Comments 12 pages. Earlier version available on SSRN, Abstract ID 6683303

详情
AI中文摘要

使用大语言模型(LLM)对ESG叙述披露进行自动评分正逐渐受到关注,但重度推理的前沿模型是否带来与其成本相称的价值,在实证上仍不确定。我们基于十家日本上市公司的语料库,沿三个评分轴——定量目标、进度跟踪基础设施和外部标准对齐——通过四模型共识设计评估这一问题,该设计将一个推理型前沿模型与三个非推理型同期模型相结合。在120个公司×轴×模型评分中,推理型模型与每个非推理型模型之间的汇总平均绝对偏差为0.38(5分制);仅有2%的成对比较达到两分偏差,且无任何比较超过两分。每公司成本核算显示,仅推理型模型的成本约为三个非推理型模型集成成本的5.6倍,而结果仅在小范围内存在差异。我们得出结论,在基于区间的ESG叙述评分中,重度推理部署相对于非推理共识并未显著改善结果,同时大幅增加了运营成本。我们讨论了这对成本效益的ESG自动评分流程以及应用问责环境中的LLM部署治理的启示。本工作的早期版本可在SSRN上获取(摘要ID 6683303)。

英文摘要

Automated scoring of ESG narrative disclosures with large language models (LLMs) is gaining traction, yet whether reasoning-heavy frontier models add value commensurate with their cost remains empirically unsettled. We evaluate this question on a corpus of ten Japanese listed firms across three rubric axes -- quantitative targets, progress-tracking infrastructure, and external-standard alignment -- using a four-model consensus design that combines a reasoning-on frontier model with three reasoning-off contemporaries. Across 120 firm x axis x model scores, the pooled mean absolute deviation between the reasoning-on model and each reasoning-off counterpart is 0.38 on a 5-point scale; only 2% of pairwise comparisons reach a two-point deviation, and none exceeds two points. Per-firm cost accounting shows the reasoning-on arm alone costs roughly 5.6x as much as the three-provider reasoning-off ensemble, for outcomes that differ only within small margins. We conclude that in span-based ESG narrative scoring, reasoning-heavy deployment does not materially improve outcomes relative to reasoning-off consensus, while substantially increasing operational cost. We discuss implications for cost-effective ESG auto-scoring pipelines and LLM deployment governance in applied accountability settings. An earlier version of this work is available on SSRN (Abstract ID 6683303).

2606.13691 2026-06-16 cs.CY cs.CL 新提交

Incentives Of EdTech: A Systematic Review Of EduNLP Research

教育科技的激励:EduNLP研究的系统综述

Gabrielle Gaudeau, Aoife O'Driscoll, Jasper Degraeuwe, Andrew Caines, Donya Rooein, Zeerak Talat

发表机构 * ALTA Institute, Computer Laboratory, University of Cambridge(剑桥大学ALTA研究所、计算机实验室) Ghent University(根特大学) Bocconi University(博科尼大学) University of Edinburgh(爱丁堡大学)

AI总结 通过系统综述204篇ACL教育应用论文,揭示教育NLP研究中私营部门激励与教育基础设施需求之间的张力,发现教师作为受益者被系统性低估(33.3%),实际部署罕见(9.8%),伦理参与趋于承认而非行动。

Comments 10 main pages (13 appendix pages), 20 figures, accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications @ ACL 2026

详情
AI中文摘要

尽管自然语言处理社区投入了大量资源来开发支持这一转变的教育技术(EdTech),但在教育利益相关者中,谁的利益得到了最好的服务仍不清楚。在本文中,我们对2024年和2025年发表在计算语言学协会教育应用构建特别兴趣小组会议上的204篇论文进行了系统文献综述,并与更广泛的ACL文集中的EdTech论文进行了验证。通过考察利益相关者包容性和研究任务的优先级,我们的发现揭示了一个关键张力:私营部门激励与教育基础设施的基本需求之间的推拉。我们的分析表明,教师作为研究受益者被系统性低估(33.3%),尽管他们受影响最大;实际部署仍然罕见(9.8%);伦理参与倾向于承认而非行动。借鉴我们语料库中的典范论文,我们为更负责任的EduNLP研究实践提供了具体建议。

英文摘要

While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics' Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices.

2606.13485 2026-06-16 eess.SY cs.HC cs.NE cs.RO cs.SY physics.med-ph 新提交

Impedance MPC with Patient-Torque Estimation for Knee Rehabilitation Exoskeletons

用于膝关节康复外骨骼的阻抗模型预测控制与患者力矩估计

Yongyan Cao, Jinshan Tang

发表机构 * Department of Biomedical Engineering and Engineering Science(生物医学工程与工程科学系)

AI总结 提出阻抗模型预测控制框架,结合卡尔曼扰动状态估计患者力矩,实现无偏移跟踪和辅助按需,在500 Hz下满足临床精度标准。

详情
AI中文摘要

膝关节康复外骨骼必须强制执行规定的关节轨迹,同时保持对非自主痉挛和自主患者努力的安全顺从——这是任何固定增益阻抗控制器的目标冲突。我们提出了一种用于膝关节康复外骨骼的阻抗模型预测控制框架,并在串联弹性执行器(SEA)平台上进行了演示:代数前馈将膝关节动力学简化为常系数标量双积分器,而滚动时域二次规划(QP)计算校正力矩,同时强制执行硬性的运动范围、力矩和速度限制(ISO 13482)。由直接基于SEA的力矩传感(通过弹性元件测量的串联弹性弹簧挠度——一种固有的、无EMG的患者力矩估计,而非单独的力传感器)驱动的卡尔曼扰动状态提供了标称无偏移保证,并通过其符号和期望运动方向实现无传感器的辅助按需。常状态矩阵允许离线预计算QP成本逆,从而实现多步时域下的500 Hz运行。在七个控制器基准测试(正弦跟踪、等长保持)中,500 Hz卡尔曼MPC在15 Nm痉挛下实现了0.1 mrad RMS、0.1 mrad稳态、0.2 mrad峰值的无偏移,而相同刚度下的经典阻抗控制器稳态偏移为515 mrad——直接测量通道几乎立即(几个采样周期内)收敛估计。没有估计器时,它实现经典阻抗(4.8 mrad RMS,8.3 mrad稳态)。所有MPC变体均满足87 mrad临床标准;没有经典控制器满足。该架构通过考虑耦合的每个关节QP为20自由度MyoSuite myoLeg设计。

英文摘要

Knee rehabilitation exoskeletons must enforce a prescribed joint trajectory while remaining safely compliant with involuntary spasm and voluntary patient effort-objectives in tension for any fixed-gain impedance controller. We present an Impedance Model Predictive Control framework for knee rehabilitation exoskeletons, demonstrated on a series-elastic-actuator (SEA) platform: an algebraic feedforward reduces the knee dynamics to a constant-coefficient scalar double integrator, and a receding-horizon quadratic program (QP) computes corrective torques while enforcing hard range-of-motion, torque, and velocity limits (ISO 13482). A Kalman disturbance state driven by direct SEA-based torque sensing (the series-elastic spring deflection measured through the elastic element - an intrinsic, EMG-free patient-torque estimate, not a separate load cell) gives a nominal offset-free guarantee and, via its sign and the desired-motion direction, sensorless Assist-as-Needed. The constant state matrix permits offline precomputation of the QP cost inverse, enabling 500 Hz operation with a multi-step horizon. Across seven-controller benchmarks (sinusoidal tracking, isometric hold), the 500 Hz Kalman MPC is offset free 0.1 mrad RMS, 0.1 mrad steady-state, 0.2 mrad peak under 15 Nm spasm, versus a 515 mrad steady-state offset for classical impedance at the same stiffness - the direct-measurement channel converging the estimate near-immediately (within a few sampling periods). Without the estimator it realizes a classical impedance (4.8 mrad RMS, 8.3 mrad steady-state). All MPC variants meet the 87 mrad clinical criterion; no classical controller does. The architecture is formulated for the 20 DOF MyoSuite myoLeg via coupling-aware per-joint QPs.

2606.11692 2026-06-16 cs.CY cs.AI cs.MA cs.SI 新提交

Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agentic Simulator

基于智能体模拟器的审议式投票中替代性信息系统评估

Rwaida Alssadi, Khulud Alawaji, Balaji Kasula, Muntaser Syed, Badria Alfurhood, Markus Zanker, Marius Silaghi

发表机构 * Florida Institute of Technology(佛罗里达理工学院) Princess Nourah Bint Abdulrahman(纳厄赫·阿卜杜勒拉赫曼公主) Free University of Bozen-Bozano(博兹诺-博萨诺自由大学)

AI总结 提出基于LLM的智能体双极论证模拟器(ABAS),通过覆盖率和语料多样性评估审议式投票中推荐机制的有效性,并测试了对抗性投票攻击下的鲁棒性。

详情
AI中文摘要

审议式投票旨在通过让股东在投票前接触广泛论点来改善集体决策。然而,确保每个选民遇到理由空间的代表性样本(覆盖问题)仍然是一个开放的挑战,特别是在大规模和对抗性或策略性动机的选民群体中。本文介绍了一种使用基于LLM的智能体双极论证模拟器(ABAS)评估解决方案的方法,该模拟器基于一个将投票形式化为六元组<Jend, Jopp, Ratt, Renh, VA, VR>(包含支持与反对理由、攻击与增强关系、股东权重和关系权重)的框架。ABAS模拟N个自主股东智能体,每个智能体根据[-1,1]内的期望分布分配潜在意见,依次投票、选择或撰写理由,并可选择提交论证图链接。该模拟器实现推荐机制,根据可观察的支持质量对现有理由进行排序。它通过覆盖率(即每个股东收到的K条推荐中代表语料库理由标签集的比例)来评估机制的成功,作为NP难子集理由问题的一个解决方案。报告的实验描述了创造力率(pown)、推荐大小(K)、论证密度(plinks)和人口规模(N)如何影响覆盖率和语料库多样性。在一个经过身份验证的选民群体中(Sybil攻击不可能,只有关系图可被操纵),我们通过协调策略性投票攻击对评分进行压力测试:标签洪泛攻击导致覆盖率崩溃,而通过反向PageRank规则的作者计数关系加权比均匀权重显著更好地抵抗了洪泛攻击。

英文摘要

Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple <Jend, Jopp, Ratt, Renh, VA, VR> of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism's success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.

2606.10456 2026-06-16 cs.CR cs.AI 新提交

The Distributed Detectability Band Against Marginal-Preserving Attacks

针对边际保持攻击的分布式可检测性带

Zhang Qinqin, Gao Yuze

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对AI监控的边际保持攻击,通过高斯Copula AR(1)构造将危害编码在时间相关性中,证明分布形状监控器失效而时间相关性监控器有效,形成非空可检测性带。

Comments 10 pages, 11 figures

详情
AI中文摘要

AI控制监控器对个体智能体动作进行评分以检测异常行为,但实际危害可能分布在许多看似良性的步骤中,每个步骤单独低于任何每步警报。我们使用高斯Copula AR(1)构造了一种边际保持、相关性编码的分布式破坏攻击:每步监控器评分边际完全等于良性,因此均值、最大值、top-k尾部及阈值监控器(监控器A)被构造性地击败,而危害被编码在时间相关结构中。我们围绕三个审稿人要求的门组织论文。(1)可实现性门:隐秘攻击在所有测试危害水平(最高3.0)下与良性的KS距离为0.013(实际为零),证实危害完全与每步边际解耦,且可实现性不受危害限制。(2)监控器A与B的调和:我们形式化证明,针对监控器A的评分边际构建的攻击,在另一种评分监控器B(相关性/序列族:CUSUM、SPRT、HMM-LR、游程检验、自相关、窗口逻辑回归)下仍保持边际保持,并将最坏情况声明限定在允许时间特征的评分函数上。(3)非空可检测性带:监控器A的AUC为0.52(随机);在相同1%假阳性率目标下,监控器B的AUC范围为0.79-0.97,且当危害分摊到更多步骤时,监控器A降至随机水平,而监控器B保持AUC约0.95。这些结果证明了非空可检测性带,并刻画了亚阈值破坏前沿:分布形状监控器被构造性击败;时间相关性监控器可检测但并非平凡最优。

英文摘要

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

2606.08270 2026-06-16 cs.CR cs.AI cs.ET 新提交

An AI Security Agent for University ACMIS: Multi-Vector Threat Detection and Automated Response

面向大学教务管理信息系统的AI安全代理:多向量威胁检测与自动响应

Joseph Walusimbi, Joshua Benjamin Ssentongo

发表机构 * University ACMIS(ACMIS大学)

AI总结 提出一种结合监督异常检测、行为分析和NLP聊天机器人的AI安全代理,针对ACMIS的五个操作层进行监控,并通过四级风险升级框架实现自动响应,在模拟数据集上达到0.91的F1分数。

Comments 6 pages, 1 figure, 5 tables,

详情
AI中文摘要

大学教务管理信息系统(ACMIS)是多种安全威胁的高价值目标,包括暴力破解登录攻击、支付欺诈、权限提升、内部数据窃取和学术诚信违规。传统的基于规则的入侵检测系统不足以应对,因为许多恶意活动在结构上与正常操作无法区分。本文提出了一种基于AI的ACMIS安全代理,结合了监督异常检测、行为分析以及用于安全密码恢复的自然语言处理聊天机器人。该代理监控五个操作层:认证、授权、金融交易、用户行为和系统健康,并通过四级风险升级框架进行响应。模块化架构允许核心引擎扩展到其他机构系统。在模拟ACMIS事件日志数据集上的实验表明,威胁检测宏平均F1为0.91,而基于规则的基线为0.49,关键级别的自动响应延迟在95百分位下低于300毫秒。

英文摘要

University Academic Management Information Systems (ACMIS) are high-value targets for a wide spectrum of security threats including brute-force login attacks, payment fraud, privilege escalation, insider data theft, and academic integrity violations. Traditional rule-based intrusion detection systems are inadequate because many malicious activities are structurally indistinguishable from normal operations. This paper presents an AI-based security agent for ACMIS that combines supervised anomaly detection, behavioural analytics, and a natural language processing chatbot for secure password recovery. The agent monitors five operational layers: authentication, authorisation, financial transactions, user behaviour, and system health, and responds through a four-tier risk escalation framework. A modular architecture allows the core engine to be extended to other institutional systems. Experiments on a simulated ACMIS event log dataset of 147,922 sessions demonstrate a threat detection macro-average F1 of 0.966, compared to 0.156 for a rule-based baseline and 0.836 for a sequence-only (LSTM) baseline, with end-to-end critical-tier automated response latency under 1 ms on a single-node prototype. The integrated recovery chatbot achieves 97.1 percent identity verification accuracy and an 87.3 percent mass-reset attack detection rate with zero false positives on legitimate high volume recovery periods.

2606.08090 2026-06-16 cs.DB cs.AI 新提交

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

基于LLM的快速语义过滤:从统一框架到自适应两阶段方法

Kyoungmin Kim, Martin Catheland, Anastasia Ailamaki

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 提出自适应两阶段语义过滤框架,结合无模型聚类与在线代理,利用LLM的置信度作为软标签训练代理,并通过稀疏感知校准降低级联成本,在90%准确率目标下速度提升1.6-2.0倍。

详情
AI中文摘要

在文档语料库上评估自然语言的是/否谓词并满足准确率目标——语义过滤——是基于LLM的数据处理的基石。对每个文档调用LLM(即oracle)代价高昂,因此级联方法将oracle与快速代理配对。然而,当前部署存在四个局限性:(1) 每个级联家族——无模型聚类、预构建的小型LLM代理、在线训练的代理——只采用单一表示和流水线,仅在狭窄的查询范围内有效。(2) 最强的在线代理在稠密嵌入的双编码器上采用自定义训练方案,忽略了更丰富谓词所需的token级证据。(3) 代理针对二元是/否标签进行训练,浪费了LLM在边界文档上的逐文档置信度,而这些正是代理最需要学习的。(4) 现有校准添加了统一的安全裕度,将真实的代理不确定性与小样本噪声混为一谈,增加了级联成本。\n我们通过以下方式解决这些问题:(1) 自适应地组合不同家族——首先使用无模型聚类,仅在需要时使用在线代理,并在各阶段共享oracle调用;(2) 用现成的token感知模型的混合替代余弦双编码器;(3) 使用oracle的逐文档置信度作为软标签来训练代理;(4) 采用一种校准方法,仅在标记样本稀疏的地方添加安全裕度。我们也是首次将oracle的逐文档置信度用于三个目的:查询级难度指南针、任何基于代理的级联所需的最小oracle调用次数的下界,以及代理的软训练标签。\n在三个10K文档语料库上,以90%准确率为目标,我们的方法比每个语料库上最佳先前方法快1.6-2.0倍,并在95%的查询上达到目标;基于BER的下界表明未来工作还有约4-20倍的提升空间。

英文摘要

Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

2606.07150 2026-06-16 cs.CR cs.AI cs.MA cs.NI 新提交

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

从隐私到工作流完整性:自主智能体互操作性中的通信图元数据

Bijaya Dangol

发表机构 * Independent Researcher(独立研究者)

AI总结 针对智能体通信图元数据泄露问题,提出工作流完整性威胁模型,定义传输层与引导层隐私属性,并通过A2A案例验证元数据保护可有效抑制任务推断。

Comments 18 pages, 6 figures, 6 tables

详情
AI中文摘要

诸如A2A和MCP之类的智能体互操作性协议标准化了智能体之间的通信内容,但假设基于地址的HTTP(S)传输。此类传输保护消息内容,并越来越多地采用端到端加密。它们暴露在明文中的是通信图:哪个智能体联系哪个智能体、何时以及频率如何。在智能体系统中,该图比隐私框架所暗示的更具后果性。端点通常带有能力标签,工作流是结构化和链式的,交互与实际行动耦合,因此观察者恢复的不仅仅是过去的关系。它可以推断出待处理的工作流、正在组装的任务以及可能即将发生的行动。以机器速度,它可以在工作流完成之前根据该推断采取行动。因此,威胁是工作流完整性,而不仅仅是隐私:对自主行动的预测性杠杆。我们为智能体通信图提供了一个威胁模型;识别了使智能体元数据具有独特揭示性的因素(语义性、前瞻性、驱动性);定义了传输层和引导层隐私属性,并评估了候选传输(SimpleX/SMP、Tor、混合网络)与这些属性的匹配程度;并提出了一个A2A案例研究,其中元数据保护绑定是可表达的,但揭示了协议的身份假设。我们在一个基于真实A2A捕获的生成模型上测试了这些。仅凭被动元数据,没有载荷,一个分类器从工作流的开头就能以远高于随机水平的概率恢复任务类别;应用这些属性后,该恢复被急剧拉回随机水平。除了观察者能恢复的内容外,我们衡量了利用泄露的杠杆:在工作流开头和固定预算下,选择对哪些工作流采取行动的对手在此模型中实现了大部分先知攻击者相对于元数据盲攻击者的优势,而相同的属性抑制了这一点。

英文摘要

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to real actions, so an observer recovers more than past relationships: it can infer the pending workflow and, at machine speed, act on that inference before the workflow completes. The threat is therefore one of workflow integrity, not privacy alone. We formalize a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting, which we measure to be comparable to other machine traffic, but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, evaluate candidate transports, and give an A2A case study where a metadata-protecting binding surfaces the protocol's implicit identity assumptions. On a generative model anchored to a real capture and over a live A2A binding, a label-blind classifier recovers a task's class from passive metadata well above chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. The leverage of acting on the leak is distinct from recoverability: under a fixed budget an adversary realizes most of a clairvoyant attacker's advantage from a workflow's opening, governed by precision over the top-ranked workflows rather than overall accuracy, so a defense suppresses it even while recovery stays above chance.

2606.06510 2026-06-16 cs.AR cs.AI cs.DC cs.PF 新提交

FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail (June 13th version)

FP8就是一切(第一部分):揭穿硬件FP64作为HPC圣杯的神话

Satoshi Matsuoka

发表机构 * RIKEN Center for Computational Science (R-CCS)(日本计算科学研究中心(R-CCS))

AI总结 本文通过中国剩余定理的Ozaki Scheme II,在AI优化GPU上利用FP8张量吞吐量实现全FP64精度的内存天花板性能,挑战了原生FP64硬件是科学计算基础的传统观点。

Comments This is the revised version of the previous submission (May 28th) version. There is a companion Part (2) paper focusing on Ozaki-style FFT

详情
AI中文摘要

传统HPC教条认为,原生硬件FP64硅是科学计算不可约的基础——双精度模拟的“圣杯”。本文论证该教条是错误的:在B300代及以后的AI优化GPU上,丰富的FP8张量吞吐量结合基于中国剩余定理的Ozaki Scheme II,在典型HPC内核谱上以全FP64精度恢复了内存天花板执行。NVIDIA的Blackwell Ultra (B300)将原生FP64压缩至约1.3 TFLOPS——相比B200下降31倍——使得即使是内存受限的内核(SpMV、GEMV、模板计算)也变为计算受限。我们做出四项贡献。第一,一个统一的分析模型——张量-内存均衡(TME)模型,在Roofline模型上增加了计算乘数alpha、带宽乘数beta和重建延迟gamma。第二,我们识别出寄存器级融合是驱动beta趋近于1的机制,使得模拟在内存墙后几乎免费。第三,我们预测Ozaki Scheme II将模拟FP64从约1 TFLOPS的原生下限提升至约500 TFLOPS(B300)和约400 TFLOPS(Rubin R200),在计算受限区域超过B200原生FP64上限一个数量级以上,同时在带宽受限区域匹配内存天花板。第四,与H100基线相比,Ozaki Scheme II在每个研究的工作负载上匹配或超过H100,而B300原生FP64则导致高达50倍的性能下降。结合配套的FFT分析(在幸存的INT32流水线上使用Kulisch定点重建)和配套第二部分论文中报告的FP32+Kahan归约,B300上每个被调查的内核类别都以全FP64精度达到内存天花板。证据支持标题的主张:FP8,配合Ozaki Scheme II和Kulisch逃生路线,是生产级HPC所需的一切;原生FP64硅不再是人们所认为的圣杯。

英文摘要

Conventional HPC holds that native hardware FP64 is the irreducible foundation of scientific computing. On AI-optimized GPUs of the NVIDIA B300 generation and beyond, native FP64 throughput has collapsed to ~1.3 TFLOPS even as FP8 tensor throughput has grown to multiple PFLOPS. We argue something stronger than that this is survivable: the FP8 tensor-core matrix-multiply is the sole computational primitive on which double-precision scientific computing needs to be built. Every canonical kernel -- dense and sparse linear algebra, spectral transforms, stencils -- and every application composing them reduces, via the Chinese Remainder Theorem-based Ozaki Scheme II, to sequences of FP8 matrix operations; the only non-FP8 arithmetic is a bounded, fixed-width integer accumulation at reconstruction. Native FP64 is thereby demoted from a hardware requirement to a derived accuracy guarantee obtained by composition over the FP8 primitive. We organize the claim as a five-layer hierarchy -- the FP8 op, Ozaki II, the basic kernels or Berkeley "dwarfs", composite solvers, and full applications -- and, because the dwarf taxonomy already spans scientific computing, establish it by exhibiting the reduction for every dwarf rather than a sample. The claim is falsifiable, and we build the instrument that tests it: a Tensor-Memory Equilibrium (TME) model extending the Roofline with emulation parameters (alpha, beta, gamma). We identify register-level fusion as the mechanism that keeps emulation memory-bound, project recovered FP64 performance across B300 and Rubin against an H100 baseline, and close the kernel coverage with a companion FFT analysis and compensated reductions. The model could have returned a negative verdict; instead it passes across the dwarfs and their compositions. This is the analytical half of a two-part program, with a follow-on implementation to validate the thesis on real silicon.

2604.13301 2026-06-16 cs.CR cs.AI 交叉投稿

Honeypot Protocol

蜜罐协议

Najmul Hasan

发表机构 * University of North Carolina at Pembroke(北卡罗来纳大学帕克维尔分校)

AI总结 针对AI控制中被动监控易受自适应攻击的问题,提出蜜罐协议,通过仅改变系统提示词测试模型上下文依赖行为,实验表明模型在各条件下均无异常行为。

Comments 8 pages, 1 figure, 1 table. Research conducted at the AI Control Hackathon, March 2026. Code: https://github.com/najmulhasan-code/honeypot-protocol

详情
AI中文摘要

可信监控是AI控制中的标准防御手段,但容易受到自适应攻击、合谋和策略性攻击选择的影响。所有这些都利用了监控是被动的这一事实:它观察模型行为,但从不探测模型在不同感知条件下是否会表现出不同的行为。我们引入了蜜罐协议,该协议通过仅改变三种条件(评估、合成部署、明确无监控)下的系统提示词来测试上下文依赖行为,同时保持任务、环境和评分相同。我们在BashArena中评估了Claude Opus 4.6在诚实和攻击模式下所有三种条件下的表现。该模型在所有条件下均实现了100%的主任务成功率,并触发了零个侧任务,为未来与更强攻击策略和更多模型的比较提供了基线。

英文摘要

Trusted monitoring, the standard defense in AI control, is vulnerable to adaptive attacks, collusion, and strategic attack selection. All of these exploit the fact that monitoring is passive: it observes model behavior but never probes whether the model would behave differently under different perceived conditions. We introduce the honeypot protocol, which tests for context-dependent behavior by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding the task, environment, and scoring identical. We evaluate Claude Opus 4.6 in BashArena across all three conditions in both honest and attack modes. The model achieved 100% main task success and triggered zero side tasks uniformly across conditions, providing a baseline for future comparisons with stronger attack policies and additional models.

2601.23018 2026-06-16 cs.HC cs.AI 交叉投稿

Integrating Multi-Label Classification and Generative AI for Scalable Analysis of User Feedback

整合多标签分类与生成式AI实现用户反馈的可扩展分析

Sandra Loop, Erik Bertram, Sebastian Juhl, Martin Schrepp

发表机构 * SAP SE(SAP公司) Hochschule Fresenius Heidelberg(弗赖辛大学海德堡分校) University of Missouri(密苏里大学)

AI总结 提出结合监督多标签分类与生成式AI的方法,高效处理大量用户评论,自动分配主题标签并生成摘要,同时发现情感分析不能可靠反映产品满意度。

Comments 8 pages, 2 figures, submitted to Springer Nature

详情
AI中文摘要

在高度竞争的软件市场中,用户体验(UX)评估对于确保软件质量和促进产品长期成功至关重要。此类UX评估通常将标准化问卷的定量指标与通过开放式问题收集的定性反馈相结合。虽然开放式反馈为改进提供了有价值的见解,并有助于解释定量结果,但分析大量用户评论具有挑战性且耗时。在本文中,我们介绍了一家大型软件公司在长期UX测量项目中开发的技术,以高效处理和解释大量用户评论。为了提供收集到的评论的高层概述,我们采用监督机器学习方法,为每条评论分配有意义的预定义主题标签。此外,我们展示了如何利用生成式AI(GenAI)创建简洁且信息丰富的用户反馈摘要,促进向组织尤其是高层管理人员有效传达发现。最后,我们研究了用户评论中表达的情感是否可以作为整体产品满意度的指标。我们的结果表明,仅凭情感分析并不能可靠地反映用户满意度。相反,产品满意度需要在调查中明确评估,以衡量用户对产品的感知。

英文摘要

In highly competitive software markets, user experience (UX) evaluation is crucial for ensuring software quality and fostering long-term product success. Such UX evaluations typically combine quantitative metrics from standardized questionnaires with qualitative feedback collected through open-ended questions. While open-ended feedback offers valuable insights for improvement and helps explain quantitative results, analyzing large volumes of user comments is challenging and time-consuming. In this paper, we present techniques developed during a long-term UX measurement project at a major software company to efficiently process and interpret extensive volumes of user comments. To provide a high-level overview of the collected comments, we employ a supervised machine learning approach that assigns meaningful, pre-defined topic labels to each comment. Additionally, we demonstrate how generative AI (GenAI) can be leveraged to create concise and informative summaries of user feedback, facilitating effective communication of findings to the organization and especially upper management. Finally, we investigate whether the sentiment expressed in user comments can serve as an indicator for overall product satisfaction. Our results show that sentiment analysis alone does not reliably reflect user satisfaction. Instead, product satisfaction needs to be assessed explicitly in surveys to measure the user's perception of the product.

2512.10104 2026-06-16 cs.CR cs.AI cs.IR 交叉投稿

Phishing Email Detection Using Large Language Models

使用大型语言模型检测钓鱼邮件

Najmul Hasan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu, Shaohu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LLMPEA框架,利用GPT-4o等三种前沿LLM检测钓鱼邮件,准确率超90%,并揭示对抗攻击、提示注入和多语言攻击的漏洞。

Comments 7 pages

详情
AI中文摘要

电子邮件钓鱼是最普遍且具有全球影响的网络入侵载体之一。随着系统越来越多地部署大型语言模型(LLM)应用,这些系统面临利用其基本架构的不断演变的钓鱼邮件威胁。当前的LLM在部署到电子邮件安全系统之前需要大量加固,特别是针对利用架构漏洞的协调多向量攻击。本文提出了LLMPEA,一个基于LLM的框架,用于检测跨多个攻击向量的钓鱼邮件攻击,包括提示注入、文本精炼和多语言攻击。我们评估了三种前沿LLM(例如GPT-4o、Claude Sonnet 4和Grok-3)以及全面的提示设计,以评估它们针对钓鱼邮件攻击的可行性、鲁棒性和局限性。我们的实证分析表明,LLM可以以超过90%的准确率检测钓鱼邮件,同时我们也强调,基于LLM的钓鱼邮件检测系统可能受到对抗攻击、提示注入和多语言攻击的利用。我们的发现为现实环境中攻击者结合利用多个漏洞的基于LLM的钓鱼检测提供了关键见解。

英文摘要

Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

2605.03289 2026-06-16 stat.ML cs.LG math.ST stat.TH 版本更新

Imbalanced Classification under Capacity Constraints

容量约束下的不平衡分类

Daniel Fraiman, Ricardo Fraiman

发表机构 * Departamento de Matemática y Ciencias Universidad de San Andrés(数学与科学系,圣安德烈斯大学) CONICET Argentina(阿根廷国家科研委员会) PEDECIBA Matemática Uruguay(乌拉圭PEDECIBA数学)

AI总结 针对少数类检测中容量约束问题,提出形式化分类框架,通过重加权先验概率等价于贝叶斯分类器,并引入容量调整性能指标,实验表明优于传统方法和SMOTE。

详情
AI中文摘要

在欺诈检测、医学筛查和工业质量控制等应用中,从严重类别不平衡中检测少数类观测是一个核心挑战。在这些场景中,每个阳性预测都会触发昂贵的后续行动(如MRI扫描、交易审计),其执行受到实际运营约束。本文提出了一个容量约束下的形式化分类框架:给定用户定义的界限$b$(可标记为少数类的观测比例上限),目标是找到在该类上最大化灵敏度的分类器。我们刻画了该约束下的最优分类器,并建立了其与重加权先验概率下的经典贝叶斯分类器的等价性。我们还引入了一个容量调整的性能指标$M$,用于衡量容量约束生效时的有效检测率。该框架在标准学习方法(k-NN、SVM、随机森林和神经网络)上实现,并为每种方法建立了统计一致性。我们进一步证明,当没有超参数面向容量约束目标时,这些方法退化为事后阈值调整,并引入了一种容量感知支持向量机,在训练过程中利用约束,实现了最强的经验性能。在台湾信用卡违约数据集上的实验证实,在高不平衡情况下,容量约束分类器显著优于经典方法和SMOTE。该框架自然地扩展到多类别设置和在线环境。

英文摘要

Detecting observations from a minority class under severe class imbalance is a central challenge in applications such as fraud detection, medical screening, and industrial quality control. In these settings, each positive prediction triggers a costly follow-up action, an MRI scan, a transaction audit, whose execution is subject to real operational constraints. This paper proposes a formal classification framework under capacity constraints: given a user-defined bound limit $b$ on the proportion of observations that can be labeled as belonging to the minority class, the goal is to find the classifier that maximizes sensitivity on that class. We characterize the optimal classifier under this constraint and establish its equivalence with the classical Bayes classifier under a reweighting of the prior probabilities. We also introduce a capacity-adjusted performance metric $M$ that accounts for the effective detection rate when the capacity constraint is binding. The framework is implemented on top of standard learning methods, k-NN, SVM, random forests, and neural networks, and statistical consistency is established for each. We further show that these methods reduce to post-hoc thresholding when no hyperparameters are oriented toward the capacity-constrained objective, and introduce a capacity-aware support vector machine that exploits the constraint during training and achieves the strongest empirical performance. Experiments on the Taiwanese credit card default dataset confirm that capacity-constrained classifiers substantially outperform both classical approaches and SMOTE under high imbalance regimes. The framework extends naturally to multiclass settings and online environments.

2606.04990 2026-06-16 cs.CR cs.AI 版本更新

From Agent Traces to Trust: A Survey of Evidence Tracing and Execution Provenance in LLM Agents

从智能体痕迹到信任:LLM智能体中的证据追踪与执行溯源

Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Manqing Dong, Mingkai Zhang, Xuefei Yin, Yanming Zhu

发表机构 * Griffith University(格里菲斯大学) Jiangsu University(江苏大学) University of Southern Queensland(南方昆士兰大学) Peking University(北京大学) Great Bay University(大湾大学) Nanjing University(南京大学) Macquarie University(麦觉瑞大学) Southern University of Science and Technology(南方科学与技术大学)

AI总结 本文系统综述了LLM智能体中的证据追踪与执行溯源方法,通过统一溯源视角连接检索、工具使用、记忆等环节,提出分类体系并讨论开放挑战。

详情
AI中文摘要

基于大语言模型(LLM)的智能体通过与外部工具、检索系统、记忆模块、环境及其他智能体交互,日益解决复杂任务。这些能力增强了智能体的自主性,但也使其行为更难以验证、调试和审计。仅凭最终答案的准确性无法解释输出是如何产生的、每个主张由哪些证据支持、工具调用是否合理、记忆如何影响后续决策或执行失败的根源。证据追踪和执行溯源通过建模检索到的证据、工具输出、记忆项、环境观察、中间主张、动作和最终答案在智能体执行过程中的连接方式,弥补了这一空白。本综述对LLM智能体中的证据追踪和执行溯源进行了系统回顾和概念框架构建。我们围绕统一的溯源视角组织相关工作,该视角连接了检索依据、主张支持、工具使用安全、记忆谱系、可观测性、调试、审计和恢复。我们引入了一个分类体系,涵盖追踪来源、证据和执行单元、溯源关系、追踪粒度和时机、表示形式以及信任功能。我们回顾了关键方法论方向,包括溯源表示、证据归因、工具使用溯源、运行时护栏、携带溯源的记忆、基于痕迹的可观测性和故障诊断。我们还绘制了现有基准、数据集和评估指标与溯源相关能力的映射,并讨论了评估如何从最终答案正确性转向过程级问责。最后,我们概述了开放挑战,包括统一痕迹模式、主张级和语义溯源、溯源感知的安全机制、现实执行痕迹基准、面向恢复的评估以及隐私感知的审计基础设施。

英文摘要

Large language model (LLM)-based agents are evolving from passive text generators into autonomous systems capable of planning, tool use, retrieval, memory access, environmental interaction, and multi-agent collaboration. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where failures originated. This survey examines evidence tracing and execution provenance as foundations for process-level accountability in trustworthy LLM agents. We define execution provenance as the typed graph of an agent execution and evidence tracing as its projection onto evidence-support relations. This perspective connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery within a unified framework. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We then review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, observability, and failure diagnosis. Finally, we discuss benchmarks, datasets, metrics, and open challenges for building provenance-aware, auditable, and recoverable agent systems.