arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12850 2026-06-12 cs.DC 新提交

High-Order Spectral Element Methods for Wave Propagation on ARM Multicore CPU with SME: Optimizations and Implications

基于谱元法的波传播在ARM多核CPU上的高阶谱元方法:优化与启示

Yinuo Wang, Lin Gan, Tianqi Mao, Wubing Wan, Zekun Yin, Wenqiang Wang, Wei Xue, Guangwen Yang

AI总结 针对ARM多核CPU,利用可扩展矩阵扩展(SME)优化谱元法波传播代码SPECFEM3D,提出SME感知的批处理小矩阵内核、混合MPI+OpenMP执行方案及色散精度分析,实现4-6倍性能提升,并揭示SME使高多项式阶更优。

详情
AI中文摘要

基于谱元法(SEM)的波传播是一种代表性HPC工作负载,但现有SEM实现与新兴的具有可扩展矩阵扩展(SME)的ARM多核CPU不匹配。我们在新兴的LX2处理器上提出了一个启用SME的SPECFEM3D优化方案,该方案结合了用于SEM张量积算子的SME感知批处理小矩阵内核、用于有限HBM系统的内存感知混合MPI+OpenMP执行方案,以及基于色散的(h,p)权衡等精度研究。在固定多项式阶下,优化后的实现将完整应用性能提升4-6倍,并优于优化的非SME CPU基线。除了这些实现层面的提升,我们的结果表明,SME沿着色散等精度边界将性能有利的工作点移向更高的多项式阶,进一步减少了求解时间和工作集大小。这些结果表明,SME不仅影响内核效率,还影响现代ARM多核平台上SEM的实际离散化权衡。

英文摘要

Wave propagation based on the spectral element method (SEM) is a representative HPC workload, but existing SEM implementations are not well matched to emerging ARM multicore CPUs with Scalable Matrix Extension (SME). We present an SME-enabled optimization of \textsc{SPECFEM3D} on the emerging LX2 processor that combines an SME-aware batched small-matrix kernel for SEM tensor-product operators, a memory-aware hybrid MPI+OpenMP execution scheme for limited-HBM systems, and a dispersion-based iso-accuracy study of the $(h,p)$ tradeoff. At fixed polynomial order, the optimized implementation improves full-application performance by 4--6$\times$ over the original code and delivers clear gains over optimized non-SME CPU baselines. Beyond these implementation-level gains, our results suggest that SME shifts the performance-favorable operating point toward higher polynomial orders along the dispersion-based iso-accuracy frontier, further reducing time-to-solution and working-set size. These results indicate that SME affects not only kernel efficiency, but also the practical discretization tradeoff for SEM on modern ARM multicore platforms.

2606.12849 2026-06-12 cs.DC cs.CV cs.RO 新提交

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

SemanticXR: 低功耗实时可查询语义建图与对象级设备-云架构

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve

AI总结 提出首个设备-云协同系统SemanticXR,通过对象级通信、执行和内存管理,在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询,服务器建图延迟提升2.2倍,设备功耗仅增加2%。

详情
AI中文摘要

语义建图是新兴扩展现实(XR)应用(如AI助手和空间对象搜索)中实现具身交互的核心服务。在移动XR设备上部署此功能需要系统具备开放词汇、实时和低功耗特性。现有方法计算密集且假设服务器级资源。云卸载提供了一条实用路径,但现有系统未在设备-云边界拆分语义建图或管理其通信、执行和内存占用。我们提出SemanticXR,首个在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询的设备-云系统。我们的关键洞察是将语义可识别对象提升为跨设备和服务器的通信、执行和内存的一级单元。在服务器端,对象级并行和几何下采样改善了建图延迟,而对象级深度建图协同设计降低了上行带宽。在设备端,具有增量更新和更新优先级的对象级稀疏局部地图实现了网络鲁棒的查询,并限制了内存和下行带宽。对象级可配置的资源使用与质量权衡让应用和系统分别根据应用需求和运行条件调整建图。与使用相同感知模型的设备-云基线相比,对象级组织在同等语义质量下将服务器端建图延迟提升了2.2倍。深度建图协同设计将上行带宽维持在2.5 Mbps以下。在设备端,SemanticXR即使在网络中断时也能为多达10,000个对象维持低于100 ms的查询延迟,在500 MB内支持数万个对象,并将下行带宽随地图变化而非总场景大小缩放。系统在正常运行时仅增加2%的设备功耗。

英文摘要

Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

2606.12848 2026-06-12 cs.AI econ.GN 新提交

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

(人类的)注意力(仍然)就是一切:人类监督使AI辅助的社会科学变得可靠

Chen Zhu, Xiaolu Wang, Weilong Zhang

发表机构 * China Agricultural University(中国农业大学) University of Cambridge(剑桥大学)

AI总结 提出人机协同决策架构HLER,通过预承诺、决策排序、问责和注意力分配,将AI辅助研究的失败率从72%降至16%。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于曾经只有训练有素的研究人员才能完成的任务,包括假设生成、规范选择和结论起草。我们认为,AI辅助研究的可靠性不仅取决于模型能力,还取决于认知劳动在人与机器之间的分配方式。我们通过人机协同经济研究(HLER)来研究这个问题,这是一种基于预承诺、决策排序、问责和注意力分配的决策架构。在一个预先指定的2*4因子实验中,涉及四个数据集的280个完整研究运行,无约束的多智能体基线在72%的运行中产生了关键失败。使用相同的底层模型、相同的智能体分解以及共享推理智能体的相同提示,HLER通过施加三个架构承诺将失败率降低到16%:LLMs进行推理但不执行数据工作,数据和估计以确定性方式处理,以及三个人类决策门约束工作流程。Fisher精确检验在p<0.001水平上拒绝失败率相等的假设。可靠性增益在公开代表性最低的数据集(一份清代人口登记册)上最大,这与基于任务的产出质量服从弗雷歇分布的生产模型一致。一项80次运行的消融研究表明,确定性计算和人类决策门独立贡献,并存在互补性的探索性证据。我们将HLER解释为一种研究框架而非自主的AI科学家:它大幅减少失败,使残留的弱点更加可见,并防止不可靠的主张作为可发表的成果被提出。

英文摘要

Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

2606.12847 2026-06-12 cs.CV 新提交

Language-Guided Abstraction for Visual Reasoning

语言引导的视觉推理抽象

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) Traditional Chinese Medicine Hospital of Zengcheng District(广州市增城区中医医院)

AI总结 提出L-VARC框架,通过语言引导的特权信息学习分支增强视觉推理,设计语义压缩模块和交叉注意力投影器,在ARC任务上以18M参数超越现有方法。

详情
AI中文摘要

抽象与推理语料库(ARC)被视为通往通用人工智能(AGI)的关键途径,因为它使模型能够从少量示例中学习抽象转换规则,然后泛化到新任务。然而,主流的ARC方法要么是纯语言,要么是纯视觉(即VARC)。前者严重依赖大语言模型,消耗数十亿参数;后者通常难以捕捉高层语义,导致在像素级模式上过拟合。为弥合这一差距,我们提出L-VARC,一种通过语言引导的特权信息学习(LUPI)分支增强视觉推理的新框架。具体来说,我们通过将统一的任务无关提示输入DeepSeek-V3来设计语义压缩模块。这样,原始的LARC(一个众包语言描述数据集)可以被大幅精炼和结构化,以适应标准文本编码器(如CLIP)的上下文长度约束。此外,我们设计了交叉注意力投影器来对齐视觉特征与语义嵌入,旨在指导ARC模型的训练。值得注意的是,LUPI分支在训练过程中使用,推理时被丢弃,从而产生一个仅1800万参数的轻量级模型。大量实验表明,我们的L-VARC有效利用语言先验提升视觉推理,并超越现有最优方法。消融研究进一步证实了这两个新设计对L-VARC框架的贡献。代码见https://this URL。

英文摘要

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at this https URL.

2606.12845 2026-06-12 cs.CR cs.LG 新提交

A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction

一种使用远程数据科学的隐私保护框架用于机构间学生保留率预测

John Fields, K M Sajjadul Islam, Ruchitha Thota, Victor Chen, Praveen Madiraju

AI总结 提出基于PySyft和半气隙架构的远程数据科学框架,实现三所大学在不直接访问敏感数据的情况下协作预测学生保留率,验证了隐私保护机器学习在教育场景的可行性。

详情
Comments
7 pages, 2 figures. Accepted at the 2026 IEEE International Conference on Information Reuse and Integration (IEEE IRI 2026)
AI中文摘要

本研究探索了使用PySyft平台的隐私保护机器学习(PPML)技术,以实现机构间学生保留率的协作预测。我们开发了一个远程数据科学(RDS)框架,采用半气隙架构,包含高端和低端服务器,使来自三所大学的研究人员能够在无需直接访问数据的情况下,基于敏感学生数据构建预测模型。利用一所小型私立大学的历史数据(N=720),我们评估了三种合成数据生成方法,并通过机构间协作验证了该框架。结果显示,各机构的分类性能一致(Macro F1: 0.690--0.695),同时严格遵守《家庭教育权利和隐私法案》(FERPA)。我们还提出了数据类型感知模板,这是一种新颖的合成数据方法,优先考虑隐私而非分布保真度。我们的发现证实,基于RDS的PPML在教育环境中技术上可行,并为小规模机构间协作提供了一种联邦学习的实用替代方案。代码可在以下网址获取:this https URL。

英文摘要

This study explores privacy-preserving machine learning (PPML) techniques using the PySyft platform to enable collaborative prediction of student retention between institutions. We developed a remote data science (RDS) framework with a semi-air-gapped architecture consisting of high-side and low-side servers, allowing researchers from three universities to build predictive models on sensitive student data without direct data access. Using historical data from a small private university (N=720), we evaluated three synthetic data generation approaches and validated the framework through inter-institutional collaboration. The results demonstrate consistent classification performance across institutions (Macro F1: 0.690--0.695) while maintaining strict Family Educational Rights and Privacy Act (FERPA) compliance. We also propose Data-Type-Aware Templates, a novel synthetic data method that prioritizes privacy over distributional fidelity. Our findings confirm that RDS-based PPML is technically feasible for educational settings and offers a practical alternative to federated learning for small-scale inter-institutional collaborations. The code is available at this https URL.

2606.12843 2026-06-12 cs.LG cs.CE 新提交

Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

可解释因子分解用于大规模金融市场决策智能:来自中国A股市场的证据

Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng

AI总结 提出可解释机器学习流程,将截面股票收益预测分解为可审计因子贡献,使用XGBoost和TreeSHAP在中国A股市场验证,发现行为信号贡献58.2%预测归因。

详情
AI中文摘要

我们提出一个可解释的机器学习流程,将截面股票收益预测分解为可审计的因子贡献。我们应用带有TreeSHAP归因的XGBoost模型,对2009年至2019年的3632只中国A股进行压力测试。使用60个月滚动窗口,在55个月的样本外数据上,XGBoost获得平均AUC为0.547,且前五分之一与后五分之一的多空价差为+2.38%/月(Newey-West t = 5.94;年化夏普比率2.23)。在调整Carhart四因子模型后,该alpha持续存在(+2.31%/月;t = 7.48)。SHAP分解表明,在55个行业组中,行为信号(换手率和动量)平均占预测归因的58.2%,而估值比率仅占10.7%。消融分析用于交叉验证这一排名,并提供证据表明SHAP和消融以突出特征可替代性结构的方式产生分歧,而这种结构在单独使用任一方法时几乎不可见。

英文摘要

We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

2606.12841 2026-06-12 cs.LG cs.AI 新提交

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM:掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TimeROME-DLM,首个无需训练和梯度的推理时知识编辑框架,通过时间因果追踪定位关键坐标并应用低秩残差编辑,在保持模型性能的同时高效删除事实。

详情
AI中文摘要

掩码扩散语言模型(MDLM),如LLaDA,现已能与自回归(AR)大语言模型(LLM)竞争,但现有的所有知识编辑和遗忘方法(如ROME、MEMIT等)均针对AR Transformer,要么做出在迭代去噪下失败的假设,要么需要梯度更新,其反向传播激活会消耗数十GB的额外显存,并在标准学习率下导致MDLM崩溃。我们提出TimeROME-DLM,这是首个针对MDLM的无需训练、无需梯度、推理时的知识编辑框架。它结合了两个组件:时间间接效应(TIE)因果追踪协议,用于识别每个事实中在后续去噪步骤中最强驱动对象预测的坐标;以及一个闭式低秩残差编辑记忆,该记忆聚合所有遗忘事实的主语键和目标差值,并在每个扩散前向步骤中对该坐标应用单次岭正则化更新,同时通过稀疏化限制效用溢出。骨干权重保持冻结;仅需在小型验证集上调整三个超参数(alpha、lambda、q)。在TOFU forget01任务上,使用TOFU微调的LLaDA-8B-Base,TimeROME-DLM将遗忘集的对数概率降低了约83 nats。相同的配置可迁移至LLaDA-8B-Instruct、Dream-7B、MMaDA-8B、DiffuLLaMA-7B和LLaDA-MoE-1.4B。在50个顺序插入的事实中,它使保留集的对数概率几乎持平(在效用安全操作点处波动约1 nat),相比最强的收敛训练时基线,实现了四到十四倍的墙钟加速且零额外显存,并亚线性地扩展到400个事实。TimeROME-DLM以极小的计算代价弥合了AR LLM与MDLM之间的定位-编辑差距。

英文摘要

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

2606.12840 2026-06-12 cs.LG 新提交

CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees

CLARITree: 基于Cholesky和前瞻加速的可解释分段线性树回归

Yixiao Wang, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

AI总结 提出一种结合前瞻搜索和秩一Cholesky更新的算法,用于构建近最优稀疏分段线性回归树,在计算效率、预测精度和稀疏性之间取得良好平衡。

详情
Comments
Accepted at ICML 2026
AI中文摘要

回归树是机器学习中最具可解释性且表达能力最强的模型之一。历史上,贪心归纳一直是构建高性能回归树的主要方法。尽管存在基于动态规划和分支定界的最优方法,但对于一般的线性回归树,这些方法在计算上不可行,尽管它们通常比贪心方法取得更好的性能。最近的研究表明,专门的前瞻策略可以显著提高运行时间,同时保持接近最优的性能,主要是在分类设置中。在这项工作中,我们开发了一种新颖的算法,用于近最优、稀疏、分段线性回归树,该算法将前瞻式搜索策略与Gram矩阵的高效秩一Cholesky更新相结合。我们从理论和实验上证明,我们的方法在计算效率、预测精度和稀疏性之间实现了有利的权衡,并且比当前最先进的方法具有更好的可扩展性。

英文摘要

Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing well-performing regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

2606.12837 2026-06-12 cs.CL 新提交

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan(美团)

AI总结 提出LoHoSearch基准,基于700万维基实体知识图谱自动构建544个复杂问题,评估显示最强模型仅34.74%准确率,远超人类难度上限。

详情
AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和,最强模型已超过90%准确率。由于这些基准主要由人类编写,标注者缺乏对实体统计的全局视角,无法系统性地最大化搜索空间大小和结构复杂性,这造成了难以突破的难度上限。为解决这一问题,我们引入了LoHoSearch(长时域搜索代理),一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建,该流水线选择具有大搜索空间的关系,并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明,即使是最强模型也仅达到34.74%的准确率,且现有的上下文管理策略(最佳提升+6.8%)带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

2606.12835 2026-06-12 cs.MA cs.AI cs.CY cs.NI 新提交

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

智能体互联网:大规模通信、协调与集体智能

Quanyan Zhu

AI总结 本文提出智能体互联网(IoAI)愿景,构建异构智能体在云、边缘、设备等环境中发现、协商、通信与协作的开放生态系统,并探讨其架构、机制及关键研究挑战。

详情
AI中文摘要

自主AI智能体的快速涌现正在将人工智能从孤立的模型推理转变为分布式推理、通信和行动系统。本文发展了智能体互联网(IoAI)的愿景:一个开放生态系统,其中异构智能体能够跨云、边缘、设备、组织及信息物理环境相互发现、协商职责、交换上下文、调用工具并执行工作流。我们综合了单智能体AI、多智能体系统、分布式计算、通信网络、博弈论和安全工程的基础,以刻画可扩展智能体生态系统所需的架构和机制。本文考察了智能体部署模型、工作流生命周期、通信协议、互操作层、资源管理挑战和信任架构,并提供了自适应制造和分布式作战协调的案例研究。由此产生的框架突出了可控涌现、语义互操作、安全身份、激励兼容协调、资源感知编排以及大规模自主智能体网络治理等核心研究挑战。

英文摘要

The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

2606.12834 2026-06-12 cs.AI 新提交

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

神奇的科学智能体及其构建方法:用于Rietveld精修的AgentBuild

Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva

发表机构 * UT-Battelle, LLC(UT-Battelle有限责任公司) US Department of Energy (DOE)(美国能源部)

AI总结 提出AgentBuild框架,通过科学家编写的合同(包含评分标准、课程和知识库)自动构建科学智能体,用于X射线衍射数据的Rietveld精修,实现可复用的智能体编译而非手动调优。

详情
AI中文摘要

随着科学工作流从确定性可执行文件转向基于LLM的智能体,现有的开发实践(如微调、强化学习和即时运行)掩盖了科学家的判断。我们建议将智能体构建视为一个工作流阶段,并引入AgentBuild,它根据科学家编写的合同构建科学智能体。该合同是一个版本控制的评分标准、一个难度分级的课程和一个精心策划的外部知识库。基于评分标准的裁判门控一个元优化编码智能体,该智能体在声明的边界内编辑智能体,因此构建编译的是智能体,而不是科学家的判断。我们通过MCP和A2A背后的GSAS-II将其实例化用于X射线衍射数据的Rietveld精修,其中空白框架构建运行通过锂镧锆氧(LLZO)信噪比阶梯,达到4小时扫描作为前沿案例,并暴露了工作流范围限制。相同的评分标准既奖励可信的拟合,也评分轨迹范围,使前沿成为合同失败而非模式拟合失败。随着基础模型的发展,重新运行AgentBuild是重新调整,而不是重建,科学家编写的合同仍然是持久的资产。

英文摘要

As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

2606.12830 2026-06-12 cs.CV cs.AI 新提交

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理:构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University(清华大学) Virginia Tech(弗吉尼亚理工大学) NVIDIA(英伟达)

AI总结 提出PERIA智能体,通过视觉感知和交互工具增强VLM的空间推理能力,在13个基准上优于同类模型7.0%-14.8%。

详情
AI中文摘要

尽管最近的视觉语言模型(VLM)展示了强大的多模态理解能力,但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明,仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent(PERIA),一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具:视觉感知工具用于暴露文本、符号和空间证据,以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA,我们开发了一种统一方案,结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化(OR-GIGPO),以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明,PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%,在分布外基准上提高了4.4%,同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型(如Qwen3-VL-235B-A22B-Thinking和GPT-5)相当的性能,证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

2606.12828 2026-06-12 cs.AI 新提交

Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics

人工智能研究中的主题相变:大规模证据与新兴主题的早期预警信号

Rasul Khanbayov, Hasan Kurban

AI总结 通过分析2017-2025年五大AI会议论文,发现AI主题通过“相变”方式突然爆发,并基于早期预警信号识别未来需关注的主题。

详情
AI中文摘要

人工智能的研究主题是逐渐增长,还是通过突然的、可检测的跳跃式发展?通过分析2017年至2025年期间五个顶级AI会议(ACL、CVPR、ICLR、ICML、NeurIPS)的80,814篇主会论文,我们发现主要AI主题通过主题相变推进:在多年间保持边缘地位,然后在一到三年内跨会议激增。到2025年,大型语言模型成为跨会议的主导主题,扩散模型以类似的突发性崛起,语言模型方法通过视觉语言模型进入计算机视觉领域,而强化学习则平滑累积,这区分了真正的相变与普通增长。这一结构是我们的主要贡献:对AI研究如何重组的大规模、跨会议特征描述。然后我们探究相变是否在达到顶峰前留下可检测的足迹。我们定义了一个早期预警信号,即基于2017-2021年数据冻结的四项出版动力学标准,并在2023-2025年的相变上进行样本外评估,在13.5%的基准率下获得了27%的精确率和63%的召回率。应用于2025年数据时,该信号将推理与测试时计算、智能体AI、多模态LLM、检索增强生成和世界模型标记为2026-2028年需监测的主题。源代码也在GitHub上公开,网址为https://this https URL。

英文摘要

Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepted main-track papers from five premier AI conferences (ACL, CVPR, ICLR, ICML, NeurIPS) spanning 2017 to 2025, we show major AI topics advance through topical phase transitions: remaining marginal for years, then surging across venues within one to three years. Large language models became the dominant cross-venue topic by 2025, diffusion models rose with comparable abruptness, and language-model methods crossed into computer vision via vision-language models, whereas reinforcement learning compounded smoothly, distinguishing genuine phase transitions from ordinary growth. This structure is our primary contribution: a large-scale, cross-venue characterization of how AI research reorganizes. We then ask whether a transition leaves a detectable footprint before it peaks. We define an early-warning signature, four publication-dynamics criteria frozen on 2017-2021 data, and evaluate it out of sample on 2023-2025 transitions, obtaining a precision of 27% and recall of 63% against a 13.5% base rate. Applied to 2025 data, the signature flags reasoning and test-time compute, agentic AI, multimodal LLMs, retrieval-augmented generation, and world models as topics to monitor over 2026-2028. The source code is also publicly available on GitHub at this https URL.

2606.12826 2026-06-12 cs.CV cs.AI 新提交

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS: 解耦实例级运动目标分割

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出双解耦特征提取框架分离图像与事件模态的外观和运动信息,并通过多粒度跨模态对齐实现有效融合,在运动实例分割任务中尤其对快速运动和低光下的小目标取得最优性能。

详情
AI中文摘要

运动实例分割(MIS)因其在交通监控、自动驾驶和动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化,提供高时间分辨率和动态范围,使其对运动信息高度敏感。通过融合事件和图像特征,事件中的运动线索可以补充图像中的空间细节,从而提升MIS的性能。然而,当前的多模态MIS方法仍然难以分割小的运动实例,因为事件相机在有限分辨率下往往产生稀疏特征。此外,事件特征将外观属性与运动线索纠缠在一起,进一步限制了有效的跨模态融合。为解决这些挑战,我们首先提出一个双解耦特征提取框架,在图像和事件模态内分离并提取外观和运动信息,从而改善特征密度。随后,引入多粒度跨模态对齐,以对齐跨模态分布和语义一致的特征,实现具有丰富空间和时间细节的更有效融合。实验结果表明,我们的方法在多模态MIS中达到了最先进的性能,特别是在快速运动和低光等挑战性条件下的小实例分割方面。

英文摘要

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

2606.12821 2026-06-12 cs.AI cs.ET 新提交

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

GeoNatureAgent Benchmark:面向前沿与开源基础模型的环境地理空间分析LLM智能体基准测试

Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez, Devika Jain

发表机构 * Universidad Católica de Ávila (UCAV)(阿维拉天主教大学) Johns Hopkins University(约翰霍普金斯大学) Independent Researcher(独立研究者) Center for Geographic Analysis, Harvard University(哈佛大学地理分析中心)

AI总结 提出首个通过结构化工具调用真实API评估环境分析智能体的基准,包含93个任务,发现Claude Sonnet 4领先,但开源模型在成本效益上占优,且比较任务普遍未解决。

详情
Comments
Preprint. 10 pages, 8 figures. Submitted to ACM SIGSPATIAL 2026
AI中文摘要

环境科学家在数据整理而非分析上花费了不成比例的精力,而自动化地理空间工作流的AI智能体仍未得到验证:没有基准通过结构化工具调用评估智能体对真实API的操作。我们引入了GeoNatureAgent Benchmark,这是首个通过结构化工具调用生产级地理空间API进行环境分析智能体的基准。它包含18个类别的93个任务,涵盖市政分析、多轮对话、空间推理、跨指标综合、错误处理与恢复、排序、比较、多语言理解、栖息地分析和任务拒绝。任务通过一个开放、可自托管的API进行评估,该API通过16个工具提供西班牙和葡萄牙的三个环境指标。我们评估了七个LLM(Claude Sonnet 4、DeepSeek V3.2、GLM-5、Gemini 2.5 Pro、Qwen3-235B、GPT-OSS-120B、Llama 4 Scout),在三个温度1.0的随机种子下,报告能力与每案例成本作为正交轴。我们发现:(1)Claude Sonnet 4以60.8%±0.8%领先,其次是DeepSeek V3.2的56.3%±3.1%,其他模型均未超过51%;(2)成本-准确率帕累托前沿主要由开源模型占据,DeepSeek V3.2以11倍低的成本(每案例0.011美元)提供Claude 93%的能力;(3)比较任务普遍未解决(接近值比较上为0%),暴露了系统性的推理限制;(4)针对真实API的结构化工具调用比通用GIS基准更具区分度,准确率低25-35个百分点。我们进一步展示了可扩展性,将葡萄牙的BigEarthNet V2土地覆盖与西班牙的CO2和侵蚀指标集成。该基准、工具集和可自托管API均已公开。

英文摘要

Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

2606.12818 2026-06-12 cs.CL cs.AI 新提交

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 研究提示中无关数字如何影响语言模型数值推理的锚定效应,通过logit差值度量和电路归因定位,发现边级方法优于节点级方法,并揭示锚定路径的共享与迁移特性。

详情
AI中文摘要

提示中的无关数字可以改变语言模型的判断,在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置,研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量,比较正确答案选项与对应锚点的答案选项,并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位,我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移,表明跨锚定方向存在共享路径结构。然而,基础模型和指令微调变体之间的稀疏迁移可靠性较低,表明后训练改变了哪些路径最重要。总体而言,我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

2606.12817 2026-06-12 cs.AI 新提交

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

Teach-and-Repeat: 从移动屏幕演示中准确提取操作知识以赋能GUI智能体

Yudong Zhang (1), Lei Hu (1), Daoyang Liu (2), Jiawei Liu (1), Yangfan Luo (1), Xingyu Liu (1), Zuojian Wang (1), Zhilin Gao (1) ((1) Honor Device Co., Ltd, (2) The Chinese University of Hong Kong, Hong Kong, China)

发表机构 * Honor Device Co., Ltd(荣耀终端有限公司) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Teach VLM模型,通过从演示视频中提取关键帧生成操作知识,并构建数据飞轮解决训练数据稀缺问题;在基准测试中达到最优性能,并提升下游智能体的任务成功率。

详情
Comments
20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Xingyu Liu, Zuojian Wang, and Zhilin Gao are corresponding authors
AI中文摘要

理解移动设备上的数字世界正从静态UI感知转向动态动作理解。这种能力使模型能够将视觉状态转换转化为操作知识,定义为描述动作类型、目标UI元素、文本参数和执行顺序的简短自然语言句子。然而,由于跨应用的UI设计高度多样化和异构,现有视觉语言模型(VLM)难以准确推断这些底层操作。为弥补这一差距,我们引入了Teach VLM,这是一个核心模型,旨在通过从演示视频中提取和分析与操作相关的关键帧,将移动屏幕轨迹转化为逐步操作知识。为解决对齐训练数据稀缺的问题,我们开发了一个系统性的数据飞轮以实现可扩展的数据采集。我们进一步引入了一个新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM,我们提出了Teach-and-Repeat范式,其中生成的操作知识作为可解释的程序化参考,指导下游基于屏幕的执行智能体。大量评估表明,Teach VLM显著优于强VLM基线,在操作语义预测中达到了最先进的性能。此外,在Android World中的实验表明,我们的范式为下游智能体带来了持续的任务成功率提升。Teach VLM和Teach-and-Repeat范式共同提供了一条从原始演示到可复用任务自动化的实用路径。

英文摘要

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

2606.12814 2026-06-12 cs.RO cs.AI 新提交

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 提出Stubborn框架,通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略,统一实现人形机器人的运动跟踪与摔倒恢复,在性能与鲁棒性上超越现有方法。

详情
AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而,现有大多数工作将运动跟踪和摔倒恢复视为不同任务,需要多阶段训练,并配备专门的恢复奖励和/或独立的恢复策略。此外,现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合,限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题,我们提出了Stubborn,一个流线型统一的强化学习框架,用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说,Stubborn采用非对称Actor-Critic架构,包含三个主要组件。首先,采用偏航对齐的跟踪表示,以减少对全局漂移和航向扰动的敏感性,同时保留与重力相关的平衡信息。其次,我们引入基于伯努利的概率终止机制,使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三,我们提出一种概率终止和跟踪误差驱动的策略,根据跟踪性能动态重塑采样分布,提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明,Stubborn取得了有竞争力的性能,所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to this https URL.

2606.12812 2026-06-12 cs.CY cs.SD 新提交

Vocal Identity Under Siege by AI Voice Cloning Technologies

AI语音克隆技术对声音身份的攻击

Jyh-An Lee, Xuan Sun

AI总结 本文通过比较分析公开权、人格权和个人数据保护权三种法律框架,探讨生成式AI语音克隆对声音身份独特价值的威胁及法律应对。

详情
AI中文摘要

先进的AI驱动语音克隆的出现,将保护声音身份的关键法律和伦理挑战推到了前台。受近期争议(包括OpenAI的ChatGPT-4o语音与斯嘉丽·约翰逊声音惊人相似)的推动,本文探讨了生成式AI技术如何削弱人类声音的独特价值,并进一步复杂化围绕人格权的法律问题。通过比较分析,本文评估了三种主要法律框架:公开权、人格权和个人数据保护权。每种框架——根植于不同的法律传统——在应对AI生成语音克隆带来的威胁方面各有优势和局限。通过分析这些原则的范围、救济措施和死后保护,本研究为理解现有法律方法如何应用于生成式AI时代声音身份不断演变的挑战提供了基础。

英文摘要

The advent of sophisticated AI-driven voice cloning has brought to the fore critical legal and ethical challenges regarding the protection of vocal identity. Prompted by recent controversies - including the striking resemblance between OpenAI's ChatGPT-4o voice and that of Scarlett Johansson - this article examines how generative AI technologies undermine the unique value of the human voice and further complicate the legal questions surrounding personality right. Through a comparative analysis, the paper evaluates three principal legal frameworks: the right of publicity, personality rights, and the personal data protection right. Each framework - rooted in different legal traditions o offers distinct strengths and limitations in addressing the threats posed by AI-generated voice cloning. By analysing these doctrines' scope, remedies, and posthumous protections, the study offers a foundation for understanding how existing legal approaches may be applied to the evolving challenges of vocal identity in the era of generative AI.

2606.12809 2026-06-12 cs.AI cs.LG 新提交

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

AI总结 提出MLUBench基准,评估多模态大模型在连续遗忘请求下的性能,发现现有方法存在累积退化,并揭示多模态对齐保持的挑战,提出LUMoE方法缓解退化。

详情
Comments
36 pages, accepted to the ICML 2026
AI中文摘要

多模态大语言模型(MLLMs)在海量多模态数据上训练,使得数据遗忘变得越来越重要,因为数据所有者可能要求移除特定内容。实际上,这些请求通常随时间顺序到达,引发了MLLM终身遗忘这一具有挑战性的问题。然而,现有大多数基准在规模和范围上有限,未能捕捉MLLM终身遗忘的复杂性。为填补这一空白,我们引入了MLUBench,一个大规模、全面的基准,包含9个类别下的127个实体,用于终身遗忘请求。我们使用MLUBench进行了大量实验,揭示出现有遗忘方法遭受严重且累积的退化。更重要的是,我们进一步识别出该问题的独特挑战:与单模态模型不同,MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战,我们提出了LUMoE,一种有效方法。实验表明,LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in this https URL.

2606.12808 2026-06-12 cs.LG cs.AI 新提交

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy, Vaneet Aggarwal

AI总结 提出SymQNet,一种摊销强化学习方法,通过离线学习后验条件获取策略,在线快速前向传播,显著降低自适应哈密顿量学习的获取延迟。

详情
AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中,选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算,这一步可能需要几秒钟。在数百次试验中,这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet,一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略,然后在线使用快速策略前向传播,同时保留贝叶斯后验反馈。在横向场伊辛基准测试中,相对于有界Fisher信息搜索和有界两步贝叶斯主动学习(BALD),SymQNet显著降低了获取延迟。在五量子比特时,相对于这些在线基线,它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$;在十二量子比特时,SymQNet的完整模拟步骤需要$1.02$秒,而有界两步BALD需要$13.27$秒。总体而言,我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

2606.12807 2026-06-12 cs.CL 新提交

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

检测、重掩、修复:面向动态上下文忠实摘要的扩散编辑

Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出DETECT-REMASK-REPAIR框架,利用掩码扩散语言模型识别并修复摘要中过时内容,在保持支持内容的同时实现局部忠实性修复,并引入StreamSum基准评估。

详情
AI中文摘要

现实世界事件的摘要可能随着上下文演变和新信息的到来而过时。常见的做法是从更新后的上下文生成新摘要,但完全重新生成会丢弃之前的草稿,可能掩盖变化,并且当只有少数声明不支持时可能不必要。我们研究局部忠实性修复:在保留支持内容的同时更新现有摘要中的过时片段。我们提出DETECT-REMASK-REPAIR,一个基于扩散的框架,通过掩码扩散语言模型识别、重新掩码并修复过时区域。为了评估动态上下文摘要,我们引入了StreamSum,一个合成事件时间线的基准。在DialogSum和StreamSum上的实验表明,局部扩散修复提供了一种可控的替代完全重写的方法:忠实性导向的修复改进了早期草稿,一步修复将修复成本降低到半秒以下,该框架实现了跨数据集的忠实性-速度-保留权衡。我们还发现该框架可以作为事后修正步骤,提高自回归系统的忠实性。

英文摘要

Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

2606.12805 2026-06-12 cs.HC cs.AI 新提交

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

探索智能体口音如何影响K-12小组学习中的人机协作

Prerna Ravi, Carúmey Stevens, Ben Hurt, Brandon Hanks, Grace Lin, Emma Anderson

AI总结 研究通过33名教师的实验,发现GenAI语音智能体的不同口音(英式、印度式、非裔美式)影响其被感知为工具或同伴,进而影响信任、参与和依赖。

详情
AI中文摘要

协作被广泛认为是21世纪教育的基石,但教师在促进有效的同伴互动方面仍面临持续挑战。LLM对话式同伴智能体为调解面对面小组工作带来了新的可能性,引发了关于角色设计(尤其是语音特征)如何塑造学习者的感知、信任和互动动态的问题。虽然先前的研究已经考察了智能体口音在一对一环境中的影响,但关于这些影响如何在小组中表现尚知之甚少。我们进行了一项33名教师参与的组间混合方法研究,考察了具有不同口音(英式、印度式和非裔美式)的GenAI语音智能体如何影响协作和智能体感知。通过调查、小组互动分析和人工制品,我们发现口音塑造了参与者的心智模型以及智能体在小组互动中扮演的角色。英式口音智能体在很大程度上被视为工具,并以超然、基于实用性的方式参与,而印度式和非裔美式口音智能体则更容易被拟人化并作为同伴融入。这些角色期望影响了信任、参与和依赖随时间的变化。这项工作推进了关于GenAI的社会语言学设计特征如何塑造CSCL中小组动态的理解,对设计具有文化包容性的AI学习伙伴具有启示意义。

英文摘要

Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners' perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants' mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI's sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

2606.12801 2026-06-12 cs.CY 新提交

AiAWE: An Open-Source LLM Automated Writing Evaluation System Using LoRA-Adapted Instruction-Tuned Models

AiAWE: 一种使用LoRA适配指令微调模型的开源LLM自动写作评估系统

John Maurice Gayed

AI总结 提出AiAWE开源自动写作评估系统,通过LoRA适配指令微调Gemma-3-27B模型,在TOEFL独立写作数据集上达到0.474 RMSE和90.56%一致率,优于更大模型和GPT-3.5基线。

详情
Comments
21 pages with 7 tables and 1 figure and appendices
AI中文摘要

本研究提出了AiAWE,一个开源的自动写作评估系统,该系统使用LoRA适配的指令微调大语言模型(Gemma-3-27B-it)对议论文进行评分。使用包含480篇TOEFL独立写作论文的专有教育考试服务中心(ETS)数据集,我们在120篇论文的训练子集上以相同的LoRA配置微调Gemma-3-27B和LLaMA-3.3-70B,并在剩余的360篇论文上以相同的推理量化进行评估。微调后的Gemma模型实现了0.474的均方根误差、0.828的二次加权kappa,以及在人类评分±0.5范围内的90.56%一致率,优于在同一数据集上先前工作中报告的更大LLaMA-3.3-70B模型和微调GPT-3.5基线。三个发现具有更广泛的意义:开放权重LLM可以在符合评分标准的评分中匹配或超越专有微调;模型规模不是LoRA适配下下游性能的可靠预测指标;相同的LoRA超参数在不同架构中产生定性的不同适配行为。该系统运行在消费级服务器上,并通过此https URL公开访问。LoRA适配器、应用程序代码和微调YAML文件通过各自的仓库公开提供。

英文摘要

This study presents AiAWE, an open-source automated writing evaluation system that scores argumentative essays using a LoRA-adapted instruction-tuned large language model (Gemma-3-27B-it). Using a proprietary Educational Testing Service (ETS) dataset of 480 TOEFL Independent Writing essays, we fine-tune Gemma-3-27B and LLaMA-3.3-70B under identical LoRA configurations on a 120-essay training subset and evaluate on the remaining 360 essays under identical inference quantization. The fine-tuned Gemma model achieves a root mean square error of 0.474, a quadratic weighted kappa of 0.828, and an agreement rate of 90.56% within +/- 0.5 of the human score, outperforming both the larger LLaMA-3.3-70B model and the fine-tuned GPT-3.5 baseline reported in prior work on the same dataset. Three findings are of broader interest: open-weight LLMs can match or exceed proprietary fine-tuning for rubric-aligned scoring; model scale is not a reliable predictor of downstream performance under LoRA adaptation; and identical LoRA hyperparameters produce qualitatively different adaptation behaviors across architectures. The production system runs on a consumer-grade server and is publicly accessible at this https URL. LoRA adapters, application code, and fine-tuning YAMLs are publicly available through their respective repositories.

2606.12797 2026-06-12 cs.AI 新提交

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

遏制缺口:已部署的自主AI框架如何未能满足面向公众的安全要求

Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 研究发现主流自主AI框架缺乏架构级安全保证,内存完整性漏洞可导致定向腐败,提出轻量级遏制机制消除攻击向量。

详情
Comments
ICML 2026 (AI4GOOD Workshop)
AI中文摘要

自主调用工具、维护持久内存并执行多步计划的大语言模型系统越来越多地部署在面向公众的领域,包括政府服务、医疗分诊和财务咨询。我们询问用于构建这些系统的框架是否提供架构级结构安全保证。应用从自主架构的组合模型导出的六项遏制原则,我们审计了三个主流框架(LangChain、AutoGPT和OpenAI Agents SDK),发现没有一个原生合规。内存完整性,一种针对最普遍漏洞类别的防御,在三个评估框架中均未观察到。我们通过实证验证这些发现:在基于LangChain构建的模拟政府福利代理中,单次内存投毒写入在所有测试种子和后端上引起持久定向腐败,使目标申请人的错误拒绝率升至88.9%。在复杂的五因素政策下,同一攻击保持总体准确率,同时将目标错误拒绝率提高3.5倍,使腐败难以通过标准监控检测。然后我们引入两种轻量级遏制机制:内存完整性验证器和策略门,它们以亚毫秒开销(每次调用<0.2ms)消除了两种攻击向量。我们得出结论,当前的自主框架生态系统可能尚未满足面向公众部署的默认安全期望,并概述了优先架构干预措施,以实现在高风险、对社会有影响的应用程序中的可信部署。

英文摘要

Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

2606.12793 2026-06-12 cs.CR cs.IR 新提交

Semantic Identification of IoT Devices from Behavioral Primitives

基于行为基元的物联网设备语义识别

Samuel Witt, Hassan Habibi Gharakheili

AI总结 提出利用制造商使用描述(MUD)配置文件中的访问控制条目(ACE)作为行为基元,通过语义表示匹配实现物联网设备识别,在公开数据集和真实流量上验证了有效性。

详情
Comments
14 pages, 3 figures, 4 tables
AI中文摘要

物联网设备的准确识别对于安全管理和策略执行至关重要。现有方法通常从数据包或流记录中学习设备签名,这些方法基于低级通信观测,其流量模式可能因部署、软件版本和用户交互而异。本文研究使用制造商使用描述(MUD)配置文件进行设备识别。MUD配置文件使用访问控制条目(ACE)描述设备行为,每个ACE代表一个由协议、端点、方向和端口语义组成的行为基元,这些语义源自设备通信策略。我们的贡献有三点。首先,利用28个公开可用的MUD配置文件(包含1023个ACE实例),我们从紧凑的行为文本构建ACE级语义表示,并分析其几何特性。ACE级表示比整个配置文件嵌入更有效地保留设备级行为区分,并在白化校准后仍然有效。其次,我们在受控运行时变化下评估语义ACE匹配,包括未见过的ACE、漂移的主机名和部分运行时观测。当与规范MUD配置文件的重叠度较高时,精确ACE匹配表现良好,但当重叠变得稀疏或消失时性能急剧下降。相比之下,语义ACE匹配在这些条件下保留了有用的识别证据。第三,我们在包含超过80万个观测流的真实物联网流量轨迹上评估了相同方法。当存在稳定重叠时,精确重叠仍然是最强的信号,而语义ACE匹配在观测早期阶段提供更强的识别证据,经常将正确设备保留在排名最高的候选中,并在稀疏重叠的运行时流量下保持有效。

英文摘要

Accurate identification of IoT devices is important for security management and policy enforcement. Existing approaches typically learn device signatures from packets or flow records. These methods operate on low-level communication observations whose traffic patterns may vary across deployments, software versions, and user interactions. This paper studies device identification using Manufacturer Usage Description (MUD) profiles. MUD profiles describe device behavior using Access Control Entries (ACEs), where each ACE represents a behavioral primitive consisting of protocol, endpoint, direction, and port semantics derived from device communication policy. Our contributions are threefold. First, using 28 publicly available MUD profiles containing 1,023 ACE instances, we construct ACE-level semantic representations from compact behavioral text and analyze their geometric properties. ACE-level representations preserve device-level behavioral distinctions more effectively than whole-profile embeddings and remain effective after whitening calibration. Second, we evaluate semantic ACE matching under controlled runtime variations, including unseen ACEs, drifted hostnames, and partial runtime observation. Exact ACE matching performs well when the overlap with the canonical MUD profile remains high, but degrades sharply when the overlap becomes sparse or disappears. In contrast, semantic ACE matching preserves useful identification evidence across these conditions. Third, we evaluate the same approach on real IoT traffic traces comprising more than 800,000 observed flows. Exact overlap remains the strongest signal when stable overlap exists, while semantic ACE matching provides stronger identification evidence during the early stages of observation, frequently retains the correct device among the highest-ranked candidates, and remains effective under sparse-overlap runtime traffic.

2606.12790 2026-06-12 cs.CL 新提交

GENIE: A Fine-Grained Measure for Novelty

GENIE:一种细粒度新颖性度量方法

Ramya Namuduri, Manya Wadhwa, Anshun Asher Zheng, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出GENIE指标,通过任务特定特征细粒度衡量模型生成内容的新颖性,克服整体指标无法捕捉高维新颖性的局限。

详情
AI中文摘要

大型语言模型在各项任务中持续表现出缺乏创造力和多样性。先前的工作主要关注模型是否能够生成创造性输出。本文旨在考虑新颖性,并以任务特定方式研究模型生成内容的新颖性。我们提出了一种细粒度评估指标GENIE,用于根据响应群体中的任务特定特征来衡量响应的新颖性。我们表明,与GENIE不同,整体指标难以捕捉新颖性的高维性,并且无法提供关于它们针对哪些属性的见解。最后,我们使用GENIE来衡量解决创造力问题的缓解方法的有效性,以更好地理解这些方法在哪些方面可以提高新颖性。

英文摘要

Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

2606.12789 2026-06-12 cs.CL cs.IR 新提交

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

RAG基准测试应该有多细粒度?一个用于合成问题生成的层次化框架

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系)

AI总结 提出HieraRAG层次化框架,通过合成问题生成研究RAG基准测试的细粒度,发现最优粒度因维度而异,并引入一致性比率度量。

详情
AI中文摘要

评估检索增强生成(RAG)系统需要能够捕捉多样化问题特征的基准测试,然而实践者缺乏关于在哪些维度上变化以及以何种粒度变化的经验指导。我们提出了HieraRAG,一个用于研究RAG基准测试构建中粒度的层次化框架,将最优粒度定义为在给定RAG配置下最大化区分能力(各类别生成质量的标准差)的水平。作为案例研究,我们从FineWeb-10BT中生成了5,872个合成问答对,涵盖3个维度(问题复杂度、答案类型、语言变异)和3个粒度级别(2、4和8个类别)。使用BM25+Falcon-3-10B流水线,最优粒度因维度而异:复杂度受益于细粒度区分(区分能力:0.053),而答案类型和语言变异在中等粒度达到峰值。我们引入了一致性比率度量来量化细粒度划分是否干净地细分父类别,揭示了维度间的结构差异(问题复杂度:0.40 vs. 答案类型:1.44)。对110个分层问答对的人工评估确认了合成质量。虽然这些具体发现反映的是单一配置,但HieraRAG为实践者提供了可移植的程序和验证度量,以确定其自身RAG设置中的评估粒度。

英文摘要

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

2606.12788 2026-06-12 cs.SI cs.CY cs.DC econ.GN eess.SY 新提交

To Share or Not to Share: Orchestrating Trustworthy Data in Global Value Chains

共享还是不共享:协调全球价值链中的可信数据

Han-Teng Liao, Chang-Yi Kao

AI总结 针对欧盟CBAM带来的监管透明与数据主权矛盾,提出基于IDSA框架的RegTech参考架构,通过主权数据交换实现数字产品护照,驱动全球商业服务能力需求,并集成Agentic AI与绿色金融,为全球产业集群提供可扩展蓝图。

详情
AI中文摘要

随着欧盟碳边境调节机制(CBAM)的临近,全球半导体价值链在监管透明度和数据主权之间面临日益增长的结构性紧张。本文提出了一种使用国际数据空间(IDSA)框架的RegTech参考架构,以在半导体-石化关联领域协调可信的环境遥测。该架构区分了强制性CBAM要求和自愿性科学碳目标倡议(SBTi)框架,同时解决了安全与可持续设计(SSbD)框架的附加复杂性。超越标准线性技术栈,我们引入了一种前瞻性路线图方法,将上游物理脆弱性转化为循环的负反馈循环。聚焦台北和槟城技术走廊,本文详细说明了主权数据交换如何使数字产品护照(DPP)能够驱动全球商业服务(GBS)能力需求。最后,我们讨论了集成Agentic AI以实现自主合规以及金融科技绿色融资,为全球产业集群实现主权、可持续和透明的价值链提供了可扩展蓝图。

英文摘要

As the EU Carbon Border Adjustment Mechanism (CBAM) approaches, the global semiconductor value chain faces growing structural tensions between regulatory transparency and data sovereignty. This article proposes a RegTech reference architecture using the International Data Spaces (IDSA) framework to orchestrate trustworthy environmental telemetry across the semiconductor-petrochemical nexus. The framework distinguishes the mandatory CBAM requirements from voluntary Science Based Targets initiative (SBTi) frameworks, while addressing the additive complexities of the Safe-and-Sustainable-by-Design (SSbD) framework. Moving beyond standard linear technology stacks, we introduce a prospective roadmapping methodology that transforms upstream physical vulnerabilities into circular, negative feedback loops. Focusing on the Taipei and Penang technology corridor, the article details how sovereign data exchange enables Digital Product Passports (DPPs) to drive Global Business Services (GBSs) capability demands. Finally, we discuss the integration of Agentic AI for autonomous compliance and FinTech green financing, providing a scalable blueprint for global industrial clusters to achieve sovereign, sustainable, and transparent value chains.

2606.12787 2026-06-12 cs.SI cs.CY econ.GN eess.SY q-fin.RM 新提交

Orchestrating the Twin Transition in Multinational Corporations: Technology Roadmapping for Green and Digital Global Business Services

跨国企业中的双重转型编排:面向绿色与数字全球商业服务的技术路线图

Han-Teng Liao, Karen Ang

AI总结 本文综合技术路线图与ITU创新生态系统工具,提出社会技术框架,分析跨国企业全球商业服务如何通过“可持续智能”演进,协调绿色与数字双重转型,并识别关键枢纽国家的作用。

详情
Comments
9 pages, 6 figures
AI中文摘要

全球商业服务(GBS)已成为绿色与数字双重转型的“活实验室”,因为跨国企业(MNCs)面临协调数字效率与环境管理的日益增长的压力。为推导出一个社会技术框架,本文将技术路线图(TRM)与国际电信联盟(ITU)以ICT为中心的创新生态系统工具包相结合。对研究集群的文献计量分析揭示了从基本流程自动化向“可持续智能”的演进转变,将GBS单元识别为中央“操作气闸”,在景观压力(如欧盟双重指令和碳边境调节机制)与AI原生工作流中的利基创新之间进行调解。研究进一步将这些集群映射到利益相关者参与画布上,突出显示波兰、葡萄牙和马来西亚的韧性“中等强国”枢纽如何绕过中等收入陷阱,在地缘政治分裂的云环境中为全球价值链提供“第三条道路”。结果为领导者及创业支持网络提供了数据驱动的设计方法,以编排人才和供应链流动,从而丰富对工业5.0的概念理解以及GBS作为在动荡、多极数字经济中导航的主要机制的作用。

英文摘要

Global Business Services (GBS) have emerged as a "living laboratory" for the Twin Transition of Green and Digital Transformation, as multinational corporations (MNCs) face increasing pressure to harmonize digital efficiency with environmental stewardship. Aiming to derive a socio-technical framework, this paper synthesizes Technology Roadmapping (TRM) with the International Telecommunication Union (ITU) ICT-centric innovation ecosystem toolkit. A bibliometric analysis of research clusters reveals an evolutionary shift from basic process automation toward "Sustainable Intelligence," identifying the GBS unit as a central "operational airlock" that mediates between landscape pressures -- such as the EU's dual mandate and Carbon Border Adjustment Mechanisms -- and niche innovations in AI-native workflows. The study further maps these clusters onto a stakeholder engagement canvas, highlighting how resilient "Middle Power" hubs in Poland, Portugal, and Malaysia are bypassing the middle-income trap to provide a "third way" for global value chains amidst a bifurcated geopolitical cloud. The results offer a data-driven design approach for leaders and entrepreneurial support networks to orchestrate talent and supply chain flows, thereby enriching the conceptual understanding of Industry 5.0 and the role of GBS as a primary mechanism for navigating a volatile, multipolar digital economy.