arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12706 2026-06-12 cs.CV 新提交

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

VLADriveBench：评估自动驾驶VLA中的CoT-动作关系

Thach Nguyen, Danhua Guo, Tom Lampo, Fei Wu, Burhan Yaman

发表机构 * Uber AV Labs（优步自动驾驶实验室）

AI总结提出VLADriveBench框架，结合观察指标和CoT干预协议评估VLA模型中思维链与驾驶动作的相关性和因果性，发现不同模型表现差异显著。

2606.12703 2026-06-12 cs.CR cs.AI cs.LG 新提交

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

SMSR：针对持久化LLM代理系统中运行时内存投毒的认证防御

Tarun Sharma

AI总结提出SMSR防御框架，通过写入时HMAC签名和查询时随机化内存消融与基于判决的多数投票，首次为多会话内存投毒攻击提供认证鲁棒性保证。

详情

AI中文摘要

检索增强生成（RAG）代理越来越多地使用跨用户会话累积的持久化内存。这创造了一个新的攻击面：仅通过正常渠道交互的对手可以注入精心构造的内存，一旦被检索，就会影响未来用户的代理响应，而无需触及模型权重或代码。我们将此称为多会话内存投毒（MSMP），并表明现有防御无法对此进行认证；静态语料库防御（RobustRAG、ReliabilityRAG）假设固定的知识库，而启发式过滤器则被流畅的企业风格文本绕过。我们提出了带平滑检索的签名内存（SMSR），这是首个针对此场景提供认证鲁棒性边界的防御。组件1在写入时添加HMAC-SHA256来源证明，阻止未签名注入。组件2在查询时应用随机化内存消融与基于判决的多数投票，限制认证对手的影响。我们证明了无来源证明的检索时过滤器无法认证自适应注入，推导了组件2的超几何证书，并形式化了一致少数效应，即一致对抗答案在基于字符串的投票中作为数值少数胜出，而基于判决的投票则将其移除。在15个企业场景（3150次重复试验）中，组件1将未签名变体的攻击成功率从93-100%降至0%。对于单次注入的认证对手，组件2将成功率控制在8.0%（95% CI [5.8, 10.9], n=450），低于认证最坏情况。在端到端仅查询攻击中（代理自身写入投毒而非预植入），SMSR在实时代理栈上将成功率从65.3%降至5.3%（n=150，非重叠置信区间）。干净查询效用为90%（组件1）和85%（组合）。

英文摘要

Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

URL PDF HTML ☆

赞 0 踩 0

2606.12702 2026-06-12 cs.AI 新提交

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

以部署为中心的评估：预测临床大语言模型系统中的查询级拒绝风险

Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah

AI总结针对临床大语言模型系统，提出基于部署上下文（如提供者类型、科室名称）的预响应分类器，预测用户拒绝风险，AUROC达0.719，并展示其在触发护栏和弃权中的效用。

详情

AI中文摘要

大语言模型（LLMs）正越来越多地集成到临床系统中，因此评估这些系统的实际效用至关重要。然而，静态基准倾向于衡量正确性而非用户接受度，跨查询聚合性能，并需要密集标注的数据集——这导致评估临床系统时存在重大盲点。在这项工作中，我们对嵌入某学术医疗中心电子健康记录中的LLM系统进行了以部署为中心的评估，其中用户反馈稀疏但密切反映了部署条件。具体而言，我们训练了一个预响应分类器，该分类器基于查询内容和生成前可用的部署特定上下文，估计未来交互导致用户拒绝LLM响应的风险。我们对模型进行了4.5个月用户反馈的前瞻性分析，发现我们的预测模型达到了0.719的AUROC。此外，我们估计了此类预测在两个下游用例（触发护栏和弃权）中的益处。我们的关键概念洞察是，利用部署特定上下文（即提供者类型、科室名称、用于响应的语言模型），而不仅仅是查询内容，可以提高预测用户是否会拒绝系统输出的能力。总之，我们的实证案例研究证明了使用部署特定上下文预测用户拒绝的可行性，为定向护栏打开了大门。

英文摘要

Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

URL PDF HTML ☆

赞 0 踩 0

2606.12699 2026-06-12 cs.LG cs.AI 新提交

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估：LLM驱动方法

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

发表机构 * Department of Information Systems and Cybersecurity, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校信息系统与网络安全系）； School of Engineering Medicine, Texas A&M University（德克萨斯农工大学工程医学院）； Department of Family and Community Medicine, The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校家庭与社区医学系）

AI总结提出GlyLLM框架，利用大语言模型整合可穿戴传感器数据和结构化元数据，实现个性化血糖动态建模，在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

详情

Comments: The 14th IEEE International Conference on Healthcare Informatics, 2026

AI中文摘要

2型糖尿病（T2D）对全球健康构成日益严重的威胁，需要有效的血糖评估来支持个性化和改进的糖尿病护理。可穿戴传感器如连续血糖监测仪（CGM）和健身追踪器为血糖评估提供了许多有价值的见解。然而，有效分析这些数据需要与重要的个体层面背景信息整合。现有方法通常基于传统机器学习（ML），主要依赖历史血糖测量值，忽略了个性化信息，这限制了它们在多样化糖尿病群体中的性能。大语言模型（LLMs）的最新进展展示了它们整合多种数据模态同时建模序列依赖性的能力，激发了探索其在个性化血糖评估中潜力的兴趣。在本文中，我们提出了GlyLLM，一个基于LLM的框架，通过整合可穿戴传感器数据和结构化元数据来建模基于CGM的血糖动态。GlyLLM可以利用预训练LLM的广泛先验知识，并在决策时实现传感器-文本语义抽象。在AI-READI数据集上的两个相关任务实验表明，我们的模型在血糖预测的均方根误差（RMSE）上平均优于传统ML方法13.66%，在糖尿病分类的受试者工作特征曲线下面积（AUROC）上平均优于13.08%。此外，我们的消融研究表明，糖尿病调查和生物特征测试比其他健康信息对血糖评估更为关键。我们的工作为利用LLM推进T2D护理中的个性化血糖评估迈出了有希望的一步。

英文摘要

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

URL PDF HTML ☆

赞 0 踩 0

2606.12692 2026-06-12 cs.DS cs.DM 新提交

Random Proposals: A Softmax-Based Local-Improvement Framework for Maximum Weighted Matching

随机提议：基于Softmax的局部改进框架用于最大加权匹配

Ahmed M. Alzuhair (1), Ahmed Alherz (1) ((1) Department of Information and Computer Science, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia)

AI总结提出一种基于softmax偏置采样的随机局部改进算法，实现局部ε-优势，达到期望1/2-ε近似比，时间复杂度为O(m log(1/ε)/p_min)，在温和条件下简化为O(m log(1/ε))。

2606.12690 2026-06-12 cs.RO cs.AI 新提交

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

EWAM：一种用于具身智能闭环在线自适应的增强世界动作模型

Xin Zhou, Cong Miao

发表机构 * Astronex Robotics ； Nanjing University of Information Science and Technology（南京信息工程大学）

AI总结提出EWAM架构，基于冻结的Cosmos3骨干网络，通过四个轻量级神经层实现零样本在线自适应，无需微调或额外演示数据，显著减少新任务布局的部署数据需求。

详情

AI中文摘要

在本文中，我们提出了增强世界动作模型（EWAM），这是一种基于预训练且完全冻结的Cosmos3骨干网络构建的闭环在线自适应架构。EWAM完全在零样本任务协议下进行评估，其核心目标是减少适应新任务布局所需的额外部署数据量。值得注意的是，所有评估中均未引入额外的任务特定演示集，也未对骨干网络进行微调。其性能提升完全源于由四个插入的轻量级神经层组成的推理时协同推理机制：位于扩散变换器（DiT）中间层的神经经验记忆层提供任务相关的执行上下文；状态预测头之后的神经异常检测层实时监测预测状态与实际状态之间的差异；神经策略路由层根据异常严重程度动态选择直接执行、保守重规划或回滚恢复；神经动作校正层利用执行诊断优化生成的动作块。与简单的特征融合不同，记忆、异常检测和校正模块以可微分的方式深度集成到Cosmos3的前向路径中，仅最终路由决策是离散监督的。

英文摘要

In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

URL PDF HTML ☆

赞 0 踩 0

2606.12689 2026-06-12 cs.CL 新提交

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

可观察模式并非解释：潜在推理模型的因果几何分析

Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG（格勒诺布尔阿尔卑斯大学，法国国家科学研究中心，格勒诺布尔国立理工学院，信息学实验室）； Université Paris-Saclay（巴黎-萨克雷大学）； NAVER LABS Europe（NAVER欧洲实验室）

AI总结本文通过对照实验和因果干预发现，潜在推理模型中的可观察模式（如BFS前沿）在控制组中也出现且不总是因果影响行为，提出潜在思维的使用是分级的，其因果效应集中在低秩方向，几何结构随行为影响增强而更有序。

详情

AI中文摘要

潜在推理模型（LRMs）用连续思维替代显式思维链。最近的研究将可观察的潜在状态模式（如BFS式前沿和可解码的算术计算）视为内部推理机制的证据。通过评估两个LRM（Coconut和CODI）与缺乏所提议的循环或课程的控制组，我们发现这些模式也出现在控制组中，并且并不总是因果性地影响行为。因果干预揭示，潜在思维的利用不是二元的，而是分级的，随着思维对模型行为的因果效应而缩放。几何分析表明，这种效应集中在低秩方向，其逐步几何结构随着行为影响的增加而变得更加结构化。因此，潜在思维应被视为隐藏计算，而非隐藏解释：仅凭可解码性、注意力或静态结构无法确立机制。因此，LRM可解释性需要匹配的控制组和因果测试。

英文摘要

Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

URL PDF HTML ☆

赞 0 踩 0

2606.12688 2026-06-12 cs.LG cs.AI cs.DC 新提交

M*: A Modular, Extensible, Serving System for Multimodal Models

M*: 一个模块化、可扩展的多模态模型服务系统

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

发表机构 * Stanford University（斯坦福大学）； University of Washington（华盛顿大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出M*系统，通过将模型表示为数据流图并引入Walk Graph抽象，支持多模态复合模型的高效服务，在多个任务上降低延迟并提升吞吐量。

详情

AI中文摘要

我们正在进入一个复合模型架构的新时代，这些架构集成了多种组件，如视觉编码器、语言骨干网络、扩散和流头、音频编解码器、动作生成器和世界模型预测器。这种架构支撑了广泛的多模态模型类别，包括统一多模态模型、全能模型、语音-语言模型、视觉-语言-动作策略和世界模型。然而，现有的模型服务框架基于对模型结构的狭隘假设，难以适应这种新的架构多样性。在此，我们提出M*，一个用于高效服务复合AI模型的通用服务系统。M*将模型表示为数据流图，将跨越多种模态和任务的请求处理视为对这些图的遍历。核心洞察是一种模块化抽象，支持模型组件的任意组合、在物理集群上的灵活放置以及分布式运行时中的模型无关优化。我们将这种抽象称为Walk Graph，并展示它如何简洁地捕获来自广泛家族的复合模型。我们在代表性模型上实例化M*，发现与vLLM-Omni相比，在BAGEL上的文本到图像工作负载中，端到端延迟平均降低20%，同时在Qwen3-Omni上的文本到语音工作负载中，实时因子降低高达2.9倍，吞吐量提升高达2.7倍。M*在机器人规划任务上也比V-JEPA 2-AC rollout基线性能提升高达12.5倍。因此，我们的工作为以最小开发工作量高效服务复杂模型铺平了道路。

英文摘要

We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

URL PDF HTML ☆

赞 0 踩 0

2606.12687 2026-06-12 cs.LG 新提交

Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

预测不等于归因：在基于图的神经营销组合模型中定位解码器旁路

Yunbo Wang, Bolbi Liu

发表机构 * University of California, Irvine（加州大学尔湾分校）； AdsGency AI

AI总结针对基于图的神经营销组合模型中预测精度高但归因失败的问题，提出DICE-MMM框架，通过限制解码器通信路径来诊断和定位归因旁路，实验表明低预测误差不能保证归因正确性。

详情

AI中文摘要

营销组合模型用于预测业务结果并将这些结果归因于营销渠道，但这些目标并不等价。我们研究了基于图的神经MMM中的一种失败模式，称为归因旁路：高容量解码器可以通过目标自回归、密集通信、共同运动、上下文或潜在记忆获得低预测误差，但未能将反事实敏感性通过用作归因对象的图进行路由。我们引入DICE-MMM作为一个有界诊断和训练框架。我们不声称观测性神经MMM能够识别因果效应。相反，DICE将基于图的MMM中经常混淆的三个问题分开：图恢复、预测准确性，以及训练后的解码器的扰动诱导影响是否与图对齐。阶段1训练一个带有受限图介导解码器的图编码器。阶段2冻结选定的编码器，并训练一个图安全的潜在解码器，其跨节点通信必须通过提供的图。解码器的使用通过CIG、AR-CIG和图交换测试进行评估。在受控的R/d/T交换和外部多图原始日志压力测试中，DICE比CausalMMM提高了稳定图恢复。实验表明，预测准确性不是归因证书：在稀疏目标基准中，无图解码器和全图解码器实现了约0.004的MSE@7，而AR-CIG nAUPRC仍接近或低于零，而oracle图在可比的MSE下达到0.807 +/- 0.129。冻结图交换定位了瓶颈：相同的DICE-hard训练解码器在学习图输入下从nAUPRC -0.044 +/- 0.006移动到oracle图下的0.894 +/- 0.027。贡献在于一个压力测试和故障定位框架，表明低MSE可能隐藏归因旁路，且未解决的瓶颈是图支撑选择，而不是预测或解码器容量。

英文摘要

Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder's perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

URL PDF HTML ☆

赞 0 踩 0

2606.12683 2026-06-12 cs.AI cs.CY cs.LG 新提交

From AGI to ASI

从AGI到ASI

Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

发表机构 * Google DeepMind（谷歌深度思维）； University of Waterloo（滑铁卢大学）； Australian National University（澳大利亚国立大学）； University College London（伦敦大学学院）

AI总结探讨从人类级通用人工智能到超级智能的转变路径，包括扩展、范式转变、递归改进和多智能体涌现，并分析摩擦与瓶颈。

详情

AI中文摘要

在过去十年中，构建人类级通用人工智能已从遥不可及的猜测转变为许多大型AI组织未来十年的具体目标。实现这一目标将对人类社会产生深远影响，并引发未来十年的诸多复杂问题。本报告研究在机器智能连续体中，AI如何在后AGI世界中继续发展。该连续体的终点——通用AI——在理论上已被充分理解，这为本报告的主要焦点提供了形式基础：从人类级AGI向人工通用超级智能的转变，直观上可理解为比大型人类组织更智能、认知能力更强的系统。在描述ASI后，报告讨论了从AGI到ASI的四条潜在路径：扩展AGI、AI范式转变、递归改进以及从大规模多智能体集体中涌现ASI。随后，报告讨论了这些路径上可能的摩擦和瓶颈。确定这些摩擦的影响是微不足道还是重大，提出了若干具体的开放研究问题。由于预测ASI进展存在巨大不确定性，不能排除AI进展在未来几年继续加速的可能性。这可能意味着由人类级AGI引入社会所导致的单一变革性步骤的形象可能不准确。更恰当的前景可能是由AI在科学和技术的多个领域引发的进步和突破所导致的一系列变革性社会变化。为这一前景做准备需要全球范围内的大规模跨学科努力。

英文摘要

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

URL PDF HTML ☆

赞 0 踩 0

2606.12680 2026-06-12 cs.LG stat.ML 新提交

How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

因果不变性在有限样本设置中对领域适应有多大用处？

Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe, Fanny Yang

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）； Causal Artificial Intelligence Lab, Columbia University（哥伦比亚大学因果人工智能实验室）； Department of Statistics, Columbia University（哥伦比亚大学统计系）

AI总结研究线性回归中因果不变性如何提升监督领域适应，通过候选预测器的目标风险边界和有限样本估计误差推导匹配上下界，证明当边界足够大时自适应聚合可避免负迁移。

详情

AI中文摘要

机器学习模型在部署到与训练源分布不同的目标分布时，性能往往会下降。最近基于因果的领域泛化工作表明，领域间的共享因果结构可以诱导不变预测器，例如在结构化领域偏移下具有稳定风险的某些特征子集上的模型。然而，这种总体水平的因果不变性在有限样本设置中能带来多大收益仍未充分探索。特别是，在实践中我们通常只能获得少量带标签的目标样本，这种设置称为监督领域适应（sDA）。本文探讨何时（完全或部分）因果知识能够可证明地改进监督领域适应。作为第一步，我们研究线性回归，其中完全或部分因果知识指定了一组不变或可能不变的特征子集，每个子集产生一个源训练候选预测器。我们推导了匹配的上界和下界，表明有限样本收益由候选预测器之间的目标风险边界以及有限源估计误差共同决定。当这些边界相对于$n_Q$足够大时，自适应聚合过程可以匹配最佳候选预测器，同时避免相对于仅使用目标样本学习的负迁移。另一方面，当边界过小时，没有算法能够可靠地利用候选集合获得更快的有限样本速率。我们进一步将这些边界与线性SCM中的结构偏移幅度联系起来，并在真实世界的因果基准上验证了理论。

英文摘要

Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to $n_Q$, an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.12679 2026-06-12 cs.LG cs.CR eess.IV 新提交

Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

Fed-FBD：用于隔离、隐私和精准遗忘的联邦功能块多样化

Weijie Chen, Alan B. McMillan

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结提出Fed-FBD模块化联邦架构，将ResNet分解为六个功能块并维护颜色变体仓库，实现块级隔离、隐私设计和亚秒级精准遗忘，在多个数据集上以微小精度代价换取安全保障。

详情

Comments: 12 pages, 3 figures, 8 tables. Code: this https URL

AI中文摘要

联邦学习（FL）能够在无需共享原始患者数据的情况下进行协作模型训练，但标准方法（如FedAvg）将每个客户端视为黑盒，无法隔离对抗性贡献者、审计每个客户端的影响或尊重已退出参与者的被遗忘权。我们提出Fed-FBD（联邦功能块多样化），一种模块化联邦架构，将ResNet骨干网络分解为六个功能块（主干、四个残差组和分类头），并维护一个包含N种颜色变体的仓库，每种变体由独立跟踪和贡献者标记的块组装而成。Fed-FBD提供了FedAvg所不具备的三种能力：(i) 架构保证的块级隔离，使对抗性或错误标注的客户端无法污染干净颜色；(ii) 隐私设计，在应用任何隐私机制之前，成员推断优势已与随机猜测无异；(iii) 在亚秒级成本下无需重新训练即可精准遗忘已退出参与者的贡献。在六个MedMNIST-2D数据集、224x224的PathMNIST和CIFAR-10上的实验表明，Fed-FBD在规模足够的数据集上以0.3%-3.1%的IID精度差距换取这些保证，在四个数据集中的三个上，Dirichlet alpha=1.0时与FedAvg的差距在0.8%-4.0%以内，并将我们研究的所有六种对抗性攻击限制在中毒客户端自己的块内，干净颜色上的AUC漂移最多为+/-0.01。

英文摘要

Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant's right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant's contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client's own blocks with at most +/-0.01 AUC drift on the clean colors.

URL PDF HTML ☆

赞 0 踩 0

2606.12676 2026-06-12 cs.LO cs.CG 新提交

A Calculus of Apartness over Separoids: Effective Convex Representation, Stratified Conservativity, and the Complexity of Entailment

分离体上的相离关系演算：有效凸表示、分层保守性与蕴含复杂性

Faruk Alpay, Baris Basaran

AI总结研究有限族紧凸体诱导的相离关系，提出有效有理实现定理，证明布尔蕴含的完备性与可判定性，并分析计算复杂性。

详情

Comments: 21 pages, 2 figures. Includes effective rational representation with uniform margins, logical consequence analysis, and a fixed-dimensional hierarchy

AI中文摘要

欧氏空间中每一有限族紧凸体在不相交指标集之间诱导一个相离关系：当对应并集的凸包不相交时，两个集合相离。本文研究以相离为原始关系的有限理论。其基本定律是对称性、双边包含和空性，等价于无环分离体的分离-极性形式。主要贡献是一个具有均匀边界的有效有理实现定理及其支持的精确推论理论。每一有限相离分离体可由有理多面体实现，其坐标由最大分离索引。最大分离和最小Radon划分可从全表、生成元或成员关系预言机枚举；坐标值具有受控的比特高度；每个坐标记录一个可读的最大分离证书。该实现使每一相离对具有至少2的间隙，在半径小于1的外平行扩张下保持正确，并在加厚后产生全维凸体。距离函数层通过Lipschitz比较、包含单调性和外平行体记录标准凸分析稳定性。在逻辑方面，正蕴含恰好是单前提包含。欧氏场景上的布尔推论是可靠、完备且可判定的；可满足性是NP完全的，有效性是coNP完全的，正蕴含对排序编码是线性的。分层定理表明布尔推理不引入超出分离体闭包的新原子相离。固定维度的蕴含关系形成一个严格递减的层级，在n个站点时稳定于维度n减1。

英文摘要

Every finite family of compact convex bodies in Euclidean space induces an apartness relation between disjoint index sets: two sets are apart when the convex hulls of the corresponding unions are disjoint. This paper studies the finite theory obtained by taking apartness as the primitive relation. Its basic laws are symmetry, bilateral subsumption, and vacuity, equivalently the separation-polarity form of acyclic separoids. The main contribution is an effective rational realization theorem with uniform margins and the exact consequence theory it supports. Every finite apartness separoid is realized by rational polytopes whose coordinates are indexed by maximal separations. Maximal separations and minimal Radon partitions can be enumerated from a full table, generators, or a membership oracle; the coordinate values have controlled bit height; and each coordinate records a readable certificate of one maximal separation. The realization separates every apart pair with clearance at least 2, remains correct under outer parallel enlargement by any radius below 1, and yields full-dimensional convex bodies after thickening. The distance-function layer records standard convex-analytic stability through Lipschitz comparison, monotonicity under inclusion, and outer parallel bodies. On the logical side, positive entailment is exactly one-premise subsumption. Boolean consequence over Euclidean scenes is sound, complete, and decidable; satisfiability is NP-complete, validity is coNP-complete, and positive entailment is linear for sorted encodings. A stratification theorem shows that Boolean reasoning introduces no new atomic apartness beyond separoid closure. Fixed-dimensional consequence relations form a strictly decreasing hierarchy that stabilizes in dimension n minus 1 for n sites.

URL PDF HTML ☆

赞 0 踩 0

2606.12674 2026-06-12 cs.AI 新提交

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux: 紧凑型智能体的可执行工具工作流的推理时演化

Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao

AI总结提出Evoflux，一种推理时演化搜索方法，通过结构化编辑和执行反馈修复紧凑语言模型的工具工作流，将执行可行性从3%提升至17-24%，优于SFT和ReAct。

详情

Comments: Code is available at this https URL

AI中文摘要

紧凑型语言模型（LMs）降低了工具智能体的成本、延迟和部署风险。然而，MCP风格的工具使用不仅仅需要孤立的函数调用：智能体必须从实时目录中发现工具、满足模式、跨中间输出保留依赖关系，并在执行证据中基于最终响应。小型规划器通常生成看似合理的工作流图，但在工具解析、参数验证、依赖跟踪或执行中失败。我们认为，小语料蒸馏难以处理这种失败模式。几百个教师轨迹可以教授工作流格式，但很少涵盖修复失败计划所需的恢复行为。我们引入了Evoflux，一种推理时演化搜索方法，将紧凑工具使用视为可执行工具工作流的修复。它通过结构化编辑、执行反馈、自适应强度、元引导重设计和多样性剪枝来演化类型化工作流图。在涵盖实时MCP服务器和250个工具的保留MCP-Bench任务上，Evoflux将小型规划器的执行可行性从约3%提高到17-24%。相比之下，在相同搜索挖掘数据上的SFT和SFT+DPO匹配、表现不佳或崩溃至零样本性能以下；ReAct达到更高峰值，但方差和令牌成本更高。这些结果表明，在稀缺的教师轨迹预算下，基于执行的搜索更可靠。

英文摘要

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.12673 2026-06-12 cs.LG cs.AI 新提交

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

发表机构 * School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出AlignGAD框架，通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据，实现零样本跨域图异常检测。

详情

AI中文摘要

跨域图异常检测旨在识别未见过的目标图中的异常节点，在异构图数据的实际应用中展现出巨大潜力。然而，现有方法通常依赖于数据集特定的特征语义和结构模式，限制了其跨域泛化能力。为解决这一挑战，我们提出AlignGAD，一个零样本广义图异常检测框架。我们的框架基于三个关键组件：全局统一模块，用于对齐异构节点特征并在谱域中归一化图信号；聚类模块，用于构建聚类感知的图视图以捕获组级异常模式；以及节点差异评分模块，用于测量重构差异并聚合来自不同图视图的异常证据。在多个真实数据集上的实验证明了AlignGAD在零样本图异常检测设置下的有效性。

英文摘要

Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

URL PDF HTML ☆

赞 0 踩 0

2606.12671 2026-06-12 cs.CV 新提交

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

SalArt-VQA: 诊断VLM是否理解生成图像中的显著伪影

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy

AI总结提出SalArt-VQA基准，通过950张图像和3681道多选题，从检测、定位、空间基础、缺陷识别四方面评估VLM对生成图像伪影的理解，揭示高检测准确率下隐藏的失败模式。

详情

Comments: 23 pages, 7 figures, 7 tables. Dataset: this https URL

AI中文摘要

视觉语言模型（VLM）越来越多地被用于检测AI生成图像是否包含可见伪影，然而它们分析此类伪影的能力仍然知之甚少。正确的图像级决策仍可能隐藏重要失败：模型可能正确标记伪影，但依赖于错误的视觉线索、选择错误的区域，或描述图像中不存在的缺陷。为了直接评估这些行为，我们引入了SalArt-VQA，一个用于细粒度理解AI生成图像中显著伪影的诊断基准。SalArt-VQA包含950张图像和3,681道人工编写的多项选择题，涵盖伪影图像、匹配的真实参考图像和配对的生成参考图像。四种对齐的问题类型评估存在检测、语义定位、空间基础和证据基础的缺陷识别，而参考分割测试了当注释缺陷不存在时的校准和弃权能力。在20个VLM上，SalArt-VQA揭示了图像级检测准确率所隐藏的失败：最强的模型在伪影图像上达到99.37%的检测召回率，但仅在53.26%的图像上正确回答了所有四个伪影侧问题。比较伪影图像与无伪影参考揭示了灵敏度-校准权衡：敏感模型经常做出无根据的伪影声明，而保守模型主要通过遗漏真实伪影来避免误报。这些结果表明，高伪影检测准确率本身并不意味着有基础的伪影理解。SalArt-VQA暴露了这些隐藏的失败模式，并提供了对VLM伪影声明是否得到局部视觉证据支持的细粒度评估。

英文摘要

Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

URL PDF HTML ☆

赞 0 踩 0

2606.12667 2026-06-12 cs.NI cs.AI eess.SY 新提交

Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites

低地球轨道卫星地面站位置的自由布局优化

Grace Ra Kim, Duncan Eddy, Vedant Srinivas, Mykel J. Kochenderfer

AI总结提出SCORE方法，通过两阶段自由布局优化地面站位置，相比差分进化算法减少5倍函数评估次数并提升13%下行吞吐量，相比固定站点方法提升15%总下行量。

详情

Comments: 34 pages, 13 figures, 11 tables, Journal of Aerospace Information Systems (JAIS)

AI中文摘要

快速扩展的低地球轨道卫星星座对地面网络的需求日益增加，推动了更高效地面站网络设计的发展。当前方法从预定义位置选择站点，将优化限制在现有基础设施内，从而约束了性能。相比之下，自由布局优化在地球连续空间域上运行，拓宽了搜索空间，允许更高吞吐量的配置，但代价是可能需要部署新的基础设施。在这项工作中，我们引入了SCORE（通过细化与评估的顺序循环优化），一种用于地面站设计的两阶段自由布局方法。SCORE结合了顺序坐标选择与循环细化，以应对全局优化器面临的高维度、非凸性和局部最小值挑战。我们使用Kongsberg卫星服务公司和世界电信协会的位置，将SCORE与差分进化（DE）等一次性方法以及整数规划方法进行了基准测试。在两个商业地球观测星座（Capella Space和ICEYE）和一个合成Walker-Star星座上的测试表明，与DE相比，SCORE收敛所需的函数评估次数最多减少5倍，同时下行吞吐量提升高达13%。与固定站点方法相比，无约束SCORE实现了高达15%的总下行量提升，为灵活布局建立了强大的经验性能基准；受基础设施约束的SCORE在将布局限制在现有光纤和电力基础设施附近的同时，保留了超过92%的增益。我们还探讨了扩建现有站点与部署新站点之间的权衡，为运营星座的未来地面网络设计提供参考。

英文摘要

Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the development of more efficient ground station network designs. Current approaches select sites from predefined locations, limiting optimization to existing infrastructure and constraining performance. In contrast, free-placement optimization operates over a continuous spatial domain on Earth, broadening the search space and allowing higher-throughput configurations at the cost of potentially requiring new infrastructure deployment. In this work, we introduce SCORE (Sequential Cyclic Optimization via Refinement & Evaluation), a two-stage free-placement method for ground station design. SCORE combines sequential coordinate selection with cyclic refinement to manage high-dimensionality, non-convexity, and local minima that challenge global optimizers. We benchmark SCORE against one-shot methods such as differential evolution (DE) and integer programming approaches using locations from Kongsberg Satellite Services and the World Teleport Association. Tests across two commercial Earth observation constellations (Capella Space and ICEYE) and one synthetic Walker-Star constellation show that SCORE requires up to 5x fewer function evaluations to converge relative to DE while improving downlink throughput by up to 13%. Compared to fixed-site methods, unconstrained SCORE achieves up to 15% greater total downlink, establishing a strong empirical performance benchmark for flexible placement; infrastructure-constrained SCORE retains over 92% of this gain while restricting placement to within proximity of existing fiber and power infrastructure. We also explore trade-offs between expanding existing stations and deploying new sites, informing future ground network design for operational constellations.

URL PDF HTML ☆

赞 0 踩 0

2606.12666 2026-06-12 cs.CR cs.AI 新提交

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

CAPED：面向移动GUI代理的上下文感知隐私暴露防御

Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

AI总结针对移动GUI代理截图上传导致的附带视觉隐私暴露问题，提出上下文感知的预上传暴露控制层CAPED，通过任务需求提取、屏幕上下文隐私先验和UI元素解析，选择性暴露任务所需内容，在保持高任务效用的同时显著降低隐私泄露。

详情

AI中文摘要

基于截图的移动GUI代理能够像人类用户一样通过相同的视觉界面操作普通智能手机应用，但这种能力也将每一次屏幕观察变成了隐私边界。在正常任务执行过程中，截图可能暴露联系人、消息、照片、文件、推荐、健康提示等与用户请求无关的敏感上下文。我们称这个问题为附带视觉隐私暴露。现有防御难以解决：文本匿名化遗漏了许多视觉和推理线索，而通用隐私遮蔽可能移除GUI代理完成任务所需的证据和控制。本文提出CAPED，一种面向移动GUI代理的上下文感知预上传暴露控制层。CAPED被设计为手机端保护层：在截图被释放到远程多模态代理之前，它提取任务需求，利用屏幕上下文作为隐私先验，解析可见UI元素，并仅选择性暴露当前任务所需的内容，同时遮蔽附带隐私内容。我们在AndroidWorld上评估CAPED的广泛任务效用，并使用受控的28任务种子隐私评估作为轨迹级附带泄漏的测量工具。在该种子评估中，完整CAPED将成功条件下的加权种子泄漏从原始截图的0.766降低到0.268，同时保持高任务效用。更广泛的AndroidWorld运行显示了剩余的原型级效用成本，但结果支持核心主张：截图上传应被视为明确的设备-云边界决策，由任务驱动的选择性暴露而非全有或全无的屏幕共享来管理。

英文摘要

Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results support the central claim that screenshot upload should be treated as an explicit device--cloud boundary decision, governed by task-driven selective exposure rather than all-or-nothing screen sharing.

URL PDF HTML ☆

赞 0 踩 0

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 新提交

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE（泰雷兹SIX GTS公司，法国）

AI总结提出BASENet，通过Bark尺度划分频带并分配自适应容量编码器，结合跨频带注意力模块，以最少参数实现高PESQ和STOI，适用于资源受限设备。

详情

AI中文摘要

语音增强模型通常对所有频率采用统一容量，忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet，一种频率自适应架构，将频谱划分为Bark尺度频带，并为每个频带分配基于临界频带密度的缩放容量编码器，自动为感知密集的低频分配更深的分支，为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络，BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%，是所有PESQ > 3.50方法中参数最少的。因果变体（3.44 PESQ）超过了几种非因果基线，证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

URL PDF HTML ☆

赞 0 踩 0

2606.12658 2026-06-12 cs.LG q-bio.QM stat.ML 新提交

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

基于物理信息的神经网络用于化疗药代动力学：基准测试临床估计器并揭示参数可辨识性

Riya Bisht, Dhruv Agarwal

AI总结本研究将物理信息神经网络（PINN）应用于化疗药代动力学，在双室线性模型上匹配临床标准方法，在Michaelis-Menten扩展模型中揭示参数不可辨识性，并通过稀疏组织观测部分恢复可辨识性。

详情

AI中文摘要

物理信息神经网络（PINN）是生物学中部分观测问题的一个有吸引力的工具，其中控制动力学已知但某些隔室无法测量。化疗药代动力学（PK）是一个清晰的实例：血浆中的药物浓度常规测量，但组织中的浓度——决定肿瘤杀伤和脱靶毒性——无法测量。我们在两个PK问题上将PINN与标准临床基线（非线性最小二乘解析双指数血浆解，以下简称NLS）和物理无关的神经基线（仅数据的MLP）进行基准测试。在线性双室问题上，NLS接近最优；PINN在匹配其性能（小常数因子内）的同时，在单次训练过程中产生组织曲线，而仅数据的MLP在组织上失败约10倍。在Michaelis-Menten扩展（可饱和消除）上，双指数闭式不再存在，因此NLS被错误指定并静默返回无意义的速率常数。PINN反而揭示了一个更深层的事实：Michaelis-Menten双室模型仅从血浆数据不可辨识，PINN通过收敛到k12 -> 0的盆地诚实地报告这一点。添加两个稀疏组织观测在很大程度上解决了可辨识性：在五个随机种子上，PINN恢复k21在真实值的1%以内，Vmax和Km在一个标准差范围内，而k12向正确方向移动（0.02 -> 0.82）但仍低于真实值约2个标准差——这是闭式NLS估计器根本无法尝试的恢复，因为其双指数假设仅描述血浆。我们的主张不是PINN击败NLS。而是PINN提供了一种统一的方案，该方案在教科书问题上与教科书估计器匹配，揭示了教科书估计器隐藏的结构可辨识性，并在单一损失中吸收异构测量。

英文摘要

Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

URL PDF HTML ☆

赞 0 踩 0

2606.12657 2026-06-12 cs.AI cs.DB cs.RO 新提交

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

TrajGenAgent: 一种用于人类移动轨迹生成的分层LLM智能体

Siyu Li, Toan Tran, Lingyi Zhao, Khurram Shafique, Li Xiong

发表机构 * Emory University（埃默里大学）； University of Florida（佛罗里达大学）

AI总结提出TrajGenAgent，一种无需微调的分层LLM智能体框架，通过编排器-工作者两阶段设计生成真实轨迹，在时空保真度、语义一致性和个体行为真实性上优于现有方法。

详情

Comments: 14 pages, 2 figures, 8 tables. Accepted by the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

AI中文摘要

人类移动数据对于交通、城市规划和流行病控制至关重要，但大规模轨迹收集通常成本高昂且受隐私限制，这推动了逼真的合成轨迹生成。现有的基于LLM的生成器通常依赖于提示工程（保留了零样本推理但缺乏细粒度的时空基础）或轨迹级微调（提高了统计精度但产生了大量计算成本并可能削弱一般推理）。我们提出了TrajGenAgent，一种语义感知的分层LLM智能体框架，用于无需模型微调的人类移动轨迹生成。TrajGenAgent采用两阶段编排器-工作者设计：LLM首先通过上下文学习从历史证据中合成个体和星期条件化的活动链，然后确定性工作流通过个性化POI检索、距离感知位置选择、运动学感知旅行时间传播和基于LLM的持续时间估计将每个活动落地为完整的访问。为了评估超越聚合时空统计的真实性，我们引入了一个基于异常检测的评估框架，使用两个互补检测器来评估行为和语义合理性。在基准和大规模模拟数据集上的实验表明，与代表性的神经网络和基于LLM的基线相比，TrajGenAgent在时空保真度、语义一致性和个体特定行为真实性方面有所改进，同时避免了参数更新。

英文摘要

Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often costly and privacy-constrained, motivating realistic synthetic trajectory generation. Existing LLM-based generators typically rely on either prompt engineering, which preserves zero-shot reasoning but lacks fine-grained spatiotemporal grounding, or trajectory-level fine-tuning, which improves statistical precision but incurs substantial computational cost and may weaken general reasoning. We propose TrajGenAgent, a semantic-aware hierarchical LLM-agent framework for human mobility trajectory generation without model fine-tuning. TrajGenAgent uses a two-stage orchestrator-worker design: an LLM first synthesizes an individual- and weekday-conditioned activity chain from historical evidence via in-context learning, and a deterministic workflow then grounds each activity into a complete visit using personalized POI retrieval, distance-aware location selection, kinematics-aware travel-time propagation, and LLM-based duration estimation. To evaluate realism beyond aggregate spatiotemporal statistics, we introduce an anomaly-detection-based evaluation framework using two complementary detectors to assess behavioral and semantic plausibility. Experiments on benchmark and large-scale simulation datasets show that TrajGenAgent improves spatiotemporal fidelity, semantic coherence, and individual-specific behavioral realism over representative neural and LLM-based baselines, while avoiding parameter updates.

URL PDF HTML ☆

赞 0 踩 0

2606.12655 2026-06-12 cs.CR cs.CV 新提交

Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

Amnesia: 一种针对持续学习梦境的重放隐蔽攻击

Ahmed Sharshar, Naveen Kumar Kummari, Mohsen Guizani

AI总结提出Amnesia攻击，通过仅控制重放索引选择，在审计约束下最大化持续学习模型性能下降，揭示了索引级重放控制的威胁。

详情

AI中文摘要

持续学习（CL）模型常使用经验重放来减少灾难性遗忘，但其对重放采样干扰的鲁棒性尚未充分探索。现有的CL攻击会改变输入或训练流程（投毒/后门），且很少包含明确的审计约束，限制了真实性。这里，审计性意味着监控者可以通过检查采样器可见的遥测数据（例如，记录的重放索引/标签统计）来验证合规性，即检查实现的重放类别直方图是否接近名义基线，以及重放率在每个批次和/或滚动窗口内是否不变。我们研究了一个权限受限的内部人员，其仅控制重放索引选择，而不控制像素、标签或模型参数，同时保持在审计限制内（如队列优先级）。我们提出了Amnesia，一种重放组合攻击，在两种预算下最大化性能下降：可见性预算δ，限制与名义类别直方图p0的TV/KL散度；以及质量预算f，固定重放率。Amnesia有两个步骤：（i）计算轻量级类别效用（如EMA损失或置信度），将p0向有害类别倾斜；（ii）使用高效的KL（指数倾斜）或TV（平衡质量重分配）优化器将倾斜投影回δ-球内。窗口调度器强制执行滚动审计。在具有挑战性的CL基准测试和强重放基线中，Amnesia持续降低最终准确率（ACC）并恶化反向迁移（-BWT）。KL变体在多种审计方案（包括每批次和滚动窗口检查）下实现高影响且基本未被检测到。TV变体更具破坏性但更易检测，尤其是在严格的每类别约束下。这些结果揭示了仅索引重放控制是CL系统中一个实用且可审计的威胁面，并建立了原则性的影响-可见性权衡。

英文摘要

Continual learning (CL) models often use experience replay to reduce catastrophic forgetting, but their robustness to replay sampling interference remains underexplored. Existing CL attacks alter inputs or training pipelines (poisoning/backdoors) and rarely include explicit auditable constraints, limiting realism. Here, auditability means a monitor can verify compliance from sampler-visible telemetry - e.g., logged replay index/label statistics - by checking that the realized replay class histogram stays close to a nominal baseline and that replay rate is unchanged per batch and/or over a rolling window. We study a limited-privilege insider who controls only replay index selection, not pixels, labels, or model parameters, while staying within auditable limits such as queue priorities. We introduce Amnesia, a replay composition attack that maximizes degradation under two budgets: a visibility budget delta bounding the TV/KL divergence from a nominal class histogram p0, and a mass budget f fixing the replay rate. Amnesia has two steps: (i) compute lightweight class utilities, such as EMA loss or confidence, to tilt p0 toward harmful classes; and (ii) project the tilt back into the delta-ball using efficient KL (exponential tilt) or TV (balanced mass redistribution) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks and strong replay baselines, Amnesia consistently lowers final accuracy (ACC) and worsens backward transfer (-BWT). The KL variant delivers high impact while remaining largely undetected under multiple audit schemes, including per-batch and rolling-window checks. The TV variant is more damaging but easier to detect, especially under tight per-class constraints. These results expose index-only replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility trade-off.

URL PDF HTML ☆

赞 0 踩 0

2606.12651 2026-06-12 cs.LG q-bio.QM 新提交

Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

物理感知辅助损失提升图神经网络可合成性滤波器的分布外泛化能力

Riya Bisht, Dhruv Agarwal

AI总结通过在GNN上添加基于Bertz指数的拓扑复杂度回归和MMFF94力场应变能软惩罚作为辅助损失，在分布外数据上小幅但显著提升了可合成性滤波器的AUC（最高+0.0066）。

详情

AI中文摘要

机器学习药物发现流程越来越依赖生成模型，这些模型提出的分子远离用于训练下游可合成性滤波器的数据。现有滤波器（SAScore、SCScore、RAscore、DeepSA）纯粹基于统计，在分布外（OOD）场景下性能下降。我们探究廉价的闭式物理先验，作为图神经网络（GNN）的辅助监督，是否能改善OOD泛化。我们在GINE骨干网络上添加两个辅助损失：基于Bertz指数的拓扑复杂度回归，以及基于MMFF94力场能量的应变能软惩罚。在由SAScore阈值标注的65,177个分子语料库（HIV、Tox21、COCONUT）上，我们复现了强分布内基线，然后在单源OOD划分（在类药HIV+Tox21上训练，在COCONUT天然产物上测试）上评估4路消融实验（基线/+复杂度/+应变/+两者），重复5个种子并采用配对bootstrap置信区间。所有三个物理感知变体相比基线（平均OOD AUC 0.9774）均带来微小但统计显著的OOD提升：+复杂度Delta = +0.0060（95% CI [+0.0023, +0.0102]），+应变Delta = +0.0032（[+0.0008, +0.0052]），+两者Delta = +0.0066（[+0.0038, +0.0093]）；每个区间均不包含零，且组合效果最佳。各变体在分布内表现无差异，因此效果仅在OOD评估下可见。我们明确指出效果是适度的，并报告一个警示性方法学发现：该实验的单种子版本产生了定性不同（非单调）的故事，未能在多种子评估中复现。

英文摘要

Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.12650 2026-06-12 cs.PL cs.PF 新提交

nomp: A Framework for Building Domain Specific Compilers

nomp: 构建领域特定编译器的框架

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana, Hirumal Priyashan, Poorna Gunathilaka, Nagitha Abeywickrema, Ravindu Hirimuthugoda, Tarun Prabhu, Kirshanthan Sundararajah, Sanath Jayasena

AI总结提出nomp框架，通过基于pragma的编程模型和运行时，利用领域特定优化模式在保持性能与可移植性的同时提高程序员生产力。

详情

AI中文摘要

低层GPU编程模型（CUDA、HIP、OpenCL等）提供对程序数据流和执行计划的精细控制，以提取接近硬件的性能。然而，由于其语法和语义的复杂性，学习曲线陡峭，降低了程序员的生产力。另一方面，高层模型（OpenMP、OpenACC等）作为低层模型的抽象，旨在提高程序员生产力，但实现与低层模型相当的性能是一个挑战。这两种方法在生产效率、可移植性和性能之间存在固有的权衡，没有一种通用解决方案能同时实现三者。然而，我们相信通过重用特定领域的优化模式，可以在不牺牲性能和可移植性的前提下提高程序员生产力。为此，我们提出nomp：一个用于构建领域特定编译器的框架。nomp包含一个基于pragma的编程模型和一个能够根据用户提供的元数据进行代码转换和生成的运行时。

英文摘要

The low-level GPU programming models (CUDA, HIP, OpenCL, etc.) provide detailed control of the data flow and execution plan of a program in order to extract close-to-metal performance. However, these have a steep learning curve due to the intricacies of their syntax and semantics. This reduces programmer productivity. On the other hand, high-level models (OpenMP, OpenACC, etc.) that serve as abstractions over the low-level models are aimed at improving programmer productivity but achieving performance on-par with the low-level models is a challenge. There are inherent trade-offs between productivity, portability and performance in both approaches and there is no one-size-fits-all solution which achieves all three simultaneously. However, we believe there is room to improve programmer productivity without sacrificing performance and portability by reusing optimization patterns specific to a given domain. To this end, we propose nomp: a framework for building domain specific compilers. nomp consists of a pragma based programming model and a runtime capable of code transformation and generation based on user provided metadata.

URL PDF HTML ☆

赞 0 踩 0

2606.12649 2026-06-12 cs.CL 新提交

MentalMARBERT: Domain-Adaptive Pre-training and Two-Stage Fine-Tuning for Arabic Mental Health Disorders Detection

MentalMARBERT：面向阿拉伯语心理健康障碍检测的领域自适应预训练与两阶段微调

Fatimah Almalki, Areej Alhothali, Lulwah Alharigy, Abdulrahman Aladeem

发表机构 * King Abdulaziz University（阿卜杜勒阿齐兹国王大学）

AI总结针对阿拉伯语社交媒体文本中心理健康障碍检测的方言差异、非正式语言、标注资源有限和类别不平衡问题，提出领域自适应预训练与两阶段微调框架，构建含5万条推文的数据集，MentalMARBERT在宏F1和准确率上分别达到0.861和0.877。

详情

Comments: 17 pages, 5 figures, 13 tables

AI中文摘要

从阿拉伯语社交媒体文本中检测心理健康障碍仍然具有挑战性，原因包括方言差异、非正式语言、高质量标注资源有限以及严重的类别不平衡。虽然英语心理健康自然语言处理（NLP）已取得显著进展，但阿拉伯语多类别障碍分类的研究仍不充分。本研究提出一个两阶段框架用于阿拉伯语心理健康文本分类。在第一阶段，三个阿拉伯语预训练语言模型AraBERT、CAMeLBERT和MARBERT，使用大规模未标注阿拉伯语心理健康推文语料库进行领域自适应和任务自适应预训练（DAPT和TAPT）。在统一协议下评估自适应模型，以确定最有效的骨干模型。在第二阶段，选定的模型在四种配置下进行评估，这些配置结合了单阶段和分层两阶段分类架构，并采用全微调和低秩适应（LoRA）。为支持本研究，我们构建了一个新的标注阿拉伯语心理健康数据集，包含50,670条推文，涵盖六个类别，具有强标注者间一致性（Krippendorff's Alpha = 0.733，平均成对一致性 = 0.797）。实验结果表明，领域自适应的MARBERT（MentalMARBERT）在准确率和宏F1上均比基线模型有统计显著的提升。结合全微调的分层两阶段架构取得了最佳整体性能，宏F1达到0.861，准确率达到0.877。这些发现证明了领域特定自适应预训练和分层分类在阿拉伯语心理健康障碍检测中的有效性。

英文摘要

Detecting mental health disorders from Arabic social media text remains challenging due to dialectal variation, informal language, limited high-quality annotated resources, and severe class imbalance. While English mental health natural language processing (NLP) has progressed substantially, Arabic multi-class disorder classification remains insufficiently studied. This study proposes a two-phase framework for Arabic mental health text classification. In phase 1, three Arabic pre-trained language models, AraBERT, CAMeLBERT, and MARBERT, undergo Domain-Adaptive and Task-Adaptive Pretraining (DAPT and TAPT) using a large-scale corpus of unlabeled Arabic mental health tweets. The adapted models are evaluated under a unified protocol to identify the most effective backbone model. In phase 2, the selected model is assessed across four configurations combining single-stage and hierarchical two-stage classification architectures with full fine-tuning and Low-Rank Adaptation (LoRA). To support this study, we constructed a novel annotated Arabic mental health dataset comprising 50,670 tweets across six categories, with strong inter annotator agreement (Krippendorff's Alpha = 0.733, average pairwise agreement = 0.797). Experimental results show that the domain-adapted MARBERT (MentalMARBERT) achieves statistically significant improvements over baseline models in both accuracy and macro-F1. The hierarchical two-stage architecture combined with full fine-tuning achieves the best overall performance, reaching a macro-F1 of 0.861 and an accuracy of 0.877. These findings demonstrate the effectiveness of domain-specific adaptive pretraining and hierarchical classification for Arabic mental health disorder detection.

URL PDF HTML ☆

赞 0 踩 0

2606.12648 2026-06-12 cs.HC 新提交

OpenRoundup: Multi-Table Data Wrangling Through Interactive Visualization

OpenRoundup：通过交互式可视化进行多表数据整理

Stephen Kasica, Charles Berret, Tamara Munzner

AI总结提出OpenRoundup系统，通过交互式可视化支持数据记者无代码整合多张表格，采用模式优先、按需取值范式，并引入急切表合并与声明式词汇（Stack和Pack），复制研究证明其表达能力，部署研究确认对非编程从业者的实用性。

详情

Comments: 18 pages

AI中文摘要

数据记者通常需要整合多个独立发布来源的记录以支持问责报道，但现有的交互式整理工具均以单表而非多表集合作为主要工作单元。我们提出OpenRoundup，一个开源、基于浏览器的系统，使数据记者无需编写代码即可将多个表格合并为单一的分析就绪输出。界面包含五个协调面板，实现了模式优先、按需取值的范式，具有实时模式预览、环境数据质量警报以及操作树的可递归树图可视化。基于DuckDB-WASM的纯客户端架构在浏览器中运行，为敏感的新闻数据提供了强大的隐私保护。该系统引入了两个概念性贡献：急切表合并，即在整理阶段早期通过交互式、增量式组装多个源表来构建复合表；以及一个由两个操作（Stack和Pack）组成的表合并声明式词汇。我们通过一项复制研究（作者仅使用界面重现了17个已发布的记者编程工作流）和一项部署研究（与四位专业数据记者合作）来评估该系统。复制研究证明了系统对现实世界合并任务的表达能力。部署研究确认了其对理解连接概念但缺乏编程技能的执行者的实用性，并揭示了数据新闻教育中一个意想不到的次要价值。

英文摘要

Data journalists routinely integrate records across multiple independently published sources to support accountability reporting, yet no existing interactive wrangling tool treats the collection of tables -- rather than the single table -- as its primary unit of work. We present OpenRoundup, an open-source, browser-based system that enables data journalists to consolidate multiple tables into a single analysis-ready output without writing code. The interface comprises five coordinated panels that implement a schema-first, values-on-demand paradigm with live schema previews, ambient data quality alerts, and a recursive treemap visualization of the evolving operation tree. A client-only architecture powered by DuckDB-WASM runs in the browser, providing strong data privacy guarantees suited to sensitive journalism data. The system introduces two conceptual contributions: eager table consolidation, in which a composite table is assembled early in the wrangling phase via interactive, incremental assembly of multiple source tables; and a declarative vocabulary for table consolidation consisting of two operations, Stack and Pack. We evaluate the system through a replication study in which the authors reproduce 17 published journalist programming workflows using only the interface, and a deployment study with four professional data journalists. The replication study demonstrates expressive coverage of real-world consolidation tasks. The deployment study confirms utility for practitioners who understand joins conceptually but lack the programming skills to execute them, and surfaces an unanticipated secondary value for data journalism education.

URL PDF HTML ☆

赞 0 踩 0

2606.12647 2026-06-12 cs.CC cs.AI cs.LG 新提交

Token Complexity Theory for AI-Augmented Computing

AI增强计算的Token复杂度理论

Jie Wang

AI总结提出Token复杂度作为AI增强计算中查询与响应成本的形式化度量，建立AI-Oracle图灵机框架，证明单调性、凸性、价格敏感性和任务排序的价格相对性等基本定理。

详情

Comments: 25 pages, 1 figure

AI中文摘要

AI增强计算将自然语言查询、代码生成请求及其他开放式任务委托给一组AI模型，这些模型处理查询并生成响应。这一范式引入了一个经典时间或空间复杂度无法捕捉的资源维度：向该集群发送查询和接收响应的成本。我们引入Token复杂度，将其定义为在任务上达到指定输出质量水平所需的最小期望Token成本，并建立了一个根据概率性质强度对AI系统进行分类的体系。我们在AI-Oracle图灵机框架内发展Token复杂度，其中概率图灵机通过专用查询和响应磁带与随机Oracle交互。我们证明了基本定理，表明Token复杂度符合预期：单调性（更高质量需要更多Token）、凸性（质量改进逐渐变得更昂贵）、价格敏感性（小价格变化导致有界成本变化）以及任务排序的价格相对性（任务的Token复杂度排序可能根据查询与响应成本比率而反转）。我们证明了复杂度前沿（定义为Token、时间和空间中所有可行资源约束的集合）是非空的、向上封闭且凸的。

英文摘要

AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex.

URL PDF HTML ☆

赞 0 踩 0

2606.12643 2026-06-12 cs.LG 新提交

TEDD: Robust Detection of Unstable Temporal Features

TEDD：不稳定时间特征的鲁棒检测

Ricardo Ribeiro Pereira, Bruno Casal Laraña, Nádia Soares, Miguel Araújo

发表机构 * Feedzai

AI总结提出TEDD方法，利用回归模型检测导致时间分布变化的特征，无需参数调优，可扩展，能检测数值和类别特征的单变量及多变量漂移。

详情

Comments: 8 pages, 9 figures

AI中文摘要

在处理真实世界的时间序列数据时，经常会遇到特征分布随时间变化的情况。在这种不稳定的数据上直接使用机器学习模型可能导致性能迅速下降，尤其是当新分布与训练时所见差异较大时。为了解决这个问题，自动识别随时间变化的特征至关重要。检测到这些特征后，数据科学家和其他从业者能够通过应用数据变换等方式缓解问题，部署更鲁棒的模型，使其在更长时间内保持高性能。本文描述了特征不应遭受的时间变化类型，并提出了TEDD技术，用于a) 识别数据集何时可能导致不稳定的机器学习模型，以及b) 自动检测哪些特征导致了这种不鲁棒性。为此，我们利用回归模型来突出哪些特征有助于良好预测实例的时间戳。我们将我们的方法与其他方法在真实和合成数据上进行比较，测试它们在所有简单变化模式上的检测能力。我们表明，我们的方法：检测所有类型的基本变化，包括数值和类别特征；能够检测多变量漂移；返回一个可比较的值来衡量每个特征的变化量；无需参数调优；并且在数据集的特征数量和实例数量上都具有可扩展性。

英文摘要

When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance's timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.12640 2026-06-12 cs.LG cs.RO eess.SY 新提交

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University（阿尔托大学电气工程与自动化系）； School of Computing and Data Science, Xiamen University Malaysia（厦门大学马来西亚分校计算与数据科学学院）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）

AI总结提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法，通过逆动力学恢复控制策略，在保证奖励的同时显著提升轨迹生成的安全性。

详情

Comments: Accepted to the 23rd IFAC World Congress, 2026

AI中文摘要

离线强化学习允许直接从数据中学习控制策略而无需在线交互，使其适用于安全关键任务。最近的研究将扩散模型应用于离线强化学习，以利用其建模复杂数据分布的强大能力。然而，现有方法主要关注单智能体设置，多智能体环境中的安全挑战在很大程度上未被探索。在这项工作中，我们提出了一种安全的离线多智能体强化学习算法，该算法将神经个体控制障碍函数嵌入扩散模型中，以增强轨迹生成过程中的安全性，并通过逆动力学恢复控制策略。我们在多种基准上评估了我们的算法，证明了在保持竞争性奖励的同时实现了显著的安全改进。

英文摘要

Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

URL PDF HTML ☆

赞 0 踩 0

2606.12639 2026-06-12 cs.LG q-bio.QM 新提交

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者：评估选择翻转未见化学空间中药物反应预测的模型排名

Dhruv Agarwal, Riya Bisht

AI总结本研究通过VCPI竞赛数据，发现药物反应预测模型排名随评估指标反转：简单基线在代理指标下胜出，但真实指标下深度模型显著优于线性指纹基线，首次在真实药物化学数据上验证了度量校准效应。

详情

AI中文摘要

预测细胞转录组对其从未见过的药物的反应是计算细胞生物学中的一个核心难题：最近的基准测试表明，一旦测试化合物按化学结构留出，复杂模型往往无法击败简单基线。我们研究了一个细胞系和检测方法，即通过DRUG-seq分析的THP-1细胞，由VCPI预测竞赛的活性化合物加权MSE（wMSE）评分。我们提出了一种分阶段方法：该领域一直无法击败的简单基线（未处理对照和平均训练化合物响应）；非参数检索（对留出化合物的最近训练化合物进行Tanimoto加权平均）；以及一个融合阶段，将冻结的化学嵌入与检索支持特征相结合，以预测相对于均值的残差，并包含不确定性头和基因程序。在发布的VCPI THP-1 drug-seq数据（14,026个训练化合物）上，采用Bemis-Murcko骨架划分，模型排名根据度量标准反转。在逆方差每基因代理度量下，基于Morgan指纹的正则化线性回归似乎胜过了深度模型、检索和ChemBERTa——这是教科书式的“简单基线获胜”结果。但在竞赛的真实活性集度量（每（基因，化合物）的Mejia权重，经官方评分器验证；均值基线0.535 vs 组织者的0.507参考）下，情况反转：深度模型获胜，我们的融合解码器显著优于线性指纹基线（-0.012 wMSE，配对bootstrap p < 10^-4），而代理度量的胜者成为最差的化学感知预测器。选择度量即选择胜者——据我们所知，这是首次在真实留出药物化学数据上证明度量校准效应，该效应此前主要在遗传扰动中建立。我们发布了一个可复现的流水线，连接到官方评分器，可在真实的1064 x 12,995网格上生成有效提交。

英文摘要

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

URL PDF HTML ☆

赞 0 踩 0