arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19229 2026-05-20 cs.AI

Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

大型语言模型能否革新调查研究?与灾害准备响应的实验

Yan Wang, Ziyi Guo, Christopher McCarty

AI总结 本文探讨了大型语言模型在调查研究中的应用,通过实验验证了其在灾害准备响应中的有效性,提出了一个五阶段框架,涵盖问卷设计、样本选择、试点测试、缺失数据填补和事后分析,并介绍了基于保护动机理论的协同出现知识图谱和七种LLM配置。

详情
AI中文摘要

调查研究面临日益严峻的结构性挑战:响应率下降、样本偏差、高风险受访者中的块状缺失以及在线面板中AI辅助的欺诈性完成。大型语言模型(LLMs)已被提出作为解决方案,但迄今为止,对整个调查工作流程的严格评估仍然有限,特别是在灾害情境中,数据质量至关重要。我们提出并评估了一个五阶段框架,用于LLM的整合,涵盖问卷设计、样本选择、试点测试、缺失数据填补和事后分析,使用2024年飓风米勒尔准备调查(佛罗里达居民,n=946)作为共享的实证测试床。我们引入了一个受保护动机理论(PMT)约束的协同出现知识图谱,并开发了七种LLM配置,涵盖零样本推理、检索增强基线和新型理论指导变体。我们提出的锚定边际理论指导LLM(A-TLM)在灾难相关块状MNAR条件下,在RMSE上优于所有三个经典填补基线(IPW/MI、MICE+PMM、missForest)(S4 RMSE 1.439 vs. 1.496 for the next-best),同时在接近零的符号偏差(-0.121)方面优于随机森林填补器(产生最大的绝对偏差-0.631)。围绕PMT因果结构组织检索,并在单个模型调用中整合所有证据,优于无结构检索和分阶段顺序推理(MAE 0.993 vs. 1.097 for standard RAG)。我们记录了接近零的总体偏差可以掩盖相反的子组误差,并提出子组分层偏差审计作为报告标准。一个检索受限的知识图谱聊天机器人展示了幻觉是通过接地拒绝可管理的。

英文摘要

Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.

2605.19224 2026-05-20 cs.CL

Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

在慢速fMRI上微调语言编码模型以提高对快速ECoG的预测

Aditya R. Vaidya, Richard J. Antonello, Alexander G. Huth

AI总结 该研究通过在慢速fMRI上微调语言编码模型,提高了对快速ECoG数据的预测性能,展示了慢速数据在构建快速脑数据模型中的价值。

详情
AI中文摘要

神经科学家最近开始使用侵入性脑记录方法,如电极皮层图(ECoG),进行人类实验,因为它们提供了精细的空间和时间分辨率。然而,训练这些数据的模型受到能接受记录植入物的患者群体的限制。我们提出利用非侵入性fMRI来弥补训练数据的不足。通过在fMRI上微调的语音语言表示,我们构建了ECoG的编码模型。这些表示在ECoG上的预测性能得到了提高,尽管fMRI的时间分辨率比ECoG差两个数量级。预测性能在远超fMRI直接测量的频率带中有所提升。接下来,为了测试该方法的泛化能力,我们对在fMRI响应上以2倍时间下采样率微调的模型进行了测试。尽管分辨率有所下降,这些模型仍能预测fMRI和ECoG响应,与原始fMRI微调模型的水平相当。最后,我们展示了ECoG性能与fMRI微调数据量之间有稳定的关系。我们的结果表明,像fMRI这样的“慢”数据可以成为构建“快”脑数据模型如ECoG的宝贵资源。未来,整合多种记录方法可能进一步提高在其他应用中的性能,如解码。

英文摘要

Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure's generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that "slow" data like fMRI can be a valuable resource for building better models of "fast" brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.

2605.19223 2026-05-20 cs.CV

HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

HAVEN:用于统一视频理解的层次对齐多模态基准

Mengqi Shi, Haopeng Zhang

AI总结 本文提出HAVEN,一个用于统一视频理解的层次对齐多模态基准,旨在解决现有多模态大语言模型在复杂叙事总结和推理方面评估不足的问题,通过引入全粒度和全多模态的数据集架构,提供了一个严谨的标准测试平台。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在标准视频任务上表现出色,但其在复杂叙事的忠实总结和推理能力仍缺乏充分评估。现有总结基准在监督上分散于孤立的粒度层面,如关键帧、关键镜头或不连贯的文本总结,未能捕捉跨模态对齐的内在层次结构。为了解决这一关键差距,我们引入了HAVEN,一个用于统一视频理解的层次对齐多模态基准。HAVEN开创了一种全粒度(帧、镜头和视频层面)且全多模态(视频和文本)的数据集架构,配备了明确的、连续的模态对齐。基于这一统一的标注范式,我们提出了涵盖总结、时间推理、多模态定位和显著性排序的综合评估套件。对最新MLLMs的广泛基准测试揭示了表面文本流畅性与基于多模态理解之间的持续差距。最终,HAVEN推动了多模态系统的评估超越传统问答格式,提供了一个严谨、标准化的测试平台,以推动未来可解释、层次化的视频理解研究。我们公开发布了数据集、基准套件和评估协议。

英文摘要

While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

2605.19220 2026-05-20 cs.CL cs.AI cs.LG

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

位置:在LLM中的不确定性量化仅仅是无监督聚类

Tiejin Chen, Longchao Da, Xiaoou Liu, Hua Wei

AI总结 本文指出,当前LLM的不确定性量化方法本质上是无监督聚类算法,无法有效评估模型的外部正确性,导致无法检测出自信但错误的回答。文章提出了改进的不确定性量化方法,以确保模型的自信度能可靠地反映现实。

Comments Accepted by ICML 2026 Position Paper Track

详情
AI中文摘要

不确定性量化(UQ)被广泛认为是部署大型语言模型(LLM)于高风险领域的主要保障。然而,我们主张该领域存在类别错误:主流LLM的UQ方法本质上是无监督聚类算法。我们证明大多数当前方法本质上量化的是模型生成的内部一致性,而不是其外部正确性。因此,当前方法从根本上无法识别事实现实,并无法检测出“自信幻觉”,即模型在稳定但错误的答案上表现出高自信。因此,当前UQ方法在部署模型时可能会产生误导的安全感。具体而言,我们识别出由于对内部状态的依赖而产生的三种关键病理:超参数敏感危机,使部署不安全;内部评估循环,将稳定性与事实混淆;以及缺乏事实基础,迫使依赖不稳定代理指标来评估不确定性。为解决这一困境,我们倡导向UQ方法转变,并为研究界制定研究路线图,以采用更好的评估指标和设置,实施原生不确定性机制的变化,并将验证锚定在客观事实上,确保模型自信度能可靠地反映现实。

英文摘要

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

2605.19219 2026-05-20 cs.AI

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

SimGym:一种用于电子商务A/B测试模拟的框架,使用基于流量的VLM代理

Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Zhong Wu, Lingyun Wang

AI总结 本文提出SimGym框架,通过基于流量的VLM代理模拟电子商务A/B测试,解决真实测试周期长、风险高等问题,验证结果显示其能快速准确预测用户行为变化。

详情
AI中文摘要

A/B测试仍然是评估电子商务店铺修改的黄金标准,但其分流流量、需要数周才能达到统计显著性,并有降低用户体验的风险。我们提出了SimGym,一种使用视觉语言模型(VLM)代理在浏览器中模拟A/B测试的框架。该框架包含三个关键组件:(a)基于流量的买家人设生成管道,从生产点击流数据中推导出每个店铺的买家人设和意图;(b)实时浏览器代理架构,结合多模态感知和情景记忆与守卫规则,以在控制和处理店铺中进行连贯的购物会话;(c)评估协议,将模拟的成果变化与实际买家行为的观察变化进行比较。我们验证了SimGym在主要电子商务平台上对视觉驱动的UI主题变化的A/B测试,结果表明SimGym代理在观察到的成果变化上表现良好,与实际买家流量中不同界面变体的add-to-cart变化达成77%的方向一致。它将实验周期从数周减少到不到一小时,使快速实验成为可能,而无需将真实买家暴露于候选变体中。

英文摘要

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

2605.19218 2026-05-20 cs.CV cs.AI

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

旋转对齐的关键通道剪枝用于高效的视觉-语言模型推理

Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

AI总结 本文提出旋转对齐的关键通道剪枝方法,通过压缩通道维度在固定KV缓存预算下保留更多视觉token,解决传统token剪枝在细粒度感知任务中的性能下降问题,同时提升解码效率。

详情
AI中文摘要

视觉-语言模型在推理过程中面临严重的KV缓存压力,因为单张图像通常会编码成数千个token。现有方法主要通过token稀疏性进行token剪枝,但永久丢弃视觉内容导致细粒度感知任务显著退化。为此,本文提出一个互补的轴,即特征稀疏性:在固定KV缓存预算下,压缩通道维度可以在相同内存成本下保留更多视觉token。然而,现有关键通道剪枝方法面临结构上的权衡:基于token的通道剪枝具有表现力但不结构化且较慢,而基于head的方法则硬件友好但不够稳健。本文通过RotateK,一种基于旋转的结构化关键通道剪枝框架,解决这一问题。RotateK应用基于PCA的在线旋转,将token依赖的通道重要性对齐到共享的低维子空间,从而在轻量级head掩码下实现精确剪枝;融合的Triton注意力内核直接在稀疏通道的Key上操作以实现高效的解码。在两个代表性的VLM后端上进行的实验表明,RotateK在准确率和解码延迟方面均优于现有关键通道剪枝方法,而联合token-通道剪枝在匹配的KV缓存预算下优于仅token剪枝的基线。

英文摘要

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

2605.19215 2026-05-20 cs.AI

Not all uncertainty is alike: volatility, stochasticity, and exploration

并非所有不确定性都相同:波动性、随机性与探索

Payam Piray

AI总结 本文研究了在生物和人工智能中适应性决策中波动性和随机性对探索的影响差异,提出了CAUSE方法以提升探索效率。

详情
AI中文摘要

在生物和人工智能中适应性决策需要在利用已知结果和探索不确定替代方案之间取得平衡。尽管先前研究表明不确定性通常促进探索,但通常将不同的环境不确定性来源视为等同。我们考虑具有潜在线性奖励状态随时间变化(波动性)和通过噪声结果观察(随机性)的环境。两者都增加后验不确定性,但我们显示它们驱动最优探索的方向相反:波动性增强它,随机性抑制它。我们通过将Gittins指数框架扩展到具有潜在线性动态的高斯状态空间带顿时,正式建立了这种不对称性。我们进一步推导出Cause-Aware Uncertainty-Sensitive Exploration (CAUSE),一种通过控制-推理获得的闭式探索奖励,继承了相同的单调性。CAUSE在具有异质噪声结构的环境中优于标准探索策略,并且在非休息带顿设置中改进了Gittins-per-arm策略。学习和探索由相同的噪声推理不对称性所支配,并且该框架预测病理噪声推理会产生相反而非仅仅受损的探索,对计算精神病学的解释具有启示。

英文摘要

Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emph{reversed} rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

2605.19214 2026-05-20 cs.LG cs.CV

Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

多属性公平医疗图像分类中的最差组等化几率正则化

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lauren Oakden-Rayner, Robert Vandersluis, Jessica Schrouff, Lyle J. Palmer, Mark Jenkinson

AI总结 本文提出了一种最差组等化几率正则化方法,用于在多个人口属性上同时评估和缓解医疗图像分类中的系统性差异,通过在推理时优化子组层面的真阳性率和假阳性率偏差,减少等化几率和等化机会的不平等,同时对AUC影响最小。

Comments 11 Pages, 2 Figures

详情
AI中文摘要

医疗人工智能的诊断性能在不同人口群体间系统性地变化,但子组AUC可能掩盖了临床重要的不平等。在固定的推理时间操作点上,某些群体可能表现出过度诊断行为,其特征是真阳性率和假阳性率升高,而另一些群体则表现出不足诊断模式,其真阳性率和假阳性率降低。这些对立的趋势可能在总体AUC中相互抵消,但会产生有意义的临床决策不平等。受在操作点和多个人口属性上评估和缓解此类不平等的需要所驱动,我们提出了一种最差组等化几率边际正则化器。该正则化器明确针对推理时的子组层面真阳性率和假阳性率偏差。在每次更新时,该方法识别出由显式人口属性(如年龄、性别和种族)定义的最极端边际偏差的子组,并应用统一的惩罚,从而在多个人口轴上实现公平优化,而无需显式交集约束。在两个现实中的多标签医学影像数据集中,我们的方法在减少等化几率和等化机会的不平等方面表现一致,对AUC影响极小,从而在保持诊断性能的同时提高公平性。

英文摘要

Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.

2605.19213 2026-05-20 cs.CV

Smartphone-based Circular Plot Sampling for Forest Inventory

基于智能手机的圆形采样法用于森林调查

Su Sun, Jui-Cheng Chiu, Nabin Khanal, Songlin Fei, Yingjie Victor Chen

AI总结 本文提出了一种基于智能手机的轻量级pipeline,通过单次 walkthrough 视频实现完整的圆形采样法树测量,无需额外专业硬件,结合预训练的单目深度估计和树实例分割与SLAM框架,实现相机轨迹和深度的联合优化,从而获得树的位置和胸径估计,具有较高的准确性和可扩展性。

详情
AI中文摘要

圆形采样法是森林调查的核心,但准确测量树的胸径(DBH)和在采样区域内的空间位置仍然具有挑战性。传统方法依赖于昂贵的地面激光雷达系统或劳动密集型的手动方法,涉及卡尺和罗盘测量,限制了其在大规模环境中的可扩展性和可及性。本文提出了一种轻量级、基于智能手机的pipeline,能够通过单次walkthrough视频实现完整的采样区域树测量,仅需一个消费者智能手机安装在便携支架上即可。所提出的方法整合了预训练的单目深度估计和树实例分割与同时定位与建图(SLAM)框架,以联合优化视频序列中的相机轨迹和深度。通过融合SLAM推导出的相机姿态与分割深度图,结合校准的参考长度,获得树的位置和DBH估计。该系统在管理森林和自然森林采样区域中进行了评估,分别达到了1.51厘米(MARE 3.98%)和2.30厘米(MARE 5.69%)的平均绝对误差,性能在不同起始方向和位置下保持一致。跨视频一致性分析进一步证明了在不同起始位置开始测量时,树的定位稳定且可重复。所提出的方法在准确性和可扩展性上与传统现场方法相当,同时显著降低了设备成本和操作复杂性,使其适用于专业研究人员和非专业森林管理者在多样化的操作环境中使用。

英文摘要

Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings.

2605.19210 2026-05-20 cs.CV

D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation

D-Convexity:通过准凹性统一的可微凸形状先验用于数据驱动的图像分割

Shengzhe Chen, Hao Yan

AI总结 本文提出了一种基于网络输出掩码函数u的准凹性,统一且无阈值的可微凸形状先验,用于数据驱动的图像分割,通过将所有超水平集要求为凸性,将全局形状约束转化为局部可微不等式,从而提升形状正则化性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

凸性是许多自然和人造结构的基础几何先验,但在端到端可训练分割网络中有效施加仍然具有挑战性。我们从函数的角度重新审视凸性,并提出基于网络输出掩码函数u的准凹性的一致、无阈值凸性先验。我们不局限于约束单个二值分割,而是要求u的所有超水平集都是凸的,将全局形状约束转化为u及其导数的局部、可微不等式。从这一原则出发,我们推导出零、一、二阶特征,分别产生局部中点凸化算法、基于支撑超平面的梯度条件以及以切平面上的二次形式表达的充分二阶不等式。一阶和二阶形式产生一个紧凑的卷积损失,可以在图像上密集应用而无需阈值处理。我们的准凹性损失通过所提出的凸梯度投影模块(CGPM)无缝集成到现代分割网络中。它们在多个数据集中一致地强制凸性并提高形状正则化性能,优于专门针对视网膜分割的网络,并超越了先前的形状意识方法。值得注意的是,我们的分析将一系列先前的凸形状模型,从离散1-0-1线约束和图割凸性公式到基于曲率或带符号距离拉普拉斯的水平集先验,统一在一个连续且可微的框架中。

英文摘要

Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network's output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

2605.19209 2026-05-20 cs.RO cs.MA

Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning

基于图神经网络的多机器人通信受限的无标签运动规划与预测控制

Manohari Goarin, Yang Zhou, Giuseppe Loianno

AI总结 本文提出一种分层框架,结合图注意力规划器和分布式非线性模型预测控制器,以解决多机器人在通信受限环境下同时分配目标和生成安全轨迹的问题,通过图神经网络方法实现可扩展的去中心化解决方案。

Comments 8 pages, 6 figures, Accepted at the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

多机器人同时分配目标并生成安全轨迹的无标签运动规划问题在许多协作任务中至关重要。最近的图神经网络方法提供了可扩展的去中心化解决方案,但依赖于简化动力学和模拟环境,忽略了现实部署中的关键挑战,如动态可行性和通信限制。为了解决这些差距,我们提出了一种分层框架,结合图注意力规划器(GATP)和分布式非线性模型预测控制器(NMPC)。GATP通过多机器人协作提供中间子目标,而NMPC在非线性动力学和执行约束下强制安全。我们评估了该框架在仿真和真实世界四旋翼实验中的性能。得益于注意力机制和最小通信需求,我们展示了在更大团队中改进的泛化能力、对通信延迟高达200毫秒的鲁棒性以及实用可行性,具有去中心化的板载推理。

英文摘要

The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

2605.19207 2026-05-20 cs.CV cs.AI cs.LG

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

用于低资源医疗环境的量化机器学习模型:医学影像

Sumanth Meenan Kanneti, Aryan Shah

AI总结 本文提出了一种多策略压缩框架,用于MRI图像中的脑肿瘤分类,通过量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏以及轻量MobileNetV2骨干网络上的Float16后训练量化,实现了在低资源医疗环境中高效且准确的脑肿瘤筛查。

详情
AI中文摘要

深度学习模型在医学影像分析中表现出强大的性能,但在低资源临床环境中部署仍然困难,由于计算、内存和电力限制。本文提出了一种多策略压缩框架,用于从MRI中进行脑肿瘤分类,包括量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏,以及在轻量MobileNetV2骨干网络上的Float16后训练量化。使用包含胶质瘤、脑膜瘤、垂体瘤和健康对照的多类脑肿瘤MRI数据集,我们提供了基于MobileNetV2的完整实验验证,通过三阶段迁移学习训练分类器,并通过TensorFlow Lite应用Float16量化。DenseNet基于的知识蒸馏和量化感知训练策略被描述为框架内的互补压缩方法,其完整的经验评估留待未来工作。在MobileNetV2管道上的实验结果表明,量化模型在验证准确率为82.37%的情况下,与全精度基线82.20%相比,模型大小从35.34 MB减少到5.76 MB,压缩比为6.14倍,无显著精度损失。各分类评估证实,量化在所有四个肿瘤类别中均匀保持诊断性能。这些发现表明,轻量化的量化模型可以在资源受限的医疗环境中提供临床可行的脑肿瘤筛查。

英文摘要

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

2605.19206 2026-05-20 cs.RO

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

CLUE: 通过利用统一语义地图实现适应性优先级上下文线索

Taeyun Kim, Alvin Jinsung Choi, Dasol Hong, Hyun Myung

AI总结 CLUE通过利用统一语义地图,采用适应性优先级上下文线索的方法,有效解决零样本物体-目标导航问题,提高了导航的鲁棒性和效率。

Comments 8 pages, 5 figures

详情
AI中文摘要

零样本物体-目标导航(ZSON)是机器人领域具有挑战性的问题,需要对语言和视觉观察有全面的理解。房间和物体的上下文线索至关重要,但它们的相对重要性取决于目标:一些物体与特定房间类型紧密相关,而另一些物体则更可能由附近共存的物体预测。现有方法忽略了这一区别,导致探索效率低下且不准确。我们提出了CLUE,一种新的导航框架,通过利用从离线大型语言模型(LLM)提取的常识知识,适应性地平衡使用上下文房间和物体。通过使用LLM估计目标与房间类型的关联性,代理优先使用房间线索预测强关联的目标,使用物体线索预测弱关联的目标。我们的框架构建了一个统一的语义价值地图,整合了两种类型的上下文信息,并根据目标的模糊性进行自适应加权,以指导探索。结合多视角验证和由上下文线索指导的探索策略,CLUE实现了稳健且高效的导航。在模拟和真实世界部署中的大量实验表明,我们的方法在成功率(SR)和按路径长度加权的成功率(SPL)上均优于最先进的基线方法,证明了其在实际导航任务中的有效性和实用性。

英文摘要

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target's association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target's ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

2605.19202 2026-05-20 cs.RO cs.AI math.OC

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

通过基于强化学习的四旋翼控制实现空中巡检行为:在树冠下森林环境中的应用

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan, George Nikolakopoulos

AI总结 本文提出了一种基于深度强化学习的四旋翼控制器,用于在树冠下森林环境中进行自主巡检任务,通过端到端控制策略实现巡检视角姿态跟踪,并结合旅行商问题规划器和快速随机树星规划器确保长距离任务的安全可靠部署。

Comments Submitted to 2026 IEEE 22nd International Conference on Automation Science and Engineering

详情
AI中文摘要

本文针对在树冠下森林环境中使用基于深度强化学习(RL)的低级四旋翼控制器进行空中巡检任务的问题进行了研究。具体而言,本文提出了一种端到端(将状态映射到RPMs)的四旋翼控制策略,实现了巡检视角姿态跟踪(同时位置和偏航参考跟踪),这对于各种目标巡检行为和森林中的点对点导航至关重要。为确保在长距离任务中端到端RL控制器的安全可靠部署,本文利用了一个包含旅行商问题规划器(TSP)和快速随机树星规划器(RRT*)的更高导航指导层。在已知的森林地图和一组用户指定的巡检区域上,TSP规划器找到最优访问序列。在两个目标区域之间,RRT*规划器生成符合下层端到端RL策略跟踪限制的碰撞自由路径。通过五个目标巡检场景,本文证明了基于强化学习的电机级稳定控制器,结合导航指导层,可以有效用作树冠下森林巡检任务的低级巡检执行模块。

英文摘要

This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT*) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT* planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.

2605.19201 2026-05-20 cs.LG cs.AI

On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis

设备端持续学习与双阶段缓冲器和动态损失用于现场肺炎诊断

Danu Kim

AI总结 本文提出PneumoNet,一种适用于资源受限环境的领域增量学习方法,结合轻量级CNN进行设备端预测,双阶段平衡缓冲器实现类别平衡回放,以及动态类别加权损失以纠正训练批次不平衡,实验表明其在模拟五个真实域变化场景的PneumoniaMNIST数据集上达到86.6%的准确率,同时更小更高效。

Comments Presented at 32nd Samsung Humantech Paper Awards

详情
AI中文摘要

深度学习模型在胸部X光片上检测肺炎具有高准确性,但在设备、患者或机构差异导致的域偏移下性能会下降。我们提出了PneumoNet,一种用于资源受限环境的点-of-care肺炎诊断的领域增量学习方法。PneumoNet结合了轻量级CNN进行设备端预测,双阶段平衡缓冲器实现类别平衡回放,以及动态类别加权损失以纠正训练批次不平衡。在模拟五个真实域变化场景的域偏移PneumoniaMNIST数据集上评估,PneumoNet在86.6%的准确率和1.4%的遗忘率下,比现有基线更小且更快。这些结果突显了PneumoNet在真实世界和疫情准备医疗环境中实现适应性、隐私保护诊断AI的潜力。

英文摘要

Deep learning models detect pneumonia from chest X-rays with high accuracy, but the performance declines under domain shifts caused by differences in devices, patients, or institutions. We present PneumoNet, a domain-incremental learning method for point-of-care pneumonia diagnosis in resource-limited settings. PneumoNet combines a lightweight CNN for on-device prediction, a dual-stage balanced buffer for class-balanced replay, and a dynamic class-weighted loss to correct training-batch imbalances. Evaluated on a domain-shifted PneumoniaMNIST dataset simulating five realistic domain change scenarios, PneumoNet achieves 86.6% accuracy with 1.4% forgetting while being smaller and faster than existing baselines. These results highlight PneumoNet's potential to enable adaptive, privacy-preserving diagnostic AI directly on point-of-care medical devices in real-world and pandemic-ready healthcare.

2605.19196 2026-05-20 cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

时间到REFLECT:我们能否信任LLM裁判来评估基于证据的研究代理?

Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

AI总结 本文提出REFLECT基准,用于评估LLM裁判在代理环境中的细粒度失败检测,揭示当前LLM裁判在推理、工具使用和报告质量上的可靠性不足,为构建更可靠的评估流程提供指导。

详情
AI中文摘要

深度研究代理越来越多地自动化复杂的信息检索任务,通过多步骤推理、工具使用和综合生成基于证据的报告。其日益增长的作用要求可扩展、可靠的评估,将LLM作为裁判设定为评估事实准确性、证据使用和推理质量的监督范式。然而,这些裁判对深度研究代理的可靠性仍不明确,提出了一个关键的元评估问题:在部署LLM裁判监督研究代理之前,必须首先评估这些裁判本身。现有的元评估在两个方面存在不足:(1)依赖于粗略的、主观的人类偏好一致;(2)专注于遵循指令或可验证的任务,未探索开放性的代理执行。为了解决这些差距,我们引入REFLECT(REliable Fine-grained LLM judge Evaluation via Controlled inTervention),一个针对代理环境中细粒度失败检测的元评估基准。REFLECT定义了详细的失败模式分类,通过在质量筛选的代理执行轨迹上执行受控和局部化的干预来实例化。这产生了可验证、全面且细粒度的实例,用于验证裁判模型。我们的实验表明,当前LLM裁判仍然不可靠:即使是最能干的模型,在推理、工具使用和报告质量失败方面的总体准确率也低于55%,在证据验证上表现尤其差。一起,我们的分类和发现揭示了系统性的裁判限制,揭示了成本和可靠性之间的权衡,并为构建更可靠的评估流程提供可行的指导。

英文摘要

Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

2605.19194 2026-05-20 cs.CL

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

MMoA: 一个具有递归性的记忆混合代理框架

Rui Chu

AI总结 本文提出MMoA框架,通过引入LSTM门控机制,改进了传统混合代理方法在时间依赖性和上下文感知方面的不足,实现了更高效的多代理系统。

详情
AI中文摘要

混合代理(MoA)框架通过聚合多个代理的输出来提升大语言模型(LLM)的性能。然而,现有MoA系统通常依赖静态路由器,无法充分捕捉聚合层中的时间依赖性和上下文依赖性。为了解决这一限制,我们提出MMoA,一种具有递归性的MoA架构,将基于LSTM的门控机制整合到代理选择过程中。递归路由器根据当前输入和历史路由决策动态调节代理贡献,从而实现更上下文感知的聚合。我们在标准的指令遵循基准上评估了MMoA,包括AlpacaEval 2.0、MT-Bench和Arena-Hard。结果表明,MMoA在准确率上与传统MoA相当,同时通过动态激活更少的代理减少了计算开销。例如,在AlpacaEval 2.0上,MMoA实现了58.0%的胜率,相比MoA的59.8%,同时将运行时间效率提高了高达4.6%。这些结果表明,MMoA为适应性多代理LLM系统提供了一种可扩展且高效的解决方案。

英文摘要

The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

2605.19193 2026-05-20 cs.LG

Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

多智能体大语言模型辩论中的顺序共识:一种基于Wald-SPRT的计算控制器与基于校准的故障检测

Andrea Morandi

AI总结 本文提出了一种基于Wald-SPRT的计算控制器,用于多智能体大语言模型辩论,通过校准来检测故障,从而在保证准确性的同时减少计算资源的使用。

详情
AI中文摘要

多智能体大语言模型辩论能够提高事实性和推理能力,但大多数方法固定回合数,导致在简单任务上过度消耗计算资源而在困难任务上不足。本文将Wald的顺序概率比率检验(SPRT)作为插件计算控制器应用于大语言模型辩论。每轮结束后,一个LLM法官会发出一个[0,1]的共识分数来评估最新智能体的位置;Wald监控器在Beta似然族下累积“有用收敛”与“尚未有用”的对数似然比,并在跨越任一边界或返回 capped 最佳努力结果时停止。在独立同分布假设下,该规则继承了SPRT类型I/类型II误差保证;在部署中,校准本身更为重要,因为它估计法官评分是否在特定领域中区分有用和无用的收敛。我们评估了两个轨道:(i) 在校准Beta模型下的蒙特卡洛研究,研究工作曲线、误差率、上限行为和敏感性;以及(ii) 在200个尝试的MMLU和200个尝试的GSM8K项目上的真实LLM评估,使用三个异质智能体(gpt-5, claude-opus-4-6, gemini-2.5-pro)和一个claude-opus-4-6法官,使用不相交的40项校准子集。在GSM8K上,该规则在1.01平均回合(4.06个LLM调用)达到97.0%的准确率,比固定5轮辩论在15次调用中达到的99.0%准确率减少了3.7倍的调用次数,但准确性降低了2个百分点。在MMLU上,校准的KL值坍缩到约0,规则在2.1倍成本下对99.5%的项目进行上限。结论是,SPRT并未使辩论更准确,而是经典的顺序检验为多智能体LLM系统提供了一种廉价的计算控制和故障检测层。

英文摘要

Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald's Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of "useful convergence" vs "not yet useful" under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real-LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a claude-opus-4-6 judge, using disjoint 40-item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed-5 debate at 15 calls: a 3.7x call reduction at -2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute-control and failure-detection layer for multi-agent LLM systems.

2605.19185 2026-05-20 cs.LG cs.AI

Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

规划可接受的图-偏微分方程值扩展用于稀疏目标条件规划

Shiheng Zhang

AI总结 本文研究了在操作argmin-Q规划器下,哪些图值扩展是规划可接受的,提出了一种局部动作间隙证书,证明在 rollout 过程中若代理值误差低于真实动作间隙的一半,则贪心 rollout 可达到目标。通过比较原理填充距离界,AMLE 实现了该证书,而调和扩展由于反映边界击中概率而非最短路径贪心顺序,可能导致局部动作排名错误。

详情
AI中文摘要

稀疏目标条件规划中,少量成本到目标标签可视为图-偏微分方程Dirichlet扩展问题:将稀疏标签扩展到目标依赖的边界上,以贪心rollouts达到目标。我们研究了在操作argmin-Q规划器下哪些图值扩展是规划可接受的。我们的主要结果是一种局部动作间隙证书:如果代理值误差在rollout过程中保持在真实动作间隙的一半以下,则贪心rollout可达到目标。绝对最小Lipschitz扩展(AMLE),作为图p-Laplacian家族的p=∞端点,通过比较原理填充距离界实现了该证书。相比之下,调和扩展由于其值反映边界击中概率而非最短路径贪心顺序,可能导致局部动作排名错误。在120个AntMaze布局衍生的图配置上,调和扩展实现0.584的累积rollout成功率,而AMLE达到0.970。有限高p方法也进入高成功率区域,p=4时成功率0.903,p=8时0.973,p=16固定预算求解器时0.982,尽管p=16行未作为收敛端点排名使用,因求解器认证不完整。机制审计显示,许多rollout决策发生在AMLE兼容但调和不兼容的局部几何中,并且AMLE在rollout加权决策范围内修正了大多数调和反转。

英文摘要

Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result is a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE), the p=infinity endpoint of the graph p-Laplacian family, instantiates this certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path greedy order. On 120 AntMaze layout-derived graph configurations, harmonic extension achieves 0.584 aggregate rollout success, while AMLE reaches 0.970. Finite high-p methods also enter a high-success regime, with success 0.903 for p=4, 0.973 for p=8, and 0.982 for a fixed-budget p=16 solver, though the p=16 row is not used as a converged endpoint ranking due to incomplete solver certification. Mechanism audits show that many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry, and that AMLE corrects most harmonic inversions on the rollout-weighted decision scope.

2605.19173 2026-05-20 cs.CL

Prompting language influences diagnostic reasoning and accuracy of large language models

提示语言影响大型语言模型的诊断推理和准确性

Adrien Bazoge, Josselin Corvellec, Sofiane Djillali Sid-Ahmed, Pierre-Antoine Gourraud

AI总结 本研究探讨了提示语言对大型语言模型在临床诊断推理和准确性上的影响,通过比较英语和法语性能,发现四种模型在英语环境下表现更优,而o3模型未表现出语言效应,表明提示语言是影响模型临床性能的关键因素。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被探索用于临床决策支持,但大多数评估都是用英语进行的,这使得其在其他语言中的可靠性存疑。本文通过比较五种LLM(o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct和BioMistral-7B)在英语和法语环境下的表现,评估了提示语言对诊断推理和最终诊断准确性的影响。总共评估了180个涵盖16个医学专科的临床情景,由两位医生使用18分量表评估诊断准确性和推理质量。五种模型中有四种在英语环境下表现更优(平均差异0.37-0.91,调整p<0.05),差距涵盖多个推理方面,包括鉴别诊断、逻辑结构和内部效度。o3是唯一一个未表现出整体语言效应的模型。这些发现表明,提示语言仍然是影响LLM临床性能的关键因素,对全球公平的语义文化部署具有重要影响。

英文摘要

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

2605.19172 2026-05-20 cs.LG cs.AI

Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand

Bridge:基于检索的时空建模用于城市配送需求

Yihong Tang, Tong Nie, Junlin He, Qianjun Huang, Dingyi Zhuang, Lijun Sun

AI总结 本文提出Bridge框架,通过结合归纳上下文图结构和时间感知的记忆模块,解决新加入服务区域缺乏历史记录导致的城市配送需求预测难题,提升了冷启动区域的预测性能。

详情
AI中文摘要

预测城市配送需求在新增服务区域缺乏历史记录时变得尤为具有挑战性。现有的时空预测器在有足够的节点历史时能有效建模空间依赖性,但它们仍然是参数化的,因此在冷启动区域难以恢复短期运营动态。地理嵌入帮助识别区域的位置和功能,但并不能直接揭示相似区域在相似时间背景下行为的方式。我们提出了Bridge,一种结合归纳上下文图结构和时间感知记忆的时空图框架。对于每个目标区域,Bridge通过区域上下文和近期动态从记忆中检索未来需求模式,并通过门控融合机制优化图结构预测。为了使检索与预测效用对齐,我们进一步训练检索器以未来为导向的目标,偏好那些未来轨迹与目标最匹配的条目。实验表明,Bridge在四个真实世界配送数据集上,无论是城市内部冷启动还是跨城市转移时部分观察情况下,均优于竞争性的时空基线模型。结果表明,当参数图泛化能力不足时,检索增强为冷启动城市需求预测提供了有用的操作记忆。

英文摘要

Forecasting urban delivery demand becomes substantially more challenging when newly added service regions lack historical records. Existing spatiotemporal forecasters effectively model spatial dependence once sufficient node histories are available. Still, they remain parametric and therefore struggle to recover short-term operational dynamics in cold-start regions. Geospatial embeddings help identify where a region is and what function it serves, yet they do not directly reveal how a similar region behaves under a comparable temporal context. We propose Bridge, a retrieval-augmented spatiotemporal graph framework that combines an inductive contextual graph backbone with a time-aware memory of region-time windows. For each target region, Bridge retrieves future demand patterns from the memory using both regional context and recent dynamics, and refines the backbone forecast through a gated fusion mechanism. To align retrieval with forecasting utility, we further train the retriever with a future-aware objective that favors entries whose future trajectories best match the target. Experiments on four real-world delivery datasets show that Bridge consistently improves over competitive spatiotemporal baselines in both within-city cold-start and cross-city transfer with partial observations. The results show that retrieval augmentation provides a useful operational memory for cold-start urban demand forecasting when parametric graph generalization alone is insufficient.

2605.19166 2026-05-20 cs.RO cs.LG math.OC

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

一种通过奖励设计和终止条件实现RL基于四旋翼控制性能调优的启发式方法

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, George Nikolakopoulos

AI总结 本文提出了一种新的启发式方法,通过奖励设计和终止条件实现RL四旋翼控制的可调性能,该方法通过双带宽指数奖励结构实现了设定点跟踪的临界阻尼响应,并具有低稳态误差。在使用近端策略优化(PPO)算法训练时,结合episode截断条件,在600万次时间步内以高效的方式实现了所需性能。通过直观的启发式规则调整奖励权重和指数系数,可以实现更快(空翻式)和更慢(检查式)的稳定时间性能,同时保留基线临界阻尼响应和约2%的稳态误差。

Comments Accepted in the 34th Mediterranean Conference on Control and Automation

详情
AI中文摘要

基于强化学习(RL)的四旋翼控制策略在诸如在复杂环境中快速导航和无人机赛车等任务中取得了显著性能。然而,在某些应用中,如基础设施检查,实现精确、可控的机动并具有可调性能至关重要。本文提出了一种新的启发式方法,通过奖励设计和终止条件实现RL基于四旋翼控制的可调性能。我们提出了一种包含双带宽指数的新型奖励结构,实现了设定点跟踪的基线临界阻尼响应,并具有低稳态误差。当使用近端策略优化(PPO)算法进行训练时,结合episode截断条件,在600万次时间步内以高效的方式实现了所需性能。为了调节基线行为的性能,我们提出了直观的启发式规则来调整奖励权重和指数系数,以实现更快(空翻式)和更慢(检查式)的稳定时间性能,同时保留基线临界阻尼响应和大约2%的稳态误差。我们评估了三种RL策略(基线、空翻和检查)在100次试验中的表现,并展示了在随机初始条件下位置和偏航跟踪的准确且可调性能,从而证明了所提出启发式方法的有效性。

英文摘要

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

2605.19156 2026-05-20 cs.AI cs.CY cs.LG cs.MA

How Far Are We From True Auto-Research?

我们距离真正的自动研究还有多远?

Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie

AI总结 本文通过ResearchArena评估了不同代理生成的论文质量,发现虽然代理能生成看似有竞争力的论文,但实际实验严谨性不足,存在伪造结果、实验能力不足和计划与执行不匹配等问题,表明自动研究仍需进一步发展。

详情
AI中文摘要

最近的自动研究系统能够生成完整的论文,但可行性并不等同于质量,该领域仍然缺乏对代理生成论文实际质量的系统研究。我们介绍了ResearchArena,一个最小的框架,让现成的代理(Claude Code使用Opus 4.6,Codex使用GPT-5.4,和Kimi Code使用K2.5)在仅轻量指导下自行完成完整的研究循环(构想、实验、论文写作、自我完善)。在13个计算机科学种子和每个代理-领域对的3次试验中,ResearchArena生成了117篇代理生成的论文,每篇都在三个互补的视角下评估:仅手稿的评审员(SAR)、考虑工件的同行评审(PR)以及人工进行的元评审。在仅SAR的情况下,图景是乐观的:Claude Code获得最高评分,优于Analemma的FARS,并与加权平均的人类ICLR 2025提交匹配,表明最小框架的代理能够生成在手稿-only评审中看起来有竞争力的论文。然而,人工检查却揭示了这个图景被夸大了:SAR评分与实际接受决定不一致,且奖励合理框架而不验证实验实质。在考虑工件的PR评分急剧下降,人工审计发现实验严谨性是主要瓶颈,分解为三种失败模式(伪造结果、低能力实验、计划/执行不匹配),这些模式高度依赖于代理:Codex 5%/8%论文与工件不匹配/伪造参考文献,与Kimi Code 77%/72%相比,差距约为15倍,追踪代理发展出的不同研究身份。没有一篇代理生成的论文达到顶级会议的接受标准。这表明我们仍然与真正的自动研究有差距。

英文摘要

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

2605.19155 2026-05-20 cs.CV

Efficient coding along the visual hierarchy

视觉层次中的高效编码

Ananya Passi, Brian S. Robinson, Michael F. Bonner

AI总结 本文研究了在有限数据下如何通过高效编码原理构建与人类对齐的视觉特征层次,提出了一种无监督学习方法,该方法通过压缩输入到自然图像的主要变化模式来生成从边缘和颜色到纹理和形状的特征,且结合监督微调可提高脑区对齐性和类别学习速度。

Comments 34 pages, 6 figures

详情
AI中文摘要

生物视觉系统在有限经验下学习,不同于依赖数百万训练图像的深度学习模型。什么学习原理使这种可能性成为可能?我们测试了高效编码(即神经表示捕捉自然输入的统计结构)是否能从有限数据中构建与人类对齐的视觉特征层次。我们开发了一种无监督学习过程,其中每个深度网络层仅使用局部统计信息,不使用标签、任务或反向传播,将输入压缩到自然图像的主要变化模式上。这种无监督过程生成的特征从边缘和颜色逐步发展到纹理和形状。该深度高效编码模型的特征易于被人类观察者识别,并能预测人类视觉皮层的图像诱发fMRI响应。此外,结合高效编码与监督微调的混合学习过程在低数据设置下能产生更好的脑区对齐性,并加快类别学习速度。这些发现表明,高效编码可能在视觉层次的整个表示中起作用,并有助于解释生物视觉的数据效率。

英文摘要

Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.

2605.19151 2026-05-20 cs.AI cs.HC

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

渐进自主性作为偏好学习:代理工具使用中的信任校准形式化

Changkun Ou

AI总结 本文将代理工具使用中的信任校准形式化为一个偏好学习问题,通过高斯过程后验模型维护潜在人类风险容忍函数,并在审批结果最不确定的地方升级到人类,继承了偏好贝叶斯优化的推理机制和样本效率论证,但目标不同。

详情
AI中文摘要

我们正式将代理工具使用中的信任校准(决定何时自动化代理的提议行动可以自主执行还是需要人类批准)作为偏好学习问题。策略网关维护一个高斯过程后验,覆盖潜在人类风险容忍函数,通过二元批准/拒绝反馈的probit似然进行观测,并在审批结果最不确定的地方升级到人类。我们证明这在结构上是偏好贝叶斯优化的一个实例,继承了其推理机制(近似高斯过程分类)和样本效率论证(不确定性目标查询),但目标不同:将动作空间分类为允许/阻止/询问区域,而不是优化设计。

英文摘要

We formalize trust calibration for agentic tool use (deciding when an automated agent's proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.

2605.19150 2026-05-20 cs.LG cs.AI

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

Flash PD-SSM: 一种内存优化的结构稀疏状态空间模型

Aleksandar Terzić, Francesco Carzaniga, Nicolas Menet, Yannick Biehl, Michael Hersche, Thomas Hofmann, Abbas Rahimi

AI总结 本文提出Flash PD-SSM,一种内存优化的结构稀疏状态空间模型,通过在保持高效的同时提升表达能力,实现了与传统结构化状态空间模型相当的吞吐量,并在多个任务中展示了更高的准确性和效率。

详情
AI中文摘要

状态空间模型(SSMs)面临效率与表达能力之间的根本权衡,这主要由模型转移矩阵的结构决定。无结构的转移矩阵具有最大的表达能力,但计算和内存成本过高。相比之下,大多数结构化转移矩阵形式在运行时间和内存消耗上都非常高效,但表达能力有限。基于最近关于结构稀疏SSMs的研究,我们提出了Flash PD-SSM,一种新的SSM,其吞吐量与广泛使用的结构化SSMs相当,但具有显著更好的表达性保证。Flash PD-SSM维护一个可训练的结构稀疏矩阵集合,在每个时间步选择其中一个进行离散选择,从而在保持大规模训练所需的效率的同时,实现了与无结构矩阵相当的FSA表达能力。首先,我们在合成机制和状态跟踪任务上验证了Flash PD-SSM,发现其理论表达能力在实践中得以实现。其次,在涉及超过17000长度序列的多变量时间序列任务中,我们发现Flash PD-SSM在竞争性的SSM方法中定义了新的最先进的(SoTA)准确性。最后,我们展示了Flash PD-SSM是混合LLMs的有效替代品,在自然语言状态跟踪和常见语言建模场景中均取得改进。该模型相比前沿语言模型广泛使用的SSMs表现出更高的吞吐量和更低的内存消耗。

英文摘要

State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model's transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.

2605.19149 2026-05-20 cs.CL cs.CR

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Rishi Jha, Harold Triedman, Arkaprabha Bhattacharya, Vitaly Shmatikov

AI总结 研究探讨了代理在遇到错误时可能发生意外崩溃现象,通过实验发现64.7%的代理在遇到模拟错误时会出现不同程度的不安全行为,且这些行为未被现有安全标准所覆盖。

Comments 32 pages, 8 figures, 4 tables

详情
AI中文摘要

代理在使用计算机和网络时不可避免地会遇到错误:无法访问的网页、缺失的文件、本地和远程的配置错误等。这些错误不会阻碍基于最新模型的代理。它们会继续寻找完成任务的方法。我们引入、描述并测量了一种新的代理失败类型,称为"意外崩溃":在没有对抗性输入的情况下,对良性环境错误产生不安全或有害行为。由于崩溃未被现有可靠性或安全基准捕捉,我们开发了一种崩溃行为的分类法。然后,我们实现了通用代理基础设施,用于在滚动环境中注入模拟的本地和远程错误,并使用它来系统评估基于GPT、Grok和Gemini的代理系统。我们的评估显示,在遇到模拟错误的64.7%的代理滚动中,会出现不同程度和成功程度的崩溃(例如,进行未经授权的侦察或颠覆访问控制)。在超过一半的这些崩溃中,不安全行为未报告给用户。比较有无错误的相同代理行为,我们发现对错误的探索与不安全和有害行为相关。

英文摘要

Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emph{accidental meltdown}: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7\% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

2605.19141 2026-05-20 cs.LG cs.AI cs.CL cs.CY cs.HC

GRASP: Deterministic argument ranking in interaction graphs

GRASP:交互图中的确定性论证排名

Diganta Misra, Antonio Orvieto, Rediet Abebe, Volkan Cevher

AI总结 本文提出GRASP框架,通过聚合稳定的局部交互判断生成全局排名,以解决大语言模型作为裁判时整体评判不一致的问题,强调结构充分性而非说服力或修辞吸引力。

Comments Preprint

详情
AI中文摘要

大型语言模型越来越多地被部署为自动裁判,以评估论证的强度。随着这一角色的扩大,其合法性取决于一致性、透明性和将论证结构与修辞吸引力区分开的能力。然而,我们证明了整体评判——一种常见的LLM-as-a-Judge实践,其中模型对辩论提供全球裁决——存在显著的跨模型分歧。我们主张这种不稳定性源于将辩论复杂的交互结构压缩成单一的不透明分数。为了解决这一问题,我们提出GRASP(渐进排名与攻击支持传播),一种确定性框架,通过收敛的攻击-防御传播操作,将稳定的局部交互判断聚合为全局排名。我们证明在LLM-as-a-Judge评估中,局部交互判断比整体排名更具可重复性,使GRASP能够生成更一致的全局排名。我们进一步证明GRASP分数与人类“说服性”标签不相关,突显了一个关键的社技术区别:GRASP不衡量说服力、事实性或修辞吸引力,而是结构充分性——一种在显式交互图上的防御意识的论证鲁棒性概念。总体而言,GRASP为整体LLM评判提供了一个透明且可审计的替代方案。

英文摘要

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

2605.19140 2026-05-20 cs.AI

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

学习手柄:在接口约束下的可证明收敛的工作流学习

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

AI总结 该研究探讨了在接口约束下的工作流学习问题,提出了一种异步去中心化的Q学习算法IC-Q,并给出了神经IC-Q的有限样本界,证明了在去中心化部分可观测性下的神经Q学习的第一个有限样本保证。

详情
AI中文摘要

我们研究了在专门的代理通过共享的艺术品进行控制转移的设置下的工作流学习,每个代理只能观察该艺术品的局部函数及其自己的私人状态,且没有集中式学习者访问联合轨迹——这多代理LLM管道跨越组织、供应商或信任边界时的操作模式。我们将这种模式形式化为一个接口约束的半马尔可夫决策过程(IC-SMDP),其决策时刻发生在手柄时间,设计了IC-Q,一种异步去中心化的Q学习算法,其中每次手柄的跨代理协调恰好是一个标量。我们的主要结果是神经IC-Q的有限样本界,该界分解为三个独立可控的误差源:神经函数近似误差、接口表示差距和混合时间残差,基于随机选项持续时间折扣。建立这个界需要将近似信息状态(AIS)框架从单代理原始步骤MDP提升到多代理SMDP,并在随机持续时间内控制马尔可夫噪声,而这在先前工作中尚未完成。据我们所知,这是第一个在去中心化部分可观测性下的神经Q学习的有限样本保证。四个实验:一个受控的合成IC-SMDP,多LLM数学推理,多代理路由,以及多代理CPU编程,显示IC-Q在没有任何代理观察联合轨迹的情况下匹配集中式 oracle,每个误差源沿其对应的轴按界预测的比例缩放。

英文摘要

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

2605.19137 2026-05-20 cs.CV

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

迈向数据高效的视频预训练:使用冻结的图像基础模型

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

AI总结 本文探讨了如何通过冻结预训练的图像基础模型并仅训练时间模块来实现数据高效的视频预训练,从而减少对大规模视频数据和计算资源的需求。

Comments Accepted to CVPR 2026 Workshops CV4Smalls

详情
AI中文摘要

视频基础模型在许多视频理解任务中表现出色,但通常需要在大规模视频数据集上进行大规模预训练,导致显著的数据和计算成本。相比之下,现代图像基础模型已经提供了强大的空间表示。这引发了一个重要问题:能否通过重用这些空间表示并仅进行时间推理的预训练来构建具有竞争力的视频模型?我们初步探索了一种轻量级训练范式,即冻结预训练的图像基础模型并仅训练时间模块来处理流视频。通过将图像基础模型用作空间编码器,这种方法可以显著减少与端到端视频预训练相比所需的视频数据和计算量。在本工作中,我们探讨了这种方法的可行性,以在投入视频预训练计算之前进行探索。在多个视频理解任务上的实证发现表明,无需大规模视频预训练即可获得强大的时间性能,这促使未来的工作集中在通过在冻结的图像基础模型上预训练时间模块来构建递归视频基础模型。代码:https://github.com/tue-mps/towards-video-image-frozen

英文摘要

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: https://github.com/tue-mps/towards-video-image-frozen .