arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2307.11925 2026-05-21 cs.LG math.CA

Mercer Large-Scale Kernel Machines from Ridge Function Perspective

从岭函数视角出发的Mercer大规模核机

Karol Dziedziul, Sergey Kryzhevich, Paweł Wieczyński

AI总结本文从岭函数视角出发，研究大规模核机的Mercer性质，探讨了通过余弦函数的乘积之和近似核函数的可行性，并分析了该方法的障碍，应用于图像处理中的'一对一'方法。

Comments 17 pages, 3 figures

2304.12906 2026-05-21 cs.LG stat.ML

The Score-Difference Flow for Implicit Generative Modeling

隐式生成建模的分数差流

Romann M. Weber

AI总结本文提出分数差流作为隐式生成建模的一种新方法，通过最优减少两个分布之间的KL散度，展示了其与去噪扩散模型的等价性，并揭示了生成对抗网络训练中隐含的数据优化子问题与分数差流之间的联系。

Comments 25 pages, 5 figures, 4 tables. Updated final version of a paper originally published in Transactions on Machine Learning Research (TMLR), including minor typographical corrections and post-publication commentary connecting the SD flow to drifting models

详情

Journal ref: Transactions on Machine Learning Research (7/2023)

AI中文摘要

隐式生成建模（IGM）旨在生成与目标数据分布特征相符的合成样本。近期工作（如分数匹配网络、扩散模型）从推动合成源数据向目标分布的角度出发，通过动力学扰动或环境空间中的流来实现。在此方向上，我们提出任意目标与源分布之间的分数差（SD）作为一种流，该流能够最优地减少两者之间的KL散度。我们应用SD流到方便的代理分布上，这些分布只有在原始分布对齐时才对齐。我们证明在某些条件下，这种形式与去噪扩散模型具有形式等价性。我们还表明，生成对抗网络的训练包含一个隐含的数据优化子问题，当判别器最优时，该子问题在特定损失函数选择下诱导出SD流。因此，SD流为解决生成建模三重困境（高质量样本、模式覆盖和快速采样）的三种模型类别提供了理论联系，从而为统一方法奠定了基础。

英文摘要

Implicit generative modeling (IGM) aims to produce samples of synthetic data matching the characteristics of a target data distribution. Recent work (e.g. score-matching networks, diffusion models) has approached the IGM problem from the perspective of pushing synthetic source data toward the target distribution via dynamical perturbations or flows in the ambient space. In this direction, we present the score difference (SD) between arbitrary target and source distributions as a flow that optimally reduces the Kullback-Leibler divergence between them. We apply the SD flow to convenient proxy distributions, which are aligned if and only if the original distributions are aligned. We demonstrate the formal equivalence of this formulation to denoising diffusion models under certain conditions. We also show that the training of generative adversarial networks includes a hidden data-optimization sub-problem, which induces the SD flow under certain choices of loss function when the discriminator is optimal. As a result, the SD flow provides a theoretical link between model classes that individually address the three challenges of the "generative modeling trilemma" -- high sample quality, mode coverage, and fast sampling -- thereby setting the stage for a unified approach.

URL PDF HTML ☆

赞 0 踩 0

2212.08989 2026-05-21 cs.LG

Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics

深度学习应用于计算力学：综述、现状和经典方法

Loc Vu-Quoc, Alexander Humer

AI总结本文综述了深度学习在计算力学中的应用，包括固体力学、流体力学和有限元技术，并讨论了混合和纯机器学习方法在解决非线性偏微分方程中的作用，同时介绍了LSTM、注意力机制和核方法等技术。

Comments 275 pages, 158 figures. Appeared online on 2023.03.01 at CMES-Computer Modeling in Engineering & Sciences

详情

DOI: 10.32604/cmes.2023.028130
Journal ref: CMES-Computer Modeling in Engineering & Sciences, Vol. 137, No. 2, pp.1069-1343, 2023

AI中文摘要

三个最近由于人工智能在艺术和科学领域取得的突破性进展作为动机：获奖的数字图像、蛋白质折叠、快速矩阵乘法。本文详细回顾了近年来在人工神经网络中的许多发展，特别是深度学习（DL），并将其应用于计算力学（固体力学、流体力学、有限元技术）。讨论了混合和纯机器学习（ML）方法。混合方法将传统PDE离散化与ML方法结合，以帮助建模复杂的非线性本构关系，非线性地降低模型阶数以实现高效模拟（湍流），或通过预测传统积分方法中的某些组件来加速模拟。其中，方法（1）和（2）依赖于长短期记忆（LSTM）架构，方法（3）依赖于卷积神经网络。纯ML方法解决（非线性）PDEs的方法由物理信息神经网络（PINN）方法表示，这些方法可以结合注意力机制来处理不连续解。LSTM和注意力架构，以及现代和通用的经典优化器，包括用于DL网络的随机性，都被广泛回顾。核机，包括高斯过程，为更高级的工作如浅层网络无限宽度提供了足够的深度。不仅面向专家，读者被假定熟悉计算力学，但不熟悉DL，其概念和应用从基础开始构建，旨在让首次学习者快速进入研究前沿。AI的历史和限制被回顾和讨论，特别关注指出经典方法中的错误陈述或误解，即使在知名参考文献中也是如此。大变形梁的位置和指向控制作为示例。

英文摘要

Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image, protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology) are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitutive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural networks. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN) methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM and attention architectures, together with modern and generalized classic optimizers to include stochasticity for DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to sufficient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts, readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or misconceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable beam is given as an example.

URL PDF HTML ☆

赞 0 踩 0

2205.13524 2026-05-21 cs.CV cs.GR

PREF: Phasorial Embedding Fields for Compact Neural Representations

PREF: 用于紧凑神经表示的相位嵌入场

Binbin Huang, Xinhao Yan, Anpei Chen, Shenghua Gao, Jingyi Yu

AI总结本文提出了一种高效的基于频率的神经表示PREF，通过引入覆盖显著边谱的相位体积，结合快速傅里叶变换和局部插值加速傅里叶映射，从而减少频率表示中的成本MLP，提升效率和可解释性。

详情

AI中文摘要

我们提出了一种高效的基于频率的神经表示，称为PREF：一种带有相位体积的浅层MLP，能够覆盖比之前傅里叶特征映射或位置编码更显著的边谱。核心是我们的紧凑3D相位体积，其中频率在2D平面上均匀分布并在1D轴上扩展。为此，我们开发了一种专门且高效的傅里叶变换，结合快速傅里叶变换和局部插值以加速朴素傅里叶映射。我们还引入了Parsvel正则化器以稳定基于频率的学习。通过这些方法，我们的PREF减少了频率表示中的成本MLP，从而显著缩小了其与其他混合表示之间的效率差距，并提高了其可解释性。全面的实验表明，我们的PREF能够捕捉高频细节，同时保持紧凑和鲁棒，包括2D图像泛化、3D签名距离函数回归和5D神经辐射场重建。

英文摘要

We present an efficient frequency-based neural representation termed PREF: a shallow MLP augmented with a phasor volume that covers significant border spectra than previous Fourier feature mapping or Positional Encoding. At the core is our compact 3D phasor volume where frequencies distribute uniformly along a 2D plane and dilate along a 1D axis. To this end, we develop a tailored and efficient Fourier transform that combines both Fast Fourier transform and local interpolation to accelerate naïve Fourier mapping. We also introduce a Parsvel regularizer that stables frequency-based learning. In these ways, Our PREF reduces the costly MLP in the frequency-based representation, thereby significantly closing the efficiency gap between it and other hybrid representations, and improving its interpretability. Comprehensive experiments demonstrate that our PREF is able to capture high-frequency details while remaining compact and robust, including 2D image generalization, 3D signed distance function regression and 5D neural radiance field reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2110.06123 2026-05-21 cs.SD eess.AS

COVID-19 Diagnosis from Cough Acoustics using ConvNets and Data Augmentation

通过卷积神经网络和数据增强进行新冠肺炎咳嗽声诊断

Saranga Kingkor Mahanta, Darsh Kaushik, Shubham Jain, Hoang Van Truong, Koushik Guha

AI总结本文提出利用卷积神经网络和数据增强技术，对DiCOVA 2021挑战赛Track 1中的咳嗽声数据集进行分析，以实现新冠肺炎的诊断，通过改进模型在盲测集上的AUC分数达到87.07，并超越了挑战赛的基线模型。

Comments DiCOVA, top 1st, This work has been submitted to the IEEE for possible publication

详情

DOI: 10.1109/ICACFCT53978.2021.9837350
Journal ref: IEEE Advances in Computing and Future Communication Technologies (ICACFCT), Meerut, India, 2021, pp. 33-38

AI中文摘要

随着新冠肺炎的周期性上升和下降，各国正遭受其波浪的冲击，因此需要一种高效、经济且简便的病毒诊断方法。新冠肺炎阳性个体可能甚至无症状，使诊断变得困难，但其中无症状者并不完全没有病毒引起的症状。他们可能不会表现出任何可观察的症状，如有症状者，但他们在咳嗽方式上可能与未感染者不同。这些咳嗽声音的差异是微小的，难以被人类耳朵察觉，但可以使用基于机器学习的统计模型捕捉到。在本文中，我们提出了一种深度学习方法来分析DiCOVA 2021挑战赛Track 1中的声音数据集，该数据集包含属于新冠肺炎阳性或阴性示例的咳嗽声音记录。为了将声音记录分类为新冠肺炎阳性或阴性示例，我们提出了一个ConvNet模型。我们的模型在提供的盲测集上实现了72.23%的AUC分数，以进行模型的无偏评估。结合数据增强的ConvNet模型进一步将AUC-ROC百分比从72.23增加到87.07。它还比DiCOVA 2021挑战赛的基线模型高出23%，从而在挑战赛排行榜上占据首位。本文提出将梅尔频率倒谱系数作为所提模型的特征输入。

英文摘要

With the periodic rise and fall of COVID-19 and countries being inflicted by its waves, an efficient, economic, and effortless diagnosis procedure for the virus has been the utmost need of the hour. COVID-19 positive individuals may even be asymptomatic making the diagnosis difficult, but amongst the infected subjects, the asymptomatic ones need not be entirely free of symptoms caused by the virus. They might not show any observable symptoms like the symptomatic subjects, but they may differ from uninfected ones in the way they cough. These differences in the coughing sounds are minute and indiscernible to the human ear, however, these can be captured using machine learning-based statistical models. In this paper, we present a deep learning approach to analyze the acoustic dataset provided in Track 1 of the DiCOVA 2021 Challenge containing cough sound recordings belonging to both COVID-19 positive and negative examples. To perform the classification on the sound recordings as belonging to a COVID-19 positive or negative examples, we propose a ConvNet model. Our model achieved an AUC score percentage of 72.23 on the blind test set provided by the same for an unbiased evaluation of the models. The ConvNet model incorporated with Data Augmentation further increased the AUC-ROC percentage from 72.23 to 87.07. It also outperformed the DiCOVA 2021 Challenge's baseline model by 23% thus, claiming the top position on the DiCOVA 2021 Challenge leaderboard. This paper proposes the use of Mel frequency cepstral coefficients as the feature input for the proposed model.

URL PDF HTML ☆

赞 0 踩 0

1908.05972 2026-05-21 cs.LG stat.ML

AI-based Prediction of Independent Construction Safety Outcomes from Universal Attributes

基于AI的独立施工安全结果的属性预测

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier

AI总结本文改进并验证了先前研究中通过机器学习从属性中预测安全结果的方法，使用NLP提取属性并训练模型预测伤害严重性、类型、受影响身体部位和事件类型，通过独立人工标注消除潜在的人工相关性，结果表明属性仍具有高度预测性，同时引入了更大的数据集、新模型、模型堆叠和更合适的评估指标，最终成功预测伤害严重性，这是重大进展。

Comments Added author contributions and journal reference, updated corresponding author, fixed a few typos

详情

Journal ref: Automation in Construction 118 (2020): 103146

AI中文摘要

本文显著改进并验证了先前研究中通过机器学习从属性中预测安全结果的方法。与原始研究类似，我们使用自然语言处理（NLP）从原始事件报告中提取基本属性，并训练机器学习模型进行预测。此处预测的安全结果包括伤害严重性、伤害类型、受影响身体部位和事件类型。与原始研究不同，安全结果不是通过NLP提取，而是由独立的人工标注提供，从而消除了预测变量和预测目标之间可能的人工相关性。结果表明，属性仍具有高度预测性，证实了原始方法的有效性。当前研究的其他改进包括使用（1）一个包含超过90,000份报告的更大数据集，（2）两种新模型，XGBoost和线性支持向量机（SVM），（3）模型堆叠，（4）更简单的实验设置和更合适的性能指标，以及（5）对各属性重要性评分的分析。最后，伤害严重性结果得到良好预测，这在原始研究中并未实现。这是重大进展。

英文摘要

This paper significantly improves on, and finishes to validate, an approach proposed in previous research in which safety outcomes were predicted from attributes with machine learning. Like in the original study, we use Natural Language Processing (NLP) to extract fundamental attributes from raw incident reports and machine learning models are trained to predict safety outcomes. The outcomes predicted here are injury severity, injury type, body part impacted, and incident type. However, unlike in the original study, safety outcomes were not extracted via NLP but were provided by independent human annotations, eliminating any potential source of artificial correlation between predictors and predictands. Results show that attributes are still highly predictive, confirming the validity of the original approach. Other improvements brought by the current study include the use of (1) a much larger dataset featuring more than 90,000 reports, (2) two new models, XGBoost and linear SVM (Support Vector Machines), (3) model stacking, (4) a more straightforward experimental setup with more appropriate performance metrics, and (5) an analysis of per-category attribute importance scores. Finally, the injury severity outcome is well predicted, which was not the case in the original study. This is a significant advancement.

URL PDF HTML ☆

赞 0 踩 0

1907.11769 2026-05-21 cs.CL

Automatically Learning Construction Injury Precursors from Text

从文本自动学习事故前兆

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier

AI总结本文研究了如何利用建设行业数字记录的安全报告，通过比较几种深度学习方法自动学习事故前兆，以提高对安全事故的理解和学习能力。

Comments Added author contributions and journal reference, updated corresponding author

详情

Journal ref: Automation in Construction 118 (2020): 103145

AI中文摘要

鉴于建设行业数字记录的安全报告日益增多，开发方法利用这些数据以提高对安全事故的理解和学习能力变得重要。在本研究中，我们比较了几种自动从原始建设事故报告中学习事故前兆的方法。更具体地说，我们尝试了两种最先进的自然语言处理（NLP）深度学习架构，即卷积神经网络（CNN）和分层注意力网络（HAN），以及已建立的词频-反文档频率表示（TF-IDF）+支持向量机（SVM）方法。对于每种模型，我们提供了一种方法，在训练后识别出平均最能预测每种安全结果的文本模式。我们显示，在这些文本中可以找到有效的事故前兆。所提出的方法也可以让用户可视化和理解模型的预测。

英文摘要

In light of the increasing availability of digitally recorded safety reports in the construction industry, it is important to develop methods to exploit these data to improve our understanding of safety incidents and ability to learn from them. In this study, we compare several approaches to automatically learn injury precursors from raw construction accident reports. More precisely, we experiment with two state-of-the-art deep learning architectures for Natural Language Processing (NLP), Convolutional Neural Networks (CNN) and Hierarchical Attention Networks (HAN), and with the established Term Frequency - Inverse Document Frequency representation (TF-IDF) + Support Vector Machine (SVM) approach. For each model, we provide a method to identify (after training) the textual patterns that are, on average, the most predictive of each safety outcome. We show that among those pieces of text, valid injury precursors can be found. The proposed methods can also be used by the user to visualize and understand the models' predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.20539 2026-05-21 cs.LG

OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI

OpenSeisML: 开放式大规模真实地震和井历数据集用于生成式AI

Ipsita Bhar, Huseyin Tuna Erdinc, Thales Souza, Charles Jones, Felix J. Herrmann

AI总结本文提出OpenSeisML，一个开放的大型真实地震和井历数据集，用于支持生成式AI在地震反演中的应用，通过自动化数据整理流程提供可重复的地震数据准备，以训练生成模型捕捉地下属性的统计分布，从而生成多个统计上一致的现实实现用于不确定性量化。

Comments 5 pages, 8 figures

详情

AI中文摘要

机器学习（ML）和计算机视觉的出现显著加速了地震反演工作流程，通过减少传统昂贵的迭代方法的计算成本。然而，ML方法的发展和评估仍然受限于真实速度模型的稀缺性，因为大多数高质量数据由石油和天然气公司私有拥有。为了解决这一差距，我们提出了OpenSeisML，一个收集真实地震数据集的集合，旨在支持生成式AI（Gen-AI）在地震反演中的工作流程。这些数据集是从英国国家数据存储库（NDR）中公开可用的调查中精心挑选的。当地震体积处于时域而井位于深度时，需要进行时-深转换。我们使用检波器数据建立时-深关系，并通过插值构建速度模型，以实现对叠后地震数据的准确转换。在这里，我们提出了一种自动化数据整理流程，使地震数据准备成为可能，同时确保可重复性。目标是训练一个生成模型，以捕捉地下属性的统计分布，从而生成多个统计上一致的现实实现，用于不确定性量化，这些可以作为地震反演的先验条件。

英文摘要

The advent of machine learning (ML) and computer vision has significantly accelerated seismic inversion workflows by reducing the computational cost of traditionally expensive iterative methods. However, the development and evaluation of ML methods remain limited by the scarcity of realistic velocity models, as most high-quality data are privately owned by oil and gas companies. To address this gap, we present OpenSeisML, a collection of real seismic datasets designed to support generative AI (Gen-AI) workflows for seismic inversion. The datasets are curated from publicly available surveys in the UK National Data Repository (NDR). When seismic volumes are in the time domain and wells are in depth, a time-to-depth conversion is required. We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data. Here, we present an automated data curation pipeline that enables seismic data preparation while ensuring reproducibility. The objective is to train a generative model that captures the statistical distribution of subsurface properties, enabling the synthesis of multiple statistically consistent realizations for uncertainty quantification which can act as a prior for seismic inversion.

URL PDF HTML ☆

赞 0 踩 0

2605.20538 2026-05-21 cs.CV

Continual Segmentation under Joint Nonstationarity

连续分割下的联合非平稳性

Prashant Pandey, Himanshu Kumar, Devineni Sri Venkatraya Chowdary, Brejesh Lall

AI总结本文研究了在联合非平稳性条件下连续语义分割的问题，提出了一种基于梯度适应稳定机制和半监督学习的方法，以应对数据分布漂移带来的不稳定性和过拟合问题，并在多种场景下验证了方法的有效性。

详情

AI中文摘要

演化数据流导致连续语义分割中的联合非平稳性，其中语义类别、输入分布和监督可用性随时间同时变化。这种设置反映了实际的结构预测系统，但此前的持续学习工作通常孤立地研究这些因素。我们正式化了在耦合类别、领域和标签漂移下的持续分割，并研究了在异构密集预测环境中有限标注和丰富未标注数据下的学习。为了解决在分布漂移下少量监督带来的不稳定性及过拟合问题，我们引入了梯度适应稳定机制，这是一种通过梯度缩放的随机扰动实现的参数级正则化机制，促进了原理上的稳定性-可塑性权衡。我们进一步通过半监督学习利用未标注数据，并引入原型锚定监督，通过联合置信度和原型一致性验证伪标签。这些机制共同使持续分割在联合非平稳性下得以学习。在类别递增、领域递增和少样本场景中的广泛实证评估显示，在异构结构预测设置中，与现有方法相比有持续的改进。我们的结果揭示了现有持续分割方法的根本失败模式，并提供了在动态演变环境中学习鲁棒密集预测器的见解。

英文摘要

Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

URL PDF HTML ☆

赞 0 踩 0

2605.20537 2026-05-21 cs.CL

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Biomedical NER和实体链接基准测试测量什么？一种以语料库为中心的诊断框架

Robert Leaman, Rezarta Islamaj, Zhiyong Lu

AI总结本文提出一种以语料库为中心的诊断框架，用于分析生物医学NER和实体链接基准测试的相关属性，揭示语料库特性对评估信号、泛化需求和基准测试结论范围的影响。

Comments Accepted to the ACL 25th Workshop on Biomedical Language Processing

详情

AI中文摘要

生物医学命名实体识别（NER）和实体链接（EL）高度依赖于标注语料库，但这些资源用于基准测试的效用往往被假设而非明确描述。我们提出一种以语料库为中心的框架，直接从语料库注释、概念链接、训练-测试分割、文档元数据和术语映射中诊断与基准测试相关的属性。该框架将标准化统计分为五类：（1）规模、密度和标签分布，（2）词汇和概念结构，（3）训练-测试重叠，（4）元数据组成，以及（5）术语覆盖（如适用）。对九个涵盖疾病、化学物质和细胞类型的语料库应用该框架，发现即使针对相同明显任务，语料库属性也可能存在显著差异。我们发现它们提供的评估信号、施加的泛化需求、允许的训练-测试重用程度以及所代表的生物医学文献和概念空间区域也存在差异。这些差异表明，通常报告的语料库统计可能不足以描述生物医学NER和EL基准测试所评估的内容。我们主张语料库中心的诊断提供了一种实用框架，用于分析语料库，超越仅基于语料库大小和实体类型的表面描述，以识别潜在的迁移风险，并解释基准测试结论的范围。我们以开源代码和交互式仪表盘的形式发布该框架，以支持重现我们的分析并表征其他语料库。

英文摘要

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

URL PDF HTML ☆

赞 0 踩 0

2605.20536 2026-05-21 cs.CV

HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification

HADS-Net：一种融合注意力增强的双流网络，用于乳腺超声图像分类

Chinedu Emmanuel Mbonu, Blessing Nwamaka Iduh, Joseph Ikechukwu Odo, Doris Chinedu Asogwa

AI总结本文提出HADS-Net，一种融合注意力增强的双流网络，通过两个并行路径利用全局纹理和局部边界线索，结合物理信息增强，以提高乳腺超声图像分类的准确性。

Comments 7 pages, 4 figures

详情

AI中文摘要

准确地将乳腺超声图像分类为良性、恶性和正常类别是至关重要的临床任务，但受到斑点噪声、声影效应和类间视觉模糊的阻碍。现有的深度学习方法依赖于单流架构，使用通用增强方法，忽略了超声成像的物理特性，并且没有先前的方法专门针对被确定为最诊断性视觉线索的病变边界特征进行处理。我们提出了HADS-Net，一种混合注意力增强的双流网络，通过两个并行路径利用全局纹理和局部边界线索。流1应用物理信息增强模拟斑点噪声、声影效应和增益变化，在提取特征前使用预训练的EfficientNet-B3投影到512维空间。流2提取Sobel边缘图，经过轻量级CNN处理后投影到相同的512维空间。交叉注意力融合模块允许纹理流选择性地查询边界特征，生成联合优化的表示，由通过自适应类别加权焦点损失训练的MLP进行分类。使用五折分层交叉验证，在50个周期中使用余弦退火，选择验证损失最低的全局最佳检查点。在BUSI数据集上，HADS-Net实现了96.58%的准确率，宏ROC-AUC为0.9978，宏F1为0.9654，以及良性、恶性和正常类别的F1分数分别为0.970、0.951和0.976。没有恶性病变被误分类为正常。这些结果证实，模态特定的增强与跨模态注意力融合是超声波乳腺癌诊断的有效策略。

英文摘要

Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.20534 2026-05-21 cs.LG cs.AI stat.ML

Axiomatizing Neural Networks via Pursuit of Subspaces

通过子空间追求轴心化神经网络

Mehmet Yamac, Mert Duman, Ugur Akpinar, Felix Rojas Casadiego, Serkan Kiranyaz, Marcel van Gerven, Moncef Gabbouj

AI总结本文提出一个基于几何公理的框架，用于解释神经网络的行为，通过子空间追求假设，统一了表示、计算和泛化在浅层和深层架构中的视角。

Comments 43 pages, 25 figures. Code and additional materials will be released

2605.20533 2026-05-21 cs.LG

Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

Ada2MS: 一种基于元素级和全局二阶矩估计指数混合的混合优化算法

Meng Zhu, Quan Xiao, Weidong Min

AI总结本文提出Ada2MS算法，通过连续指数插值元素级和全局二阶矩估计，平衡AdamW和动量SGD的优缺点，在视觉任务中取得竞争性结果。

详情

AI中文摘要

优化算法是机器学习模型通过迭代最小化损失函数、更新参数、从数据中学习并提高性能的核心方法。动量SGD和AdamW代表了两种重要的优化范式。AdamW产生稳定的更新，通常在各种训练场景中具有较强的鲁棒性，但其泛化性能有时弱于动量方法。动量SGD在仔细调参后通常可以获得更好的泛化性能，但对梯度尺度变化和超参数设置更敏感。为了平衡这两种范式的优缺点，本文提出Ada2MS优化算法，通过连续指数插值元素级二阶矩估计和全局二阶矩估计，实现AdamW-like行为和动量SGD-like行为的平滑过渡。在本研究评估的视觉任务中，Ada2MS在统一的优化器比较协议下取得了竞争性结果。代码将在https://github.com/mengzhu0308/Ada2MS上发布。

英文摘要

Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usually has strong robustness across training scenarios, but its generalization performance is sometimes weaker than that of momentum methods. Momentum SGD can often obtain better generalization after careful tuning, but it is more sensitive to gradient-scale variation and hyperparameter settings. To balance the strengths and weaknesses of the two paradigms, this paper proposes Ada2MS, an optimization algorithm that achieves a smooth transition between AdamW-like behavior and momentum-SGD-like behavior through continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates. On the visual tasks evaluated in this study, Ada2MS obtains competitive results under a unified optimizer-comparison protocol. The code will be released at https://github.com/mengzhu0308/Ada2MS

URL PDF HTML ☆

赞 0 踩 0

2605.20529 2026-05-21 cs.CL cs.AI

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

词组关联性：人类和神经网络中主谓一致学习的一种假设

Claire Hobbs, R. Thomas McCoy

AI总结本文探讨了语言输入中的统计信号如何帮助语法习得，提出词组关联性假设，通过词组共现规律提供句法依赖线索，并验证该机制在英语主谓一致习得中的有效性。

Comments Accepted to CoNLL

详情

AI中文摘要

在何种程度上，语言输入中的统计信号可以促进语法的习得？本文提出了一种称为词组关联性学习的机制，其中词组共现规律可以提供句法依赖的线索。我们研究这种机制是否能支持英语主谓一致的习得。首先，我们通过在不同可预测性水平的合成数据集上训练神经网络来模拟语言习得，发现存在一个可预测性范围，使得这些统计学习器能够稳健地学习主谓一致。然后，我们分析儿童导向语言中主谓配对的可变性，并发现此类数据中的可变性落在我们计算模拟中支持稳健泛化的范围内。综合来看，这些结果表明词组关联性是一种可行的学习策略，适用于儿童所接收的输入类型。

英文摘要

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

URL PDF HTML ☆

赞 0 踩 0

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

AI总结本文提出NeuroQA，一个大规模的3D脑部MRI视觉问答基准，包含来自12977名受试者的56953个问答对，涵盖5-104岁及五个临床领域，通过3D体积评估11种临床推理技能，并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情

AI中文摘要

我们提出了NeuroQA，一个大规模的3D脑部磁共振成像（MRI）视觉问答基准，包含来自12977名受试者的56953个问答对，涵盖5-104岁及五个临床领域：阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答（VQA）方法不同，NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能，涵盖是/否、多项选择和开放式格式。在203个模板中，131个是图像 grounded（可从3平面查看器回答），72个是图像 informed（答案来自定量体积测量或临床仪器）。为消除纯文本捷径，我们应用了答案分布优化，将封闭式文本-only 准确率从>80%降至44.6%；图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配，零个相同受试者矛盾。我们进行了临床评估，两名临床医生独立评估100个冻结测试项目，使用3平面查看器。在封闭式（是/否+多项选择）测试公开项目上，最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率，均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布，公开QA对用于开放访问数据集和受数据使用协议（DUAs）限制的数据集的可复现生成脚本，加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2605.20523 2026-05-21 cs.LG cs.AI q-bio.QM

Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models

机器学习增强的非侵入性测试用于MASLD纤维化：浅层-深层神经网络与FIB-4、表格基础模型和大语言模型的比较

Athanasios Angelakis, Gabriele De Vito, Eleni-Myrto Trifylli, Filomena Ferrucci

AI总结本文研究了机器学习增强的非侵入性测试在MASLD纤维化检测中的应用，比较了浅层-深层神经网络、FIB-4、表格基础模型和大语言模型在不同队列中的性能，发现浅层-深层神经网络在保持FIB-4变量空间的同时提供了更平衡的外部操作性能。

Comments 26 pages, 4 figures, 3 tables. Preprint

详情

AI中文摘要

晚期纤维化是代谢功能障碍相关脂肪性肝病（MASLD）中肝相关发病率的主要决定因素。FIB-4被广泛用作一线非侵入性测试，但其固定公式可能低估了年龄、天冬氨酸转氨酶、丙氨酸转氨酶和血小板计数中包含的诊断信息。我们评估了机器学习增强的非侵入性测试（MLE-NIT）是否能够在保持FIB-4变量空间的同时提高晚期纤维化的检测能力。我们使用了来自中国、马来西亚和印度的三个经活检确认的MASLD队列（n=784）。中国队列被分为486名训练样本和54名内部验证/调整治疗样本；最终性能仅在马来西亚和印度的外部队列中报告。模型使用了五个变量：年龄、FIB-4、天冬氨酸转氨酶、血小板计数和丙氨酸转氨酶。我们比较了FIB-4与浅层-深层神经网络（s-DNN）、TabPFN和gpt-4o-2024-08-06。FIB-4在马来西亚和印度的外部ROC-AUC分别为0.75和0.60。TabPFN达到0.69和0.66，微调后的GPT-4o达到0.75和0.63，而s-DNN达到0.77和0.67。s-DNN仅包含354个可训练参数，相比TabPFN的7,244,554个参数，却提供了更平衡的外部操作性能。校准显示s-DNN的Brier分数为0.18和0.22，排列重要性识别出AST和FIB-4为主要变量。紧凑的非线性MLE-NIT可能在不增加临床数据需求的情况下增强基于FIB-4的纤维化评估。

英文摘要

Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.20521 2026-05-21 cs.LG cs.CR

An exponential mechanism based on quadratic approximations for fine-tuning machine learning models with privacy guarantees

基于二次近似的指数机制用于具有隐私保障的机器学习模型微调

Hoang Tran, Jorge Ramirez, Jiayi Wang, Alberto Bocchinfuso, Christopher Stanley, M. Paul Laiu

AI总结本文提出一种基于指数机制的随机算法，用于在保证差分隐私的前提下微调预训练模型，通过结合局部二次近似和新数据集信息构建效用函数，并引入随机投影策略提升高维模型的可扩展性。

详情

AI中文摘要

微调过程将预训练的机器学习模型适应到一个小而敏感的数据集，但此过程有风险记住个体新的数据点，使模型对试图提取敏感信息的对手而言变得脆弱。在本文中，我们开发了一种基于指数机制的随机算法，用于微调的同时确保差分隐私。我们的关键思想是构建一个简单的效用函数，该函数结合了预训练模型的局部二次近似和新数据集的信息。所得到的指数机制允许以闭式形式精确地从多元正态分布中进行抽样。我们建立了该方法的理论隐私保证、灵敏度界限和准确性估计。我们进一步引入了一种随机投影策略，使该方法能够扩展到高维模型。在MNIST基准和MIMIC临床数据集上的数值实验显示，该方法在现有差分隐私微调技术中表现具有竞争力。

英文摘要

Fine-tuning adapts a pretrained machine learning model to a small, sensitive dataset, but this process risks memorizing individual new data points, making the model vulnerable to adversaries who seek to extract sensitive information. In this work, we develop a randomized algorithm based on the exponential mechanism for fine-tuning while ensuring differential privacy. Our key idea is to construct a simple utility function that combines a local quadratic approximation of the pretrained model with information from the new dataset. The resulting exponential mechanism admits exact sampling from a multivariate normal distribution in closed form. We establish theoretical privacy guarantees, sensitivity bounds, and accuracy estimations for our method. We further introduce a random-projection strategy that makes the approach scalable to high-dimensional models. Numerical experiments on the MNIST benchmark and the MIMIC clinical dataset demonstrate competitive performance against existing differentially private fine-tuning techniques.

URL PDF HTML ☆

赞 0 踩 0

2605.20520 2026-05-21 cs.AI

Open-World Evaluations for Measuring Frontier AI Capabilities

面向前沿AI能力的开放世界评估

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan

AI总结本文提出开放世界评估作为一种补充方法，通过小样本定性分析来评估长期、复杂、现实世界任务，以更准确地衡量AI能力，并介绍了CRUX项目作为定期进行此类评估的尝试。

详情

AI中文摘要

基于基准的评估在跟踪前沿AI进展方面仍然很重要。但其可能同时高估和低估实际能力，因为它优先考虑可以精确指定、自动评分、容易优化且预算低、时间短的任务。我们倡导一种互补的评估类别，我们称之为开放世界评估：长期、复杂、现实世界任务通过小样本定性分析而非基准规模自动化来评估。在本文中，我们回顾了最近的开放世界评估，识别了其优势和局限性，并介绍了CRUX（Collaborative Research for Updating AI eXpectations），一个定期进行此类评估的项目。作为第一个实例，我们让一个AI代理开发并发布一个简单的iOS应用程序到Apple App Store。代理仅需一次可避免的手动干预就完成了任务，这表明开放世界评估可以提供关于可能很快普及的能力的早期预警。我们最后提出设计和报告开放世界评估的建议。

英文摘要

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

URL PDF HTML ☆

赞 0 踩 0

2605.20515 2026-05-21 cs.LG eess.SP

Online Conformal Prediction with Corrupted Feedback

在线腐蚀反馈下的符合预测

Bowen Wang, Matteo Zecchin, Osvaldo Simeone

AI总结本文研究了在存在腐蚀反馈的情况下在线符合预测的鲁棒性问题，提出两种鲁棒方案并通过实验验证了其在腐蚀反馈下的改进性能。

详情

AI中文摘要

现代人工智能系统需要校准的不确定性估计，这些估计在顺序和非平稳环境中仍需保持可靠。在线符合预测（OCP）通过适应性更新的预测集来解决这一挑战，这些预测集提供确定性的长期误覆盖保证。然而，这些保证依赖于对过去预测集覆盖情况的完美反馈假设。在实践中，观察到的误覆盖指示器可能受到噪声、通信故障或对抗性操纵的干扰，这会严重降低OCP的校准保证。本文研究了在腐蚀反馈下的OCP。我们首先将反馈腐蚀建模为任意的二进制翻转序列，并分析反馈腐蚀如何影响和降低标准OCP的误覆盖性能。然后我们提出两种鲁棒方案：通过过滤的鲁棒OCP，利用预测阈值的结构特性来过滤腐蚀反馈；以及通过主动补偿的鲁棒OCP，整合主动补偿机制以减轻腐蚀反馈的影响。对于这两种方法，我们建立了显式的误覆盖保证，并进一步专门针对独立随机翻转模型和具有记忆限制的任意误差模型。在真实世界数据集上的实验验证了所提出的方法，显示在腐蚀反馈下校准显著改进，预测集明显更小，相比基线OCP方法。

英文摘要

Modern artificial intelligence systems require calibrated uncertainty estimates that remain reliable in sequential and non-stationary environments. Online conformal prediction (OCP) addresses this challenge through adaptively updated prediction sets that provide deterministic long-run miscoverage guarantees. These guarantees, however, hinge on the assumption of perfect feedback about the coverage of past prediction sets. In practice, the observed miscoverage indicator may be corrupted by noise, communication failures, or adversarial manipulation, which can severely degrade OCP's calibration guarantees. In this paper, we study OCP under corrupted feedback. We first model feedback corruption as an arbitrary binary flip sequence, and analyze how feedback corruption affects and degrades the miscoverage performance of standard OCP. We then propose two robust schemes: robust OCP via filtering, which leverages the structural properties of the predicted threshold to filter corrupted feedback, and robust OCP via active compensation, which incorporates an active compensation mechanism to mitigate the effect of corrupted feedback. For both methods, we establish explicit miscoverage guarantees, which are further specialized for an independent stochastic flip model and for an arbitrary error model with memory bounds. Experiments on real-world datasets validate the proposed approach, showing markedly improved calibration and significantly smaller prediction sets compared with baseline OCP methods under corrupted feedback.

URL PDF HTML ☆

赞 0 踩 0

2605.20510 2026-05-21 cs.CV cs.AI cs.CY

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

ShadeBench: 一个用于可持续社会建筑阴影模拟的基准数据集

Longchao Da, Mithun Shivakoti, Xiangrui Liu, T Pranav Kutralingam, Yezhou Yang, Hua Wei

AI总结本文提出ShadeBench，一个用于城市阴影理解的综合数据集和基准，通过多模态数据支持阴影生成、分割和3D建筑重建，并提供标准化评估协议和基线方法，为数据驱动的城市气候研究和热适应城市规划提供基础。

Comments 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

详情

AI中文摘要

由于城市热岛效应的加剧，城市热暴露问题变得越来越严峻。细粒度的阴影模式，尤其是由建筑物引起的阴影，强烈影响行人热暴露和户外活动规划。然而，大规模准确建模和分析城市阴影仍然困难，因为缺乏大规模数据集和系统评估框架。为了解决这一挑战，我们提出了ShadeBench，一个全面的城市阴影理解数据集和基准。ShadeBench包含地理多样的城市场景，具有时间变化的模拟阴影地图和文本描述，以及对齐的卫星图像、建筑骨架表示和3D建筑网格。基于此多模态数据集，ShadeBench支持一系列下游任务，包括阴影生成、阴影分割和3D建筑重建。我们进一步建立了这些任务的标准评估协议和基线方法。通过使大规模和细粒度的阴影分析成为可能，ShadeBench为数据驱动的城市气候研究提供了基础，并支持未来在热适应城市规划和决策中的研究。代码和数据集可在https://darl-genai.github.io/shadebench/上公开获取。

英文摘要

Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.

URL PDF HTML ☆

赞 0 踩 0

2605.20506 2026-05-21 cs.LG cs.CL

Reinforcing Human Behavior Simulation via Verbal Feedback

通过言语反馈强化人类行为模拟

Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap

AI总结本文提出DITTO模型，通过将言语反馈作为强化学习中的首要信号来提升LLM模拟人类行为的能力，并引入SOUL基准测试平台，展示了在多个任务中显著提升性能的成果。

详情

AI中文摘要

人类通过言语反馈（例如父母说“那很粗鲁”或朋友解释“这是为什么那会伤害你”）学习社会规范和行为。然而，对于LLM而言，学习反馈主要集中在代码和数学等领域，这些领域中的RL奖励可以直接验证并压缩为标量值。随着LLM越来越多地用于模拟人类行为，例如代表用户、患者、学生和其他角色，有必要使它们更加人性化，这需要接受一种根本不同的信号：主观的、多方面的言语反馈。我们提出了DITTO，一个通过将言语反馈作为强化学习中的首要信号进行训练的模型。每次回放后，DITTO会接收言语反馈并生成反馈条件的改进回放；两个输出通过GRPO联合优化，将言语指导蒸馏到基础策略中，而无需在测试时使用反馈。我们还引入了SOUL（Simulation gym Of hUman-Like behavior），一个涵盖10个任务、六个类别的统一基准和训练数据集：理论思维、角色扮演、社交技能、学习模拟、用户模拟和角色模拟。DITTO在基础模型上平均提升了36%，并在SOUL基准测试中的6个任务上超过了GPT-5.4，证明了通过言语反馈的强化学习是训练LLM模拟人类行为的有前途的方向。

英文摘要

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.20502 2026-05-21 cs.LG cs.AI cs.CV stat.AP stat.ML

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

基于表示空间扩散模型的Tippett最小融合多编码器异常检测

Neelkamal Bhuyan

AI总结本文提出了一种多编码器融合的表示空间扩散模型，通过统计分析每个编码器对特定分布偏移类型的敏感性，引入EncMin2L门控机制，无需使用OOD标签即可在较低参数成本下提升异常检测性能，同时在四种分布偏移类型上均达到0.94以上的AUROC。

Comments 14 pages

详情

AI中文摘要

我们通过多编码器融合的每编码器表示空间扩散模型（RDMs）来解决跨完整分布偏移谱的异常检测问题，包括全局域变化、语义分歧、纹理差异和协变量腐蚀。我们从ID数据中统计地识别每个编码器对特定偏移类型的敏感性，并引入EncMin2L——一种编码器无关的两级min(⋅)门控，能够在不使用OOD标签的情况下结合和校准每编码器扩散基的似然检测器，参数成本比单编码器基线低2.3倍。两种ID数据诊断：η²（类条件F检验）和Δμ（在合成腐蚀下的对数似然偏移）量化编码器的专业化，而Tippett最小p值组合将每编码器得分聚合为一个校准稳定的OOD信号。EncMin2L在所有四种偏移类型上均达到≥0.94的AUROC，优于在重叠基准上的最佳表示空间扩散OOD检测器。

英文摘要

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $η^2$ (class-conditional F-test) and $Δμ$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.20495 2026-05-21 cs.CV

A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

一种用于显微镜视觉-语言模型中高效提示选择的人机协作框架

Abhiram Kandiyana, Ankur Mali, Lawrence O. Hall, Peter R. Mouton, Dmitry Goldgof

AI总结本文提出了一种人机协作框架，通过目标驱动的主动学习方法解决显微镜视觉-语言模型中提示集构建的问题，减少专家验证图像的数量，提高分类性能。

Comments Accepted to CVPR workshops, 2026

详情

AI中文摘要

显微镜图像分类的深度学习流程通常需要昂贵、耗时的人工标注来生成高质量的训练地面真实数据。最近的研究表明，通过提示调整视觉-语言模型（VLMs）可以减少手动标注，通过构建一个小的专家验证图像-描述示例集，作为少样本上下文来对所有剩余图像进行分类。为了进一步减少工作量，VLM可以为候选示例生成描述，然后由专家验证并进行轻微编辑，而不是从头编写文本。然而，仍有两个实际问题未得到解决：（1）哪些未标注图像应优先进行验证？（2）需要多少验证示例才能达到性能目标？在本文中，我们通过将提示集构建公式化为目标驱动的主动学习问题来解决这些问题，优先标注哪些图像。我们在严格低资源约束下研究了三种互补的选取标准，并在小的未标注池中进行实验。实验表明，我们的方法在显著较少的专家验证图像下达到目标性能，平均只需20个标注图像即可达到100%的测试准确率。更广泛地说，我们的以人为本的框架展示了生成式AI在生物医学图像分析中的应用，其中专家在验证和改进模型输出方面仍保持积极的参与，同时显著降低了标注成本。代码和数据将向公众开放。

英文摘要

Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.20494 2026-05-21 cs.LG physics.ao-ph stat.AP

A 10,000-Year Global Stochastic Tropical Cyclone Catalog with Wind-Dependent Track Transitions (WHITS)

具有风依赖性路径转换的10,000年全球随机热带气旋目录（WHITS）

Jennifer Nakamura, Upmanu Lall

AI总结本文提出WHITS方法，通过非参数半马尔可夫路径生成器生成全球10,000年合成气旋目录，以提高保险损失评估的可靠性。

详情

AI中文摘要

可靠的热带气旋（TC）风险评估受到历史记录的简短和空间稀疏性的限制，特别是对于罕见的高强度登陆事件，这些事件主导了保险损失。我们提出了WHITS（风聚焦飓风交互路径模拟器），这是一种非参数半马尔可夫路径生成器，扩展了Nakamura等人（2015）的HITS框架，有三种改进：在历史路径段之间转换时，除了位置、年龄和前进向量外，还根据局部风速进行条件；在比较向量项上选择核时，进行了细化以抑制动态不一致的跳跃；并在每个转换中应用了短平滑窗口，以消除下游风暴潮用户报告的位置和风速不连续性。WHITS被拟合到每个六个盆地的完整可用最佳轨迹记录中，北大西洋延伸至1851年，在其他盆地延伸至可靠最佳轨迹数据的最早年份。所得到的10,000年全球合成目录重现了所有盆地的观测路径密度和每年飓风/台风风力打击概率。该目录旨在用于灾难风险应用，其中大量、低偏倚的物理合理路径比小而统计上修正的样本更有用。

英文摘要

Reliable assessment of tropical cyclone (TC) risk is limited by the brevity and spatial sparsity of the historical record, particularly for the rare, high-intensity landfalls that dominate insured loss. We present WHITS (Wind-focused Hurricane Interactive Track Simulator), a non-parametric semi-Markov track generator that extends the HITS framework of Nakamura et al. (2015) in three ways: transitions between historical track segments are conditioned on local wind speed in addition to position, age, and forward vector; the kernel selection on the comparative-vector term is sharpened to suppress dynamically inconsistent jumps; and a short smoothing window is applied across each transition to remove the position and wind discontinuities reported by downstream surge users. WHITS is fit to the full available best-track record in each of six basins in IBTrACS, extending in the North Atlantic to 1851 and in other basins to the earliest year of reliable best-track data. The resulting 10,000-yr global synthetic catalog reproduces observed track density and the annual hurricane/typhoon-force wind-hit probability across all basins. The catalog is intended for catastrophe-risk applications where a large, low-bias sample of physically plausible tracks is more useful than a small, statistically corrected one.

URL PDF HTML ☆

赞 0 踩 0

2605.20485 2026-05-21 cs.LG

ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

ZEBRA: 零样本预算化资源分配用于LLM编排

May Hamri, Inbal Talgam-Cohen

AI总结该研究提出ZEBRA框架，通过将多阶段预算分配转化为连续非线性背包问题，有效解决多智能体流水线中预算分配问题，实验显示其在多个任务上均优于传统方法。

详情

AI中文摘要

随着自主代理在固定货币预算下执行端到端任务，关键问题从预算是否被尊重转变为如何有效使用预算。现有预算感知方法通常在单一代理内逐步控制推理过程，或通过强化学习学习资源分配策略。本文提出ZEBRA，一种零样本框架，将多阶段预算分配转化为连续非线性背包问题：一个LLM控制器估计各阶段的效用曲线，通过拉格朗日乘数的水填充搜索返回各阶段的分配。加法和乘法聚合统一在同一个求解器下。在150个任务APPS编码基准测试中，ZEBRA变体在所有聚合指标上均优于LLM直接分配方法。在预算为无约束支出的α=0.5时，ZEBRA恢复了94.4%的无约束质量，而LLM直接分配仅为88.1%。该优势具有统计显著性，并且在编码之外也具有转移性：在3阶段的HotpotQA流水线中，ZEBRA比LLM直接分配高出14.3个百分点，分配在经验上对曲线估计噪声具有鲁棒性。在HotpotQA中，ZEBRA达到的预算分配（近平衡）与APPS中的分配（偏向细化阶段）不同，显示出对流水线结构的适应性。更广泛地说，我们展示了在推理时间使用轻量级算法指导可以改善自主多智能体系统的经济行为。

英文摘要

As autonomous agents increasingly execute end-to-end tasks under fixed monetary budgets, the pressing open question shifts from whether the budget is respected, to how to spend it effectively. Existing budget-aware methods typically control reasoning step-by-step within a single agent, or learn resource allocation policies via RL. None address how to split a budget across the composing phases of a multi-agent pipeline at inference time. We propose ZEBRA, a zero-shot framework that reduces multi-phase budget allocation to a continuous nonlinear knapsack problem: an LLM controller estimates per-phase utility curves, and a water-filling search on the Lagrange multiplier returns the per-phase split. Additive and multiplicative aggregations are unified under the same solver. On a $150$-task APPS coding benchmark, both ZEBRA variants outperform LLM-direct (budget allocation directly by an LLM) on every aggregate metric. At a budget of $α= 0.5$ of the unconstrained spend, ZEBRA recovers $94.4\%$ of unconstrained quality, versus $88.1\%$ for LLM-direct. The advantage is statistically significant and transfers beyond coding: on a $3$-phase HotpotQA pipeline, ZEBRA beats LLM-direct by $14.3$pp, with allocations empirically robust to curve-estimation noise. On HotpotQA, ZEBRA arrives at a different budget split (near-balanced) compared to the APPS one (skewed towards a refinement phase), showing adaptation to the pipeline structure. More broadly, we show that lightweight algorithmic guidance at inference time can improve the economic behavior of autonomous multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.20484 2026-05-21 cs.RO

Enhancing Graph-Based SLAM in GNSS-Denied environments by leveraging leg odometry

通过利用腿部里程计增强基于图的SLAM在GNSS受限环境中的性能

Léon Perruchot-Triboulet, Luc Jaulin, Kai Xiao

AI总结本文提出了一种基于因子图的架构，通过结合本体感觉腿部里程计和激光雷达-惯性里程计，有效减少了GNSS受限环境中视觉漂移，提高了SLAM的鲁棒性。

Comments 4 pages, 3 figures, 2 tables, for ICRA workshop on Robot Meets GNSS and Ranging for Seamless Autonomy

2605.20482 2026-05-21 cs.LG cs.SY eess.SY

Quadratic Characterizations for Reachability Analysis of Neural Networks

二次特性用于神经网络可达性分析

Elias Khalife, Mazen Farhood, Pierre-Loic Garoche

AI总结本文提出了一种构建二维实平面上标量关系的验证二次特性的框架，通过局部生成候选二次不等式并全局验证，以提高神经网络可达性分析的精度和效率。

详情

AI中文摘要

二次约束（QCs）广泛用于表征非线性和不确定性，但在有界域上通用分析特性可能较为保守。本文开发了一个框架，用于构建二维实平面上标量关系的验证二次特性。候选二次不等式通过使用关系和外部样本点求解凸二次规划局部生成。然后通过求和平方证书在精确半代数描述或非多项式关系的放松多项式描述上进行全局验证。所得到的验证约束定义了所考虑域上标量关系的可信上近似。这些约束与基于QCs和点wise积分二次约束（IQCs）的现有分析框架直接兼容，可用于静态非线性和不确定性的分析，并可嵌入基于QCs的半正定规划中，用于前馈神经网络的可达性和安全性分析。对于平滑激活函数如tanh，该方法产生域依赖的二次特性，作为通用扇区或斜率描述的替代方案。对于ReLU网络，我们给出了减少QC基于可达性分析保守性的方法，通过利用神经元间的依赖关系和更紧的局部界限。数值示例展示了对平滑激活函数的改进可达性结果，对ReLU网络的减少保守性，以及通过涉及饱和的示例展示了其在神经网络之外的应用。

英文摘要

Quadratic constraints (QCs) are widely used to characterize nonlinearities and uncertainties, but generic analytical characterizations can be conservative on bounded domains. This paper develops a framework for constructing verified quadratic characterizations of scalar relations in the two-dimensional real plane. Candidate quadratic inequalities are locally generated by solving convex quadratic programs using samples from the relation and exterior sample points. They are then verified globally using sum-of-squares certificates over an exact semialgebraic description or, in the case of nonpolynomial relations, over relaxed polynomial descriptions. The resulting verified constraints define a sound overapproximation of the scalar relations over the considered domains. These constraints are directly compatible with existing analysis frameworks based on QCs and pointwise integral quadratic constraints (IQCs) for static nonlinearities and uncertainties, and they can also be embedded in QC-based semidefinite programs for reachability and safety analysis of feedforward neural networks. For smooth activations such as $\tanh$, the method yields domain-dependent quadratic characterizations that constitute an alternative to generic sector- or slope-based descriptions. For ReLU networks, we give methods to reduce conservatism in QC-based reachability analysis of feedforward networks by exploiting dependencies between neurons and tighter local bounds. Numerical examples demonstrate improved reachability results for smooth activations, reduced conservatism for ReLU networks, and applicability beyond neural networks through an example involving saturation.

URL PDF HTML ☆

赞 0 踩 0

2605.20479 2026-05-21 cs.CV cs.LG

Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

用于基于模型的图像去噪中超参数预测的Oracle监督转移

Jianmin Liao, Lixin Shen, Yuesheng Xu

AI总结该研究提出HyperDn，一种单配置条件预测器，通过聚合源配置的Oracle监督，预测新的去噪器-噪声配置的异质超参数，展示了在跨范式实验中，从相对便宜的TV/TGV变分源转移到更昂贵的扩散模型DiffPIR时，通过少量或无目标Oracle标签实现接近Oracle性能的成果。

详情

AI中文摘要

超参数预测是基于模型的图像去噪器中的关键实际瓶颈，从经典的TV/TGV变分求解器到现代的扩散基模型如DiffPIR。尽管现有的学习预测器可以实现接近Oracle的性能，但这种方法扩展性差：每个新的配置通常需要其自身的Oracle标记训练集，且每个标签都需要通过与干净地面真实值对比的分层网格搜索来评估。因此，我们询问是否可以从源配置收集的Oracle监督能够转移到目标配置，而使用很少或没有目标Oracle标签。我们提出了HyperDn，一种单配置条件预测器，通过聚合源配置的Oracle监督，预测新的去噪器-噪声配置的异质超参数。在跨范式实验中，HyperDn从相对便宜的TV/TGV变分源转移到更昂贵的扩散基DiffPIR。仅使用2个目标Oracle标签，它达到了30.23 dB，接近Oracle性能，且在使用1/32个目标标签的情况下优于训练自研的每配置64标签预测器。在没有目标Oracle标签的情况下，HyperDn在两个未见过的噪声类型混合和从相对便宜的96×96源图像转移到512×768目标时也达到了接近Oracle的PSNR。这些结果表明，超参数预测的昂贵Oracle监督可以从源转移到新的目标配置，从而减少为每个新的去噪配置重建Oracle标签的需求。

英文摘要

Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

URL PDF HTML ☆

赞 0 踩 0

2605.20478 2026-05-21 cs.CL

Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

Stage-Audit: 用于跨维基表的可审计源前沿发现

Chen Shen

AI总结本文研究了LLM整理的表格可能存在的源不一致问题，提出Stage-Audit方法通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学，提高了源前沿的精度和F1值，同时保持了每行的源可追溯性。

Comments 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild

详情

AI中文摘要

LLM整理的表格可能看似源相关，但实际上包含不支持的行：整理者可能从参数记忆中回忆条目并回溯性地附加页面级引用，这些引用并非实际来源。我们研究了这一风险在Seed2Frontier发现任务中的影响：该任务是从种子页面找到互补的维基百科页面以构建结构化表格。Stage-Audit通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学（涵盖键、模式、源角色、基数和范围）来解决这一问题。在覆盖15个顶级域的51实例Seed2Frontier评估集上，Stage-Audit将源前沿的精度从0.356提升到0.505（+42%相对提升），F1值从0.334提升到0.451（+35%），同时保持了每行的源可追溯性。Vanilla-LLM与Stage-Audit的比较隔离了策略贡献，而非一般LLM发现过程的贡献。

英文摘要

LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

URL PDF HTML ☆

赞 0 踩 0

2605.20477 2026-05-21 cs.LG cs.AI cs.CL

Training Language Agents to Learn from Experience

训练语言代理以从经验中学习

Yuval Shalev, Zifeng Ding, Mateja Jamnik

AI总结本文提出了一种名为In-context Training（ICT）的任务框架，用于评估语言代理在跨任务中的自我改进能力，并通过基于强化学习的训练管道直接从经验中学习反思，从而在多个基准任务中优于基线模型，展示了从经验中学习的能力本身可以被学习。

详情

AI中文摘要

语言代理可以在交互环境中通过经验进行适应，但当前基于反思的方法只能在单个任务实例内进行自我纠正。是否可以将这种经验提炼成可重用的教训，从而在未来的未见任务上提高性能仍不明确。我们通过引入In-context Training（ICT）任务来解决这个问题，这是一种用于评估语言代理跨任务自我改进能力的框架。在ICT中，一个反思模型观察由行为模型收集的轨迹，并生成旨在提高行为模型在未见任务上的性能的系统提示。然后，我们提出了一种基于强化学习的训练管道，用于直接从经验中学习此类反思，而无需人工提供的示例。在ALFWorld和MiniHack上，我们训练的反思器在大多数保留的任务家族上优于未训练的基线，表明从经验中学习的能力本身可以被学习。在某些情况下，我们观察到在训练反射器的基准之外的泛化能力，能够显著不同的环境。最后，我们介绍了MetaGym，一个通用的Python库，用于构建元环境，从而促进未来对自我改进语言代理的研究。

英文摘要

Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor's performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

URL PDF HTML ☆

赞 0 踩 0