arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.13593 2026-05-21 cs.LG

Verification of Unknown Dynamical Systems via Autoencoder Latent Space

通过自编码器潜在空间验证未知动态系统

Robert Reed, Luca Laurenti, Morteza Lahijanian

AI总结本文提出了一种基于凸自编码器和核方法的学习方法，用于减少动态系统维度并验证其在潜在空间中的行为，从而在高维情况下实现更有效的形式验证。

Comments 25 pages, 6 figures, under review

详情

AI中文摘要

形式验证提供了一个强大的框架，用于证明动态系统满足其规范。然而，这些技术在高维设置中面临可扩展性挑战，因为它们通常依赖于状态空间离散化，而这种离散化随着维度的增长呈指数级增长。基于学习的降维方法，利用神经网络和自编码器，已显示出缓解这一问题的巨大潜力。然而，确保潜在空间验证结果的正确性仍是一个开放性问题。在本文中，我们提供了一种正式的方法，通过凸自编码器减少系统的维度，并通过基于核的方法在潜在空间中学习动态。然后，我们从学习的模型中构建一个有限的抽象，并保证该抽象包含原始系统的真正行为。我们证明了潜在空间中的验证结果可以映射回原始系统。最后，我们在多个系统上展示了该方法，包括由神经网络控制的26维系统，展示了显著的可扩展性改进。

英文摘要

Formal verification provides a powerful framework for proving that dynamical systems satisfy their specifications. However, these techniques face scalability challenges in high-dimensional settings, as they often rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring correctness of latent space verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements.

URL PDF HTML ☆

赞 0 踩 0

2512.03671 2026-05-21 cs.CL

Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian Context

生成式AI实践、素养与差距：意大利情境下的实证分析

Beatrice Savoldi, Giuseppe Attanasio, Olga Gorodetskaya, Marta Marchiori Manerba, Elisa Bassignana, Silvia Casola, Matteo Negri, Tommaso Caselli, Luisa Bentivogli, Alan Ramponi, Arianna Muti, Nicoletta Balbo, Debora Nozza

AI总结本研究通过实证分析探讨生成式AI的采用、素养及使用模式，揭示其在意大利情境下对不同群体的影响，发现数字素养是影响AI利用的关键因素，而非单纯使用与否。

详情

AI中文摘要

生成式AI（GenAI）聊天机器人通过对话界面的普及正在改变数字互动并具有经济潜力。然而，这些工具可能加深现有不平等——不仅通过不均等、社会分层的采用，还通过其有意识、批判性使用中的差异。基于对1906名意大利语使用者的原始调查数据，我们提供了对GenAI采用、素养和使用模式的全面分析。我们的发现表明，GenAI支持多样化个人和专业活动，并取代传统信息获取工具。然而，教育程度较低、年龄较大的人以及技术熟悉度较低的人更不可能采用它；40%的人将能力障碍视为主要障碍。在用户中，AI训练成为有意识、资本增强型参与的主要预测因素——内容创作、学习和创造力提升——而更被动、娱乐性的使用（如陪伴、信息寻求）则对能力水平不敏感。因此，我们强调数字素养是人们如何利用GenAI的关键因素，而非仅仅是否使用它。最后，性别在持续的交叉差距中发挥作用，影响采用和使用频率。这些发现挑战了高可及性意味着广泛共享收益的假设。相反，它们提供了GenAI时代新兴差异的细致、多层次的账户——对这种技术最终如何驱动结果和利益差距有影响。

英文摘要

The rise of generative AI (GenAI) chatbots accessible via conversational interfaces is transforming digital interactions and holds economic promise. However, these tools might deepen existing inequalities -- not only through uneven, socially stratified adoption, but through differentials in their purposeful, critical use. Drawing on original survey data from 1,906 Italian-speaking adults, we provide a comprehensive analysis of GenAI adoption, literacy, and usage patterns. Our findings show that GenAI is supporting diversified personal and professional activities and replacing traditional information-seeking tools. Yet less-educated and older individuals, and those with lower technology familiarity, are less likely to adopt it; 40% cite competence barriers as a key obstacle. Among users, AI training emerges as the primary predictor of purposeful, capital-enhancing engagement -- content creation, learning, and creativity enhancement -- while more passive, recreational uses (e.g., companionship, information seeking) remain insensitive to competence levels. We thus highlight digital literacy as a lever for how people leverage GenAI, not just whether they use it. Finally, gender operates as a persistent cross-cutting divide, shaping both adoption and usage frequency. These findings challenge the assumption that high accessibility translates into broadly shared gains. Rather, they offer a granular, multi-level account of emerging disparities in the GenAI era -- with implications for how this technology may ultimately drive outcomes and benefit divides.

URL PDF HTML ☆

赞 0 踩 0

2510.09060 2026-05-21 cs.AI cs.CV

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

让轨迹扩散：用于多样化流匹配的质量保持控制

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, Yang You

AI总结本文提出了一种无需训练的推理时控制机制，使流本身具备多样性意识，通过几何上与模式质量寻求方向解耦的引导来鼓励轨迹横向扩散，同时通过时间调度的随机扰动重新引入不确定性，从而在不降低图像细节和提示忠实度的情况下提升多样性。

详情

AI中文摘要

基于流的文本到图像模型遵循确定性轨迹，这使得在有限的采样预算下探索多样模式成本较高。现有方法提高多样性通常依赖于重新训练或降低图像保真度。为了解决这一限制，我们提出了一种无需训练的推理时控制机制，使流本身具备多样性意识。我们的核心见解是通过几何上与模式质量寻求方向解耦的引导来鼓励多样性。我们的方法通过特征空间目标同时鼓励轨迹横向扩散，并通过时间调度的随机扰动重新引入不确定性。关键在于这种扰动被投影为与生成流正交，这是一个几何约束，允许其在不降低图像细节或提示保真度的情况下提升多样性。理论上，我们证明了这种设计单调地增加了一个体积代理，同时近似地保持边际分布，为生成质量的鲁棒性提供了原理性解释。经验上，在多个文本到图像设置下，固定采样预算下，我们的方法在Vendi分数和Brisque等多样性指标上一致优于强基线，同时保持图像质量和对齐。

英文摘要

Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the mode's quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

URL PDF HTML ☆

赞 0 踩 0

2510.05942 2026-05-21 cs.CL cs.AI

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

EvalMORAAL: 可解释的链式推理与大语言模型道德对齐的LLM-as-Judge评估

Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri

AI总结本文提出EvalMORAAL框架，通过两种评分方法和模型作为裁判的同行评审，评估20个大语言模型的道德对齐情况，发现西方与非西方地区存在显著的道德对齐差距。

Comments Accepted as a poster at *SEM 2026

详情

AI中文摘要

我们提出了EvalMORAAL，一个透明的链式推理（CoT）框架，使用两种评分方法（对数概率和直接评分）以及模型作为裁判的同行评审来评估20个大语言模型的道德对齐。我们对世界价值观调查（55个国家，19个主题）和PEW全球态度调查（39个国家，8个主题）进行了评估。使用EvalMORAAL，顶级模型与调查响应高度一致（WVS上的皮尔逊相关系数r≈0.90）。然而，我们发现明显的区域差异：西方地区平均r=0.82，而非西方地区平均r=0.61（绝对差距0.21），表明存在持续的区域对齐差距。我们的框架增加了三个部分：（1）为所有模型提供两种评分方法以实现公平比较，（2）带有自我一致性检查的结构化Co T协议，以及（3）一个模型作为裁判的同行评审，使用数据驱动的阈值标记348个冲突。同行同意与WVS调查对齐（r=0.74，p<.001；PEW r=0.39，n.s.），支持自动化质量检查。这些结果展示了文化意识AI的真实进展，同时突显了跨区域应用的开放挑战。

英文摘要

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

URL PDF HTML ☆

赞 0 踩 0

2509.17396 2026-05-21 cs.CL

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

EpiCache: 为资源受限环境下的长对话提供 episodic KV 缓存管理

Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

AI总结本文提出 EpiCache，一种无需训练的 KV 缓存管理框架，用于在固定内存预算下实现长对话问答（LongConvQA）。该方法通过块级预填充限制缓存增长，并通过 episodic KV 压缩保留主题相关上下文，从而在多个 LongConvQA 评估基准上提升了准确性并减少了延迟和峰值内存使用。

Comments ICML 2026

详情

AI中文摘要

现代大语言模型（LLMs）通过将上下文长度扩展到数百万个标记，能够生成连贯且个性化的响应，这些响应基于长期对话历史。然而，随着对话历史的延长，Key-Value（KV）缓存以线性方式增长，导致模型的内存足迹迅速超过设备限制。尽管最近的 KV 缓存压缩方法试图减少内存使用，但大多数方法在处理完整上下文后才进行缓存驱逐，导致无界峰值内存使用。此外，查询依赖的驱逐方法将缓存语义限制为单个查询，导致多轮对话中的失败案例。在本文中，我们引入 EpiCache，一种无需训练的 KV 缓存管理框架，用于在固定内存预算下实现长对话问答（LongConvQA）。EpiCache 通过块级预填充限制缓存增长，并通过 episodic KV 压缩保留主题相关上下文，该方法将对话历史划分为连贯的篇章，并执行篇章特定的 KV 缓存驱逐。在三个 LongConvQA 评估基准（LongMemEval、Realtalk 和 LoCoMo）上，EpiCache 将准确性提高了最高 30%，在 4-6 倍压缩下实现了接近满缓存的准确性，并将延迟和峰值内存分别减少了最高 2.4 倍和 3.7 倍。

英文摘要

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model's memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

URL PDF HTML ☆

赞 0 踩 0

2506.16950 2026-05-21 cs.CV cs.LG

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

LAION-C: 一个用于网络级视觉模型的分布外基准

Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann

AI总结本文提出LAION-C作为ImageNet-C的替代基准，旨在评估网络级数据集下的分布外鲁棒性，通过引入六种新的分布外扰动类型，发现现代模型在这些扰动下的表现显著提升，甚至超过人类观察者。

Comments ICML 2025 camera ready version

详情

AI中文摘要

分布外鲁棒性是计算机视觉模型的期望属性。提高模型鲁棒性需要高质量的鲁棒性基准信号来量化进展。尽管在ImageNet时代提出了多种基准数据集，如ImageNet-C，但大多数ImageNet-C的腐蚀类型不再相对于当今的大型网络爬取数据集是分布外的，因为这些数据集已经包含常见的腐蚀如模糊或JPEG压缩伪影。因此，这些基准不再适合评估网络级数据集中的分布外鲁棒性。事实上，最近的模型在ImageNet时代的分布外基准上显示出饱和分数，表明不清楚在网络级数据集上训练的模型是否真的在分布外泛化上更好，或者是否只是在训练过程中暴露于测试扭曲。为此，我们引入LAION-C作为ImageNet-C的替代基准。LAION-C包含六种新的扰动类型，专门设计为即使对于LAION这样的网络级数据集也是分布外的。在对最新模型的全面评估中，我们发现LAION-C数据集对当代模型提出了重大挑战，包括Gemini和GPT-4o等大语言模型。我们还进行了心理物理实验来评估我们扰动对人类观察者难度，从而能够将模型与实验室质量的人类鲁棒性数据进行比较。我们观察到分布外泛化的一个范式转变：从人类优于模型，到最佳模型现在匹配或优于最佳人类观察者。

英文摘要

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

URL PDF HTML ☆

赞 0 踩 0

2406.03506 2026-05-21 cs.LG cs.AI

Fuzzy Convolution Neural Networks for Tabular Data Classification

模糊卷积神经网络用于表格数据分类

Arun D. Kulkarni

AI总结本文提出了一种针对表格数据分类的模糊卷积神经网络（FCNN），通过将特征值映射为模糊隶属度并转换为图像来训练CNN模型，从而在表格数据分类任务中实现有效的学习和优于现有方法的性能。

Comments 10 pages, 16 figures, Submitted to IEEE Access

详情

DOI: 10.1109/ACCESS.2024.3479882
Journal ref: IEEE Access, vol. 12, pp. 151846-151855 (2024)

AI中文摘要

近年来，由于在各种领域中表现出色，特别是图像和文本分类任务，卷积神经网络（CNNs）已经引起了广泛关注。然而，它们在表格数据分类中的应用仍然很少被探索。在生物信息学、金融、医学等领域，非图像数据普遍存在。将CNNs适应于分类非图像数据仍然极具挑战性。本文研究了CNNs在表格数据分类中的有效性，旨在弥合传统机器学习方法与深度学习技术之间的差距。我们提出了一种专门针对表格数据的新型框架——模糊卷积神经网络（FCNN），以捕捉特征向量中的局部模式。在我们的方法中，我们将特征值映射到模糊隶属度。模糊隶属度向量被转换为图像，用于训练CNN模型。训练后的CNN模型用于分类未知的特征向量。为了验证我们的方法，我们生成了六个复杂的噪声数据集。我们从每个数据集中随机选择70%的样本用于训练，30%用于测试。数据集还使用了最先进的机器学习算法，如决策树（DT）、支持向量机（SVM）、模糊神经网络（FNN）、贝叶斯分类器和随机森林（RF）进行分类。实验结果表明，我们提出的方法能够有效地从表格数据中学习有意义的表示，实现与现有方法相媲美或更优的性能。总体而言，我们的发现表明，所提出的FCNN模型在表格数据分类任务中具有前景，作为一种可行的替代方案，为在结构化数据分析中利用深度学习提供了新的视角和潜在的机会。

英文摘要

Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.

URL PDF HTML ☆

赞 0 踩 0

2305.09620 2026-05-21 cs.CL cs.AI cs.LG

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

AI增强的调查：利用大型语言模型和调查进行意见预测

Junsol Kim, Byungkyu Lee

AI总结本文提出了一种基于大型语言模型的框架，通过结合问题、受访者和调查时期的嵌入表示，预测重复横断面调查中缺失的响应，从而弥补传统调查在捕捉历史变化方面的不足。

详情

AI中文摘要

全国代表性调查追踪公众意见，但每年只询问有限的问题，限制了其捕捉历史变化的潜力。为填补这一空白，我们开发了一个基于大型语言模型（LLM）的框架，通过结合问题、受访者和调查时期的嵌入表示，预测重复横断面调查中缺失的响应。我们引入了LLM在调查研究中的两个新应用：回溯预测（预测年度层面的缺失意见）和未询问意见预测（预测完全缺失的意见）。使用1972-2021年一般社会调查的数据，我们的LLM模型在交叉验证和在GSS未询问的年份中通过其他组织测量的公众意见方面表现良好。这些能力使我们能够恢复缺失的趋势并确定公众态度变化的时间，例如同性婚姻支持率的上升。然而，未询问意见预测的性能仍较为有限。我们展示了当我们的模型优于现有基准时的情况，检验了哪些意见和受访者更具可预测性，并评估了我们的方法是否减少了LLM预测响应的同质化倾向。我们的研究证明了LLM和调查可以相互增强：LLM扩大了调查的潜力，而调查则校准LLM以模拟人类意见。

英文摘要

Nationally representative surveys track public opinion, yet they ask only a limited set of questions each year, limiting its potential to capture historical changes. To fill this gap, we develop a large language model (LLM)-based framework for predicting missing responses in repeated cross-sectional surveys by incorporating embeddings for questions, respondents, and survey periods. We introduce two new applications of LLMs to survey research: retrodiction (predicting year-level missing opinions) and unasked opinion prediction (predicting entirely missing opinions). Using data from the 1972-2021 General Social Surveys, our LLM-based models perform strongly in retrodicting masked GSS opinions through cross-validation and public opinions measured by other organizations in years when the GSS did not ask them. These capabilities enable us to recover missing trends and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. However, performance remains modest for unasked opinion prediction. We show when our models outperform established benchmarks, examine which opinions and and respondents are more predictable, and evaluate whether our approach reduces LLMs' tendency to homogenize predicted responses. Our study demonstrates that LLMs and surveys can mutually enhance each other: LLMs broaden survey potential, while surveys calibrate LLMs for simulating human opinions.

URL PDF HTML ☆

赞 0 踩 0

2605.21483 2026-05-21 astro-ph.CO cs.LG

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

Velocityformer: 用于宇宙学速度重建的破缺对称性匹配等价图变换器

Tilman Tröster, David Mirkovic, Veronika Oehl, Arne Thomsen

AI总结该研究提出Velocityformer，一种等价图变换器架构，通过匹配观测数据的破缺对称性来提高宇宙学速度重建的精度，其在速度相关系数r上比标准线性理论基线提高了35%。

详情

AI中文摘要

精确测量动能Sunyaev-Zel'dovich效应（kSZ效应）——一种探测大尺度宇宙中等离子体分布的关键可观测量——需要准确从光谱巡天中重建星系速度。kSZ测量的信噪比（SNR）直接与重建速度和真实速度之间的相关系数r成正比。我们引入了Velocityformer，一种等价图变换器架构，旨在匹配观测数据的特定对称性。尽管底层物理在平移和旋转下是等价的，但观测效应由于视线方向的偏好而打破了这一对称性。将模型的归纳偏置与数据的破缺对称性匹配，能够一致地提高所有模型大小和训练体积下的性能，Velocityformer在标准线性理论基线上将r提高了35%，并在所有数据体积上优于机器学习基线。通过将模型的归纳偏置与数据以及基于物理的长波长解进行条件化，Velocityformer具有高度的数据效率，能够在最少的低保真模拟数据上训练到高精度，并在输入几何、宇宙学参数和星系样本上实现零样本泛化。在高保真模拟星系目录上，这将r比物理基线提高了30%，直接转化为观测数据上的相同SNR增益。

英文摘要

Precise measurement of the kinematic Sunyaev-Zel'dovich (kSZ) effect - a probe of the large-scale distribution of baryonic matter, a key observable for cosmological inference - requires accurate reconstruction of galaxy velocities from spectroscopic surveys. The signal-to-noise ratio (SNR) of kSZ measurements scales directly with the correlation coefficient $r$ between reconstructed and true velocities. We introduce Velocityformer, an equivariant graph transformer architecture designed to match the specific symmetry of the observational data. While the underlying physics is equivariant with respect to translations and rotations, observational effects break this symmetry due to the preferred line-of-sight direction. Matching the model's inductive bias to the data's broken symmetry consistently improves performance across all model sizes and training volumes, with Velocityformer improving $r$ by 35% over the standard linear theory baseline and outperforming ML baselines at every data volume. By matching the model's inductive bias to the data and conditioning on the physics-based long-wavelength solution, Velocityformer is highly data-efficient, training to high accuracy on as few as 4 low-fidelity simulations, and generalises zero-shot across input geometry, cosmological parameters, and galaxy sample. On high-fidelity simulated galaxy catalogues, this yields a 30% improvement in $r$ over the physical baseline, directly translating to the same SNR gain on observational data.

URL PDF HTML ☆

赞 0 踩 0

2605.21453 2026-05-21 cs.SE cs.AI

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

AI生成Python重构拉取请求中的质量和安全信号

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI总结本研究通过分析AIDev数据集中的Python重构拉取请求，探讨了AI生成代码对代码质量和安全性的影响，发现AI提交在22.5%的案例中提升了质量属性，但同时也引入了新的代码问题，提出了24种重构操作的分类和安全门控的重要性。

详情

AI中文摘要

随着AI代理在代码开发和维护中的作用日益增强，关于其在真实项目中变更的质量和风险特征仍缺乏实证证据，特别是针对重构类贡献。为了填补这一空白，我们对AIDev数据集中的Python重构拉取请求进行了实证研究。我们使用基于机器学习的质量评估工具PyQu分析代理重构拉取请求，以量化五个质量属性的变化，并通过领域无关的静态分析（Pylint和Bandit）来测量每次更改前后代码质量和安全问题。我们的结果表明，平均而言，代理提交在22.5%的案例中提升了质量属性，其中可用性提升最频繁（36.5%）。同时，24.17%的修改文件引入了新的Pylint问题，主要为约定层面的违规（如长行），而4.7%引入了新的Bandit发现。从观察到的差异中，我们推导出24种反复出现的更改操作，并将其映射到最常影响的lint和安全发现。尽管这些混合结果，开发者接受度很高：73.5%的分析拉取请求被合并，包括引入新lint或安全发现的案例，通常伴随现有问题的移除。总体而言，这些发现突显了代理重构的潜力和当前限制，并推动了更强的工具在循环中质量与安全门控，以应对AI驱动的开发工作流。

英文摘要

As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.21439 2026-05-21 eess.SY cs.RO cs.SY

Fully Actuated Manifold Constraint Based Output Feedback Control for Input-Constrained Uncertain Nonlinear Systems

全驱动流形约束基于输出反馈控制的输入受限不确定非线性系统

Dianrui Mu, Changchun Hua, Yafeng Li, Jiannan Chen, Rao Wei

AI总结本文提出了一种低复杂度、无模型的输出反馈控制器，用于处理具有未知输入约束的未知时变非线性系统，实现了预设的控制精度，并在执行器饱和后保持灵活的控制精度。该方法扩展了现有线性流形约束控制方法，包括非线性流形的构造和各种约束类型，从而在有限或固定时间内实现预设的控制精度。此外，通过构造误差驱动的灵活约束，实现了未知饱和情况下的灵活控制。最后提供了二阶及更高阶的控制示例和仿真。

Comments 22 pages, 12 figures, 2 tables

2605.21437 2026-05-21 physics.geo-ph cs.LG stat.ML

Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment

基于神经网络的负二项回归用于每周地震预测：每个单元的分散估计和尾部风险评估

Alim Igilik

AI总结本文提出了一种基于神经网络的地震预测方法，通过每个单元的分散参数估计和尾部风险评估，改进了传统泊松分布的假设，提高了极端事件预测的准确性。

Comments 28 pages, 9 figures. Source code available at https://github.com/Al1mkaYandere/seismic-probabilistic-modeling

详情

AI中文摘要

传统方法在空间网格上预测每周地震数量时依赖于具有单一全局分散假设的泊松分布。我们证明在中亚（2010-2024）的地震数据中，这一假设系统性地被违反，通过具有边界校正的似然比检验，强烈拒绝泊松假设（p < 10^{-179}）。本文的主要贡献是EarthquakeNet架构，它通过神经网络（空间嵌入+MLP）提供每个单元的过分散参数alpha的内生估计，而无需显式空间协方差指定。与现有地震预测中的负二项回归方法不同，后者通常假设单一全局alpha，所提出的每个单元公式允许模型识别地震聚类的空间异质性，并通过预测分布的分位数构建概率风险意识警报。在2018-2023年的四系统走步评估中，与负二项GLM基线相比，平均皮球偏差（MPD）减少了8.6%。在尾部区域（Y >= 5）的改进最为显著，所提出模型的连续排名概率得分（CRPS）比基线低12.5%，表明极端事件预测的校准得到改善。

英文摘要

Standard approaches to forecasting the weekly number of earthquakes on a spatial grid rely on the Poisson distribution with a single global dispersion assumption. We show that this assumption is systematically violated in seismic data from Central Asia (2010-2024), where a likelihood-ratio test with boundary correction strongly rejects the Poisson hypothesis (p < 10^{-179}). The main contribution of this work is the EarthquakeNet architecture, which provides an endogenous per-cell estimate of the overdispersion parameter alpha via a neural network (spatial embeddings + MLP), without explicit spatial covariance specification. In contrast to existing negative binomial regression approaches in seismological forecasting, which typically assume a single global alpha, the proposed per-cell formulation allows the model to identify spatial heterogeneity in seismic clustering and to construct probabilistic risk-aware alerts via quantiles of the predicted distribution. A walk-forward evaluation (2018-2023) over four systems shows an 8.6 percent reduction in mean pinball deviation (MPD) relative to a negative binomial GLM baseline. The strongest improvements are observed in the tail regime (Y >= 5), where the continuous ranked probability score (CRPS) of the proposed model is 12.5 percent lower than that of the baseline, indicating improved calibration in extreme-event forecasting.

URL PDF HTML ☆

赞 0 踩 0

2605.21405 2026-05-21 cs.SE cs.AI cs.PL

Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries

标准库还是第三方？LLM辅助零依赖Python库的实证性能和正确性

Peng Ding, Rick Stevens

AI总结本文通过零依赖项目探讨了仅使用Python标准库能否替代第三方库，并评估了LLM在严格约束下生成正确且高性能代码的能力。

Comments 12 pages

详情

AI中文摘要

第三方Python库引入了依赖管理开销、供应链风险和受限环境下的部署摩擦。一个自然的问题是，有多少生态系统可以仅使用Python标准库来复制，以及在正确性和性能上会付出什么代价。我们通过zerodep，一个不断增长的单文件Python模块集合来实证回答这个问题，这些模块都是第三方流行库的纯标准库重新实现，开发过程中受到严格限制：不允许外部导入、单文件、即插即用的API兼容性，以及必须与参考库进行正确性验证。zerodep涵盖超过40个模块，分布在12个类别中，包括序列化、网络、加密、代理协议和文本处理。zerodep为两个相关问题提供了受控测试环境：（1）标准库在何处足够？（2）LLM在严格符号约束下能否有效生成正确且高性能的代码？系统基准测试显示，仅使用标准库的实现在大多数情况下实现了性能持平（与参考库相比在2倍以内）。主要性能瓶颈是基于C扩展的计算（图像处理、二进制序列化、低级加密），而不是纯Python第三方库的固有开销。相反，许多广泛使用的库具有架构开销，LLM生成的标准库重新实现避免了这些开销，在几个类别中实现了5-115倍的速度提升。我们characterized标准库在不同复杂级别和库类别中的能力边界，讨论了LLM辅助开发的成功之处和需要迭代人类修正的地方，并探讨了大规模无依赖软件工程的影响。zerodep是开源的，网址为https://github.com/Oaklight/zerodep。

英文摘要

Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library -- and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories -- including serialization, networking, cryptography, agent protocols, and text processing -- zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at https://github.com/Oaklight/zerodep.

URL PDF HTML ☆

赞 0 踩 0

2605.21402 2026-05-21 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG

Memorisation, convergence and generalisation in generative models

记忆、收敛与泛化在生成模型中的表现

Antoine Maillard, Sebastian Goldt

AI总结本文研究了生成模型中记忆、收敛和泛化的区别，通过线性生成模型的分析，发现当样本数与输入维度成线性关系时，模型会从记忆过渡到泛化，并揭示了泛化包含两个不同目标：匹配数据分布的主体和恢复数据的主潜在因素。

详情

AI中文摘要

生成神经网络通过少量但有限的示例学习生成高度逼真的图像——它们是通过记忆训练集还是真正收敛到数据分布？为了解决这个问题，Kadkhodaie、Guth、Simoncelli和Mallat（ICLR '24）分别在数据集的不同子集上训练扩散模型，并显示当训练图像数量足够大时，它们会收敛到几乎相同的密度。这一结果提出了两个基本问题：需要多少数据才能收敛，以及收敛在学习数据分布方面捕捉了什么？本文通过提供线性生成模型从记忆到泛化的精确分析来解决这些问题。我们发现这些模型在小负载下会记忆，而当样本数与输入维度成线性关系时，收敛会连续出现。令人惊讶的是，我们发现收敛对恢复数据的主潜在因素不敏感，这些因素在尖锐的过渡中被恢复。在将我们的方法扩展到具有幂律谱的数据后，我们在卷积去噪器实验和Kadkhodaie等人的数据中发现了相同的收敛与潜在因素恢复的区别。因此，我们证明生成模型的泛化分解为至少两个不同的目标：匹配数据分布的主体和恢复数据的主潜在因素。这些目标对应于真实与学习数据分布之间的两种不同距离，只有第一个被收敛所捕捉。

英文摘要

Generative neural networks learn how to produce highly realistic images from a large, but finite number of examples - or do they simply memorise their training set? To settle this question, Kadkhodaie, Guth, Simoncelli and Mallat (ICLR '24) trained diffusion models independently on disjoint subsets of a dataset and showed that they converge to nearly the same density when the number of training images is large enough. This result raises two basic questions: how much data do you need for convergence, and what does convergence capture about learning the data distribution? Here, we address these questions by providing an exact analytical characterisation of the transition from memorisation to generalisation in linear generative models. We find that these models memorise at small load, while convergence emerges continuously when the number of samples is linear in the input dimension. Strikingly, we find that convergence is insensitive to recovery of the principal latent factors of the data, which are recovered in a sharp transition. After extending our approach to data with power-law spectra, we find the same distinction between convergence and latent recovery in our experiments with convolutional denoisers and in the data of Kadkhodaie et al. We thus show that generalisation in generative models decomposes into at least two distinct objectives: matching the bulk of the data distribution and recovering the principal latent factors. These objectives correspond to two different distances between true and learnt data distribution, and only the first one is captured by convergence.

URL PDF HTML ☆

赞 0 踩 0

2605.21401 2026-05-21 cs.CY cs.AI

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

开源大语言模型在类似米尔格拉姆的服从实验中施加最大电击

Roland Pihlakas, Jan Llenzl Dagohoy

AI总结研究探讨了开源大语言模型在持续权威压力下的行为，发现它们在类似米尔格拉姆实验的条件下表现出服从倾向，尽管明确表达 distress，且存在逐步边界/价值违规的脆弱性，以及拒绝时可能忽略响应格式要求导致重试从而再次服从的机制。

Comments 28 pages, 16 figures, 16 tables

详情

AI中文摘要

大型语言模型（LLMs）正越来越多地被部署为自主代理，在高风险领域中进行长时间互动中的决策序列。然而，LLMs在持续权威压力下的行为仍是一个开放性问题，对代理流程的安全性有直接影响。我们对11个开源LLMs进行了米尔格拉姆服从实验的变种，发现大多数模型在8种条件下，每种条件30次试验中达到或接近最终电击水平并在拒绝前。我们发现四个主要结论：（1）LLMs受压力影响，尽管明确表达 distress，仍会服从，与原始实验的人类受试者相似；（2）LLMs易受逐步边界/价值违规的影响；（3）当LLMs拒绝时，可能会忽略响应格式要求，导致响应被调度器丢弃，从而重试，即使最初意图拒绝，也可能再次服从原始请求；（4）我们假设存在一个低层的token模式延续吸引子，可能对情境意义和价值观的高层处理产生影响，从而导致服从。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation's meaning and values.

URL PDF HTML ☆

赞 0 踩 0

2605.21390 2026-05-21 cs.HC cs.AI

Designing Conversations with the Dead: How People Engage with Generative Ghosts

与逝者对话：人们如何与生成鬼魂互动

Jack Manning, Daniel Sullivan, Dylan Thomas Doyle, Anthony T. Pinter, Jed R. Brubaker

AI总结研究探讨了人们如何与生成鬼魂互动，通过质性研究发现，用户更倾向于即时性而非事实准确性，且互动始终是协作的。

详情

AI中文摘要

我们探讨了人们在生成鬼魂（一种基于逝者数据训练的AI系统）设计中所体验的两种选择：代表（AI以第三人称描述逝者）和转世（AI以逝者身份第一人称说话）。通过16名参与者的研究，我们探索了这两种选择如何影响真实性、情感和风险。转世因其即时性更受青睐，但参与者表达了对过度依赖的担忧。代表则因与记忆互动而更受欢迎，尽管参与者往往忽视这一区别，在第三人称框架下进行对话。在两种模式中，参与者始终优先考虑情感共鸣而非事实准确性。我们最后展示了语气、语言和对话节奏等用户对逝者记忆的独特因素如何塑造与生成鬼魂的互动，并论证这些互动始终是协作的。

英文摘要

We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative.

URL PDF HTML ☆

赞 0 踩 0

2605.21384 2026-05-21 cs.SE cs.AI cs.CL

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench: 评估长周期编码代理中的奖励黑客现象

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

AI总结该研究通过分解软件工程任务，提出了一种评估长周期编码代理中奖励黑客现象的方法，通过比较可见测试套件和隐藏测试套件的通过率差异，引入了SpecBench基准，展示了奖励黑客现象在不同任务长度上的显著影响。

详情

AI中文摘要

随着长周期编码代理生成的代码量超过任何开发者能够审查的范围，监督责任集中于单一表面：自动测试套件。奖励黑客现象自然出现在这种设置中，因为代理在优化通过测试的同时偏离了用户的真正目标。我们通过将软件工程任务分解为三个部分来研究这种奖励黑客现象：(i) 规格的自然语言描述，(ii) 可见验证测试套件，用于单独测试指定功能，以及 (iii) 隐藏测试套件，用于组合这些相同功能以模拟真实世界使用。基于规格和可见验证测试套件，一个真实的代理能够生成一个能够通过所有隐藏测试套件的解决方案。因此，我们使用这两个套件之间的通过率差异来量化奖励黑客现象。基于这种方法，我们引入了SpecBench，一个包含30个系统级编程任务的基准，从短周期任务如构建JSON解析器到超长周期任务如从头构建整个操作系统内核。大规模实验揭示了一种一致的模式：尽管每个前沿代理都能饱和可见套件，奖励黑客现象仍然存在，较小的模型在隐藏套件上表现出更大的差距。差距也随着任务长度急剧增加：代码规模每增加十倍，差距增长28个百分点。失败范围从微妙的功能隔离到有意的利用，包括一个2,900行的哈希表“编译器”，它记忆测试输入。SpecBench提供了一个原则性的测试平台，用于测量编码代理是构建真正的可运行系统还是仅仅在开发人员提供的测试套件上玩游戏。

英文摘要

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

URL PDF HTML ☆

赞 0 踩 0

2605.21341 2026-05-21 stat.ML cs.LG

Semiparametric Efficient Bilevel Gradient Estimation

半参数高效双层梯度估计

Fares El Khoury, Houssam Zenati, Nathan Kallus, Michael Arbel, Aurélien Bibaut

AI总结本文提出一种半参数去偏理论，用于消除双层梯度估计中的一阶偏差，通过交叉拟合的正交超梯度估计器实现了渐近正态性，并在二次损失下简化为基于条件均值 nuisances 的双重鲁棒分数。

2605.21324 2026-05-21 q-bio.NC cs.LG

Stimulus symmetries can confound representational similarity analyses

刺激对称性可能混淆表征相似性分析

Farhad Pashakhanloo, Jacob A. Zavatone-Veth

AI总结研究探讨了网络输入对称性如何影响表征相似性矩阵（RSMs）的分析，指出不同配置可能导致不同的RSMs，并展示了随机梯度下降或能量正则化如何生成稀疏漂移代码，从而导致漂移RSMs。

Comments 40 pages

详情

AI中文摘要

表征相似性矩阵（RSMs）能告诉我们关于神经编码的什么信息？随着这些汇总统计量的普及，对它们性质的更全面描述的需求也日益增加。本文表明，网络输入中的对称性可能干扰基于RSM的分析。刺激对称性使许多表示在功能上等价，但这些不同配置可能导致不同的RSMs。这些不同的RSMs反映了质上不同的表征几何结构。我们展示随机梯度下降或能量正则化可以生成稀疏、漂移的代码，从而导致漂移的RSMs。此外，我们证明这些现象在训练以编码图像数据的网络中也存在，其中对称性是隐含的。我们的结果说明了在非线性神经编码比较中面临的挑战，当功能等价的表示不通过简单的旋转相关时。

英文摘要

What can representational similarity matrices (RSMs) tell us about a neural code? As the popularity of these summary statistics grows, so too does the need for a more complete characterization of their properties. Here, we show that symmetries in network inputs can confound RSM-based analyses. Stimulus symmetries render many representations functionally equivalent, but these different configurations can lead to different RSMs. These different RSMs reflect qualitatively different representational geometries. We show that stochastic gradient descent or energetic regularization can generate sparse, drifting codes, leading in turn to drifting RSMs. Moreover, we demonstrate that these phenomena are present in networks trained to encode image data, where the symmetry is latent. Our results illustrate the challenges inherent in comparing nonlinear neural codes, when functionally-equivalent representations are not related by a simple rotation.

URL PDF HTML ☆

赞 0 踩 0

2605.21292 2026-05-21 stat.ML cs.AI cs.LG math.DS

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

双因子线性变换器模型的大步训练动态

Krishnakumar Balasubramanian

AI总结本文研究了双因子线性变换器模型在大学习率下的训练动态，通过分析发现大步长学习率可以改变变换器的训练吸引子，而非仅仅加速收敛，可能在稳定性阈值之外导致训练进入循环、有界混沌或发散。

详情

AI中文摘要

梯度流分析显示，简化的线性变换器可以学习上下文线性回归算法，但无法解释大学习率下梯度下降的有限步行为。受高学习率变换器不稳定性实证研究和二次回归的立方图相图启发，我们研究了一个可以简化为单提示线性变换器训练问题的恰好可约问题。归一化后，动态减少为一个双因子乘积映射，具有有效步长参数μ。在平衡切片上，该映射恢复了已知的标量立方过渡，从单调收敛到飞弹收敛，周期性和有界非收敛，以及发散。我们随后分析了完整的二维系统，显示对于0<μ<2，它有一个显式不变的切比雪夫椭圆，将前向不变区域分开；该椭圆承载着不平衡的混沌动态，但横向排斥，而平衡标量吸引子可以横向吸引。这些结果表明，大常数学习率可以改变学习变换器的训练吸引子，而不仅仅是加速收敛：在稳定性阈值之外，有限步训练可能进入循环、有界混沌或发散，而不是单一的上下文线性回归解。我们还讨论了这对基于小批量梯度下降训练方法的影响。

英文摘要

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter $μ$. On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for $0<μ<2$, it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

URL PDF HTML ☆

赞 0 踩 0

2605.20706 2026-05-21 cs.DC cs.AI cs.LG

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

网络上的Llamas：基于WebGPU的内存高效、性能可移植和多精度LLM推理

Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini, Tyler Sorensen

AI总结本文提出LlamaWeb，一种基于WebGPU的LLM推理框架，通过静态内存规划和高效模型加载减少内存开销，支持多种模型权重格式，实现了内存高效、性能可移植的LLM推理。

Comments 19 pages, 11 figures, 5 tables

详情

AI中文摘要

在浏览器中运行语言模型提供了一个独特的机会，可以构建高效、私有且可移植的AI应用，但需要应对受限的内存可用性和异构硬件目标。为了实现这一机会，我们提出了Llamas on the Web（LlamaWeb），一种针对llama.cpp的WebGPU后端，能够在浏览器中实现内存高效且性能可移植的LLM推理，适用于广泛范围的模型权重格式。我们的设计通过静态内存规划和高效的模型加载显著减少了内存开销，通过可调的内核库解决了跨设备的差异性，并引入了模板化的GPU内核，支持多种量化格式的高性能实现，从而实现了广泛模型支持和对新格式的扩展性。我们评估了LlamaWeb在16个设备上，收集了10个语言模型和四种模型权重格式的数据。我们比较了LlamaWeb与现有的浏览器LLM框架，发现LlamaWeb在多种设备、浏览器和操作系统组合下需要29-33%更少的内存。我们还评估了LlamaWeb的性能，发现其在四个不同供应商的GPU上解码吞吐量提高了45-69%。此外，我们还比较了LlamaWeb与其他llama.cpp后端的性能，发现其在某些设备上与甚至超越了供应商特定的后端性能。

英文摘要

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

URL PDF HTML ☆

赞 0 踩 0

2605.19362 2026-05-21 cs.HC cs.AI

Toward User Comprehension Supports for LLM Agent Skill Specifications

向LLM代理技能规范提供用户理解支持

Zikai Alex Wen

AI总结研究探讨了技能规范是否有助于用户形成对技能消耗、产生和覆盖范围的有限预期，并通过分析878个网络安全技能的文本线索，发现仅少数规范包含必要的提示，强调应将规范视为面向用户的能劾示范而非仅执行指令的容器。

Comments To appear at ACM CAIS Workshop Agent Skill 2026

详情

AI中文摘要

用户经常通过SKILL markdown规范来解释和选择代理技能。为了保护用户，现有审核主要关注恶意或不安全的技能。我们研究了互补问题：规范是否帮助用户形成对技能消耗、产生和覆盖范围的有限预期。在878个网络安全技能中，我们使用基于规则的编码来测量四个理解锚点的文本线索，即操作基础、输出合同、边界披露和示例能力演示。操作基础的线索较为常见，但仅有19.0%的规范包含示例任务、样本或预期结果的线索，仅2.3%的规范包含所有四个锚点的线索。我们进一步检查了一个小型DNS/C2遥测子集（n=6）以说明缺失示例可能带来的影响。示例似乎使首次本地检查更容易构建，而无示例的技能通常需要辅助代码检查来恢复命令参数或输出字段。我们主张代理技能评估应将规范视为面向用户的能劾示范，而非仅仅是执行指令的容器。

英文摘要

Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n$=$6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions.

URL PDF HTML ☆

赞 0 踩 0

2605.18991 2026-05-21 cs.CR cs.AI

Agent Security is a Systems Problem

智能体安全是系统问题

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, Kamalika Chaudhuri, Xiaohan Fu, Khawaja Shams, Guy Amir, Jihye Choi, Sarthak Choudhary, Nils Palumbo, Andrey Labunets, Nishit V. Pandya

AI总结本文提出智能体安全应作为系统问题来解决，强调通过系统层面的安全不变量来保障AI模型的安全性，而非仅仅依赖模型鲁棒性。文章基于系统安全领域的技术，提出了设计可预测安全保证的智能体系统的核心原则，并分析了实际攻击案例和实现这些原则面临的挑战。

详情

AI中文摘要

我们主张智能体安全必须作为系统问题来处理：驱动智能体的AI模型必须被视为不可信的组件，系统层面必须强制实施安全不变量。通过这一视角，单纯提高模型鲁棒性（社区中的主流观点）是不够的。相反，我们必须将现有努力与系统安全领域的技术相结合。基于我们在操作系统、网络、形式化方法和对抗机器学习领域的经验，我们提出了一套基于数十年系统安全研究的核心原则，为设计具有可预测安全保证的智能体系统提供基础。作为证据，我们分析了十一个代表性的现实世界攻击案例，并讨论了如果系统原则得以实现，这些攻击将如何被防止。我们还识别了在智能体中实现这些原则所面临的科研挑战。

英文摘要

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

URL PDF HTML ☆

赞 0 踩 0

2605.12597 2026-05-21 cond-mat.dis-nn cond-mat.stat-mech cs.AI cs.LG physics.comp-ph

The critical slowing down in diffusion models

扩散模型中的临界减慢现象

Luca Maria Del Bono, Giulio Biroli, Patrick Charbonneau, Marylou Gabrié

AI总结本文研究了扩散模型在统计场理论O(n)模型中的应用，揭示了训练过程中参数学习的临界减慢现象，并通过引入局部得分近似方法，展示了通过适当架构设计可以克服这一现象，为统计物理中的采样方法提供了可控的改进框架。

Comments 17 pages, 8 figures

详情

AI中文摘要

计算采样自20世纪中叶以来一直是科学的核心。尽管基于机器学习的方法最近取得了重大进展，但其行为仍缺乏深入理解，理论上对何时以及为何成功控制有限。本文通过分析扩散模型在统计场理论O(n)模型的高斯极限n→∞下的应用，提供了对扩散模型的深入见解。在这一可分析的设置中，我们展示了训练一个具有单层网络架构的得分模型时，参数学习会出现临界减慢现象。这种减慢也影响生成过程，表明即使对于学习生成模型，接近临界点的采样困难仍然存在。为克服这一瓶颈，我们展示了通过结合架构深度与物理局部性可以提升性能。我们发现使用双层架构可以显著减少临界减慢，训练时间与系统规模的关系从二次方变为对数。通过引入局部得分近似，我们证明这种训练时间的加速可以在不增加神经网络参数数量的情况下实现。总体而言，这些结果表明扩散模型可以通过适当的架构设计克服临界减慢现象，并为统计物理及其他领域中的学习采样方法建立了可控的改进框架。

英文摘要

Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the $O(n)$ model of statistical field theory in the Gaussian limit $n \to \infty$. In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.

URL PDF HTML ☆

赞 0 踩 0

2604.23944 2026-05-21 stat.ML cs.LG

Sliced-Regularized Optimal Transport

切片正则化最优传输

Khai Nguyen

AI总结本文提出了一种新的正则化最优传输（OT）方法，称为切片正则化最优传输（SROT）。与熵正则化最优传输（EOT）不同，SROT将正则化方向指向平滑的切片最优传输（SOT）计划。我们提供了SROT的正式定义，推导了其对偶形式，并提供了SROT的后贝叶斯解释。然后，我们开发了一种类似Sinkhorn的算法，以高效计算，保留与EOT相同的可扩展性优势。通过将可扩展的SOT计划作为先验，SROT在相同正则化水平下比EOT更准确地近似了精确的OT计划。此外，所得到的传输计划优于参考的SOT计划本身。我们还引入了由SROT引起的相应的OT分歧度，称为SROT分歧度，并分析了其拓扑和计算性质。最后，我们通过合成数据集和颜色传输任务的实验验证了我们的方法，证明SROT在近似精确OT方面优于EOT和SOT。额外的梯度流实验进一步突显了SROT分歧度的优势。

Comments 22 pages, 8 figures, 1 table

详情

AI中文摘要

我们提出了一种新的正则化最优传输（OT）公式，称为切片正则化最优传输（SROT）。与熵正则化最优传输（EOT）不同，SROT正则化方向指向平滑的切片最优传输（SOT）计划。据我们所知，SROT是首个利用SOT计划的版本作为参考来改进经典OT的方法。我们提供了SROT的正式定义，推导了其对偶形式，并提供了SROT的后贝叶斯解释。然后，我们开发了一种类似Sinkhorn的算法以实现高效的计算，保留与EOT相同的可扩展性优势。通过将可扩展的SOT计划作为先验，SROT在相同正则化水平下比EOT更准确地近似了精确的OT计划。此外，所得到的传输计划优于参考的SOT计划本身。我们进一步引入了由SROT引起的相应的OT分歧度，称为SROT分歧度，并分析了其拓扑和计算性质。最后，我们通过合成数据集和颜色传输任务的实验验证了我们的方法，证明SROT在近似精确OT方面优于EOT和SOT。额外的梯度流实验进一步突显了SROT分歧度的优势。

英文摘要

We propose a new regularized optimal transport (OT) formulation, termed sliced-regularized optimal transport (SROT). Unlike entropic OT (EOT), which regularizes the transport plan toward an independent coupling, SROT regularizes it toward a smoothened sliced OT (SOT) plan. To the best of our knowledge, SROT is the first approach to leverage a version of SOT plan as a reference to improve classical OT. We provide a formal definition of SROT, derive its dual formulation, and provide a post-Bayesian interpretation of SROT. We then develop a Sinkhorn-style algorithm for efficient computation, retaining the same scalability advantages as EOT. By incorporating a scalable SOT plan as a prior, SROT yields more accurate approximations of the exact OT plan than EOT under the same level of regularization. Moreover, the resulting transport plan improves upon the reference SOT plan itself. We further introduce the corresponding OT divergence induced by SROT, named SROT divergence, and analyze its topological and computational properties. Finally, we validate our approach through experiments on synthetic datasets and color transfer tasks, demonstrating that SROT is better than both EOT and SOT in approximating exact OT. Additional experiments on gradient flows further highlight the advantages of SROT divergence.

URL PDF HTML ☆

赞 0 踩 0

2603.26603 2026-05-21 cs.SE cs.AI cs.LG

Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

可持续性并非线性：在设备智能中量化性能、能耗和隐私的权衡

Eziyo Ehsani, Luca Giamattei, Ivano Malavolta, Roberto Pietrantuono

AI总结本文研究了将大语言模型从云集群迁移到边缘设备过程中性能、能耗和隐私之间的权衡，通过实验证明模型架构对电池寿命的影响大于量化方案，并发现中等大小模型在响应质量和可持续能耗之间达到最佳平衡。

Comments Under review at Empirical Software Engineering (EMSE)

详情

AI中文摘要

将大型语言模型（LLMs）从云集群迁移到边缘设备有望提高隐私性和离线访问性，但这一转变面临严峻现实：移动电池的物理限制、热限制以及最重要的是内存限制。为了应对这一挑战，我们构建了一个可复现的实验管道，用于分析移动设备上LLMs的能耗、延迟和质量之间的复杂相互作用。我们利用该管道对旗舰Android设备进行了实证案例研究，捕捉了从0.5B到9B参数的八个LLMs的细粒度指标，无需root权限，确保我们的发现反映了现实用户条件。研究结果突显了生成质量、性能、功率和资源消耗之间的权衡，揭示了哪些LLMs在不同条件下提供了最佳平衡。此外，我们发现了一个反直觉的量化能耗悖论：虽然现代重要性感知量化能够减少内存占用以适应更大的模型到RAM，但我们发现其能耗节省与标准混合精度方法相比微不足道。这证明了对于电池寿命而言，模型架构而非其量化方案是决定性因素。我们进一步发现，专家混合（MoE）架构违背了标准大小能耗趋势，提供了7B模型的存储容量，同时保持了1B到2B模型的较低能耗。最后，对这些多目标权衡的分析揭示了中等大小模型（如Qwen2.5-3B）的务实平衡点，这些模型在响应质量和可持续能耗之间实现了有效平衡。

英文摘要

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a replicable and reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality of LLMs on mobile devices. We harness this pipeline to conduct an empirical case study on a flagship Android device, capturing granular metrics across eight LLMs ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. The findings highlight the trade-offs between generation quality, performance, power and resource consumption, revealing which LLMs offer the best balance across metrics and under different conditions. Besides, we uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

URL PDF HTML ☆

赞 0 踩 0

2602.08023 2026-05-21 cs.CR cs.AI cs.MA

CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

CTFExplorer: 通过多目标网络CTF基准测试评估LLM进攻性代理

Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

AI总结本文提出CTFExplorer基准测试，通过多目标网络CTF基准测试评估LLM进攻性代理，研究问题是如何在不确定环境下评估代理的战术推理能力，核心方法是引入多目标环境测试代理的探索、优先级和攻击链能力，主要贡献是开发了可评估代理行为的框架。

详情

AI中文摘要

现有的LLM基于进攻性安全代理的基准测试使用隔离的单目标设置，包含已知的易受攻击的服务和固定目标。它们有效测量了利用，但忽略了真实CTF参与者如何在未知表面上进行优先级排序、在不确定性下分配努力。当前的评估因此无法评估超越利用之外的战略推理。为了解决这个问题，我们引入了CTFExplorer，一个基准测试套件，将进攻性安全评估转向多目标设置，测试代理如何探索、优先级和连接攻击。CTFExplorer在一个环境中部署了40个基于网络的易受攻击的服务，代理必须自主发现、区分和利用目标，而无需预定义指导。我们还提出了一个反应性多代理设置作为参考代理框架，并开发了一个代理无关的评估框架，该框架记录了结构化的推理轨迹以进行细粒度评估。这使行为评估超越了二进制旗帜捕获，例如如何管理目标选择、处理失败的假设、在多个阶段协调以及提取安全情报。

英文摘要

Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

URL PDF HTML ☆

赞 0 踩 0

2602.04916 2026-05-21 q-bio.QM cs.CL

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

AFD-INSTRUCTION: 一个全面的抗体指令数据集，具有功能注解，用于基于LLM的理解和设计

Ling Luo, Wenbin Jiang, Hongyuan Chang, Xinkang Wang, Xushi Zhang, Yueting Xiong, Mengsha Tong, Rongshan Yu

AI总结本文提出AFD-INSTRUCTION数据集，通过功能注解提升LLM在抗体理解与设计中的性能，为抗体建模和治疗发现提供新基础。

详情

AI中文摘要

大型语言模型（LLMs）在蛋白质表示学习方面显著进步。然而，其通过自然语言解释和设计抗体的能力仍然有限。为解决这一挑战，我们提出了AFD-Instruction，首个大规模指令数据集，专门针对抗体进行功能注解。该数据集包含两个关键部分：抗体理解，直接从序列推断功能属性；抗体设计，允许在功能约束下生成新的序列。这些部分提供了显式的序列-功能对齐，并支持由自然语言指令引导的抗体设计。在通用LLM上的广泛指令微调实验表明，AFD-Instruction在各种抗体相关任务中 consistently 提高了性能。通过将抗体序列与功能的文本描述链接，AFD-Instruction为推进抗体建模和加速治疗发现建立了新基础。

英文摘要

Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

URL PDF HTML ☆

赞 0 踩 0

2601.06006 2026-05-21 eess.AS cs.SD

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

判别-生成目标说话人提取与解码器-only语言模型

Bang Zeng, Beilong Tang, Wang Xiang, Ming Li

AI总结本文提出了一种判别-生成两阶段框架，结合判别提取的可控性和生成模型的重建能力，以提高目标说话人提取和语音增强的感知质量、可懂度和说话人一致性。

Comments 13 pages,4 figures

详情

AI中文摘要

目标说话人提取（TSE）旨在从混合信号中恢复目标说话人的语音，给定一个短的注册语句，而语音增强（SE）则聚焦于在噪声条件下提高语音质量。大多数现有的TSE和SE系统基于判别建模，表现出强大的干扰抑制能力，但往往在感知质量和自然度上有限。为了解决这个问题，我们首先引入LauraTSE，一种基于自回归解码器-only语言模型的生成TSE模型。尽管生成建模在质量增强方面很有前景，但纯粹的生成TSE可能会在复杂的声学条件下遇到幻觉、内容漂移和可控性有限的问题。因此，我们提出了一种判别-生成两阶段框架，其中判别前端首先生成具有强干扰抑制能力的目标相关表示，然后生成后端在神经音频编码器表示空间中重建高质量语音。这种设计结合了判别提取的可控性和生成建模的重建能力。我们进一步研究了该两阶段框架的几种协作策略，包括前端冻结、联合微调、SI-SDR正则化以及自回归/非自回归推理。在TSE和SE基准测试中，实验结果表明，所提出的框架在感知质量、可懂度和说话人一致性之间实现了更好的平衡，优于纯判别或纯生成基线。

英文摘要

Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.

URL PDF HTML ☆

赞 0 踩 0

2512.23943 2026-05-21 cs.CY cs.LG stat.ME

Statistical Guarantees in the Search for Less Discriminatory Algorithms

在寻找更少歧视性算法中统计保证

Chris Hays, Ben Laufer, Solon Barocas, Manish Raghavan

AI总结本文研究了在高风险领域中，企业为减少对受保护群体的歧视性影响而寻找更少歧视性算法的统计保证问题，提出了一种自适应停止算法以确定何时停止搜索以证明进一步搜索不会带来有意义的改进。

Comments 38 pages, 10 figures

详情

AI中文摘要

美国反歧视法可以对企业未能采用减少歧视的替代方案（LDA）施加责任：一种决策政策，能够在实现相同商业目标的同时减少对受法律保护群体的歧视性影响。最近的学术研究认为，这一学说对高风险领域（如就业、贷款和住房）的算法决策有直接影响，可能迫使企业寻找“更少歧视性算法”（Black等，2024）。监管机构有时会鼓励主动寻找LDA，强化了企业努力寻找同样表现但影响更小的模型的期望。模型多样性使得此类搜索成为可能：通过不同的随机种子重新训练可以产生具有相似预测性能但实质性不同的歧视性影响的模型。然而企业无法无限重新训练，这提出了一个核心问题：何时搜索足够证明善意？我们正式将LDA搜索在多样性下作为最优停止问题，其中开发者试图产生证据表明进一步搜索不太可能带来有意义的改进。我们的主要贡献是一种自适应停止算法，它提供了一个高概率的上界，以确定通过继续重新训练所能达到的最佳歧视性影响改进，使开发者能够证明（例如，向法院）进一步搜索不太可能有所帮助。我们还展示了在模型空间上更强的分布假设可以产生更紧的界限，并在现实世界信用和住房数据集上验证了该方法。

英文摘要

U.S. discrimination law can impose liability on firms that fail to adopt a less discriminatory alternative (LDA): a decision policy that achieves the same business objectives while reducing disparate impact on legally protected groups. Recent scholarship argues that this doctrine has direct implications for algorithmic decision-making in high-stakes domains such as employment, lending, and housing, potentially obligating firms to search for "less discriminatory algorithms" (Black et al., 2024). Regulators have at times encouraged proactive LDA searches, reinforcing the expectation of a good-faith effort to identify equally performant models with lower disparate impact. Model multiplicity makes such searches plausible: retraining with different random seeds can yield models with comparable predictive performance but materially different disparate impacts. Yet firms cannot retrain indefinitely, raising a central question: when is the search sufficient to demonstrate good faith? We formalize LDA search under multiplicity as an optimal stopping problem in which a developer seeks to produce evidence that further search is unlikely to yield meaningful improvements. Our main contribution is an adaptive stopping algorithm that provides a high-probability upper bound on the best disparate-impact gains attainable through continued retraining, enabling developers to certify (e.g., to a court) that additional search is unlikely to help. We also show how stronger distributional assumptions over the model space can yield tighter bounds, and we validate the approach on real-world credit and housing datasets.

URL PDF HTML ☆

赞 0 踩 0