arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2602.23958 2026-06-09 eess.AS cs.SD 版本更新

An Empirical Analysis of Task-Induced Encoder Bias in Fréchet Audio Distance

Fréchet音频距离中任务诱导编码器偏差的实证分析

Wonwoo Jeong

发表机构 * Dept. of Computer Science and Engineering, Sogang University, South Korea(计算机科学与工程系,首尔大学,韩国)

AI总结 通过分解评估指标为召回率、精度和对齐(语义与结构维度),分析六种编码器在FAD中的任务诱导偏差,发现重建、ASR和分类训练编码器各有优劣,需发展评估原生编码器。

Comments Accepted to Interspeech 2026. Source code and evaluation pipeline are available at: https://github.com/wonwoo-jeong/fad-encoder-bias

详情
AI中文摘要

Fréchet音频距离(FAD)是评估文本到音频生成的事实标准,但其分数依赖于底层编码器的嵌入空间。编码器的训练任务决定了哪些声学特征被保留或丢弃,导致FAD继承系统性的任务诱导偏差。我们将评估分解为召回率、精度和对齐(分为语义和结构维度),并使用对数尺度归一化以实现公平的跨编码器比较。在两个数据集上对六种编码器进行的受控实验揭示了四轴权衡:基于重建的AudioMAE主导精度敏感性;ASR训练的Whisper在结构检测中占优,但对信号退化视而不见;分类训练的VGGish最大化语义检测,但惩罚合法的类内变异。由于没有单个编码器是通用评估器,未来的指标必须转向与人类感知内在一致的评估原生编码器。

英文摘要

Fréchet Audio Distance (FAD) is the de facto standard for evaluating text-to-audio generation, yet its scores depend on the underlying encoder's embedding space. An encoder's training task dictates which acoustic features are preserved or discarded, causing FAD to inherit systematic task-induced biases. We decompose evaluation into Recall, Precision, and Alignment (split into semantic and structural dimensions), using log-scale normalization for fair cross-encoder comparison. Controlled experiments on six encoders across two datasets reveal a four-axis trade-off: reconstruction-based AudioMAE leads precision sensitivity; ASR-trained Whisper dominates structural detection but is blind to signal degradation; classification-trained VGGish maximizes semantic detection but penalizes legitimate intra-class variation. Since no single encoder is a universal evaluator, future metrics must shift toward evaluation-native encoders intrinsically aligned with human perception.

2601.16510 2026-06-09 cs.MS cs.LG math.OC 版本更新

Learning to Optimize by Differentiable Programming

通过可微编程学习优化

Liping Tao, Xindi Tong, Chee Wei Tan

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本教程介绍利用可微编程学习设计一阶优化算法,通过端到端训练提升收敛性和解质量,并基于Fenchel-Rockafellar对偶性展示ADMM和PDHG等算法的学习与适应。

详情
AI中文摘要

解决大规模优化问题需要可扩展且每次迭代成本低的一阶方法。本教程强调了优化领域的一个转变:利用可微编程不仅执行算法,而且学习如何设计它们。诸如PyTorch、TensorFlow和JAX等现代框架通过高效的自动微分实现了这一范式。将一阶方法嵌入这些系统允许端到端训练,从而改善收敛性和解质量。在Fenchel-Rockafellar对偶性的指导下,本教程展示了如何学习和适应诸如ADMM和PDHG等对偶信息迭代方案。通过LP、NNV、和速率最大化、OPF和LRMP等案例研究说明了这些改进。

英文摘要

Solving massive-scale optimization problems requires scalable first-order methods with low per-iteration cost. This tutorial highlights a shift in optimization: using differentiable programming not only to execute algorithms but to learn how to design them. Modern frameworks such as PyTorch, TensorFlow, and JAX enable this paradigm through efficient automatic differentiation. Embedding first-order methods within these systems allows end-to-end training that improves convergence and solution quality. Guided by Fenchel-Rockafellar duality, the tutorial demonstrates how duality-informed iterative schemes such as ADMM and PDHG can be learned and adapted. Case studies across LP, NNV, Sum-Rate maximization, OPF, and LRMP illustrate these gains.

2602.21788 2026-06-09 cs.DC cs.LG 版本更新

Efficient Scaling of LLM Training with Flexible Context Parallelism

利用灵活上下文并行实现LLM训练的高效扩展

Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 针对数据异构导致负载不均和通信冗余问题,提出自适应重配置通信组和上下文并行度的FCP策略,实现近线性加速比,最高达1.46倍吞吐提升。

详情
AI中文摘要

扩展长上下文能力对于大型语言模型(LLM)至关重要。然而,现实世界的数据包含大量具有异构长度的序列。现有的LLM训练库依赖于静态并行策略,在数据异构下会遭受严重的负载不均衡、冗余通信和次优的硬件利用率。在这项工作中,我们提出了灵活上下文并行(FCP),一种高效的并行策略,能够在LLM训练期间自适应地重配置通信组和上下文并行度。我们推广了更灵活的非2的幂次并行度,并开发了一个多项式时间算法,为每个训练批次生成近乎最优的并行策略,开销仅为毫秒级。即使在极端数据异构下,FCP也能保持高硬件效率。实验结果表明,FCP在LLM和多模态大模型(MLLM)训练中均显著优于Megatron-LM和DeepSpeed,在保持大规模集群近线性扩展效率的同时,平均吞吐量提升高达1.46倍。对于极端不平衡的批次,FCP甚至实现了2.24倍的加速。

英文摘要

Scaling long-context capabilities is crucial for Large Language Models (LLMs). However, real-world data contain a large number of sequences with heterogeneous lengths. Existing training libraries for LLMs rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Flexible Context Parallelism (FCP), an efficient parallelism strategy that adaptively reconfigures communication groups and context parallelism degrees during LLM training. We generalize more flexible non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. FCP is able to maintain high hardware efficiency even under extreme data heterogeneity. Experimental results demonstrate that FCP significantly outperforms Megatron-LM and DeepSpeed in both LLM and MLLM training, achieving up to 1.46x speedup in average throughput while maintaining near-linear scaling efficiency across large-scale clusters. For extremely unbalanced batches, FCP even achieves 2.24x speedup.

2602.20967 2026-06-09 eess.AS cs.AI cs.SD 版本更新

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

无训练的可懂度引导的噪声ASR观测添加

Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti, Eng Siong Chng

发表机构 * Nanyang Technological University(南洋理工大学) Nara Institute of Science and Technology(奈良科学技術大學)

AI总结 提出一种无训练的可懂度引导观测添加方法,通过后端ASR的可懂度估计推导融合权重,提升噪声环境下ASR鲁棒性,无需修改SE或ASR模型参数。

Comments Accepted to Interspeech2026

详情
AI中文摘要

自动语音识别(ASR)在噪声环境中严重退化。尽管语音增强(SE)前端有效抑制背景噪声,但它们常常引入损害识别的伪影。观测添加(OA)通过融合噪声和SE增强语音解决了这一问题,无需修改SE或ASR模型的参数。本文提出了一种可懂度引导的OA方法,其中融合权重从后端ASR直接获得的可懂度估计中推导。与基于训练好的神经预测器的先前OA方法不同,所提出的方法无需训练,降低了复杂度并增强了泛化能力。在多种SE-ASR组合和数据集上的大量实验表明,该方法相比现有OA基线具有强大的鲁棒性和改进。对可懂度引导的基于切换的替代方案以及帧级与话语级OA的进一步分析也验证了所提出的设计。

英文摘要

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

2602.15519 2026-06-09 eess.AS cs.SD 版本更新

Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

Enroll-on-Wakeup:真实噪声人机对话场景中无缝交互的目标语音提取首次比较研究

Yiming Yang, Guangyong Wang, Haixin Guan, Yanhua Long

发表机构 * Shanghai Normal University(上海师范大学) Unisound AI Technology Co., Ltd.(Unisound人工智能技术有限公司)

AI总结 提出Enroll-on-Wakeup框架,利用唤醒词片段作为注册参考,无需预录语音,实现无缝交互;首次系统比较了判别式和生成式模型在真实噪声条件下的性能,并探索了基于LLM的TTS注册增强。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

目标语音提取(TSE)通常依赖于预先录制的高质量注册语音,这破坏了用户体验并限制了在自发交互中的可行性。在本文中,我们提出了Enroll-on-Wakeup(EoW),一种新颖的框架,其中在人机交互过程中自然捕获的唤醒词片段被自动用作注册参考。这消除了对预收集语音的需求,以实现无缝体验。我们首次对EoW-TSE进行了系统研究,评估了在真实多样声学条件下的先进判别式和生成式模型。鉴于唤醒词片段的短时和噪声特性,我们研究了使用基于LLM的TTS进行注册增强。结果表明,虽然当前的TSE模型在EoW-TSE中面临性能下降,但基于TTS的辅助显著增强了听觉体验,尽管在语音识别准确性方面仍存在差距。

英文摘要

Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.

2602.07774 2026-06-09 cs.IR cs.AI 版本更新

Generative Reasoning Re-ranker

生成式推理重排序器

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Jacob Tao, Shike Mei, Wenlin Chen, Santanu Kolay, Sandeep Pandey, Hamed Firooz, Luke Simon

发表机构 * Meta AI

AI总结 提出GR2框架,利用大语言模型的推理能力进行推荐重排序,通过语义ID编码、推理轨迹监督微调和强化学习优化,在Recall@5和NDCG@5上超越现有方法。

Comments 31 pages

详情
AI中文摘要

最近的研究越来越多地探索大语言模型(LLMs)作为推荐系统的新范式,因其可扩展性和世界知识。然而,现有工作存在三个关键限制:(1)大多数工作集中在检索和排序,而重排序阶段——对优化最终推荐至关重要——在很大程度上被忽视;(2)LLMs通常用于零样本或有监督微调设置,其推理能力(尤其是通过强化学习(RL)和高质量推理数据增强的能力)未被充分利用;(3)项目通常由非语义ID表示,在拥有数十亿标识符的工业系统中造成重大可扩展性挑战。为解决这些问题,我们提出生成式推理重排序器(GR2),这是一个端到端框架,具有专为重排序设计的三阶段训练流程。首先,预训练的LLM通过一个分词器对从非语义ID编码的语义ID进行中期训练,实现≥99%的唯一性。接下来,一个更强的更大规模LLM通过精心设计的提示和拒绝采样生成高质量推理轨迹,用于监督微调以赋予基础推理技能。最后,我们应用解耦裁剪和动态采样策略优化(DAPO),实现具有可验证奖励的可扩展RL监督,这些奖励专为重排序设计。在两个真实数据集上的实验证明了GR2的有效性:它在Recall@5和NDCG@5上分别超越最先进的OneRec-Think 2.4%和1.3%。消融实验证实,高级推理轨迹在各项指标上带来显著提升。我们进一步发现,RL奖励设计在重排序中至关重要:LLMs倾向于通过保留项目顺序来利用奖励黑客行为,这促使我们设计条件可验证奖励以减轻这种行为并优化重排序性能。

英文摘要

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

2602.18777 2026-06-09 eess.AS cs.SD 版本更新

Mind the Gap: Detecting Cluster Exits for Robust Local Density-Based Score Normalization in Anomalous Sound Detection

注意差距:检测聚类出口以实现异常声音检测中鲁棒的局部密度分数归一化

Kevin Wilkinghoff, Gordon Wichern, Jonathan Le Roux, Zheng-Hua Tan

发表机构 * Department of Electronic Systems, Aalborg University(电子系统系,奥尔堡大学) Pioneer Centre for Artificial Intelligence(先锋人工智能中心) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室(MERL))

AI总结 针对异常声音检测中局部密度分数归一化对邻域大小敏感的问题,提出聚类出口检测机制,通过识别距离不连续性自适应选择邻域大小,提升鲁棒性和性能。

详情
AI中文摘要

局部密度分数归一化是异常声音检测中基于距离的嵌入方法的有效组成部分,尤其是在数据密度随条件或领域变化时。然而,在实践中,性能强烈依赖于邻域大小。当邻域扩展跨越聚类边界时,增加邻域大小会降低检测精度,违反了局部密度估计的局部性假设。这一观察促使我们基于局部性保持而不是预先固定来调整邻域大小。我们通过提出聚类出口检测来实现这一点,这是一种轻量级机制,用于识别距离不连续性并相应地选择邻域大小。在多个嵌入模型和数据集上的实验表明,该方法对邻域大小选择具有更好的鲁棒性,并带来一致的性能提升。

英文摘要

Local density-based score normalization is an effective component of distance-based embedding methods for anomalous sound detection, particularly when data densities vary across conditions or domains. In practice, however, performance depends strongly on neighborhood size. Increasing it can degrade detection accuracy when neighborhood expansion crosses cluster boundaries, violating the locality assumption of local density estimation. This observation motivates adapting the neighborhood size based on locality preservation rather than fixing it in advance. We realize this by proposing cluster exit detection, a lightweight mechanism that identifies distance discontinuities and selects neighborhood sizes accordingly. Experiments across multiple embedding models and datasets show improved robustness to neighborhood-size selection and consistent performance gains.

2602.18364 2026-06-09 cs.IT cs.LG math.IT quant-ph stat.ML 版本更新

Quantum Maximum Likelihood Prediction via Hilbert Space Embeddings

通过希尔伯特空间嵌入的量子最大似然预测

Sreejith Sreekumar, Nir Weinberger

发表机构 * L2S, CNRS, CentraleSupélec, University of Paris-Saclay, France(L2S、CNRS、CentraleSupélec、巴黎-萨克雷大学、法国)

AI总结 研究量子最大似然预测任务,通过将经验概率分布嵌入量子态并最小化量子相对熵,提出统一框架,给出非渐近性能保证。

Comments 31+3 pages, 1 figure

详情
AI中文摘要

最大似然预测是现代大型语言模型的核心任务。这里,我们作为第一步,针对由独立同分布样本组成的简化数据模型研究该任务的量子版本。量子最大似然预测器通过将经验概率分布嵌入量子态,并在给定状态类上最小化量子相对熵得到。当量子模型类具有足够表达能力时,我们从量子反向信息投影和量子勾股定理的角度给出了该预测器的解释。我们进一步推导了在迹范数和量子相对熵下的非渐近性能保证,包括收敛速度和集中不等式。我们的方法为处理经典和量子LLM中的MLP提供了统一框架。

英文摘要

Maximum likelihood prediction (MLP) is a core task at the heart of modern large language models. Here, we study a quantum version of this task for a simplified data model consisting of independent and identically distributed samples, as a first step. The quantum maximum likelihood predictor is obtained by embedding of empirical probability distributions into quantum states and performing a minimization of quantum relative entropy over a given class of states. We provide an interpretation of this predictor in terms of quantum reverse information projection and quantum Pythagorean theorem when the class of quantum models is sufficiently expressive. We further derive non-asymptotic performance guarantees in terms of convergence rates and concentration inequalities, both in trace norm and quantum relative entropy. Our approach provides a unified framework to handle MLP within both classical and quantum LLMs.

2602.16061 2026-06-09 stat.ML cs.LG econ.EM stat.ME 版本更新

Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

利用预训练模型中的弱影子变量在缺失数据下的部分识别

Hongyu Chen, David Simchi-Levi, Ruoxuan Xiong

发表机构 * Massachusetts Institute of Technology, Cambridge, MA 02139(麻省理工学院) Emory University, Atlanta, GA 30322(埃默里大学)

AI总结 针对缺失非随机(MNAR)导致的估计偏差,提出部分识别框架,通过线性规划结合预训练模型(如LLM)的预测作为弱影子变量收紧边界,并设计集合扩张估计器保证覆盖,实验显示识别区间缩小75-83%。

详情
AI中文摘要

从用户反馈中估计总体量(如平均结果)是平台评估和社会科学的基础,但反馈通常非随机缺失(MNAR):意见更强的用户更可能回应,因此标准估计量有偏,且在没有额外假设的情况下目标量不可识别。现有方法通常依赖强参数假设或实践中可能不可用的定制辅助变量。在本文中,我们开发了一个部分识别框架,其中通过求解一对线性规划获得目标量的尖锐边界,其约束编码了观测数据结构。该公式自然地将来自预训练模型(包括大型语言模型LLM)的结果预测作为额外的线性约束纳入,从而收紧可行集。我们将这些预测称为弱影子变量:它们满足关于缺失性的条件独立性假设,但不需要经典影子变量方法所需的完备性条件。当预测足够信息时,边界坍缩为点,将标准识别作为特例恢复。在有限样本中,为了提供对识别集的有效覆盖,我们提出了一种集合扩张估计器,在集合识别状态下达到慢于$\sqrt{n}$的收敛速度,在点识别下达到标准$\sqrt{n}$速度。在模拟和半合成实验(基于客服对话)中,我们发现LLM预测通常对经典影子变量方法条件不良,但在我们的框架中仍然非常有效。在现实的MNAR机制下,它们将识别区间缩小75-83%,同时保持有效覆盖。

英文摘要

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.

2602.10016 2026-06-09 cs.IR cs.AI 版本更新

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Kunlun: 通过统一架构设计建立大规模推荐系统的缩放定律

Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Ellie Wen, Jiyan Yang, Huayu Li

发表机构 * Meta Platforms, Inc.(Meta平台公司) OpenAI

AI总结 针对大规模推荐系统缺乏可预测缩放定律的问题,提出Kunlun架构,通过低层优化(GDPA、HSP、滑动窗口注意力)和高层创新(CompSkip、事件级个性化)提升模型效率,MFU从17%提升至37%,缩放效率翻倍,已在Meta广告模型部署。

Comments 10 pages, 4 figures

详情
AI中文摘要

推导可预测的缩放定律,即模型性能与计算投入之间的关系,对于大规模推荐系统的设计和资源分配至关重要。虽然这类定律已在大型语言模型中建立,但在推荐系统中仍具挑战,尤其是处理用户历史记录和上下文特征的系统。我们识别出低缩放效率是可预测幂律缩放的主要障碍,源于低模型FLOPs利用率(MFU)的模块和次优的资源分配。我们引入Kunlun,一种可扩展的架构,系统性地提升模型效率和资源分配。我们的低层优化包括广义点积注意力(GDPA)、分层种子池化(HSP)和滑动窗口注意力。高层创新包括计算跳过(CompSkip)和事件级个性化。这些进步在NVIDIA B200 GPU上将MFU从17%提升至37%,并将缩放效率相比最先进方法提升一倍。Kunlun现已部署在主要的Meta广告模型中,产生显著的生产影响。

英文摘要

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

2602.10234 2026-06-09 physics.soc-ph cs.AI cs.RO 版本更新

Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy

将警车变道行为转化为缓解孤立走走停停交通波的实际拥堵吸收驾驶策略

Zhengbing He

发表机构 * Faculty of Science and Engineering, University of Nottingham Ningbo China(诺丁汉大学宁波校区理工程学院)

AI总结 本文提出一种基于警车变道行为启发的实际拥堵吸收驾驶(JAD)策略,通过定义JAD三角形,利用单车辆双探测器实现孤立走走停停波的抑制,并系统分析五个关键参数,仿真验证其有效性。

详情
AI中文摘要

走走停停交通波是高速公路拥堵的主要形式,对交通效率、安全风险和车辆排放造成严重且持续的负面影响。在各种高速公路交通管理策略中,拥堵吸收驾驶(JAD)——由专用车辆在被走走停停波捕获前执行“慢进快出”操作——已被提出作为抑制此类波传播的一种有前景的方法。然而,现有大多数JAD策略仍不实用,主要原因是缺乏对实施车辆和运行条件的考虑。受真实世界中警车变道行为的启发,本文首先引入单车辆双探测器拥堵吸收驾驶(SD-JAD)问题,然后基于JAD三角形的定义提出一种实用的JAD策略,将这种变道行为转化为能够抑制孤立走走停停波传播的交通控制策略。识别并系统分析了五个显著影响所提策略的关键参数,即JAD速度、流入交通速度、波宽、波速和波内速度。通过基于SUMO的仿真示例,进一步展示了如何仅使用两个固定路侧交通探测器在实际中测量这些参数。结果表明,所提出的JAD策略成功抑制了走走停停波的传播,且未引发二次波。本文有望推动JAD的实际实施迈出重要一步,将其从理论概念推进为可行且可部署的交通管理策略。

英文摘要

Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic efficiency, increased safety risks, and elevated vehicle emissions. Among various freeway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a promising approach to suppressing the propagation of such waves. However, most existing JAD strategies remain impractical, primarily due to the lack of consideration of implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces the Single-Vehicle Double-Detector Jam-Absorption Driving (SD-JAD) problem and then proposes a practical JAD strategy based on a definition of the JAD Triangle, transforming such behavior into a traffic control strategy capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice using only two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave without triggering secondary waves. This paper is expected to take a significant step toward the practical implementation of JAD, advancing it from a theoretical concept to a feasible and deployable traffic management strategy.

2602.12246 2026-06-09 cs.NI cs.RO 版本更新

6G Empowering Future Robotics: A Vision for Next-Generation Autonomous Systems

6G赋能未来机器人:下一代自主系统的愿景

Mona Ghassemian, Andrés Meseguer Valenzuela, Ana Garcia Armada, Dejan Vukobratovic, Periklis Chatzimisios, Kaspar Althoefer, Ranga Rao Venkatesha Prasad

发表机构 * ITI UC3M International Hellenic University and University of New Mexico(国际希伯来大学和新墨西哥大学) QMUL(女王玛丽大学) TUDelft(代尔夫特理工大学)

AI总结 本文探讨6G如何通过IMT-2030关键性能指标映射至机器人功能模块,提出集成机器人、智能和网络服务平面的架构,并展示实时动态安全框架以促进人机协作。

Comments IEEE Communication Magazine

详情
AI中文摘要

机器人技术与下一代通信的融合是技术发展的关键驱动力。随着世界从5G向6G过渡,无线网络的基础能力正在演进以支持日益复杂和自主的系统。我们研究了6G在增强机器人关键功能方面的变革性影响。本文系统地将IMT-2030关键性能指标映射到机器人功能模块,包括感知、知觉、认知、执行和自学习。基于此映射,我们提出了一个集成机器人、智能和网络服务平面的高层架构框架,强调了整体方法的必要性。作为示例用例,我们展示了一个由IMT-2030能力支持的实时动态安全框架,用于共享空间中安全高效的人机协作。

英文摘要

The convergence of robotics and next-generation communication is a critical driver of technological advancement. As the world transitions from 5G to 6G, the foundational capabilities of wireless networks are evolving to support increasingly complex and autonomous systems. We examine the transformative impact of 6G on enhancing key robotics functionalities. It provides a systematic mapping of IMT-2030 key performance indicators to robotic functional blocks, including sensing, perception, cognition, actuation, and self-learning. Building upon this mapping, we propose a high-level architectural framework integrating robotic, intelligent, and network service planes, underscoring the need for a holistic approach. As an example, use case, we present a real-time, dynamic safety framework enabled by IMT-2030 capabilities for safe and efficient human-robot collaboration in shared spaces.

2602.12129 2026-06-09 cs.IR cs.LG 版本更新

Towards Personalized Bangla Book Recommendation: A Large-Scale Heterogeneous Book Graph Dataset

面向个性化孟加拉语图书推荐:大规模异构图书图谱数据集

Rahin Arefin Ahmed, Md. Anik Chowdhury, Sakil Ahmed Sheikh Reza, Devnil Bhattacharjee, Muhammad Abdullah Adnan, Julian McAuley, Nafis Sadeq

发表机构 * East West University(东西方大学) Bangladesh University of Engineering and Technology(孟加拉工程与技术大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 针对孟加拉语文学缺乏结构化大规模公开数据集的问题,构建了RokomariBG异构图书图谱数据集,包含12.7万本书、6.3万用户等实体及多种关系,通过基准测试表明异构关系与混合文本元数据显著影响推荐性能。

Comments Added new experiment results on sequential recommendation, top-N recommendation results have been updated using per user temporal leave-last-one-out instead of random split

详情
AI中文摘要

孟加拉语文学中的个性化图书推荐一直受限于缺乏结构化、大规模且公开可用的数据集。本文介绍了RokomariBG,一个大规模异构图书图谱数据集,旨在支持低资源语言环境下的个性化推荐研究。该数据集包含127,302本书、63,723个用户、16,601位作者、1,515个类别、2,757家出版社和209,602条评论,通过多种关系类型连接,并组织为综合知识图谱。为展示数据集的实用性,我们针对Top-N推荐和序列推荐任务进行了系统基准研究,评估了多种代表性推荐模型。通过全面基准测试,我们证明了该领域的推荐性能同时受异构关系信息和混合文本元数据的强烈影响。这些发现揭示了孟加拉国电商生态系统中现有推荐基准大多缺失的独特挑战。总体而言,本文为孟加拉语图书推荐研究建立了基础基准和公开可用资源,实现了可重复评估及未来对低资源文化领域推荐的研究。数据集和代码已公开于此https URL。

英文摘要

Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through several relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we present a systematic benchmarking study on the top-N recommendation and sequential recommendation tasks, evaluating a diverse set of representative recommendation models. Through comprehensive benchmarking, we demonstrate that recommendation performance in this domain is strongly influenced by both heterogeneous relational information and code-mixed textual metadata. These findings reveal unique challenges of Bangladeshi e-commerce ecosystems that are largely absent from existing recommendation benchmarks. Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset

2602.10172 2026-06-09 astro-ph.IM cs.AI 版本更新

Cosmo3DFlow: Wavelet Flow Matching for Spatial-to-Spectral Compression in Reconstructing the Early Universe

Cosmo3DFlow:用于重建早期宇宙的空间到光谱压缩的小波流匹配

Md. Khairul Islam, Zeyu Xia, Ryan Goudjil, Jialu Wang, Arya Farahi, Judy Fox

发表机构 * Department of Computer Science University of Virginia(计算机科学系弗吉尼亚大学) Department of Statistics and Data Sciences The University of Texas at Austin(统计与数据科学系德克萨斯大学奥斯汀分校) School of Data Science(数据科学学院)

AI总结 提出Cosmo3DFlow框架,结合3D离散小波变换与流匹配,通过空间到光谱压缩解决高维宇宙结构重建中的维度和稀疏性瓶颈,实现比扩散模型快46倍的采样速度。

详情
AI中文摘要

从演化的现今宇宙重建早期宇宙是现代天体物理学中一个具有挑战性和计算密集的问题。我们设计了一种新颖的生成框架Cosmo3DFlow,旨在解决维度和稀疏性——当前最先进的宇宙学推理方法中的关键瓶颈。通过将3D离散小波变换(DWT)与流匹配相结合,我们有效地表示了高维宇宙学结构。小波变换通过将空间空无转化为光谱稀疏性来解决“空洞问题”。它将高频细节与低频结构解耦,并且小波空间速度场促进了具有大步长的稳定常微分方程(ODE)求解器。使用$128^3$分辨率的大规模宇宙学$N$体模拟,我们实现了比扩散模型快46倍的采样速度。我们的结果使得初始条件可以在几秒内采样,而以前的方法需要几分钟。

英文摘要

Reconstructing the early universe from the evolved present-day universe is a challenging and computationally demanding problem in modern astrophysics. We devise a novel generative framework, Cosmo3DFlow, designed to address dimensionality and sparsity, the critical bottlenecks inherent in current state-of-the-art methods for cosmological inference. By integrating 3D Discrete Wavelet Transform (DWT) with flow matching, we effectively represent high-dimensional cosmological structures. The Wavelet Transform addresses the ``void problem'' by translating spatial emptiness into spectral sparsity. It decouples high-frequency details from low-frequency structures, and wavelet-space velocity fields facilitate stable ordinary differential equation (ODE) solvers with large step sizes. Using large-scale cosmological $N$-body simulations at $128^3$ resolution, we achieve up to $46\times$ faster sampling than diffusion models. Our results enable initial conditions to be sampled in seconds, compared to minutes for previous methods.

2602.04402 2026-06-09 stat.ML cs.AI cs.CY cs.LG math.ST stat.TH 版本更新

Performative Learning Theory

表现性学习理论

Julian Rodemann, Unai Fischer-Abaigar, James Bailie, Krikamol Muandet

发表机构 * University of Cambridge(剑桥大学)

AI总结 将表现性预测嵌入统计学习理论,证明在样本和总体表现性效应下的泛化界,揭示模型影响数据越多则学习越少的权衡,并提出通过再训练改善泛化保证。

Comments ICML 2026. v2: corrected typo in author list; v3: added explanation of condition 3.2, modified condition 3.3 and fixed lemma 3.4, added examples and explanations in sections 2, 5, and 6

详情
AI中文摘要

表现性预测会影响它们试图预测的结果。我们研究影响样本(例如,仅限现有应用用户)和/或整个总体(例如,所有潜在应用用户)的表现性预测。这引发了模型在表现性下泛化能力的问题。例如,当现有用户和新用户都对应用的预测做出反应时,我们基于现有用户对新用户能得出多好的见解?我们通过将表现性预测嵌入统计学习理论来解决这个问题。我们证明了在样本、总体以及两者共同影响下的泛化界。我们证明背后的一个关键直觉是,在最坏情况下,总体否定预测,而样本欺骗性地实现预测。我们分别将这种自我否定和自我实现的预测表述为Wasserstein空间中的最小-最大和最小-最小风险泛函。我们的分析揭示了表现性地改变世界与从中学习之间的基本权衡:模型对数据的影响越大,它能从数据中学到的就越少。此外,我们的分析得出一个令人惊讶的见解:通过对表现性扭曲的样本进行再训练,可以改善泛化保证。我们通过一个案例研究说明了我们的界,该案例涉及基于预测的德国失业居民工作培训分配,利用了德国1975年至2017年的行政劳动力市场记录。

英文摘要

Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app's predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such self-negating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals a fundamental trade-off between performatively changing the world and learning from it: the more a model affects data, the less it can learn from it. Moreover, our analysis results in a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds in a case study on prediction-informed assignments of unemployed German residents to job trainings, drawing upon administrative labor market records from 1975 to 2017 in Germany.

2602.05869 2026-06-09 stat.ML cs.LG cs.NA math.NA math.PR math.ST stat.TH 版本更新

Wedge Sampling: Efficient Tensor Completion with Nearly-Linear Sample Complexity

楔形采样:具有近线性样本复杂性的高效张量补全

Hengrui Luo, Anna Ma, Ludovic Stephan, Yizhe Zhu

发表机构 * Rice University(里士满大学) University of California, Irvine(加州大学尔湾分校) Univ Rennes, Ensai, CNRS, CREST-UMR 9194(里昂大学,Ensai,CNRS,CREST-UMR 9194) University of Southern California(南加州大学)

AI总结 提出楔形采样非自适应方案,通过结构化长度二模式(楔形)分配观测,在均匀采样稀疏时增强谱信号,实现近线性样本复杂度的张量补全。

Comments COLT 2026 arXiv version. 65 pages, 3 figures

详情
AI中文摘要

我们引入了楔形采样(Wedge Sampling),一种用于低秩张量补全的新型非自适应采样方案。我们研究从部分条目中恢复维度为 $n \times \cdots \times n$ 的 $k$ 阶低秩张量。与标准均匀条目模型(即来自 $[n]^k$ 的 i.i.d. 样本)不同,楔形采样将观测分配到关联二分采样图中的结构化长度二模式(楔形)。通过直接促进这些长度二连接,采样设计增强了在均匀采样过于稀疏而无法产生足够信息相关性的情况下高效初始化所依赖的谱信号。我们的主要结果表明,这种采样范式的改变使得多项式时间算法能够以 $n$ 的近线性样本复杂度实现弱恢复和精确恢复。该方法也是即插即用的:基于楔形采样的谱初始化可以与现有的细化过程(例如,谱方法或梯度方法)结合,仅需额外 $\tilde{O}(n)$ 个均匀采样条目,显著优于在均匀条目采样下高效方法通常所需的 $\tilde{O}(n^{k/2})$ 样本复杂度。总体而言,我们的结果表明,Barak 和 Moitra (2022) 中强调的统计-计算差距在很大程度上是张量补全中均匀条目采样模型的结果,而保证强初始化的替代非自适应测量设计可以克服这一障碍。

英文摘要

We introduce Wedge Sampling, a new non-adaptive sampling scheme for low-rank tensor completion. We study recovery of an order-$k$ low-rank tensor of dimension $n \times \cdots \times n$ from a subset of its entries. Unlike the standard uniform entry model (i.e., i.i.d. samples from $[n]^k$), wedge sampling allocates observations to structured length-two patterns (wedges) in an associated bipartite sampling graph. By directly promoting these length-two connections, the sampling design strengthens the spectral signal that underlies efficient initialization, in regimes where uniform sampling is too sparse to generate enough informative correlations. Our main result shows that this change in sampling paradigm enables polynomial-time algorithms to achieve both weak and exact recovery with nearly linear sample complexity in $n$. The approach is also plug-and-play: wedge-sampling-based spectral initialization can be combined with existing refinement procedures (e.g., spectral or gradient-based methods) using only an additional $\tilde{O}(n)$ uniformly sampled entries, substantially improving over the $\tilde{O}(n^{k/2})$ sample complexity typically required under uniform entry sampling for efficient methods. Overall, our results suggest that the statistical-to-computational gap highlighted in Barak and Moitra (2022) is, to a large extent, a consequence of the uniform entry sampling model for tensor completion, and that alternative non-adaptive measurement designs that guarantee a strong initialization can overcome this barrier.

2602.03682 2026-06-09 stat.ML cs.DC cs.LG cs.NA math.NA 版本更新

Improved Analysis of the Accelerated Noisy Power Method with Applications to Decentralized PCA

加速噪声幂方法的改进分析及其在分布式PCA中的应用

Pierre Aguié, Mathieu Even, Laurent Massoulié

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 本文改进了加速噪声幂方法的分析,在更宽松的扰动条件下保持加速收敛速率,并首次提出具有可证明加速收敛的分布式PCA算法。

详情
AI中文摘要

我们分析了加速噪声幂方法,这是一种在仅有不精确矩阵-向量乘积可用的情况下进行主成分分析的算法,例如在分布式PCA中可能出现的情况。虽然先前的工作已经证明,与标准噪声幂方法相比,加速可以改善收敛速度,但这些保证需要对扰动幅度进行过度严格的上界限制,限制了其实用性。我们提供了该算法的改进分析,在更温和的扰动条件下保持了加速收敛速率。我们证明我们的新分析在最坏情况下是最优的,即收敛速率无法进一步提高,并且我们推导的噪声条件在不牺牲收敛保证的情况下无法放宽。我们通过推导一种用于分布式PCA的加速算法来展示我们结果的实际相关性,该算法具有与非加速方法相似的通信成本。据我们所知,这是第一个具有可证明加速收敛的分布式PCA算法。

英文摘要

We analyze the Accelerated Noisy Power Method, an algorithm for Principal Component Analysis in the setting where only inexact matrix-vector products are available, which can arise for instance in decentralized PCA. While previous works have established that acceleration can improve convergence rates compared to the standard Noisy Power Method, these guarantees require overly restrictive upper bounds on the magnitude of the perturbations, limiting their practical applicability. We provide an improved analysis of this algorithm, which preserves the accelerated convergence rate under much milder conditions on the perturbations. We show that our new analysis is worst-case optimal, in the sense that the convergence rate cannot be improved, and that the noise conditions we derive cannot be relaxed without sacrificing convergence guarantees. We demonstrate the practical relevance of our results by deriving an accelerated algorithm for decentralized PCA, which has similar communication costs to non-accelerated methods. To our knowledge, this is the first decentralized algorithm for PCA with provably accelerated convergence.

2602.02431 2026-06-09 stat.ML cs.LG 版本更新

Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

全批量梯度下降优于单次SGD:单索引学习中的样本复杂度分离

Filip Kovačević, Hong Chang Ji, Denny Wu, Mahdi Soltanolkotabi, Marco Mondelli

发表机构 * Institute of Science and Technology Austria(奥地利科学与技术研究所) Sung Kyun Kwan University(顺天妇女大学) New York University and Flatiron Institute(纽约大学和Flatiron研究所) University of Southern California(南加州大学)

AI总结 研究单索引学习中全批量GD与单次SGD的样本复杂度差异,发现通过截断激活函数,全批量GD在n≃d样本时实现弱恢复,优于单次SGD的n≳d log d样本需求。

Comments Accepted to ICML 2026

详情
AI中文摘要

传统观点认为,多次重用训练数据可以提高基于梯度的学习的统计效率。虽然这一现象在线性回归中已被广泛研究,但在非线性和非凸设置中,除了前两次数据传递实现的损失修改机制外,多遍梯度下降(GD,重用所有数据)相对于单遍随机梯度下降(在线SGD,每个数据点仅使用一次)的优势尚未得到充分理解。在这项工作中,我们考虑学习一个具有二次激活函数的$d$维单索引模型,已知单次SGD需要$n\gtrsim d\log d$个样本才能实现弱恢复。我们首先证明,对于相关损失上的全批量球面GD,样本复杂度中的$\log d$因子仍然存在;然而,通过简单地截断激活函数,全批量GD在$n \simeq d$个样本时展现出有利的优化景观,从而在统计效率上优于单次SGD(使用相同的激活函数)。我们通过从微小初始化开始的平方损失上全批量GD的轨迹分析补充了这一结果,表明$n \gtrsim d$个样本和$T \gtrsim\log d$个梯度步足以实现强(精确)恢复。

英文摘要

It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. While this phenomenon has been extensively studied in linear regression, the benefit of multi-pass gradient descent (GD, which reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) is not well-understood in nonlinear and non-convex settings, except for a loss modification mechanism achieved by the first two passes on the data. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.

2601.22859 2026-06-09 cs.SE cs.AI 版本更新

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

MEnvAgent:可扩展的多语言环境构建用于可验证软件工程

Chuanzhe Guo, Jingjing Wu, Sijun He, Yang Chen, Zhaoqi Kuang, Shilong Fan, Bingjin Chen, Siqi Bao, Jing Liu, Hua Wu, Qingfu Zhu, Wanxiang Che, Haifeng Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出MEnvAgent框架,通过多智能体规划-执行-验证架构和环境复用机制,自动构建多语言可执行环境,生成可验证任务实例,在10种语言1000个任务上提升F2P率8.6%并降低时间成本43%。

Comments Accepted as a Spotlight Paper at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)智能体在软件工程(SWE)领域的发展受到可验证数据集稀缺的制约,这一瓶颈源于跨不同语言构建可执行环境的复杂性。为解决此问题,我们提出MEnvAgent,一个用于自动环境构建的多语言框架,支持可验证任务实例的可扩展生成。MEnvAgent采用多智能体规划-执行-验证架构自主解决构建失败问题,并集成了一种新颖的环境复用机制,通过增量修补历史环境来减少计算开销。在MEnvBench(一个包含10种语言1000个任务的新基准)上的评估表明,MEnvAgent优于基线方法,将失败到通过(F2P)率提高了8.6%,同时将时间成本降低了43%。此外,我们通过构建MEnvData-SWE(迄今为止最大的开源多语言真实可验证Docker环境数据集)以及解决方案轨迹,展示了MEnvAgent的实用性,这些轨迹使得各种模型在SWE任务上能够获得一致的性能提升。我们的代码、基准和数据集可在以下网址获取:https://this URL。

英文摘要

The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.

2602.00797 2026-06-09 stat.ML cs.LG 版本更新

Zero-Flow Encoders

零流编码器

Yakun Wang, Leyang Wang, Song Liu, Taiji Suzuki

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于流的表示学习框架,通过零流准则验证条件独立性,从而在生成模型中提取充分信息,并在图模型和自监督学习任务中学习近似马尔可夫毯和潜在表示。

Comments Yakun Wang and Leyang Wang contributed equally to this work; As published at ICML 2026

详情
AI中文摘要

基于流的方法在各种生成建模任务中取得了显著成功,能够捕捉复杂数据分布中的细微细节。然而,现有研究很少利用这一独特能力来解决超出生成任务的细粒度结构细节。本文提出了一种流启发式的表示学习框架。首先,我们证明了如果源分布和目标分布相同,独立耦合训练的修正流在t=0.5时处处为零。我们称这一性质为零流准则。其次,我们展示该准则可以验证条件独立性,从而从数据中提取充分信息。第三,我们将这一准则转化为可计算且无需模拟的损失函数,从而在图模型中学习近似马尔可夫毯和自监督学习任务中的潜在表示。在模拟和真实世界数据集上的实验验证了本文方法的有效性。代码可在https://github.com/probabilityFLOW/zfe上找到。

英文摘要

Flow-based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine-grained structural details beyond generation tasks. This paper presents a flow-inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at $t=0.5$ if and only if the source and target distributions are identical. We term this property the \emph{zero-flow criterion}. Second, we show that this criterion can certify conditional independence, thereby extracting \emph{sufficient information} from the data. Third, we translate this criterion into a tractable, simulation-free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self-supervised learning tasks. Experiments on both simulated and real-world datasets demonstrate the effectiveness of our approach. The code reproducing our experiments can be found at: https://github.com/probabilityFLOW/zfe.

2601.23231 2026-06-09 eess.IV cs.LG 版本更新

Solving Inverse Problems with Flow-based Models via Model Predictive Control

基于模型预测控制的流模型逆问题求解

George Webber, Alexander Denker, Riccardo Barbano, Andrew J Reader

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出MPC-Flow框架,将流模型逆问题求解转化为序列控制子问题,实现无需训练的推理时引导,理论联系最优控制,在图像修复任务中表现优异。

Comments Accepted for publication at ICML 2026

详情
AI中文摘要

基于流的生成模型为逆问题提供了强大的无条件先验,但引导其动态进行条件生成仍然具有挑战性。最近的工作将流模型中的无训练条件生成视为最优控制问题;然而,求解由此产生的轨迹优化在计算和内存上都很密集,需要对流动力学进行微分或伴随求解。我们提出了MPC-Flow,一个模型预测控制框架,将基于流的生成模型的逆问题求解公式化为一系列控制子问题,从而在推理时实现实用的基于最优控制的引导。我们提供了将MPC-Flow与底层最优控制目标联系起来的理论分析,并展示了不同的算法选择如何产生一系列引导算法,包括避免通过生成模型轨迹进行反向传播的机制。我们在基准图像恢复任务上评估了MPC-Flow,涵盖线性和非线性设置,如修复、去模糊和超分辨率,并通过在消费级硬件上对FLUX.2(32B)进行量化设置下的无训练引导,展示了强大的性能和可扩展性到大规模最先进架构。

英文摘要

Flow-based generative models provide strong unconditional priors for inverse problems, but guiding their dynamics for conditional generation remains challenging. Recent work casts training-free conditional generation in flow models as an optimal control problem; however, solving the resulting trajectory optimisation is computationally and memory intensive, requiring differentiation through the flow dynamics or adjoint solves. We propose MPC-Flow, a model predictive control framework that formulates inverse problem solving with flow-based generative models as a sequence of control sub-problems, enabling practical optimal control-based guidance at inference time. We provide theoretical analysis linking MPC-Flow to the underlying optimal control objective and show how different algorithmic choices yield a spectrum of guidance algorithms, including regimes that avoid backpropagation through the generative model trajectory. We evaluate MPC-Flow on benchmark image restoration tasks, spanning linear and non-linear settings such as in-painting, deblurring, and super-resolution, and demonstrate strong performance and scalability to massive state-of-the-art architectures via training-free guidance of FLUX.2 (32B) in a quantised setting on consumer hardware.

2601.20408 2026-06-09 cs.DC cs.AI 版本更新

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

满足SLO,节省时间:使用OptiKIT实现企业级LLM自动化优化

Nicholas Santavas, Kareem Eissa, Patrycja Cieplicka, Piotr Florek, Matteo Nulli, Stefan Vasilev, Seyyed Hadi Hashemi, Antonios Gasteratos, Shahram Khadivi

发表机构 * Anonymous Authors(匿名作者)

AI总结 提出OptiKIT分布式LLM优化框架,通过自动化复杂优化流程,为非专家团队提供动态资源分配和流水线执行,实现GPU吞吐量提升2倍以上,降低优化门槛。

Comments Accepted in MLSys 2026

详情
AI中文摘要

企业级LLM部署面临关键的可扩展性挑战:组织必须在有限的计算预算内系统性地优化模型以扩展AI计划,然而手动优化所需的专业知识仍然稀缺。这一挑战在管理异构基础设施上的GPU利用率,同时使具有不同工作负载且LLM优化经验有限的团队能够高效部署模型时尤为明显。我们提出了OPTIKIT,一个分布式LLM优化框架,通过自动化非专家团队的复杂优化工作流程,使模型压缩和调优民主化。OPTIKIT提供动态资源分配、带自动清理的分阶段流水线执行以及无缝的企业集成。在生产中,它实现了超过2倍的GPU吞吐量提升,同时使应用团队无需深厚的LLM优化专业知识即可获得一致的性能改进。我们分享了平台设计以及资源管理、流水线编排和集成模式的关键工程见解,这些实现了大规模、生产级模型优化的民主化。最后,我们开源该系统以促进外部贡献和更广泛的可重复性。

英文摘要

Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2x GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource management, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.

2601.11541 2026-06-09 cs.HC cs.AI cs.CY 版本更新

A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science Topics

学生视角下技术写作反馈质量比较研究:评估计算机科学主题中的LLM、SLM和人类

Suqing Liu, Runlong Ye, Christopher Eaton, Bogdan Simion, Michael Liut

发表机构 * McMaster University(麦斯特大学) Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Research Institute for the Study of University Pedagogy, University of Toronto Mississauga(多伦多大学密西根分校大学教学研究学院) Department of Mathematical and Computational Sciences, University of Toronto Mississauga(多伦多大学密西根分校数学与计算科学系)

AI总结 本研究比较了本地部署的小语言模型(SLM)、商业大语言模型(LLM)和人类导师在计算机科学课程中提供写作反馈的质量,发现SLM在可读性和可操作性上获得学生更高评价,而人类反馈在专业写作任务中更受青睐。

Comments accepted at AIED 26

详情
AI中文摘要

为了解决计算机科学中反馈的可扩展性问题,同时减轻商业大语言模型(LLM)的隐私和成本限制,本研究评估了一个本地托管的小语言模型(SLM)。我们在入门编程(N=176)、操作系统(N=80)和写作研讨会(N=7)中部署了量化后的Llama-3.1、GPT-4和人类导师。对学生感知的混合方法分析显示,虽然本地SLM与商业LLM相当,并且在技术课程中学生在可读性和可操作性方面给予其更高评价,但人类反馈在高度专业化的写作任务中仍然更受青睐。我们证明,本地SLM为基础反馈提供了一种保护隐私、零边际成本的替代方案,支持分层教学框架,其中AI处理结构指导,而教师专注于高层次的脚手架概念。

英文摘要

To address the scalability of feedback in computer science while mitigating the privacy and cost limitations of commercial Large Language Models (LLMs), this study evaluates a locally hosted Small Language Model (SLM). We deployed a quantized Llama-3.1, GPT-4, and human instructors across introductory programming (N=176), operating systems (N=80), and a writing seminar (N=7). Mixed-methods analysis of student perceptions reveals that while the local SLM matched commercial LLMs and was rated higher by students for readability and actionability in technical courses, human feedback remained more favoured for highly specialized writing tasks. We demonstrate that local SLMs offer a privacy-preserving, zero-marginal-cost alternative for foundational feedback, supporting a tiered pedagogical framework where AI handles structural guidance while instructors focus on high-level conceptual scaffolding.

2601.07013 2026-06-09 stat.ML cs.LG 版本更新

Conditional Normalizing Flows for Forward and Backward Joint State and Parameter Estimation

条件归一化流用于前向和后向联合状态与参数估计

Luke S. Lagunowich, Guoxiang Grayson Tong, Daniele E. Schiavazzi

发表机构 * Department of Computer Science and Engineering University of Notre Dame(计算机科学与工程系诺特达姆大学) Department of Pediatrics Stanford University(儿科系斯坦福大学) Department of Applied and Computational Mathematics and Statistics University of Notre Dame(应用与计算数学与统计系诺特达姆大学)

AI总结 针对非线性非高斯系统,提出基于条件归一化流的状态滤波方法,结合MLP、Transformer或Mamba-SSM生成条件嵌入,并引入最优传输动力学损失缓解过参数化,在自动驾驶和COVID-19联合估计中验证有效性。

详情
AI中文摘要

传统的状态估计滤波算法——如经典卡尔曼滤波、无迹卡尔曼滤波和粒子滤波——在应用于不确定性遵循任意非高斯且可能多峰分布的非线性系统时,性能会下降。本研究回顾了基于条件归一化流进行非线性滤波的状态估计最新方法,其中条件嵌入由标准MLP架构、Transformer或选择性状态空间模型(如Mamba-SSM)生成。此外,我们测试了最优传输启发的动力学损失项在缓解由大量变换组成的流中过参数化问题的有效性。我们研究了这些方法在自动驾驶和患者群体动力学相关应用中的性能,特别关注它们如何处理时间反转和链式预测。最后,我们评估了各种条件策略在真实世界COVID-19联合SIR系统预测和参数估计应用中的性能。

英文摘要

Traditional filtering algorithms for state estimation -- such as classical Kalman filtering, unscented Kalman filtering, and particle filters -- show performance degradation when applied to nonlinear systems whose uncertainty follows arbitrary non-Gaussian, and potentially multi-modal distributions. This study reviews recent approaches to state estimation via nonlinear filtering based on conditional normalizing flows, where the conditional embedding is generated by standard MLP architectures, transformers or selective state-space models (like Mamba-SSM). In addition, we test the effectiveness of an optimal-transport-inspired kinetic loss term in mitigating overparameterization in flows consisting of a large collection of transformations. We investigate the performance of these approaches on applications relevant to autonomous driving and patient population dynamics, paying special attention to how they handle time inversion and chained predictions. Finally, we assess the performance of various conditioning strategies for an application to real-world COVID-19 joint SIR system forecasting and parameter estimation.

2601.06077 2026-06-09 cs.IT cs.AI cs.LG math.IT math.OC 版本更新

One if by Land, Two if by Sea, Three if by Four Seas, and More to Come -- Values of Perception, Prediction, Communication, and Common Sense in Decision Making

一陆二海三四海,更多将至——感知、预测、通信与常识在决策中的价值

Aolin Xu

发表机构 * Aolin Xu(徐傲林)

AI总结 本文严格定义决策中感知、预测、通信和常识的价值,发现无预测的感知价值可能为负,而预测价值非负,并应用于自主决策系统设计。

详情
AI中文摘要

本文旨在严格定义决策中感知、预测、通信和常识的价值。所定义的量是决策论意义上的,但具有信息论上的类比,例如,它们与香农熵和互信息共享一些简单但关键的数学性质,并且在特定设置中可以简化为这些量。一个有趣的观察是,没有预测的感知价值可能为负,而感知与预测一起的价值以及单独预测的价值总是非负的。这些定义为自主决策系统设计中出现的实际问题提供了答案。示例问题包括:我们是否需要观察和预测特定代理的行为?其重要性如何?观察和预测代理的最佳顺序是什么?这些定义也可能为认知科学和神经科学提供见解,有助于理解自然决策者如何利用从不同来源和操作中获得的信息。

英文摘要

This work aims to rigorously define the values of perception, prediction, communication, and common sense in decision making. The defined quantities are decision-theoretic, but have information-theoretic analogues, e.g., they share some simple but key mathematical properties with Shannon entropy and mutual information, and can reduce to these quantities in particular settings. One interesting observation is that, the value of perception without prediction can be negative, while the value of perception together with prediction and the value of prediction alone are always nonnegative. The defined quantities suggest answers to practical questions arising in the design of autonomous decision-making systems. Example questions include: Do we need to observe and predict the behavior of a particular agent? How important is it? What is the best order to observe and predict the agents? The defined quantities may also provide insights to cognitive science and neural science, toward the understanding of how natural decision makers make use of information gained from different sources and operations.

2601.05261 2026-06-09 cs.IR cs.LG 版本更新

Improving User Experience with Personalized Review Ranking and Summarization

通过个性化评论排名和摘要提升用户体验

Muhammad Jawad Mufti, Omar Hammad, MD. Mahfuzur Rahman

发表机构 * Information and Computer Science Dept., King Fahd University of Petroleum and Minerals(信息与计算机科学系,国王法赫德石油与矿物大学) Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS), King Fahd University of Petroleum and Minerals(智能安全系统交叉研究中心(IRC-ISS),国王法赫德石油与矿物大学)

AI总结 提出融合用户偏好建模、混合情感估计、方面级评论匹配和LLM摘要的个性化评论排名与摘要框架,在亚马逊数据集和用户研究中优于现有方法。

详情
AI中文摘要

在线消费者评论是电子商务中重要的决策支持资源,然而日益增长的评论量常常造成信息过载,使用户难以识别符合个人偏好的内容。现有的评论排名方法通常依赖星级评分、有用性投票或时效性等聚合信号,这些可能无法反映用户特定兴趣。本文提出了一种个性化评论排名和摘要框架,融合了用户偏好建模、混合情感估计、方面级评论匹配和基于大语言模型(LLM)的摘要。该框架首先从历史评论中提取方面级偏好和情感信号,然后结合用户选择的产品方面和书面评论输入来构建个性化用户画像。通过比较该画像与评论级别的方面和情感表示,对候选评论进行排名。随后对排名靠前的评论进行摘要,以提供简洁且符合偏好的信息。该方法使用亚马逊移动电子产品评论数据集和一项涉及70名参与者的结构化用户研究(涵盖常见消费电子产品类别)进行评估。结果表明,所提出的排名方法优于随机排序、基于星级评分、有用性投票、时效性和语义相似度的排名。用户研究结果进一步表明,该方法在满意度、感知相关性、决策信心、信息查找便捷性和阅读效率方面均有提升。研究结果表明,结合方面级个性化、情感感知排名和基于LLM的摘要可以减少评论过载,支持更高效的用户中心决策。

英文摘要

Online consumer reviews are important decision-support resources in e-commerce, yet the increasing volume of reviews often creates information overload and makes it difficult for users to identify content that matches their individual preferences. Existing review-ranking approaches commonly rely on aggregate signals such as star ratings, helpfulness votes, or recency, which may not reflect user-specific interests. This paper proposes a personalized review ranking and summarization framework that integrates user preference modeling, hybrid sentiment estimation, aspect-level review matching, and Large Language Model (LLM)-based summarization. The framework first extracts aspect-level preferences and sentiment signals from historical reviews. It then incorporates user-selected product aspects and written review input to build a personalized user profile. Candidate reviews are ranked by comparing this profile with review-level aspect and sentiment representations. The top-ranked reviews are then summarized to provide concise, preference-aligned information. The proposed method was evaluated using an Amazon Mobile Electronics review dataset and a structured user study involving 70 participants across common consumer electronics categories. Results show that the proposed ranking method outperformed random ordering, star-rating-based ranking, helpfulness-vote ranking, recency-based ranking, and semantic-similarity-based ranking. User-study results further indicate improvements in satisfaction, perceived relevance, decision-making confidence, ease of finding information, and reading efficiency. The findings suggest that combining aspect-level personalization, sentiment-aware ranking, and LLM-based summarization can reduce review overload and support more efficient user-centered decision-making.

2601.04266 2026-06-09 cs.CR cs.LG 版本更新

State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space

状态后门:针对状态空间中视觉-语言-动作模型的隐蔽现实世界投毒攻击

Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li

发表机构 * Laboratory Of Intelligent Collaborative Computing, University of Electronic Science and Technology of China(智能协同计算实验室,电子科学与技术大学) National Key Laboratory of Wireless Communications, University of Electronic Science and Technology of China(无线通信国家重点实验室,电子科学与技术大学) School of Computer Science and Engineering, University of Electronic Science and Technology of China(计算机科学与工程学院,电子科学与技术大学) College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(计算机科学与技术学院,南京航空航天大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 提出状态后门攻击,利用机器人手臂初始状态作为触发器,通过偏好引导遗传算法优化触发器的隐蔽性和有效性,在五个VLA模型和五个真实任务中实现超过90%的攻击成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型广泛部署于机器人等安全关键的具身AI应用中。然而,其复杂的多模态交互也暴露了新的安全漏洞。本文研究了VLA模型中的后门威胁,即恶意输入导致目标错误行为,同时保持对干净数据的性能。现有后门方法主要依赖在视觉模态中插入可见触发器,由于环境变化,在现实场景中鲁棒性差且不易被察觉。为克服这些限制,我们引入状态后门,一种新颖且实用的后门攻击,利用机器人手臂的初始状态作为触发器。为优化触发器的隐蔽性和有效性,我们设计了偏好引导遗传算法(PGA),高效搜索状态空间以找到最小但有效的触发器。在五个代表性VLA模型和五个真实任务上的大量实验表明,我们的方法在不影响良性任务性能的情况下实现了超过90%的攻击成功率,揭示了具身AI系统中一个未被充分探索的漏洞。

英文摘要

Vision-Language-Action (VLA) models are widely deployed in safety-critical embodied AI applications such as robotics. However, their complex multimodal interactions also expose new security vulnerabilities. In this paper, we investigate a backdoor threat in VLA models, where malicious inputs cause targeted misbehavior while preserving performance on clean data. Existing backdoor methods predominantly rely on inserting visible triggers into visual modality, which suffer from poor robustness and low insusceptibility in real-world settings due to environmental variability. To overcome these limitations, we introduce the State Backdoor, a novel and practical backdoor attack that leverages the robot arm's initial state as the trigger. To optimize trigger for insusceptibility and effectiveness, we design a Preference-guided Genetic Algorithm (PGA) that efficiently searches the state space for minimal yet potent triggers. Extensive experiments on five representative VLA models and five real-world tasks show that our method achieves over 90% attack success rate without affecting benign task performance, revealing an underexplored vulnerability in embodied AI systems.

2601.04178 2026-06-09 eess.AS cs.SD 版本更新

Sound Event Detection with Boundary-Aware Optimization and Inference

基于边界感知优化与推理的声音事件检测

Florian Schmid, Chi Ian Tang, Sanjeel Parekh, Vamsi Krishna Ithapu, Juan Azcarreta Ortiz, Giacomo Ferroni, Yijun Qian, Arnoldas Jasonas, Cosmin Frateanu, Camilla Clark, Gerhard Widmer, Çağdaş Bilen

发表机构 * Meta Institute of Computational Perception(计算感知研究所) Linz Institute of Technology (LIT)(林茨技术研究所) Meta Reality Labs Research(Meta现实实验室研究)

AI总结 提出边界感知优化与推理策略,通过显式建模事件起始和偏移,结合循环事件检测与事件提议网络,在AudioSet强标注子集上实现无需后处理调参的SOTA性能。

Comments Accepted for publication in IEEE Signal Processing Letters, 2026

详情
AI中文摘要

时间检测问题出现在许多领域,包括时间序列估计、活动识别和声音事件检测(SED)。在这项工作中,我们提出了一种新的时间事件建模方法,通过显式建模事件起始和偏移,并引入边界感知优化和推理策略,显著增强了时间事件检测。所提出的方法包含了新的时间建模层——循环事件检测(RED)和事件提议网络(EPN),它们与定制的损失函数一起,实现了更有效和精确的时间事件检测。我们在SED领域使用AudioSet中时间强标注部分的一个子集评估了所提出的方法。实验结果表明,我们的方法不仅优于具有最先进后处理的传统逐帧SED模型,而且消除了后处理超参数调优的需要,并扩展以在所有AudioSet强类别上实现新的最先进性能。

英文摘要

Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.

2601.02424 2026-06-09 cond-mat.mtrl-sci cs.AI 版本更新

A large-scale nanocrystal database with aligned synthesis and properties enabling generative inverse design

大规模纳米晶体数据库:对齐合成与性质实现生成式逆向设计

Kai Gu, Yingping Liang, Senliang Peng, Aotian Guo, Haizheng Zhong, Ying Fu

发表机构 * MIIT Key Laboratory for Low-Dimensional Quantum Structure and Devices, School of Materials Sciences & Engineering, Beijing Institute of Technology(信息产业部低维量子结构与器件重点实验室,材料科学与工程学院,北京理工大学) School of Computer Science and Technology, Beijing Institute of Technology(计算机科学与技术学院,北京理工大学)

AI总结 构建大规模对齐的纳米晶体合成-性质数据库,开发基于大语言模型的NanoExtractor提取文献数据,并利用NanoDesigner实现生成式逆向设计,成功设计PbSe和MgF2纳米晶体的合成路线。

详情
AI中文摘要

由于合成参数与物理化学性质之间的复杂相关性,纳米晶体的合成高度依赖于试错法。尽管深度学习为生成式逆向设计提供了潜在方法,但缺乏对齐纳米晶体合成路线与其性质的高质量数据集仍阻碍其发展。本文介绍了一个大规模、对齐的纳米晶体合成-性质(NSP)数据库的构建,并展示了其用于生成式逆向设计的能力。为了从文献中提取结构化的合成路线及其对应的产物性质,我们开发了NanoExtractor,这是一个通过精心设计的增强策略增强的大语言模型(LLM)。NanoExtractor经过人类专家验证,在测试集上达到88%的加权平均分,显著优于化学专用(3%)和通用LLM(38%)。生成的NSP数据库包含近16万条对齐条目,并作为我们的NanoDesigner(一个用于逆向合成设计的LLM)的训练数据。NanoDesigner的生成能力通过成功设计成熟的PbSe纳米晶体和罕见的MgF2纳米晶体的可行合成路线得到验证。值得注意的是,模型为MgF2纳米晶体推荐了反直觉的非化学计量前驱体比例(1:1),实验证实该比例对抑制副产物至关重要。我们的工作弥合了非结构化文献与数据驱动合成之间的差距,并建立了一个强大的人机协作范式,以加速纳米晶体的发现。

英文摘要

The synthesis of nanocrystals has been highly dependent on trial-and-error, due to the complex correlation between synthesis parameters and physicochemical properties. Although deep learning offers a potential methodology to achieve generative inverse design, it is still hindered by the scarcity of high-quality datasets that align nanocrystal synthesis routes with their properties. Here, we present the construction of a large-scale, aligned Nanocrystal Synthesis-Property (NSP) database and demonstrate its capability for generative inverse design. To extract structured synthesis routes and their corresponding product properties from literature, we develop NanoExtractor, a large language model (LLM) enhanced by well-designed augmentation strategies. NanoExtractor is validated against human experts, achieving a weighted average score of 88% on the test set, significantly outperforming chemistry-specialized (3%) and general-purpose LLMs (38%). The resulting NSP database contains nearly 160,000 aligned entries and serves as training data for our NanoDesigner, an LLM for inverse synthesis design. The generative capability of NanoDesigner is validated through the successful design of viable synthesis routes for both well-established PbSe nanocrystals and rarely reported MgF2 nanocrystals. Notably, the model recommends a counter-intuitive, non-stoichiometric precursor ratio (1:1) for MgF2 nanocrystals, which is experimentally confirmed as critical for suppressing byproducts. Our work bridges the gap between unstructured literature and data-driven synthesis, and also establishes a powerful human-AI collaborative paradigm for accelerating nanocrystal discovery.

2512.20978 2026-06-09 eess.AS cs.AI cs.LG 版本更新

GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

GenTSE: 通过粗到细的生成语言模型增强目标说话人提取

Haoyang Li, Xuyi Zhuang, Azmat Adnan, Ye Ni, Wei Rao, Shreyas Gopal, Eng Siong Chng, Boon Siew Han, Yuanjin Zheng

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Southeast University, China(东南大学,中国) Schaeffler Hub for Advanced REsearch (SHARE) at Nanyang Technological University, Singapore(南洋理工大学Schaeffler先进研究 hub(SHARE),新加坡)

AI总结 提出GenTSE,一种两阶段解码器仅生成语言模型,先预测粗语义标记再生成细声学标记,结合冻结语言模型条件训练和直接偏好优化,在Libri2Mix上超越先前基于语言模型的系统。

Comments Accepted to Interspeech2026

详情
AI中文摘要

基于语言模型(LM)的生成建模已成为目标说话人提取(TSE)的一个有前景的方向,具有改善泛化能力和高保真语音的潜力。我们提出GenTSE,一种用于TSE的两阶段解码器仅生成语言模型:第一阶段预测粗语义标记,第二阶段生成细声学标记。分离语义和声学稳定了解码过程,并产生更准确的目标语音。两个阶段均使用连续的SSL或编解码嵌入,相比离散提示方法提供更丰富的上下文。为减少曝光偏差,我们采用冻结语言模型条件训练策略,使语言模型以早期检查点预测的标记为条件,以减少教师强制训练与自回归推理之间的差距。我们进一步应用直接偏好优化(DPO)以更好地将输出与感知偏好对齐。在Libri2Mix上的实验表明,GenTSE在语音质量、可懂度和说话人一致性方面超越了先前基于语言模型的系统。

英文摘要

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.