arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 交叉投稿

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Fisheries College, Ocean University of China（中国海洋大学水产学院）； College of Information and Electrical Engineering, China Agricultural University（中国农业大学信息与电气工程学院）

AI总结提出混合两阶段扩散变压器架构，通过粗到细策略平衡全局语义对齐与局部细节编辑，在重叠音频事件和复杂指令任务上提升性能与效率。

详情

AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容，同时保留其余声学内容。尽管扩散模型取得了显著进展，但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互，这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下，扩散变压器提供了更强的全局建模和多模态融合，但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率，我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构，用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐，然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明，所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升，同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

URL PDF HTML ☆

赞 0 踩 0

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 交叉投稿

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror：在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon（亚马逊）

AI总结提出MakeupMirror扩散模型，通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器，在保持面部特征和肤色的同时实现高质量化妆迁移，相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情

AI中文摘要

化妆迁移模型能够实现有趣的增强现实（AR）体验以及在线化妆购物的虚拟试妆（VTO）。尽管最近最先进的基于扩散的解决方案（如Stable-Makeup）显著提高了化妆迁移的准确性和逼真度，但在身份和肤色保持方面仍存在局限性，使得用于化妆购物的生产级VTO不切实际。在这项工作中，我们提出了MakeupMirror，一种基于扩散的化妆迁移方法，在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新：（1）将面部几何条件与ControlNets集成以保持面部保真度；（2）区域特定的化妆迁移控制，以便在面部区域（如皮肤、眼睛和嘴唇）实现精确的化妆应用；（3）基于肤色的化妆迁移调制，防止跨主体迁移场景中的肤色改变；（4）集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及（本文新收集的、更多样化的）MakeupSelfies数据集上的实验表明，与Stable-Makeup相比，MakeupMirror将相对面部识别相似度提高了+60%，将相对肤色差异降低了-50%，延迟为0.7秒，同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

URL PDF HTML ☆

赞 0 踩 0

2606.19936 2026-06-19 cs.LO cs.MM 交叉投稿

Prismriver: Formalization of Music Theory and Algorithmic Composition in Lean 4

Prismriver：Lean 4 中音乐理论与算法作曲的形式化

Leni Aniva, Claire Wang

AI总结使用 Lean 4 形式化音乐理论，实现可验证的算法作曲与伴奏生成，并支持音乐结构的单子分析。

2606.19658 2026-06-19 cs.AI cs.IR cs.MM 交叉投稿

Denoising Implicit Feedback for Cold-start Recommendation

去噪隐式反馈用于冷启动推荐

Gaode Chen, Shicheng Wang, Shikun Li, Rui Huang, Xinghua Zhang, Yunze Luo, Shipeng Li, Shiming Ge, Ruina Sun, Yinjie Jiang, Jun Zhang

发表机构 * Hong Kong Baptist University（香港浸会大学）； Independent Researcher（独立研究员）； Peking University（北京大学）； Nanjing University（南京大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结针对冷启动推荐中隐式反馈噪声问题，提出模型无关的去噪方法DIF，通过内容相似性推断伪标签并建模置信度与不确定性，在快手应用中显著提升冷启动场景商业指标。

Comments Accepted by KDD 2026 ADS Track

详情

AI中文摘要

隐式反馈因其可获取性和通用性被广泛用于推荐系统，但通常包含噪声样本（如点击诱饵、位置偏差）。同时，由于新物品的持续涌入，推荐器不可避免地面临物品冷启动问题。我们识别出冷物品因上述因素更容易受到噪声样本的影响，而研究者往往忽视了为冷物品去噪隐式反馈的重要性。先前的去噪研究通常基于启发式模式（如高损失值）识别噪声样本，并通过样本选择或重加权来减轻噪声。然而，这些方法适应性有限，在冷启动场景中效果不佳。为了实现冷启动推荐中的隐式反馈去噪，我们提出了一种模型无关的去噪方法DIF。首先，用户对内容的偏好是稳定的，这使我们能够通过内容相似的热物品推断出指示用户是否对冷物品感兴趣的伪标签。其次，为了提高伪标签准确性，我们基于冷物品与热物品的内容相似性对伪标签的置信度进行建模，然后为每个样本聚合多个伪标签。最后，我们通过考虑噪声样本标签的相对熵和物品的冷启动状态，显式估计其不确定性，从而自适应地指导伪标签在样本级别纠正噪声标签。DIF的优越性得到了理论证明和真实数据集上大量实验的支持。该方法已部署在十亿用户规模的短视频应用快手上，并在冷启动场景中显著提升了各项商业指标。

英文摘要

Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF's superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.

URL PDF HTML ☆

赞 0 踩 0

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.（高通技术公司）

AI总结提出首个仅使用音频和IMU模态的实时对话助手，通过微调语言模型减少不必要对话并提升问答准确性，在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情

AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入，这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手，仅使用来自用户可穿戴设备的轻量级隐私保护模态（如音频和IMU输入）来理解上下文，为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务，我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令，并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题，我们展示了微调模型如何将其减少不必要对话的能力提升50%（精确度），同时将正确回答问题的能力提升150%（召回率）。我们进一步描述了如何在边缘设备上实现该助手，无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

URL PDF HTML ☆

赞 0 踩 0