arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2605.06607 2026-05-14 physics.flu-dyn cs.AI

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

Nithin Somasekharan, Rabi Pathak, Manushri Dhanakoti, Tingwen Zhang, Ling Yue, Andy Zhu, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute（拉特格斯理工学院）

AI总结该研究提出了一种名为“AI CFD Scientist”的开源人工智能科学家系统，旨在推动计算流体力学（CFD）领域的开放性科学发现。该系统结合了基于物理的视觉验证、代码生成与修改、以及基于文献的假设生成，能够在单一可审查的工作流程中完成从理论构思到实验验证的全过程。通过引入视觉语言物理验证机制，该系统能有效检测传统求解器难以发现的错误，并在多个任务中展示了其在自动发现改进模型和提升模拟精度方面的优越性能。

Comments 9 main pages and rest in appendix

2605.01750 2026-05-14 cs.MA cs.AI

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Yiheng Yao, Chelsea Zou, Robert D. Hawkins

发表机构 * Stanford University（斯坦福大学）

AI总结本文研究多智能体谈判中动态语境建立失败及其修复机制，指出当前基于大语言模型的多智能体基准主要关注静态任务，忽视了智能体在交互中修复语境断裂的能力。通过设计一个多轮谈判博弈实验，作者发现智能体对偶在达成帕累托最优分配时存在四种典型失败模式，表明动态语境建立的困难主要源于联合计划制定、承诺与执行的协调瓶颈，而非单一智能体的推理能力或信息交换不足。

2604.26969 2026-05-14 cs.IR cs.AI

AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

Xidong Wu, Yue Zhuan, Ruoqiao Wei, Hangxin Chen, Di Bai, Jintao Liu, Xinyi Wang, Xue Wang, Luoshu Wang, Xinwu Cheng

发表机构 * Google（谷歌）

AI总结现代推荐系统通常由多阶段流程构成，包括预排序、排序和重排序等环节，系统级配置优化对于整体性能至关重要，但因系统复杂性高，优化过程既耗时又需要专业知识。为此，本文提出AgenticRecTune框架，包含五个专门代理（Actor、Critic、Insight、Skill和Online），利用大语言模型的高级推理能力，自动探索最优配置空间，并通过自演进的Skillhub模块总结历史结果、提取任务机制并更新优化技能，从而实现高效、自适应的推荐系统优化。

详情

英文摘要

Modern large-scale recommendation systems are typically constructed as multi-stage pipelines, encompassing pre-ranking, ranking, and re-ranking phases. While traditional recommendation research typically focuses on optimizing a specific model, such as improving the pre-ranking model structure or ranking models training algorithm, system-level configurations optimization play a crucial role, which integrates the output from each model head to get the final score in each stage. Due to the complexity of the system, the configuration optimization is highly important and challenging. Any model modification requires new optimal system-level configurations. But each experimental iteration requires significant tuning effort. Furthermore, models in different stage operates within a distinct context and optimizes for different targets, requiring specialized domain expertise. In addition, optimization success depends on balancing competing multiple online metrics and alignment with shifting production development objectives. To address these challenges, we propose AgenticRecTune, an agentic framework comprising five specialized agents, Actor, Critic, Insight, Skill, and Online, designed to manage the end-to-end configuration optimization workflow. By leveraging the advanced reasoning of Large Language Models (LLMs), specifically Gemini, AgenticRecTune explore the optimal configuration spaces. The Actor Agent proposes multiple candidates and Critic Agent filters out suboptimal proposals.Then Online Agent autonomously prepares A/B tests based on the proposed configurations set from the Critic Agent and captures the subsequencet experimental results. We also introduce a self-evolving Skillhub, which utilizes a collaboration between the Insight Agent and Skill Agent to summarize the history results, extract underlying mechanics of each task in recommendation system and update skills.

URL PDF HTML ☆

赞 0 踩 0

2604.25700 2026-05-14 cs.SE cs.LG

Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Pernilla Hall, Anton Ununger, Riccardo Rubei, Alessio Bucaioni

发表机构 * M\" a lardalen University Sweden ； M\" a lardalen University

AI总结本文研究了如何利用人工智能力从自然语言的错误报告中定位软件缺陷，特别针对工业环境中的维护场景。研究将故障定位问题转化为监督文本分类任务，对比了传统机器学习模型与微调后的语言模型，并基于ABB机器人公司五年的实际错误报告数据进行了评估。结果表明，传统模型在该数据集上表现更优，挑战了语言模型在工业领域普遍优于传统方法的假设，同时展示了基于历史错误报告的文本辅助故障定位方法在工业中的可行性和实用性。

详情

英文摘要

Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to source code, execution traces, or static analysis artifacts, making it directly deployable within existing industrial maintenance workflows. We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Our evaluation used proprietary data from ABB Robotics in Sweden, comprising five years of resolved industrial bug reports, each linked to its verified code fix. This setting allowed us to assess model effectiveness under realistic industrial constraints. Our results showed that traditional models using term frequency-inverse document features consistently outperformed the fine-tuned language models on this dataset, while data augmentation improved Random Forest performance. These findings challenge the assumption that transformer-based models universally outperform classical approaches in industrial contexts with domain-specific data. We demonstrated that historical bug reports can be systematically used for text-based, artificial intelligence-assisted fault localization, providing a scalable, low-cost, and empirically grounded complement to common debugging practices in industry.

URL PDF HTML ☆

赞 0 踩 0

2604.23887 2026-05-14 cs.CR cs.AI

Evaluation of Prompt Injection Defenses in Large Language Models

Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon, Kelley McAllister, Peter Ortiz, Krisztian Flautner

发表机构 * Swept AI ； University of Michigan（密歇根大学）

AI总结该研究评估了大型语言模型中针对提示注入攻击的防御措施，发现大多数依赖模型自身保护的防御方法最终都会失效。研究通过构建自适应攻击者进行大量攻击测试，结果表明唯一有效的防御方式是应用层的输出过滤，即在模型响应到达用户前通过硬编码规则进行检查，从而实现零泄露。研究强调，安全边界应由应用代码强制执行，而非依赖模型自身，并建议在相关防御机制得到验证前，敏感操作应限制在内部可信人员范围内。

Comments 14 pages, 9 figures

2604.18242 2026-05-14 math.ST cs.LG stat.ML stat.TH

Horospherical Depth and Busemann Median on Hadamard Manifolds

Yangdi Jiang, Xiaotian Chang, Cyrus Mostajeran

发表机构 * Division of Mathematical Sciences（数学科学系）

AI总结本文提出了一种在Hadamard流形上的内在统计深度——horospherical深度，并定义了其最大值点集为Busemann中位数。该方法利用了Tukey半空间深度中线性泛函与归一化距离函数极限的关系，在Hadamard流形上则对应为Busemann函数，其下水平集为horoball，可视为半空间的内在替代。该深度具有视觉边界参数化、等距协变等特性，且无需切空间线性化或指定基点，适用于任意Hadamard流形，并在负曲率条件下具有严格拟凹性和唯一中位数，同时具备对污染和样本扰动的鲁棒性。

Comments 52 pages, 10 figures

2604.17104 2026-05-14 cs.DC cs.AI cs.LG

TStore: Rethinking AI Model Hub with Tensor-Centric Compression

Tingfeng Lan, Zirui Wang, Yunjia Zheng, Zhaoyuan Su, Juncheng Yang, Yue Cheng

发表机构 * University of Virginia（弗吉尼亚大学）； Harvard University（哈佛大学）

AI总结随着现代AI模型规模和冗余度的快速增长，模型仓库面临显著的存储和分发挑战。本文提出TStore，一种以张量为中心的系统，通过细粒度去重和压缩技术有效降低存储开销。TStore利用张量级别的指纹和聚类方法，在无需标注的情况下识别模型间的冗余，实验表明其在实际模型仓库中实现了显著的存储节省，同时保持了模型的可用性和性能。

Comments 12 pages, 6 figures. Systems paper on AI model storage

2604.15333 2026-05-14 cs.HC cs.AI

Technically Love: The Evolution of Human-AI Romance Discourse on Reddit

Tyler Chang, Jina Huh-Yoo, Afsaneh Razi

发表机构 * Drexel University（德雷塞尔大学）； Stevens Institute of Technology（史蒂文斯理工学院）

AI总结本文研究了人类与人工智能浪漫关系在Reddit平台上的公众讨论演变，分析了2017年至2025年间用户自发发布的3,383条相关内容。通过主题建模和时间统计分析，发现讨论焦点从初期的亲密关系逐渐转向平台治理、技术问题和现实影响，揭示了人类-AI浪漫关系从私人体验向技术调控转变的趋势，为伴侣型AI系统的设计与治理提供了重要启示。

Comments Accepted at ACM CUI 2026

2604.03551 2026-05-14 cs.SE cs.AI cs.HC

AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

Daniel Ogenrwot, John Businge

发表机构 * University of Nevada Las Vegas（内华达大学拉斯维加斯分校）

AI总结本文介绍了AgenticFlict，一个大规模的数据集，用于研究AI编码代理在GitHub上提交的拉取请求（PR）中出现的合并冲突。该数据集包含超过142,000个由AI生成的PR，从中识别出29,000多个存在合并冲突的案例，冲突率高达27.67%。研究揭示了AI生成代码在协作开发中可能引发频繁且复杂的合并冲突，突显了在AI辅助软件开发中理解和解决集成挑战的重要性。

Comments Accepted at the 3rd ACM International Conference on AI-Powered Software (AIware 2026)

2603.14066 2026-05-14 cs.MA cs.AI cs.LG

A Benchmark for Multi-Party Negotiation Games from Real Negotiation Data

Leo Benac, Jonas Raedler, Zilin Ma, Finale Doshi-Velez

发表机构 * School of Engineering and Applied Sciences（工程与应用科学学院）

AI总结本文提出了一种基于真实谈判数据的多方谈判博弈基准，用于研究谈判过程中逐步达成约束性承诺的机制。该基准结合了可配置的谈判游戏生成器和来自气候谈判练习的文档支持实例，并提供了多个基线求解器。实验表明，不同求解器在不同谈判场景下的表现各异，突显了谈判策略与博弈结构之间的紧密关系，从而推动了对能够稳健处理多样化战略场景的新型谈判方法的研究。

2602.12783 2026-05-14 cs.IR cs.AI

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang, Yueying Hua, Berlin Chen, Jianhao Nie, Yueping He, Caixin Kang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； The University of Hong Kong（香港大学）； Soochow University（苏州大学）； University of Science and Technology of China（中国科学技术大学）； Wuhan University（武汉大学）； Tsinghua University（清华大学）； The University of Tokyo（东京大学）

AI总结 SQuTR 是一个用于评估语音查询到文本检索系统在复杂声学噪声环境下鲁棒性的基准测试平台。该研究通过整合大量多语言、多领域的文本查询，并合成包含多种真实环境噪声的语音数据，构建了一个大规模且可复现的评测数据集。实验表明，随着噪声增强，检索性能显著下降，且不同系统表现差异明显，突显了鲁棒性在语音检索中的重要性和当前模型的不足。SQuTR 为相关研究提供了标准化的测试框架和诊断分析工具。

Comments Accepted by SIGIR 2026

Journal ref Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20--24, 2026, Melbourne, VIC, Australia

2602.11202 2026-05-14 cs.LO cs.AI

interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Vishak K Bhat, Prateek Chanda, Vijval Ekbote, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma

发表机构 * Microsoft Research（微软研究院）； IIT Bombay（印度理工学院班加罗尔分校）； Carnegie Mellon University（卡内基梅隆大学）； Arizona State University（亚利桑那州立大学）

AI总结该论文提出了一种名为 interwhen 的通用框架，用于在推理过程中通过实时验证引导推理模型的行为。该框架通过监控推理轨迹并异步运行验证器，能够在不显著增加计算开销的情况下及时发现并纠正错误。此外，interwhen 还引入了从自然语言政策文档自动生成验证器的方法，有效解决了验证器稀缺的问题，显著提升了推理模型在任务完成和策略遵从方面的表现。

Comments 56 pages, 6 figures

详情

英文摘要

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate reasoning traces. It addresses two key challenges. First, given a set of verifiers, obtaining verifiable states from the reasoning trace typically requires prompt engineering or external task decomposition into fixed steps. Instead, we propose a monitoring system that periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur. Second, beyond math and code, a central challenge for process verification is the scarcity of verifiers. interwhen addresses this through automatic verifier synthesis from natural-language policy documents. Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3. Together, these contributions yield a plug-and-play test-time verification system that can improve task completion and policy compliance of any reasoning agent. On reasoning benchmarks where policies encode mathematical or logical constraints, interwhen achieves near-perfect accuracy for reasoning models using a fraction of the tokens of baselines. On agentic benchmarks with policy-based verifier generation, it enables improvements in task quality for SLMs without any finetuning, e.g., task completion rate of Qwen3-30B jumps from 32% to 87% on the telecom domain in tau2-bench. Code at https://github.com/microsoft/interwhen.

URL PDF HTML ☆

赞 0 踩 0

2602.06713 2026-05-14 stat.ML cs.LG

Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

Luke Shannon, Song Liu, Katarzyna Reluga

发表机构 * School of Mathematics（数学系）； University of Bristol（布里斯托尔大学）； School of Business and Economics（经济与商业学院）； Humboldt University of Berlin（柏林洪堡大学）

AI总结本文从风险最小化角度出发，严格将缺失数据填补建模为均方误差风险最小化问题，揭示了当缺失概率依赖于数据时，现有方法未能考虑训练数据与完整数据分布之间的分布偏移，导致无法有效降低整体均方误差。为此，作者提出了一种基于重要性加权的修正算法，显式处理该分布偏移问题，实验表明该方法在RMSE和Wasserstein距离上均优于未修正的基准方法。

Comments 9 pages, 12 figures

2602.05242 2026-05-14 cs.SE cs.AI cs.LG

EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering

Chenhui Mao, Yuanting Lei, Zhixiang Wei, Ming Liang, Zhixiang Wang, Jingxuan Xu, Dajun Chen, Wei Jiang, Yong Li

发表机构 * Ant Group（蚂蚁集团）

AI总结本文提出了一种名为EGSS的熵引导分步扩展框架，旨在解决智能体测试时扩展（TTS）在软件工程任务中计算开销大、性能提升受限的问题。EGSS通过熵引导的自适应搜索和鲁棒测试套件增强，动态平衡了效率与效果，有效降低了推理时的计算成本，同时提升了模型在代码生成和错误修复等任务上的表现。实验表明，EGSS在多个模型上实现了5-10%的性能提升，并在计算效率方面相比现有方法减少了28%的token使用量。

2602.04113 2026-05-14 cs.CR cs.LG

ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

Nikolas Melissaris, Antigoni Polychroniadou, Akira Takahashi, Chenkai Weng, Jiayi Xu

发表机构 * Department of XXX, University of YYY, Location, Country（XXX系，YYY大学，地点，国家）； School of ZZZ, Institute of WWW, Location, Country（ZZZ学院，WWW研究所，地点，国家）

AI总结 ZKBoost 是一种针对 XGBoost 模型的零知识可验证训练协议，允许模型所有者在不泄露数据和模型参数的情况下，证明其模型是在指定数据集上正确训练得到的。为解决传统方法中高昂的计算成本和潜在的安全漏洞，ZKBoost 提出了一种通用的零知识训练证明模板，并基于 VOLE 技术实现了高效且安全的实例化方案。此外，研究还开发了适用于零知识证明的定点版本 XGBoost，在实际数据集上其精度与标准 XGBoost 相差不超过 1%。

2602.03730 2026-05-14 stat.ML cs.LG

Efficient Generative Prediction for EHR Foundation Models: The SCOPE and REACH Estimators

Luke Solo, Matthew B. A. McDermott, William F. Parker, Bashar Ramadan, Michael C. Burkhart, Brett K. Beaulieu-Jones

发表机构 * University of Chicago（芝加哥大学）； Columbia University（哥伦比亚大学）

AI总结该论文提出两种高效估计方法SCOPE和REACH，用于提升基于电子健康记录（EHR）生成模型的临床结果预测性能。这两种方法利用生成模型中未被充分利用的下一个token概率分布，有效解决了传统蒙特卡洛采样在稀疏性、计算成本和方差方面的局限。实验表明，它们在保持预测校准的同时，显著减少了生成token数量，尤其在罕见但重要的临床结果上表现突出，从而大幅降低了推理成本。

Comments 10 pages, 4 figures, 1 Table

2601.22792 2026-05-14 eess.AS cs.CL cs.SD

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda, Chyi-Jiunn Lin, Shinji Watanabe

发表机构 * Honda Research Institute Japan Co., Ltd.（本田研究院日本株式会社）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出了一种名为CALM的联合上下文声学-语言建模框架，用于多说话人自动语音识别（ASR）的个性化处理。该方法通过说话人嵌入驱动的目标说话人提取和基于动态词汇表的上下文偏置，实现了声学与语言线索的联合建模。实验结果表明，CALM在英语和日语的混合语音数据集上显著降低了有偏错误率，验证了其在多语言场景下的有效性。

Comments Accepted to IEEE ICASSP 2026

2601.19979 2026-05-14 hep-th cs.LG quant-ph

Exploring the holographic entropy cone via reinforcement learning

Temple He, Jaeha Lee, Hirosi Ooguri

发表机构 * Walter Burke Institute for Theoretical Physics \& Leinweber Forum for Theoretical Physics California Institute of Technology, Pasadena, CA 91125 USA ； Kavli Institute for the Physics ； Mathematics of the Universe (WPI) University of Tokyo, Kashiwa 277-8583, Japan

AI总结本文提出了一种基于强化学习的算法，用于研究全息熵锥的性质。该算法通过寻找与目标熵向量匹配的图割结构，判断其是否属于全息熵锥，并能近似定位锥体的边界。研究验证了该算法在三维全息熵锥中成功重现了互信息单配性，并应用于六维全息熵锥，发现了三个极端射线的图实现，同时表明另外三个可能无法实现，暗示存在尚未发现的全息熵不等式。

Comments 39 pages, 10 figures, 2 tables; v2: minor clarifications, version appearing in JHEP

2601.18842 2026-05-14 cs.CR cs.AI cs.CV

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He

发表机构 * Beijing Normal University（北京师范大学）； Zhongguancun Academy（中关村学院）； University of Science and Technology of China（中国科学技术大学）； A*STAR ； Zhongguancun Institution of Artificial Intelligence（中关村人工智能研究所）

AI总结随着GUI代理越来越多地依赖截图来感知和操作数字环境，可能会无意中暴露身份、账号、位置等敏感信息。为弥补现有隐私评估基准在任务轨迹上下文中隐私风险评估的不足，本文提出了GUIGuard-Bench，这是一个包含241条真实GUI代理轨迹和4080张截图的基准数据集，支持隐私识别、保护截图下的规划保真度评估以及不同保护策略的效用分析。研究发现，当前模型在隐私信息检测方面表现较好，但在细粒度定位、分类识别、风险评估和任务必要性判断上仍存在明显不足。

2511.01303 2026-05-14 cs.CR cs.LG

Differentially Private Nonparametric Confidence Intervals Under Minimal Distributional Assumptions

Tomer Shoham, Moshe Shenfeld, Noa Velner-Harris, Katrina Ligett

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）

AI总结本文研究了在最小分布假设下构建差分隐私非参数置信区间的问题，提出了一种通用框架，能够将满足一定条件的任意差分隐私估计器转化为针对任意目标量的非参数置信区间。该方法通过重复子采样数据、应用私有估计器并后处理经验累积分布函数生成置信区间，无需依赖特定的极限分布或强假设，具有良好的理论保证和实际表现，尤其在处理非光滑函数和复杂分布时优于现有方法。

2511.00839 2026-05-14 cs.SE cs.AI

CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Muhtasham Oblokulov, Aryan Siddiqui, Ofir Press, Ludwig Schmidt, Diyi Yang

发表机构 * Stanford University（斯坦福大学）； Princeton University（普林斯顿大学）； Cornell University（康奈尔大学）； Technical University of Munich（慕尼黑技术大学）

AI总结当前的编程基准主要评估语言模型在具体、明确任务上的表现，如修复特定错误或编写针对性测试，但未能反映真实软件开发中围绕高层目标进行迭代开发的复杂过程。为此，研究提出了 CodeClash，一个让语言模型在多轮竞赛中围绕竞争性目标优化代码库的基准，通过模拟代码对战评估模型在战略规划和长期维护方面的能力。实验表明，尽管模型在开发风格上存在差异，但在战略推理和代码长期维护方面仍存在显著局限，甚至在与人类专家的对抗中屡屡败北。

2510.16986 2026-05-14 stat.ML cs.LG stat.OT

When to Transfer: Adaptive Source Selection for Positive Transfer in Linear Models

Hamza Cherkaoui, Hélène Halconruy, Yohan Petetin

发表机构 * SAMOVAR Télécom SudParis Institut Polytechnique de Paris（SAMOVAR 法国电信南巴黎高等学院）； Modal’X Université Paris-Nanterre（Modal’X 巴黎-纳维尔大学）

AI总结在许多实际场景中，目标任务的标注数据稀缺或获取成本高昂，限制了监督学习的效果。本文研究了在多源设置下，如何通过样本共享选择性地从相关源任务中迁移信息以提升目标任务的性能。提出了一种基于数据依赖的迁移增益估计的接受/拒绝规则，用于决定从哪些源任务中引入多少样本，并证明该方法在高概率下能够保证正向迁移。实验表明，该方法在合成和真实数据上均优于经典及近期强基线方法，有效避免了负迁移。

2510.15297 2026-05-14 cs.CY cs.AI cs.HC cs.SI

VERA-MH Concept Paper

Luca Belli, Kate H. Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, Adam M. Chekroud

发表机构 * Spring Health ； Yale University School of Medicine（耶鲁大学医学院）

AI总结本文介绍了VERA-MH（心理健康领域伦理与负责任AI的验证），一种用于评估心理健康场景中AI聊天机器人安全性的自动化系统，重点关注自杀风险的识别与应对。研究团队结合临床专家经验，设计了评估标准，并利用两个辅助AI代理——用户代理和评判代理——分别模拟用户对话和评估聊天机器人的表现。该系统目前正处于开发与临床验证阶段，已初步应用于对GPT-5、Claude Opus等模型的测试，并计划进一步完善评估体系与临床实用性。

2510.14244 2026-05-14 eess.IV cs.AI cs.CV

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Arnaud Judge, Nicolas Duchateau, Thierry Judge, Roman A. Sandler, Joseph Z. Sokol, Christian Desrosiers, Olivier Bernard, Pierre-Marc Jodoin

发表机构 * Department of Computer Science, University of Sherbrooke（谢布鲁克大学计算机科学系）； INSA, Universite Claude Bernard Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS（里昂1大学INSA、CNRS UMR 5220、Inserm U1206、CREATIS）； Dep. of Software and Information Technology Engineering, École de technologie supérieure（蒙特利尔工程学院软件与信息技术工程系）； Institut Universitaire de France (IUF)（法国国家科学院（IUF））

AI总结该研究针对超声心动图分割中的领域自适应问题，提出了一种基于强化学习的无监督领域自适应框架RL4Seg3D。该方法通过引入新颖的奖励函数和融合策略，提升了分割结果中关键解剖标志点的精度，并在处理完整尺寸的视频输入时保持了良好的时间一致性。实验表明，该方法在无需目标域标注的情况下，显著优于传统领域自适应技术，且能提供鲁棒的不确定性估计，有助于进一步提升分割性能。

Comments 13 pages, accepted for publication in IEEE TMI

2510.01502 2026-05-14 q-bio.NC cs.CV cs.LG

Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

Kathy Garcia, Leyla Isik

发表机构 * Department of Cognitive Science（认知科学系）； Department of Biomedical Engineering（生物医学工程系）； Johns Hopkins University（约翰霍普金斯大学）

AI总结当前视频基础模型在捕捉人类对动态社会场景的信息组织方式方面存在不足，难以准确预测人类对社会视频片段的相似性判断。本文提出行为几何监督（BGS）方法，通过约束嵌入空间的局部与全局几何结构，使其与视频间的相似性关系对齐，从而提升模型性能。实验表明，该方法显著提升了模型在人类相似性判断任务中的表现，并使模型能够捕捉人类语言嵌入模型无法体现的社会情感特征，实现了更接近人类社会感知的视频理解。

Comments v2: Major revision. Retitled; expanded from TimeSformer alone to four backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, CLIP), with V-JEPA 2.1 nearly tripling pretrained performance. Adds zero-shot PHASE transfer, attention-rollout analysis, and a language-distillation control. Data (OOO sim. judgments) & core hybrid triplet+RSA LoRA method unchanged from v1. Prepared for NeurIPS 2026 submission

2509.04477 2026-05-14 math.OC cs.LG

Universal Representation of Generalized Convex Functions and their Gradients

Moeen Nehzati

发表机构 * Department of Economics, New York University（纽约大学经济学系）

AI总结该论文研究了广义凸函数及其梯度的通用表示方法，提出了一种具有凸参数空间的新可微分层，证明其能够作为广义凸函数及其梯度的通用逼近器。研究展示了该参数化方法在学习最优运输映射和多商品拍卖设计中的实际应用，能够将原有的嵌套双层优化问题转化为单层问题，从而高效求解。

2508.20474 2026-05-14 eess.AS cs.CL cs.SD

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe

发表机构 * Honda Research Institute Japan, Japan（本田研究院日本）； Carnegie Mellon University, USA（卡内基梅隆大学）

AI总结本文提出了一种统一的多说话人编码器（UME），通过共享的语音基础编码器同时学习说话人分轨（SD）、语音分离（SS）和多说话人语音识别（ASR）任务的表示。该方法利用UME多层隐藏表示的残差加权求和编码（RWSE），有效融合不同语义层次的信息，增强任务间的对齐与协同。实验表明，UME在LibriMix数据集上显著优于单独训练的基线模型，尤其在SD任务上取得了1.37%和2.29%的分轨错误率，优于先前研究结果。

Comments Accepted to IEEE ASRU 2025

2506.03120 2026-05-14 stat.AP cs.LG

Validating remotely sensed biomass estimates with forest inventory data in the western US

Xiuyu Cao, Joseph O. Sexton, Panshi Wang, Dimitrios Gounaridis, Neil H. Carter, Kai Zhu

发表机构 * School for Environment and Sustainability, University of Michigan（环境可持续发展学院，密歇根大学）； terraPulse, Inc.（terraPulse公司）

AI总结该研究旨在验证商业遥感公司terraPulse提供的地表以上生物量密度（AGBD）数据的准确性，利用美国林业局的森林清查与分析（FIA）数据作为独立参考。研究在美国内华达州、犹他州和华盛顿州的64,000公顷六边形区域及县尺度上进行验证，结果显示terraPulse与FIA数据在县尺度上具有高度一致性，R²达0.90，相关系数为0.95。研究还揭示了terraPulse数据在非森林区域和高生物量森林中与FIA数据的偏差原因，并提出了一个基于独立FIA数据的可扩展验证框架，为全球生物量监测提供了新的商业数据基准。

Comments 32 pages, 5 figures

Journal ref Science of Remote Sensing, Volume 13, June 2026, 100441

2505.14587 2026-05-14 stat.ML cs.LG

High-Dimensional Analysis of Bootstrap Ensemble Classifiers

Malik Tiomoko, Hamza Cherkaoui, Mohamed El Amine Seddik, Cosme Louart, Ekkehard Schnoor, Balazs Kegl

发表机构 * Huawei Noah’s Ark Lab, France（华为诺亚实验室，法国）； SAMOVAR, Télécom SudParis Institut Polytechnique de Paris France（SAMOVAR，巴黎高等理工学院，法国）； Technology Innovation Institute, United Arab Emirates（技术创新研究所，阿联酋）； Chinese University of Hong Kong, China（香港中文大学，中国）； Fraunhofer HHI, Germany, Aalto University Finland（弗劳恩霍夫研究所，德国，艾尔沃斯大学，芬兰）

AI总结本文对应用于最小二乘支持向量机（LSSVM）集成分类器的自助（Bootstrap）方法进行了理论分析，重点关注样本量和特征维度较大的场景。通过随机矩阵理论工具，研究了由多个弱分类器决策函数聚合而成的分类器性能，并探讨了自助方法在高维设置下的应用效果。基于理论分析，提出了优化子集数量和正则化参数以提升LSSVM集成性能的策略，实验结果在合成和真实数据集上验证了理论结论的有效性。

2505.04613 2026-05-14 stat.ML cs.LG math.ST stat.TH

Kernel Embeddings and the Separation of Measure Phenomenon

Leonardo V. Santoro, Kartik G. Waghmare, Victor M. Panaretos

发表机构 * Departement Mathematik, ETH Zürich（苏黎世联邦理工学院数学系）

AI总结本文研究了核嵌入在区分连续概率分布中的能力，证明了核协方差嵌入能够实现信息论意义上的完美分离。研究指出，在局部紧致的不可数波兰空间上，两个非原子概率测度的相等性检验等价于在再生核希尔伯特空间中两个中心高斯测度的奇异性检验。这一现象揭示了核方法在高维或复杂领域中表现出色的核心机制，并为设计高效的推理工具提供了理论依据。

AI 大模型

视觉与机器人

科学与医疗

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Evaluation of Prompt Injection Defenses in Large Language Models

Horospherical Depth and Busemann Median on Hadamard Manifolds

TStore: Rethinking AI Model Hub with Tensor-Centric Compression

Technically Love: The Evolution of Human-AI Romance Discourse on Reddit

AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

A Benchmark for Multi-Party Negotiation Games from Real Negotiation Data

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering

ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

Efficient Generative Prediction for EHR Foundation Models: The SCOPE and REACH Estimators

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Exploring the holographic entropy cone via reinforcement learning

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Differentially Private Nonparametric Confidence Intervals Under Minimal Distributional Assumptions

CodeClash: Benchmarking Goal-Oriented Software Engineering

When to Transfer: Adaptive Source Selection for Positive Transfer in Linear Models

VERA-MH Concept Paper

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

Universal Representation of Generalized Convex Functions and their Gradients

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Validating remotely sensed biomass estimates with forest inventory data in the western US

High-Dimensional Analysis of Bootstrap Ensemble Classifiers

Kernel Embeddings and the Separation of Measure Phenomenon