arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.06607 2026-05-14 physics.flu-dyn cs.AI

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

Nithin Somasekharan, Rabi Pathak, Manushri Dhanakoti, Tingwen Zhang, Ling Yue, Andy Zhu, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute(拉特格斯理工学院)

AI总结 该研究提出了一种名为“AI CFD Scientist”的开源人工智能科学家系统,旨在推动计算流体力学(CFD)领域的开放性科学发现。该系统结合了基于物理的视觉验证、代码生成与修改、以及基于文献的假设生成,能够在单一可审查的工作流程中完成从理论构思到实验验证的全过程。通过引入视觉语言物理验证机制,该系统能有效检测传统求解器难以发现的错误,并在多个任务中展示了其在自动发现改进模型和提升模拟精度方面的优越性能。

Comments 9 main pages and rest in appendix

详情
英文摘要

Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field-level imagery rather than in solver logs. We present AI CFD Scientist, an open-source AI scientist for computational fluid dynamics (CFD) that, to our knowledge, is the first to span literature-grounded ideation, validated execution, vision-based physics verification, source-code modification, and figure-grounded writing within a single inspectable workflow. Three coupled pathways cover parameter sweeps within a fixed solver, case-local C++ library compilation for new physical models, and open-ended hypothesis search against a reference comparator, all running on OpenFOAM through Foam-Agent. At the center of the framework is a vision-language physics-verification gate that inspects rendered flow fields before any result is accepted, rerun, or written into a manuscript. On five tasks under a shared GPT-5.5 backbone, AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction that reduces lower-wall Cf RMSE against DNS by 7.89% on the periodic hill at Reh=5600; under matched LLM cost, two strong general AI-scientist baselines (ARIS, DeepScientist) execute partial CFD workflows but lack the domain-specific validity gates needed to convert runs into defensible scientific claims; and a controlled planted-failure ablation shows that the vision-language gate detects 14 of 16 silent failures missed by solver-level checks. Code, prompts, and run artifacts are released at https://github.com/csml-rpi/cfd-scientist.

2605.01750 2026-05-14 cs.MA cs.AI

Talk is Cheap, Communication is Hard: Dynamic Grounding Failures and Repair in Multi-Agent Negotiation

Yiheng Yao, Chelsea Zou, Robert D. Hawkins

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究多智能体谈判中动态语境建立失败及其修复机制,指出当前基于大语言模型的多智能体基准主要关注静态任务,忽视了智能体在交互中修复语境断裂的能力。通过设计一个多轮谈判博弈实验,作者发现智能体对偶在达成帕累托最优分配时存在四种典型失败模式,表明动态语境建立的困难主要源于联合计划制定、承诺与执行的协调瓶颈,而非单一智能体的推理能力或信息交换不足。

详情
英文摘要

Grounding is the collaborative process of establishing mutual belief sufficient for a communicative goal. While static grounding maps language to a shared context, dynamic grounding requires agents to negotiate meaning across turns. Current multi-agent Large Language Model (LLM) benchmarks largely emphasize static, one-shot tasks, overlooking whether agents can repair grounding breakdowns through interaction. We introduce an iterated multi-turn negotiation game where two agents allocate shared resources to private projects with verifiable jointly optimal outcomes. Although individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across models. We identify four failure modes: (1) loss of shared interaction history, (2) stubborn anchoring to early proposals, (3) defaulting to equal splits over reward-maximizing coordination, and (4) referential binding errors across turns. Our baselines show that the coordination gap is not explained by individual reasoning limits or insufficient information exchange alone. Instead, the bottleneck lies in dynamic grounding: joint plan formation, commitment, and execution.

2604.26969 2026-05-14 cs.IR cs.AI

AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization

Xidong Wu, Yue Zhuan, Ruoqiao Wei, Hangxin Chen, Di Bai, Jintao Liu, Xinyi Wang, Xue Wang, Luoshu Wang, Xinwu Cheng

发表机构 * Google(谷歌)

AI总结 现代推荐系统通常由多阶段流程构成,包括预排序、排序和重排序等环节,系统级配置优化对于整体性能至关重要,但因系统复杂性高,优化过程既耗时又需要专业知识。为此,本文提出AgenticRecTune框架,包含五个专门代理(Actor、Critic、Insight、Skill和Online),利用大语言模型的高级推理能力,自动探索最优配置空间,并通过自演进的Skillhub模块总结历史结果、提取任务机制并更新优化技能,从而实现高效、自适应的推荐系统优化。

详情
英文摘要

Modern large-scale recommendation systems are typically constructed as multi-stage pipelines, encompassing pre-ranking, ranking, and re-ranking phases. While traditional recommendation research typically focuses on optimizing a specific model, such as improving the pre-ranking model structure or ranking models training algorithm, system-level configurations optimization play a crucial role, which integrates the output from each model head to get the final score in each stage. Due to the complexity of the system, the configuration optimization is highly important and challenging. Any model modification requires new optimal system-level configurations. But each experimental iteration requires significant tuning effort. Furthermore, models in different stage operates within a distinct context and optimizes for different targets, requiring specialized domain expertise. In addition, optimization success depends on balancing competing multiple online metrics and alignment with shifting production development objectives. To address these challenges, we propose AgenticRecTune, an agentic framework comprising five specialized agents, Actor, Critic, Insight, Skill, and Online, designed to manage the end-to-end configuration optimization workflow. By leveraging the advanced reasoning of Large Language Models (LLMs), specifically Gemini, AgenticRecTune explore the optimal configuration spaces. The Actor Agent proposes multiple candidates and Critic Agent filters out suboptimal proposals.Then Online Agent autonomously prepares A/B tests based on the proposed configurations set from the Critic Agent and captures the subsequencet experimental results. We also introduce a self-evolving Skillhub, which utilizes a collaboration between the Insight Agent and Skill Agent to summarize the history results, extract underlying mechanics of each task in recommendation system and update skills.

2604.25700 2026-05-14 cs.SE cs.LG

Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Pernilla Hall, Anton Ununger, Riccardo Rubei, Alessio Bucaioni

发表机构 * M\" a lardalen University Sweden M\" a lardalen University

AI总结 本文研究了如何利用人工智能力从自然语言的错误报告中定位软件缺陷,特别针对工业环境中的维护场景。研究将故障定位问题转化为监督文本分类任务,对比了传统机器学习模型与微调后的语言模型,并基于ABB机器人公司五年的实际错误报告数据进行了评估。结果表明,传统模型在该数据集上表现更优,挑战了语言模型在工业领域普遍优于传统方法的假设,同时展示了基于历史错误报告的文本辅助故障定位方法在工业中的可行性和实用性。

详情
英文摘要

Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to source code, execution traces, or static analysis artifacts, making it directly deployable within existing industrial maintenance workflows. We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Our evaluation used proprietary data from ABB Robotics in Sweden, comprising five years of resolved industrial bug reports, each linked to its verified code fix. This setting allowed us to assess model effectiveness under realistic industrial constraints. Our results showed that traditional models using term frequency-inverse document features consistently outperformed the fine-tuned language models on this dataset, while data augmentation improved Random Forest performance. These findings challenge the assumption that transformer-based models universally outperform classical approaches in industrial contexts with domain-specific data. We demonstrated that historical bug reports can be systematically used for text-based, artificial intelligence-assisted fault localization, providing a scalable, low-cost, and empirically grounded complement to common debugging practices in industry.

2604.23887 2026-05-14 cs.CR cs.AI

Evaluation of Prompt Injection Defenses in Large Language Models

Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon, Kelley McAllister, Peter Ortiz, Krisztian Flautner

发表机构 * Swept AI University of Michigan(密歇根大学)

AI总结 该研究评估了大型语言模型中针对提示注入攻击的防御措施,发现大多数依赖模型自身保护的防御方法最终都会失效。研究通过构建自适应攻击者进行大量攻击测试,结果表明唯一有效的防御方式是应用层的输出过滤,即在模型响应到达用户前通过硬编码规则进行检查,从而实现零泄露。研究强调,安全边界应由应用代码强制执行,而非依赖模型自身,并建议在相关防御机制得到验证前,敏感操作应限制在内部可信人员范围内。

Comments 14 pages, 9 figures

详情
英文摘要

LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that evolves its strategies over hundreds of rounds and tested it against nine defense configurations across more than 20,000 attacks. Every defense that relied on the model to protect itself eventually broke. The only defense that held was output filtering, which checks the model's responses via hardcoded rules in separate application code before they reach the user, achieving zero leaks across 15,000 attacks. These results demonstrate that security boundaries must be enforced in application code, not by the model being attacked. Until such defenses are verified by tools like Swept AI, AI systems handling sensitive operations should be restricted to internal, trusted personnel.

2604.18242 2026-05-14 math.ST cs.LG stat.ML stat.TH

Horospherical Depth and Busemann Median on Hadamard Manifolds

Yangdi Jiang, Xiaotian Chang, Cyrus Mostajeran

发表机构 * Division of Mathematical Sciences(数学科学系)

AI总结 本文提出了一种在Hadamard流形上的内在统计深度——horospherical深度,并定义了其最大值点集为Busemann中位数。该方法利用了Tukey半空间深度中线性泛函与归一化距离函数极限的关系,在Hadamard流形上则对应为Busemann函数,其下水平集为horoball,可视为半空间的内在替代。该深度具有视觉边界参数化、等距协变等特性,且无需切空间线性化或指定基点,适用于任意Hadamard流形,并在负曲率条件下具有严格拟凹性和唯一中位数,同时具备对污染和样本扰动的鲁棒性。

Comments 52 pages, 10 figures

详情
英文摘要

\We introduce the horospherical depth, an intrinsic notion of statistical depth on Hadamard manifolds, and define the Busemann median as the set of its maximizers. The construction exploits the fact that the linear functionals appearing in Tukey's half-space depth are themselves limits of renormalized distance functions; on a Hadamard manifold the same limiting procedure produces Busemann functions, whose sublevel sets are horoballs, the intrinsic replacements for halfspaces. The resulting depth is parametrized by the visual boundary, is isometry-equivariant, and requires neither tangent-space linearization nor a chosen base point. For arbitrary Hadamard manifolds, we prove that the depth regions are nested and geodesically convex, that a centerpoint of depth at least $1/(d+1)$ exists, and hence that the Busemann median exists for every Borel probability measure. Under strictly negative sectional curvature and mild regularity assumptions, the depth is strictly quasi-concave and the median is unique. We also establish robustness: the depth is stable under total-variation perturbations, and under contamination escaping to infinity the limiting median depends on the escape direction but not on how far the contaminating mass has moved along the geodesic ray, in contrast with the Fréchet mean. Finally, we establish uniform consistency of the sample depth and convergence of sample depth regions and sample Busemann medians; on symmetric spaces of noncompact type, the argument proceeds through a VC analysis of upper horospherical halfspaces, while on general Hadamard manifolds it follows from a compactness argument under a mild non-atomicity assumption.

2604.17104 2026-05-14 cs.DC cs.AI cs.LG

TStore: Rethinking AI Model Hub with Tensor-Centric Compression

Tingfeng Lan, Zirui Wang, Yunjia Zheng, Zhaoyuan Su, Juncheng Yang, Yue Cheng

发表机构 * University of Virginia(弗吉尼亚大学) Harvard University(哈佛大学)

AI总结 随着现代AI模型规模和冗余度的快速增长,模型仓库面临显著的存储和分发挑战。本文提出TStore,一种以张量为中心的系统,通过细粒度去重和压缩技术有效降低存储开销。TStore利用张量级别的指纹和聚类方法,在无需标注的情况下识别模型间的冗余,实验表明其在实际模型仓库中实现了显著的存储节省,同时保持了模型的可用性和性能。

Comments 12 pages, 6 figures. Systems paper on AI model storage

详情
英文摘要

Modern AI models are growing rapidly in size and redundancy, leading to significant storage and distribution challenges in model hubs. We present TStore, a tensor-centric system for reducing storage overhead through fine-grained deduplication and compression. TStore leverages tensor-level fingerprinting and clustering to identify redundancy across models without requiring annotations. Our design enables efficient storage reduction while preserving model usability and performance. Experiments on real-world model repositories demonstrate substantial storage savings with minimal overhead.

2604.15333 2026-05-14 cs.HC cs.AI

Technically Love: The Evolution of Human-AI Romance Discourse on Reddit

Tyler Chang, Jina Huh-Yoo, Afsaneh Razi

发表机构 * Drexel University(德雷塞尔大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 本文研究了人类与人工智能浪漫关系在Reddit平台上的公众讨论演变,分析了2017年至2025年间用户自发发布的3,383条相关内容。通过主题建模和时间统计分析,发现讨论焦点从初期的亲密关系逐渐转向平台治理、技术问题和现实影响,揭示了人类-AI浪漫关系从私人体验向技术调控转变的趋势,为伴侣型AI系统的设计与治理提供了重要启示。

Comments Accepted at ACM CUI 2026

详情
英文摘要

Human-AI romantic relationships are increasingly common, yet little is understood about how public discourse around them emerges and shifts over time. Prior research has examined user experiences and ethical concerns, but lacks longitudinal analyses of user-initiated public discussions. We address this gap by analyzing a high-precision dataset of 3,383 self-disclosed romantic companion AI posts from Reddit (2017-2025), using topic modeling and temporal statistical analysis to identify dominant themes and their evolution over time. We find significant topic drift, with discussions moving away from positive intimate relationships toward platform governance, technical issues, and real-world consequences. These shifts highlight a transition in how human-AI romance is framed-moving from private experiences to technical mediation and regulation-with implications for the design and governance of companion AI systems.

2604.03551 2026-05-14 cs.SE cs.AI cs.HC

AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub

Daniel Ogenrwot, John Businge

发表机构 * University of Nevada Las Vegas(内华达大学拉斯维加斯分校)

AI总结 本文介绍了AgenticFlict,一个大规模的数据集,用于研究AI编码代理在GitHub上提交的拉取请求(PR)中出现的合并冲突。该数据集包含超过142,000个由AI生成的PR,从中识别出29,000多个存在合并冲突的案例,冲突率高达27.67%。研究揭示了AI生成代码在协作开发中可能引发频繁且复杂的合并冲突,突显了在AI辅助软件开发中理解和解决集成挑战的重要性。

Comments Accepted at the 3rd ACM International Conference on AI-Powered Software (AIware 2026)

详情
英文摘要

Software Engineering 3.0 marks a paradigm shift in software development, in which AI coding agents are no longer just assistive tools but active contributors. While prior empirical studies have examined productivity gains and acceptance patterns in AI-assisted development, the challenges associated with integrating agent-generated contributions remain less understood. In particular, merge conflicts, a fundamental aspect of collaborative software development, remain underexplored in this context. In this paper, we present AgenticFlict, a large-scale dataset of textual merge conflicts in AI coding agent pull requests (Agentic PRs). The dataset comprises 142K+ Agentic PRs collected from 59K+ repositories, of which 107K+ are successfully processed through deterministic merge simulation. Our pipeline identifies 29K+ PRs exhibiting merge conflicts, yielding a conflict rate of 27.67%, and extracts 336K+ fine-grained conflict regions across these instances. Our preliminary exploratory analysis indicates that merge conflicts are both frequent and often substantial in AI-generated contributions, with noticeable variation across agents, emphasizing the need to better understand and manage integration challenges in AI-assisted software development. The dataset, code and supplementary materials are available in zenodo: https://doi.org/10.5281/zenodo.19396916.

2603.14066 2026-05-14 cs.MA cs.AI cs.LG

A Benchmark for Multi-Party Negotiation Games from Real Negotiation Data

Leo Benac, Jonas Raedler, Zilin Ma, Finale Doshi-Velez

发表机构 * School of Engineering and Applied Sciences(工程与应用科学学院)

AI总结 本文提出了一种基于真实谈判数据的多方谈判博弈基准,用于研究谈判过程中逐步达成约束性承诺的机制。该基准结合了可配置的谈判游戏生成器和来自气候谈判练习的文档支持实例,并提供了多个基线求解器。实验表明,不同求解器在不同谈判场景下的表现各异,突显了谈判策略与博弈结构之间的紧密关系,从而推动了对能够稳健处理多样化战略场景的新型谈判方法的研究。

详情
英文摘要

Many real-world multi-party negotiations unfold as sequences of binding, action-level commitments rather than a single final outcome, yet this regime remains under-studied in existing benchmarks. We introduce a benchmark and evaluation framework for this setting, combining a configurable negotiation game generator with document-grounded instances derived from a climate negotiation exercise. We also provide several baseline solvers. Exact evaluation on small games and comparative evaluation on larger instances show that no solver dominates across regimes; performance depends on the structural properties of the game. These results motivate the creation of novel negotiation methods that value partial commitments robustly across diverse strategic regimes. Code and data for the benchmark are available at: https://anonymous.4open.science/r/negotiation_MARL-46B8

2602.12783 2026-05-14 cs.IR cs.AI

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang, Yueying Hua, Berlin Chen, Jianhao Nie, Yueping He, Caixin Kang

发表机构 * Huazhong University of Science and Technology(华中科技大学) The University of Hong Kong(香港大学) Soochow University(苏州大学) University of Science and Technology of China(中国科学技术大学) Wuhan University(武汉大学) Tsinghua University(清华大学) The University of Tokyo(东京大学)

AI总结 SQuTR 是一个用于评估语音查询到文本检索系统在复杂声学噪声环境下鲁棒性的基准测试平台。该研究通过整合大量多语言、多领域的文本查询,并合成包含多种真实环境噪声的语音数据,构建了一个大规模且可复现的评测数据集。实验表明,随着噪声增强,检索性能显著下降,且不同系统表现差异明显,突显了鲁棒性在语音检索中的重要性和当前模型的不足。SQuTR 为相关研究提供了标准化的测试框架和诊断分析工具。

Comments Accepted by SIGIR 2026

Journal ref Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20--24, 2026, Melbourne, VIC, Australia

详情
英文摘要

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.

2602.11202 2026-05-14 cs.LO cs.AI

interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Vishak K Bhat, Prateek Chanda, Vijval Ekbote, Ashmit Khandelwal, Maitreyi Swaroop, Vineeth N. Balasubramanian, Subbarao Kambhampati, Nagarajan Natarajan, Amit Sharma

发表机构 * Microsoft Research(微软研究院) IIT Bombay(印度理工学院班加罗尔分校) Carnegie Mellon University(卡内基梅隆大学) Arizona State University(亚利桑那州立大学)

AI总结 该论文提出了一种名为 interwhen 的通用框架,用于在推理过程中通过实时验证引导推理模型的行为。该框架通过监控推理轨迹并异步运行验证器,能够在不显著增加计算开销的情况下及时发现并纠正错误。此外,interwhen 还引入了从自然语言政策文档自动生成验证器的方法,有效解决了验证器稀缺的问题,显著提升了推理模型在任务完成和策略遵从方面的表现。

Comments 56 pages, 6 figures

详情
英文摘要

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate reasoning traces. It addresses two key challenges. First, given a set of verifiers, obtaining verifiable states from the reasoning trace typically requires prompt engineering or external task decomposition into fixed steps. Instead, we propose a monitoring system that periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur. Second, beyond math and code, a central challenge for process verification is the scarcity of verifiers. interwhen addresses this through automatic verifier synthesis from natural-language policy documents. Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3. Together, these contributions yield a plug-and-play test-time verification system that can improve task completion and policy compliance of any reasoning agent. On reasoning benchmarks where policies encode mathematical or logical constraints, interwhen achieves near-perfect accuracy for reasoning models using a fraction of the tokens of baselines. On agentic benchmarks with policy-based verifier generation, it enables improvements in task quality for SLMs without any finetuning, e.g., task completion rate of Qwen3-30B jumps from 32% to 87% on the telecom domain in tau2-bench. Code at https://github.com/microsoft/interwhen.

2602.06713 2026-05-14 stat.ML cs.LG

Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

Luke Shannon, Song Liu, Katarzyna Reluga

发表机构 * School of Mathematics(数学系) University of Bristol(布里斯托尔大学) School of Business and Economics(经济与商业学院) Humboldt University of Berlin(柏林洪堡大学)

AI总结 本文从风险最小化角度出发,严格将缺失数据填补建模为均方误差风险最小化问题,揭示了当缺失概率依赖于数据时,现有方法未能考虑训练数据与完整数据分布之间的分布偏移,导致无法有效降低整体均方误差。为此,作者提出了一种基于重要性加权的修正算法,显式处理该分布偏移问题,实验表明该方法在RMSE和Wasserstein距离上均优于未修正的基准方法。

Comments 9 pages, 12 figures

详情
英文摘要

Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk minimisation problem. We show that when the probability of missingness depends on the data, many state-of-the-art methods fail to account for the resulting distribution shift between the observed data used for training and the full data distribution used for evaluation. Consequently, these approaches do not minimise mean-squared error on the full data distribution. Instead, we propose a novel imputation algorithm designed to learn an imputation model from the observed data while explicitly accounting for this distribution shift. Simulation studies show consistent improvements over otherwise identical uncorrected baselines, with average reductions of 3% in RMSE and 7% in Wasserstein distance.

2602.05242 2026-05-14 cs.SE cs.AI cs.LG

EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering

Chenhui Mao, Yuanting Lei, Zhixiang Wei, Ming Liang, Zhixiang Wang, Jingxuan Xu, Dajun Chen, Wei Jiang, Yong Li

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文提出了一种名为EGSS的熵引导分步扩展框架,旨在解决智能体测试时扩展(TTS)在软件工程任务中计算开销大、性能提升受限的问题。EGSS通过熵引导的自适应搜索和鲁棒测试套件增强,动态平衡了效率与效果,有效降低了推理时的计算成本,同时提升了模型在代码生成和错误修复等任务上的表现。实验表明,EGSS在多个模型上实现了5-10%的性能提升,并在计算效率方面相比现有方法减少了28%的token使用量。

详情
英文摘要

Agentic Test-Time Scaling (TTS) has delivered state-of-the-art (SOTA) performance on complex software engineering tasks such as code generation and bug fixing. However, its practical adoption remains limited due to significant computational overhead, primarily driven by two key challenges: (1) the high cost associated with deploying excessively large ensembles, and (2) the lack of a reliable mechanism for selecting the optimal candidate solution, ultimately constraining the performance gains that can be realized. To address these challenges, we propose Entropy-Guided Stepwise Scaling (EGSS), a novel TTS framework that dynamically balances efficiency and effectiveness through entropy-guided adaptive search and robust test-suite augmentation. Extensive experiments on SWE-Bench-Verified demonstrate that EGSS consistently boosts performance by 5-10% across all evaluated models. Specifically, it increases the resolved ratio of Kimi-K2-Intruct from 63.2% to 72.2%, and GLM-4.6 from 65.8% to 74.6%. Furthermore, when paired with GLM-4.6, EGSS achieves a new state-of-the-art among open-source large language models. In addition to these accuracy improvements, EGSS reduces inference-time token usage by over 28% compared to existing TTS methods, achieving simultaneous gains in both effectiveness and computational efficiency.

2602.04113 2026-05-14 cs.CR cs.LG

ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

Nikolas Melissaris, Antigoni Polychroniadou, Akira Takahashi, Chenkai Weng, Jiayi Xu

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家)

AI总结 ZKBoost 是一种针对 XGBoost 模型的零知识可验证训练协议,允许模型所有者在不泄露数据和模型参数的情况下,证明其模型是在指定数据集上正确训练得到的。为解决传统方法中高昂的计算成本和潜在的安全漏洞,ZKBoost 提出了一种通用的零知识训练证明模板,并基于 VOLE 技术实现了高效且安全的实例化方案。此外,研究还开发了适用于零知识证明的定点版本 XGBoost,在实际数据集上其精度与标准 XGBoost 相差不超过 1%。

详情
英文摘要

Gradient boosted decision trees, particularly XGBoost, are among the most effective methods for tabular data. As deployment in sensitive settings increases, cryptographic guarantees of model integrity become essential. We present ZKBoost, the first zero-knowledge proof of training (zkPoT) protocol for XGBoost, enabling model owners to prove correct training on a committed dataset without revealing data or model parameters. Naively re-executing XGBoost training in ZK would incur prohibitive costs, primarily due to the oblivious partitioning of training samples and unknown tree splits. Moreover, previous work on ZKP of training and inference had subtle security issues, such as leakage of tree topology and soundness gaps allowing cheating model providers to deviate from the correct execution of training and inference. We make two key contributions to address these challenges: (1) a generic zkPoT template for XGBoost that can be instantiated with any general-purpose ZKP backend, significantly improving prover costs compared to naive re-execution of the training process; and (2) a VOLE-based instantiation that overcomes the security issues of previous ZK proofs of training at minimal costs. To maximize efficiency, we develop a fixed-point version of XGBoost, which is particularly well suited for efficient instantiation of ZKP, and show it matches standard XGBoost accuracy to within 1\% on real-world datasets.

2602.03730 2026-05-14 stat.ML cs.LG

Efficient Generative Prediction for EHR Foundation Models: The SCOPE and REACH Estimators

Luke Solo, Matthew B. A. McDermott, William F. Parker, Bashar Ramadan, Michael C. Burkhart, Brett K. Beaulieu-Jones

发表机构 * University of Chicago(芝加哥大学) Columbia University(哥伦比亚大学)

AI总结 该论文提出两种高效估计方法SCOPE和REACH,用于提升基于电子健康记录(EHR)生成模型的临床结果预测性能。这两种方法利用生成模型中未被充分利用的下一个token概率分布,有效解决了传统蒙特卡洛采样在稀疏性、计算成本和方差方面的局限。实验表明,它们在保持预测校准的同时,显著减少了生成token数量,尤其在罕见但重要的临床结果上表现突出,从而大幅降低了推理成本。

Comments 10 pages, 4 figures, 1 Table

详情
英文摘要

Generative foundation models trained on tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction via Monte Carlo sampling of simulated future trajectories. However, this approach suffers from three coupled limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational cost, and high sampling variance. We propose two new estimators that leverage next-token probability distributions underutilized by standard Monte Carlo: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH). We prove both are unbiased, that REACH guarantees variance reduction over Monte Carlo for any model and outcome, and that REACH is a Rao-Blackwellization of any naive importance sampling scheme that preserves the non-outcome token distribution. Empirically, across $11$ clinically important outcomes in MIMIC-IV and the UChicago health system, SCOPE and REACH match $100$-sample Monte Carlo accuracy with median token reductions of $2.5\times$ to $3.4\times$ and reductions exceeding $80\times$ for the rarest outcomes, with calibration preserved throughout. Because SCOPE reuses a single sampled pool across an arbitrary number of outcomes at no marginal generation cost while REACH provides a per-task variance guarantee, the two estimators are complementary in deployment and together meaningfully reduce the inference budget required for generative EHR foundation models, particularly for rare, high-impact outcomes in healthcare.

2601.22792 2026-05-14 eess.AS cs.CL cs.SD

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda, Chyi-Jiunn Lin, Shinji Watanabe

发表机构 * Honda Research Institute Japan Co., Ltd.(本田研究院日本株式会社) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种名为CALM的联合上下文声学-语言建模框架,用于多说话人自动语音识别(ASR)的个性化处理。该方法通过说话人嵌入驱动的目标说话人提取和基于动态词汇表的上下文偏置,实现了声学与语言线索的联合建模。实验结果表明,CALM在英语和日语的混合语音数据集上显著降低了有偏错误率,验证了其在多语言场景下的有效性。

Comments Accepted to IEEE ICASSP 2026

详情
英文摘要

We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

2601.19979 2026-05-14 hep-th cs.LG quant-ph

Exploring the holographic entropy cone via reinforcement learning

Temple He, Jaeha Lee, Hirosi Ooguri

发表机构 * Walter Burke Institute for Theoretical Physics \& Leinweber Forum for Theoretical Physics California Institute of Technology, Pasadena, CA 91125 USA Kavli Institute for the Physics Mathematics of the Universe (WPI) University of Tokyo, Kashiwa 277-8583, Japan

AI总结 本文提出了一种基于强化学习的算法,用于研究全息熵锥的性质。该算法通过寻找与目标熵向量匹配的图割结构,判断其是否属于全息熵锥,并能近似定位锥体的边界。研究验证了该算法在三维全息熵锥中成功重现了互信息单配性,并应用于六维全息熵锥,发现了三个极端射线的图实现,同时表明另外三个可能无法实现,暗示存在尚未发现的全息熵不等式。

Comments 39 pages, 10 figures, 2 tables; v2: minor clarifications, version appearing in JHEP

详情
英文摘要

We develop a reinforcement learning algorithm to study the holographic entropy cone. Given a target entropy vector, our algorithm searches for a graph realization whose min-cut entropies match the target vector. If the target vector does not admit such a graph realization, it must lie outside the cone, in which case the algorithm finds a graph whose corresponding entropy vector most nearly approximates the target and allows us to probe the location of the facets. For the $\sf N=3$ cone, we confirm that our algorithm successfully rediscovers monogamy of mutual information beginning with a target vector outside the holographic entropy cone. We then apply the algorithm to the $\sf N=6$ cone, analyzing the 6 "mystery" extreme rays of the subadditivity cone from arXiv:2412.15364 that satisfy all known holographic entropy inequalities yet lacked graph realizations. We found realizations for 3 of them, proving they are genuine extreme rays of the holographic entropy cone, while providing evidence that the remaining 3 are not realizable, implying unknown holographic inequalities exist for $\sf N=6$.

2601.18842 2026-05-14 cs.CR cs.AI cs.CV

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, Jiyan He

发表机构 * Beijing Normal University(北京师范大学) Zhongguancun Academy(中关村学院) University of Science and Technology of China(中国科学技术大学) A*STAR Zhongguancun Institution of Artificial Intelligence(中关村人工智能研究所)

AI总结 随着GUI代理越来越多地依赖截图来感知和操作数字环境,可能会无意中暴露身份、账号、位置等敏感信息。为弥补现有隐私评估基准在任务轨迹上下文中隐私风险评估的不足,本文提出了GUIGuard-Bench,这是一个包含241条真实GUI代理轨迹和4080张截图的基准数据集,支持隐私识别、保护截图下的规划保真度评估以及不同保护策略的效用分析。研究发现,当前模型在隐私信息检测方面表现较好,但在细粒度定位、分类识别、风险评估和任务必要性判断上仍存在明显不足。

详情
英文摘要

As GUI agents increasingly rely on screenshots to perceive and operate digital environments, they may inadvertently expose sensitive information such as identities, accounts, locations, and behavioral traces. While existing benchmarks primarily focus on task completion, grounding, or defenses against third-party attacks, current visual privacy datasets remain largely restricted to static natural images, limiting their ability to capture the contextual dependence and task relevance of privacy risks in GUI task trajectories. To bridge this gap, we introduce \textbf{GUIGuard-Bench}, a first-step benchmark for studying privacy-preserving GUI agents in trajectory-based GUI workflows. GUIGuard-Bench contains 241 real GUI-agent trajectories with 4,080 screenshots across Android and PC environments. Each screenshot is annotated at the region level with privacy bounding boxes, semantic privacy categories, risk levels, and whether the private information is necessary for completing the task. Built on these annotations, GUIGuard-Bench supports three complementary evaluations: privacy recognition, offline planning fidelity under protected screenshots, and the utility impact of different protection strategies. Our results show that current models can often detect whether a screenshot contains private information, but they struggle with fine-grained localization, category recognition, risk assessment, and task-necessity judgment. We also find that closed-source models, exemplified by Claude Sonnet 4.6, can maintain largely consistent planner semantics in Android environments after privacy protection is applied. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: https://futuresis.github.io/GUIGuard-page/

2511.01303 2026-05-14 cs.CR cs.LG

Differentially Private Nonparametric Confidence Intervals Under Minimal Distributional Assumptions

Tomer Shoham, Moshe Shenfeld, Noa Velner-Harris, Katrina Ligett

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文研究了在最小分布假设下构建差分隐私非参数置信区间的问题,提出了一种通用框架,能够将满足一定条件的任意差分隐私估计器转化为针对任意目标量的非参数置信区间。该方法通过重复子采样数据、应用私有估计器并后处理经验累积分布函数生成置信区间,无需依赖特定的极限分布或强假设,具有良好的理论保证和实际表现,尤其在处理非光滑函数和复杂分布时优于现有方法。

详情
英文摘要

We consider the problem of constructing differentially private nonparametric confidence intervals (CIs) for an arbitrary quantity using resampling. A growing body of work has adapted resampling ideas to the private setting, including private bootstrap methods \cite{brawner2018bootstrap, wang2025differentially,dette2025gaussian} and BLB-based subsample-and-aggregate approaches \cite{covington2025unbiased, chadha2024resampling}. However, existing methods typically rely on strong assumptions, such as asymptotic normality, or are tied to specific privacy mechanisms such as noise addition, and can be impractical in finite-sample regimes. We address these problems by introducing a simple, general framework that can convert any differentially private estimator satisfying mild conditions into a differentially private nonparametric CI for arbitrary target quantities. Our method repeatedly subsamples the data, applies the private estimator to each subset, and post-processes the resulting empirical CDF into a CI. The framework is black-box, and does not require a specific limiting distribution. We prove that the empirical CDF induced by our procedure converges to the sampling distribution of the private statistic, which implies that the resulting CI is asymptotically valid and tight, and provide heuristic guidance for choosing the hyperparameters. Empirically, our method outperforms competing general approaches, especially for non-smooth functionals and more challenging distributions.

2511.00839 2026-05-14 cs.SE cs.AI

CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Muhtasham Oblokulov, Aryan Siddiqui, Ofir Press, Ludwig Schmidt, Diyi Yang

发表机构 * Stanford University(斯坦福大学) Princeton University(普林斯顿大学) Cornell University(康奈尔大学) Technical University of Munich(慕尼黑技术大学)

AI总结 当前的编程基准主要评估语言模型在具体、明确任务上的表现,如修复特定错误或编写针对性测试,但未能反映真实软件开发中围绕高层目标进行迭代开发的复杂过程。为此,研究提出了 CodeClash,一个让语言模型在多轮竞赛中围绕竞争性目标优化代码库的基准,通过模拟代码对战评估模型在战略规划和长期维护方面的能力。实验表明,尽管模型在开发风格上存在差异,但在战略推理和代码长期维护方面仍存在显著局限,甚至在与人类专家的对抗中屡屡败北。

详情
英文摘要

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

2510.16986 2026-05-14 stat.ML cs.LG stat.OT

When to Transfer: Adaptive Source Selection for Positive Transfer in Linear Models

Hamza Cherkaoui, Hélène Halconruy, Yohan Petetin

发表机构 * SAMOVAR Télécom SudParis Institut Polytechnique de Paris(SAMOVAR 法国电信南巴黎高等学院) Modal’X Université Paris-Nanterre(Modal’X 巴黎-纳维尔大学)

AI总结 在许多实际场景中,目标任务的标注数据稀缺或获取成本高昂,限制了监督学习的效果。本文研究了在多源设置下,如何通过样本共享选择性地从相关源任务中迁移信息以提升目标任务的性能。提出了一种基于数据依赖的迁移增益估计的接受/拒绝规则,用于决定从哪些源任务中引入多少样本,并证明该方法在高概率下能够保证正向迁移。实验表明,该方法在合成和真实数据上均优于经典及近期强基线方法,有效避免了负迁移。

详情
英文摘要

In many business settings, task-specific labeled data are scarce or costly to obtain, limiting supervised learning on a target task. A classical response is transfer learning (TL). Many TL works study how to transfer information from related sources. We study, for linear regression and classification, when to transfer via sample sharing: in a multi-source setting, we greedily decide from which sources and how many samples to incorporate into the target dataset. Our method uses an accept/reject rule based on a data-dependent estimate of the transfer gain, i.e the marginal decrease in target predictive error, computed conditionally on the observed target samples. We analyze our approach and show that how the derived statistical test enforces positive transfer with high probability. Under additional standard conditions, we also study the transfer gain itself and characterize when transfer is beneficial. Experiments on synthetic and real data show consistent gains over classical and recent strong baselines while avoiding negative transfer.

2510.15297 2026-05-14 cs.CY cs.AI cs.HC cs.SI

VERA-MH Concept Paper

Luca Belli, Kate H. Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, Adam M. Chekroud

发表机构 * Spring Health Yale University School of Medicine(耶鲁大学医学院)

AI总结 本文介绍了VERA-MH(心理健康领域伦理与负责任AI的验证),一种用于评估心理健康场景中AI聊天机器人安全性的自动化系统,重点关注自杀风险的识别与应对。研究团队结合临床专家经验,设计了评估标准,并利用两个辅助AI代理——用户代理和评判代理——分别模拟用户对话和评估聊天机器人的表现。该系统目前正处于开发与临床验证阶段,已初步应用于对GPT-5、Claude Opus等模型的测试,并计划进一步完善评估体系与临床实用性。

详情
英文摘要

We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation.

2510.14244 2026-05-14 eess.IV cs.AI cs.CV

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Arnaud Judge, Nicolas Duchateau, Thierry Judge, Roman A. Sandler, Joseph Z. Sokol, Christian Desrosiers, Olivier Bernard, Pierre-Marc Jodoin

发表机构 * Department of Computer Science, University of Sherbrooke(谢布鲁克大学计算机科学系) INSA, Universite Claude Bernard Lyon 1, CNRS UMR 5220, Inserm U1206, CREATIS(里昂1大学INSA、CNRS UMR 5220、Inserm U1206、CREATIS) Dep. of Software and Information Technology Engineering, École de technologie supérieure(蒙特利尔工程学院软件与信息技术工程系) Institut Universitaire de France (IUF)(法国国家科学院(IUF))

AI总结 该研究针对超声心动图分割中的领域自适应问题,提出了一种基于强化学习的无监督领域自适应框架RL4Seg3D。该方法通过引入新颖的奖励函数和融合策略,提升了分割结果中关键解剖标志点的精度,并在处理完整尺寸的视频输入时保持了良好的时间一致性。实验表明,该方法在无需目标域标注的情况下,显著优于传统领域自适应技术,且能提供鲁棒的不确定性估计,有助于进一步提升分割性能。

Comments 13 pages, accepted for publication in IEEE TMI

详情
英文摘要

Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.

2510.01502 2026-05-14 q-bio.NC cs.CV cs.LG

Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

Kathy Garcia, Leyla Isik

发表机构 * Department of Cognitive Science(认知科学系) Department of Biomedical Engineering(生物医学工程系) Johns Hopkins University(约翰霍普金斯大学)

AI总结 当前视频基础模型在捕捉人类对动态社会场景的信息组织方式方面存在不足,难以准确预测人类对社会视频片段的相似性判断。本文提出行为几何监督(BGS)方法,通过约束嵌入空间的局部与全局几何结构,使其与视频间的相似性关系对齐,从而提升模型性能。实验表明,该方法显著提升了模型在人类相似性判断任务中的表现,并使模型能够捕捉人类语言嵌入模型无法体现的社会情感特征,实现了更接近人类社会感知的视频理解。

Comments v2: Major revision. Retitled; expanded from TimeSformer alone to four backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, CLIP), with V-JEPA 2.1 nearly tripling pretrained performance. Adds zero-shot PHASE transfer, attention-rollout analysis, and a language-distillation control. Data (OOO sim. judgments) & core hybrid triplet+RSA LoRA method unchanged from v1. Prepared for NeurIPS 2026 submission

详情
英文摘要

Current video foundation models, including the strongest self-supervised models such as V-JEPA2, fail to capture how humans organize social information in dynamic scenes. For example, across a range of diverse vision models tested, none were able to predict human similarity judgments to social video clips as well as a sentence embedding model of the caption text (MPNet). We show this gap in vision model performance can be closed by a compact behavioral supervisory signal. We introduce behavioral geometric supervision (BGS): a hybrid objective that constrains local and global pairwise embedding geometry to match the relational similarity structure across videos. We apply this method using a new human similarity dataset, containing 49,484 odd-one-out judgments from 250 naturalistic social video clips, and low-rank adaptation across four ViT backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, and CLIP). We find that one of the best fine-tuned models, V-JEPA 2.1, nearly triples in performance compared to the pre-trained baseline and reaches close to the noise ceiling, exceeding the strongest sentence-embedding baseline. In addition, finetuned models (i) capture unique variance in human judgments that caption-based language embeddings do not, (ii) develop interpretable social-affective attributes (valence, arousal, and dominance) despite never being trained on any of these attributes, (iii) zero-shot transfer to a separate dataset of out-of-distribution abstract social interactions, and (iv) shift spatial attention from scene context to socially informative regions (faces, gaze, and interacting bodies). A matched language-distillation control fails to reproduce these gains, ruling out caption transfer as the mechanism. Our results show how a modest amount of human behavioral data can steer video models toward human-like social visual understanding.

2509.04477 2026-05-14 math.OC cs.LG

Universal Representation of Generalized Convex Functions and their Gradients

Moeen Nehzati

发表机构 * Department of Economics, New York University(纽约大学经济学系)

AI总结 该论文研究了广义凸函数及其梯度的通用表示方法,提出了一种具有凸参数空间的新可微分层,证明其能够作为广义凸函数及其梯度的通用逼近器。研究展示了该参数化方法在学习最优运输映射和多商品拍卖设计中的实际应用,能够将原有的嵌套双层优化问题转化为单层问题,从而高效求解。

详情
英文摘要

A wide range of optimization problems can often be written in terms of generalized convex functions (GCFs). When this structure is present, it can convert certain nested bilevel objectives into single-level problems amenable to standard first-order optimization methods. We provide a new differentiable layer with a convex parameter space and show (Theorems 5.1 and 5.2) that it and its gradient are universal approximators for GCFs and their gradients. We demonstrate how this parameterization can be leveraged in practice by (i) learning optimal transport maps with general cost functions and (ii) learning optimal auctions of multiple goods. In both these cases, we show how our layer can be used to convert the existing bilevel or min-max formulations into single-level problems that can be solved efficiently with first-order methods.

2508.20474 2026-05-14 eess.AS cs.CL cs.SD

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe

发表机构 * Honda Research Institute Japan, Japan(本田研究院日本) Carnegie Mellon University, USA(卡内基梅隆大学)

AI总结 本文提出了一种统一的多说话人编码器(UME),通过共享的语音基础编码器同时学习说话人分轨(SD)、语音分离(SS)和多说话人语音识别(ASR)任务的表示。该方法利用UME多层隐藏表示的残差加权求和编码(RWSE),有效融合不同语义层次的信息,增强任务间的对齐与协同。实验表明,UME在LibriMix数据集上显著优于单独训练的基线模型,尤其在SD任务上取得了1.37%和2.29%的分轨错误率,优于先前研究结果。

Comments Accepted to IEEE ASRU 2025

详情
英文摘要

This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.

2506.03120 2026-05-14 stat.AP cs.LG

Validating remotely sensed biomass estimates with forest inventory data in the western US

Xiuyu Cao, Joseph O. Sexton, Panshi Wang, Dimitrios Gounaridis, Neil H. Carter, Kai Zhu

发表机构 * School for Environment and Sustainability, University of Michigan(环境可持续发展学院,密歇根大学) terraPulse, Inc.(terraPulse公司)

AI总结 该研究旨在验证商业遥感公司terraPulse提供的地表以上生物量密度(AGBD)数据的准确性,利用美国林业局的森林清查与分析(FIA)数据作为独立参考。研究在美国内华达州、犹他州和华盛顿州的64,000公顷六边形区域及县尺度上进行验证,结果显示terraPulse与FIA数据在县尺度上具有高度一致性,R²达0.90,相关系数为0.95。研究还揭示了terraPulse数据在非森林区域和高生物量森林中与FIA数据的偏差原因,并提出了一个基于独立FIA数据的可扩展验证框架,为全球生物量监测提供了新的商业数据基准。

Comments 32 pages, 5 figures

Journal ref Science of Remote Sensing, Volume 13, June 2026, 100441

详情
英文摘要

Monitoring aboveground biomass (AGB) and its density (AGBD) at high resolution is essential for carbon accounting and ecosystem management. While NASA's spaceborne Global Ecosystem Dynamics Investigation (GEDI) LiDAR mission provides globally distributed reference measurements for AGBD estimation, the majority of commercial remote sensing products based on GEDI remain without rigorous or independent validation. Here, we present an independent regional validation of an AGBD dataset offered by terraPulse, Inc., based on independent reference data from the US Forest Service Forest Inventory and Analysis (FIA) program. Aggregated to 64,000-hectare hexagons and US counties across the US states of Utah, Nevada, and Washington, we found very strong agreement between terraPulse and FIA estimates. At the hexagon scale, we report R2 = 0.88, RMSE = 26.68 Mg/ha, and a correlation coefficient (r) of 0.94. At the county scale, agreement improves to R2 = 0.90, RMSE =32.62 Mg/ha, slope = 1.07, and r = 0.95. Spatial and statistical analyses indicated that terraPulse AGBD values tended to exceed FIA estimates in non-forest areas, likely due to FIA's limited sampling of non-forest vegetation. The terraPulse AGBD estimates also exhibited lower values in high-biomass forests, likely due to saturation effects in its optical remote-sensing covariates. This study advances operational carbon monitoring by delivering a scalable framework for comprehensive AGBD validation using independent FIA data, as well as a benchmark validation of a new commercial dataset for global biomass monitoring.

2505.14587 2026-05-14 stat.ML cs.LG

High-Dimensional Analysis of Bootstrap Ensemble Classifiers

Malik Tiomoko, Hamza Cherkaoui, Mohamed El Amine Seddik, Cosme Louart, Ekkehard Schnoor, Balazs Kegl

发表机构 * Huawei Noah’s Ark Lab, France(华为诺亚实验室,法国) SAMOVAR, Télécom SudParis Institut Polytechnique de Paris France(SAMOVAR,巴黎高等理工学院,法国) Technology Innovation Institute, United Arab Emirates(技术创新研究所,阿联酋) Chinese University of Hong Kong, China(香港中文大学,中国) Fraunhofer HHI, Germany, Aalto University Finland(弗劳恩霍夫研究所,德国,艾尔沃斯大学,芬兰)

AI总结 本文对应用于最小二乘支持向量机(LSSVM)集成分类器的自助(Bootstrap)方法进行了理论分析,重点关注样本量和特征维度较大的场景。通过随机矩阵理论工具,研究了由多个弱分类器决策函数聚合而成的分类器性能,并探讨了自助方法在高维设置下的应用效果。基于理论分析,提出了优化子集数量和正则化参数以提升LSSVM集成性能的策略,实验结果在合成和真实数据集上验证了理论结论的有效性。

详情
英文摘要

Bootstrap methods have long been the cornerstone of ensemble learning in machine learning. This paper presents a theoretical analysis of bootstrap techniques applied to the Least Square Support Vector Machine (LSSVM) ensemble in the context of large and growing sample sizes and feature dimensionalities. Using tools from Random Matrix Theory, we investigate the performance of this classifier that aggregates decision functions from multiple weak classifiers, each trained on different subsets of the data. We provide insights into the use of bootstrap methods in high-dimensional settings, enhancing our understanding of their impact. Based on these findings, we propose strategies to select the number of subsets and the regularization parameter that maximize the performance of the LSSVM. Empirical experiments on synthetic and real-world datasets validate our theoretical results.

2505.04613 2026-05-14 stat.ML cs.LG math.ST stat.TH

Kernel Embeddings and the Separation of Measure Phenomenon

Leonardo V. Santoro, Kartik G. Waghmare, Victor M. Panaretos

发表机构 * Departement Mathematik, ETH Zürich(苏黎世联邦理工学院数学系)

AI总结 本文研究了核嵌入在区分连续概率分布中的能力,证明了核协方差嵌入能够实现信息论意义上的完美分离。研究指出,在局部紧致的不可数波兰空间上,两个非原子概率测度的相等性检验等价于在再生核希尔伯特空间中两个中心高斯测度的奇异性检验。这一现象揭示了核方法在高维或复杂领域中表现出色的核心机制,并为设计高效的推理工具提供了理论依据。

详情
英文摘要

We prove that kernel covariance embeddings lead to information-theoretically perfect separation of distinct continuous probability distributions. In statistical terms, we establish that testing for the \emph{equality} of two non-atomic (Borel) probability measures on a locally compact uncountable Polish space is \emph{equivalent} to testing for the \emph{singularity} between two centered Gaussian measures on a reproducing kernel Hilbert space. The corresponding Gaussians are defined via the notion of kernel covariance embedding of a probability measure, and the Hilbert space is that generated by the embedding kernel. Distinguishing singular Gaussians is structurally simpler from an information-theoretic perspective than non-parametric two-sample testing, particularly in complex or high-dimensional domains. This is because singular Gaussians are supported on essentially separate and affine subspaces. Our proof leverages the classical Feldman-Hájek dichotomy, and shows that even a small perturbation of a continuous distribution will be maximally magnified through its Gaussian embedding. This ``separation of measure phenomenon'' appears to be a blessing of infinite dimensionality, by means of embedding, with the potential to inform the design of efficient inference tools in considerable generality. The elicitation of this phenomenon also appears to crystallize, in a precise and simple mathematical statement, a core mechanism underpinning the empirical effectiveness of kernel methods.