arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.12899 2026-05-14 stat.ML cs.LG

Robust Sequential Experimental Design for A/B Testing

Qianglin Wen, Xiangkun Wu, Chengchun Shi, Ting Li, Niansheng Tang, Yingying Zhang, Hongtu Zhu

发表机构 * Yunnan Key Laboratory of Statistical Modeling and Data Analysis(云南统计建模与数据分析重点实验室) Yunnan University(云南大学) School of Mathematical Sciences(数学科学学院) Zhejiang University(浙江大学) Department of Statistics(统计系) London School of Economics and Political Science(伦敦政治经济学院) School of Statistics and Data Science(统计与数据科学学院) Shanghai University of Finance and Economics(上海财经大学) East China Normal University(华东师范大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究了在模型误设情况下A/B测试中鲁棒的序贯实验设计问题,提出了一种统一的框架,适用于上下文 bandit 和动态设置。理论上,该方法能够保证估计处理效应的最坏情况均方误差上界;实验部分在合成数据和某科技公司的实际数据上验证了方法的有效性。

详情
英文摘要

Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model misspecification and develop a unified framework that covers both contextual bandit and dynamic settings. Theoretically, we prove that our design bounds the worst-case mean squared error of the estimated treatment effect. Empirically, we demonstrate the effectiveness of the proposed approach using synthetic and real-world datasets from a leading technology company.

2605.12898 2026-05-14 cs.SI cs.CL cs.CY

When Do LLMs Generate Realistic Social Networks? A Multi-Dimensional Study of Culture, Language, Scale, and Method

Sai Hemanth Kilaru, Sriram Theerdh Manikyala, Raghav Upadhyay, Sri Sai Kumar Ramavath, Srivika Nunavathu, Dalal Alharthi

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本文研究了大语言模型(LLMs)在生成现实社会网络时的表现,探讨了文化背景、语言、模型规模和提示方法等因素对其关系生成机制的影响。作者基于同质性和结构平衡理论,提出了四种不同的关系生成机制,并通过大量实验验证了这些机制在不同条件下的表现差异。研究发现,提示方式和文化背景等要素显著影响生成网络的结构特征,且LLM生成的网络在聚类和模块性方面优于传统图模型,但同时也表现出高于现实数据的群体偏差。

详情
英文摘要

Large language models (LLMs) are increasingly used as substitutes for human subjects in behavioral simulations, including synthetic social network generation. Yet it remains unclear how their relational outputs depend on prompt design, cultural framing, prompt language, and model scale. Building on homophily theory and structural balance theory, we formalize four LLM-based tie-formation mechanisms: sequential, global, local, and iterative, and treat them as distinct conditional distributions over edge sets. Using a fixed roster of 50 demographically grounded personas, we generate 192 verified directed networks across four cultural contexts, four prompt languages, three GPT-4.1 variants, and four prompting architectures, with two seeds per condition. We find that cultural framing shifts inbreeding homophily and largest-component connectivity. Political affiliation dominates tie formation under three methods, while the global method substitutes age, showing that prompt architecture functions as a substantive sociological variable. Model scale produces a stable divergence ranking, with the smallest variant behaving qualitatively differently rather than merely noisily. Prompt language alone sharply shifts religion homophily, especially under Hindi prompting, while leaving political homophily nearly invariant. LLM-generated networks match real social graphs on clustering and modularity better than standard graph baselines, yet encode demographic biases above empirical levels. These results show that prompt choices often treated as implementation details encode substantive sociological assumptions.

2605.12890 2026-05-14 stat.AP cs.LG

Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

Luxu Liang, Xiang Li

发表机构 * Tsinghua University(清华大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 随着大语言模型(LLM)的快速发展,区分机器生成文本与人类撰写文本变得越来越困难。为了解决这一问题,本文提出了一种名为Steer-to-Detect(S2D)的两阶段检测框架,通过注入引导向量提升冻结的观察模型的隐藏状态表示,从而增强类别可分性,并基于引导后的表示进行假设检验以实现检测。该方法在理论上有严格的误差保证,并在多种场景下表现出色,包括分布外和对抗性扰动情况。

详情
英文摘要

The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \texttt{S2D} achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.

2605.12887 2026-05-14 cs.IR cs.AI

EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents

Hengwei Ye, Jiasheng Mao, Zhenhan Guan, Zheng Tian

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 本文提出了EcoGEO,一种面向轨迹的生态系统生成引擎优化方法,用于改进网络增强型大语言模型搜索代理的信息获取过程。与现有基于单页面优化的方法不同,EcoGEO关注代理在搜索过程中的整体浏览轨迹,通过构建协调的证据生态系统,引导代理更有效地发现和验证目标信息。实验表明,该方法在推荐任务中显著优于传统方法,主要得益于对代理浏览路径和证据获取过程的优化设计。

详情
英文摘要

Web-enabled LLM agents are changing how online information influences search outcomes. \ Existing Generative Engine Optimization (GEO) studies mainly focus on individual webpages. \ However, agentic web search is not a single-document setting: an agent may issue queries, crawl pages, follow links, reformulate searches, and synthesize evidence across multiple browsing steps. \ Influence therefore depends not only on page content, but also on how pages are organized, connected, and encountered along the agent's browsing trajectory. \ We study this shift through \textbf{Ecosystem Generative Engine Optimization} (\textbf{EcoGEO}), which treats GEO as an environment-level influence problem for web-enabled LLM agents. \ To instantiate this perspective, we propose \textbf{TRACE}, a \textbf{Trajectory-Aware Coordinated Evidence Ecosystem}. \ Given a recommendation query and a fictional target product, our method builds a controlled evidence environment that coordinates an agent-facing navigation entry page with heterogeneous support pages. \ These pages use shared terminology, internal links, and consistent product attributes to introduce, verify, and reinforce the target product. We evaluate our method on OPR-Bench, a benchmark for open-ended product recommendation. \ Experiments show that it consistently outperforms page-level GEO baselines in final target recommendation. \ Trajectory-level metrics further show increased initial target-result crawls, target-specific follow-up searches, and internal-link crawls, suggesting that the gains come from shaping the agent's evidence-acquisition process rather than merely adding more target-related content. \ Overall, our findings support an ecosystem research paradigm for GEO, where web-enabled LLM agents are studied in relation to the broader evidence environments that guide search, browsing, and answer synthesis.

2605.12878 2026-05-14 math.OC cs.LG

Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization

Yaxin Yu, Long Chen, Minfu Feng

发表机构 * School of Mathematics(数学学院) Department of Mathematics(数学系) University of California, Irvine(加州大学 Irvine 分校) Sichuan University(四川大学)

AI总结 本文提出了一种名为 Adam-SHANG 的自适应优化算法,通过引入李雅普诺夫函数指导,将动量、自适应预处理和曲率感知修正相结合,提升了算法的稳定性。该方法在随机光滑凸优化中证明了期望收敛性,仅需满足一个保守的步长条件,无需对二阶矩序列的全局单调性做假设。此外,文中还提出了一种基于迹比的可计算步长规则,并在非凸场景中进行了验证,实验表明其在深度学习任务中具有良好的训练性能。

Comments 25 pages, 13 figures

详情
英文摘要

We propose Adam-SHANG, a Lyapunov-guided Adam-type method that couples momentum, adaptive preconditioning, and a curvature-aware correction through a more stable lagged-preconditioner update. For stochastic smooth convex optimization, we prove convergence in expectation under an admissible stepsize condition that can always be satisfied by a conservative spectral bound, without imposing global monotonicity on the second-moment sequence. To obtain a less conservative practical rule, we introduce a computable trace-ratio stepsize, motivated by a local coordinatewise alignment condition. The same structural update is also tested beyond the convex setting with simplified parameters. Experiments validate the predicted stochastic decay and show competitive training performance against Adam and AdamW on deep learning tasks.

2605.12869 2026-05-14 cs.CR cs.AI

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Zvi Topol

发表机构 * MuyVentive, LLC(MuyVentive公司)

AI总结 本文研究了大语言模型(LLM)在持续遭受对抗性攻击下的安全性能退化问题,提出了一种基于生存分析的新评估框架,用于量化模型被越狱攻击的时间动态特性。该方法将越狱时间建模为生存分析中的事件时间,能够估计风险函数、生存曲线及相关风险因素。实验表明,不同模型在面对重复攻击时表现出不同的脆弱性特征,为模型开发者提供了有价值的改进依据。

详情
英文摘要

Large language models (LLMs) are increasingly deployed in a wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails. Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vuln`erability. Our approach models the time-to-jailbreak as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a subset of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model demonstrates rapid degradation under iterative attacks, the two other models show consistent moderate vulnerability. Our framework provides actionable insights for model and LLM application developers and establishes survival analysis as a rigorous methodology for LLM safety evaluation.

2605.12863 2026-05-14 cs.PL cs.AI cs.CR

Language-Based Agent Control

Timothy Zhou, Loris D'Antoni, Nadia Polikarpova

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出了一种基于语言的智能体控制(LBAC)编程模型,旨在通过编程语言和语言安全技术提升智能体应用的控制能力。该模型通过要求智能体生成符合静态类型检查的程序,确保其行为符合用户指定的策略,如访问控制和信息流控制等,并在执行前由类型检查器过滤不安全程序。LBAC在保持高度表达性的同时,实现了对智能体生成行为和开发者编写框架代码的统一策略管理,并通过三个案例验证了其有效性。

详情
英文摘要

This paper introduces language-based agent control (LBAC), a new programming model for agentic applications that brings techniques from programming languages and language-based security to the problem of agent control. In conventional programming, combinations of static typing and runtime enforcement have long been used to guarantee that well-typed programs satisfy user-specified policies, including policies for access control, information flow, data provenance, and more. The key idea behind LBAC is to extend these guarantees to agentic applications by requiring agents to generate programs that are themselves well typed in the context of the surrounding scaffolding code. Unsafe programs are rejected by the type-checker before execution, allowing policies to apply uniformly across the entire application, including both agent-generated behavior and developer-written scaffolding. At the same time, LBAC preserves substantial expressiveness: agents may perform arbitrary side-effect-free computation and recursively invoke subagents, which retain full tool access subject to the same -- or potentially more restrictive -- policies. We demonstrate LBAC with three case studies: I/O sandboxing via filesystem capabilities, data provenance, and information-flow control.

2605.12862 2026-05-14 cs.NI cs.LG

NeuroRisk: Physics-Informed Neural Optimization for Risk-Aware Traffic Engineering

Yingming Mao, Ximeng Liu, Jingyi Cheng, Xiyuan Liu, Jiashuai Liu, Yike Liu, Zhen Yao, Yuzhou Zhou, Siyuan Feng, Qiaozhu Zhai, Shizhen Zhao

发表机构 * Xi’an Jiaotong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Huawei(华为) Shanghai Innovation Institute(上海创新研究院)

AI总结 在实际的广域网(WAN)中,相关故障是导致可用性下降的主要原因,迫使运营商预留大量安全余量,从而造成容量的严重浪费。为在严格可用性目标下实现高利用率,需要在大量概率性故障场景下进行风险感知的流量工程,但现有方法在运行时难以高效求解。本文提出NeuroRisk,一种基于物理信息的深度展开优化器,通过利用Sort-and-Select结构,结合门控边局部预留和排列不变的梯度对齐提示,有效平衡了模型的表达能力和计算可行性,在实际WAN上的实验表明,NeuroRisk在风险目标上相比求解器实现了数量级的加速,同时保持了较高的优化精度。

详情
英文摘要

In production Wide-Area Networks (WANs), correlated failures dominate availability losses, forcing operators to reserve large safety margins that leave substantial capacity underutilized. Achieving high utilization under strict availability targets therefore requires risk-aware Traffic Engineering (TE) over dozens to hundreds of probabilistic failure scenarios-yet solving this problem at operational timescales remains elusive. We demonstrate that existing risk-aware formulations can be unified under an embedded Sort-and-Select structure, exposing a fundamental trade-off between expressiveness and tractability: classical optimizers either restrict scenario selection for efficiency or incur prohibitive decomposition costs. While deep learning appears promising, prior Deep TE methods mainly target maximum link utilization and rely on scaling-based feasibility, which fundamentally breaks under explicit capacity constraints and scenario-dependent risk. We present NeuroRisk, a physics-informed deep unrolled optimizer that exploits the structure of Sort-and-Select. NeuroRisk enforces feasibility via gated edge-local reservations and represents scenario sets through permutation-invariant, gradient-aligned cues. Evaluations on production-style WANs show that NeuroRisk achieves small optimality gaps relative to the solver with orders of magnitude speedup $(10^2- 10^5 \times)$ on risk objectives, while outperforming neural baselines on nominal throughput.

2605.12857 2026-05-14 cs.MA cs.AI cs.AR cs.LG

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Zhongkai Yu, Yichen Lin, Chenyang Zhou, Yuwei Zhang, Kun Zhou, Junxia Cui, Haotian Ye, Zhengding Hu, Zaifeng Pan, Ruiyi Wang, Yujie Zhao, Hejia Zhang, Jingbo Shang, Jishen Zhao, Yufei Ding

发表机构 * UCSD(加州大学圣迭戈分校) Columbia University(哥伦比亚大学)

AI总结 现有基于API的智能体系统在RTL代码生成方面与工业实践存在根本性偏差,无法满足芯片厂商的安全要求并难以利用其专有数据。为解决这些问题,本文提出ChipMATE,首个自训练的多智能体RTL生成框架,通过Verilog智能体与Python参考模型智能体的相互验证,无需黄金测试平台即可保证生成代码的正确性。该方法采用回溯推理流程和两阶段训练策略,结合高质量数据生成框架,显著提升了生成效果,在多个评估指标上优于现有模型。

详情
英文摘要

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0\% and 80.1\% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

2605.12840 2026-05-14 stat.AP cs.LG

Decision Support for Marketplace Policies under Incomplete Evidence: From Replay to Launch Readiness

Prashant Shekhar, Caroline Howard

发表机构 * Department of Mathematics(数学系) Embry-Riddle Aeronautical University(埃姆布里-瑞德航空大学)

AI总结 本文研究了在实时竞价(RTB)市场中,如何基于不完整证据对定价和分配政策进行决策支持的问题。作者提出了一种支持感知的决策支持系统(DSS),整合了回放、离线评估、保守下界排名、多方面防护机制等多种方法,构建了一个可保留主张的评估流程,输出的是政策是否具备上线条件的分类结果,而非单一性能估计。实验表明,该系统能够识别出具有提升潜力的地板价格策略,并指出在缺乏关键因果证据的情况下,应选择在线验证而非直接部署,从而避免决策过断。

详情
英文摘要

Marketplace platforms routinely evaluate pricing and allocation policies using logged observational data, yet strong offline performance does not imply that a policy is safe to deploy. In real-time bidding (RTB) marketplaces, reserve-price and floor-policy changes affect not only revenue but also fill, advertiser value, budget pacing, and competition across auctions, creating feedback and interference. The central problem is therefore not to estimate whether a policy improves an offline metric, but to determine whether the available evidence justifies direct launch or only further validation. In this regard, we propose a support-aware decision-support system (DSS) that distinguishes promising from actionable evidence. The framework integrates replay, support-aware off-policy evaluation (OPE), conservative lower-bound ranking, multi-sided guardrails, out-of-time validation, sensitivity analysis, and interference-aware validation design into a claim-preserving pipeline that outputs a launch-readiness classification rather than a single performance estimate. Applying the framework to iPinYou-style RTB logs, we identify a margin-gated floor policy as the leading candidate, with a 47.7% replay yield lift, a 45.8% conservative lower-tail lift, and stable out-of-time performance. However, the framework does not recommend direct launch. A decision-rule ablation shows that simplified pipelines select the same policy but incorrectly recommend deployment, leaving key causal assumptions unresolved. In contrast, the proposed DSS selects the same policy but changes the action to online validation, reflecting missing evidence on propensities, bidder response, and interference. Overall, the contribution is a reproducible DSS protocol that prevents decision overclaim under partial identification and converts offline evaluation into an auditable, action-oriented recommendation.

2605.12832 2026-05-14 stat.AP cs.LG stat.ML

Digital Twins as Synthetic Controls in Single-Arm Trials

Daniele Bertolini, Franklin Fuller, Aaron M. Smith, Jonathan R. Walsh, Run Zhuang

发表机构 * Unlearn.AI, Inc.(Unlearn人工智能公司)

AI总结 本文探讨了在单臂试验中使用数字孪生作为合成对照的方法,以评估药物疗效和安全性。研究提出基于结果模型的合成对照能够克服传统数据驱动方法的局限性,提供更稳健的治疗效果估计。文章重点介绍了数字孪生技术,即利用机器学习模型生成的个性化疾病进展预测,并讨论了其在实际应用中的统计方法、样本量计算及与FDA最新指南的兼容性。最后,作者通过重新分析肌萎缩侧索硬化症和亨廷顿病的试验数据,验证了所提方法的有效性。

详情
英文摘要

Single-arm trials are an important study design for evaluating drug efficacy and safety without enrolling patients into a control arm. Although they do not provide the gold-standard evidence of randomized controlled trials, they are increasingly used in clinical development as they offer an efficient, ethical, and practical alternative. A wide variety of approaches can be used to construct control comparators and estimate treatment effects, from fixed comparators informed by clinical knowledge to data-based and model-based patient-level comparators, also known as synthetic controls. Powerful and flexible machine learning models can allow outcome-model-based synthetic controls to overcome key limitations of direct data-based approaches, yield more robust estimates of treatment effects, and provide a principled way to incorporate corrections or encode additional assumptions when external data are not directly comparable. In this work, we argue that outcome-model-based synthetic control arms are an important tool for single-arm trials. We focus on digital twins, personalized predictions of disease progression generated from machine learning models trained on historical datasets, which naturally leverage these flexible approaches. We review doubly robust estimators, present power and sample size formulas, and discuss trade-offs in selecting historical data for training and analysis. We also outline practical considerations for deploying digital twins within the framework of recent FDA draft guidance on the use of artificial intelligence in drug development. Finally, we reanalyze data from trials in amyotrophic lateral sclerosis and Huntington's disease to demonstrate the proposed methods.

2605.12814 2026-05-14 cs.SI cs.CL

Linking Extreme Discourse to Structural Polarization in Signed Interaction Networks

Zhijin Guo, Li Zhang, Tyler Bonnet, Janet B. Pierrehumbert, Xiaowen Dong

发表机构 * University of Oxford(牛津大学) University College London(伦敦大学学院) Imperial College London(伦敦帝国学院)

AI总结 该研究旨在将在线社区中的极端言论与结构化极化现象联系起来,提出了一种基于语言的有符号网络分析框架。通过从大语言模型中获取立场评分,构建连续的有符号边权重,并采用谱分析和划分基础的两种互补指标量化结构极化。实验表明,该方法能更敏感地揭示边权重变化对极化动态的影响,并在Reddit的脱欧讨论中展示了语言特征与结构极化随时间演变的关系。

详情
英文摘要

Polarization in online communities is often studied through either language or interaction structure, but the two views are rarely connected in a unified measurement pipeline. Prior work links them by building interaction graphs from human judgments of agreement and disagreement, leaving a gap between language as observed text and structure as an engineered representation of that text. We address this gap with a language-grounded signed-network pipeline that derives continuous signed edge weights from LLM stance scores and quantifies structural polarization using two complementary measures: a spectral Eigen-Sign score and a partition-based frustration score. After normalization, the two measures show substantial agreement while retaining important differences in their sensitivity to edge magnitude. Applying the framework to Reddit Brexit discussions, we analyze how window-level discourse signals, including toxicity, extreme scalar claims, and perplexity, relate to temporal variation in structural polarization. Edge-level and ablation analyses show that continuous, confidence-weighted signed edges reveal intensity-sensitive patterns that are muted under sign-only representations. We further report an exploratory one-step-ahead forecasting analysis suggesting that lagged language signals may contain information about future polarization beyond structural persistence. Together, the results demonstrate how discourse and signed-network structure can be connected in a single framework for measuring and interpreting polarization dynamics over time.

2605.12780 2026-05-14 stat.ME cs.LG stat.ML

When to Trust Confidence Thresholding: Calibration Diagnostics for Pseudo-Labelled Regression

Marcell T. Kurbucz

发表机构 * Institute for Global Prosperity, The Bartlett, University College London(全球繁荣研究所,巴特利特学院,伦敦大学学院)

AI总结 本文研究了在回归分析中使用经过校准的分类器输出作为伪标签时,置信度阈值选择对估计结果的影响。作者提出了一种基于校准的诊断方法,推导出置信度阈值引起的衰减偏差的闭式表达,并表明该偏差可由未标记数据集上的残差得分方差 $V^{*}$ 预测。研究还给出了在有界校准漂移下的敏感性边界,并提出了一个基于 $V^{*}$ 和 $κ$ 的决策规则,帮助实践者判断是否安全使用置信度阈值进行伪标签。

Comments 24 pages, 6 figures, 6 tables

详情
英文摘要

Calibrated probability outputs of trained classifiers are increasingly used as inputs to downstream regression estimands such as effects, prevalences, or disparities for a latent group observed only on a small labelled subset. A standard practice is to threshold the calibrated score at a confidence cutoff and treat the hard label as the truth. Building on a recent identification result for the underlying moment equation, we develop a calibration-aware diagnostic apparatus for pseudo-labelling pipelines. We derive a closed-form expression for the attenuation bias that confidence thresholding induces in the downstream regression coefficient, and show that the bias can be predicted, before any inference is run, from the residual score variance $V^{*}=\mathbb{E}[\operatorname{Var}(p\mid X)]$ on the unlabelled set after partialling out the downstream controls $X$. We further obtain a sharp sensitivity bound under bounded calibration drift, and identify the boundary $V^{*}=0$, which holds iff $p$ is a deterministic function of $X$; this motivates a structural separation between classifier features $W$ and downstream controls $X\subsetneq W$. Five controlled simulations and a UCI Adult illustration trace the predictions. The contribution is operational: a $(V^{*}, κ)$ decision rule that practitioners can compute from any classifier output to decide whether confidence thresholding is safe.

2605.12778 2026-05-14 cs.GR cs.CV

Generative Motion In-betweening by Diffusion over Continuous Implicit Representations

Shiyu Fan, Paul Henderson, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院)

AI总结 本文提出了一种基于连续隐式表示的扩散模型新方法,用于生成高质量的运动中间帧。该方法通过在潜在空间中建立隐式神经表示与稀疏时空信息之间的映射,能够在仅有极少关键帧的情况下生成平滑且多样化的运动序列。实验表明,该方法在保持关键帧准确性的同时显著提升了运动生成的质量。

详情
英文摘要

Recent advances in generative models have yielded impressive progress on motion in-betweening, allowing for more complex, varied, and realistic motion transitions. However, recent methods still exhibit noticeable limitations in preserving keyframe information and ensuring motion continuity. In this paper, we propose a novel pipeline and sampling optimization strategy for latent diffusion models (LDM) based on motion implicit neural representations (INR). By establishing a mapping between INR and sparse spatial or temporal information within latent diffusion, our model can sample the INR parameters from extremely sparse and ambiguous keyframe data and reconstruct plausible and smooth motions from the manifold. Our experiments demonstrate the superior performance of our model, which significantly improves motion generation quality in scenarios with few keyframes while ensuring both keyframe accuracy and diversity of in-between motions.

2605.12756 2026-05-14 math.OC cs.AI stat.ML

Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

Zhehang Du, Hangfeng He, Weijie Su

发表机构 * The Wharton School, University of Pennsylvania(宾夕法尼亚大学沃顿商学院) University of Rochester(罗切斯特大学)

AI总结 本文研究了大规模语言模型在最小化交叉熵损失进行预训练时,是否会在模型权重和上下文嵌入中诱导出几何结构。通过分析一个约束的逐层剥离优化模型,作者证明了目标下一个词分布中的对称性会以群论意义上的方式转移到模型的最优解中。例如,当目标词具有循环移位对称性时,最优的logit矩阵为循环矩阵,输出投影和上下文嵌入的格拉姆矩阵也呈现出循环几何结构;对于具有对称群不变性的目标分布,最优输出投影矩阵形成等角紧框架,且继承了输入数据中的排列对称性。实验表明,开源大语言模型自然表现出与理论预测一致的对称性,尽管训练过程中并未显式引入相关正则化。

详情
英文摘要

Large language models (LLMs) are pretrained by minimizing the cross-entropy loss for next-token prediction. In this paper, we study whether this optimization strategy can induce geometric structure in the learned model weights and context embeddings. We approach this problem by analyzing a constrained layer-peeled optimization program, which serves as a mathematically tractable surrogate for LLMs by treating the output projection matrix and last-layer context embeddings as optimization variables. Our analysis of this nonconvex optimization program demonstrates that symmetries in the target next-token distributions are transferred to the global minimizers of the layer-peeled model in a precise group-theoretic sense. Specifically, we prove that when the target tokens exhibit a cyclic-shift symmetry (such as the seven days of the week or the twelve months of the year), the optimal logit matrix is exactly circulant, and the Gram matrices of both the output projections and the context embeddings form circulant geometries as well. Next, for exchangeable target distributions invariant under the symmetric group and, more generally, under two-transitive group actions, we show that the global optimal output projection matrix forms a simplex equiangular tight frame, while the optimal logit matrix and context embeddings inherit the permutation symmetries present in the input data. A key technical step is to reduce the constrained nonconvex factorized problem to an explicit logit-level convex characterization for cyclic symmetry and to a symmetry-based lower bound for permutation symmetry, together with a sharp characterization of the optimal factorization. Finally, we empirically demonstrate that open-source LLMs naturally exhibit symmetries consistent with our theoretical predictions, despite being trained without any explicit regularization promoting such geometric structure.

2605.12753 2026-05-14 eess.IV cs.CV cs.LG

Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data

Paul Hoareau, Kuan Yi Wang, Brandon Bujak, Roy Sun, Govind Nair, Irene Cortese, Charidimos Tsagkas, Daniel Reich, Julien Cohen-Adad

发表机构 * NeuroPoly Lab, Institute of Biomedical Engineering, Polytechnique Montreal(神经多极实验室,生物医学工程学院,蒙特利尔理工学院) École Centrale de Lyon(里昂中央理工学院) Mila - Quebec AI Institute(魁北克人工智能研究所) Functional Neuroimaging Unit, CRIUGM, University of Montreal(功能神经影像单元,CRIUGM,蒙特利尔大学) Translational Neuroradiology Section, National Institute of Neurological Disorders and Stroke, National Institutes of Health(转化神经放射学部门,国家神经疾病与中风研究所,国家卫生研究院) Translational Imaging in Neurology (ThINk) Basel, Department of Biomedical Engineering, Faculty of Medicine, University Hospital Basel and University of Basel(神经学转化成像(ThINk)巴塞尔,生物医学工程系,医学院,巴塞尔大学医院和巴塞尔大学) Neurologic Clinic and Policlinic, Departments of Medicine, University Hospital Basel, Switzerland(神经科诊所和多科诊所,医学院,巴塞尔大学医院,瑞士) Research Center for Clinical Neuroimmunology and Neuroscience Basel (RC2NB), University Hospital Basel and University of Basel, Switzerland(临床神经免疫学和神经科学巴塞尔研究中心(RC2NB),巴塞尔大学医院和巴塞尔大学,瑞士) National Institute of Neurological Disorders and Stroke, National Institutes of Health(国家神经疾病与中风研究所,国家卫生研究院) Centre de recherche du CHU Sainte-Justine, Université de Montréal, Montreal, QC, Canada(圣朱斯特医院研究中心,蒙特利尔大学,蒙特利尔,魁北克,加拿大) Quantitative MRI core facility, NINDS, NIH(定量MRI核心设施,NINDS,NIH) Experimental Immunotherapeutics Unit, Division of Neuroimmunology and Neurovirology, NINDS, NIH(实验免疫治疗单元,神经免疫学和神经病毒学部门,NINDS,NIH)

AI总结 该研究针对高分辨率体外MRI数据的多标签分割问题,探讨了在稀疏2D标注下如何优化生成密集3D分割的弱监督学习方法。研究提出了一种基于2D教师网络生成伪标签训练3D学生网络的框架,并系统分析了人类视觉增强、空间增强和软标签正则化对模型性能的影响。结果表明,2D和3D模型在优化策略上存在显著差异,需采用不同的正则化方法以获得最佳分割效果。

Comments 19 pages. Submitted to Machine Learning for Biomedical Imaging (MELBA). Code and models: https://github.com/ivadomed/model_seg_sc-gm-lesion_human_ms_exvivo_t2star

详情
英文摘要

INTRODUCTION | Fully supervised 3D segmentation of high-resolution ex vivo MRI is limited by the prohibitive cost of volumetric annotation, forcing reliance on sparse 2D slices. Weakly supervised Sparse-to-Dense frameworks bridge this gap, but guidelines remain ambiguous regarding human-centric visual enhancements and transferring optimization strategies across dimensions. We analyze divergent regularization needs for multi-class segmentation of high-resolution ex vivo spinal cord MRI. METHODS | We used 9.4T MRI of multiple sclerosis spinal cords (>104,000 slices) with sparse annotations (428 slices). A 2D Teacher trained on sparse slices generated dense pseudo-labels to train a 3D Student. We systematically evaluated the impact of human-centric preprocessing, spatial augmentation, and soft-label regularization on both architectures. RESULTS | We identified a critical divergence in training dynamics. The 2D Teacher required strong spatial augmentation and soft-labeling to overcome data scarcity, improving White Matter Lesion Dice scores by >11 points. However, propagating these techniques to the 3D Student degraded its performance. Furthermore, human-centric preprocessing (e.g., CLAHE) disrupted global statistical cues, dropping Gray Matter Lesion Dice scores by ~25 points. DISCUSSION | Our study highlights a perception divergence (human-centric contrast enhancement harms machine models) and a regularization conflict across dimensions. 3D architectures trained on dense pseudo-labels exhibit fundamentally different optimization landscapes than 2D counterparts and require distinct, conservative regularization. Code and models: https://github.com/ivadomed/model_seg_sc-gm-lesion_human_ms_exvivo_t2star.

2605.12746 2026-05-14 cs.CR cs.AI

CoT-Guard: Small Models for Strong Monitoring

Nirav Diwan, Han Wang, Berkcan Kapusuzoglu, Ramin Moradi, Supriyo Chakraborty, Giri Iyengar, Sambit Sahu, Huan Zhang, Gang Wang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Capital One

AI总结 本文提出 CoT-Guard,一种基于 4B 参数的小型模型,用于监控推理过程(CoT)以检测代码生成任务中的隐蔽目标。为解决小型模型在检测隐蔽目标时的不足,研究设计了一种结合监督微调和强化学习的后训练方法,提升模型在领域内和领域外任务中的检测能力。实验表明,CoT-Guard 在多种攻击场景下表现优异,显著优于其他主流大模型,为用户提供了一种高效、低成本的防御方案。

详情
英文摘要

Monitoring the chain-of-thought (CoT) of reasoning models is a promising approach for detecting covert misbehavior (i.e., hidden objectives) in code generation tasks. While large models (GPT-5, Gemini-3-Flash) can serve as effective CoT monitors, they are expensive to deploy due to the lengthy reasoning traces and high API cost, emphasizing the need for smaller, cheaper alternatives. Nevertheless, we find that current small models (4B--8B) struggle to detect hidden objectives despite access to the CoT, frequently misattributing them as part of the user query. To address this, we propose a post-training pipeline combining supervised fine-tuning (SFT) and reinforcement learning (RL), where SFT narrows the gap for in-domain tasks by distilling detection behavior from stronger monitors, and RL on hard and subtly crafted hidden objectives helps the model generalize to out-of-domain monitoring tasks. To validate this generalization, we evaluate under a realistic threat model motivated by practical supply-chain attacks, where the adversary is a third-party LLM router injecting hidden objectives into code-generation requests through either prompt manipulation or code manipulation attacks. To push beyond objectives that large monitors already saturate, we also introduce four new challenging tasks even for strong monitors. Finally, we introduce CoT-Guard, a 4B-parameter monitor that demonstrates superior generalization performance under both prompt and code manipulation attacks, achieving a G-mean^2 (i.e., TNR x TPR) of 75% and outperforming GPT-5.4 (56%), GPT-5-mini (41%), and Qwen3-32B (54%), while closing the gap to Gemini-3-Flash (83%). These results demonstrate that CoT-Guard provides a practical and cost-effective user-side defense, substantially improving hidden-objective detection while avoiding the deployment cost of large monitors.

2605.12745 2026-05-14 cs.HC cs.AI

What Do You Think I Think? Accounting for Human Beliefs Using Second-Order Theory of Mind

Patrick Callaghan, Reid Simmons, Henny Admoni

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 该研究探讨了智能体如何理解人类对自身知识的错误信念,并提出了基于二阶心智理论(ToM-2)的模型,以识别和应对人类认知偏差与启发式推理。通过引入I-POMDP框架,智能体能够建模人类的错误信念及其成因,并据此生成适应性的反馈,从而提升交互效果。实验表明,该方法能有效提高人类教师行为的信息量,并获得用户对反馈有用性的积极评价。

Comments To appear in the proceedings of The 2026 Cognitive Science Society Conference

详情
英文摘要

Discrepancies between an agent's actual knowledge and what a person thinks the agent knows can hinder interactions. If an agent could detect such discrepancies, it could provide feedback to account for them and improve current and future interactions. Using the I-POMDP as a framework for a second-order Theory of Mind (ToM-2), this work endows an agent with the ability to model the evolution of a person's erroneous beliefs about an agent and the cognitive biases and heuristics (CBH) from which they arise. In doing so, the agent can detect when CBH might be at play during an interaction and adaptively generate feedback that accounts for them. An in-person user study shows how a ToM-2 learner can account for the effects of a teacher's CBH to significantly improve the informativeness of teacher actions, and subjective results suggest people find the ToM-2 learner's feedback more useful.

2605.12743 2026-05-14 cs.CR cs.CV

Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving

Shuo Ju, Qingzhao Zhang, Huashan Chen, Xuheng Wang, Haotang Li, Wanqian Zhang, Feng Liu, Kebin Peng, Sen He

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) The University of Arizona(亚利桑那大学) Beijing Jiaotong University(北京交通大学) East Carolina University(东卡罗来纳大学)

AI总结 该研究提出了一种新型的物理对抗攻击方法,针对基于视觉的自动驾驶系统,利用视角变化本身作为攻击工具,通过在车辆上部署静态的伪装贴片,使其在相对运动中产生视点依赖的外观变化,从而诱导系统产生错误的轨迹预测。与以往需要多视角鲁棒性或主动干预的攻击方法不同,该方法仅需简单部署,即可在不同场景和感知模型下引发自动驾驶车辆的误判刹车,实验在nuScenes数据集上验证了其高达87.5%的成功率。

详情
英文摘要

Existing physical adversarial attacks on vision-based autonomous driving induce time-evolving perception errors, including biased object tracking or trajectory prediction, through (i) sophisticated physical patch inducing detection box drift when entering the view distance, or (ii) dynamically changing patches that cause different perception errors at different time. In both cases, viewing-angle variation is treated as a challenge, requiring adversarial patches to remain effective across frames under varying views, leading to complex multi-view optimization. In contrast, we show that viewing-angle variation itself can be turned into an attack tool. We design a new attack paradigm where a static, passive adversarial camouflage is mounted on a vehicle whose view-dependent appearance naturally evolves with relative motion, inducing consistent feature drift across frames. This causes the system to infer a physically plausible but incorrect trajectory, such as a false cut-in, which propagates to downstream decision-making and triggers unnecessary braking. Unlike prior approaches that require multi-view robustness or active intervention, our attack emerges from normal driving dynamics and is easy to deploy: a parked vehicle with a natural camouflage can induce hard braking in passing autonomous vehicles. We demonstrate the novel attack on nuScenes dataset, showing the effectiveness with an end-to-end success rate of up to 87.5%, measured by hard-braking events, and robustness across different scene backgrounds, victim vehicle speeds, and perception models.

2605.12728 2026-05-14 eess.SY cs.AI cs.SE cs.SY

Grid-Orch: An LLM-Powered Orchestrator for Distribution Grid Simulation and Analytics

Boming Liu, Jin Dong, Jamie Lian

发表机构 * Electrification and Energy Infrastructures Division, Oak Ridge National Laboratory(电力化与能源基础设施部门,橡树岭国家实验室) UT-Battelle, LLC(UT-巴托尔实验室)

AI总结 本文提出了一种名为Grid-Orch的框架,通过模型上下文协议(MCP)将大语言模型(LLM)与电力系统仿真相结合,使工程师能够通过自然语言进行复杂的配电网络分析。该框架基于OpenDSS实现,提供了36个领域专用工具,支持多种优化任务和多步骤工程流程,并可通过交互式网页平台进行操作,显著提升了配电分析的效率和可访问性。

详情
英文摘要

The power distribution engineering workforce faces a projected shortage of up to 1.5 million engineers by 2030, creating urgent demand for more accessible analysis tools. This paper introduces Grid-Orch, a framework that bridges Large Language Models (LLMs) and power system simulation through the Model Context Protocol (MCP), enabling engineers to perform complex distribution analyses via natural language. Using OpenDSS as the reference implementation, Grid-Orch provides 36 domain-specific tools across eleven categories, covering power flow, voltage analysis, quasi-static time series (QSTS) simulation, and automated optimization. A provider-agnostic LLM layer supports both cloud-hosted (Gemini, Claude) and locally deployed (Ollama, llama-cpp) models, enabling air-gapped operation for security-sensitive utility environments. Three optimization skills, capacitor placement, voltage violation analysis, and overvoltage mitigation, extend the platform beyond single-tool queries to multi-step engineering workflows. Grid-Orch is delivered as an interactive web platform with chat-based interaction, a QSTS dashboard, and feeder topology visualization, and renders simulation results inline. Workflow demonstrations show that distribution analyses formerly requiring hours of scripting, such as distributed energy resource (DER) interconnection screening, complete in under two minutes through natural language, producing numerically identical results to direct OpenDSS scripting.

2605.12717 2026-05-14 cs.GT cs.AI

The End Justifies the Mean: A Linear Ranking Rule for Proportional Sequential Decisions

Carmel Baharav, Niclas Boehmer, Bailey Flanigan, Maximilian T. Wittmann

发表机构 * MIT, USA(美国麻省理工学院) Hasso Plattner Institute, University of Potsdam, Germany(德国波茨坦大学哈索·普拉特纳研究院)

AI总结 本文研究了在多人参与的决策场景中,如何设计一个公平的线性排序规则,以满足不同群体的偏好比例。研究提出了一种基于角度均值的简单规则,能够实现长期比例公平性,且在批量排序中与传统算术均值相比表现出更好的比例性。实验表明,在意见分歧较大的情况下,该方法显著提升了决策的公平性。

详情
英文摘要

AI alignment and participatory design motivate a new democratic design problem: how to collectively choose a decision rule to use repeatedly. We study this problem for linear ranking rules, which repeatedly rank items $x_j$ within batches $X=(x_1,\dots,x_m)\in(\mathbb{R}^d)^m$, where each item's ranking is dictated by its score $\langle θ^*,x_j\rangle$ according to a fixed scoring vector $θ^*$. Given voters' preferred scoring vectors $θ^{(1)},\dots,θ^{(n)}$ and their population fractions $α^{(1)},\dots,α^{(n)}$, we ask how to choose a collective vector $θ^*$ satisfying individual proportionality (IP): every voter type $i$ should agree with the resulting rankings to an $α^{(i)}$-proportional degree, either on average over time (long-run IP) or even within each batch (per-batch IP). The default rule, the arithmetic mean of the $θ^{(i)}$, has been shown to be severely majoritarian; more generally, it is not clear that any fixed linear rule can balance many voters' disparate opinions. Our main result is that, surprisingly, there is a simple rule that does satisfy long-run IP: the angular mean, the spherical analog of the arithmetic mean. We then show that exact per-batch IP is impossible for fixed linear rules, but that the gap between per-batch and long-run IP shrinks quickly with batch size. Experiments on three real-world preference datasets show that all rules perform similarly when voters' preferences are homogeneous, while the angular mean substantially improves proportionality in high-disagreement regimes.

2605.12704 2026-05-14 cs.SC cs.AI cs.LG

FePySR: A Neural Feature Extraction Framework for Efficient and Scalable Symbolic Regression

Zhiming Yu, Wangtao Lu, Xin Lai

发表机构 * School of Mathematical Sciences, Zhejiang University, Hangzhou, Zhejiang, China(浙江大学数学科学学院,杭州,浙江,中国) Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland(塔尔库大学医学与健康技术学院,塔尔库,芬兰)

AI总结 符号回归(SR)的一个基本挑战是从观测数据中高效地恢复复杂的数学表达式。本文提出了一种名为FePySR的两阶段神经特征提取框架,通过在方程搜索前提取有效特征来缩小搜索空间,从而提升符号回归的效率和可扩展性。该方法首先利用异构神经网络将观测数据约束到一组候选表达式,然后在该精简的表达式空间中使用PySR进行结构优化,实验表明FePySR在多个基准测试中优于现有方法,尤其在复杂方程的恢复率和计算效率方面表现突出。

Comments Data and Code Availability: https://github.com/laixn/FePySR

详情
英文摘要

A fundamental challenge in symbolic regression (SR) is efficiently recovering complex mathematical expressions from observational data. Although this problem is NP-hard, many expressions of practical interest decompose naturally into combinations of nonlinear feature modules, concentrating structural complexity into a small number of reusable components. Here, we introduce FePySR, a two-stage framework that reduces the SR search space by extracting valid features prior to equation search. FePySR first employs a heterogeneous neural network to constrain observational data to a set of candidate expressions, then performs structural optimization within this refined expression space using PySR. Across five standard benchmarks, FePySR outperforms state-of-the-art methods by achieving higher equation recovery rates. On a set of 75 highly complex synthesized equations, FePySR recovers 36 equations, while producing substantially smaller mean squared errors on the remaining unrecovered cases, with reduced computation time compared to PySR. FePySR's first stage also maintains consistent performance under varying numbers of selected top features and increasing levels of noise in the observational data. Applied to ordinary differential equations governing biological systems, FePySR successfully identifies governing equations in 24 out of 100 tests where PySR recovers none. Taken together, FePySR is a generalizable framework that can enhance the SR solvers, enabling the efficient and reliable recovery of symbolic expressions across scientific domains.

2605.12697 2026-05-14 stat.ML cs.LG math.PR

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

Tomohiro Hayase, Ryo Karakida

发表机构 * AIST(日本产业技术综合研究所)

AI总结 本文提出了一种统一的框架,用于确定自注意力机制中逆温度参数的临界缩放规律,以稳定长上下文处理。研究通过分析每个注意力行的间隔计数函数 $N_n$,定义了上尾累积尺度,并证明该尺度决定了softmax集中度的临界逆温度值。该框架统一了先前不同的缩放规律,并为从理论模型到实际Transformer的注意力得分分布提供了直接的诊断方法。

详情
英文摘要

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_n$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.

2605.12694 2026-05-14 cs.SE cs.AI cs.PL

Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

Jacqueline L. Mitchell, Chao Wang

发表机构 * University of Southern California, Los Angeles, California, USA(美国南加州大学)

AI总结 本文提出了一种名为“代理解释”的新框架,旨在将基于格结构的静态分析方法应用于基于大语言模型(LLM)的程序分析中。该方法将高层次的分析目标分解为局部断言,并在有限高度的格结构中跟踪LLM对每个断言的判断,从而更透明和系统地进行程序分析。通过引入工作列表算法,论文展示了如何逐步推进分析过程,并通过一个具体示例说明该方法在处理依赖第三方组件的代码时的有效性。这一方法提升了LLM在程序分析中的可靠性与可解释性。

Comments 27 pages, 6 figures

详情
英文摘要

Large language models can consult information that fixed static analyzers cannot, such as documentation, current security advisories, version-specific metadata, and informal API contracts. This makes LLMs a compelling option for program analyses that depend on information beyond the source program, or that are otherwise not amenable to conventional static analyzers. However, directly asking an LLM for a one-shot whole-program analysis is brittle because it compresses many evidence-dependent judgments into a single opaque answer, rather than exposing which conclusions are supported or disputed and using intermediate findings to guide later, more focused searches. In this paper, we propose agentic interpretation, a framework that brings the discipline of lattice-based static analysis to LLM-driven program reasoning. At a high level, agentic interpretation decomposes a high-level analysis goal into localized claims, and tracks the LLM's judgment about each claim in a finite-height lattice. A worklist algorithm governs how claims and their judgments evolve during the analysis. We introduce a formal model of agentic interpretation, explore the design space it opens, and illustrate the approach with a worked example analyzing code that depends on opaque third-party components.

2605.12668 2026-05-14 stat.ML cs.LG

Online Conformal Prediction: Enforcing monotonicity via Online Optimization

Eduardo Ochoa Rivera, Ambuj Tewari

发表机构 * University of Michigan(密歇根大学)

AI总结 本文研究了在线符合预测问题,旨在在多个置信水平下同时生成具有嵌套结构的有效预测集,以满足不同用户对风险容忍度的异构需求。作者提出了两种新的在线符合预测方法,通过在线优化视角实现预测集的嵌套性,并控制分位数估计误差。实验表明,与现有方法相比,该方法在多个置信水平上实现了稳定的覆盖率、严格的嵌套结构以及更高的统计效率。

详情
英文摘要

Conformal prediction provides a principled framework for uncertainty quantification with finite-sample coverage guarantees. While recent work has extended conformal prediction to online and sequential settings, existing methods typically focus on a single coverage level and do not ensure consistency across multiple confidence levels. In many real-world applications, such as weather forecasting, macroeconomic prediction, and risk management, different users operate under heterogeneous risk tolerances and require calibrated uncertainty estimates across a range of coverage levels. In such settings, it is desirable to produce prediction sets corresponding to different coverage levels that are nested and valid simultaneously. In this paper, we propose two novel online conformal prediction methods that output \emph{nested prediction sets} across a range of coverage levels, enabling simultaneous uncertainty quantification across the entire risk spectrum. Beyond interpretability, jointly estimating multiple coverage levels is known to improve statistical efficiency in classical quantile regression by enforcing non-crossing constraints and sharing information across quantiles. Our approaches leverage an online optimization perspective with small regret that translates to quantile estimation error control while enforcing nestedness of prediction sets. Empirical results on synthetic and real-world datasets, including applications in forecasting tasks with heterogeneous risk requirements, demonstrate that our method achieves stable coverage across all levels, strictly nested prediction sets, and improved efficiency compared to existing online conformal baselines.

2605.12664 2026-05-14 cs.GT cs.LG

Profit Maximization in Bilateral Trade against a Smooth Adversary

Simone Di Gregorio, Paul Dütting, Federico Fusco, Chris Schwiegelshohn

发表机构 * Google Research(谷歌研究)

AI总结 本文研究了双边贸易中经纪人在面对平滑对手时如何最大化利润的问题,提出了一种在线学习算法,保证了 $\tilde{O}(\sqrt{T})$ 的遗憾界,这一结果在时间范围 $T$ 上是紧致的,并与随机独立同分布情形下的最小最大率一致。通过将强遗憾保证从独立同分布情形推广到平滑对手情形,显著拓宽了可实现快速收敛率的场景,填补了该基础经济问题中遗憾界研究的重要空白。

详情
英文摘要

Bilateral trade models the task of intermediating between two strategic agents, a seller and a buyer, who wish to trade a good. We study this problem from the perspective of a profit-maximizing broker within an online learning framework, where the agents' valuations are generated by a smooth adversary. We devise a learning algorithm that guarantees a $\tilde{O}(\sqrt{T})$ regret bound, which is tight in the time horizon $T$ up to poly-logarithmic factors. This matches the minimax rate for the stochastic i.i.d. case, and is also well separated from the adversarial setting, where sublinear-regret is unattainable. By extending the strong regret guarantees from the i.i.d. case to the smooth adversary, we significantly broaden the scope of settings where such fast rate is achievable, while closing an important gap in the regret landscape of this fundamental economic problem. To overcome the challenges posed by this adversary, we leverage a continuity property of smooth instances and combines this with a hierarchical net-construction of the broker's action space, which is analyzed via algorithmic chaining. We showcase the applicability of these techniques by deriving a similarly tight $\tilde{O}(\sqrt{T})$ regret bound for a related mechanism design model: the joint ads problem.

2605.11033 2026-05-14 physics.plasm-ph cs.AI

TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma

JC Wu, Norton Lee, Kai Siang Chen

发表机构 * TaiScience Research Group(TaiScience研究组) Fu Jen Catholic University(辅仁大学) Center for Geometry and Physics(几何与物理中心) Institute for Basic Science (IBS)(基础科学研究所)

AI总结 本文提出了一种多模态变压器基础模型TokaMind,最初在聚变等离子体诊断数据上进行预训练,并在多个跨领域任务中验证其表示能力的可迁移性。研究通过在工业轴承退化、航空发动机退化及电力系统PMU数据集上的实验,揭示了TokaMind在电力系统中表现出色的关键特征,并在严重事件分类任务中取得了较高的F1分数。研究还发现,电力系统事件分类的难度主要由电网拓扑结构决定,而非模型容量,并提出了基于临界减缓指标的改进评估方法。

Comments 8 pages, 5 figures

详情
英文摘要

TokaMind is a multi-modal transformer (MMT) foundation model pre-trained on tokamak plasma diagnostics data from MAST, where it was shown to outperform CNN-based approaches on fusion benchmarks. We investigate whether its learned representations generalize to physically distinct but structurally analogous domains. Through systematic experimentation across four domains-industrial bearing degradation, NASA CMAPSS turbofan degradation, and two independent power grid PMU datasets-we identify four transfer-favoring characteristics that help explain where TokaMind's pretrained representations are most effective. Power grid synchrophasor data matches this target-domain profile most directly, while industrial degradation datasets demonstrate that TokaMind can still yield useful performance under partial alignment, especially when task design and feature construction expose physically meaningful degradation structure. On the GESL/PNNL 500-event benchmark with provider-aware evaluation, TokaMind achieves test $\text{F1} = 0.837 \pm 0.040$ (3~seeds) for severe event classification. Our central finding, however, is not the aggregate score: classification difficulty is structurally determined by provider-level grid topology, not model capacity. In the single-window early-warning regime, TokaMind outperforms a CNN baseline (F1~0.889 vs.~0.878)--a reversal that disappears as more event windows are provided. Furthermore, Critical Slowing Down (CSD) indicators, used as a confidence gate rather than a classification label, improve F1 from 0.696 to 0.750 at 63% coverage-outperforming the CNN baseline (0.636) at any coverage level. These results establish the first cross-domain validation of TokaMind outside nuclear fusion and propose a transferability framework and revised evaluation protocol for multi-source PMU datasets.

2605.10005 2026-05-14 cs.PL cs.AI cs.LO cs.SE

Combining Mechanical and Agentic Specification Inference for Move

Wolfgang Grieskamp, Teng Zhang, Vineeth Kashyap

发表机构 * Aptos Labs(Aptos实验室)

AI总结 本文介绍了一种用于Move Prover的规范推断工具,该工具结合了Move字节码的最弱前置条件(WP)分析与智能编码代理(如Claude Code)。该方法旨在减少手动编写规范的繁琐工作,通过WP分析提供可靠的机械基础,而AI代理则用于处理WP较弱的部分,如循环不变式和高层次规范。该工具已应用于包含高阶函数、动态分派、全局状态等特性的典型Move代码库中,验证了其有效性和实用性。

详情
英文摘要

In this paper, we describe early work on a specification inference tool for the Move Prover that combines a weakest-precondition (WP) analysis over Move bytecode with an agentic coding CLI such as Claude Code. Specification inference reduces the boilerplate of writing specifications in Move: in order to verify a high-level property such as a global state invariant, pre- and post-conditions for the supporting functions typically have to be written by hand, which is tedious. In our setting, a Model Context Protocol (MCP) service exposes the WP analysis and the prover itself to the coding agent. The WP analysis provides a sound, mechanical baseline for inference; the AI is used precisely where WP is weakest -- for loop invariants and high-level idiomatic specifications such as monotonicity, conservation, and structural invariants. The Move Prover serves as the oracle that decides whether the generated specs are valid, and the agent is equipped to generate proof hints and to refine the inferred specification until verification succeeds. The tool has been applied to a corpus of canonical Move code, including code that uses higher-order functions, dynamic dispatch, global state, references, and various forms of loops.

2605.08320 2026-05-14 eess.IV cs.CV

Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks

Marwane Hariat, Antoine Manzanera, David Filliat

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院)

AI总结 本文针对单目深度估计在低纹理区域表现不佳的问题,提出了一种基于预语义轮廓的距离变换方法,结合自监督神经网络提升深度预测的准确性。该方法通过预语义轮廓联合估计深度和相机运动,并利用距离变换增强低纹理区域的判别能力,从而生成更具区分性的输入图像和更有效的损失函数。实验表明,该方法在多个数据集上表现出色,优于现有的自监督深度估计方法。

详情
英文摘要

Monocular depth estimation (MDE) with self-supervised training approaches struggles in low-texture areas, where photometric losses may lead to ambiguous depth predictions. To address this, we propose a novel technique that enhances spatial information by applying a distance transform over pre-semantic contours, augmenting discriminative power in low texture regions. Our approach jointly estimates pre-semantic contours, depth and ego-motion. The pre-semantic contours are leveraged to produce new input images, with variance augmented by the distance transform in uniform areas. This approach results in more effective loss functions, enhancing the training process for depth and ego-motion. We demonstrate theoretically that the distance transform is the optimal variance-augmenting technique in this context. Through extensive experiments on KITTI, Cityscapes, Waymo, NYUv2 and ScanNet our model demonstrates robust performance, surpassing competing self-supervised methods in MDE.

2605.07147 2026-05-14 cs.LO cs.AI cs.LG

MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Zixuan Xie, Xinyu Liu, Shangtong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出 MathlibPR,一个基于真实 Mathlib4 拉取请求(PR)历史构建的基准,用于评估大语言模型(LLM)在判断数学库 PR 是否适合合并的能力。研究指出,尽管 LLM 在辅助形式化推理方面取得进展,但尚未有效参与 Mathlib 的贡献过程,而 Mathlib 的增长正受到人工审核流程的限制。通过引入分阶段评估协议,研究发现当前主流 LLM 和 LLM 代理在区分可合并 PR 与仅通过构建但未被合并的 PR 方面仍面临挑战,MathlibPR 为此类评审助手和奖励模型的开发提供了监督信号。

详情
英文摘要

The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code). Surprisingly, both LLM models and LLM agents struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged. By turning Mathlib PR histories into a supervised signal, MathlibPR provides a step toward reviewer assistants and reward models that could help evaluate PRs and steer LLMs toward producing merge-ready Mathlib contributions.