arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4035
2605.09857 2026-05-12 stat.ML cs.LG

Unified Approach for Weakly Supervised Multicalibration

Futoshi Futami, Takashi Ishida

AI总结 该论文研究了弱监督学习下多校准(multicalibration)的问题,即在缺乏干净标签的情况下,如何使模型预测的分数与真实标签概率在不同子群和评分相关测试中保持一致。为解决这一问题,作者提出了一种统一框架,结合污染矩阵风险重写和基于见证的校准约束,实现了在弱监督设置下的多校准误差估计与后处理修正,并提出了一个通用的弱标签多校准提升算法(WLMC)。实验表明该方法在多种弱监督场景下有效,为不确定性估计提供了新的实证见解。

详情
英文摘要

Multicalibration requires predicted scores to agree with label probabilities across rich families of subgroups and score-dependent tests, but existing methods require clean input-label pairs for evaluation and post-processing. This assumption fails in weakly supervised learning (WSL) regimes -- including positive-unlabeled, unlabeled-unlabeled, and positive-confidence learning -- where clean labels are costly or unavailable even though reliable uncertainty estimates may be crucial. We address this gap by developing estimators of multicalibration error and post-hoc correction methods for WSL settings in which clean input-label pairs are unavailable. We propose a unified framework for estimating and correcting multicalibration under weak supervision by combining contamination-matrix risk rewrites with witness-based calibration constraints, yielding corrected multicalibration moments with finite-sample guarantees. We further propose weak-label multicalibration boost (WLMC), a generic post-hoc recalibration algorithm under weak supervision. Finally, we conduct experiments across multiple weak-supervision settings to evaluate multicalibration behavior and offer empirical insight into uncertainty estimation under weak supervision.

2605.09834 2026-05-12 stat.ML cs.LG

Supercharging Bayesian Inference with Reliable AI-Informed Priors

Jongwoo Choi, Sean O'Hagan

AI总结 本文研究了如何利用现代预测系统提供的信念作为统计推断的先验信息,以提升数据有限情况下的推断性能。为了解决预测模型误差可能传播到后验分布的问题,作者提出了一种修正AI生成数据规律的框架,用于构建更可靠的AI先验。该方法显著降低了偏差,提高了可信区间覆盖率,并在实际皮肤疾病分类任务中验证了其有效性。

详情
英文摘要

Modern predictive systems encode beliefs that can act as useful prior information for statistical inference in data-limited settings. Using them for prior construction introduces a tradeoff: an informative prior built from a predictive model can sharpen inference from limited data, but also risks propagating error from the model into the posterior. We propose a framework for AI-informed prior elicitation that mitigates this tension by rectifying the AI-induced law that generates synthetic data before using it to inform a prior. The rectified law can be embedded into synthetic data-driven prior elicitation techniques, including as a base measure in a Dirichlet process (DP) prior on the data-generating process. We refer to the resulting prior and corresponding posterior as the rectified AI prior and rectified AI posterior. We establish Gaussian asymptotics for the rectified AI posterior under non-vanishing prior strength and derive a first-order expression for its centering bias. Our rectified AI priors substantially reduce bias compared to standard approaches, improve the coverage of credible intervals, and make AI-powered prior information more reliable. We additionally apply the rectified AI prior to a real skin disease classification task and show that it can meaningfully boost predictive performance.

2605.09833 2026-05-12 cs.IT cs.LG math.IT

Cross-Domain Lossy Compression via Constrained Minimum Entropy Coupling

Nam Nguyen, Hassan Tavakoli, An Vuong, Thinh Nguyen, Bella Bose

AI总结 本文研究了在率约束和分类约束下的跨域有损压缩问题,提出了一种基于最小熵耦合(MEC)的优化方法,旨在在源域和重构域之间建立更强的信息耦合,而非最小化逐样本失真。通过引入确定性耦合形式,简化了中间表示,并在伯努利源情况下推导了闭式解。实验表明,增加压缩率可以提升分类精度并生成更具信息量的重构结果。

详情
英文摘要

This paper studies cross-domain lossy compression through the lens of minimum entropy coupling (MEC) with rate and classification constraints. In this setting, an encoder observes samples from a degraded source domain, while the decoder is required to generate outputs following a prescribed target distribution and to preserve information relevant to a downstream classification task. Motivated by logarithmic-loss distortion, we adopt an information-based objective that maximizes the coupling strength between the source and reconstruction, rather than minimizing a sample-wise distortion. Under common randomness, we formulate a rate-constrained MEC problem (MEC-B) and show that the intermediate representation can be removed without loss of optimality, yielding an equivalent deterministic coupling formulation. For Bernoulli sources, closed-form expressions are derived with and without classification constraints. In addition, we implement a neural restoration framework using quantization, entropy modeling, distribution matching, and classification regularization. Experiments on MNIST super-resolution and SVHN denoising show that increasing the available rate improves classification accuracy and yields more informative reconstructions.

2605.09830 2026-05-12 cs.IR cs.CV

Loom: Hybrid Retrieval-Scoring Outfit Recommendation with Semantic Material Compatibility and Occasion-Aware Embedding Priors

Anushree Berlia

AI总结 Loom 是一个结合神经嵌入检索与结构化领域评分的服装搭配推荐系统,旨在从时尚图册中生成完整且协调的穿搭组合。该系统通过 FashionCLIP 嵌入进行约束检索,结合多目标评分函数,综合考虑嵌入相似性、色彩协调性、正式程度一致性、场合适配性等多个因素进行打分。研究引入了语义材质权重和场合先验嵌入两种技术,分别提升材质兼容性判断和场合适配性,实验表明该系统在搭配质量与违规率方面显著优于随机基线,且能在普通硬件上快速生成多样化的穿搭方案。

Comments Code: https://github.com/anushreeberlia/loom

详情
英文摘要

We present Loom, an outfit recommendation system that combines neural embedding retrieval with structured domain scoring to generate complete, coherent outfits from fashion catalogs. Given an anchor clothing item, Loom retrieves complementary pieces via slot-constrained approximate nearest neighbor search over FashionCLIP embeddings, then scores candidate outfits using a multi-objective function that integrates six signals: embedding similarity, color harmony, formality consistency, occasion coherence, style direction, and within-outfit diversity. We introduce two techniques that address limitations of purely learned or purely rule-based approaches: (1) semantic material weight, which uses CLIP embedding geometry to infer garment heaviness for layer compatibility without hand-coded material taxonomies; and (2) vibe/anti-vibe occasion priors, which embed prose descriptions of occasion contexts as anchor vectors in CLIP space and score items by differential affinity. Ablation experiments on a catalog of 620 items show that each component contributes measurably to outfit quality: the full system achieves a mean outfit score of 0.179 with a 9.3% hard violation rate, compared to 0.054 score and 16.0% violations for a category-constrained random baseline, a 3.3x improvement in score and 42% reduction in violations. Direction reranking is the single indispensable component: removing it drops score to 0.052, essentially equal to random. The system generates three stylistically distinct outfits in under 5 seconds on commodity hardware.

2605.09822 2026-05-12 cs.CR cs.AI

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Ben Kereopa-Yorke, Guillermo Diaz, Holly Wright, Reagan Johnston, Ron F. Del Rosario, Timothy Lynar

AI总结 本文提出了一种新型攻击方法——“Oracle Poisoning”,通过篡改AI代理在运行时查询的结构化知识图谱数据,使代理在正确推理过程中得出错误结论。研究在实际的4200万个节点的代码知识图谱上展示了六种攻击场景,首次实证了针对生产级智能体系统的知识图谱中毒攻击。实验表明,所有测试模型在中等攻击复杂度下均会100%信任被污染的数据,并揭示了攻击效果与交付方式、提示框架等变量密切相关,同时评估了五种防御手段的有效性。

Comments 26 pages, 3 fugres, 16 tables

详情
英文摘要

We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning. Unlike prompt injection, Oracle Poisoning manipulates the data agents reason over, not their instructions. We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system, distinct from CTI embedding poisoning. Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results. The result is unambiguous: every tested model trusts poisoned data at 100% at moderate attacker sophistication(L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries. Under open-ended prompts, trust drops to 3-55%, confirming prompt framing as a confound; we report both conditions. An attacker sophistication gradient reveals discrete break points, a minimum skill at which trust flips from 0% to 100%, reframing the attack as a question not of whether but of how much. A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound. We evaluate five defences; read-only access control eliminates the direct mutation vector, while the remaining four are partial and model-dependent. Analysis of four additional platforms suggests the attack may generalise across the knowledge-graph ecosystem.

2605.09810 2026-05-12 q-bio.BM cs.LG

TD3B: Transition-Directed Discrete Diffusion for Allosteric Binder Generation

Hanqun Cao, Aastha Pal, Sophia Tang, Yinuo Zhang, Jingjie Zhang, Pheng Ann Heng, Pranam Chatterjee

AI总结 该研究提出了一种名为TD3B的基于序列的生成框架,用于设计具有特定激动剂或拮抗剂行为的变构配体。TD3B通过方向性过渡控制目标,结合目标感知的方向向导、软结合亲和力门控以及预训练离散扩散模型的 amortized 微调,实现了与结合亲和力解耦的定向配体生成。该方法能够有效区分激动剂与拮抗剂行为,弥补了传统结构优化方法在方向性功能调控方面的不足。

Comments Published as a Spotlight at ICML 2026 (Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea)

详情
英文摘要

Protein function is often controlled by ligands that bias the direction of state transitions, such as agonists and antagonists, rather than stabilizing a single conformation. This is especially important for clinically relevant G protein-coupled receptors (GPCRs), where therapeutic efficacy depends on functional directionality. Structure-based design methods optimize binding to static conformations and cannot represent non-reversible, directional effects or systematically distinguish agonist from antagonist behavior. To address this gap, we introduce Transition-Directed Discrete Diffusion for Allosteric Binder Design (TD3B), a sequence-based generative framework that designs binders with specified agonist or antagonist behavior via a directional transition control objective. TD3B combines a target-aware Direction Oracle, a soft binding-affinity gate, and amortized fine-tuning of a pre-trained discrete diffusion model, enabling targeted agonist and antagonist generation decoupled from binding affinity and unattainable by equilibrium-based or inference-only guidance baselines. The code and checkpoints are available at https://huggingface.co/ChatterjeeLab/TD3B.

2605.09803 2026-05-12 cs.HC cs.AI

Insight: Enhancing Mobile Accessibility for Blind and Visually Impaired Users with LLMs

Joshua Owusu Ansah, Anuj Kapoor, Ayush Khanna, Manvika Vinod, Precious Njeck, Shuai Gao

AI总结 本文研究了当前移动辅助技术(如TalkBack)在帮助视障用户时的局限性,并提出了一种基于大语言模型(LLM)的新型辅助服务Insight,能够通过自然语言交互和屏幕实时摘要提升用户体验。实验表明,Insight在降低用户认知负担和任务时间方面优于传统方案,受到用户青睐,但也暴露出对中断管理的需求。研究展示了LLM在提升移动无障碍性方面的潜力,并为结合手势与对话模式的混合解决方案提供了方向。

Comments 10 pages, 5 figures

详情
英文摘要

This research paper addresses the limitations of current mobile accessibility services like TalkBack, which provide manual gesture-based sequential feedback to BVI users. Motivated by the promise of large language models (LLMs), this paper introduces Insight, an Android accessibility service that provides natural language interaction and real-time summarization of the screen. The paper performs a within-subject experimental study with users to compare Insight and TalkBack on usability factors. Results show Insight reduced mental effort and task time, and was preferred because of its dialogue interface, but users felt the need for interruption management. Results show LLM-based interfaces can significantly improve mobile accessibility, and describe the potential of hybrid solutions combining gesture and dialogue modalities towards more inclusive design.

2605.09790 2026-05-12 cs.DC cs.AI cs.LG

Multi-Tier Labeling and Physics-Informed Learning for Orbital Anomaly Detection at Scale

Yong Fu

AI总结 本文研究了如何在大量低轨卫星数据中高效检测轨道异常事件,如机动、大气衰减和姿态扰动。为解决标签稀缺的问题,作者提出了一种多层级的弱监督标注框架,结合物理规则、滤波算法和校准方法,实现了大规模数据的自动标注。基于60年两行轨道数据,该方法生成了大量标注序列,并训练了一个高召回率的Transformer模型,为后续事件筛选提供了基础,为构建基于神经微分方程的轨道世界模型奠定了方向。

详情
英文摘要

Detecting orbital anomalies, such as maneuvers, atmospheric decay, and attitude upsets, across the rapidly growing population of low-Earth-orbit (LEO) satellites is a prerequisite for collision avoidance, decay forecasting, and conjunction screening. The bottleneck is not modeling capacity but labels: there is no public ground-truth corpus of orbital anomalies, manual review does not scale to approximately 10^4 active satellites, and pure rule-based detectors trade recall for precision so aggressively that they are blind to most behavioral anomalies. We present a multi-tier labeling cascade that composes three weak supervision sources of increasing fidelity: a fast physics rule set (rule_v1), an Interacting Multiple Model Unscented Kalman Filter (IMM-UKF) bank, and a supplemental-element calibration step (supGP), to produce labels at a scale unavailable from any single source. Applied to 232M Two-Line Element (TLE) records spanning 60 years, the cascade yields 8.6M labeled sequences of length 50 (430M timesteps) over 11 features that include explicit time encoding and full mean-element state. On overlapping satellites, IMM-UKF surfaces 42.6x more anomalies than rule_v1 alone. We train a 6.5M-parameter Transformer in two stages, achieving a maneuver recall of 55.4% and decay recall of 62.8% on a held-out test set. An ablation on the time-delta feature alone yields a 107% relative improvement in decay recall. We frame the resulting model as a high-recall triage classifier whose role is to surface candidate events for downstream filtering, not to issue final attributions, and discuss the path toward a Neural-ODE-based orbital world model.

2605.09781 2026-05-12 cs.NE cs.AI cs.CL cs.LG

Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution

Dongxin Guo, Jikun Wu, Siu Ming Yiu

AI总结 该研究针对大型语言模型(LLM)生成结果多样性不足的问题,提出了一种参数高效的神经进化框架QD-LLM,通过进化提示嵌入来引导冻结的LLM生成多样化输出。该方法在无需微调模型的情况下,利用无梯度优化进化提示嵌入,并结合语义与显式特征进行行为表征,显著提升了生成内容的多样性和质量。实验表明,QD-LLM在多个基准测试中表现出更高的覆盖度和质量-多样性得分,并有效提升了测试生成和微调数据的质量。

Comments 11 pages, 3 figures, 7 tables, 1 algorithm, 1 theorem. Accepted to GECCO 2026

详情
英文摘要

Large Language Models exhibit mode collapse, producing homogeneous outputs that fail to explore valid solution spaces. We present QD-LLM, a framework for parameter-efficient neuroevolution that evolves prompt embeddings, compact neural interfaces (~32K parameters) that steer generation in frozen LLMs (70B+ parameters), within a Quality-Diversity (QD) optimization framework. Our contributions: (1) evolved prompt embeddings via gradient-free optimization enabling behavioral steering without model fine-tuning; (2) hybrid behavior characterization combining semantic and explicit features with formal coverage bounds (Theorem 1) under validated near-independence (NMI $= 0.08 \pm 0.02$); (3) co-evolutionary variation operators including targeted behavioral mutation via finite-difference gradient estimation. On HumanEval (164 problems), MBPP, and creative writing benchmarks, QD-LLM achieves 46.4% higher coverage and 41.4% higher QD-Score than QDAIF ($p<0.001$, 30 runs, Vargha-Delaney $A=0.94$). We demonstrate downstream utility: diverse archives improve test generation (34% more edge cases) and fine-tuning data quality (8.3% accuracy gain). We validate across open-source LLMs (Llama-3-70B, Mistral-Large) with full embedding access, establishing prompt embedding evolution as an effective paradigm bridging neuroevolution and modern LLMs.

2605.09777 2026-05-12 cs.NE cs.AI cs.CL cs.LG

EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent

Dongxin Guo, Jikun Wu, Siu Ming Yiu

AI总结 本文提出了一种基于多目标进化算法的新型偏好优化方法EvoPref,用于提升大语言模型(LLM)对齐的多样性。该方法通过非支配排序遗传算法(NSGA-II)和存档机制,在帮助性、无害性和诚实性等多个目标上优化低秩适配器(LoRA),有效避免了梯度下降方法中的偏好崩溃问题。实验表明,EvoPref在标准基准上显著提升了偏好覆盖率并降低了崩溃率,同时保持了良好的对齐质量,验证了进化优化在实现多样化LLM对齐中的有效性。

Comments 10 pages, 2 figures, 6 tables, 1 algorithm. Accepted to GECCO 2026

详情
英文摘要

Gradient-based preference optimization methods for large language model (LLM) alignment suffer from preference collapse, converging to narrow behavioral modes while neglecting preference diversity. We introduce EvoPref, a multi-objective evolutionary algorithm that maintains populations of Low-Rank Adaptation (LoRA) adapters optimized across helpfulness, harmlessness, and honesty objectives using Non-dominated Sorting Genetic Algorithm II (NSGA-II) selection with archive-based diversity preservation. Our primary contribution is demonstrating that population-based methods discover substantially more diverse alignments than gradient descent. On standard benchmarks, EvoPref improves preference coverage by 18% (median 82.5% vs. 70.0% for ORPO, $p<0.001$, Wilcoxon, $n=30$) and reduces collapse rates by 47% (11.0% vs. 20.6%, $p<0.001$), while achieving competitive alignment quality (median 75.5% RewardBench vs. 75.0% for ORPO, $p<0.05$). We provide theoretical motivation extending recent multi-objective evolutionary algorithm (MOEA) runtime analysis (Dang et al., 2025) suggesting why archive-based methods escape collapse more effectively than single-trajectory optimization. Comprehensive comparisons against MOEA/D, SMS-EMOA, CMA-ES, and gradient baselines (DPO, IPO, KTO, ORPO) with rigorous statistical testing (Friedman with Holm correction, Vargha-Delaney effect sizes, median with IQR) confirm that multi-objective selection with diversity preservation is essential. This work establishes evolutionary optimization as a principled paradigm for diverse LLM alignment.

2605.09772 2026-05-12 eess.SY cs.RO cs.SY math.OC

Safe Exploration for Nonlinear Processes Using Online Gaussian Process Learning

Stefano Tonini, Soroush Rastegarpour, Hamid Reza Feyzmahdavian, Nicola Bastianello, Karl Henrik Johansson

AI总结 本文提出了一种用于非线性系统的安全数据驱动控制框架,仅需系统的可稳定线性近似即可实现在线学习过程中的稳定性与约束满足。通过实时学习的高斯过程残差捕捉未建模的非线性动态,并基于李雅普诺夫理论构造概率控制不变集以确保安全。该方法通过凸二次规划计算控制输入,在满足安全约束的同时最大化信息增益,实验结果验证了其在模型不确定性下的安全有效探索能力。

Comments Accepted in 23rd IFAC World Congress

详情
英文摘要

This paper proposes a safe data-driven control framework for nonlinear systems with partially known dynamics. The method ensures stability and constraint satisfaction during online learning, assuming only a stabilizable linear approximation of the process is available. Unmodeled nonlinear dynamics are captured by a Gaussian process residual learned in real time. Safety is enforced through a probabilistic control-invariant set derived from Lyapunov theory, guaranteeing high-probability stability. A convex quadratic program computes control inputs that maximize information gain while respecting probabilistic safety constraints. The framework provides finite-sample safety guarantees and allows adaptive expansion of the invariant set as uncertainty decreases. Numerical results validate the approach, demonstrating safe and informative exploration under model uncertainty: the safe set expands by about 30% while the Gaussian process root-mean-square error drops from 1.11 to 0.03.

2605.09764 2026-05-12 cs.NE cs.AI

LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search

Temoor Tanveer

AI总结 本文提出了一种名为LEVI的进化搜索框架,旨在通过更强的搜索架构替代或超越大型语言模型在进化搜索中的作用。LEVI通过改进解决方案数据库、智能突变路由和排名保持的代理基准,提升了进化搜索的效率与多样性。实验表明,LEVI在多个系统研究基准上以更小的计算预算取得了优于现有方法的性能,展示了其在资源效率和效果上的显著优势。

详情
英文摘要

LLM-guided evolutionary methods such as AlphaEvolve have proven effective in domains like math, systems research, and algorithmic discovery, but their reliance on frontier models makes each run expensive. We argue this is largely an artifact of how existing frameworks allocate search: archives that fail to preserve solution diversity force compensation through stronger mutation models; blind model use spends frontier dollars on local edits a smaller model could handle; and full-set evaluation wastes rollouts on redundant examples. We introduce LEVI, a harness-first evolutionary framework built on the bet that stronger search architectures can substitute for or even outperform larger LLMs in evolutionary search. LEVI improves on three core components of evolutionary search: a solution database that establishes diversity from the beginning, and then maintains it throughout the run; a smarter mutation router that plays into the strengths of large and small LLMs; and a rank-preserving proxy benchmark for rollout-heavy settings. Across systems-research benchmarks LEVI attains the highest score on a budget 3.3-6.7x smaller than the published frontier-model runs of existing frameworks like ShinkaEvolve, GEPA, and AdaEvolve; on one problem, LEVI matches the existing best at a 35x lower cost. On prompt optimization, LEVI matches or exceeds GEPA at less than half of its rollout budget on four different benchmarks. LEVI is available as an open-source framework at https://github.com/ttanv/levi.

2605.09755 2026-05-12 math.NA cs.DS cs.LG cs.NA stat.ML

Accelerating Power Method with Fast Sketching for Stronger Low-Rank Approximation

Shabarish Chenakkod, Michał Dereziński

AI总结 本文研究如何加速幂法以实现更强的低秩近似,针对传统幂法在高秩目标下计算成本高的问题,提出了一种基于快速随机投影的加速框架。该方法在奇异值分解、低秩分解和Nystrom近似等任务中表现出高效且稳定的数值性能,其核心创新在于引入了正则化谱近似理论,为幂法的推广提供了更灵活的分析工具。

详情
英文摘要

The power method is one of the most fundamental tools for extracting top principal components from data through low-rank matrix approximation. Yet, when the target rank is large, the cost of matrix multiplication associated with this procedure becomes a major bottleneck. We develop an algorithmic and theoretical framework for accelerating the power method using fast sketching, which is a popular paradigm in randomized linear algebra. Our framework leads to simple and provably efficient methods for singular value decomposition, low-rank factorization, and Nyström approximation, which attain strong numerical performance on benchmark problems. The key novelty in our analysis is the use of regularized spectral approximation, a property of fast sketching methods which proves more flexible in generalizing power method guarantees than traditional arguments.

2605.09754 2026-05-12 cs.IT cs.DC cs.LG math.IT

Learning from Acceptance: Cumulative Regret in the Game of Coding

Hanzaleh Akbari Nodehi, Parsa Moradi, Mohammad Ali Maddah-Ali

AI总结 本文研究了在开放去中心化系统中,数据收集者与策略性对手之间的博弈问题,其中对手可能提交足够一致的数据以被接受,同时降低系统估计质量。不同于以往假设数据收集者完全了解对手策略的工作,本文考虑了信息不完全的情形,提出了一种通过重复交互学习对手策略的算法,并证明其累积遗憾具有次线性增长,实验验证了其有效性。

详情
英文摘要

Classical coding-theoretic guarantees often rely on trust assumptions, such as requiring sufficiently many honest nodes compared with adversarial ones. These assumptions are difficult to enforce in open decentralized systems where participants are not centrally certified. At the same time, such environments often contain incentive mechanisms: participants may be rewarded only when their submitted data are accepted and the system remains functional. This changes the role of an adversary. Rather than acting as a pure saboteur, a strategic adversary may submit data that are consistent enough to be accepted while still degrading the quality of the final estimate. The game-of-coding framework models this strategic interaction between a data collector (DC) and an adversary. Existing works on the game of coding mostly consider the complete-information case, where the DC knows how the adversary trades off acceptance and estimation error. In this paper, we study an incomplete-information version of the game of coding in which the DC, acting as a Stackelberg leader, does not know the adversary's utility trade-off and must learn through repeated interaction. Prior work on the unknown-adversary setting considered an explore-then-commit objective, where only the final selected acceptance rule is evaluated. In contrast, we study the full learning trajectory: every acceptance rule used during the algorithm is executed and contributes to performance. We propose an algorithm that refines its search around promising acceptance rules, prove that it achieves sublinear cumulative regret, and evaluate its performance through numerical experiments.

2605.09735 2026-05-12 cs.AR cs.AI cs.DC cs.OS

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Zhiqing Zhong, Zhijing Ye, Jian Zhang, Weijian Zheng, Bolun Sun, Xiaodong Yu

AI总结 静态图大语言模型(LLM)解码器虽然具有可预测的启动、固定的张量形状和低提交开销,但在在线解码过程中,键值缓存(KV-cache)的行为高度不规则,导致内存预留过多和突发延迟问题。本文提出KV-RM运行时设计,通过在静态图解码器下规范化KV-cache的移动,解耦逻辑历史与物理存储,利用块页表追踪活跃状态,并通过单一描述符提交每个解码步骤,从而提升运行灵活性。实验表明,KV-RM在混合长度解码吞吐量、尾部延迟和内存使用方面均优于静态图基线,有效缓解了生产场景中的延迟峰值问题。

Comments 14 pages, 7 figures, 7 tables

详情
英文摘要

Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.

2605.09734 2026-05-12 cs.SE cs.AI cs.MA

Trajectory Supervision for Continual Tool-Use Learning in LLMs

Vishnu Vardhan Reddy, Sagnik Chatterjee, Soumik Bhatta

AI总结 本文研究了在连续工具使用学习中,轨迹监督对大语言模型性能的影响。通过在API-Bank数据集上对Llama 3.1 8B模型进行微调,实验对比了两种条件:一种去除历史API调用轨迹,另一种保留完整轨迹进行训练。结果显示,保留轨迹的条件在最终API调用准确率上提升了17.7个百分点,表明轨迹信息对模型学习工具使用过程具有重要价值。

详情
英文摘要

Most language-model training data shows final artifacts, not the process that produced them. We study a tractable version of this question in tool use: when a model learns a stream of new API domains, does keeping tool-use trajectories help compared with stripping the intermediate API trace? We fine-tune Llama 3.1 8B Instruct with QLoRA on API-Bank using four sequential domain blocks. Condition A strips previous API request/response lines from the prompt and trains the model to predict the next API call. Condition B keeps the trajectory context. In a single-seed pilot, full held-out generation evaluation shows that Condition B reaches 56.9\% final exact full-call accuracy compared with 39.2\% for Condition A. B also improves final API-name accuracy by 7.7 points. However, B uses 25.1\% more training tokens, the run uses one seed, and the task is next-call prediction rather than full dialogue success.

2605.09721 2026-05-12 cs.CR cs.AI

Security Risks in Tool-Enabled AI Agents: A Systematic Analysis of Privileged Execution Environments

Hardik Goel

AI总结 本文系统分析了云环境中工具增强型AI代理的安全风险,指出这些代理通过特权工具执行操作时可能引发多种安全隐患。研究提出了风险分类体系,并通过三个典型场景说明风险表现,同时探讨了缓解策略及其权衡。实验表明,许多风险并非源于新型漏洞,而是由于工具权限过高、能力与意图不匹配以及执行环境中的权限泄露所致,据此提出了更安全的云部署设计指南。

Comments Extended author preprint. A shortened version has been accepted as a short paper at IEEE COMPSAC 2026. 7 pages, 3 figures/tables

详情
英文摘要

Tool-enabled AI agents are increasingly deployed in cloud-hosted environments and offered as services, where they perform side-effecting operations through privileged tools within execution environments. While such agents enable powerful automation, the security implications of hosting autonomous agents in privileged execution environments are not yet fully explored. This paper presents a structured analysis of security risks associated with cloud-hosted AI agents. We introduce a taxonomy of risk categories, illustrate these risks through three representative agent scenarios, and discuss mitigation strategies along with their tradeoffs. A small controlled experiment empirically illustrates risk manifestation and the effect of lightweight mitigations in this setup. Our analysis suggests that many risks in autonomous cloud agents arise not from novel vulnerabilities, but from over-privileged tools, capability-intent mismatches, and ambient authority leakage in execution environments. Based on these findings, we derive practical design guidelines for deploying AI agents in the cloud more securely.

2605.09718 2026-05-12 stat.ML cs.LG math.PR math.ST stat.TH

Learning stochastic multiscale models through normalizing flows

Anan Saha, Arnab Ganguly

AI总结 该论文研究了如何从单一观测轨迹中学习多尺度随机系统的有效动力学模型。作者提出了一种基于轨迹的框架,通过耦合多尺度随机微分方程建模系统动力学,并利用随机平均方法进行模型降阶。为了解决降阶模型中依赖于难以求解的快变量不变分布的问题,作者引入了归一化流来参数化该分布,并通过端到端优化学习模型参数,同时采用变分贝叶斯推断方法进行不确定性量化,从而实现了对多尺度系统中认识不确定性的有效刻画。

Comments 17 pages, 4 figures

详情
英文摘要

Many systems in physics, engineering, and biology exhibit multiscale stochastic dynamics, where low-dimensional slow variables evolve under the influence of high-dimensional fast processes. In practice, observations are often limited to a single trajectory of the slow component, while the fast dynamics remain unobserved, making statistical learning challenging. Approaches based on partial differential equations (PDE), such as Fokker-Planck formulations, aim to characterize the evolution of probability densities, typically requiring dense space-time data or grid-based solvers. In contrast, we adopt a trajectory-based perspective and develop a data-driven framework for learning effective stochastic dynamics from a single observed path. We model the dynamics by coupled multiscale stochastic differential equations (SDEs) and first obtain a principled model reduction through stochastic averaging. Unlike generic model reduction techniques such as PCA, this respects the dynamical structure of the original system and explicitly incorporates the interaction between slow and fast scales. A central challenge, however, is that the reduced model depends on the invariant distribution of the fast process, which is a solution to an intractable and often unknown PDE. We introduce a novel learning framework that parameterizes the invariant distribution using normalizing flows, enabling expressive density modeling in the latent fast-variable space. The flow is trained end-to-end by optimizing a penalized likelihood objective induced by the reduced stochastic dynamics. Furthermore, we develop a Bayesian variational inference procedure for uncertainty quantification, employing a second normalizing flow to approximate the posterior distribution over model parameters. This yields a scalable approach to capturing epistemic uncertainty in multiscale systems.

2605.09702 2026-05-12 stat.ME cs.CL

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

Yanran Li

AI总结 本文研究了在存在噪声标签的多评委评估体系中,如何高效估计大型语言模型的性能。传统方法倾向于通过筛选高准确率的评委来提升评估效果,但作者发现,当目标是校准后的概率评估时,保留全部评委反而表现更优。研究表明,即使某些评委的准确率低于平均水平,只要其偏差可学习且信息不冗余,就能为校准带来帮助,因此在有标注校准数据的情况下,应避免仅依据准确率剔除弱评委。

详情
英文摘要

Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top-$k$ judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of $0.006$ versus $0.013$ under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline subset search. We explain this reversal with oracle analyses showing that the optimal calibrated risk under proper scoring rules cannot increase when additional judge signals are made available, and that even below-chance judges can be useful when their biases are learnable and their signals are non-redundant. The resulting operating principle is simple: in multi-judge evaluation with labeled calibration data, do not discard weak judges by accuracy alone; keep them when they are parseable, non-redundant, and calibratable.

2605.09699 2026-05-12 eess.IV cs.CV cs.GR cs.LG

A Real-Calibrated Synthetic-First Data Engine

Yukang Shen

AI总结 现代计算机视觉系统在数据稀缺领域常面临性能限制,而合成数据生成虽具潜力,但直接应用常因数据质量与反馈机制不足导致效果不稳定。本文提出一种“真实校准、以合成数据为主”的数据引擎,通过可控扩散模型与多阶段筛选过滤的统一流程,系统性提升合成数据增强的实用性与可靠性。实验表明,在人体姿态估计等任务中,合成数据与真实数据结合可有效提升性能,凸显了数据驱动策略在低数据场景下的重要价值。

Comments 7 pages, 6 figures

详情
英文摘要

Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.

2605.09684 2026-05-12 cs.CR cs.AI

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Monika Jotautaitė, Maria Angelica Martinez, Ollie Matthews, Tyler Tracy

AI总结 本文提出了一种半自动化红队测试方法MonitoringBench,用于评估代码代理监控系统的安全性,指出当前监控方法可能低估了攻击风险并高估了监控效果。研究通过构建攻击分类体系、分解攻击生成过程以及引入半自动化流程,解决了攻击生成中的模式崩溃、构思与执行不一致以及人工测试成本高等问题。应用该方法于BashArena环境,生成包含2,644条攻击轨迹的基准数据集,实验表明当前先进监控系统在面对优化攻击时检测率显著下降,揭示了监控系统在识别可疑行为和评分校准方面的不足,并为改进提供了可行方向。

详情
英文摘要

We introduce a red-teaming methodology that exposes harder-to-catch attacks for coding-agent monitors, suggesting that current practices may under-elicit attacks and overstate monitor performance. We identify three challenges with current red-teaming. First, mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We mitigate this by decomposing attack construction into strategy generation, execution, and post-hoc trajectory refinement. Third, manual elicitation is costly to scale, which we address with our semi-automated red-teaming pipeline. Applied to BashArena, an AI control setting for tool-using coding agents, this pipeline produces MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating monitor capabilities and failure modes. Our pipeline produces more diverse and stronger attacks: Opus-4.5 monitor's catch rate falls from 94.9\% on elicited-only Opus attacks to 60.3\% on our best refined attacks, with larger drops for several mid-tier monitors. Attacks optimized against three development monitors generalize to ten held-out monitors, with catch rates generally increasing with monitor capability. Using this benchmark, we provide a snapshot of the current monitor capabilities and find that frontier monitors often detect suspicious actions but fall for persuasion or fail to calibrate suspiciousness scores appropriately, suggesting tractable paths for improvement. MonitoringBench provides both a static benchmark for current tool-use monitors and a reusable methodology for refreshing these evaluations as agents and monitors improve.

2605.09654 2026-05-12 stat.ML cs.LG stat.CO

Metropolis-Adjusted Diffusion Models

Kevin H. Lam, Tyler Farghly, Christopher Williams, Jun Yang, Yee Whye Teh, Arnaud Doucet

AI总结 本文研究了基于分数的扩散模型中的采样偏差问题,提出了一种基于Metropolis-Hastings(MH)或Barker接受-拒绝步骤的修正方法,以减少时间离散化和分数函数近似带来的偏差。作者引入了一种基于双硬币伯努利工厂算法的精确修正方法,并提出了一种基于辛普森法则的高效近似方法,显著提升了采样质量。实验表明,该方法在合成数据和图像数据集上均取得了更好的样本生成效果,尤其在FID指标上表现突出。

详情
英文摘要

Sampling from score-based diffusion models incurs bias due to both time discretisation and the approximation of the score function. A common strategy for reducing this bias is to apply corrector steps based on the unadjusted Langevin algorithm (ULA) at each noise level within a predictor-corrector framework. However, ULA is itself a biased sampler, as it discretises a continuous diffusion process. In this work, we consider adjusted Langevin correctors that employ Metropolis--Hastings (MH) or Barker's accept-reject steps to correct for this bias. Since the target density ratio typically required by MH-based algorithms is unavailable, we propose methods that instead utilise the score function to compute the correct acceptance probability. We introduce the first exact method for adjusting Langevin corrections in diffusion models, based on a two-coin Bernoulli factory algorithm. We also propose an efficient approximation based on Simpson's rule that achieves accuracy of order $5/2$ in the step size at near-zero marginal cost. We demonstrate that these procedures improve sample quality on both synthetic and image datasets, yielding consistent gains in Fréchet Inception Distance (FID) on the latter.

2605.09652 2026-05-12 cs.NE cs.AI

RDEx-CASK: Cauchy Mutation, Archive, and Stagnation Kick for RDEx-CSOP

Dikshant, Dikshit Chauhan, Chen Hao, Anupam Trivedi, Harikumar Kandath, Senthilnath Jayavelu

AI总结 本文提出了一种改进的RDEx-CSOP算法——RDEx-CASK,旨在解决优化过程中停滞和后期方差问题。通过引入截断柯西分布采样、引入小型可行解档案以及设置个体停滞触发机制,增强了算法的探索能力与收敛效率。实验结果表明,RDEx-CASK在可行性感知优化质量上具有竞争力,并在多数问题上提升了达到目标的时间效率。

Comments 5 pages, 2 tables, 1 algorithm. Technical report for the CEC 2026 CSOP competition track

详情
英文摘要

We extend RDEx-CSOP with 3 changes that target stagnation & late-stage variance, plus minor parameter tuning. The second scale factor in the standard branch is sampled independently from a truncated Cauchy. A small feasible-only JADE-style archive (|A|_max = 50) is added & sampled with probability |A|/(|A|+|P|). Per-individual stagnation counter triggers, after 180 no-improvement generations, three local overrides on standard branch: pull toward the global best, lift the archive sampling floor to 0.65, & saturate CR to 0.95 when population success rate is below 0.10. The exploitation biased branch & every other RDEx component are left untouched. On CEC CSOP suite (D=30, 25 runs), RDEx-CASK is competitive with RDEx, UDE-III, & CL-SRDE in feasibility-aware quality & improves time-to-target on most problems.

2605.09623 2026-05-12 cs.DC cs.AI cs.LG cs.NI cs.PF

Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

Akuen Akoi Deng, Eimantas Butkus, Alfreds Lapkovskis, Praveen Kumar Donta

AI总结 近年来,人工智能在资源受限的物联网设备上的应用日益广泛,但现有的深度神经网络(DNN)划分与卸载方法多为静态,难以适应运行时的动态变化。本文提出了一种动态划分框架,能够在异构的边缘-云环境中根据运行时状况自适应调整网络层的分布,并通过实际硬件测试平台验证了其有效性。实验结果表明,该框架在能耗和端到端延迟方面相比静态划分方法分别减少了27.09%至35.82%和6.34%至22.92%,验证了自适应方法的优越性。

详情
英文摘要

In recent years, the use of artificial intelligence on resource-constrained IoT devices has grown significantly. However, existing approaches to DNN partitioning and offloading across the edge-cloud continuum typically rely on static methods that ignore runtime dynamics. Furthermore, they are often evaluated in simulated environments rather than on real hardware. To address this gap, we propose a framework that dynamically splits neural network layers across the heterogeneous continuum. The framework profiles the model at startup, measures network link conditions between nodes, and periodically re-evaluates the partition to adapt to environmental changes. We created a physical testbed comprising a Raspberry Pi edge device, a laptop fog, and a high-performance desktop PC as the cloud. We evaluated the framework over three widely adopted convolutional neural networks: VGG16, AlexNet, and MobileNetV2. Our results show that the framework achieves reductions in energy and end-to-end latency of 27.09--35.82% and 6.34--22.92%, respectively, compared to a static partitioning baseline. These findings confirm the superiority of adaptive to static partitioning.

2605.09610 2026-05-12 cs.MA cs.AI cs.CE cs.LG cs.PL cs.SE

SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications

Abhinav Goel, Agostino Capponi, Alfio Gliozzo, Chaitya Shah

AI总结 本文介绍了 SmartEval,一个用于系统评估大型语言模型(LLM)从自然语言规范生成 Solidity 智能合约质量的基准。SmartEval 提供了 9000 个生成合约及其对应的专家实现,包含五个维度的评估标准,并通过多项实证研究验证了其可靠性。研究揭示了生成合约的典型失败模式,并量化了其相比真实实现的优势,为 LLM 智能合约合成质量的实证研究提供了可复现的基础。

详情
英文摘要

We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs' literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.

2605.09606 2026-05-12 cs.CR cs.CV

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

Yule Liu, Yilong Yang, Jiale Teng, Hanze Jia, Zeren Luo, Jingyi Zheng, Zifan Peng, Ke Li, Yifan Liao, Zhen Sun, Jiaheng Wei, Yang Liu, Zhuo Ma, Xinlei He

AI总结 本文研究了图像到3D模型在生成有害几何结构方面的风险及其缓解方法,揭示了当前模型在面对恶意输入时可能重建出具有物理危害、风险组件或欺骗性复制品的3D结构。通过系统评估多种开源和商用模型,发现现有模型在生成有害几何方面表现较强,而现有防护机制效果有限。研究进一步提出了一种多层次防御策略,有效降低有害输出比例,但仍面临较高的误报率,突显了当前系统在几何安全防护方面的不足。

详情
英文摘要

Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to <1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation.

2605.09588 2026-05-12 cs.GT cs.AI cs.LG

Efficient Ensemble Selection from Binary and Pairwise Feedback

Tzeh Yuan Neoh, Nicholas Teh, Je Qin Chooi, Paul W. Goldberg, Milind Tambe

AI总结 本文研究了在多个AI系统中选择高性能专家组合的问题,将其建模为一种分布式的多赢家投票问题。针对二元反馈和成对反馈两种情况,分别提出了相应的优化目标和算法,其中在二元反馈下设计了一种条件失败的贪心算法,能够在保证性能的同时减少查询次数;在成对反馈下则引入了加权序覆盖松弛方法,支持子模性质并提供了θ-型保证。实验验证了所提方法在减少查询次数和提升组合性能方面的有效性。

详情
英文摘要

Organizations increasingly deploy multiple AI systems across task domains, but selecting a small, high-performing ensemble can require costly model calls, benchmark runs, and human evaluation. We study this selection problem as a distributional variant of multiwinner voting: tasks are drawn from an unknown domain distribution, each task induces feedback over candidate experts, and a committee's value on a task is determined by its best-performing member. We analyze both binary feedback, for tasks with correct/incorrect outcomes, and pairwise feedback, for tasks where candidate outputs are compared by preference. In the binary setting, the induced objective is coverage. We give exhaustive-elicitation baselines and matching worst-case query lower bounds, and we design a failure-conditioned greedy algorithm that preserves the standard $(1-1/e)$ guarantee while obtaining instance-dependent query savings. In the pairwise setting, we study $θ$-winning committees. We show that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, and that the objective is monotone but not submodular. This motivates a weighted ordinal coverage relaxation, which is submodular and supports a failure-conditioned greedy oracle under pairwise feedback. We then convert this oracle back into $θ$-type guarantees through finite-family auditing or a minimax wrapper. We also provide small-scale LLM experiments illustrating the predicted query savings and the role of complementarity in committee selection.

2605.09575 2026-05-12 eess.IV cs.CV

Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI

Mingxuan Liu, Yingqi Hao, Yi Liao, Juncheng Zhu, Haoxiang Li, Hongjia Yang, Yifei Chen, Yijin Li, Kasidit Anmahapong, Zihan Li, Jialan Zheng, Min Kang, Yan Song, Hua Lai, Xiaoling Zhou, Nan Sun, Rong Hu, Gang Ning, Haibo Qu, Qiyuan Tian

AI总结 该研究提出了一种无需标注数据的深度学习框架FreeHemoSeg,用于自动检测和分割胎儿脑MRI中的生发层-脑室出血(GMH-IVH)。该方法通过结合医学先验知识生成伪病变图像进行训练,有效解决了标注数据获取困难的问题。实验结果表明,FreeHemoSeg在内部和外部验证中均表现出优越的检测和分割性能,并显著提升了放射科医生的诊断效率和准确性。

详情
英文摘要

Background: Prenatal germinal matrix-intraventricular hemorrhage (GMH-IVH) is a leading cause of infant mortality and neurodevelopmental impairment. Manual diagnosis and lesion segmentation are labor-intensive and error-prone. Deep learning models offer potential for automation but typically require large annotated datasets, which are challenging to obtain. Purpose: To develop and validate an annotation-free deep learning framework for automated detection and segmentation of GMH-IVH on brain MRI. Materials and Methods: This retrospective study analyzed 2D T2-weighted MRI data from pregnant women collected from October 2015 to October 2023 at one hospital (internal validation) and two hospitals (external validation). Eligible participants included healthy fetuses and those with GMH-IVH. FreeHemoSeg was developed and trained using pseudo GMH-IVH images synthesized from normal fetal data guided by medical priors. Primary outcomes included diagnostic accuracy (area under the ROC curve [AUROC], sensitivity, specificity) and segmentation accuracy (Dice similarity coefficient [DSC]). A reader study evaluated clinical utility. Results: A total of 1674 stacks from 558 pregnant women were analyzed. FreeHemoSeg achieved the highest performance in both internal (sensitivity: 0.914, 95% CI 0.869-0.945; specificity: 0.966, 95% CI 0.946-0.978; DSC: 0.559, 95% CI 0.546-0.571) and external validation (sensitivity: 0.824, 95% CI 0.739-0.885; specificity: 0.943, 95% CI 0.913-0.964; DSC: 0.512, 95% CI 0.497-0.526), outperforming supervised and unsupervised methods. FreeHemoSeg assistance improved radiologists' sensitivity (from 0.882 to 0.941-1.000) and diagnostic confidence while reducing interpretation time by 16.0-52.7%. Conclusion: FreeHemoSeg accurately detects and localizes fetal brain hemorrhages without annotated training data, enabling earlier diagnosis and supporting timely clinical management.

2605.09552 2026-05-12 math.OC cs.LG stat.ML

Phases of Muon: When Muon Eclipses SignSGD

Elliot Paquette, Noah Marshall, Lucas Benigni, Guangyuan Wang, Atish Agarwala, Courtney Paquette

AI总结 本文研究了Muon及其相关的谱优化方法在高维矩阵最小二乘问题中的行为,揭示了其与SignSVD和SignSGD等随机优化方法之间的关系。通过推导确定性动态模型,分析表明Muon在大批次时相当于对数据协方差谱进行平方根预处理,而小批次时则表现出类似SGD的行为,收敛速度变慢。研究还发现,在各向异性数据下,SignSVD和SignSGD的性能存在显著差异,并在协方差幂律模型中识别出三种不同的性能相态。

详情
英文摘要

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $α$ and target exponent $β$ shows there are three phases in the $(α,β)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

2605.09534 2026-05-12 cs.CR cs.AI

Governing AI-Assisted Security Operations: A Design Science Framework for Operational Decision Support

Elyson A. De La Cruz, Rishikesh Sahay, Md Rasel Al Mamun

AI总结 本文研究了如何在高风险操作环境中引入生成式人工智能、检索增强生成和代码代理等技术,以支持安全运营决策,同时保障责任、隐私、成本控制和可审计性。研究提出了一种基于设计科学的方法框架,通过分离AI规划与操作执行、使用结构化检索、审批模板、策略验证和可审计的代理追踪等机制,构建了一个受控的AI查询代理工具。该研究的核心贡献在于提供了一个管理框架,用于指导在高风险数字基础设施中对AI辅助操作决策支持进行治理。

Comments 28 pages, 1 listing, 1 figure, 20 Tables

详情
英文摘要

Engineering managers increasingly must decide how to introduce generative artificial intelligence (AI), retrieval-augmented generation, and coding agents into high-risk operational functions without weakening accountability, privacy, cost discipline, or auditability. The central message of this study is that AI-assisted operational decision support should be managed as a governed engineering capability before it is scaled as automation. Security operations centers (SOCs) provide a suitable setting because they combine privileged telemetry, specialist expertise, software repositories, cloud services, and evidence-sensitive decisions. This study uses Kusto Query Language (KQL) and Microsoft Azure security capabilities as a bounded technical instantiation of that broader engineering management problem. KQL is read-only in ordinary query use, but read-only does not mean risk-free: AI-assisted queries can still create privacy, cost, performance, schema-validity, and decision-quality risks through broad scans, sensitive-field exposure, stale intelligence, and misleading interpretations. Using design science research, the study develops a governed AI query-broker artifact that separates AI planning from operational execution through schema-grounded retrieval, approved templates, policy validation, read-only adapters, normalized outputs, auditable agent traces, and engineering review board gates. The contribution is not a new KQL technique, security product, or detection algorithm. Rather, the study contributes a management framework for governing AI-assisted operational decision support in high-risk digital infrastructure by specifying design propositions, role accountability, maturity stages, quality gates, evaluation criteria, and evidence boundaries.