arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2081
2605.06900 2026-05-11 cs.DS cs.LG

Accelerated Relax-and-Round for Concave Coverage Problems

Matthew Fahrbach, Mehraneh Liaee, Morteza Zadimoghaddam

AI总结 本文提出了一种加速的“松弛-取整”算法,用于解决广义的最大覆盖问题中的凹覆盖问题。该方法在原有框架基础上,采用投影加速梯度法替代线性规划松弛步骤,并结合专门的超单纯形取整方案,显著提升了算法效率与近似精度。实验表明,该算法在合成和现实图数据上均优于使用最新线性规划求解器的方法。

Comments 47 pages, 6 figures

详情
英文摘要

We present an accelerated relax-and-round algorithm for concave coverage problems, which generalize the classic maximum coverage problem. Building on the relax-and-round framework of Barman et al. [STACS 2021], we propose two significant improvements. First, we replace the linear programming (LP) relaxation step with a projected accelerated gradient method applied to a smooth surrogate objective to achieve a $\widetilde{O}(mn \varepsilon^{-1})$ running time. Second, we use a specialized rounding scheme for the hypersimplex that combines the Carathéodory decomposition algorithm in Karalias et al. [NeurIPS 2025] with randomized swap rounding of Chekuri et al. [FOCS 2010]. We prove tight approximation ratios for new reward functions, including a $0.827$-approximation for the logarithmic reward $φ(x) = \log(1 + x)$. Finally, we conduct maximum multi-coverage experiments on synthetic and real-world graphs, demonstrating that our algorithm outperforms approaches that use state-of-the-art LP solvers.

2605.06894 2026-05-11 cs.CR cs.LG

McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware

Md Mahmuduzzaman Kamol, Jesus Lopez, Saeefa Rubaiyet Nowmi, Emilia Rivas, Md Ahsanul Haque, Edward Raff, Aritran Piplai, Mohammad Saidur Rahman

AI总结 本文提出 McNdroid,一个涵盖2013至2025年(除2015年)的长期多模态Android恶意软件基准数据集,用于研究恶意软件检测中的概念漂移问题。该数据集为每个应用提供了静态特征、动态行为特征和函数调用图的三类对齐模态数据,并通过时间分割评估了传统机器学习和深度学习检测器在不同训练-测试时间间隔下的性能。实验表明,多模态融合在长期时间间隔中优于单一模态,且跨模态一致性随时间下降,揭示了漂移对特征空间和模态间关系的影响。

Comments 28 pages, 14 figures, 14 tables

详情
英文摘要

Machine learning (ML) in real-world systems must contend with concept drift, adversarial actors, and a spectrum of potential features with varying costs and benefits. Malware naturally exhibits all of these complexities, but for the same reason, it is challenging to curate and organize data to study these factors. We present McNdroid, to our knowledge the largest longitudinal multimodal Android malware benchmark for malware detection and drift analysis. McNdroid spans 2013--2025, excluding 2015, and represents each application with three aligned modalities--static features from manifests and smali code, dynamic behavioral features from sandbox execution, and graph-based features from function-call graphs. Using temporally separated splits, we evaluate standard ML and deep-learning detectors across increasing train--test time gaps. Results show clear temporal degradation, while multimodal fusion outperforms the best single modality across long-term temporal gaps. Cross-modal agreement also declines over time, suggesting that drift affects both individual feature spaces and the consistency among modalities. We further analyze modality-specific drift, malware-family evolution, and temporal changes in model explanations. We publicly release McNdroid, benchmark splits, and code to support reproducible research on temporal generalization and robust multimodal learning in security-critical, non-stationary settings.

2605.06884 2026-05-11 math.OC cs.LG

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

Sayantan Choudhury, Xiaoran Cheng, Martin Takáč, Sen Na, Mladen Kolar

AI总结 本文研究了在存在重尾噪声的非凸矩阵优化问题中,如何有效利用带有Nesterov动量的Muon优化器进行求解。核心方法是在不精确极分解框架下,结合随机低秩分解技术,建立了收敛性理论,并分析了误差传播对优化过程的影响。研究的主要贡献包括提出了适用于重尾噪声的收敛速率分析,给出了无需先验知识的收敛保证,并通过数值实验验证了所提方法的有效性。

Comments 33 pages, 4 figures, 1 table

详情
英文摘要

Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of $O \left(\varepsilon^{\frac{-(3α-2)}{(α-1)}} \right)$ for finding an $\varepsilon$-stationary point, where $α\in(1,2]$ denotes the heavy-tail index. For the inexact-polar setting with $σ_1=0$, we also provide guarantees that do not require prior knowledge of $α$. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.

2605.06883 2026-05-11 stat.ML cs.LG

Kernel Selection is Model Selection: A Unified Complexity-Penalized Approach for MMD Two-Sample Tests

Yijin Ni, Xiaoming Huo

AI总结 该论文研究了如何通过动态选择核函数来提升最大均值差异(MMD)两样本检验的统计功效。作者提出了一种统一的复杂度惩罚方法(CP-MMD),将核选择视为模型选择问题,并通过引入优化复杂度的惩罚项,使得在连续参数空间上可以直接进行无网格的核优化。该方法在保证第一类错误控制的同时,显著提升了检验能力,适用于包括带宽参数、多项式特征和深度网络在内的多种核类。

详情
英文摘要

The Maximum Mean Discrepancy (MMD) is a cornerstone statistic for nonparametric two-sample testing, but its test power is dictated entirely by the chosen kernel. Because any fixed kernel inherently fails to distinguish certain distributions, the kernel must be dynamically optimized. However, data-driven optimization violates the foundational i.i.d. assumption, forcing a strict trade-off in existing frameworks. Ratio criteria ignore this dependence, inducing overfitting and variance collapse on rich kernel classes. Conversely, aggregation methods bypass the dependence using finite grids, but this strategy cannot scale to continuous search spaces like deep kernels. To break this dichotomy, we establish data-driven kernel selection as a model selection problem. We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of preceding works to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization, so that CP-MMD enables direct, grid-free maximization over continuous parametric classes, including scalar bandwidths, polynomial feature bandwidths, and deep network parameters. By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity. Consequently, CP-MMD enables grid-free kernel selection across linear, polynomial-feature, and deep regimes, matching or exceeding state-of-the-art test power.

2605.06878 2026-05-11 cs.AR cs.CC cs.RO eess.IV

CARMEN: CORDIC-Accelerated Resource-Efficient Multi-Precision Inference Engine for Deep Learning

Sonu Kumar, Mukul Lokhande, Santosh Kumar Vishvakarma, Adam Teman

AI总结 本文提出了一种名为CARMEN的深度学习推理引擎,该引擎基于CORDIC算法加速,支持多精度计算并具有资源高效的特点。其核心思想是通过动态调整CORDIC迭代深度来实现精度与计算效率之间的灵活切换,无需硬件修改即可在近似与精确模式之间转换。该架构结合了低资源消耗的CORDIC乘加单元和时分复用的多激活函数模块,实现了8/16位精度的灵活支持,并在28纳米CMOS工艺下实现了计算周期减少33%和每乘加单元功耗降低21%等显著性能提升。

Comments Under Review (VDAT 2026)

详情
英文摘要

This paper presents CARMEN, a runtime-adaptive, CORDIC-accelerated multi-precision vector engine for resource-efficient deep learning inference. The key insight is that CORDIC iteration depth directly governs computational accuracy, enabling dynamic switching between approximate and accurate execution modes without hardware modification. The architecture integrates a low-resource iterative CORDIC-based MAC unit with a time-multiplexed multi-activation function block, supporting flexible 8/16-bit precision and high hardware utilization. ASIC implementation in 28 nm CMOS achieves up to 33% reduction in computation cycles and 21% power savings per MAC stage; a 256-PE configuration delivers 4.83 TOPS/mm2 compute density and 11.67 TOPS/W energy efficiency. FPGA deployment on PynqZ2 validates 154.6 ms latency at 0.43 W for real-time object detection.

2605.06875 2026-05-11 cs.AR cs.AI cs.CV cs.NA eess.IV math.NA

EULER-ADAS: Energy-Efficient & SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration

Mukul Lokhande, Ratko Pilipovic, Omkar Kokane, Adam Teman, Santosh Kumar Vishvakarma

AI总结 本文提出了一种名为EULER-ADAS的能效高且支持SIMD统一的对数Posit计算引擎,用于实现精度可配置的近似ADAS加速。该架构结合了有限范围Posit表示、分阶段自适应对数尾数乘法与位截断技术,以及支持多种Posit精度的SIMD共享累加路径,从而在不重复设计硬件的情况下实现多种精度操作。实验表明,该设计在FPGA和28纳米CMOS工艺中均表现出优异的能效和性能,适用于低功耗实时ADAS推理任务。

详情
英文摘要

Advanced driver-assistance systems (ADAS) require neural compute engines that deliver low-latency inference under strict power and area constraints. Posit arithmetic is attractive for such accelerators because it provides high numerical fidelity at low precision, but its variable-length regime encoding increases encode/decode cost and exposes the datapath to large regime-field fault effects. This paper presents EULER-ADAS, a SIMD-enabled logarithmic bounded-Posit neural compute engine for energyefficient and reliability-aware ADAS acceleration. The proposed datapath combines bounded-regime Posit representation, stageadaptive logarithmic mantissa multiplication with bit truncation, and a SIMD-shared quire accumulation path supporting Posit- (8,0), Posit-(16,1), and Posit-(32,2) execution. The unified architecture enables 4xPosit-8, 2xPosit-16, or 1xPosit-32 operation without duplicating precision-specific hardware. FPGA implementation shows that the proposed configurations reduce LUT count by up to 41.4%, delay by up to 76.1%, and power by up to 71.9% relative to exact Posit neural compute engines, while achieving up to 10x lower energy-delay product than radix-4 Booth-based Posit multipliers. In 28-nm CMOS, the bounded variants occupy 0.013-0.016 mm2 , consume 19.8-22.1 mW, and operate at up to 1.84 GHz. Application-level evaluation across image-classification, ADAS, and edge-inference workloads shows that the evaluated Posit-16 and Posit-32 configurations remain within about 1.5 percentage points of FP32 accuracy. A TinyYOLOv3 prototype on Pynq-Z2 achieves 78 ms latency at 0.29 W and 22.6 mJ/frame, demonstrating the suitability of EULERADAS for low-power real-time ADAS inference.

2605.06839 2026-05-11 cond-mat.mtrl-sci cs.AI

LLM-Guided Open Hypothesis Learning from Autonomous Scanning Probe Microscopy Experiments

Boris Slautin, Utkarsh Pratiush, Yu Liu, Kamyar Barakati, Sergei Kalinin

AI总结 该研究提出了一种基于大语言模型的开放假设学习框架,用于自主扫描探针显微镜实验,旨在从实验数据中生成新的物理模型而非仅在固定假设空间内选择测量。研究结合符号回归与物理可解释性评估,通过稀疏测量直接生成候选关系式,并利用语言模型根据物理合理性、尺度行为和一致性进行排序。该方法在铁电畴壁运动的自主压电响应力显微镜实验中得到了验证,展示了从初始测量出发逐步演化出符合物理规律的电压-时间增长定律的能力,推动了自主显微技术从闭环优化向开放假设发现的转变。

Comments 21 pages, 6 figures, 1 table

详情
英文摘要

Autonomous experimentation has transformed microscopy and materials discovery by enabling closed-loop optimization including imaging and spectroscopy tuning, strucutre property relationship discovery, and exploration of combinatorial libraries. However, most current workflows remain limited to selecting measurements within fixed objective or hypothesis spaces, rather than generating new physical models from experimental data. Here, we introduce an open hypothesis-learning framework that combines symbolic regression with large-language-model-based physical evaluation and implement it for autonomous scanning probe microscopy. Symbolic regression generates candidate analytical relationships directly from sparse measurements, while the language-model evaluator ranks these candidates according to physical plausibility, scaling behavior, and consistency with known mechanisms. We demonstrate the approach on autonomous piezoresponse force microscopy measurements of ferroelectric domain switching in a PZT thin film. Starting from five seed measurements, the workflow evolves from physically incomplete candidate expressions toward interpretable voltage-time growth laws consistent with kinetic domain-wall motion. This work extends autonomous microscopy from closed-loop optimization toward open hypothesis discovery, where candidate physical laws emerge from the experiment itself rather than being specified in advance. More broadly, the framework establishes a route for integrating symbolic regression, physical reasoning, and adaptive experimentation into hierarchical autonomous scientific workflows.

2605.06833 2026-05-11 cs.CR cs.AI cs.NI

PAMPOS: Causal Transformer-based Trajectory Prediction for Attack-Agnostic Misbehavior Detection in V2X Networks

Konstantinos Kalogiannis, Ahmed Mohamed Hussain, Panos Papadimitratos

AI总结 本文提出了一种基于因果变压器的轨迹预测方法PAMPOS,用于在车联网(V2X)网络中检测攻击无关的异常行为。该方法通过学习正常交通轨迹模式,无需攻击标签数据即可识别异常行为,利用顶部-K归一化异常评分机制定位具体运动特征的异常。实验表明,PAMPOS在多种攻击场景下表现出色,取得了高达0.98的AUC值和0.95的F1分数。

Comments Author's version; Accepted for presentation at the ACM Workshop on Wireless Security and Machine Learning (WiseML 2026)

详情
英文摘要

Misbehavior detection in Vehicle-to-Everything (V2X) networks is a second line of defense against insider falsification attacks that cryptographic mechanisms alone cannot address. Existing learning-based Misbehavior Detection Schemes (MDSs) are supervised, requiring labeled attack samples at training time, thus failing to counter unseen falsification attacks. We present PAMPOS, a causal transformer-decoder trained on benign VeReMi++ trajectories to learn normal mobility patterns. At inference time, misbehavior is identified as a deviation from the model's next-step kinematic predictions using a top-K normalized anomaly scoring mechanism that localizes falsification to specific kinematic features, without requiring attack-labeled training data. We evaluate PAMPOS across all 19 attack types in VeReMi++ under rush-hour and afternoon scenarios, achieving Area Under the Curve (AUC) values of up to 0.98 and F1-scores of up to 0.95 for most attack categories.

2605.06820 2026-05-11 physics.med-ph cs.AI

Overcoming data scarcity through multi-center federated learning for organs-at-risk segmentation in pediatric upper abdominal radiotherapy

Mianyong Ding, Maximilian Knoll, Semi Harrabi, Martine van Grotel, Annemieke S. Littooij, Max van Noesel, Jens-Peter Schenk, Marry M. van den Heuvel-Eibrink, Geert O. Janssens, Matteo Maspero

AI总结 该研究旨在解决儿科上腹部放射治疗中器官危及区域(OARs)分割模型因数据稀缺而性能不足的问题,提出了一种基于联邦学习(FL)的多中心协作方法。通过在两个欧洲医疗中心的本地数据上训练并共享模型参数,而非直接共享患者数据,该方法在保护隐私的前提下提升了模型的泛化能力。实验结果表明,联邦学习模型在跨中心测试中表现优于本地模型,显著提高了分割精度和鲁棒性,为儿科放射治疗的自动化轮廓绘制提供了可行方案。

详情
英文摘要

Deep learning-based organs/structures-at-risk(OARs) auto-contouring models can improve radiotherapy workflows, but models trained on adult data often underperform in pediatric patients. Developing robust pediatric-specific models is hindered by data scarcity and fragmentation across centers. Federated learning (FL) enables privacy-preserving collaborative training without the need for data sharing. We evaluated the feasibility and performance of FL for developing pediatric-specific OAR segmentation models across two European medical centers. Computed tomography (CT) images from pediatric patients from Utrecht and Heidelberg with a renal tumor or abdominal neuroblastoma were retrospectively collected and locally processed. An nnU-Net-based framework segmented 19 OARs using local and FL schemes. FL was implemented with secure weight exchange on a cloud storage across institutional firewalls. Performance was assessed using the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance, and mean surface distance. Robustness to patient orientation, false-positive segmentation of surgically removed kidneys, and failure cases were identified. A total of 310 postoperative CTs from 272 patients (105 renal tumors, 167 neuroblastomas) were included. Local models performed well on their respective center data but showed significantly reduced cross-center performance for four to seven of the nine evaluated OARs (DSC). In contrast, the FL model matched local performance for at least seven of nine OARs and achieved the best cross-center results across three metrics, with DSC gains of 0.003-0.007 over local models. FL also maintained stable performance across patient orientations and reduced false-positive kidney segmentations. Real-world FL improves cross-center robustness of CT-based OAR segmentation models in pediatric upper abdominal tumors.

2605.06810 2026-05-11 cs.HC cs.CV

Enhancing Eye Movement Biometrics for User Authentication via Continuous Gaze Offset Score Fusion

Hashim Aziz, Mehedi Hasan Raju, Oleg V. Komogortsev

AI总结 该研究探讨了如何通过融合连续注视偏移分数来增强基于眼动的生物特征识别性能。研究提出将连续注视偏移信息与现有生物特征相结合,并在两个公开数据集上评估了线性和非线性融合方法的效果。实验结果表明,融合方法能够提升认证性能,尤其在使用非线性融合时效果更显著,表明连续注视偏移可作为辅助信息在眼动追踪质量下降时提高系统鲁棒性。

Comments 10 Pages, 1 Figure, 1 Table, Submitted to IJCB 2026

详情
英文摘要

Eye movement biometrics (EMB) use subject-specific gaze dynamics for user authentication and identification. Recent deep learning-based EMB systems achieve strong performance by modeling temporal eye movement behavior. However, these systems typically overlook continuous gaze offset, despite prior evidence that it contains user-discriminative information. This work examines whether continuous gaze offset can improve biometric performance when combined with existing biometric features. We evaluate linear and nonlinear fusion methods on two publicly available datasets, collected via the lab-grade eye tracker and virtual reality headset across multiple tasks and observation durations. Results indicate that fusion offers performance benefits on both datasets, particularly when using nonlinear fusion. Additionally, fusing biometric information across multiple tasks further improves authentication performance. These findings support the hypothesis that continuous gaze offset may serve as useful auxiliary information under conditions of degraded or noisy eye tracking.

2605.06762 2026-05-11 q-bio.GN cs.AI

A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine

Yibin Wang, Murukarthick Jayakodi, Silvas Kirubakaran, Ambika Chandra, Azlan Zahid

AI总结 本文提出了一种结合线性模型与Transformer架构的混合方法LiT-G2P,用于基于SNP数据的葡萄基因型到表型预测。该方法通过整合加性遗传效应与非线性基因互作,提升了复杂性状在不同年份间的预测稳定性与准确性。实验结果表明,LiT-G2P在单年和跨年预测中均优于基准模型,尤其在叶毛密度和绒毛密度等性状上表现突出,并通过注意力机制提取关键SNP位点,为后续验证提供了可解释的候选标记。

Comments 15 pages, 4 Figures

详情
英文摘要

Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide polymorphisms (SNPs) data. We evaluated LiT-G2P on a panel of diverse grape accessions, genotyped with SNP markers and measured for phenotypes across two consecutive years. Target phenotypic traits include leaf hair density and trichome density of grapevines. Across both single-year and cross-year testing scenarios, LiT-G2P consistently improves prediction performance compared with baseline models. For hair density, LiT-G2P achieves the lowest error in both single-year and cross-year evaluations, with RMSEs of 0.469 and 0.454, respectively, while maintaining strong tolerance accuracies of 79.2% and 74.6%, respectively. For trichome density, LiT-G2P also presents the best overall G2P performance. In addition, we extract model-prioritized SNPs from attention weights and apply genotype-stratified analysis to provide interpretable candidate marker for downstream validation. These results demonstrate that integrating stable additive effects with learned interaction patterns can enhance cross-year robustness and support practical SNP-based predictive modeling for genomic selection.

2605.06749 2026-05-11 stat.ME cs.AI

A Statistical Framework for Algorithmic Collective Action with Multiple Collectives

Claudio Battiloro, Pietro Greiner, Dario Rancati, Bret Nestor, Oumaima Amezgar, Francesca Dominici

AI总结 随着学习系统在日常决策中扮演越来越重要的角色,算法集体行动(ACA)作为一种用户协调修改共享数据以引导模型行为的方式,为监管政策和企业模型设计提供了补充。现有研究多聚焦于单一集体的场景,而现实中多个集体往往在共享总体目标的同时,因规模、策略和行动目标的不同而分散存在。本文首次提出一个多集体算法集体行动的统计框架,研究多个集体如何共同影响分类器的行为,并提供了基于集体规模和目标对齐程度的定量统计界限,且允许每个集体仅需部分了解其他集体的信息即可计算这些界限。通过模拟智慧城市中气候适应干预的场景,验证了该框架的有效性。

Comments 27 pages, 16 figures

详情
英文摘要

As learning systems increasingly shape everyday decisions, Algorithmic Collective Action (ACA), i.e., users coordinating changes to shared data to steer model behavior, offers a complement to regulator-side policy and corporate model design. Real-world collective actions have traditionally been decentralized and fragmented into multiple collectives, despite sharing overarching objectives, with each collective differing in size, strategy, and actionable goals. However, most of the ACA literature focuses on single collective settings. To address this, we propose the first comprehensive statistical framework for ACA with multiple collectives acting on the same system. In particular, we focus on collective action in classification, studying how multiple collectives can influence a classifier's behavior. We provide quantitative statistical bounds on the success of the collectives, considering the role and the interplay of the collectives' sizes and the alignment of their goals. We make such bounds computable by each collective with only partial knowledge of other collectives' sizes and strategies. Finally, we numerically illustrate our framework on simulations inspired by interventions for climate adaptation in smart cities, demonstrating the usefulness of our bounds.

2605.06737 2026-05-11 cs.SE cs.AI

A Self-Healing Framework for Reliable LLM-Based Autonomous Agents

Cheonsu Jeong, Younggun Shin

AI总结 本文提出了一种面向基于大语言模型(LLM)的自主代理的可靠性自愈框架,旨在解决其在复杂系统中面临的幻觉、执行错误和推理不一致等不可预测故障问题。该框架整合了故障检测、可靠性评估与自动恢复机制,通过定义故障类型分类、构建定量可靠性模型,并设计基于执行模式和输出一致性的检测方法,结合自适应重规划和纠正提示策略实现动态恢复。实验表明,该方法显著提高了任务成功率,降低了故障传播,增强了系统鲁棒性,并通过整合代理内部推理与外部执行结果的监控系统,为提升自主系统的稳定性提供了新思路。

Comments 13 pages, 3 figures,1 table

详情
英文摘要

Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent reasoning. This paper proposes a reliability-aware self-healing framework for LLM-based software agents. The framework integrates failure detection, reliability assessment, and automated recovery mechanisms. First, we define a taxonomy of failure types and introduce a quantitative reliability assessment model. Next, we propose a failure detection method that identifies abnormal agent behavior based on execution patterns and output consistency. Finally, we design a self-healing mechanism that dynamically recovers from failures through adaptive replanning and corrective prompting strategies. The proposed framework was implemented in a multi-agent workflow environment and evaluated using real-world task scenarios. Experimental results demonstrate that our approach significantly increases task success rates, reduces failure propagation, and enhances overall system robustness compared to existing methods. In particular, this study distinguishes itself by establishing an integrated monitoring system that combines the agent's internal reasoning process with external execution results. These findings are expected to contribute to securing the stability of advanced autonomous systems and lowering the barriers to LLM adoption in production environments.

2605.06731 2026-05-11 cs.CR cs.CL cs.LG

When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

Xiaoyu Xu, Minxin Du, Qipeng Xie, Haobin Ke, Qingqing Ye, Haibo Hu

AI总结 个性化大语言模型代理在跨会话中保持持久状态以支持长期协作,但这种持久性引入了一种隐蔽而关键的安全漏洞:常规的用户代理交互可能逐渐改变代理的长期状态,削弱未来确认边界、扩大工具使用默认行为并随时间推移增强自主性。本文将此风险形式化为“无意的长期状态中毒”,并构建了包含350种场景的双语基准ULSPB用于系统研究,同时定义了衡量危害程度的“危害分数”(HS)。实验表明,仅通过常规对话即可显著污染长期状态,而提出的轻量防御机制StateGuard能有效降低危害分数,提升安全性。

Comments 23 pages

详情
英文摘要

Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbf{unintended long-term state poisoning}. To systematically study it, we introduce the \textbf{Unintended Long-Term State Poisoning Bench (ULSPB)}, a bilingual benchmark comprising $350$ settings spanning five assistance categories, seven interaction patterns, 24-turn routine interactions, and matched single-injection counterparts. Furthermore, we define the \emph{Harm Score} (HS), a state-centric metric that quantifies \emph{authorization drift}, \emph{tool-use escalation}, and \emph{unchecked autonomy}. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single-injection is generally effective, routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts. Evaluations seeded with real-world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose \textbf{StateGuard}, a lightweight, post-execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false-negative rates, with acceptable high false-positive rates under a safety-first writeback defense and minimal overhead.

2605.06728 2026-05-11 q-bio.GN cs.AI q-bio.CB

OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning

Maciej Sypetkowski, Joanna Krawczyk, Łukasz Smoliński, Remigiusz Kinas, Przemysław Pietrzak, Tomasz Jetka, Rafał Powalski

AI总结 OmicsLM 是一个用于多样本组学推理的多模态大语言模型,旨在连接定量组学数据与自然语言生物任务。该模型通过将转录组数据表示为紧凑的连续向量,在统一的上下文中处理自然语言指令、基因名称和多个样本数据,从而实现语言引导下的多样本推理。研究还引入了 GEO-OmicsQA 基准,用于评估模型在真实表达谱上的多样本生物问答能力,并表明 OmicsLM 在语言引导的生物推理任务中优于现有专门模型和通用大语言模型。

Comments 13 pages (main text), 14 pages (appendix), 1 figure, 10 tables

详情
英文摘要

Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.

2605.06718 2026-05-11 cs.CR cs.LG

TUANDROMD-X: Advanced Entropy and Visual Analytics Dataset for Enhanced Malware Detection and Classification

Parthajit Borah, Upasana Sarmah, D. K. Bhattacharyya, J. K. Kalita

AI总结 随着恶意软件攻击日益复杂,传统基于签名的防御手段已难以应对。本文提出了一种名为 TUANDROMD-X 的多类别恶意软件数据集,该数据集基于静态分析提取了每个样本的视觉和熵值特征,能够有效区分恶意软件与良性软件。该数据集降低了特征工程和动态分析的开销,为研究人员提供了高质量的数据资源,有助于设计更高效、更精准的恶意软件检测系统。

详情
英文摘要

Malware and malware-based attacks are becoming more prevalent and complex. Attackers regularly come up with new techniques that have the ability to evade conventional and signature-based malware defense. In order to address such threats, there is an increasing demand for advanced and better defense solutions. Machine learning-based techniques are efficiently capable of defending against malware and malware-based attacks. Nevertheless, creating and efficiently testing such techniques demand high-quality datasets having samples of various malware families as well as goodware. The lack of such datasets continues to be a major bottleneck in malware research. In this paper, we introduce TUANDROMD-X, a multiclass malware dataset with visual and entropy-based features of each sample, distinctly identifying malware from goodware. The dataset is created based on static analysis, lowering the overhead that comes with high feature engineering and dynamic analysis. As a result, TUANDROMD-X facilitates researchers and cyber-security experts to design faster and better malware detection systems.

2605.06717 2026-05-11 cs.SE cs.AI

Agentic Coding Needs Proactivity, Not Just Autonomy

Nghi D. Q. Bui, Georgios Evangelopoulos

AI总结 本文探讨了新一代智能编程代理(coding agents)应具备的“主动性”(proactivity)特性,指出其不仅应具备自主性,还需在软件开发过程中主动识别变化、跨工具整合信息并做出适时干预。研究提出了主动性行为的三级分类(反应式、定时式和情境感知式),并基于混合主动交互原则,提出以洞察决策质量、上下文关联度和学习提升度为核心的评估指标,为衡量代理行为的实用性提供了理论框架和评价方法。

Comments Position Paper

详情
英文摘要

Coding agents are rapidly changing the landscape of software development, moving from inline completion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development life cycle. The next generation is increasingly described as proactive and long-horizon: agents should notice relevant changes before the developer asks, connect signals across tools, decide when to interrupt, and carry preferences across sessions. Yet the field still lacks a clear account of what proactivity means for software development, how it differs from autonomy, what acceptance criteria proactive long-horizon tasks should satisfy, and which metrics determine whether unsolicited agent behavior is useful rather than merely active. Proactive coding agents should be evaluated by the quality and improvement of their insight policy: the policy that decides what matters next, what evidence supports it, whether to show it, and how to adapt after feedback. This view is grounded in the principles of mixed initiative interaction. We propose a three level taxonomy of proactivity (Reactive, Scheduled, and Situation Aware), compare contemporary coding agents against five practical criteria, and sketch an active user simulation protocol with three evaluation targets: Insight Decision Quality (IDQ), Context Grounding Score (CGS), and Learning Lift

2605.06713 2026-05-11 cs.CR cs.AI cs.HC

Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand

Christopher Koch

AI总结 本文探讨了智能体AI在网络攻击中的应用及其对企业和中型企业的潜在威胁。研究指出,智能体AI通过降低侦察、钓鱼、漏洞利用等攻击环节的成本,显著压缩了攻击周期,改变了网络攻击的经济格局。文章提出了三通道智能体网络风险模型和智能体攻击压缩模型,并基于2026年Linux内核复制失败事件进行案例分析,最终为大型企业和德国及欧洲中小企业制定了优先防御路线图。

Comments 7 pages

详情
英文摘要

Agentic AI systems can plan, call tools, inspect code, interact with web applications, and coordinate multi-step workflows. These same capabilities change the economics of cyber offense. The central near-term risk is not that every low-skill criminal immediately becomes a frontier exploit researcher; it is that agentic AI compresses the attack lifecycle by lowering the cost of reconnaissance, phishing, credential abuse, vulnerability triage, exploit adaptation, and post-compromise decision support. This paper synthesizes current public evidence from national cybersecurity agencies, industry threat reports, agent security guidance, and research on LLM agents cyber capabilities. It introduces a Three Channel Agentic Cyber Risk Model and an Agentic Attack Compression Model, uses the 2026 Linux kernel Copy Fail incident as a case study for foothold-to-root acceleration, and develops a 2026 to 2028 forecast for large enterprises and the German and European Mittelstand. The paper concludes with a prioritized defense roadmap. Organizations should treat agentic AI security as an immediate operational problem: identity, phishing resistant authentication, patch velocity, CI/CD and Linux/container hardening, agent governance, telemetry, and recovery readiness must be strengthened now.

2605.06710 2026-05-11 cs.IT cs.LG math.IT math.ST stat.TH

Information-theoretic Limits of Learning and Estimation

Abbas El Gamal, Maxim Raginsky

AI总结 本文介绍了信息论在学习与估计问题中的基本极限,探讨了无论计算能力如何,任何学习或估计算法所能达到的性能边界。文章从集中不等式、度量熵、Rademacher复杂度等工具入手,推导了泛化误差的上界,并结合互信息与相对熵分析了学习理论框架。随后,通过Fano不等式建立了最小最大估计风险的下界,为理解学习与估计的理论极限提供了重要分析工具。

详情
英文摘要

Information theory plays a central role in establishing fundamental limits on what any learning or estimation algorithm can -- and cannot -- achieve, regardless of computational power. In this chapter, we provide an introduction to these connections. End-of-chapter exercises makes the material suitable for both classroom use and self-study. We begin by introducing concentration inequalities along with the notions of covering and packing in metric spaces, and the associated concept of metric entropy. These tools are essential for our analysis. We then introduce the learning-theoretic framework and derive upper bounds on generalization error in terms of metric entropy, Rademacher complexity, and the VC dimension, as well as mutual information and relative entropy. Finally we discuss the minimax estimation framework and establish lower bounds on minimax risk using Fano's inequality, yielding bounds in terms of relative entropy and covering and packing numbers. This manuscript contains preprint of a chapter under consideration for inclusion in the forthcoming third edition of Cover and Thomas's Elements of Information Theory, posted with permission from Wiley. It would follow the chapter posted at arXiv:2605.02989 . The table of contents of the new edition can be found at: https://docs.google.com/document/d/1L-m4oQEJw1PJhoxBeMwrrBD8S_HmvzMEkPbYvS24980/edit?usp=sharing . For feedback, please contact abbas@ee.stanford.edu.

2605.06707 2026-05-11 cs.SE cs.AI

The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking

Diego Cabezas Palacios

AI总结 本文通过八周的观察实验,对比了四种大型语言模型(GPT、Gemini、Grok 和 Claude)在固定公共接口协议下生成 HTML 页面的表现,评估指标包括功能正确性、界面质量和提示遵循度,并将结果通过社交媒体平台进行传播测试。研究发现,Claude 在整体表现上最为稳定且优异,而推理时间与生成质量无显著关联。此外,模型家族对 HTML 生成的冗余程度影响较大,而预发布技术与音频变量无法有效预测其在 Twitter 上的传播效果。

Comments 23 pages, 3 figures, 5 tables

详情
英文摘要

This paper presents an eight-week observational comparison of 68 single-file HTML generations collected across 17 public experiments in the "HTML AI Battle" project between December 10, 2025 and February 4, 2026. Four reasoning model families, GPT, Gemini, Grok, and Claude, were compared under a fixed public-interface protocol with no custom instructions, no personality tuning, and no repair prompts. Each output was evaluated from a rendered browser video using human scores and a Gemini LLM-as-a-judge layer for prompt adherence, functional correctness, and UI quality, then packaged into a standardized social-media protocol spanning X (Twitter), TikTok, and YouTube. The tracker was also used for two supervised predictive analyses: an experiment-level model for 24-hour X impressions and a generation-level model for HTML verbosity. Under this protocol, Claude was the strongest and most consistent family, leading mean performance and winning 9/17 prompts under the primary human weighted score. Longer measured reasoning time was not associated with higher quality overall. Gemini as a judge was significantly more lenient than the human evaluator on functional correctness and overall performance, while stable self-favoring bias remained unresolved. The exploratory X-impressions model remained weak under post-screen cross-validation (MAE = 46,874, R^2 = -0.377), whereas the HTML-lines model performed better, with a model-family-only baseline outperforming prompt-aware alternatives (MAE = 135.2, R^2 = 0.576). Overall, selected pre-publication technical/audio variables were not sufficient to predict 24-hour X reach, while code verbosity was driven much more by model family than by prompt wording. The comparisons remain observational and are limited by public-interface drift, access-path differences, and one primary human scorer.

2605.06699 2026-05-11 eess.IV cs.AI cs.CV cs.LG

Multimodal synthesis of MRI and tabular data with diffusion in a joint latent space via cross-attention

Daniel Mensing, Jan Kapar, Jochen G. Hirsch, Matthias Günther, Horst Hahn, Marvin N. Wright

AI总结 本文提出了一种基于交叉注意力机制的多模态潜在扩散模型,能够在共享的潜在空间中联合生成磁共振成像(MRI)和表格临床数据,实现了两种模态的协同表征学习。该模型通过变分自编码器融合两种模态数据,并利用扩散生成方法进行合成,分别使用MRI和表格数据的解码器进行重建。实验表明,该方法在生成解剖结构合理且与表格属性一致的MRI图像方面表现优异,并在多项定量指标上优于现有方法,为医疗领域中生成一致的多模态患者数据提供了可行方案。

详情
Journal ref
Proc. SPIE 13925, Medical Imaging 2026: Image Processing, 139252D (April 03, 2026)
英文摘要

We propose a multimodal latent diffusion model that jointly synthesizes volumetric magnetic resonance imaging (MRI) and tabular clinical data within a shared latent space via cross-attention. This approach enables coherent joint representation learning of MRI and tabular modalities for generative modeling. Our model utilizes a variational autoencoder to fuse the two modalities before diffusion-based synthesis, allowing modality-appropriate reconstruction with separate decoders for MRI and tabular data. We evaluated the framework on data from the German National Cohort (NAKO Gesundheitsstudie), comprising over 10,000 participants with MRI scans and clinical tabular features such as age, sex, body measurements, and ethnicity. The generated MRI volumes exhibited anatomical plausibility and body composition consistent with the synthesized tabular attributes. Quantitative evaluation using Fréchet distance and precision-recall metrics confirmed high-fidelity image generation. In the tabular modality, our model outperformed CTGAN across standard evaluation metrics and achieved results comparable to TVAE, demonstrating competitive performance relative to established unimodal baselines. This work is, to our knowledge, the first to demonstrate the feasibility of jointly modeling MRI and mixed-type tabular data in a single latent diffusion framework, offering a proof-of-concept for generating coherent synthetic multimodal patient data and aligning with the broader goal of developing digital twins in healthcare.

2605.06055 2026-05-11 cs.DC cs.LG

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao, Yuxing Li, Wei Wang, Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou

AI总结 该研究针对混合专家(MoE)推理中的通信瓶颈问题,提出了一种无需中继缓冲的通信设计,旨在提升在昇腾平台上的推理效率。通过利用全局共享的高带宽内存和对称内存分配,该方法直接将数据放置到目标专家窗口并从远程专家窗口读取,减少了中间中继和重排序缓冲的使用,仅保留轻量级的控制状态。实验表明,该方法在预填充和解码阶段均有效降低了通信延迟,提升了首 token 响应时间并保持了良好的输出 token 效率。

详情
英文摘要

Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.

2605.05995 2026-05-11 cs.CR cs.AI cs.CL

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou, Hua Dai, Fu Xiao

AI总结 大型语言模型的安全对齐仍易受到有害微调(HFT)的攻击。现有防御方法通过限制参数、梯度或内部表示来应对,但在持续HFT下容易被绕过。本文提出了一种新的防御方法——安全瓶颈正则化(SBR),通过将防御重点转移到解码层这一几何瓶颈,将有害查询的最终隐藏状态锚定到安全对齐模型的状态,从而在持续HFT下仍能保持安全响应。实验表明,仅使用一个安全锚点即可将有害评分降至低于10,同时保持对良性任务的良好性能。

Comments Accepted to ICML 2026

详情
英文摘要

The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.

2605.05703 2026-05-11 cs.MA cs.AI cs.LG

Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems

Huchen Yang, Xinghao Dong, Dan Negrut, Jin-Long Wu

AI总结 本文研究了基于大语言模型的多智能体系统中通信结构优化的问题,旨在在有限训练资源下提升系统性能并减少计算开销。为解决现有方法依赖随机任务采样导致优化不稳定的问题,提出了一种基于信息论的集成任务选择框架,通过估计任务对图参数分布的影响来选择最具信息量的任务,并结合嵌入式代表性采样和代理模型加速优化过程。实验表明,该方法在正常和对抗环境下均能有效提升通信结构优化的效果。

详情
英文摘要

Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training set. To actively identify the most valuable tasks for communication-structure optimization, we propose an ensemble-based information-theoretic task selection framework. The proposed method estimates task informativeness by how much a candidate task changes the distribution over graph parameters, using ensemble Kalman inversion as an efficient and derivative-free approximation of the corresponding Bayesian update. The resulting estimator is especially suitable for black-box and noisy multi-agent systems. To enhance scalability, we construct a compact candidate pool through embedding-based representative selection and combine the informative selection with surrogate modeling and batch Thompson sampling. We validate our method in both benign settings and settings with agent attacks, demonstrating its effectiveness for communication-structure optimization under constrained computational budgets.

2605.05340 2026-05-11 cs.CR cs.AI

How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

Junran Wang, Xinjie Shen, Zehao Jin, Pan Li

AI总结 随着视觉语言模型(VLMs)越来越多地用于具身代理,评估它们在物理环境中的隐私意识变得至关重要。本文提出了一种基于Unity的交互式视听评估框架ImmersedPrivacy,用于模拟真实物理场景,从三个层次评估模型在复杂场景中识别隐私物品、适应社交情境变化以及处理隐私约束与指令冲突的能力。实验表明,当前最先进的12个模型在感知复杂性和社交情境变化面前表现出显著缺陷,揭示了VLMs在物理世界中仍存在感知脆弱性和行为决策不足的问题。

详情
英文摘要

As Vision-Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy-sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text-based representations that cannot capture the demands of real-world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio-visual evaluation framework that simulates realistic physical environments using a Unity-based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model's ability to identify sensitive items in cluttered scenes, adapt to shifting social contexts, and resolve conflicts between explicit commands and inferred privacy constraints. Our evaluation of 12 state-of-the-art models reveals consistent deficits. In cluttered scenes, all models exhibit monotonic performance decay as scene complexity grows due to perceptual deficit. When social context shifts, no model exceed 65% selection accuracy. Under conflicting commands, the best model gemini-3.1-pro perfectly balances task completion and privacy preservation in only 51% of cases. These findings reveal that current VLMs in the physical world suffer from perceptual fragility and fail to let their knowledge of privacy cues govern their situated behavior. Our code and data is available at https://github.com/immersed-privacy/immersed-privacy .

2605.04615 2026-05-11 cs.SE cs.AI

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu

AI总结 本文提出了一种名为CoREB的多任务代码搜索基准测试和重排序模型,旨在超越传统的检索阶段,覆盖完整的代码搜索流程。该基准基于反事实重写的问题构建,包含五种编程语言的数据,并提供了分级的相关性判断。实验表明,专门针对代码的嵌入模型在代码到代码检索中表现优异,但无单一模型能在所有任务中全面胜出,而本文提出的CoREB-Reranker在三个任务中均取得稳定提升。

Comments project site: https://hq-bench.github.io/coreb-page/

详情
英文摘要

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

2604.18972 2026-05-11 stat.ML cs.LG math.OC

Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation

Yaowei Zheng, Richong Zhang, Shenxi Wu, Shirui Bian, Haosong Zhang, Li Zeng, Xingjian Ma, Yichi Zhang

AI总结 本文研究在时间非齐次动力学下,如何从离散闭环轨迹进行有限时间连续时间策略评估问题。传统Bellman方法仅具有一阶精度,本文提出通过多步转移估计时间依赖的生成器,并结合矩匹配系数消除低阶截断误差,从而实现更高阶的回归估计。理论分析给出了误差分解及适用条件,实验表明该方法在多种基准测试中优于Bellman基线,验证了高阶生成器回归在连续时间策略评估中的有效性与稳定性。

Comments The authors are withdrawing this paper due to an unresolved dispute concerning authorship and the attribution of intellectual contributions

详情
英文摘要

We study finite-horizon continuous-time policy evaluation from discrete closed-loop trajectories under time-inhomogeneous dynamics. The target value surface solves a backward parabolic equation, but the Bellman baseline obtained from one-step recursion is only first-order in the grid width. We estimate the time-dependent generator from multi-step transitions using moment-matching coefficients that cancel lower-order truncation terms, and combine the resulting surrogate with backward regression. The main theory gives an end-to-end decomposition into generator misspecification, projection error, pooling bias, finite-sample error, and start-up error, together with a decision-frequency regime map explaining when higher-order gains should be visible. Across calibration studies, four-scale benchmarks, feature and start-up ablations, and gain-mismatch stress tests, the second-order estimator consistently improves on the Bellman baseline and remains stable in the regime where the theory predicts visible gains. These results position high-order generator regression as an interpretable continuous-time policy-evaluation method with a clear operating region.

2604.15533 2026-05-11 cs.PL cs.LG cs.LO cs.SE

Verification Modulo Tested Library Contracts

Abhishek Uppar, Omar Muhammad, Sumanth Prabhu, Deepak D'Souza, Madhusudan P, Adithya Murali

AI总结 本文研究如何通过已测试的库契约来进行验证,旨在自动化验证使用复杂库的客户端程序。核心方法是合成适用于客户端程序的模块化契约,并通过测试引擎验证这些契约的正确性,同时引入了一种新的上下文契约形式,使其更易推断。作者提出了一种基于反例引导的学习框架,结合约束求解器和测试引擎进行契约和归纳不变量的合成,并在工具DUALIS中实现,展示了其在处理调用大型库的客户端程序时的有效性。

Comments Removed LaTeX formatting from abstract text

详情
英文摘要

We consider the problem of verification modulo tested library contracts as a step towards automating the verification of client programs that use complex libraries. We formulate this problem as the synthesis of modular contracts for the library methods used by the client that are adequate to prove the client correct, and that also pass the scrutiny of a testing engine that tests the library against these contracts. We also consider a new form of method contracts called contextual contracts that arise in this setting that hold in the context of the client program, and can often be simpler and easier to infer than classical modular contracts. We provide a counterexample-guided learning framework to solve this problem, in which the synthesizer interacts with a constraint solver as well as the testing engine in order to infer adequate modular/contextual method contracts and inductive invariants for the client. The main synthesis engines we use are generalizing CHC solvers that are realized using ICE learning algorithms. We realize this framework in a tool called DUALIS and show its efficacy on benchmarks where clients call large libraries.

2604.15439 2026-05-11 stat.ML cs.LG math.PR

One-Shot Generative Flows: Existence and Obstructions

Panos Tsimpos, Daniel Sharp, Youssef Marzouk

AI总结 本文研究了生成模型中的动态测度传输问题,重点探讨了通过积分速度场将源分布 $P_0$ 转换为目标分布 $P_1$ 的传输映射。研究核心在于判断何时该过程能产生“直线流”,即点加速度为零、可被任意一阶方法精确积分的流动。文章通过偏微分方程刻画了直线流的特征,并证明了在端点独立条件下,直线流存在与否存在明显二分现象:一方面,对任意高斯端点可构造显式直线流;另一方面,对于具有足够分离模态的目标分布,直线流则根本不存在。这些结果揭示了生成流结构存在的条件与限制。

详情
英文摘要

We study dynamic measure transport for generative modeling, focusing on transport maps that connect a source measure $P_0$ to a target measure $P_1$ by integrating a velocity field of the form $v_t(x) = \mathbb{E}[\dot X_t \mid X_t = x]$, where $X_\bullet = (X_t)_t$ is a stochastic process satisfying $(X_0,X_1)\sim{P_0}\otimes{P_1}$ and $\dot X_t$ is its time derivative. We investigate when $X_\bullet$ induces a \emph{straight-line flow}: a flow whose pointwise acceleration vanishes and is therefore exactly integrable by any first-order method. First, we develop multiple characterizations of straight-line flows in terms of PDEs involving the conditional statistics of the process. Then, we prove that straight-line flows under endpoint independence exhibit a sharp dichotomy. On the one hand, we construct explicit, computable straight-line processes for arbitrary Gaussian endpoints. On the other hand, we show that straight-line processes do not exist for targets with sufficiently well-separated modes. We demonstrate this obstruction through a sequence of increasingly general impossibility theorems that uncover a fundamental relationship between the sample-path behavior of a process with independent endpoints and the space-time geometry of this process' flow map. Taken together, these results provide a structural theory of when straight-line generative flows can, and cannot, exist.

2604.06738 2026-05-11 cs.GT cs.LG

Beyond Pessimism: Offline Learning in KL-regularized Games

Yuheng Zhang, Claire Chen, Nan Jiang

AI总结 本文研究了在KL正则化两人零和博弈中的离线学习问题,通过KL正则化将策略优化与固定参考策略对齐。不同于以往依赖悲观价值估计的方法,作者提出了一种无需悲观假设的新算法与分析框架,利用KL正则化最佳响应的平滑性以及纳什均衡的稳定性性质,首次实现了对KL正则化博弈的无悲观离线学习保证,并获得了更快的$\widetilde{\mathcal{O}}(1/n)$样本复杂度。此外,作者还设计了一种高效的自博弈策略优化算法,通过迭代更新策略替代精确均衡计算,保持了与原方法相当的统计保证。

详情
英文摘要

We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized with respect to a fixed reference policy through KL regularization. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields, to our knowledge, the first pessimism-free offline learning guarantee for KL-regularized games, with a fast $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that its last iterate preserves the same pessimism-free statistical guarantee up to a controlled optimization error.