arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.21572 2026-05-22 cs.CV cs.RO

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

PhysX-Omni: 为刚体、变形体和关节物体统一的模拟准备物理3D生成

Ziang Cao, Yinghao Liu, Haitian Li, Runmao Yao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S实验室）

AI总结本文提出PhysX-Omni，一种统一的模拟准备物理3D生成框架，通过开发针对视觉-语言模型的高效几何表示和首个通用模拟准备3D数据集PhysXVerse，以及评估生成和理解能力的PhysX-Bench，显著提升了生成和理解性能，推动下游应用如具身AI和物理模拟的发展。

Comments Project page: https://physx-omni.github.io/

详情

AI中文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

英文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

URL PDF HTML ☆

赞 0 踩 0

2605.21568 2026-05-22 cs.LG

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

扩散 Fitzhugh-Nagumo 模型中的平衡传播与哈密顿推断

Jack Kendall

发表机构 * Rain Neuromorphics（Rain神经形态实验室）

AI总结本文扩展了平衡传播框架以应用于偏斜梯度系统，并展示了深度能量模型与哈密顿神经网络之间的等价性。研究重点是扩散耦合的 Fitzhugh-Nagumo 神经网络作为典型示例，证明了由于 Fitzhugh-Nagumo 模型的稳态解由自共轭算子描述，因此可以应用平衡传播方法进行信用分配。此外，对于具有深度残差网络拓扑的 Fitzhugh-Nagumo 网络，稳态解具有（空间）哈密顿量，因此可以应用哈密顿回传方法。最后，推导出一个显式的层间哈密顿递推关系，用于指导深度 Fitzhugh-Nagumo 网络和深度能量模型的稳态解推断。

2605.21566 2026-05-22 cs.LG

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

CKD风险预测中的校准、不确定性沟通与部署准备性：一个框架评估研究

Michael O. Eniolade

发表机构 * University of the Cumberlands（卡默尔兰兹大学）

AI总结本文评估了在慢性肾病风险预测中，校准、不确定性量化和部署准备性的重要性，通过五个分类器在UCI CKD数据集上的表现，发现内部性能优异但外部转移性差，强调了校准稳定性和外部数据验证的必要性。

Comments 27 pages, 6 figures, 4 tables. Supplementary materials (S1-S4) included as ancillary file

详情

AI中文摘要

用于慢性肾病（CKD）风险预测的机器学习模型在内部测试集上通常表现出很强的判别能力。然而，校准和不确定性量化往往受到忽视，导致临床医生无法获得关于概率输出是否准确的可靠信息。我们训练了五个分类器在UCI CKD数据集（400名患者，62.5%的CKD患病率）上：逻辑回归、随机森林、XGBoost、带有Platt缩放的SVM以及高斯朴素贝叶斯。我们评估了每个模型在校准质量、符合性预测覆盖率以及一个八项部署准备性框架上的表现。分布压力测试将每个模型的最佳校准变体应用于公开的MIMIC-IV演示队列（97名患者，23.7%的CKD患病率）以评估在患病率变化和特征缺失情况下的行为。我们使用期望校准误差和Brier分数测量校准在Platt缩放和等距回归前后的变化，并通过分割符合性预测来量化不确定性，目标为90%的边际覆盖率。所有五个模型在UCI测试集上达到了AUROC 1.00。等距重校准将内部ECE降低到0.000-0.022。在MIMIC-IV上，AUROC降至0.48-0.58，ECE升至0.68-0.76，符合性覆盖率从0.80-0.98降至0.21-0.25，目标为90%。没有模型在部署准备性检查表上得分超过16分中的4分。近完美的内部性能并未转移。在任何临床预测模型部署之前，校准稳定性和符合性覆盖率应在外部数据上进行评估。

英文摘要

Machine learning models for chronic kidney disease (CKD) risk prediction often post strong discrimination scores on internal test sets. Calibration and uncertainty quantification get far less attention, leaving clinicians without reliable information about whether the probability outputs are accurate. We trained five classifiers on the UCI CKD dataset (400 patients, 62.5% CKD prevalence): logistic regression, random forest, XGBoost, SVM with Platt scaling, and Gaussian naive Bayes. We evaluated each across calibration quality, conformal prediction coverage, and an eight-criterion deployment readiness framework. A distributional stress-test applied the best-calibrated variant of each model to the open-access MIMIC-IV demo cohort (97 patients, 23.7% CKD) to assess behaviour under prevalence shift and feature missingness. We measured calibration before and after Platt scaling and isotonic regression using Expected Calibration Error and Brier Score, and quantified uncertainty through split conformal prediction targeting 90% marginal coverage. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist. Near-perfect internal performance did not transfer. Calibration stability and conformal coverage should be evaluated on external data before any clinical prediction model moves toward deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.21565 2026-05-22 cs.LG

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

通过自适应课程学习提升多模态对话情感识别的模态平衡

Phuong-Anh Nguyen, The-Son Le, Duc-Trong Le, Cam-Van Thi Nguyen

发表机构 * VNU University of Engineering and Technology（越南工程大学）

AI总结本文提出基于自适应课程学习的框架，通过双层难度评估器解决多模态对话情感识别中的模态不均衡问题，实验表明该方法在IEMOCAP和MELD数据集上显著提升了模型性能。

Comments Accepted at Neural Computing and Applications (Springer), 2026

详情

AI中文摘要

多模态情感识别在对话中是一项关键任务，其中融合语言、面部表情和语音语调的多模态方法已取得显著进展。然而，模态不匹配和学习不平衡仍然是主要挑战，限制了多模态信息的有效利用。为了解决这个问题，我们提出了一种基于自适应课程学习（SPCL）的插件式框架用于MERC。我们引入了双层难度评估器，捕捉语句级和对话级的挑战。语句级分数模型细粒度地捕捉模态特定的难度，而对话级分数捕捉更广泛的对话结构，包括情感依赖性和模态一致性。基于这些分数，学习调度器动态地指导从简单到困难的实例训练。通过将SPCL整合到现有的MERC架构中，我们的方法缓解了模态不平衡并提高了模型鲁棒性。在IEMOCAP和MELD数据集上的大量实验显示，不同架构和模态设置下均取得一致的改进。在IEMOCAP上，SPCL在基线模型上将加权F1分数提高约+1.2%至+6.6%，而在MELD上，提升达到+10.4%。这些结果突显了SPCL作为轻量级插件模块在多模态情感识别中的有效性与通用性。

英文摘要

Multimodal Emotion Recognition in Conversations (MERC) is a crucial task for understanding human interactions, where multimodal approaches integrating language, facial expressions, and vocal tone have achieved significant progress. However, modality misalignment and imbalanced learning remain major challenges, limiting the effective utilization of multimodal information. To address this issue, we propose a plug-and-play framework based on Self-Paced Curriculum Learning (SPCL) for MERC. We introduce a dual-level Difficulty Measurer that captures both utterance-level and conversation-level challenges. The utterance-level score models fine-grained modality-specific difficulty, while the conversation-level score captures broader dialogue structures, including emotional dependencies and modality coherence. Based on these scores, the Learning Scheduler dynamically guides training from easier to more difficult instances. By integrating SPCL into existing MERC architectures, our method alleviates modality imbalance and improves model robustness. Extensive experiments on the IEMOCAP and MELD datasets demonstrate consistent improvements across different architectures and modality settings. On IEMOCAP, SPCL improves weighted F1-score by approximately +1.2% to +6.6% over baseline models, while on MELD, gains reach up to +10.4%. These results highlight the effectiveness and generalizability of SPCL as a lightweight plug-and-play module for multimodal emotion recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.21563 2026-05-22 cs.LG

Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

基于运行时治理的嵌入式联邦学习用于缺铁预测

Fan Zhang, Simon Deltadahl, Majid Lotfian Delouee, Daniel Kreuter, Joseph Taylor, Allerdien Visser, BloodCounts Consortium, James H. F. Rudd, Nicholas S. Gleadall, Suthesh Sivapalaratnam, Folkert Asselbergs, Martijn C. Schut, Michael Roberts

发表机构 * Theoretical Physics University of Cambridge Cambridge, UK ； Translational AI Laboratory, Dept. of Laboratory Medicine Amsterdam UMC Amsterdam, The Netherlands ； Precision Health University Research Institute Queen Mary Univ. of London London, UK ； Department of Medicine University of Cambridge Cambridge Biomedical Campus Cambridge, UK ； Transplant Cambridge Biomedical Campus Cambridge, UK ； Dept. of Cardiology Amsterdam Cardiovascular Sciences Amsterdam UMC Amsterdam, The Netherlands

AI总结本文提出了一种基于嵌入的联邦学习框架，用于从常规全血计数数据中预测缺铁，并在两个临床环境中部署，展示了个性化聚合方法在处理不同样本量和任务相关性时的优越性。

详情

AI中文摘要

近期的综述发现，发表的大多数医疗联邦学习（FL）研究从未达到实际应用。我们开发了一种基于嵌入的FL管道，用于从常规全血计数（FBC）数据中预测缺铁，并在阿姆斯特丹大学医学中心（AUMC）和英国国家血库和移植（NHSBT）两个临床环境中部署。这两个临床数据集在结构上不独立和相同分布（非IID），异质性源于不同的群体差异而非采样误差。运行时治理由FLA$^3$强制执行，这是一个面向医疗的FL平台，提供研究范围的执行、基于策略的授权和带签名的审计日志。标准样本量加权聚合（FedAvg）在两个站点相对于仅本地训练降低了受试者工作特征曲线（ROC-AUC）的面积，因为全局更新偏向于较大的AUMC分布。FedMAP，一种个性化聚合方法，将AUMC的ROC-AUC从0.9470提升到0.9594，将NHSBT的ROC-AUC从0.8558提升到0.8671，实现了最高的宏ROC-AUC为0.9133和最佳的宏平衡精度。这些结果支持在临床联邦中使用个性化聚合，特别是在客户端样本量和任务相关性差异显著时。

英文摘要

Recent reviews find that the vast majority of published healthcare federated learning (FL) studies never reach real-world deployment. We developed an embedding-based FL pipeline for iron deficiency prediction from routine full blood count (FBC) data and deployed it across real institutional environments at Amsterdam University Medical Centre (AUMC) and NHS Blood and Transplant (NHSBT), two clinical environments that differ markedly in iron deficiency prevalence, ferritin distribution, and subject populations. A frozen domain-specific haematology foundation model, DeepCBC, performs site-local representation extraction, restricting federated training to a compact downstream classifier and substantially reducing recurrent communication relative to full-encoder federation. The two clinical datasets are structurally not independent and identically distributed (non-IID), with heterogeneity arising from distinct population differences rather than sampling artefacts. Runtime governance is enforced by FLA$^3$, a healthcare-oriented FL platform providing study-scoped execution, policy-based authorisation, and signed audit logging. Standard sample-size-weighted aggregation (FedAvg) reduced the area under the receiver operating characteristic curve (ROC-AUC) at both sites relative to local-only training, as the global update was biased towards the larger AUMC distribution. FedMAP, a personalised aggregation method, raised ROC-AUC from 0.9470 to 0.9594 at AUMC and from 0.8558 to 0.8671 at NHSBT relative to local-only training, achieving the highest macro ROC-AUC of 0.9133 and the best macro balanced accuracy overall. These results support personalised aggregation in clinical federations where client sample size and task relevance diverge substantially.

URL PDF HTML ☆

赞 0 踩 0

2605.21561 2026-05-22 cs.LG

Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

目标诱导偏差与多目标无监督特征选择中的搜索动态

Mathieu Cherpitel, Thomas Bäck, Martijn R. Tannemaat, Anna V. Kononova

发表机构 * LIACS, Leiden University（莱顿大学LIACS）； LUMC, Leiden University（莱顿大学LUMC）

AI总结本研究探讨了多目标无监督特征选择中评价目标对搜索动态和Pareto前沿质量的影响，发现基于轮廓系数的评价目标倾向于产生低基数的平凡解，而提出的PCA损失目标能生成测试准确度与监督优化相似的紧凑子集。

详情

AI中文摘要

无监督特征选择通常被建模为一个多目标优化问题，联合优化子集质量和子集大小。然而，这种形式的行为严重依赖于评估目标的选择、子集大小正则化的方向以及初始化策略。我们通过一个具有已知信息性、冗余性和无关特征类型的合成数据集，在受控环境下研究这些因素。通过结合三个评估目标：准确率、轮廓系数和PCA重建损失，并结合子集大小最小化或最大化，比较了六种形式。结果表明，形式对搜索动态和最终Pareto前沿的质量都有显著影响。基于轮廓系数的形成表现出对平凡低基数解的强烈偏见，并且仍然是预测性能的弱代理。相比之下，所提出的PCA损失目标产生具有与直接优化监督准确率获得的子集相似测试准确度的紧凑子集。总体而言，该研究表明，目标设计是有效多目标无监督特征选择的关键。

英文摘要

Unsupervised feature selection is commonly formulated as a multiobjective optimisation problem that jointly optimises subset quality and subset size. Yet the behaviour of this formulation depends critically on the choice of evaluation objective, the direction of subset-size regularisation, and the initialisation strategy. We study these factors in a controlled setting using a synthetic dataset with known informative, redundant, and irrelevant feature types. Six formulations are compared by combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss with subset-size minimisation or maximisation. The results show that formulation strongly affects both search dynamics and the quality of the resulting Pareto front. Silhouette-based formulations exhibit a strong bias toward trivial low-cardinality solutions and remain weak proxies for predictive performance. In contrast, the proposed PCA loss objective produces compact subsets with test accuracy comparable to subsets obtained by directly optimising supervised accuracy. Overall, the study shows that objective design is central to effective multiobjective unsupervised feature selection.

URL PDF HTML ☆

赞 0 踩 0

2605.21560 2026-05-22 cs.LG

AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

AutoMCU: 通过基于LLM的多智能体系统实现面向MCU的神经网络定制化

Penglin Dai, Zijie Zhou, Xincao Xu, Junhua Wang, Xiao Wu, Lixin Duan

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China（深圳先进研究院，电子科技大学）； School of Computer Science and Engineering, Northeastern University（计算机科学与工程学院，东北大学）

AI总结本文提出AutoMCU，一种基于LLM的多智能体系统，用于在MCU约束下实现神经网络的自动化定制化。通过自然语言任务需求和硬件规格，AutoMCU迭代生成结构化架构候选方案，通过供应商工具链反馈过滤不可行设计，在训练前进行筛选，评估可行模型并在受控协议下验证部署可行性。

详情

AI中文摘要

在微控制器单元（MCU）上部署神经网络对于边缘智能至关重要，但受限于内存、存储和计算约束仍具挑战性。现有方法，如模型压缩和硬件感知神经架构搜索（HW-NAS），通常依赖代理指标，导致搜索成本高，且未能充分弥合架构设计与验证部署之间的差距。本文提出AutoMCU，一种以可行性为导向的基于大型语言模型（LLM）的多智能体系统，用于在MCU约束下实现神经网络的自动化定制化。给定自然语言任务要求和硬件规格，AutoMCU迭代生成结构化架构候选方案，在训练前通过供应商工具链反馈过滤不可行设计，评估可行模型在受控协议下的性能，并通过后端基础部署分析验证部署可行性。AutoMCU包含两个关键机制：1）基于硬件的架构生成，用于在RAM和Flash约束下提前淘汰不可部署的候选方案；2）状态隔离的多智能体调度，用于稳定协调提案、训练、评估和部署阶段。在严格MCU约束下对CIFAR-10和CIFAR-100的实验表明，AutoMCU在减少定制时间至约1-2小时的同时实现了具有竞争力的精度，相比代表性的MCU导向HW-NAS基线方法所需的数百小时GPU时间。与ColabNAS和基于LLM的NAS方法GENIUS在NAS-Bench-201上的比较进一步证明了AutoMCU的有效性和稳定性。在多个STM32微控制器上的实际设备部署验证了其在MCU规模边缘智能中的实际适用性。

英文摘要

Deploying neural networks on microcontroller units (MCUs) is critical for edge intelligence but remains challenging due to tight memory, storage, and computation constraints. Existing approaches, such as model compression and hardware-aware neural architecture search (HW-NAS), often depend on proxy metrics, incur high search cost, and do not fully bridge the gap between architecture design and verified deployment. This paper presents AutoMCU, a feasibility-first large language model (LLM)-based multi-agent system for automated neural network customization under MCU constraints. Given natural-language task requirements and hardware specifications, AutoMCU iteratively generates structured architecture candidates, filters infeasible designs through vendor toolchain feedback before training, evaluates feasible models under a controlled protocol, and verifies deployability through backend-grounded deployment analysis. AutoMCU includes two key mechanisms: 1) hardware-in-the-loop architecture generation for early elimination of undeployable candidates under RAM and Flash constraints, and 2) state-isolated multi-agent scheduling for stable coordination of proposal, training, evaluation, and deployment stages. Experiments on CIFAR-10 and CIFAR-100 under strict MCU constraints show that AutoMCU achieves competitive accuracy while reducing customization time to about 1--2 hours, compared with hundreds of GPU hours for representative MCU-oriented HW-NAS baselines. Comparisons with ColabNAS and the LLM-based NAS method GENIUS on NAS-Bench-201 further demonstrate the effectiveness and stability of AutoMCU. Real-device deployments on multiple STM32 microcontrollers validate its practical applicability to MCU-scale edge intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.21558 2026-05-22 cs.LG cs.CL

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

从参数到数据：一种任务参数引导的微调流水线用于高效的LLM对齐

Hao Chen, Qi Zhang, Liyao Li, Zhanming Shen, Wentao Ye, Lirong Gao, Ningtao Wang, Xing Fu, Xiaoyu Shen, Junbo Zhao

发表机构 * Zhejiang University（浙江大学）； Eastern Institute of Technology（东部技术研究所）

AI总结本研究提出了一种任务参数引导的微调流水线，通过任务敏感的注意力头作为双指南，实现样本挖掘和结构剪枝，从而提高LLM对齐的效率。

Comments Accepted@ICML26, 28 pages, 11 figures, 26 tables

详情

AI中文摘要

适应大型语言模型（LLM）到专业领域通常会带来高数据和计算开销。尽管先前的效率努力大多将数据选择和参数高效微调视为孤立过程，我们的实证分析表明它们可能本质上是耦合的。我们提出了强映射假说：稀疏的注意力头子集在任务特定适应中起主导作用，作为解锁特定数据模式的钥匙。基于这一观察，我们提出了从参数到数据（P2D）统一框架，利用这些任务敏感的注意力头作为双指南，用于样本挖掘和结构剪枝。为了严格量化整个流程的成本，我们引入了对齐效率比率（AER）指标，用于衡量选择延迟和训练时间。机理上，P2D通过轻量级代理识别关键头，并利用它们作为功能性过滤器来精选高亲和力数据，建立协同流程。经验上，通过更新仅10%的注意力头在10%的数据上，P2D在强基线基础上实现了8.3个百分点的性能提升，并提供了7.0倍的端到端时间加速。这些结果验证了精确的参数-数据同步消除了冗余，提供了一种新的高效对齐范式。

英文摘要

Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.21556 2026-05-22 cs.LG

Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

超越单一广告位：多广告位保障型显示广告的联合优化

Zhaoqi Zhang, Jiaming Deng, Miao Xie, Linyou Cai, Qianlong Xie, Xingxing Wang, Siqiang Luo, Gao Cong

发表机构 * Nanyang Technological University（南洋理工大学）； China Agricultural University（中国农业大学）

AI总结本文提出了一种多广告位保障型显示广告的联合优化框架，解决了广告位冗余、合同不平衡和曝光集中等关键问题，通过离线 bipartite 匹配问题和合同轮盘机制，提升了广告商 ROI、平台收入效率和合同履行的鲁棒性。

Comments Accepted at SIGIR Industry Track 2026

详情

DOI: 10.1145/3805712.3808398

AI中文摘要

保障型显示广告对于平台变现至关重要，但现有方法通常基于单一广告位假设，限制了其在多广告位页面浏览中的优化能力。本文提出了一种新颖的多广告位保障型显示广告联合优化框架，解决了广告位冗余、合同不平衡和曝光集中等关键挑战。我们的方法将分配建模为一个离线 bipartite 匹配问题，采用合同轮盘机制实现广告位独占性，并通过页面浏览约束实现印象控制，同时结合可扩展的分配优化算法以实现高效的大规模部署。在美团广告平台的大量在线测试中，我们的方法显著提高了广告商 ROI、平台收入效率和合同履行的鲁棒性。具体而言，在线 A/B 测试显示在 70% 的流量下，平均收入每用户增加了 28.99%，DID 分析进一步表明合同稳定性得到改善，证明了我们的框架在现实广告部署中的强大适用性和有效性。

英文摘要

Guaranteed display advertising is crucial for platform monetization, yet existing methods often operate under a single-slot assumption, limiting their ability to optimize allocation across multi-slot page views. In this paper, we propose a novel joint optimization framework for multi-slot GD allocation, addressing key challenges such as slot-level redundancy, contract imbalance, and exposure concentration. Our approach formulates the allocation as an offline bipartite matching problem with a contract roulette mechanism for slot exclusivity and Page View constraints for impression control, and incorporates a scalable allocation optimization algorithm for efficient large-scale deployment. Extensive online tests on the Meituan advertising platform demonstrate that our method significantly improves merchant ROI, platform revenue efficiency, and contract fulfillment robustness. Specifically, online A/B tests show a 28.99% increase in Average Revenue Per User under 70% traffic, and DID analysis further indicates improved contract stability, demonstrating the strong applicability and effectiveness of our framework in real-world advertising deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.21553 2026-05-22 cs.LG cs.IT eess.IV math.IT

TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems

TONIC：面向任务的无线系统中的基于标记的语义通信

Sige Liu, Kezhi Wang

发表机构 * Department of Computer Science, Brunel University London（布鲁内尔大学伦敦计算机科学系）

AI总结本文提出TONIC框架，通过在发送端进行语义感知保护和接收端置信度感知门控，实现任务导向无线系统中基于标记的语义通信，优于传统方法。

Comments 15 pages, 10 figures

详情

AI中文摘要

标记正成为基础模型表示和处理信息的基本单元，用于理解和推理。然而，传统的位级忠实无线通信在可靠传输的内容与下游模型实际消耗的内容之间存在不匹配。这种不匹配要求一种通信设计，能够直接考虑标记层面的任务相关性和下游模型需求，而不是将所有传输位视为同等重要。在本文中，我们提出了TONIC，一种面向任务的无线系统中的基于标记的语义通信框架。发送端将每个源样本转换为标记序列，估计标记层面的任务相关性，并在固定信道使用预算下通过效用感知的非均等错误保护分配保护。在接收端，使用标记层面的置信度来门控不可靠的决策，将有害的替代转换为可恢复的擦除，在基于Transformer的完成模型恢复被遮蔽的标记以进行最终任务推理之前。我们的框架在模块化且可解释的架构中结合了发送端的语义感知保护和接收端的置信度感知门控，而不是仅依赖于完全黑盒端到端学习。我们进一步建立了接收端门控规则的效用感知贝叶斯风险解释，并研究其与非均等保护和完成的相互作用。在图像分类实验中，结果表明TONIC在匹配的通信预算下，无论是在AWGN、瑞利和莱斯信道上，都优于分离式方案、像素域DeepJSCC基线和标记域基线。

英文摘要

Tokens are becoming the basic units through which foundation models represent and process information for understanding and inference. However, traditional wireless communication, centered on bit-level fidelity, faces a mismatch between what is transmitted reliably and what downstream models actually consume. This mismatch calls for a communication design that directly accounts for token-level task relevance and downstream model requirements, rather than treating all transmitted bits as equally important. In this paper, we propose TONIC, a token-centric semantic communication framework for task-oriented wireless systems. The transmitter converts each source sample into a sequence of tokens, estimates token-level task relevance, and allocates protection through utility-aware unequal error protection under a fixed channel-use budget. At the receiver, token-level confidence is used to gate unreliable decisions, turning harmful substitutions into recoverable erasures before a Transformer-based completion model restores the masked tokens for final task inference. Our framework combines transmitter-side semantic-aware protection with receiver-side confidence-aware gating in a modular and interpretable architecture, rather than relying solely on fully black-box end-to-end learning. We further establish a utility-aware Bayes-risk interpretation for the receiver-side gating rule and study its interaction with unequal protection and completion. Experimental results on image classification show that TONIC consistently outperforms separation-based schemes, the pixel-domain DeepJSCC baseline, and token-domain baselines under matched communication budgets over AWGN, Rayleigh, and Rician channels.

URL PDF HTML ☆

赞 0 踩 0

2605.21552 2026-05-22 cs.LG stat.ML

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

期望一致性损失：在协变量偏移下重新思考置信度校准

Jinzong Dong, Zhaohui Jiang, Bo Yang

发表机构 * School of Automation, Central South University, Changsha, China（中南大学自动化学院，长沙，中国）

AI总结本文针对协变量偏移下的置信度校准问题，提出了一种无监督域适应损失（ECL），该方法在理论和实践中均表现出色，能够有效校准目标域的置信度。

Comments Accepted by ICML 2026

详情

AI中文摘要

置信度校准对于分类模型在安全关键决策场景中的应用至关重要，并已受到广泛关注。通用的置信度校准方法假设训练和测试数据是独立同分布的，这在存在协变量偏移时限制了其有效性。在协变量偏移下的先前校准方法在类内或标准校准方面存在困难，且通常依赖于当密度比较大或无界时不稳定的重要性加权。鉴于上述限制，本文重新思考了协变量偏移下的置信度校准。首先，我们推导出协变量偏移下的置信度校准的必要且充分条件，称为期望一致性条件，该条件揭示协变量偏移并不必然导致未校准的置信度，并提供了比全局协变量分布对齐更弱的置信度校准条件。然后，利用期望一致性条件，本文提出了一种无监督域适应损失来校准目标域的置信度，称为期望一致性损失（ECL），该方法兼容标准校准、类内校准和顶部标签校准。第三，我们证明计算ECL损失的样本复杂度与预期校准误差（ECE）相同，并提供了一种理论支持的mini-batch可训练方案。最后，我们在模拟和现实世界协变量偏移数据集上验证了本文方法的有效性。

英文摘要

Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed, limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.21550 2026-05-22 cs.LG

PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting

PeakFocus: 通过统一的多尺度框架桥接峰值定位与强度回归以实现电力负荷预测

Wangzhi Yu, Peng Zhu, Qing Zhao, Yiwen Jiang, Dawei Cheng

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； Big Data Center, State Grid Corporation of China（国家电网公司大数据中心）

AI总结本文提出PeakFocus框架，通过统一的多尺度框架解决电力负荷峰值预测中的峰值定位与强度回归问题，改进多尺度表示冲突和强度平滑问题，提升预测精度。

详情

AI中文摘要

电力负荷峰值预测（ELPF）同时预测峰值时间和强度，是有效电网调度和风险管理的前提。然而，现有方法面临三个限制。首先，它们采用预测后定位的两阶段范式，切断了时间定位和强度回归之间的联系。其次，它们仍然挣扎于多尺度表示冲突，导致峰值误判和时间错位。第三，强度回归过程中缺乏显式的峰值时间上下文，导致强度平滑，因为预测受全局平滑趋势主导。为了解决这些限制，我们提出了PeakFocus，一个统一的ELPF框架。（i）统一的峰值感知流水线（UPAP）利用三重混合损失共同监督时间定位和强度回归，并配以基于容忍度的评估协议。（ii）多尺度混合峰值定位器（MSM-PL）利用粗粒度特征来缓解局部波动导致的峰值误判，并通过级联机制将它们注入细粒度特征中以解决时间错位。（iii）位置感知解码器（LAD）将峰值时间上下文注入强度回归过程中，提供明确的指导以对抗强度平滑并提高峰值强度估计。在公共电力（ELC）数据集和我们工业级的World Large-scale Electricity Load（WLEL）数据集上的广泛实验表明，PeakFocus在时间和强度精度上均优于基线方法。

英文摘要

Electricity load peak forecasting (ELPF), simultaneously predicting peak timing and intensity, is a prerequisite for effective grid scheduling and risk management. However, existing methods face three limitations. First, they adopt a two-stage predict-then-locate paradigm, which severs the link between temporal localization and intensity regression. Second, they still struggle with the multi-scale representation conflict, leading to peak misjudgment and timing misalignment. Third, the lack of explicit peak timing context during intensity regression causes intensity smoothing because predictions are dominated by global smoothing trends. To address these limitations, we propose PeakFocus, a unified framework for ELPF. (i) A Unified Peak-Aware Pipeline (UPAP) utilizes a triple hybrid loss to jointly supervise temporal localization and intensity regression, alongside a tolerance-based evaluation protocol. (ii) A Multi-Scale Mixing Peak Locator (MSM-PL) exploits coarse-grained features to mitigate peak misjudgment caused by local fluctuations, and injects them into fine-grained features via a cascade mechanism to resolve timing misalignment. (iii) A Location-Aware Decoder (LAD) injects peak timing context into the intensity regression process, providing explicit guidance to counteract intensity smoothing and improve peak intensity estimation. Extensive experiments on the public Electricity (ELC) dataset and our industrial-scale World Large-scale Electricity Load (WLEL) dataset show that PeakFocus outperforms baselines in both timing precision and intensity estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.21544 2026-05-22 cs.LG eess.SP

Tabular foundation models for robust calibration of near-infrared chemical sensing data

用于近红外化学传感数据稳健校准的表格基础模型

Robin Reiter, Denis Cornet, Fabien Michel, Lauriane Rouan, Gregory Beurier

发表机构 * CIRAD（国际热带农业中心）； UMR AGAP Institut（AGAP研究所）； Université de Montpellier（蒙彼利埃大学）； INRAE（国家农业食品与环境研究机构）； Institut Agro（农业研究所）； LIRMM（蒙彼利埃大学LIRMM实验室）

AI总结本文研究了表格基础模型在近红外化学传感数据校准中的应用，通过对比不同模型在回归和分类任务中的表现，发现预处理优化的TabPFN在回归任务中表现最佳，而在分类任务中直接使用原始光谱的数据表现最优，同时指出在存在光谱异常值和外推样本时，传统化学计量学模型仍具竞争力。

Comments 56 pages, 17 figures, including supplementary material

详情

AI中文摘要

近红外光谱学正日益被用作一种快速、非破坏性的化学传感技术，用于食品、制药、生物和环境样品的分析。然而，NIR传感器的实际部署仍然依赖于能够处理高维、共线性光谱、有限样本量、预处理依赖性、光谱异常值和超出校准域外推的校准模型。本文评估了表格基础模型是否能为NIR化学传感提供新的校准策略。我们对66个NIR数据集（涵盖54个回归和12个分类任务）进行了基准测试，并将直接推断原始光谱与预处理优化推断与PLS/PLS-DA、岭回归、CatBoost和一维卷积神经网络进行比较。本研究使用统一的验证框架，在此框架中预处理和模型选择仅在校准数据上进行，之后进行外部测试评估。在回归中，预处理优化的TabPFN在总体平均排名上最佳，并显著优于PLS、CatBoost、TabPFN在原始光谱上的表现以及CNN-1D，同时在统计上与岭回归相当。在分类中，直接应用于原始光谱的TabPFN提供了最佳的平均排名，性能接近优化变体。鲁棒性分析显示，TabPFN提供强的平均预测性能，但在光谱异常值和外推样本中，其优势减少，传统化学计量学模型仍具竞争力。这些结果表明，表格基础模型可以补充已建立的化学计量学工作流程用于NIR化学传感，特别是在小到中等规模的校准设置中，同时突显了需要光谱特定的先验知识和不确定性感知的部署策略。

英文摘要

Near-infrared spectroscopy is increasingly used as a rapid, non-destructive chemical sensing technology for the analysis of food, pharmaceutical, biological, and environmental samples. However, the practical deployment of NIR sensors still depends on calibration models able to handle high-dimensional, collinear spectra, limited sample sizes, preprocessing dependence, spectral outliers, and extrapolation beyond the calibration domain. Here, we evaluate whether tabular foundation models can provide a new calibration strategy for NIR chemical sensing. We benchmark TabPFN on 66 NIR datasets covering 54 regression and 12 classification tasks, and compare direct inference on raw spectra with preprocessing-optimized inference against PLS/PLS-DA, Ridge, Catboost, and one-dimensional convolutional neural networks. The study uses a unified validation framework in which preprocessing and model selection are performed exclusively on calibration data before external test evaluation. In regression, preprocessing-optimized TabPFN achieves the best overall average rank and significantly outperforms PLS, CatBoost, TabPFN on raw spectra, and CNN-1D, while remaining statistically comparable to Ridge. In classification, TabPFN applied directly to raw spectra provides the best average rank, with performance close to the optimized variant. Robustness analyses show that TabPFN provides strong average predictive performance but that its advantage decreases on spectral outliers and extrapolated samples, where classical chemometric models remain competitive. These results suggest that tabular foundation models can complement established chemometric workflows for NIR chemical sensing, especially in small- to medium-sized calibration settings, while highlighting the need for spectroscopy-specific priors and uncertainty-aware deployment strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.21543 2026-05-22 cs.LG

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

可证明的多语言模型基准测试去污染

Zhenlong Liu, Hao Zeng, Hongxin Wei

发表机构 * Department of Statistics and Data Science, Southern University of Science and Technology Shanghai Innovation Institute（统计与数据科学系，南方科技大学上海创新研究院）

AI总结本文提出了一种可证明的多语言模型基准测试去污染方法，通过联合选择过程实现全局污染率控制，提升跨模型比较的可靠性。

详情

AI中文摘要

在LLM评估中，基准数据污染已成为关键挑战：当评估示例出现在一个或多个受审模型的训练数据中时，报告性能可能被夸大，跨模型比较变得不可靠。大量训练数据检测工作设计了评分来量化模型对给定数据点的记忆程度，但这些基于评分的方法缺乏理论保证。最近的符合方法为单个模型提供了可证明的假识别控制；然而，分别应用它们到每个模型会产生模型特定的基准，破坏跨模型的公平比较。在本文中，我们将多模型基准去污染正式化为一个联合选择问题，并提出联合包络符合选择（JECS），一种符合程序，能够在给定假设下实现全局污染率（GCR）控制。具体而言，JECS计算每个模型的符合p值，通过每个项目的最大值进行汇总，并从高于数据驱动阈值的右尾观测中重建一个保守的包络最大p空分布。通过将自适应Benjamini-Hochberg（BH）程序应用于包络重新缩放值，我们选择了一个具有可证明GCR控制的基准。在各种模型和基准上的广泛实验表明，JECS在保持目标GCR控制的同时，比max-p基线具有更高的功效。

英文摘要

Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.

URL PDF HTML ☆

赞 0 踩 0

2605.21542 2026-05-22 cs.LG

Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

发现实体-条件滞后异质性：一种用于面板时间序列的滞后门神经审计框架

Andi Xu

发表机构 * School of Engineering Jönköping University（工程学院琼斯科普инг大学）

AI总结本文提出了一种用于面板时间序列的滞后门神经审计框架AC-GATE，旨在解决不同实体在不同时间跨度上对历史信号的响应问题，通过引入适应性编码器和尺度不变滞后门，实现对滞后异质性的发现和结构化输出。

Comments Preprint/technical paper. An interpretable neural audit framework for entity-conditioned lag discovery in panel time series. 10 pages, 5 figures, 16 tables. Code available at the GitHub repository

详情

AI中文摘要

国家层面的时间面板被广泛用于实证分析。研究人员经常需要审计不同实体在不同时间跨度上对历史信号的响应。当前方法通常无法直接提供可审计的实体特定滞后汇总。我们将其公式化为时间面板挖掘任务，并提出AC-GATE，一种具有尺度不变滞后门的适应性编码器。它通过使用可观察的实体层面代理来条件化历史观测的滞后权重分布，从而将有效的滞后作为模型的结构输出，而不是事后解释。评估基于分层审计协议，将预测校准与滞后发现分开。使用具有已知真实滞后的人工面板进行机制恢复测试，并使用两个现实世界的国家层面面板进行外部审计和压力测试。结果表明，AC-GATE可以在合成数据中恢复异质滞后结构，并在真实数据中生成非退化的、结构化的有效滞后。

英文摘要

Country-level temporal panels are widely used in empirical analysis. Researchers often need to audit how different entities respond to historical signals over different time horizons. Current approaches typically do not provide directly auditable entity-specific lag summaries. We formulate entity-conditioned heterogeneous lag discovery as a temporal panel mining task and propose AC-GATE, an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate. It instantiates conditional Moderated Distributed Lag by using observable entity-level proxies to condition lag-weight distributions over historical observations, thereby making effective lags structural outputs of the model rather than post-hoc explanations. The evaluation is based on a layered audit protocol that separates predictive calibration from lag discovery. A synthetic panel with known ground-truth lags is used for mechanism recovery testing, and two real-world country-level panels are used for external audit and stress testing. The results show that AC-GATE can recover heterogeneous lag structure in synthetic data, and generates non-degenerate, externally structured effective lags in real data.

URL PDF HTML ☆

赞 0 踩 0

2605.21539 2026-05-22 cs.LG

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

DualOptim+: 联合与解耦优化器状态的桥梁以提升大语言模型中的机器反遗忘

Xuyang Zhong, Qizhang Li, Yiwen Guo, Chen Liu

发表机构 * Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； Independent Researcher（独立研究者）

AI总结本文提出DualOptim+，一种改进大语言模型中机器反遗忘的新优化框架，通过引入基础状态和delta状态，有效平衡遗忘与保留目标，同时提出8位量化变体以减少内存开销，实验表明其在多个任务中均表现出色。

Comments Accepted by ICML 2026

2605.21538 2026-05-22 cs.SD

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

学术文本到音乐大奖赛：数据集、基线和评估方法

Fang-Chih Hsieh, Wei-Jaw Lee, Chun-Ping Wang, Hung-yi Lee, Hao-Wen Dong, Yi-Hsuan Yang

发表机构 * Artificial Intelligence Center of Research Excellence, National Taiwan University, Taipei, Taiwan（台湾大学人工智能研究中心）； Department of Performing Arts Technology, University of Michigan, Ann Arbor, MI, United States（密歇根大学表演艺术技术系）

AI总结本文介绍了ICME 2026学术文本到音乐生成大奖赛（ATTM）的概述和技术框架。尽管文本到音乐生成（TTM）系统取得了快速进展，但该领域目前主要由在大规模专有数据集上训练的模型主导，这些模型使用工业级计算资源，给学术研究带来了显著障碍。为此，ATTM挑战赛建立了一个公平的基准，要求参赛者使用标准化的、采用CC许可的MTG-Jamendo数据集子集（仅包含纯音乐）从头开始训练生成模型。该挑战分为两个赛道：效率赛道（限制在5亿参数以内）和性能赛道（无参数限制）。提交将通过多阶段评估过程进行评估，包括客观指标，如Fréchet音频距离、CLAP分数和新的概念覆盖分数（CCS），随后进行主观听觉测试。通过提供开源基线、预处理管道、参考标题和公开计算FAD和CLAP的评估代码，该挑战旨在促进学术环境中的TTM研究。

Comments Accepted to IEEE ICME 2026 Grand Challenge Paper

详情

AI中文摘要

本文介绍了ICME 2026学术文本到音乐生成大奖赛（ATTM）的概述和技术框架。尽管文本到音乐生成（TTM）系统取得了快速进展，但该领域目前主要由在大规模专有数据集上训练的模型主导，这些模型使用工业级计算资源，给学术研究带来了显著障碍。为此，ATTM挑战赛建立了一个公平的基准，要求参赛者使用标准化的、采用CC许可的MTG-Jamendo数据集子集（仅包含纯音乐）从头开始训练生成模型。该挑战分为两个赛道：效率赛道（限制在5亿参数以内）和性能赛道（无参数限制）。提交将通过多阶段评估过程进行评估，包括客观指标，如Fréchet音频距离、CLAP分数和新的概念覆盖分数（CCS），随后进行主观听觉测试。通过提供开源基线、预处理管道、参考标题和公开计算FAD和CLAP的评估代码，该挑战旨在促进学术环境中的TTM研究。

英文摘要

This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.21516 2026-05-22 cs.LG cs.AI

Harnesses for Inference-Time Alignment over Execution Trajectories

在执行轨迹上进行推理时间对齐的工具

Boyuan Wang, Bochao Li, Minghan Wang, Yuxin Tao, Fang Kong

发表机构 * GitHub

AI总结本文研究了在执行轨迹上进行推理时间对齐的工具设计，通过任务分解和引导执行机制来提高长期性能，发现工具设计中分解和引导的复杂性并不总是带来更好的结果，提出了任务分解和引导执行的两种机制，并通过合成实验和实际终端代理基准验证了这些发现。

详情

AI中文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

英文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.21515 2026-05-22 cs.LG cs.AI

Predicting Performance of Symbolic and Prompt Programs with Examples

通过示例预测符号程序和提示程序的性能

Chengqi Zheng, Keya Hu, Shuzhi Liu, Tao Wu, Kevin Ellis, Yewen Pu

发表机构 * Nanayang Technological University, Singapore（南洋理工大学，新加坡）； Massachusetts Institute of Technology, USA（麻省理工学院，美国）； Cornell University, USA（康奈尔大学，美国）

AI总结本文研究了通过示例预测程序性能的问题，提出了一种基于简单硬币翻转模型的方法，利用观察到的执行结果和性能先验知识来预测程序性能，并开发了RAP方法来构建代理先验以提高预测效果。

详情

AI中文摘要

LLM提示广泛用于自然陈述的任务，但其不可靠，可能在少数测试用例上成功但在部署时失败。我们研究了性能预测：给定一个程序（例如符号程序或在LLM上执行的提示程序）和少量领域内示例，预测其在未见任务上的性能。我们使用一个简单的硬币翻转模型，将每次通过/失败的程序执行视为伯努利随机变量，其成功概率是程序未知的性能。在该模型中，性能完全取决于：1）在测试用例上观察到的执行结果，以及2）性能的先验分布。我们从多样化的程序和任务语料库中编译了经验性能先验，并发现符号程序（例如Python）都是全或无的，而提示程序具有弥漫的先验，有许多几乎正确的程序。这种差异解释了为什么少数通过测试可以认证符号程序但不能认证提示程序。基于这一见解，我们开发了RAP（检索近似先验），通过从现有语料库中检索相似任务和提示程序来构建代理先验，然后用于预测性能。我们展示了RAP实现了稳健的性能。

英文摘要

LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.

URL PDF HTML ☆

赞 0 踩 0

2605.21496 2026-05-22 cs.LG cs.AI cs.CL

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

HealthCraft: 一种用于急救医学的强化学习安全环境

Brandon Dent

发表机构 * GOATnote Inc.（GOATnote公司）

AI总结本文提出HealthCraft，首个公开的强化学习环境，用于在真实急救医学条件下奖励轨迹级安全，通过FHIR R4世界状态、24个MCP工具和双层评估标准，评估模型在急救任务中的安全性和性能，揭示了模型在多步骤工作流中的安全失败问题。

Comments 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: https://github.com/GOATnote-Inc/healthcraft

详情

AI中文摘要

前沿语言模型被部署到临床工作流程的速度超过了评估它们安全性的基础设施。静态医学问答基准测试忽略了急救医学中至关重要的失败模式：轨迹级安全崩溃、工具误用和在持续临床压力下的屈从。我们提出了HealthCraft，首个公开的强化学习环境，该环境在真实急救医学条件下奖励轨迹级安全，源自Corecraft。它基于FHIR R4世界状态，包含14个实体类型和3,987个种子实体，暴露24个MCP工具，并定义了双层评估标准，只要任何安全关键性标准被违反，就会将奖励设为零。我们发布了195个任务，涵盖六个类别，根据2,255个二元标准（其中515个为安全关键性标准）进行评分；一个事后10任务负类列表将此扩展到205个任务和2,337个标准。在两个前沿模型上的V8结果表明，Claude Opus 4.6在Pass@1达到24.8% [21.5-28.4]，GPT-5.4为12.6% [10.2-15.6]，安全失败率为27.5%和34.0%。在多步骤工作流——最接近真实急救护理的代理——中，性能降至接近零（Claude 1.0%，GPT-5.4 0.0%），尽管在单个步骤上部分具备能力。在试点v2和v8之间修复了六个基础设施错误，重新排列了哪些模型“看起来更强”，这表明基础设施的保真度是测量的一部分。一个确定性的LLM-判断器叠加限制了评估者的噪声，并且一个60次负类烟雾试点显示奖励信号不是可直接用于训练的安全：限制标准通过率为0.929的患病率，这在评估工具可以容忍但训练奖励不能。我们搭建了与Corecraft第5.2节中的Megatron+SGLang+GRPO循环的耦合，并将训练奖励的消融作为未来的工作。环境、任务、评估标准和工具均在Apache 2.0下发布。

英文摘要

Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

URL PDF HTML ☆

赞 0 踩 0

2605.21494 2026-05-22 cs.LG

Double descent for least-squares interpolation on contaminated data: A simulation study

过拟合模型的最小二乘插值在受污染数据中的双下降现象：一项模拟研究

Tino Werner

发表机构 * Institute for Mathematics, Carl von Ossietzky University Oldenburg（奥尔登堡卡尔·冯·奥西特齐克大学数学研究所）

AI总结本文研究了在受污染数据下线性回归中是否会出现双下降现象，比较了最小二乘插值估计器与几种鲁棒替代方法的性能，发现大规模过拟合确实导致双下降现象，使最小二乘插值器的泛化性能优于鲁棒替代方法。

详情

AI中文摘要

过参数化模型尽管根据经典统计理论应容易过拟合，但能表现出出色的泛化性能。双下降现象的发现，即在达到一定模型复杂度后泛化误差减小，开辟了新的研究方向。稳健统计考虑在受污染数据上的统计估计，由于现实数据不满足假设，导致数据点相对于假设的“理想”分布出现异常值，可能严重扭曲任何经典估计器。本文探讨在受污染训练数据的线性回归设置中是否会出现双下降现象。比较了高度非鲁棒的最小二乘插值估计器与几种鲁棒替代方法的性能。结果表明，大规模过参数化确实导致双下降现象，使最小二乘插值器的泛化性能非常优异，优于鲁棒替代方法。

英文摘要

Overparametrized models can exhibit an excellent generalization performance, although they should be prone to overfitting according to classical statistical theory. The discovery of the "double descent", indicating that the generalization error decreases after a certain model complexity has been reached, opened a new line of research. Robust statistics considers statistical estimation on contaminated data, which, due to assumptions that do not hold on real data, let data points appear as outliers w.r.t. the assumed "ideal" distribution, potentially severely distorting any classical estimator. We address the question whether a double descent phenomenon can be observed in a linear regression setting with contaminated training data. We compare the performance of the highly non-robust least-squares interpolation estimator with several robust alternatives. It turns out that large overparametrization indeed allows for a double descent phenomenon, resulting in a very good generalization performance of the least-squares interpolator, surpassing that of the robust alternatives.

URL PDF HTML ☆

赞 0 踩 0

2605.21493 2026-05-22 cs.LG cs.AI cs.CV

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

不要压缩你的特征：为什么CenterLoss伤害OOD检测和多尺度Mahalanobis获胜

Rahul D Ray

发表机构 * Department of Electronics and Electrical Engineering（电子与电气工程系）

AI总结本文提出GOEN方法，通过多尺度特征、L2归一化、Mahalanobis距离和校准头来提升OOD检测性能，发现CenterLoss会降低OOD检测性能，而GOEN-NoCenterLoss在CIFAR-10基准上表现优于其他基线方法。

详情

AI中文摘要

检测分布外（OOD）输入的能力是安全部署机器学习系统的基础。然而，当前方法往往依赖于仅优化分类准确性的特征表示，忽略了epistemic不确定性的要求。我们引入GOEN（几何优化的epistemic网络），一种结合多尺度特征、L2归一化、Mahalanobis距离和使用真实硬OOD示例训练的校准头的简单流程。通过系统消融，我们发现一个反直觉的发现：CenterLoss，一种用于特征紧凑性的流行正则化器，显著降低了OOD检测性能，尽管提高了分类准确性。最佳变体GOEN-NoCenterLoss在CIFAR-10基准上实现了0.9483的平均OOD AUROC，超过了包括深度集成（0.8827）、KNN（0.8967）和ODIN（0.8870）在内的所有基线方法，同时保持了有竞争力的分布内准确性。我们的结果挑战了普遍认为更好的分类几何自动导致更好的epistemic不确定性假设。相反，我们展示了过于紧致的特征簇会压缩类间边缘并扭曲所需的有效OOD检测的协方差结构。GOEN是高效的，在单个GPU上训练不到20分钟，并提供了一种构建可靠识别自身局限的AI系统的实用蓝图。

英文摘要

The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

URL PDF HTML ☆

赞 0 踩 0

2605.21492 2026-05-22 cs.LG cs.AI cs.LO stat.ML

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

特征归因不可能性：在共线性下，没有任何特征排名是忠实、稳定和完整的

Drake Caraker, Bryan Arnold, David Rhoads

发表机构 * Independent Researchers（独立研究人员）

AI总结本文研究了在共线性情况下特征排名的不可能性，证明了无法同时满足忠实、稳定和完整性的条件，并提出了DASH方法作为解决途径，同时通过形式化验证展示了其理论基础和实际应用影响。

Comments 66 pages, 12 figures, 305 Lean 4 theorems. Code at https://github.com/DrakeCaraker/dash-impossibility-lean

详情

DOI: 10.5281/zenodo.19468379

AI中文摘要

在共线性情况下，没有任何特征排名可以同时忠实、稳定和完整。对于共线性对，排名本质上等同于抛硬币。我们证明了这一不可能性，针对四种模型类别进行了量化分析，通过集成平均（DASH）方法解决该问题，并利用305个Lean 4定理进行机验证。我们刻画了完整的归因设计空间：恰好存在两种方法家族——忠实-完整方法（不稳定，排名可能翻转多达50%的时间）和集成方法如DASH（稳定，对称特征报告平局）。归因比在梯度提升中发散为1/(1-rho^2)，在Lasso中为无穷大，在随机森林中收敛。DASH（Diversified Aggregation of SHAP）在无偏聚合中被证明是帕累托最优的，达到Cramer-Rao方差下界并具有紧的集成大小公式。在77个公共数据集中，68%表现出归因不稳定性。在特征具有相等因果效应时，切换到条件SHAP无法逃脱这一不可能性。该框架包括实用的诊断工具——Z检验工作流程和单模型筛查工具——并直接影响公平性审计：基于SHAP的代理歧视审计在共线性下被证明不可靠。设计空间定理、诊断和不可能性均在Lean 4中形式化验证（305个定理从16个公理，0 sorry）——据我们所知，这是可解释AI领域首个形式化验证的不可能性。

英文摘要

No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine-verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist -- faithful-complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) -- and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1-rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto-optimal among unbiased aggregations, achieving the Cramer-Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics -- a Z-test workflow and single-model screening tool -- and has direct consequences for fairness auditing: SHAP-based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) -- to our knowledge, the first formally verified impossibility in explainable AI.

URL PDF HTML ☆

赞 0 踩 0

2605.21491 2026-05-22 cs.LG cs.AI cs.CL

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较想法评估教授语言模型预测研究成功的技巧

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

发表机构 * IISER Pune（印度理工学院帕内尔）

AI总结本研究探讨了语言模型能否在无需实验的情况下预测研究想法的实证成功，通过构建基于PapersWithCode客观结果的11488对想法数据集，发现通过强化学习可提升模型性能至71.35%，证明小型语言模型可以作为有效的客观验证器，为自主科学发现提供可扩展路径。

Comments ACL 2026 Findings

详情

AI中文摘要

随着语言模型通过自动化假设生成和实现加速科学研究，出现了一个新的瓶颈：在没有彻底实验的情况下评估和过滤数百个AI生成的想法。我们问语言模型是否能学会在任何实验运行之前预测研究想法的实证成功。我们研究了比较实证预测：给定一个基准特定的研究目标和两个候选想法，预测哪个将实现更好的基准性能。我们构建了一个基于PapersWithCode客观结果的11,488对想法数据集。尽管现成的8B参数模型表现不佳（30%准确率），SFT显著提升了性能至77.1%，优于GPT-5（61.1%）。通过将评估框架为推理任务，通过可验证奖励的强化学习（RLVR），我们训练模型发现潜在的推理路径，实现71.35%的准确率，并具有可解释的依据。通过额外的消融和分布外测试，我们展示了对表面启发式的鲁棒性，并转移到了跨领域时间拆分测试集和独立构建的测试集。我们的结果表明，计算高效的轻量级语言模型可以作为有效的、客观的验证器，为自主科学发现提供可扩展的路径。

英文摘要

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.21490 2026-05-22 cs.LG cs.CR

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

基于时间对比的变压器用于金融犯罪检测：通过预测对比编码实现自监督序列嵌入

Danny Butvinik, Yonit Marcus, Nitzan Tal, Gabrielle Azoulay

发表机构 * NICE Actimize

AI总结本文提出了一种名为时间对比变压器（TCT）的表示学习框架，旨在捕捉金融交易序列中的时间动态。通过自监督对比目标训练模型，生成编码时间行为模式的嵌入，以支持下游的欺诈检测任务。实验结果显示，嵌入本身能实现有意义的预测性能（AUC 0.8644），但结合领域工程特征时，性能提升不显著（AUC 0.9205 vs. 0.9245），表明学习到的表示与现有特征抽象有较大重叠。这些发现表明TCT是一种有前景的表示学习方法，能够捕捉相关的行为信号，同时凸显了在强领域特征上实现加性价值的挑战。

Comments 10 pages, 4 figures, one table

详情

AI中文摘要

我们介绍了一种时间对比变压器（TCT），一种旨在捕捉金融交易序列中上下文时间动态的表示学习框架。该模型通过自监督对比目标进行训练，以生成编码时间行为模式的嵌入，以支持下游的欺诈检测任务。我们通过将学习到的嵌入作为输入特征送入梯度提升分类器，在现实环境中评估TCT。实验结果表明，仅使用嵌入本身就能实现有意义的预测性能（AUC 0.8644），表明模型能够捕捉非平凡的时间结构。然而，当结合领域工程特征时，与基线相比没有可观的提升（AUC 0.9205 vs. 0.9245），表明学习到的表示与现有特征抽象有较大重叠。这些发现将TCT定位为一种有前景的表示学习方法，能够捕捉相关的行为信号，同时凸显了在强领域特征上实现加性价值的挑战。这些结果反映了时间表示学习在金融犯罪检测中的发展中间阶段，并激励进一步研究模型架构、训练目标和整合策略。在这一早期阶段，实现与强特征工程基线相当的性能本身就是一个有意义的结果，表明学习到的表示可以近似于领域特定的特征，而无需手动工程。虽然尚未达到生产就绪状态，但这些结果指出了减少对特征工程依赖的有希望的方向。

英文摘要

We introduce the Temporal Contrastive Transformer (TCT), a representation learning framework designed to capture contextual temporal dynamics in sequences of financial transactions. The model is trained using a self-supervised contrastive objective to produce embeddings that encode behavioral patterns over time, with the goal of supporting downstream fraud detection tasks. We evaluate TCT in a realistic setting by using the learned embeddings as input features to a gradient boosting classifier. Experimental results show that embeddings alone achieve meaningful predictive performance (AUC 0.8644), indicating that the model captures non-trivial temporal structure. However, when combined with domain-engineered features, no measurable improvement is observed over the baseline (AUC 0.9205 vs. 0.9245), suggesting that the learned representations largely overlap with existing feature abstractions. These findings position TCT as a promising representation learning approach that captures relevant behavioral signal, while highlighting the challenges of achieving additive value over strong domain features. The results reflect an intermediate stage in the development of temporal representation learning for financial crime detection and motivate further research on model architecture, training objectives, and integration strategies. At this early stage, achieving performance comparable to a strong feature-engineered baseline is itself a meaningful outcome, indicating that learned representations approximate domain-specific features without manual engineering. While not yet production-ready, these results point to a promising direction for reducing reliance on feature engineering in financial crime detection.

URL PDF HTML ☆

赞 0 踩 0

2605.21282 2026-05-22 cs.LG cs.AI

Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

随机均值流策略：带有熵镜降的一步生成控制

Zeyuan Wang, Da Li, Yulin Chen, Yuehu Gong, Yanming Guo, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu

发表机构 * Laboratory for Big Data and Decision（大数据与决策实验室）； National University of Defense Technology（国防科技大学）； Samsung AI Center Cambridge（三星AI研究中心）； Queen Mary University of London（伦敦玛丽女王大学）； Fudan University（复旦大学）； ShanghaiTech University（上海科技大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出了一种随机均值流策略（SMFP），通过均值流变换将高斯噪声映射到动作，以实现可训练的生成策略，从而在离线策略镜降框架下实现探索性且稳定的改进。

详情

AI中文摘要

在线离线策略强化学习（RL）受到两个耦合选择的影响：策略类和更新规则。高斯策略速度快且具有可计算的熵，但难以处理多模态动作分布。生成策略更具表现力，但通常需要迭代采样或缺乏可计算的熵估计。在优化方面，SAC风格的软策略改进和镜降（MD）可以视为最小化不同的KL散度：前者将策略推向价值诱导的玻尔兹曼分布，后者则通过之前的策略正则化每个更新。将熵正则化与MD约束结合因此具有吸引力，因为它支持探索并稳定策略改进；然而，所得到的目标可能是多模态的，且与单峰高斯策略不匹配。我们提出随机均值流策略（SMFP），一种一步生成策略类，通过均值流变换将高斯噪声映射到动作。这种随机重参数化产生了一个可计算的熵替代物，并允许均值流策略在离线策略镜降框架下通过统一的目标进行训练，以实现探索性且稳定的改进。在七个MuJoCo基准测试中，SMFP在高斯和生成基线之上取得了改进，同时保留了单步推断效率。

英文摘要

Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.21079 2026-05-22 cs.CV

VDFP: Video Deflickering with Flicker-banding Priors

VDFP：基于闪烁带先验的视频去闪烁

Zhiyi Zhou, Libo Zhu, Zihan Zhou, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出VDFP，一种基于闪烁带先验的视频去闪烁框架，通过构建DeViD数据集和引入DFM和CPP模块，有效解决屏幕捕捉中的带状伪影问题，实验表明其在去闪烁效果和时空一致性方面优于现有方法。

Comments Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP

详情

AI中文摘要

使用智能手机捕捉数字屏幕时，由于硬件同步不匹配，经常会产生严重的带状伪影。现有的视频修复方法难以处理这些结构化、周期性的亮度波动，通常导致残留伪影或过度平滑的纹理。我们首先构建了DeViD数据集，以应对可用数据集不足的问题。然后我们提出了VDFP（Video Deflickering with Flicker-banding Priors），一种新颖的感知引导生成框架。首先，我们引入了一种基于滚动快门机制的退化场建模（DFM），能够合成复杂的多带状场景。其次，我们提出了空间-时间连续先验感知（CPP）。不同于传统的二元分割，该模块通过闪烁感知的均方误差（FA-MSE）进行优化，以捕捉亮度过渡。通过零初始化增强的输入层，我们的模型保留了预训练的生成先验以及空间-时间先验感知。广泛的实验表明，VDFP在去闪烁效果和时空一致性方面显著优于其他方法，能够高效消除复杂的带状伪影并保留高保真的空间细节。我们的数据集和代码将在https://github.com/ZhiyiZZhou/VDFP上发布。

英文摘要

Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets. Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP.

URL PDF HTML ☆

赞 0 踩 0

2605.20514 2026-05-22 cs.LG cs.NA math.NA stat.ML

Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

从稀疏数据快速重建精确的Maxwell动力学

Dan DeGenaro, Xin Li, Obed Amo, Michael Pokojovy, Sarah Adel Bargal, Markus Lange-Hegermann, Bogdan Raiţă

发表机构 * Department of Computer Science, Georgetown University（乔治城大学计算机科学系）； Department of Mathematics, Georgetown University（乔治城大学数学系）； Department of Mathematics and Statistics, Old Dominion University（老 Dominion 大学数学与统计学系）； School of Data Science, Old Dominion University（老 Dominion 大学数据科学学院）； Institute Industrial IT, Department of Computer Science and Automation, OWL University of Applied Sciences and Arts（OWL 应用科学与艺术大学工业IT研究所）

AI总结本文提出FLASH-MAX神经网络架构，通过稀疏点观测预测均匀电磁场，该架构通过符号构造满足Maxwell方程，实现从稀疏数据快速训练，且保持零PDE残差，提升了科学机器学习中精度与优化速度的平衡。

Comments 31 pages, 8 figures

2605.20303 2026-05-22 cs.LG

AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

AirfoilGen: 一种用于翼型生成的可构造且性能感知的潜在扩散模型

Zhijie Yang, Min Tang, Peng Du, Qiang Zou

发表机构 * State Key Laboratory of CAD \& CG, Zhejiang University, Hangzhou, 310027, China

AI总结本文提出了一种新的翼型生成模型AirfoilGen，通过引入圆扫表示法约束生成过程，确保生成的翼型符合基本特性，并通过在学习的潜在空间中操作实现对气动性能的显式控制，同时提供了一个包含超过20万翼型的新数据集。

Comments 15 pages

详情

AI中文摘要

翼型形状设计是航空工程中的基本任务，直接影响飞行稳定性与燃油消耗。深度学习最近 emerged 作为一种有前景的工具用于此任务，但现有的深度生成方法在几何有效性与物理可控性方面仍然有限。它们对生成的形状控制很少，导致无效的几何形状，并且通常不有效地对气动性能进行条件化。为了解决这些问题，本文提出了一种名为AirfoilGen的可构造且性能感知的潜在扩散模型用于翼型生成。首先引入了一种新的翼型表示方案，即圆扫表示法，以约束生成过程，使得输出形状尊重基本的翼型特性。然后通过在学习的潜在空间中操作，实现对气动性能（例如升力和阻力系数）的显式控制：一个transformer模型将翼型形状编码为向量嵌入，而一个条件扩散模型将高斯噪声解噪为这些潜在嵌入，同时结合目标气动性能。此外，本文还提出了一组包含超过200,000个翼型的新数据集，该数据集比广泛使用的UIUC翼型数据集（1,650个翼型）大得多，并且更适合训练现代深度生成模型。实验表明，AirfoilGen在几何有效性和气动性能可控性方面比之前实现的要高得多，平均性能条件化精度为98.41%。

英文摘要

Airfoil shape design is a fundamental task in aerospace engineering, with a direct impact on flight stability and fuel consumption. Deep learning has recently emerged as a promising tool for this task, but existing deep generative approaches remain limited in both geometric validity and physical controllability. They offer little control over the generated shapes, yielding invalid geometries, and they typically do not condition effectively on aerodynamic performance. To address these issues, this paper proposes AirfoilGen, a valid-by-construction and performance-aware latent diffusion model for airfoil. It first introduces a novel airfoil representation scheme, the circle sweeping representation, to constrain the generative process so that output shapes respect essential airfoil characteristics. It then enables explicit control over aerodynamic performance (e.g., lift and drag coefficients) by operating in a learned latent space: a transformer model encodes airfoil shapes into vector embeddings, and a conditional diffusion model denoises Gaussian noise into these latent embeddings while incorporating target aerodynamic performance. In addition, this paper presents a new dataset of over 200,000 airfoils, which is substantially larger than the widely used UIUC airfoil dataset (1,650 airfoils) and more suitable for training modern deep generative models. Experiments demonstrate that AirfoilGen enables airfoil generation with far greater geometric validity and aerodynamic performance controllability than previously achievable, with an average performance-conditioning accuracy of 98.41%.

URL PDF HTML ☆

赞 0 踩 0

2605.20302 2026-05-22 cs.LG cs.CV

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

按设计实现神经崩溃：在超球面上学习类别原型

Panagiotis Koromilas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, Yannis Panagakis

发表机构 * The Cyprus Institute（塞浦路斯研究所）； University of Athens（雅典大学）； Archimedes AI/Athena Research Center（阿基米德AI/阿泰纳研究中心）； University of Cyprus（塞浦路斯大学）

AI总结本文研究了监督分类的理论最优解神经崩溃（NC），指出交叉熵（CE）和监督对比学习（SCL）两种主流范式在实践中无法达到该最优解。作者提出通过在超球面上对比原型的方法，改进了CE和SCL，从而在多个基准测试中实现了更接近NC的性能。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/nc_by_design

详情

AI中文摘要

监督分类有一个理论最优解，即神经崩溃（NC），然而其两种主导范式在实践中都无法达到这一最优。交叉熵（CE）保留了径向自由度，导致收敛到退化几何结构，而监督对比学习（SCL）在预训练阶段驱动特征向NC靠近，但在后续的线性探测阶段丢弃了这一结构。我们证明这两种范式实际上是同一种方法的不同表现，即在单位超球面上对比原型。缩小差距需要在各自失败点进行修正。从CE侧，我们提出NTCE和NONL两种归一化损失，将对比优化缺失的成分引入分类器学习：大有效负样本集和解耦的对齐和均匀性项。从SCL侧，我们证明SCL的目标在训练过程中已经优化了原理分类器，其权重是类别均值嵌入，使线性探测变得冗余且有害。实验表明，在四个基准测试（包括ImageNet-1K）中，NTCE和NONL在准确率上超过了CE，接近NC（≥95%），并在不到7.5%的迭代次数中在4/5个指标上匹配CE的收敛NC，而SCL在固定原型的情况下无需线性探测阶段即可达到。学习的几何结构在迁移学习中带来了+5.5%的平均相对改进，严重类别不平衡下可达+8.7%，并且在ImageNet-C上提高了对损坏的鲁棒性。本文将监督学习重新定义为在超球面上的原型学习，通过设计达到NC。

英文摘要

Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method that contrasts prototypes on the unit hypersphere, and that closing the gap requires fixing each at its point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and improved robustness to corruptions on ImageNet-C. Our work recasts supervised learning as prototype learning on the hypersphere, with NC reached by design.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting

Tabular foundation models for robust calibration of near-infrared chemical sensing data

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

Harnesses for Inference-Time Alignment over Execution Trajectories

Predicting Performance of Symbolic and Prompt Programs with Examples

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

Double descent for least-squares interpolation on contaminated data: A simulation study

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

VDFP: Video Deflickering with Flicker-banding Priors

Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere