arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2601.21008 2026-05-27 cs.LG cs.AI math.OC

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

ORLoopBench:运筹学中自我修正与行为理性的求解器在环基准测试

Ruicheng Ao, David Simchi-Levi, Xinshang Wang

AI总结 提出ORLoopBench基准套件,通过将不可行模型修复形式化为求解器在环马尔可夫决策过程,利用不可约不可行子系统(IIS)反馈,结合验证强化学习训练(RLVR),使8B模型在LP修复上超越前沿API(95.3% vs 92.4% RR@5),并揭示全模型代码再生中的语义漂移问题。

Comments 58 pages, accepted by ICML 2026

详情
AI中文摘要

运筹学从业者通过迭代过程调试不可行模型:检查不可约不可行子系统(IIS),识别约束冲突,并修复公式直至恢复可行性。现有的LLM基准大多将OR视为从问题描述到求解器代码的一次性翻译,忽略了这一诊断循环。我们将不可行模型修复形式化为一个求解器在环马尔可夫决策过程,其中每个动作触发求解器重新执行和IIS重新计算,产生确定性的、可验证的反馈。我们引入ORLoopBench,一个包含两个组件的基准套件:OR-Debug-Bench发布5,362个LP/MILP修复实例,而OR-Bias-Bench评估库存设置中的闭式运营决策理性。求解器验证的RLVR训练使8B模型在LP修复上超越前沿API(95.3% vs 92.4% RR@5),改善诊断行为,并迁移到MILP修复。同样的评估暴露了全模型代码再生中的语义漂移:可行的再生MILP可能解决错误的问题。使用求解器预言机的过程级评估能够为可靠的OR自我修正进行针对性训练。

英文摘要

Operations Research practitioners debug infeasible models through an iterative process: inspecting Irreducible Infeasible Subsystems ( IIS), identifying constraint conflicts, and repairing formulations until feasibility is restored. Existing LLM benchmarks mostly treat OR as one-shot translation from problem descriptions to solver code, omitting this diagnostic loop. We formalize infeasible-model repair as a solver-in-the-loop Markov Decision Process in which each action triggers solver re-execution and IIS recomputation, yielding deterministic, verifiable feedback. We introduce ORLoopBench, a benchmark suite with two components: OR-Debug-Bench releases 5,362 LP/MILP repair instances, while OR-Bias-Bench evaluates closed-form operational decision rationality across inventory settings. Solver-verified RLVR training enables an 8B model to surpass frontier APIs on LP repair (95.3% vs 92.4% RR @5), improves diagnostic behavior, and transfers to MILP repair. The same evaluation exposes semantic drift in whole-model code regeneration: feasible regenerated MILPs can solve the wrong problem. Process-level evaluation with solver oracles enables targeted training for reliable OR self-correction.

2501.06708 2026-05-27 cs.LG cs.AI

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

通过模仿模型权重评估样本效用以实现高效数据选择

Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan

AI总结 提出基于梯度和几何的Mimic Score指标,通过Grad-Mimic框架在线重加权样本加速训练、离线构建数据过滤器,在六个图像数据集上提升数据效率和CLIP模型性能。

Comments This work appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026) and was selected as an Oral paper at the ICML 2025 DataWorld Workshop

详情
AI中文摘要

大规模网络爬取数据集包含噪声、偏差和不相关信息,因此需要数据选择技术。现有方法依赖于手工启发式、下游数据集或需要昂贵的基于影响力的计算——所有这些都限制了可扩展性并引入了不必要的数据依赖性。为了解决这个问题,我们引入了Mimic Score,一种简单且基于几何的数据质量指标,通过测量样本梯度与预训练参考模型诱导的目标方向之间的对齐来评估效用。这利用了现成的模型权重,避免了验证数据集的需求,并且计算开销最小。基于该指标,我们提出了Grad-Mimic,一个两阶段框架,在线重新加权样本以加速训练,并离线聚合样本效用以构建有效的数据过滤器。实验表明,使用模仿分数指导训练提高了数据效率,加速了收敛,在六个图像数据集上取得了一致的性能提升,并以减少20.7%的训练步骤增强了CLIP模型。此外,基于模仿分数的过滤器增强了现有过滤技术,使得用更少470万个样本训练的CLIP模型得到改进。

英文摘要

Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

2602.07120 2026-05-27 cs.CL

Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model

锚定解码:可证明降低任何语言模型的版权风险

Jacqueline He, Jonathan Hayase, Wen-tau Yih, Sewoong Oh, Luke Zettlemoyer, Pang Wei Koh

AI总结 提出锚定解码,一种即插即用的推理时方法,通过将生成内容约束在许可训练的安全模型附近,可证明地抑制语言模型逐字复制受版权保护的内容,实现可调的风险-效用权衡。

Comments Accepted to ICML 2026. 53 pages, 14 figures, 22 tables. Code is publicly available at https://github.com/jacqueline-he/anchored-decoding

详情
AI中文摘要

语言模型倾向于记忆其训练数据的部分内容并逐字生成。当底层来源敏感或受版权保护时,这种复现会引发创作者同意和补偿问题以及开发者合规风险。我们提出锚定解码,一种即插即用的推理时方法,用于抑制逐字复制:它通过将生成内容保持在许可训练的安全模型的有界邻近范围内,使得任何在混合许可数据上训练的有风险语言模型都能进行解码。锚定解码在生成轨迹上自适应地分配用户选择的信息预算,并强制执行每步约束,从而提供序列级别的保证,实现可调的风险-效用权衡。为使锚定解码实用化,我们引入了一个新的许可训练的安全模型(TinyComma 1.8B),以及锚定字节解码,这是我们方法的字节级变体,通过ByteSampler框架(Hayase等人,2025)实现跨词汇融合。在六个模型对上,针对复制风险和效用的长文本指标,锚定解码和锚定字节解码定义了新的帕累托前沿,在保持接近原始流畅性和事实性的同时,将有风险基线与安全参考之间的可测量复制差距缩小了高达75%,且推理开销适中。

英文摘要

Language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). Across six model pairs on long-form metrics for copying risk and utility, Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while closing up to 75% of the measurable copying gap between the risky baseline and a safe reference, at a modest inference overhead.

2602.04990 2026-05-27 cs.LG cs.GT

Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives

立场:机器学习用于心脏移植分配政策优化应考虑激励机制

Ioannis Anagnostides, Itai Zilberstein, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm

AI总结 本文指出当前机器学习优化器官分配政策忽视了激励机制问题,提出下一代分配政策应具有激励意识,并呼吁整合机制设计、策略分类、因果推断和社会选择等研究。

Comments To appear at ICML 2026 (position paper track). V3 incorporates reviewers' feedback

详情
AI中文摘要

稀缺供体器官的分配构成了医疗保健中最具影响力的算法挑战之一。尽管该领域正迅速从僵化的、基于规则的系统转向机器学习和数据驱动的优化,我们认为当前的方法常常忽视了一个基本障碍:激励机制。在这篇立场论文中,我们强调器官分配不仅仅是一个优化问题,而是一个涉及器官获取组织、移植中心、临床医生、患者和监管机构的复杂博弈。聚焦于美国成人心脏移植分配,我们识别了决策流程中的关键激励错位,并展示了表明这些错位正在产生不良后果的数据。我们的主要立场是,下一代分配政策应具有激励意识。我们为机器学习社区概述了一个研究议程,呼吁整合机制设计、策略分类、因果推断和社会选择,以确保在面对各组成群体的策略行为时,系统具有鲁棒性、效率、公平性和信任度。

英文摘要

The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely an optimization problem, but rather a complex game involving organ procurement organizations, transplant centers, clinicians, patients, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, fairness, and trust in the face of strategic behavior from the various constituent groups.

2512.06609 2026-05-27 cs.LG cs.CV

Training-Free Vector Quantization via Gaussian VAEs

基于高斯VAE的无训练向量量化

Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang

AI总结 提出Gaussian Quant (GQ)方法,通过约束训练高斯VAE并直接转换为VQ-VAE,无需额外训练,在UNet和ViT架构上优于现有VQ-VAE。

详情
AI中文摘要

向量量化变分自编码器(VQ-VAEs)是将图像压缩为离散标记的离散自编码器。然而,由于离散化,它们难以训练。在本文中,我们提出了一种简单而有效的技术,称为Gaussian Quant (GQ),它首先在特定约束下训练高斯VAE,然后将其转换为VQ-VAE,无需额外训练。对于转换,GQ生成随机高斯噪声作为码本,并找到最接近后验均值的噪声向量。理论上,我们证明当码本大小的对数超过高斯VAE的bits-back编码率时,可以保证较小的量化误差。实际上,我们提出了一种启发式方法来训练高斯VAE以实现有效转换,称为目标散度约束(TDC)。实验上,我们表明GQ在UNet和ViT架构上均优于先前的VQ-VAE,如VQGAN、FSQ、LFQ和BSQ。此外,TDC还改进了先前的离散化方法,如TokenBridge。源代码见https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE。

英文摘要

Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

2602.04931 2026-05-27 cs.LG cs.AI

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

大型语言模型中跨深度的涌现因果几何动力学

Shahar Haim, Daniel C McNamee

AI总结 通过结合几何分析与因果干预,揭示了解码器-only大型语言模型中从上下文处理到预测形成的跨层转变,并发现后期层中角度结构参数化下一词分布相似性并实现选择性因果控制。

详情
AI中文摘要

对大型语言模型(LLM)表征的几何分析揭示了跨深度的结构化变化,但本质上与token预测形成相关。同时,因果干预揭示了依赖于深度的效能曲线,但缺乏对其表征动力学的统一解释。对LLM功能的完整解释需要说明表征结构如何跨深度演化以因果性地产生预测。我们通过将几何分析与机械干预相结合,明确将跨深度动力学作为解释LLM功能的组织轴,综合了这些视角。在解码器-only LLM中,我们识别出从上下文处理到预测形成计算的急剧转变,伴随着跨层的表征几何的更渐进重组。这种综合揭示了一种后期层几何编码,其中角度结构参数化下一词分布相似性,并能够对预测进行选择性因果控制,而表征范数编码的信息与预测基本解耦。总之,我们的结果提供了因果和几何视角的综合,产生了关于语言模型中跨深度的控制相关几何动力学如何将上下文转化为预测的机械论解释。这一视角调和了先前令人困惑的发现,并表明层状功能不能孤立地理解或有效干预,而只能在网络涌现的全局动力学结构中理解。

英文摘要

Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlational with respect to token prediction formation. Meanwhile, causal interventions expose depth-dependent efficacy profiles without a unifying account of their representational dynamics. A complete account of LLM function requires explaining how representational structure evolves across depth to causally produce predictions. We synthesize these perspectives by combining geometric analysis with mechanistic interventions, explicitly centralizing depth-wise dynamics as the organizing axis for interpreting LLM function. In decoder-only LLMs, we identify a sharp transition from context-processing to prediction-forming computation, accompanied by a more gradual reorganization of representational geometry across layers. This synthesis reveals a late-layer geometric code in which angular structure parameterizes next-token distributional similarity and enables selective causal control over predictions, while representation norms encode information largely decoupled from prediction. Together, our results provide a synthesis of causal and geometric perspectives, yielding a mechanistic account of how control-relevant geometric dynamics across depth transform context into prediction in language models. This perspective reconciles previously puzzling findings and implies that layer-wise function cannot be understood or effectively intervened upon in isolation, but only within the emergent global dynamical structure of the network.

2602.04599 2026-05-27 cs.LG

Stochastic Decision Horizons for Constrained Reinforcement Learning

约束强化学习的随机决策视界

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

AI总结 提出随机决策视界(SDH)框架,通过状态-动作延续概率实现每步约束满足,并开发了首个离策略和正则化算法(AS-SAC和VT-MPO),在90肌肉人形机器人上以4倍更少的环境步数达到最先进步态真实度。

详情
AI中文摘要

我们提出随机决策视界(SDH),这是一个理论基础的框架,用于解决具有每步约束满足的约束强化学习问题,这在许多实际应用中是一个理想属性。在SDH中,违反约束通过状态-动作延续概率有效缩短视界。利用控制作为推理,我们开发了首个用于即时约束RL的离策略和正则化算法。我们确定了违反后决策的两种原则性语义。吸收状态语义终止决策过程,因此只有存活的决策支付熵成本,产生最大熵AS-SAC。虚拟终止保持决策过程活跃,同时停止奖励信用,产生KL正则化VT-MPO。为了连接SDH与CMDP,我们跟踪违反沿轨迹的累积(它们的违反深度剖面)。SDH有效地通过每个轨迹的总违反的指数加权;这正好在违反发生在单一特征尺度时匹配加性CMDP预算,并且我们指出它不能匹配的情况:当罕见的深度违反与频繁的浅层违反混合时。实验验证了理论。在90肌肉H2190人形机器人(Hyfydy)上,VT-MPO以4倍更少的环境步数和更稳定的训练达到最先进的步态真实度。在Safety Gymnasium上,违反深度剖面正确识别了SDH提供强奖励-违反权衡的机制。

英文摘要

We propose stochastic decision horizons (SDH), a theoretically grounded framework for solving constrained RL problems with every-step constraint satisfaction, a desirable property in many real-world applications. In SDH, a constraint violation yields an effective shortening of horizon via a state-action continuation probability. Using Control as Inference, we develop the first off-policy and regularized algorithms for RL with instantaneous constraints. We identify two principled semantics for what counts as a decision after a violation. Absorbing-state semantics end the decision process, so only surviving decisions pay entropy cost, yielding max-entropy AS-SAC. Virtual-termination keeps the decision process alive while stopping reward credit, yielding KL-regularized VT-MPO. To connect SDH with CMDPs, we track how violations accumulate along trajectories (their violation-depth profile). SDH effectively weights each trajectory by the exponential of its total violation; this matches an additive CMDP budget exactly when violations occur at a single characteristic scale, and we pinpoint where it cannot: when rare, deep violations mix with frequent, shallow ones. Experiments validate the theory. On the 90-muscle H2190 humanoid (Hyfydy), VT-MPO matches state-of-the-art gait realism with $4\times$ fewer environment steps and substantially more stable training. On Safety Gymnasium, violation-depth profiles correctly identify the regimes in which SDH delivers strong reward-violation trade-offs. Experiments validate the theory. On the 90-muscle H2190 humanoid (Hyfydy), VT-MPO matches state-of-the-art gait realism with 4x fewer environment steps and substantially more stable training. On Safety Gymnasium, violation-depth profiles correctly identify the regimes in which SDH delivers strong reward-violation trade-offs.

2602.03545 2026-05-27 cs.AI

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

人格生成器:为任意上下文生成多样化的合成人格

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, Alexander Sasha Vezhnevets

AI总结 提出Persona Generators,通过迭代进化优化生成覆盖广泛意见和偏好的多样化合成人格,在六个多样性指标上显著优于现有基线。

详情
AI中文摘要

评估与人类交互的AI系统需要理解它们在不同用户群体中的行为,但收集代表性人类数据通常成本高昂或不可行,特别是对于新技术或假设的未来场景。最近在生成式基于智能体建模方面的工作表明,大型语言模型可以高保真地模拟类似人类的合成人格,准确再现特定个体的信念和行为。然而,大多数方法需要关于目标群体的详细数据,并且通常优先考虑密度匹配(复制最可能的内容)而非支持覆盖(覆盖可能的内容),导致长尾行为未被充分探索。我们引入了Persona Generators,即能够为任意上下文生成多样化合成群体的函数。我们应用基于AlphaEvolve的迭代改进循环,使用大型语言模型作为变异算子,在数百次迭代中优化我们的Persona Generator代码。优化过程产生了轻量级的Persona Generators,能够自动将小规模描述扩展为多样化的合成人格群体,这些群体在相关多样性轴上最大化意见和偏好的覆盖。我们证明,进化后的生成器在保留上下文上的六个多样性指标上显著优于现有基线,产生了覆盖标准LLM输出中难以实现的罕见特征组合的群体。

英文摘要

Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting representative human data is often expensive or infeasible, particularly for novel technologies or hypothetical future scenarios. Recent work in Generative Agent-Based Modeling has shown that large language models can simulate human-like synthetic personas with high fidelity, accurately reproducing the beliefs and behaviors of specific individuals. However, most approaches require detailed data about target populations and often prioritize density matching (replicating what is most probable) rather than support coverage (spanning what is possible), leaving long-tail behaviors underexplored. We introduce Persona Generators, functions that can produce diverse synthetic populations tailored to arbitrary contexts. We apply an iterative improvement loop based on AlphaEvolve, using large language models as mutation operators to refine our Persona Generator code over hundreds of iterations. The optimization process produces lightweight Persona Generators that can automatically expand small descriptions into populations of diverse synthetic personas that maximize coverage of opinions and preferences along relevant diversity axes. We demonstrate that evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.

2602.03517 2026-05-27 cs.LG

Rank-Learner: Orthogonal Ranking of Treatment Effects

Rank-Learner:治疗效果的正交排序

Henri Arno, Dennis Frauen, Emil Javurek, Thomas Demeester, Stefan Feuerriegel

AI总结 提出一种名为Rank-Learner的两阶段学习器,通过成对学习目标直接学习治疗效果排序,无需显式估计条件平均处理效应,具有Neyman正交性和模型无关性。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

许多决策问题需要根据治疗效果对个体进行排序,而不是估计确切的效果大小。例如,优先考虑患者进行预防性护理干预,或根据广告的预期增量影响对客户进行排名。令人惊讶的是,尽管因果效应估计在文献中受到了广泛关注,但直接学习治疗效果排序的问题在很大程度上仍未得到探索。在本文中,我们介绍了Rank-Learner,一种新颖的两阶段学习器,它直接从观测数据中学习治疗效果的排序。我们首先表明,基于精确治疗效果估计的朴素方法解决了一个比排序所需更困难的问题,而我们的Rank-Learner优化了一个成对学习目标,该目标恢复了真实的治疗效果顺序,无需显式的CATE估计。我们进一步证明,我们的Rank-Learner是Neyman正交的,因此具有强大的理论保证,包括对 nuisance 函数估计误差的鲁棒性。此外,我们的Rank-Learner是模型无关的,可以用任意机器学习模型(例如神经网络)实例化。我们通过大量实验证明了我们方法的有效性,其中Rank-Learner始终优于标准的CATE估计器和非正交排序方法。总的来说,我们为从业者提供了一种新的、正交的两阶段学习器,用于按治疗效果对个体进行排序。

英文摘要

Many decision-making problems require ranking individuals by their treatment effects rather than estimating the exact effect magnitudes. Examples include prioritizing patients for preventive care interventions, or ranking customers by the expected incremental impact of an advertisement. Surprisingly, while causal effect estimation has received substantial attention in the literature, the problem of directly learning rankings of treatment effects has largely remained unexplored. In this paper, we introduce Rank-Learner, a novel two-stage learner that directly learns the ranking of treatment effects from observational data. We first show that naive approaches based on precise treatment effect estimation solve a harder problem than necessary for ranking, while our Rank-Learner optimizes a pairwise learning objective that recovers the true treatment effect ordering, without explicit CATE estimation. We further show that our Rank-Learner is Neyman-orthogonal and thus comes with strong theoretical guarantees, including robustness to estimation errors in the nuisance functions. In addition, our Rank-Learner is model-agnostic, and can be instantiated with arbitrary machine learning models (e.g., neural networks). We demonstrate the effectiveness of our method through extensive experiments where Rank-Learner consistently outperforms standard CATE estimators and non-orthogonal ranking methods. Overall, we provide practitioners with a new, orthogonal two-stage learner for ranking individuals by their treatment effects.

2602.03238 2026-05-27 cs.AI

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

基于LLM的智能体评估统一框架的必要性

Pengyu Zhu, Li Sun, Philip S. Yu, Sen Su

AI总结 针对当前LLM智能体评估受系统提示、工具集和环境动态等混杂因素影响的问题,提出标准化统一评估框架以提升公平性和可复现性。

详情
AI中文摘要

随着大型语言模型(LLM)的出现,通用智能体取得了根本性进展。然而,评估这些智能体带来了独特的挑战,使其区别于静态的问答基准。我们观察到,当前的智能体基准受到系统提示、工具集配置和环境动态等外部因素的严重混淆。现有评估通常依赖于碎片化的、研究者特定的框架,其中推理和工具使用的提示工程差异很大,使得难以将性能提升归因于模型本身。此外,缺乏标准化的环境数据导致不可追踪的错误和不可重复的结果。这种标准化的缺失给该领域带来了显著的不公平性和不透明性。我们提出,一个统一的评估框架对于智能体评估的严谨进展至关重要。为此,我们提出了一项旨在标准化智能体评估的建议。

英文摘要

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

2602.02518 2026-05-27 cs.LG cs.AI cs.CL

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer: 通过两阶段课程后训练训练LLMs在图上的探索与推理

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

AI总结 提出GraphDancer两阶段后训练框架,通过图感知课程逐步增加任务难度,使LLMs学会在异构图上进行自然语言推理与函数调用交织的探索与推理,仅用3B骨干模型即在跨域基准上超越更强基线。

Comments 15 pages, Project website: https://yuyangbai.com/graphdancer/

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖外部知识来提高事实性,然而许多真实世界的知识源被组织为异构图而非纯文本。在此类图上进行推理要求模型通过精确的函数调用遵循模式定义的关系,并在多轮交互中聚合证据。我们提出GraphDancer,一个两阶段后训练框架,通过将自然语言推理与图函数执行交织来教导LLMs在图上的推理。第一阶段教导模型在基于规则的奖励下如何与图交互,而第二阶段进一步教导其偏好更基于事实且高效的交互轨迹。GraphDancer的关键创新在于一个图感知课程,该课程根据信息寻求轨迹的结构复杂性组织两个阶段,在训练期间逐步增加任务难度。我们在一个多领域基准上评估GraphDancer,仅在一个领域上训练,并在未见过的领域和分布外问题类型上进行测试。尽管仅使用3B骨干模型,GraphDancer仍优于配备更大/更强骨干的基线,展示了图探索和推理技能的强大跨域泛化能力。我们的代码可在https://github.com/leopoldwhite/GraphDancer找到。

英文摘要

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We propose GraphDancer, a two-stage post-training framework that teaches LLMs to reason over graphs by interleaving natural-language reasoning with graph function execution. The first stage teaches the model how to interact with the graph under rule-based rewards, while the second stage further teaches it to prefer more grounded and efficient interaction trajectories. The key novelty of GraphDancer is a graph-aware curriculum that organizes both stages by the structural complexity of information-seeking trajectories, progressively increasing task difficulty during training. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with larger/stronger backbones, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code can be found at https://github.com/leopoldwhite/GraphDancer.

2602.01518 2026-05-27 cs.AI

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Qrita:使用基于枢轴的截断和选择的高性能Top-k和Top-p

Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica

AI总结 提出Qrita算法,通过基于高斯sigma截断和四元枢轴搜索的枢轴方法,高效实现Top-k和Top-p采样,在保持与排序算法相同输出的同时,将端到端服务吞吐量提升至1.4倍并减少一半内存使用。

详情
AI中文摘要

尽管Top-k和Top-p算法在模型采样中很重要,但对于大词汇表的高效实现仍然是一个重大挑战。现有方法通常依赖于排序,这在GPU上会带来显著的计算和内存开销,或者依赖于改变算法输出的随机方法。在这项工作中,我们提出了Qrita,一种基于枢轴截断和选择的高效Top-k和Top-p算法。Qrita利用基于枢轴的搜索来实现Top-k和Top-p,并采用两种关键技术:1. 基于高斯的sigma截断,大大减少了词汇表的搜索空间;2. 具有重复处理能力的四元枢轴搜索,将枢轴搜索迭代次数减半并保证确定性输出。我们使用Triton实现了Qrita,并针对高性能LLM执行引擎(如SGLang和FlashInfer)的Top-k和Top-p内核评估了其性能,将端到端服务吞吐量提高了1.4倍,同时内存使用量减半,并提供了与基于排序算法相同的输出。Qrita现在是vLLM GPU执行路径的默认Top-k和Top-p采样器,Qrita的三元实现可在https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py获取。

英文摘要

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top-k and Top-p kernels of high-performance LLM execution engines such as SGLang and FlashInfer, improving end-to-end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting-based algorithms. Qrita is now the default Top-k and Top-p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py.

2602.00959 2026-05-27 cs.LG cs.CL

Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

探测知识边界:一种用于深度知识提取的交互式智能体框架

Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You

AI总结 提出一种交互式智能体框架,通过四种自适应探索策略和三级知识处理流水线,系统性地提取和量化大语言模型的知识,发现递归分类法最有效,并揭示了知识缩放定律、Pass@1与Pass@k的权衡以及训练数据对知识轮廓的影响。

Comments Homepage: https://ulab-uiuc.github.io/KnowledgeExtraction/

详情
AI中文摘要

大型语言模型(LLMs)可被视为压缩的知识库,但尚不清楚它们真正包含哪些知识以及其知识边界延伸多远。现有基准大多是静态的,对系统性知识探测的支持有限。本文提出一种交互式智能体框架,用于系统性地提取和量化LLMs的知识。我们的方法包括四种自适应探索策略,以不同粒度探测知识。为确保提取知识的质量,我们引入了一个三级知识处理流水线,结合基于向量的过滤以去除严格重复、基于LLM的裁决以解决模糊语义重叠,以及领域相关性审计以保留有效的知识单元。通过大量实验,我们发现递归分类法是最有效的探索策略。我们还观察到清晰的知识缩放定律,即更大的模型始终能恢复更多知识。此外,我们识别出Pass@1与Pass@k之间的权衡:领域专用模型初始准确率更高但退化迅速,而通用模型在长时间提取中保持稳定性能。最后,我们的结果表明,训练数据组成的差异导致不同模型家族具有独特且可测量的知识轮廓,反映了预训练如何塑造每个模型的参数化知识。

英文摘要

Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundary extends. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularity. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove strict duplicates, LLM-based adjudication to resolve ambiguous semantic overlap, and domain relevance auditing to retain valid knowledge units. Through extensive experiments, we find that Recursive Taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently recover more knowledge. In addition, we identify a Pass@1 versus Pass@k trade-off: domain-specialized models achieve higher initial accuracy but experience rapid degradation, while general-purpose models maintain stable performance over extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families, reflecting how pretraining shapes each model's parametric knowledge.

2602.00827 2026-05-27 cs.LG stat.ML

Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization

过度对齐 vs 过拟合:特征学习强度在泛化中的作用

Taesun Yeom, Taehyeok Ha, Jaeho Lee

AI总结 本文通过实验和理论分析,揭示了深度网络中特征学习强度存在最优值,过大导致过度对齐、过小导致过拟合,从而影响泛化性能。

Comments ICML 2026

详情
AI中文摘要

特征学习强度(FLS),即模型有效输出缩放的倒数,在塑造神经网络的优化动态中起着关键作用。尽管其影响已在渐近区域(训练时间和FLS)得到广泛研究,但现有理论对FLS如何影响实际设置中的泛化(例如,当训练在达到目标训练风险时停止)提供的见解有限。在这项工作中,我们研究了在实际条件下FLS对深度网络泛化的影响。通过实证研究,我们首先发现了一个$ extit{最优FLS}$的存在——既不太小也不太大——它能带来显著的泛化收益。这一发现与更强的特征学习普遍改善泛化的主流直觉相悖。为了解释这一现象,我们开发了对使用逻辑损失训练的两层ReLU网络中的梯度流动力学的理论分析,其中FLS通过初始化尺度控制。我们的主要理论结果建立了最优FLS的存在性,它源于两种竞争效应之间的权衡:过大的FLS会导致$ extit{过度对齐}$现象,降低泛化性能,而过小的FLS则会导致$ extit{过拟合}$。

英文摘要

Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes -- both in training time and FLS -- existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ -- neither too small nor too large -- that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.

2502.03946 2026-05-27 cs.LG

CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

CleanSurvival:使用强化学习为时间事件模型自动数据预处理

Yousef Koka, David Selby, Gerrit Großmann, Kathan Pandya, Sebastian Vollmer

AI总结 提出基于强化学习的CleanSurvival框架,自动优化生存分析的数据预处理流程,提升Cox、随机森林、神经网络等时间事件模型的预测性能。

Comments Resubmitted after Peer Review Feedback to BMC Medical Informatics and Decision Making

详情
AI中文摘要

在机器学习中,数据预处理往往被忽视,尽管它对模型性能有潜在的重大影响。虽然自动化机器学习管道开始认识到并将数据预处理集成到分类和回归任务的解决方案中,但对于更专业的任务(如针对删失数据的时间事件模型)却缺乏这种集成。因此,生存分析不仅面临数据预处理的一般挑战,还缺乏针对性的自动化解决方案。为填补这一空白,本文提出了CleanSurvival,一种基于强化学习的解决方案,用于优化预处理流程,并专门扩展到生存分析。该框架可处理连续和分类变量。它基于Learn2Clean的Q学习,选择数据插补、异常值检测和特征提取技术的组合,以针对Cox、随机森林、神经网络或用户提供的时间事件模型实现最佳性能。Python包可在GitHub上获取:https://github.com/datasciapps/CleanSurvival。在真实世界数据集上的实验基准表明,基于Q学习的数据预处理相对于简单基线可以提高预测性能,而运行时行为依赖于条件,在覆盖最好的基准单元中最清晰可解释。此外,模拟研究证明了在不同类型和水平的缺失和噪声下的有效性。随着机器学习的使用增加,将AutoML管道推广到包括生存分析在内的各种模型变得重要。像CleanSurvival这样集成生存分析预处理的工具,可以使生存研究更容易、更快速地进行,并使结果更稳健。

英文摘要

Data preprocessing is often paid little attention in machine learning, despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like time-to-event models for censored data. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents CleanSurvival, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables. It builds upon Learn2Clean's Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The Python package is available on GitHub: https://github.com/datasciapps/CleanSurvival. Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing can improve predictive performance relative to simple baselines, while runtime behavior is condition-dependent and most clearly interpretable in the best-covered benchmark cells. Furthermore, a simulation study demonstrates effectiveness across different types and levels of missingness and noise. With an increase in the use of machine learning, it becomes important to generalise AutoML pipelines to a variety of models now present, including survival analysis. Tools like CleanSurvival, which integrate preprocessing for survival analysis, can make survival studies easier and quicker to perform, as well as make the results more robust.

2602.00491 2026-05-27 cs.CL

From Knowledge to Inference: Formalizing Specialized Public Health Reasoning on GlobalHealthAtlas

从知识到推理:形式化GlobalHealthAtlas上的专业公共卫生推理

Zhaokun Yan, Shan Xu, Wuzheng Dong, Zhaohan Liu, Lijie Feng, Chengxiao Dai, Chen Tianqi, Binfan Liu, Yunpu Ma, Wenting Wei, Yingting Li, Yi Zhang, Tongning Wu

AI总结 为解决公共卫生推理缺乏结构化监督信号和基准的问题,提出大规模多语言数据集GlobalHealthAtlas(280,210实例,15领域,17语言),并构建LLM辅助的构建与质量控制流水线及领域对齐评估器,支持安全关键型公共卫生推理的LLM训练与评估。

详情
Journal ref
ICML 2026 regular
AI中文摘要

公共卫生推理需要基于科学证据、专家共识和安全约束的群体层面推理。然而,作为一个结构化的机器学习问题,它仍然未被充分探索,且缺乏监督信号和基准。我们引入了GlobalHealthAtlas,一个大规模多语言数据集,包含280,210个实例,涵盖15个公共卫生领域和17种语言。我们进一步提出了一种大语言模型(LLM)辅助的构建和质量控制流水线,包括检索、去重、证据基础检查和标签验证,以提高大规模数据的一致性。最后,我们提出了一个从不同LLM的高置信度判断中提炼的领域对齐评估器,用于评估输出在六个维度上的表现:准确性、推理、完整性、共识一致性、术语规范性和洞察力。这些贡献共同使得LLM在安全关键的公共卫生推理中的可重复训练和评估成为可能,超越了传统的问答基准。我们公开发布了项目代码库、评估器和模型,网址为:https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator 和 https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model。

英文摘要

Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce GlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages. We further propose a large language model (LLM) assisted construction and quality control pipeline with retrieval, deduplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks. We publicly release project codebase, evaluator, and model at:: https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator and https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model

2601.22648 2026-05-27 cs.AI cs.LG

UCPO: Uncertainty-Aware Policy Optimization

UCPO:不确定性感知策略优化

Xianzhou Zeng, Jing Huang, Chunmei Xie, Gongrui Nan, Siye Chen, Mengyu Lu, Weiqi Xiong, Qixuan Zhou, Junhao Zhang, Qiang Zhu, Yadong Li, Xingzhong Xu

AI总结 针对现有强化学习范式在不确定性奖励下存在的优势偏差和过度自信问题,提出三元优势解耦和动态不确定性奖励调整机制,显著提升模型在知识边界外的可靠性。

Comments Accepted by ICML 2026

详情
AI中文摘要

构建可信赖的大语言模型的关键在于赋予其内在的不确定性表达能力,从而减轻高风险应用中的过度自信错误。然而,现有的强化学习范式(如GRPO)由于二元决策空间和静态不确定性奖励,常常遭受优势偏差,导致过度保守或过度自信。为了解决这一挑战,本文揭示了当前结合不确定性奖励的强化学习范式中奖励破解和过度自信的根本原因,并在此基础上提出了不确定性感知策略优化(UCPO)框架。UCPO采用三元优势解耦来分离并独立归一化确定性和不确定性轨迹,从而消除优势偏差。此外,动态不确定性奖励调整机制根据模型演化和实例难度实时调整不确定性权重。在数学推理和通用任务上的实验结果表明,UCPO有效解决了奖励不平衡问题,显著提高了模型在知识边界外的可靠性。

英文摘要

The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby mitigating overconfident errors in high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism adapts uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly improving the reliability of the model beyond their knowledge boundaries.

2601.22384 2026-05-27 cs.LG cs.AI

Graph is a Substrate Across Data Modalities

图是跨数据模态的基板

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, Chuxu Zhang

AI总结 提出G-Substrate框架,通过统一结构模式和交错角色训练策略,使图结构作为共享基板跨模态和任务积累,优于孤立和朴素多任务方法。

Comments Graph structure across data modalities, accepted by ICML26

详情
AI中文摘要

图提供了跨不同领域出现的自然关系结构表示。尽管无处不在,图结构通常以模态和任务隔离的方式学习,即在单个任务上下文中构建图表示,然后丢弃。因此,跨模态和任务的结构规律被反复重建,而不是在中间图表示级别积累。这引发了一个表示学习问题:如何组织图结构,使其能够跨异构模态和任务持久存在并积累?我们采用以表示为中心的视角,将图结构视为跨学习上下文持久存在的结构基板。为了实例化这一视角,我们提出了G-Substrate,一个围绕共享图结构组织学习的图基板框架。G-Substrate包含两个互补机制:一个统一的结构模式,确保跨异构模态和任务的图表示兼容性;以及一个交错基于角色的训练策略,在学习过程中将同一图结构暴露给多个功能角色。跨多个领域、模态和任务的实验表明,G-Substrate优于任务隔离和朴素多任务学习方法。代码库、模型和数据集可在https://github.com/zmli6/G-Substrate获取。

英文摘要

Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modality- and task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question: how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks? We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we propose G-Substrate, a graph substrate framework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods. The codebase, model, and datasets are available at https://github.com/zmli6/G-Substrate.

2601.21845 2026-05-27 cs.LG

Constrained Meta Reinforcement Learning with Provable Test-Time Safety

具有可证明测试时安全性的约束元强化学习

Tingting Ni, Maryam Kamgarpour

AI总结 提出一种约束元强化学习算法,在测试任务上以可证明的安全性和样本复杂度保证学习近似最优策略,并证明样本复杂度下界。

详情
AI中文摘要

元强化学习允许智能体利用在可随意训练的任务分布上的经验,从而在新测试任务上更快地学习最优策略。尽管在提高测试任务样本复杂度方面取得了成功,但许多实际应用(如机器人和医疗保健)在测试期间施加了安全约束。约束元强化学习为将安全性整合到元强化学习中提供了一个有前景的框架。约束元强化学习中的一个开放问题是如何确保策略在真实世界测试任务上的安全性,同时降低样本复杂度,从而更快地学习最优策略。为了解决这一差距,我们提出了一种算法,该算法精炼训练期间学到的策略,具有可证明的安全性和样本复杂度保证,用于在测试任务上学习近似最优策略。我们进一步推导了一个匹配的下界,表明该样本复杂度是紧的。

英文摘要

Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.

2601.21789 2026-05-27 cs.LG cs.AI stat.ML

ECSEL: Explainable Classification via Signomial Equation Learning

ECSEL: 通过符号方程学习的可解释分类

Adia Lumadjeng, Ilker Birbil, Erman Acar

AI总结 提出ECSEL方法,通过学习符号方程形式的闭式表达式实现可解释分类,在符号回归基准上以更低计算量恢复更多目标方程,并保持分类精度与可解释性。

Comments 9 pages, 4 figures, accepted at ICML 2026

详情
AI中文摘要

我们引入ECSEL,一种可解释的分类方法,它学习形如符号方程的正式表达式,其动机是观察到许多符号回归基准具有紧凑的符号结构。ECSEL直接构建一个结构化的闭式表达式,同时作为分类器和解释。在标准符号回归基准上,我们的方法比竞争的最新方法恢复更大比例的目标方程,同时需要更少的计算。利用这种效率,ECSEL在不牺牲可解释性的情况下实现了与已建立的机器学习模型竞争的分类精度。此外,我们展示了ECSEL在全局特征行为、决策边界分析和局部特征归因方面满足一些理想性质。在基准数据集和两个真实世界案例研究(即电子商务和欺诈检测)上的实验表明,学习到的方程暴露了数据集偏差,支持反事实推理,并产生可操作的见解。

英文摘要

We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

2511.16870 2026-05-27 cs.CV cs.LG

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

对齐与反转:通过表示对齐解决扩散和流模型中的逆问题

Loukas Sfountouris, Giannis Daras, Paris Giampouras

AI总结 提出将扩散或流模型的内部表示与预训练自监督编码器(DINOv2)对齐(REPA),在推理时引导逆问题重建,显著提升重建质量和感知真实感。

详情
AI中文摘要

最近研究表明,强制扩散或流生成模型的内部表示与预训练自监督编码器的表示对齐,提供了强大的归纳偏置,改善了收敛性和样本质量。在这项工作中,我们将这一思想扩展到逆问题,其中预训练生成模型被用作先验。我们提出在扩散或流模型与DINOv2视觉编码器之间应用表示对齐(REPA),以在推理时指导重建过程。尽管逆问题中无法获得真实信号,但我们实验表明,对齐模型对近似目标特征的表示可以显著提升重建质量和感知真实感。我们提供了理论结果,显示(a) REPA正则化可以视为在DINOv2嵌入空间中最小化散度度量的变分方法,(b) 在一定的正则性假设下,REPA更新将潜在扩散状态引导向干净图像的状态。这些结果揭示了REPA在提升感知保真度中的作用。最后,我们通过将REPA集成到多个最先进的逆问题求解器中证明了方法的通用性,并在超分辨率、框内补全、高斯去模糊和运动去模糊上进行了大量实验,证实我们的方法一致地改善了重建质量,同时通过减少所需的离散化步骤数提高了效率。

英文摘要

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

2601.21576 2026-05-27 cs.AI

Chain Of Thought Compression: A Theoretical Analysis

思维链压缩:理论分析

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, Jeff Z. Pan

AI总结 本文通过引入Order-r Interaction理论,证明了隐式思维链压缩中高阶逻辑依赖的学习信号指数衰减问题,并提出ALiCoT框架通过对齐潜在令牌分布与中间推理状态来克服信号衰减,实现54.4倍加速且性能与显式CoT相当。

详情
AI中文摘要

思维链(CoT)通过中间步骤解锁了大语言模型(LLMs)的高级推理能力,但由于生成额外令牌而带来了高昂的计算成本。最近的研究经验表明,将推理步骤压缩到潜在状态中,即隐式CoT压缩,提供了一种令牌高效的替代方案。然而,CoT压缩背后的机制仍不清楚。在本文中,我们首次对学习内化中间推理步骤的难度进行了理论分析。通过引入Order-r Interaction,我们证明了高阶逻辑依赖的学习信号指数衰减以解决不可约问题,其中跳过中间步骤不可避免地导致高阶交互障碍。为了经验验证这一点,我们引入了NatBool-DAG,这是一个具有挑战性的基准测试,旨在强制执行不可约逻辑推理并消除语义捷径。在我们的理论发现指导下,我们提出了ALiCoT(对齐隐式CoT),一种新颖的框架,通过对齐潜在令牌分布与中间推理状态来克服信号衰减。实验结果表明,ALiCoT成功解锁了高效推理:它实现了54.4倍加速,同时保持与显式CoT相当的性能。

英文摘要

Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.

2601.20796 2026-05-27 cs.CL cs.LG

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

解析多模态上下文学习:现代Transformer中的模态不对称性与电路动力学

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

AI总结 通过可控实验,研究现代Transformer中多模态上下文学习的基本机制,发现模态间学习不对称性,并揭示其背后的归纳式电路机制。

Comments ICML 2026 Spotlight

详情
AI中文摘要

基于Transformer的多模态大语言模型通常展现出上下文学习(ICL)能力。受此现象启发,我们提出疑问:Transformer如何从上下文示例中跨模态关联信息?我们通过在合成分类任务上训练的小型Transformer进行可控实验来研究这一问题,从而能够精确操控数据统计和模型架构。我们首先重新审视现代Transformer中单模态ICL的核心原理。虽然多个先前发现得以复现,但我们发现旋转位置编码(RoPE)提高了ICL的数据复杂度阈值。扩展到多模态设置揭示了一个基本的学习不对称性:当在来自主要模态的高多样性数据上预训练时,次要模态中令人惊讶的低数据复杂度就足以使多模态ICL出现。机制分析表明,两种设置都依赖于一种归纳式机制,该机制从匹配的上下文示例中复制标签;多模态训练则跨模态细化和扩展这些电路。我们的发现为理解现代Transformer中的多模态ICL提供了机制基础,并为未来研究引入了一个可控的测试平台。代码可在 https://github.com/YiranHuangIrene/multimodal-icl 获取。

英文摘要

Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation. Code is available at: https://github.com/YiranHuangIrene/multimodal-icl

2508.02806 2026-05-27 cs.CV cs.LG

PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

PyCAT4: 基于层次化视觉Transformer的3D人体姿态估计框架

Zongyou Yang, Jonathan Loo, Yinghan Hou

AI总结 本研究提出PyCAT4框架,通过引入自注意力机制的Transformer特征提取层、特征时间融合技术和空间金字塔结构,优化Pymaf网络,在COCO和3DPW数据集上显著提升3D人体姿态估计的检测能力。

Comments 10 pages, 20 figures

详情
AI中文摘要

近年来,通过将卷积神经网络与金字塔网格对齐反馈循环相结合,3D人体姿态估计的准确性得到了显著提升。此外,基于Transformer的时间分析架构的采用在计算机视觉领域取得了创新性突破。鉴于这些进展,本研究旨在深度优化和改进现有的Pymaf网络架构。本文的主要创新包括:(1) 引入基于自注意力机制的Transformer特征提取网络层,以增强对低级特征的捕获;(2) 通过特征时间融合技术增强对视频序列中时间信号的理解和捕获;(3) 实现空间金字塔结构以实现多尺度特征融合,有效平衡不同尺度下的特征表示差异。本研究得到的新PyCAT4模型在COCO和3DPW数据集上进行了实验验证。结果表明,所提出的改进策略显著提升了网络在人体姿态估计中的检测能力,进一步推动了人体姿态估计技术的发展。

英文摘要

Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network's detection capability in human pose estimation, further advancing the development of human pose estimation technology.

2601.18904 2026-05-27 cs.SD cs.AI cs.CL

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

AI总结 提出MetaSICL方法,利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力,在低资源场景下优于直接微调。

详情
AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而,当应用于低资源任务时,它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配,直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型,提供了一种无需训练、推理时的解决方案。在这项工作中,我们首先表明,$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能,这表明这种ICL适应能力可以推广到多模态设置。在此基础上,我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$,这是一种后训练方法,仅利用来自各种任务的高资源语音数据,旨在增强模型的上下文学习能力。实验表明,我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

2601.18381 2026-05-27 cs.AI cs.SE

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

AI Agent 用于逆向工程遗留有限差分代码并转换为 Devito

Yinghan Hou, Zongyou Yang

AI总结 本研究提出一个集成 AI Agent 框架,结合检索增强生成(RAG)和开源大语言模型,通过多阶段迭代工作流将遗留有限差分代码转换为 Devito 环境,并引入强化学习反馈机制实现动态自适应代码翻译。

Comments 14 pages, 7 figures

详情
AI中文摘要

为了促进遗留有限差分实现向 Devito 环境的转换,本研究开发了一个集成的 AI Agent 框架。检索增强生成(RAG)和开源大语言模型通过系统混合 LangGraph 架构中的多阶段迭代工作流相结合。该 Agent 通过文档解析、结构感知分割、实体关系提取和基于 Leiden 的社区检测构建了一个广泛的 Devito 知识图谱。GraphRAG 优化增强了跨语义社区的查询性能,这些社区包括地震波模拟、计算流体动力学和性能调优库。一个逆向工程组件通过 Fortran 源代码的静态分析推导出用于 RAG 检索的三级查询策略。为了为语言模型指导提供精确的上下文信息,多阶段检索流水线执行并行搜索、概念扩展、社区级检索和语义相似性分析。代码合成受基于 Pydantic 的约束控制,以保证结构化输出和可靠性。一个全面的验证框架将传统静态分析与 G-Eval 方法相结合,涵盖执行正确性、结构健全性、数学一致性和 API 合规性。整个 Agent 工作流在 LangGraph 框架上实现,并采用并发处理以支持基于质量的迭代细化和状态感知的动态路由。主要贡献在于引入了受强化学习启发的反馈机制,实现了从静态代码翻译向动态自适应分析行为的转变。

英文摘要

To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval-Augmented Generation (RAG) and open-source Large Language Models are combined through multi-stage iterative workflows in the system's hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure-aware segmentation, extraction of entity relationships, and Leiden-based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three-level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi-stage retrieval pipeline performs parallel searching, concept expansion, community-scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic-based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G-Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality-based iterative refinement and state-aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.

2512.01556 2026-05-27 cs.AI cs.CL cs.LG

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

AI总结 提出LEC框架,通过线性期望约束将选择性预测转化为决策问题,在可交换性假设下利用校准集计算风险约束下的保留最大化阈值,并扩展到双模型路由系统,实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情
AI中文摘要

基础模型常常生成不可靠的答案,而启发式不确定性估计器无法完全区分正确与错误输出,导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题,旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此,我们提出了LEC,一个原则性框架,将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率,这对应于选择条件下的边际错误概率。在可交换性下,我们推导出一个仅依赖于保留校准集的有限样本充分条件,从而能够计算风险约束下的保留最大化阈值。此外,我们将LEC扩展到双模型路由系统:如果主模型的不确定性超过其校准阈值,则输入被委托给后续模型,同时保持系统级的选择条件误差控制。在封闭式和开放式问答(QA)以及视觉问答(VQA)上的实验表明,LEC在接受的预测中维持了规定的风险水平,并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

2601.15283 2026-05-27 cs.CV cs.GR

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

LuxRemix: 室内场景的光照分解与重新混合

Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt

AI总结 提出一种基于图像的光照分解模型,从多视图场景捕获中分解室内光照为独立光源,并通过多视图光照协调集成到可重光照的3D高斯溅射表示中,实现交互式光源编辑。

Comments CVPR 2026. Project page: https://luxremix.github.io

详情
AI中文摘要

我们提出了一种新颖的方法,用于从单个多视图场景捕获中对室内场景进行交互式光照编辑。我们的方法利用基于生成图像的光照分解模型,将复杂的室内场景照明分解为其组成光源。这种分解能够独立操作各个光源,特别是控制其状态(开/关)、色度和强度。我们进一步引入了多视图光照协调,以确保光照分解在所有场景视图中的一致传播。这被集成到一个可重光照的3D高斯溅射表示中,提供对单个光源的实时交互控制。我们的结果展示了在多种室内场景中高度逼真的光照分解和重光照效果。我们在合成和真实世界数据集上评估了我们的方法,并与最先进的技术进行了定量和定性比较。视频结果和交互演示请参见 https://luxremix.github.io。

英文摘要

We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.

2601.14702 2026-05-27 cs.AI cs.CV cs.RO

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Drive-P2D:自动驾驶中视觉语言模型的渐进式感知到决策基准

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang

AI总结 提出Drive-P2D基准,通过分离推理与答案的协议,在目标、场景和决策三个层级上评估视觉语言模型的感知到决策能力,并分析错误模式。

详情
AI中文摘要

自动驾驶需要在复杂场景中实现可靠的感知和安全的决策。最近的视觉语言模型(VLM)展示了推理和泛化能力,为自动驾驶开辟了新的可能性;然而,现有的基准通常分别评估感知和决策,通过仅选择格式限制故障分析,或通过LLM评分的长格式输出引入评估偏差。为了解决这些问题,我们提出了Drive-P2D,一个渐进式感知到决策基准,包含6650个问题,涵盖目标、场景和决策三个层级。Drive-P2D采用分离的推理与答案协议:最终答案客观评分,而推理则用于分析沿渐进式感知到决策链暴露的错误模式。我们评估了所有场景和高风险场景下的主流VLM,并通过相关性分析和相似场景鲁棒性测试进一步刻画了感知到决策的能力边界。推理进一步揭示了逻辑推理错误和语义特征遗漏等故障模式,我们训练了一个轻量级分析器模型来自动化大规模推理错误模式标注。这些设计共同为构建更安全、更可靠的用于现实世界自动驾驶的VLM提供了实用见解。

英文摘要

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks often evaluate perception and decision-making separately, limit failure analysis with choice-only formats, or introduce evaluation bias through LLM-scored long-form outputs. To address these issues, we present Drive-P2D, a progressive perception-to-decision benchmark with 6,650 questions across Object, Scene, and Decision levels. Drive-P2D adopts a separated reasoning-and-answer protocol: final answers are scored objectively, while reasoning is analyzed to identify error modes exposed along the progressive perception-to-decision chain. We evaluate mainstream VLMs across all and high-risk scenarios, and further characterize the perception-to-decision capability boundary through correlation analysis and similar-scene robustness testing. Reasoning further exposes failure modes such as logical reasoning errors and semantic feature omissions, and we train a lightweight analyzer model to automate large-scale error-mode annotation of reasoning. Together, these designs provide practical insights for building safer and more reliable VLMs for real-world autonomous driving.

2508.03774 2026-05-27 cs.LG cs.AI

A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets

用于三维PEC目标微波散射分析的物理信息分层神经网络

Rui Zhu, Yuexing Peng, George C. Alexandropoulos, Wenbo Wang

AI总结 提出一种U形物理信息神经网络(U-PINet),结合近场图编码器和八叉树分层多尺度融合模块,通过电场积分方程残差训练,实现高效准确的三维PEC目标微波散射分析。

Comments Submitted to an IEEE Journal

详情
AI中文摘要

在微波频率下精确建模三维完美电导体(PEC)目标的散射是计算电磁学的一个基本目标,特别是在雷达截面(RCS)预测和微波散射分析中。经典求解器,如矩量法和多层快速多极子算法(MLFMA),虽然提供高物理保真度,但在涉及多次入射配置或频率的重复查询场景下变得昂贵,而纯数据驱动的代理模型通常在几何复杂目标上缺乏准确性。本文提出一种U形物理信息人工神经网络(U-PINet)用于三维微波散射分析。受MLFMA的近远场分解启发,U-PINet结合了由可学习单变量基函数参数化的近场图编码器,以及在八叉树分区上组织的分层多尺度融合模块。所提出的网络在表面配置点处针对电场积分方程的离散残差进行训练,无需参考电流标签。在多个频率和极化配置下,对典型和几何复杂的三维PEC目标进行的实验,并通过双站RCS重建评估,表明U-PINet优于代表性的物理信息基线,并在重复查询场景下相比经典MLFMA求解器实现了显著的运行时间节省。

英文摘要

Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies constitutes a fundamental objective in computational electromagnetics, particularly for radar cross section (RCS) prediction and microwave scattering analysis. Classical solvers, such as the method of moments and the Multilevel Fast Multipole Algorithm (MLFMA), although provide high physical fidelity, they become costly under scenarios of repeated queries involving many incidence configurations or frequencies, whereas purely data-driven surrogates often lack accuracy on geometrically complex targets. This paper proposes a U-shaped physics-informed artificial neural network (U-PINet) for 3D microwave scattering analysis. Inspired by the near-far field decomposition of MLFMA, U-PINet combines a near-field graph encoder, parameterized by learnable univariate basis functions, with a hierarchical multi-scale fusion module organized on an octree partition. The proposed network is trained against a discretized residual of the electric-field integral equation at surface collocation points, without requiring reference current labels. Experiments on canonical and geometrically complex 3D PEC targets, conducted under multiple frequency and polarization configurations and assessed through bistatic RCS reconstruction, showcase that U-PINet outperforms representative physics-informed baselines, and yields substantial runtime savings over the classical MLFMA solver under repeated-query scenarios.