arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.07102 2026-05-11 cs.CL

SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions

Tianyu Wang, Nianjun Zhou

AI总结 本文提出了一种基于本体论的分层大语言模型评估框架SAGE,用于系统评估文学作品的质量,涵盖文化表达、情感深度和哲学内涵等解释性维度。该方法通过多轮迭代反思和独立验证,实现了对文学作品的结构化评估,并在100篇短篇小说上验证了其有效性,取得了高评分一致性和评分者间高度同意。研究发现,经典文学作品在各维度上显著优于通俗小说和大语言模型生成的叙事,且不同层次的评估维度能够有效区分文学质量的不同方面,展示了理论驱动的LLM评估在可靠性和系统性上的潜力。

Comments 19 pages, 4 figures

详情
英文摘要

Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.

2605.07094 2026-05-11 cs.LG

Actor-Critic with Active Importance Sampling

Majid Molaei, Gabor Paczolay, Matteo Papini, Alberto Maria Metelli, Marcello Restelli

AI总结 本文提出了一种名为 Active-Importance-Sampling Actor-Critic(AISAC)的算法,旨在通过主动重要性采样降低策略梯度估计的方差。该方法在保持梯度无偏性的前提下,优化行为策略以更高效地收集数据,从而提升策略更新和价值函数估计的准确性。实验表明,AISAC 在连续动作空间任务中表现出更快的学习速度、更高的样本效率和更稳定的训练过程,具有在实际应用中推广的潜力。

详情
英文摘要

This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments.

2605.07093 2026-05-11 cs.CL cs.AI cs.LG

The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

Zezheng Lin, Fengming Liu, Handi Li

AI总结 该研究质疑了“翻译税”作为固定标量的传统观点,即认为翻译成中文的基准数据会因保留英文原题线索而提升模型表现。通过多种估算方法和对比实验,研究发现翻译基准的影响并非统一,而是依赖于评估方法和具体题目特性。研究还提出了一种新的自然化测试方法,揭示了不同题目对模型表现的影响存在差异,并为多语言基准测试提供了更严谨的评估框架和数据支持。

Comments 13 pages, 3 figures. Submitted to NeurIPS 2026

详情
英文摘要

The Translation Tax is often treated as a scalar: translated benchmarks are assumed to inflate scores by preserving English-source cues. We audit this claim in an English-to-Chinese setting. Three proxy estimators disagree: back-translation gaps are small and parser-fragile; cue-score calibration does not predict item-level gains; and a six-model native-control comparison shows model-family rather than uniform benchmark effects. We add a same-item LLM-naturalization stress test that holds answer, options, and content fixed while rewriting Chinese surface form. After correcting a prompt-construction bug, this contrast no longer supports a model-family interaction, but it preserves a residue dose-response: high-residue items benefit while low-residue items do not. The result is not a single Translation Tax, but a set of estimator- and item-dependent validity risks. We release per-cell evidence, the naturalization protocol, human QC, and a reporting checklist for translated multilingual benchmark papers.

2605.07086 2026-05-11 cs.CV cs.LG

Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information

Houman Safaai, Andrew T. Landau, Celia C. Beron, Yasin Mazloumi, Bernardo L. Sabatini

AI总结 该论文提出了一种新的视角,将视觉网络中通道重要性的评估分为两个维度:任务相关性和局部可替代性。传统方法通常用单一分数衡量通道重要性,而本文通过分离这两个维度,揭示了它们在训练过程中表现出不同的行为和关联模式。实验表明,局部可替代性比任务相关性更能可靠地预测通道的可移除性,为模型剪枝提供了更精细的指导。

详情
英文摘要

Channel importance in vision networks is usually summarized by a single score. That summary hides two different questions: how much a channel is related to the task, and whether its function can be supplied by same-layer peers when the channel is removed. We call the second property local replaceability. We introduce a two-axis view that separates these questions. The local axis measures input capture and peer overlap, while the target axis measures task information and target-excess information. Across ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100, the two axes are weakly aligned, induce different channel groupings, and separate rapidly during training despite being strongly coupled at random initialization. A Gaussian linear analysis accounts for how this separation can arise through residualized gradient directions, and lesion plus peer-replacement experiments show that peer support refines removability beyond input capture and task relevance alone. Under the fixed FLOPs-matched pruning protocol, local-axis metrics are more reliable predictors of removability than target-axis metrics across the three CIFAR-100 backbones, with the same direction preserved in stress tests on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot. These findings identify an axis-level distinction rather than a universal ranking of pruning scores: local replaceability is a more reliable guide to removability than target relevance, while norm-based baselines remain competitive in architectures such as VGG-16. Relevance-based scores ask what a channel says about the task; pruning asks whether the network still needs that channel when its peers remain available.

2605.07084 2026-05-11 cs.CL

Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation

Anna Seo Gyeong Choi, Maria Teleki, James Caverlee, Miguel del Rio, Corey Miller, Hoon Choi

AI总结 本文探讨了自动语音识别(ASR)评估中使用单一标准转录本可能引发的认识论不公问题。研究指出,不同转录规范(如逐字、非逐字、法律等)会导致对同一语音输出的评估结果不同,而强制使用单一标准会忽视某些说话者(如失语症患者)的语音特征,造成系统性评价偏差。文章提出了“诠释鸿沟”概念,并引入“认识论不公距离”指标,通过实证分析展示了不同转录规范对WER的影响,进而提出应报告多种规范下的性能范围,而非假定单一正确答案。

详情
英文摘要

Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.

2605.07082 2026-05-11 cs.CV

ImplantMamba: Long-range Sequential Modeling Mamba For Dental Implant Position Prediction

Xinquan Yang, Congmin Wang, Xuguang Li, Yulei Li, Linlin Shen, Yongqiang Deng He Meng

AI总结 在种植牙手术导板设计中,精确预测种植体位置是一个关键步骤,但医学影像中种植区域通常缺乏明显纹理,使得AI模型难以直接定位。为此,本文提出ImplantMamba,一种基于长程序列建模的网络架构,通过结合卷积神经网络与Mamba层,有效融合邻近牙齿的纹理信息,同时引入斜率耦合预测分支,实现种植体位置与角度的联合预测,显著提升了预测精度与解剖合理性。

详情
英文摘要

In the design of surgical guides for implant placement, determining the precise implant position is a critical step. However, the implant region itself is often characterized by a lack of distinctive texture in medical images. Consequently, artificial intelligence (AI) models must infer the correct implant position and angulation (slope) primarily by analyzing the texture of the surrounding teeth, which poses a significant challenge. To address this, we propose ImplantMamba, a network architecture designed for long-range sequential modeling to integrate texture information from adjacent teeth. Our approach explicitly couples the regression of the implant position with its slope. The core of ImplantMamba is a hybrid encoder that combines Convolutional Neural Networks (CNNs) with Mamba layers. This design enables the network to hierarchically extract local anatomical features through CNNs while simultaneously modeling global contextual dependencies across the entire scan volume via Mamba's selective scan operations, leading to a more comprehensive understanding of the implant site. Furthermore, we introduce a Slope-Coupled Prediction Branch (SCP). This branch is designed to connect the prediction of implant position with the slope, ensuring internal consistency and anatomical plausibility by thereby enforcing a coherent relationship between the predicted implant location and its angulation. Extensive experiments on a large-scale dental implant dataset demonstrate that the proposed ImplantMamba achieves superior performance compared to existing methods.

2605.07080 2026-05-11 cs.AI cs.DS

Online Allocation with Unknown Shared Supply

Tzeh Yuan Neoh, Davin Choo, Mengchu Yue, Milind Tambe

AI总结 本文研究了在未知共享供应情况下的在线资源分配问题,该问题广泛存在于人道主义物流和疫苗分配等实际场景中。作者提出了一个名为OSSA的模型,其中中心枢纽需要在面对序列需求和固定运输成本的情况下,将有限且未知的供应分配到多个地点,并对缺货进行惩罚。为此,作者设计了一种确定性阈值比例策略GPA,并证明其性能接近离线最优解的4/3倍,同时提出了该方法的下界分析,并扩展了其对不完美预测的适应能力,实验表明该方法在供应稀缺时优于传统基准方法。

详情
英文摘要

Many real-world resource allocation systems, such as humanitarian logistics and vaccine distribution, must preposition limited supply across multiple locations before demand is realized while stockouts incur irreversible service losses. To study this, we introduce the Online Shared Supply Allocation (OSSA) problem, a stateful online model in which a central hub allocates a finite, unknown supply to multiple sites facing sequential demand under fixed-charge transportation costs and lost-sales penalties. Unlike classical make-to-stock or make-to-order inventory models, OSSA precludes backlogging and replenishment only hedges against future demand. To tackle OSSA, we propose a deterministic threshold-proportional policy GPA and prove that it achieves a $4/3$-approximation to the offline optimum up to an additive term independent of the total supply. We complement this with matching lower bounds showing that the $4/3$ ratio is tight and that the additive-error dependence is unavoidable, even for randomized algorithms that know the total supply upfront. Finally, we develop a learning-augmented extension to GPA that principally incorporates imperfect forecasts (e.g., from human experts or ML models) commonly available in practice, enabling us to exploit high-quality advice while being robust against arbitrary bad ones. Synthetic and real-world experiments show that GPA outperforms natural baselines with global supply is scarce.

2605.07079 2026-05-11 cs.CV cs.AI cs.LG cs.RO

Learning Visual Feature-Based World Models via Residual Latent Action

Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias

AI总结 该研究提出了一种基于视觉特征的世界模型RLA-WM,通过引入残差潜在动作(RLA)表示,解决了传统特征回归方法在复杂交互中预测模糊或崩溃的问题。RLA从DINO残差中学习,具有预测性、通用性和时间编码能力,结合流匹配方法实现高效预测。RLA-WM在仿真和真实数据集上优于现有特征基和视频扩散模型,且计算速度显著更快,并进一步提出了两种基于RLA-WM的机器人学习技术,提升了策略学习的效率与效果。

详情
英文摘要

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

2605.07078 2026-05-11 cs.LG

Test-Time Compositional Generalization in Diffusion Models via Concept Discovery

Zekun Wang, Anant Gupta, Tianyi Zhu, Christopher J. MacLellan

AI总结 该研究探讨了扩散模型在测试时如何通过概念发现实现组合泛化。不同于以往依赖预定义条件信号的方法,作者提出从扩散模型学习到的时间索引噪声边缘分布中提取查询特定的概念,并在测试时组合这些概念生成新配置。实验表明,该方法在多个组合性基准测试中优于传统基线,展示了扩散模型内部潜在的可复用密度模式概念对组合生成的有效性。

Comments 9 pages

详情
英文摘要

Compositional generalization requires models to produce novel configurations from familiar parts. In diffusion models, prior compositional generation methods typically assume that the relevant concepts or conditioning signals are already available. We instead ask whether a pretrained diffusion model can discover query-specific concepts from the time-indexed scores it learns for the noisy marginals $p_t(x_t)$ and compose them at test time. Given a single out-of-distribution query, our method performs gradient ascent on $s_θ(x_t,t) \approx \nabla_{x_t}\log p_t(x_t)$ at multiple noising timesteps to recover local density modes, maps these modes into clean-space Gaussians, greedily selects relevant prototypes with a submodular likelihood objective, and combines them into a product-of-experts (PoE) teacher model with an analytic score. This teacher model can be sampled directly through classifier-free guidance or used to generate a sample pool for training a new class embedding and low-rank adapter. On held-out composition benchmarks built from ColorMNIST and CelebA, both the analytic PoE sampler and the low-rank adapted model outperform query-only and nearest trained-class baselines. These results suggest that the time-indexed score geometry of the diffusion model contains reusable density-mode concepts that support test-time compositional generation without a predefined concept library.

2605.07075 2026-05-11 cs.LG

ModelLens: Finding the Best for Your Task from Myriads of Models

Rui Cai, Weijie Jacky Mo, Xiaofei Wen, Qiyao Ma, Wenhui Zhu, Xiwen Chen, Muhao Chen, Zhe Zhao

AI总结 随着开源模型数量的激增,为新数据集选择最佳模型变得愈发困难。ModelLens 提出了一种统一的框架,通过分析公开排行榜上的模型表现数据,学习模型-数据集-指标三元组的潜在表示,从而在无需在目标数据集上运行候选模型的情况下,对未见过的模型和数据集进行推荐。实验表明,ModelLens 在大规模基准测试中显著优于基于元数据或需逐个模型评估的基线方法,并有效提升了多种路由方法的性能。

详情
英文摘要

The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.

2605.07073 2026-05-11 cs.AI

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

Yubin Kim, Chanwoo Park, Taehan Kim, Eugene Park, Samuel Schmidgall, Salman Rahman, Chunjong Park, Cynthia Breazeal, Xin Liu, Hamid Palangi, Hae Won Park, Daniel McDuff

AI总结 该研究提出TeamBench,一个用于评估智能体在操作系统强制角色分离下协调能力的基准测试,包含851个任务模板和931个实例。通过将任务分解为规划者、执行者和验证者三个角色,并限制各角色的访问权限,确保角色间无法互相替代工作。实验表明,仅依赖提示的角色分配难以准确反映协调效果,而强制角色分离能揭示更多关于协作模式和团队价值的细节。

详情
英文摘要

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.

2605.07072 2026-05-11 cs.LG cs.CR stat.ML

Less Random, More Private: What is the Optimal Subsampling Scheme for DP-SGD?

Andy Dong, Ayfer Özgür

AI总结 本文研究了差分隐私随机梯度下降(DP-SGD)中最优的子采样方案,指出传统的泊松子采样虽然便于隐私分析,但其引入的参与方差会削弱隐私增强效果。作者提出了一种结构化的平衡迭代子采样(BIS)方法,通过确保每个样本参与固定数量的迭代,实现了比泊松子采样更强的隐私增强效果,并在噪声趋于零和无穷大的极端情况下达到最优。实验表明,BIS在低噪声场景下能有效减少所需噪声乘数,提升模型实用性和隐私保护水平。

Comments 17 pages, 1 table. Submitted to NeurIPS 2026

详情
英文摘要

Poisson subsampling is the default sampling scheme in differentially private machine learning, largely because its unstructured randomness yields tractable privacy amplification analyses. Yet this same randomness introduces substantial participation variance: each sample appears in very different numbers of training iterations. In this work, we show that this variance is not merely a practical artifact to be tolerated, but a fundamental source of suboptimal privacy amplification. We prove that Balanced Iteration Subsampling (BIS), a structured scheme in which each sample participates in exactly a fixed number of iterations, achieves stronger privacy amplification than Poisson subsampling and is optimal at both extremes of the noise spectrum ($σ\to 0$ and $σ\to \infty$). Our analysis reveals that the privacy-noise tradeoff is governed not by maximizing randomness, but by eliminating participation variance while preserving uniform marginal participation across iterations. To translate this asymptotic theory into finite-noise guarantees, we introduce a practical near-exact Monte Carlo accountant for BIS, which removes the analytical slack of existing RDP and composition-based PLD analyses. Evaluations across more than 60 practical DP-SGD configurations show that BIS consistently outperforms Poisson subsampling in the low-noise regimes most relevant for high-utility private training, reducing the required noise multiplier by up to $9.6\%$. These results overturn the common intuition that more sampling randomness necessarily yields stronger privacy amplification: in DP-SGD, structured participation can be both more practical and more private. Our implementation is available at https://github.com/dong-xin-ao-andy/bis-mc-accountant.

2605.07068 2026-05-11 cs.CL cs.AI

WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems

Juan M. Huerta

AI总结 本文提出了一种名为 WiCER 的迭代知识编译方法,旨在解决大语言模型(LLM)维基系统中知识编译时关键信息丢失的问题。该方法基于反例引导抽象精化(CEGAR)思想,通过评估编译后的维基内容、识别丢失的事实并在后续编译中加以保留,有效提升了知识编译的质量。实验表明,WiCER 能显著减少灾难性失败,提升维基系统在大规模场景下的性能。

详情
英文摘要

The LLM Wiki pattern, to compile and provide domain knowledge into a persistent artifact and serve it to LLMs via KV cache inference, promises context access at sub-second latency with zero retrieval failure. Realizing this requires solving the compilation gap: LLM compilation distilling raw documents into a wiki without catastrophically discarding critical facts. We characterize this gap across 17 RepLiQA domains (6,800 questions): we observe that full context KV cache inference outperforms RAG on curated knowledge (4.38 vs. 4.08 out of 5, 7.3 faster TTFT) but degrades below RAG at scale due to attention dilution, and blind compilation fails entirely (2.14 to 2.32 vs. 3.46, 53 to 60% catastrophic failure rate). To address the compilation gap, we propose WiCER (Wiki-memory Compile, Evaluate, Refine), an iterative algorithm inspired by counterexample-guided abstraction refinement (CEGAR) that closes this gap. WiCER evaluates compiled wikis against diagnostic probes, identifies dropped facts, and forces their preservation in subsequent compilations. One to two iterations recover 80% of lost quality (mean 3.24 vs. 3.47 for raw full-context across the 15 topics with baselines), reducing catastrophic failures by 55% relative. An ablation across all 17 topics confirms that targeted diagnosis (+0.95), not generic pinning (+0.16), drives the gains. All code and benchmarks are released for reproducible research.

2605.07067 2026-05-11 cs.LG

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

Haozhou Zhang

AI总结 本文提出了一种名为PolarAdamW的优化算法,旨在将矩阵优化中的谱控制与Schur规范等变性解耦。该方法结合了Muon的谱范数控制和AdamW的坐标预处理机制,但在保持谱控制的同时打破了规范等变性。实验表明,PolarAdamW在标准变换器任务中优于Muon和AdamW,而在涉及非平凡多重性基自由度的SO(3)等变任务中,Muon则表现更优,从而验证了谱控制与Schur规范等变性的分离及其在不同场景下的适用性。

详情
英文摘要

Muon's matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon's polar spectral-norm control but breaks the gauge-equivariance, since AdamW's coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon's Newton-Schulz polar map to AdamW's preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon's polar step is Schur gauge-equivariant on multiplicity matrices while AdamW's coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial.

2605.07064 2026-05-11 cs.CV

Learning to Track Instance from Single Nature Language Description

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shuimu Zeng, Haiying Xia, Shuxiang Song

AI总结 本文研究如何在没有任何边界框标注的情况下,仅通过自然语言描述实现视觉-语言(VL)跟踪。为此,作者提出了一种自监督的VL跟踪方法,并引入了名为\tracker的新颖跟踪器,能够根据语言描述追踪任意目标。该方法通过动态令牌聚合模块,对视觉令牌进行非均匀处理,有效提升了语义对齐与跟踪性能,实现了无需大量标注的自监督学习。实验表明,\tracker在多个VL跟踪基准上优于现有最先进的自监督方法。

Comments CVPR 2026

详情
英文摘要

How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.

2605.07063 2026-05-11 cs.LG cs.AI

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

Pingbang Hu, Xueshen Liu, Z. Morley Mao, Jiaqi W. Ma

AI总结 本文研究了大语言模型(LLM)微调过程中如何有效利用稀缺的高质量目标数据与大量但对齐不完美的通用训练数据的问题。提出了一种新的框架Dr. Post-Training,将通用数据视为防止模型过拟合目标数据的正则化项,而非单纯用于数据选择。该方法通过在每一步训练中构建通用数据引导的模型更新方向集合,并将目标数据指定的更新方向投影到该集合中,从而实现更灵活的偏差-方差权衡。实验表明,该方法在多种微调任务中优于现有数据选择方法,且具有较高的计算效率。

详情
英文摘要

Data selection methods address a critical challenge in LLM post-training: effectively leveraging scarce, high-fidelity target data alongside abundant but imperfectly aligned general training data. In this work, we move beyond the data-selection framing and introduce Dr. Post-Training (Data-Regularized Post-Training), a novel framework that reconceptualizes general training data as a data-induced regularizer that prevents overfitting to the scarce target objective, rather than serving as a pool for selection. Specifically, our framework proposes that at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set. Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength. Building on this view, we propose a family of methods offering a richer design space and more flexible bias--variance tradeoffs. For practical LLM-scale use, we introduce careful system optimizations that realize these methods with minimal overhead. Extensive experiments across SFT, RLHF, and RLVR show that our methods consistently outperform state-of-the-art data selection baselines, and system benchmarks confirm their efficiency.

2605.07058 2026-05-11 cs.CL cs.AI

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

Yicheng Gao, Xiaolin Zhou, Yahan Li, Yue Zhao, Ruishan Liu

AI总结 该研究针对真实临床诊断中复杂的交互性与不确定性,提出了一种新的强化学习框架MedExAgent,将诊断过程建模为部分可观测马尔可夫决策过程(POMDP),包含询问患者、安排检查和诊断三种动作类型。研究引入了包含七类患者噪声和三类检查噪声的系统噪声模型,并通过两阶段训练流程,结合监督微调与奖励优化,使MedExAgent在保持检查策略高效的同时,实现了与大模型相当的诊断性能。

详情
英文摘要

Real-world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single-turn question answering, noise-free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, \textbf{MedExAgent}, through a two-stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary-Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.

2605.07057 2026-05-11 cs.LG

Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure

Jiamin Xu, Jacqueline Maasch, Kyra Gan

AI总结 该研究探讨了如何在深度强化学习中整合因果图结构,以构建满足马尔可夫性质的最小状态表示。作者提出了一种名为MOSE的方法,通过多阶历史状态输入到同一个Q函数中,有效提升了性能。研究发现,仅依赖最小状态表示不足以提升表现,需引入受控冗余以充分发挥因果状态信息的优势,为因果深度强化学习提供了重要理论指导。

详情
英文摘要

Online reinforcement learning (RL) relies on the Markov property for guaranteed performance, but real-world applications often lack well-defined states given raw observed variables. While causal RL has attracted growing interest, existing work typically assumes Markovian states are provided and focuses on using causality to accelerate learning, leaving a fundamental gap: \emph{given a longitudinal causal graph over observed variables, how does one construct MDP states that provably satisfy the Markov property?} We address this by providing a procedure that constructs a provably minimal state representation. In deep RL, we observe that the minimal representation alone empirically fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. To address this, we propose \textbf{MOSE} (Multi-Order State Exposure), which feeds multi-order historical state constructions into the same $Q$-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. Our results establish a core principle for causal deep RL: minimal sufficiency is not enough, and \emph{controlled redundancy} is necessary to unlock the benefit of causal state information.

2605.07055 2026-05-11 cs.CV cs.AI

Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness

Qiangqiang Wu, Grace McIlvain, Zhou Yu, Junhao Wen

AI总结 该研究提出了一种名为Pan-FM的跨器官基础模型,旨在提升多模态医学影像在数据缺失情况下的鲁棒性。该模型基于七个器官的影像数据进行预训练,并引入了一种基于显著性引导的掩码机制(SGM),以减少模型对某些主导器官的依赖,从而实现更均衡的全身学习。实验表明,Pan-FM在疾病预测任务中优于单器官和多器官基线模型,尤其在数据缺失场景下表现出更强的泛化能力。

详情
英文摘要

Foundation models (FMs) have shown great promise in medical imaging, but most FMs are trained on unimodal data within isolated domains, such as brain MRI alone. Human aging and disease arise through coordinated biological processes across organs, therefore motivating multimodal FMs that learn whole-body representations. A key challenge, however, is that real-world multimodal biomedical data are often missing not at random, which can reduce power, limit generalizability, and introduce bias. We propose Pan-FM, a pan-organ foundation model pre-trained on imaging from seven organs (Brain, Heart, Adipose, Liver, Kidney, Spleen, and Pancreas) under realistic missing-organ scenarios. Pan-FM uses a unified backbone that handles organ missingness during both training and inference, and is pre-trained with masking-based self-distillation. We find that naive multimodal pre-training leads to dominant-organ shortcut learning bias, with the model over-relying on dominant organs such as adipose and heart. To address this, we introduce Saliency-Guided Masking (SGM), which uses the model attention distribution to adaptively mask dominant organs during pre-training, thus encouraging more balanced cross-organ, whole-body learning. Notably, SGM introduces negligible computational overhead and can be seamlessly integrated into existing self-supervised learning frameworks to improve multi-organ representation learning. On the UK Biobank, Pan-FM achieves stronger prediction across 13 disease categories and 14 single disease entities than single-organ and multi-organ baselines, with improved robustness under missing-organ settings. Pan-FM serves as a scalable solution to realistic modality-missingness in multimodal learning in system neuroscience and as a step toward more generalizable whole-body FMs.

2605.07051 2026-05-11 cs.CL

NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

George Boateng, Naafi Ibrahim, Samuel John, Philemon Badu, Patrick Agyeman-Budu, Jonathan Mensah, Kevin Yeboah, William Edor, Andrew Mensa-Onumah, Nana Yeboah, Victor Wumbor-Apin Kumbol

AI总结 本文提出了一种新的科学与数学谜题基准测试集 NSMQ Riddles,源自加纳全国科学与数学竞赛(NSMQ)中的谜题题目,旨在评估大型语言模型在科学与数学推理方面的能力。该基准包含1.8千道谜题,每道题至少包含三个线索,答案多为数字、单词或短语,便于自动评估。实验表明,即使是当前最先进的语言模型在该数据集上的表现也远不如NSMQ的优秀学生选手,突显了该基准的挑战性及其对全球科学教育模型评估的重要意义。

Comments 15 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education

详情
英文摘要

Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana's National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs' capabilities for science and mathematics education.

2605.07049 2026-05-11 cs.LG cs.AI

Towards Differentially Private Reinforcement Learning with General Function Approximation

Yi He, Xingyu Zhou

AI总结 本文首次为具有通用函数逼近能力的差分隐私在线强化学习提供了理论保证,突破了以往仅限于表格和线性设定的研究。研究结合了分批策略更新方案与指数机制,并提出了新的遗憾分析,证明在模型自由设定下,即使使用通用函数逼近,其遗憾表现仍能达到线性情况下的最优水平,规模为 $\widetilde{O}(K^{3/5})$。此外,研究还揭示了近期基于线性函数逼近的隐私强化学习结果中的关键问题,进一步明确了该领域的发展方向。

详情
英文摘要

We present the first theoretical guarantees for differentially private online reinforcement learning (RL) with general function approximation, extending beyond prior work restricted to tabular and linear settings. Our approach combines a batched policy update scheme with the exponential mechanism, together with a novel regret analysis. We show that, even under general function approximation, the regret in the model-free setting under differential privacy matches the state of the art for the linear case, scaling as $\widetilde{O}(K^{3/5})$, where $K$ denotes the number of episodes. As an important by-product, we also establish the first regret bound for online RL with batch update that depends on the standard complexity measure of coverability, complementing existing results based on a newly introduced Eluder-Condition class. In addition, we uncover fundamental gaps in recent results for private RL with linear function approximation, thereby clarifying its landscape.

2605.07048 2026-05-11 cs.LG cs.AI

Unlocking High-Fidelity Molecular Generation from Mass Spectra via Dual-Stream Line Graph Diffusion

Xujun Che, Xiuxia Du, Depeng Xu

AI总结 从串联质谱重建高保真分子结构是一个具有循环依赖关系的逆问题,传统方法因原子与键信息同步不足而受限。本文提出双流线图扩散模型DualLGD,将分子图去噪分解为原子级与键级两个耦合子问题,分别在独立的表示空间中处理,并通过双向交叉注意力机制实现信息同步,确保原子仅关注其连接的键,反之亦然。该方法在多个基准测试中显著提升了生成精度,超越了现有最佳模型。

详情
英文摘要

De novo molecular generation from tandem mass spectra is a challenging inverse problem whose core difficulty lies in the circular dependency between atom-level and bond-level reasoning: determining a bond's type requires knowing its endpoint atoms' chemical environment, yet an atom's environment is in turn defined by its incident bonds. Existing graph diffusion methods process atoms and bonds within a single computation stream, where atom-bond information synchronization can only occur implicitly across layers. We argue that this single-stream paradigm, rather than the choice of any particular aggregation kernel, is a key architectural bottleneck. We propose DualLGD (Dual-stream Line Graph Diffusion), which reformulates molecular graph denoising as the alternating solution of two coupled subproblems: atom-level reasoning and bond-level reasoning, each operating in its own dedicated representation space. The line graph provides a natural mathematical construction for the bond space, in which bond angles, dihedrals, conjugation chains, and rings correspond to local topological motifs between bonds. Incidence-constrained bidirectional cross-attention synchronizes the two streams at every layer, ensuring that each atom attends only to its incident bonds and vice versa, respecting the fundamental chemical principle that an atom's environment is determined by its bonding context. On the NPLIB1 and MassSpecGym benchmarks, DualLGD achieves top-1 accuracy of 34.37\% and 23.89\%, approximately $3\times$ the previous state of the art. Ablation studies confirm the architecture as the primary source of improvement: DualLGD without any pre-training already surpasses the previous best fully pretrained model.

2605.07042 2026-05-11 cs.AI cs.LG

The Context Gathering Decision Process: A POMDP Framework for Agentic Search

Chinmaya Kausik, Adith Swaminathan, Nathan Kallus

AI总结 本文提出了一种名为“上下文收集决策过程”(CGDP)的框架,用于解决大型语言模型(LLM)代理在复杂环境中进行搜索时面临的上下文窗口限制问题。该框架将搜索过程建模为部分可观察马尔可夫决策过程,通过引入基于谓词的信念状态和程序化终止机制,提升了代理的多跳推理能力和搜索效率。实验表明,该方法在多个问答任务中有效提高了性能并减少了冗余计算。

Comments 25 pages

详情
英文摘要

Large Language Model (LLM) agents are deployed in complex environments -- such as massive codebases, enterprise databases, and conversational histories -- where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent's working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent's objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM's behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method that decomposes an LLM's implicit search into explicit and modular operations. We then derive two plug-and-play interventions for iterative LLM agents: a persistent, predicate-based belief state that bounds context while preserving multi-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question-answering domains, we empirically validate that replacing an LLM's implicit state with our CGDP-motivated belief state improves multi-hop reasoning by up to $11.4\%$; while the modular programmatic exhaustion detection saves up to $39\%$ of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non-interfering improvements to agentic search harnesses.

2605.07041 2026-05-11 cs.RO cs.CV

Dr-BA: Separable Optimization for Direct Radar Bundle Adjustment & Localization

Daniil Lisus, Cedric Le Gentil, Timothy D. Barfoot

AI总结 本文提出了一种名为 Dr-BA 的雷达光束法平差(BA)框架,能够直接在二维旋转雷达强度图像上进行操作。与传统方法从雷达数据中提取稀疏点云不同,Dr-BA 利用多帧雷达回波联合估计密集地图和传感器位姿,通过可分离优化将位姿估计与地图构建解耦,从而实现高效且通用的解决方案。该方法不仅适用于雷达光束法平差,还可自然扩展到基于已有地图的雷达直接定位,实验表明其在多个不同路线的200公里道路数据上取得了最先进的性能。

Comments Accepted for presentation at RSS 2026

详情
英文摘要

This paper introduces Dr-BA, a first-of-its-kind radar bundle adjustment (BA) framework that operates directly on 2D spinning radar intensity images. Unlike camera or lidar sensors, radar is largely unaffected by precipitation, making it a critical modality for autonomous systems that require all-weather robustness. Existing state estimation approaches using spinning radar typically extract sparse point clouds from range-azimuth-intensity measurements and apply point cloud alignment techniques to estimate vehicle motion, scene structure, or to localize within an existing map. In contrast, Dr-BA uses the full radar returns from multiple scans to jointly estimate dense maps and sensor poses. By formulating the problem as a separable optimization, we derive an efficient and general solution that decouples pose estimation from mapping. In addition to solving the BA problem, this formulation naturally extends to direct radar-only localization (DRL) within a previously built map. Dr-BA achieves state-of-the-art radar-based BA and cross-session localization performance, demonstrated on more than 200 km of on-road data across five distinct routes. Our implementation is publicly available at https://github.com/utiasASRL/dr_ba.

2605.07040 2026-05-11 cs.CL cs.AI cs.CY

Cognitive Agent Compilation for Explicit Problem Solver Modeling

Hyeongdon Moon, Carolyn Rosé, John Stamper

AI总结 该研究提出了一种名为“认知代理编译”(CAC)的框架,旨在解决大型语言模型在教育场景中难以约束和控制的问题。通过借鉴认知架构的思想,CAC利用一个强大的教师语言模型,将问题解决知识编译成可编辑的显式代理,从而实现对知识状态、问题解决策略和验证规则的分离与明确表达。该方法为教育系统提供了更可检验和可编辑的知识模型,是迈向有限知识人工智能的重要一步。

Comments Accepted to AIED 2026 Blue Sky

详情
英文摘要

Large language models (LLMs) are widely used for tutoring, feedback generation, and content creation, but their broad pretraining makes them hard to constrain and poor substitutes for controllable learners. Educational systems often require inspectable and editable knowledge states: educators want to know what a system assumes the learner knows, and learners benefit when the system can justify actions in terms of explicit skills, misconceptions, and strategies. Inspired by cognitive architectures, we propose Cognitive Agent Compilation (CAC), a framework that uses a strong teacher LLM to compile problem-solving knowledge into an explicit target agent. CAC separates (i) knowledge representation, (ii) problem-solving policy, and (iii) verification and update rules, with the goal of making bounded problem solving more inspectable and editable in educational settings. We present an early proof of concept implemented with Small Language Models that surfaces key design trade-offs, particularly between explicit control and scalable generalization, and positions CAC as an initial step toward bounded-knowledge AI for educational applications.

2605.07039 2026-05-11 cs.LG

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Shuo Chen, Zhankui He, Noveen Sachdeva, Weili Wang, Ed H. Chi, Shivaram Venkataraman, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang

AI总结 本文提出了一种名为 PACEvolve++ 的新型强化学习框架,旨在提升进化搜索代理在测试阶段的策略适应能力。该方法将战略搜索决策与具体实现解耦,通过一个可训练的顾问模型生成、评估并选择假设,再由一个更强的前沿模型将其转化为可执行的候选方案。研究还提出了一种阶段自适应的训练策略,使顾问模型能根据进化过程的不同阶段调整优化方式,从而在多个任务中实现了更快的收敛速度和更稳定的测试时训练效果。

详情
英文摘要

Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and research tasks, where evaluations are expensive, and progress depends on learning task-specific search dynamics. We introduce PACEvolve++, an advisor-model reinforcement learning framework for test-time policy adaptation in evolutionary search agents. PACEvolve++ decouples strategic search decisions from implementation: a trainable advisor generates, assesses, and selects hypotheses, while a stronger frontier model translates selected hypotheses into executable candidates. To train the advisor under non-stationary feedback, we propose a phase-adaptive approach that adapts its optimization strategy to different phases of the evolutionary process. Early in evolution, it uses group-relative feedback to learn broad search preferences; later, as reward gaps compress, it emphasizes best-of-$k$ frontier contribution to support stable refinement. Across expert-parallel load balancing, sequential recommendation, and protein fitness extrapolation, PACEvolve++ outperforms the state-of-the-art evolutionary search framework with frontier models, achieving faster convergence and stabilizing test-time training during evolutionary search.

2605.07038 2026-05-11 cs.LG cs.MA cs.RO

Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation

Aditya Sai Ellendula, Yi Wang, Chandrajit Bajaj

AI总结 本文研究了如何在导航中实现风险感知的策略选择性,即仅在存在更安全可行路径时激活避障动作,否则抑制不必要的机动。核心方法是在端口哈密顿导航策略中引入上下文能量项,从而生成具有可验证选择性特征的力场,并结合条件风险价值(CVaR)目标优化梯度更新。实验表明,该方法在多个场景中显著提升了导航的安全性与成功率,同时减少了误动作和灾难性失败。

详情
英文摘要

Risk-aware navigation should be selective: a policy should expose evasive degrees of freedom only when the local scene admits a lower-risk feasible maneuver, and suppress them when no safer alternative exists. We show that adding one context-energy term to a port-Hamiltonian navigation policy produces a learned force channel with exactly this falsifiable signature. When the local risk field contains a feasible lower-risk direction, the induced context force activates toward it; when the apparent escape is blocked or not yet available, a route-aware gate suppresses lateral force rather than hallucinating an unsafe maneuver. A CVaR tail-risk objective focuses gradient updates on rare but consequential risk transitions. We validate the selectivity signature across four settings. In the primary delayed-required-escape benchmark, route-aware CVaR reduces premature force activation from 0.950 to 0.180 versus DWA while raising success from 0.480 to 0.810 with zero replans. On real off-road terrain (RELLIS-3D), route-aware enrichment achieves correct activation rate 0.837 and false activation rate 0.114, compared to 0.378/0.752 for scalar risk gradients. On static semantic maps (DFC2018), enrichment reduces catastrophic failure from 0.60 to 0.10 and oscillation by 90.7% while preserving path efficiency. In highway traffic, collisions drop from 100% to 0% when a lane escape is feasible; when no escape exists, the policy suppresses the lateral maneuver. The selectivity property follows from the gradient structure of the context energy rather than from training-time tuning.

2605.07037 2026-05-11 cs.RO

Intention assimilation control for accurate tracking with variable impedance in teleoperation

Atsushi Takagi, Yanan Li, Hiroaki Gomi, Etienne Burdet

AI总结 本文研究了远程操作中机器人跟踪精度与安全性的平衡问题,提出了一种新的意图融合控制(IAC)策略,能够在无需高刚度的情况下保证跟踪精度。该方法通过估计领导者的期望位置并传递给从动机器人,实现了从动机器人阻抗的动态调整,以适应任务需求或用户意图。实验表明,IAC在多种任务中均表现出更高的跟踪精度、任务完成率和效率,为远程操作提供了更灵活和精确的控制方式。

详情
英文摘要

Robot systems for teleoperation commonly use a spring-like force pulling the follower robot towards the leader's position to track their movements. With this control strategy, the tracking accuracy deteriorates when the follower' stiffness is low, but high stiffness poses a danger to objects or people in the follower robot's environment. To address this trade-off between tracking accuracy and safety, we propose an alternative intention assimilation control (IAC) strategy where the robot's tracking accuracy can be ensured without high stiffness. Different from traditional approaches, which transmit the leader's current position to the follower, this new controller estimates the leader's target position and transmits it to the follower. With this strategy, the follower impedance can be changed on-the-fly to continuously reflect the user's desired impedance or modulated automatically to fulfill the task requirements. Our controller was validated on two 7 degree-of-freedom manipulators, yielding high tracking accuracy with varying impedance. Four experiments were conducted to compare {teleoperation} with IAC to tele-impedance control (TIC) during free tracking, interaction with a balloon, during peg insertion, and table polishing with force feedback. The results show that IAC increases tracking accuracy, improves task completion rate and reduces completion time. IAC enables the robot to accurately replicate the user's movement while giving them freedom to modulate the impedance according to their intention, providing an unprecedented level of control of the follower's position and its impedance during unilateral and bilateral teleoperation.

2605.07023 2026-05-11 cs.CV

OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects

Yang Luo, Yan Gong, Yongsheng Gao, Jie Zhao, Xinyu Zhang, Huaping Liu

AI总结 在许多实际的6D物体姿态估计场景中,通常只能获取每个物体的一个真实RGB-D参考视图,且往往没有CAD模型。为此,本文提出了一种名为OneViewAll的语义先验引导框架,通过一种新颖的“投影-对比”范式,在无需CAD模型的情况下实现单视图姿态估计。该方法通过三级语义先验逐步优化姿态估计,包括类别与场景级先验、物体对称性先验以及局部区域先验,有效提升了对对称、无纹理及遮挡物体的估计精度。实验表明,OneViewAll在LINEMOD数据集上仅使用单个真实参考视图即可达到92.5%的ADD-0.1精度,显著优于CVPR 2025的基线方法。

详情
英文摘要

In many practical 6D object pose estimation scenarios, we often have access to only a single real-world RGB-D reference view per object, typically without CAD models. Existing methods largely rely on explicit 3D models or multi-view data, which limits their scalability. To address this challenging single-reference model-free setting, we propose \textbf{OneViewAll}, a semantic-prior-guided framework that performs pose estimation via a novel Project-and-Compare paradigm. Instead of relying on computationally expensive CAD-based rendering, our method directly aligns reference and query observations within a projection-equivariant space. OneViewAll progressively integrates hierarchical semantic priors across three levels: (1) \textit{category- and scene-level} priors for efficient hypothesis initialization; (2) \textit{object-level symmetry} priors for geometry completion via mirror fusion; and (3) \textit{patch-level} priors for discriminative refinement. Extensive experiments demonstrate that OneViewAll achieves \textbf{92.5\%} ADD-0.1 accuracy on the LINEMOD dataset using only one real reference view -- significantly outperforming the CVPR 2025 baseline One2Any (52.6\%). It also yields consistent improvements on YCB-V, Real275, and Toyota-Light while maintaining low inference latency. Our results underscore the efficacy of symmetry-aware projection in handling symmetric, texture-less, and occluded objects.

2605.07020 2026-05-11 cs.LG cs.AI

FlashMol: High-Quality Molecule Generation in as Few as Four Steps

Xinyuan Wei, Zian Li, Shaoheng Yan, Cai Zhou, Muhan Zhang

AI总结 生成化学上有效的三维分子构象对于计算药物发现至关重要。尽管基于扩散的经典模型如GeoLDM表现良好,但其需要数百步生成过程,限制了大规模的虚拟筛选应用。本文提出FlashMol,一种仅需4步即可生成高质量分子构象的超快速生成模型,通过改进分布匹配蒸馏方法并引入正则化策略,有效提升了生成效率与多样性,实验表明其在保持分子质量的同时,采样速度较原模型提升了250倍。

详情
英文摘要

Generating chemically valid 3D molecular conformations is critical for computational drug discovery. Classical diffusion-based models like GeoLDM perform well but require hundreds of steps, making large-scale in silico screening impractical. Recent efforts on few-step molecular generation have accelerated this process to 12-50 steps, but they often largely sacrifice sample stability. In this work, we present FlashMol, an ultra-fast molecule generative model producing high-quality molecular conformations in as few as 4 steps. To achieve this, we adapt distribution matching distillation (DMD) - a reverse KL-divergence minimization objective - to the molecular domain for effective distillation. Considering the local minimization behavior of DMD, we respace the molecule generation timesteps, providing the generator with much better initialization and enables effective distillation. Additionally, to mitigate the mode-seeking behavior of DMD and improve diversity, we further regularize it with a Jensen-Shannon divergence term, which incorporates the mean-seeking behavior of the forward KL divergence. Extensive experiments on QM9 and GEOM-DRUG datasets demonstrate that FlashMol matches and even surpasses the original 1000-step teacher, achieving up to 250$\times$ acceleration in sampling speed while maintaining high molecular quality.