arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测?古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

AI总结 通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现,发现VLM即使错误也能生成流畅文本,表明其依赖语言先验,并引入扰动和标记级定位度量分析视觉证据。

详情
AI中文摘要

最近的研究表明,用于光学字符识别(OCR)的视觉语言模型(VLM)能够生成看似合理但缺乏视觉支持的文本,暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比,我们展示了VLM的错误即使在错误时也往往保持流畅,产生合理的希腊语替换,而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据,我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下,VLM与扰动的真实文本严重偏离,而传统OCR相对忠实;然而,标记级分析表明先验依赖是模型特定的:在OCR专业模型中,流畅的词汇错误几乎不依赖图像而产生,而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位,而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集,表明流畅输出不一定具有视觉基础,并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

2605.27748 2026-05-28 cs.CV cs.AI cs.LG

Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

马氏距离 PatchCore:协方差感知与流式兼容的工业异常检测

Niccolò Ferrari, Oligert Osmani, Evelina Lamma

AI总结 提出马氏距离 PatchCore,通过协方差估计和流式处理改进 PatchCore,在保持性能的同时降低峰值内存并提升工业检测精度。

Comments 57 pages, 7 figures

详情
AI中文摘要

工业视觉异常检测通常是一类问题:正常图像丰富,而缺陷罕见、异质且常在系统设计时不可用。PatchCore 风格的检索适合此场景,因为它通过正常补丁特征的内存库对测试图像评分,但标准欧几里得几何忽略了特征相关性,且其离线构建在子采样前需实例化整个补丁池。我们引入马氏距离 PatchCore,一种协方差感知、流式兼容的 PatchCore 扩展。其人工智能贡献在于一种检索检测器,它在降维特征空间中估计正则化协方差模型并对嵌入进行白化,使得变换后的欧几里得最近邻搜索实现马氏距离检索。一个有界内存、可重复迭代的训练流程通过增量降维、在线协方差估计和流式聚合,无需一次性存储所有正常补丁即可构建内存库。工程应用是自动化工业检测,其中视觉异常检测必须在实际内存限制下保持准确。我们在一个公开的 15 类工业异常检测基准和三个工业数据集(涵盖吹灌封条带安瓿弯月面检测、琥珀色玻璃安瓿底部检测和冻干饼西林瓶检测)上评估该方法。马氏距离 PatchCore 在公开基准上保留了大部分离线 PatchCore 的图像级性能,同时将峰值内存从 5.41 GB 降至 2.78 GB,并将选定的工业平均图像接收者操作特征曲线下面积从 0.981 提升至 0.986。

英文摘要

Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986.

2605.27744 2026-05-28 cs.AI

A Policy-Driven Runtime Layer for Agentic LLM Serving

一种面向智能体LLM服务的策略驱动运行时层

Rui Zhang, Chaeeun Kim, Liting Hu

AI总结 针对多智能体LLM系统中跨层策略难以高效实现的问题,提出在框架与引擎之间插入智能体运行时层,通过四个原语支持任意智能体感知策略,并在KV缓存策略CacheSage上验证了有效性。

详情
AI中文摘要

多智能体LLM系统已成为主流生产工作负载,但服务栈并非为其构建。上层的智能体框架知道智能体身份、角色、模式和调度结构,但从未看到引擎级事件;下层的服务引擎看到每个事件但对智能体一无所知。许多跨层策略依赖于两者:前缀缓存、批处理整形、推测执行、公平性、工具结果记忆、安全执行等。每个策略都存在于两层之间的缝隙中,目前通过向某一邻域打补丁来解决。我们认为这个缝隙最好通过架构变更而非点修复来解决:在框架和引擎之间插入第三层,即智能体运行时层,暴露四个原语(观察、评分、预测、行动),任何智能体感知策略都可以插入其中,并以智能体身份作为共享坐标。我们将九个具体策略映射到该层,并在具有最大即时服务成本杠杆的策略上深入验证了该抽象:跨会话的KV缓存,实例化为CacheSage,它在线学习每工作负载的智能体转移矩阵,并用于基于存活的驱逐和步间预取。在五个真实多智能体工作负载上的初步结果显示,与未修改的服务栈相比,缓存命中率提升13到37个百分点,平均TTFT降低12%到29%,吞吐量提高6%到14%。

英文摘要

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

2605.27741 2026-05-28 cs.CL

Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

逃离语言先验:通过模态感知策略优化缓解音频推理中的后期模态崩溃

Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang, Steve Yves, Sanjeev Khudanpur, Liefeng Bo

AI总结 针对多模态大语言模型在强化学习后训练中因统一策略梯度忽略模态依赖性而导致的后期模态崩溃问题,提出模态感知策略优化(MAPO)框架,通过模态相关性掩码和辅助注意力损失分支动态聚焦梯度并维持跨模态推理,在复杂音频推理基准上取得新最优结果。

详情
AI中文摘要

音频和全模态大语言模型展现出令人印象深刻的跨模态推理能力。然而,将标准的强化学习后训练算法应用于这些模型时,暴露了一个关键的结构性脆弱性:像GRPO这样的方法对所有token施加统一的策略梯度,忽略了它们对非文本源模态的不平等依赖。这加剧了在扩展思维链生成过程中的后期模态崩溃,模型逐渐放弃主要源信号,转而依赖压缩的文本先验,导致自信但无根据的幻觉。为了解决这个问题,我们引入了模态感知策略优化(MAPO),一种新颖的双分支强化学习框架。首先,MAPO使用模态相关性掩码动态地将策略梯度集中在模态关键token上,该掩码源自音频消融参考与多模态策略之间的跨模态微分熵。其次,它集成一个辅助注意力损失分支,对模型内部的注意力分布施加有针对性的、时间尺度的惩罚。这确保模型在推理轨迹深处主动维持跨模态基础。在复杂音频推理基准上的评估表明,MAPO显著提高了长时推理保真度和多模态指令遵循能力,在开放权重模型中实现了极具竞争力的性能,并在几个关键基准上创造了新的最先进结果。通过严格依赖原生统计信号而非特定领域的归纳偏置,MAPO为缓解跨多种多模态系统的认知崩溃提供了一个有前景的基础。

英文摘要

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model's internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

2605.27740 2026-05-28 cs.CL

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

UNIQUE: 通用Top-k稀疏注意力,用于免训练推理和稀疏感知训练

Keqi Deng, Shaoshi Ling, Ruchao Fan, Jinyu Li

AI总结 提出UNIQUE框架,通过基于键均值和标准差的页面重要性评分和软掩码稀疏感知训练,实现LLM长上下文推理中KV缓存的高效稀疏注意力,在保持任务性能的同时显著加速。

详情
AI中文摘要

大型语言模型(LLM)中的长上下文推理受到自注意力键值(KV)缓存线性增长的瓶颈限制。Top-k稀疏注意力通过仅加载一小部分KV缓存来缓解这一问题,但准确且廉价地估计缓存重要性(既用于免训练使用,也用于稀疏感知训练)仍然具有挑战性。本文提出UNIQUE,一个通用的top-k稀疏注意力框架,同时满足这两个需求,并在LLM模态中保持一致的有效性。UNIQUE以KV页面为粒度,通过一个简单而准确的评分来估计每页的重要性,该评分结合了页面键的均值作为代表性向量及其标准差作为偏移项。为了进一步缩小训练-推理差距,本文引入了一种软掩码稀疏感知训练方案,该方案使用top-k分数边界作为每个查询的阈值,并在其周围使用sigmoid软掩码,既不需要辅助损失,也不需要架构更改。在文本和语音LLM上的实验表明,UNIQUE在LongBench Pro等长上下文基准测试和长形式语音识别上保持了任务性能,同时与FlashInfer密集注意力相比,实现了高达11.4倍的注意力内核加速,并且与基于vLLM的密集模型相比,实现了至少5.3倍的端到端解码加速。

英文摘要

Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page's keys as a representative vector with their standard deviation as an offset term. To further close the train-inference gap, this paper introduces a soft-mask sparsity-aware training scheme that uses the top-k score boundary as a per-query threshold and a sigmoid soft mask around it, requiring neither auxiliary losses nor architectural changes. Experiments on text and speech LLMs show that UNIQUE preserves task performance on long-context benchmarks such as LongBench Pro and on long-form speech recognition, while delivering up to 11.4x attention-kernel speedup over FlashInfer dense attention and at least 5.3x end-to-end decoding speedup over a vLLM-based dense model.

2605.27739 2026-05-28 cs.LG cs.AI

Worker Disagreement Reveals Sharp Directions in Local SGD

工作者分歧揭示局部SGD中的尖锐方向

Tolga Dimlioglu, Kristi Topollai, Anna Choromanska

AI总结 本文通过理论分析和实验证明,局部SGD中的工作者平均间隙协方差能够捕捉Hessian矩阵的尖锐方向,从而提供一种廉价的无Hessian估计方法。

Comments 5 pages main body, 18 pages appendix - Accepted to HiLD 2026, ICML

详情
AI中文摘要

深度神经网络训练通常表现出高度各向异性的损失几何,其中少数尖锐的主导Hessian方向与大量平坦区域共存。梯度往往不成比例地与这些主导方向对齐,尽管稳定的进展通常需要穿过平坦区域的方向。因此,估计主导子空间是有用的,但使用基于Hessian的直接方法成本高昂。我们表明,标准局部SGD通过工作者分歧暴露了这种几何结构。我们从理论上证明,工作者平均间隙协方差由随机梯度噪声和Hessian曲率塑造,导致工作者沿着尖锐的曲率敏感方向产生分歧。因此,工作者平均间隙提供了主导子空间的廉价无Hessian估计。在MLP、CNN和Transformer上的实验表明,由工作者平均间隙形成的子空间捕获了位于主导Hessian特征空间中的梯度分量的很大一部分。

英文摘要

Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a large flatter bulk. Gradients tend to align disproportionately with these dominant directions, although stable progress often requires movement through flatter bulk directions. Estimating the dominant subspace is therefore useful but costly with direct Hessian-based methods. We show that standard Local SGD exposes this geometry through worker disagreement. We theoretically show that the worker-average gap covariance is shaped by stochastic-gradient noise and Hessian curvature, causing workers to disagree along sharp, curvature-sensitive directions. Thus, worker-average gaps provide a cheap Hessian-free estimator of the dominant subspace. Experiments on MLPs, CNNs, and Transformers show that subspaces formed by worker-average gaps capture a substantial fraction of the gradient component lying in the dominant Hessian eigenspace.

2605.27737 2026-05-28 cs.CV

Bounded-Compute Multimodal Regression for Product-Rating Prediction

有界计算多模态回归用于产品评分预测

William Leach, Ru He, Sizhuo Ma, Yizhen Jia, Min Cao, Jian Wang, Rick Cao

AI总结 针对严格延迟预算下的标量回归任务,提出一种有界计算适配方法,通过替换语言模型头为轻量MLP并固定输入,在LoViF 2026挑战赛中实现高效多模态回归。

Comments Accepted to the LoViF Workshop at CVPR 2026. 8 pages, 2 figures

详情
AI中文摘要

视觉语言模型在多模态质量评估中日益受欢迎,但其默认依赖自回归文本生成和动态视觉处理,在严格延迟预算下难以适配标量回归。我们提出一种有界计算适配方法,基于SmolVLM2-256M-Video-Instruct,用于LoViF 2026高效VLM挑战赛中的产品评分预测。受近期多模态参与度预测结果(显示基于特征的回归可优于基于token的分数生成)的启发,我们将语言建模头替换为轻量两层MLP,输入为池化后的解码器状态,并通过固定384x384图像和截断元数据强制执行确定性输入。在受控消融实验中,静态全局图像处理略优于动态平铺,且将训练样本从10万扩展到1600万显著提升了验证相关性。在官方留出评估中,我们的228M参数模型达到了0.39 PLCC和0.40 CES,为资源受限的多模态回归提供了强且可复现的基线。

英文摘要

Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.

2605.27736 2026-05-28 cs.LG cs.CV

Explicit Critic Guidance for Aligning Diffusion Models

显式评论家引导的对齐扩散模型

Zhengyang Liang, Qihang Zhang, Ceyuan Yang

AI总结 提出一种状态对齐的潜在演员-评论家框架,通过将扩散模型自身作为时间步条件价值函数,实现轨迹级PPO训练和推理时引导,在单/多奖励基准上优于先前方法。

详情
AI中文摘要

在线强化学习对于将扩散模型与不可微目标对齐变得越来越重要。然而,现有方法在沿去噪轨迹分配细粒度信用和实现稳定的基于价值的优化方面仍面临限制。我们提出了一种用于扩散后训练的状态对齐潜在演员-评论家框架,其中扩散模型自身作为时间步条件价值函数,并直接在噪声潜在状态上预测价值。这使得轨迹级PPO训练成为可能,通过简单的条件和价值预训练策略支持稳定的演员-评论家优化,并自然地允许学习到的评论家用于推理时引导。我们进一步将框架扩展到多奖励优化,其中与互补奖励的联合训练有助于减轻奖励破解。在基于UNet和DiT的骨干网络上,我们的方法在单奖励和多奖励基准上始终优于先前的组相对RL和演员-评论家基线,同时测试时引导在生成质量上提供了额外提升。

英文摘要

Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

2605.27734 2026-05-28 cs.LG

Learn from your own latents and not from tokens: A sample-complexity theory

从自身潜在表示而非token学习:样本复杂度理论

Daniel J. Korchinski, Alessandro Favero, Matthieu Wyart

AI总结 本文通过概率上下文无关语法数据,证明潜在预测方法在样本复杂度上相比token级SSL具有指数级优势,并分析了多尺度层次结构的必要性。

Comments 10 pages, 5 figures in main. 28 pages, 14 figures, 1 table in all

详情
AI中文摘要

生成模型,从扩散模型到大型语言模型,取得了显著性能,但代价是训练数据量比生物学习所需的大几个数量级。一种替代范式已经出现,其中网络被训练来预测其自身对相关视图或掩码区域的潜在表示,如data2vec和JEPA——这一思想与皮层的预测编码理论相关。尽管有强大的实证结果,但这些方法的理论理解仍然有限。核心问题包括:潜在预测实际上能提高多少数据效率?将此类方法堆叠成多尺度层次结构是否有益?我们使用一个可处理的概率上下文无关语法作为数据来回答这两个问题,该语法捕捉了自然语言和图像的组合结构。这样的语法通过沿深度为$L$的隐藏符号树递归应用产生规则,生成可见token的字符串。对于这样的数据,监督或token级SSL需要样本数量随$L$指数增长才能恢复潜在树;我们证明潜在预测在$L$上(对数因子内)以常数样本量实现这一点。我们通过(i)层次聚类算法,(ii)端到端神经网络(其预测-聚类器模块通过梯度下降在每个层次预测自身的潜在表示),以及(iii)data2vec的首次样本复杂度分析(我们证明其隐式执行层次潜在预测)来确认这一界限。这表明显式堆叠如H-JEPA在很大程度上是冗余的。

英文摘要

Generative models, from diffusion models to large language models, achieve remarkable performance but at a cost in training data orders of magnitude larger than what biological learners require. An alternative paradigm has emerged in which networks are trained to predict their \emph{own} latent representations of related views or masked regions, as in data2vec and JEPA -- an idea related to predictive-coding accounts of the cortex. Despite strong empirical results, the theoretical understanding of these methods remains limited. Central questions include: by how much does latent prediction actually improve data efficiency? Is there a benefit to stacking such methods into multi-scale hierarchies? We answer both using as data a tractable probabilistic context-free grammar that captures the compositional structure of natural language and images. Such a grammar generates strings of visible tokens by recursively applying production rules along a tree of hidden symbols of depth $L$. For such data, supervised or token-level SSL require a number of samples \emph{exponential} in $L$ to recover the latent tree; we prove that latent prediction achieves this with a number of samples \emph{constant} in $L$, up to logarithmic factors. We confirm this bound with (i) a hierarchical clustering algorithm, (ii) an end-to-end neural network whose predictor-clusterer modules predict their own latents at each level via gradient descent, and (iii) the first sample-complexity analysis of data2vec, which we show implicitly performs hierarchical latent prediction. This suggests that explicit stacking such as H-JEPA is largely redundant.

2605.27733 2026-05-28 cs.LG

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

逐元素裁剪能否实现随机梯度的谱控制?

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich

AI总结 本文提出一种逐元素裁剪方法,通过分析梯度噪声的局部化特性,在保持矩阵结构的同时实现谱控制,并在Cauchy污染噪声下给出收敛保证,实验表明该方法可节省训练令牌。

详情
AI中文摘要

训练不稳定(如损失尖峰)通常由随机梯度噪声引起。由于语言训练数据中的罕见表达以及多层组合,噪声影响具有重尾特性,且小批量平均无法消除。现有方法在结构与成本之间权衡:向量范数裁剪忽略权重更新的矩阵结构,而谱归一化(如Muon (Jordan et al., 2024))在额外成本下尊重该结构。我们表明这种权衡可以平衡。真实梯度噪声类似于逐元素重尾污染,一阶扰动分析揭示了此类噪声的局部化性质,在此性质下,简单的逐元素方法可实现谱控制。利用这一点,我们推导出高斯信号先验下贝叶斯最优逐元素估计器的可处理替代。我们在Cauchy污染噪声下建立了$O(ε^{-4})$收敛保证。实验发现,平滑收缩改进了NanoGPT预训练上的Adam,节省了约7%的训练令牌。我们进一步发现,在谱归一化之前应用逐元素裁剪,在Muon基础上额外节省约2%的令牌。

英文摘要

Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging. Existing remedies trade off structure against cost: vector-norm clipping ignores the matrix structure of weight updates, while spectral normalization (e.g., Muon (Jordan et al., 2024)) respects it at additional cost. We show that this trade-off can be balanced. Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish $O(ε^{-4})$ convergence guarantee under Cauchy-contaminated noise. Empirically, we find that smooth shrinkage improves Adam on NanoGPT pretraining, saving ${\sim}7\%$ of training tokens. We further find that applying the entry-wise clipping before spectral normalization yields a ${\sim}2\%$ token saving on top of Muon.

2605.27726 2026-05-28 cs.CV

Asynchronous Remote Sensing Time-Series Fusion for Cloud Removal and Anytime Reconstruction

异步遥感时间序列融合用于云去除与任意时间重建

Forouzan Fallah, Chia Yu Hsu, Wenwen Li, Anna Liljedahl, Yezhou Yang

AI总结 提出AGFlow模型,通过时间对齐生成流匹配融合异步S1/S2数据,实现云去除、缺失帧重建及任意时间查询。

Comments CVPR 2026 MORSE Workshop

详情
AI中文摘要

频繁的云层覆盖严重限制了Sentinel-2 (S2) 光学时间序列在地球表面监测中的可用性。Sentinel-1 (S1) SAR提供了全天候的互补观测,但由于采集不规则且异步,实际的S1/S2融合仍然困难。许多现有方法假设时间对齐的输入(或需要外部最近日期匹配),并且通常仅恢复观测时间戳,限制了长间隙下的重建并阻止了按需合成。我们提出AGFlow(时间对齐生成流匹配),一种用于S1/S2云去除和时间序列重建的时空流匹配模型,具有三种能力:(1) 时间戳条件内部对齐,融合异步S1和含云S2观测,无需基于预处理的配对;(2) 时空上下文感知去噪,联合建模空间结构与时间动态(而非独立的逐像素时间序列);(3) 任意时间查询,能够在监测窗口内的观测时间戳和用户指定时间戳生成无云S2帧。我们在RESTORE-DiT基准协议上进行了评估,包括定量指标、定性比较和组件消融。AGFlow显著改善了完全缺失帧的重建(MAE和RMSE相比RESTORE-DiT降低16-19%),并在持续间隙下提供可靠重建,同时具有竞争力的云去除性能和灵活的时间查询能力,适用于密集植被监测等下游任务。

英文摘要

Frequent cloud cover severely limits the usability of Sentinel-2 (S2) optical time series for Earth surface monitoring. Sentinel-1 (S1) SAR provides all-weather complementary observations, but practical S1/S2 fusion remains difficult because acquisitions are irregular and asynchronous. Many existing approaches assume temporally aligned inputs (or require external nearest-date matching) and typically restore only observed timestamps, limiting reconstruction under long gaps and preventing on-demand synthesis. We propose AGFlow (Time Aligned Generative Flow Matching), a spatiotemporal flow-matching model for S1/S2 cloud removal and time-series reconstruction with three capabilities: (1) timestamp-conditioned internal alignment that fuses asynchronous S1 and cloudy S2 observations without preprocessing-based pairing; (2) spatiotemporal, context-aware denoising that models spatial structure jointly with temporal dynamics (rather than independent per-pixel time series); and (3) anytime querying, enabling generation of cloud-free S2 frames at both observed and user-specified timestamps within the monitoring window. We evaluate on the RESTORE-DiT benchmark protocol with quantitative metrics, qualitative comparisons, and component ablations. AGFlow notably improves fully missing-frame reconstruction (MAE and RMSE reduce by 16-19% over RESTORE-DiT) and provides reliable reconstructions under persistent gaps, while also yielding competitive cloud removal performance and flexible temporal querying for downstream tasks such as dense vegetation monitoring.

2605.27724 2026-05-28 cs.RO cs.AI

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

HumanoidMimicGen: 通过全身规划生成行走操作数据

Kevin Lin, Ajay Mandlekar, Caelan Reed Garrett, Nikita Chernyadev, Yu Fang, Runyu Ding, Yuqi Xie, Justin Tran, Linxi Fan, Yuke Zhu

AI总结 提出HumanoidMimicGen方法,通过全身规划自动生成人形机器人行走操作演示数据,在模拟基准上使联合训练的策略性能提升20%。

Comments website: https://humanoidmimicgen.github.io/

详情
AI中文摘要

模仿学习是训练人形机器人行走和操作的一种有前景的方法,但它需要大量演示,而这些演示通过遥操作收集耗时且困难。现有的数据生成算法可以自动合成操作器的演示,但它们在类人机器人上效果不佳,因为其高维复合动作空间涉及手臂、腿和躯干。我们提出HumanoidMimicGen,一种生成人形机器人腿部行走操作数据的方法。我们的方法将少量源演示中的接触丰富的全身技能适应到新状态,并泛化到物体姿态的变化。通过将这些单臂和双臂技能与全身运动规划和操作规划交替进行,该方法在多样化的场景和布局中生成稳定、无碰撞的数据。为了评估我们的方法,我们引入了一个新的模拟行走操作基准,包含九个测试人形机器人行走操作能力的多样化任务。在那里,我们证明HumanoidMimicGen自动生成用于模仿学习的大规模数据集,并能够系统研究数据生成和策略学习决策如何影响模型性能。我们表明,与仅使用真实世界数据训练的策略相比,与HumanoidMimicGen生成的数据联合训练的全身视觉运动策略性能提升20%。

英文摘要

Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demonstrations, which are time-intensive and difficult to collect via teleoperation. Existing data-generation algorithms can automatically synthesize demonstrations for manipulators, but they are ineffective on humanoids because their high-dimensional composite action spaces involve arms, legs, and torsos. We present HumanoidMimicGen, a method for generating humanoid legged loco-manipulation data. Our method adapts contact-rich whole-body skills from a handful of source demonstrations to new states, generalizing across changes in object pose. By interleaving these single- and dual-arm skills with whole-body locomotion and manipulation planning, the method generates stable, collision-free data across diverse scenes and layouts. To evaluate our approach, we introduce a new simulated loco-manipulation benchmark containing nine diverse tasks that test humanoid loco-manipulation capabilities. There, we demonstrate that HumanoidMimicGen automatically generates large datasets for imitation learning and enables a systematic study of how data generation and policy learning decisions impact model performance. We show that whole-body visuomotor policies co-trained with data generated by HumanoidMimicGen outperform those trained only on real-world data by 20%.

2605.27722 2026-05-28 cs.LG

NUCLEUS-MoE: Unified Model of Pool Boiling for Liquid Cooling

NUCLEUS-MoE:池沸腾液冷统一模型

Arthur Feeney, Xianwei Zou, Sheikh Md Shakeel Hassan, Siddhartha Rachabathuni, Aparna Chandramowlishwaran

AI总结 提出混合专家模型NUCLEUS,通过邻域注意力、符号距离场重初始化与专家路由,统一建模不同流体和工况下的池沸腾,实现零样本与小样本泛化。

Comments 12 pages, 9 figurs, KDD AI for Science

详情
AI中文摘要

两相沸腾的传热速率比单相冷却高一个数量级,但由于相变、湍流和输运之间的强耦合,以及对流体性质和热力学条件的极端敏感性,其建模仍然困难。现有的基于学习的替代模型要么特定于条件,要么特定于流体,限制了泛化能力并需要单独的模型。我们提出了NUCLEUS,一种用于池沸腾的混合专家模型,它用单一架构取代了专门的替代模型集合。NUCLEUS结合了邻域注意力、用于界面一致性的符号距离场重初始化,以及在不同沸腾动力学中表现出新兴专业化的专家路由。在池沸腾的高保真模拟上训练,NUCLEUS联合模拟了三种流体类别(电介质、制冷剂和低温流体)的饱和和过冷沸腾,解决了先前模型在极端流体上的失败模式。我们表明,专家路由在没有显式监督的情况下表现出连贯的空间结构和专业化。定量上,NUCLEUS在保持异质沸腾配置的物理一致性的同时,达到或超过了基线。我们还展示了在诸如用于浸没冷却的新型流体(Opteon 2P50)等下游任务上的零样本和小样本泛化能力。这些结果表明,混合专家模型是走向沸腾动力学统一替代建模的可扩展途径,并为科学机器学习中的更广泛泛化奠定了基础。

英文摘要

Two-phase boiling enables heat transfer rates an order of magnitude higher than single-phase cooling, but it remains difficult to model due to the strong coupling between phase change, turbulence, and transport, as well as extreme sensitivity to fluid properties and thermodynamic conditions. Existing learning-based surrogates are either condition- or fluid-specific, limiting generalization and requiring separate models. We present NUCLEUS, a mixture-of-experts model for pool boiling that replaces collections of specialized surrogates with a single architecture. NUCLEUS combines neighborhood attention, signed distance field reinitialization for interface consistency, and expert routing that exhibits emergent specialization across distinct boiling dynamics. Trained on high-fidelity simulations of pool boiling, NUCLEUS jointly models saturated and subcooled boiling across three fluid classes (dielectrics, refrigerants, and cryogens), resolving failure modes of prior models on extreme fluids. We show that expert routing exhibits coherent spatial structure and specialization without explicit supervision. Quantitatively, NUCLEUS matches or exceeds baselines while maintaining physical consistency across heterogeneous boiling configurations. We also show zero-shot and few-shot generalization capabilities on downstream tasks such as a new fluid (Opteon 2P50 developed for immersion cooling). These results demonstrate that mixture-of-experts models are a scalable pathway toward unified surrogate modeling of boiling dynamics and lay the groundwork for broader generalization across scientific ML.

2605.27721 2026-05-28 cs.CL cs.AI

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

UserHarness:利用用户心智增强智能体心理理论

Cheng Qian, Jiayu Liu, Heng Ji

AI总结 提出UserHarness框架,通过显式重建用户心智状态(信念、意图等)进行心理理论推理,在五个基准上达到95.94%的宏准确率,相对提升超15%。

Comments 19 Pages, 4 Figures, 2 Tables

详情
AI中文摘要

理解用户的信念和意图对于构建有效的智能体助手至关重要。这种能力通常通过心理理论(ToM)任务来评估,成功需要从用户的角度进行推理。然而,许多现有方法通过复杂流水线处理ToM,间接建模行为,而没有显式重建用户的心智状态。这忽略了问题的核心结构:用户基于其信念行动,信念通过观察环境更新;信念和意图共同决定行动,行动又改变环境;社会推理通常需要关于他人信念或意图的嵌套信念。我们提出UserHarness,一个简单的框架,将ToM推理重新定义为显式的用户心智重建。UserHarness分解用户的心智状态、其与外部环境的关系以及由此产生的行动,使智能体能够跟踪用户的观察、信念、意图和行为。在五个基准上,UserHarness达到高达95.94%的宏准确率,相比现有推理方法相对提升超过15%,相比最强的纯提示框架相对提升约20%。这些结果表明,稳健的用户理解需要从用户心智根源进行推理,将用户驾驭作为未来更具适应性的助手的有前景的基础。

英文摘要

Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user's perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user's mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user's mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user's mind, positioning user harnessing as a promising foundation for more adaptive future assistants.

2605.27720 2026-05-28 cs.LG stat.AP

Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation

有限滚动验证下学习型着陆控制器的贝叶斯部署批准

Fei Jiang, Lei Yang

AI总结 针对学习型自主控制器在有限仿真验证下的部署不确定性,提出基于贝叶斯后验推断的部署批准框架,通过后验批准概率和部署风险进行不确定性校准评估。

Comments 16 pages, 4 figures and 4 tables

详情
AI中文摘要

强化学习和数据驱动的自主控制器通常使用累积奖励和有限仿真轨迹下的经验成功频率进行评估。然而,这些经验指标不一定能为不确定性下的部署准备提供足够的统计证据。本文针对有限滚动证据下的学习型自主着陆控制器,开发了一个贝叶斯批准框架。基于不确定运行条件下的触地安全满足性,引入了概率着陆能力公式,同时使用贝叶斯后验推断来量化学习策略真实部署能力的不确定性。进一步引入了后验批准概率和后验部署风险用于面向部署的评估,以及一个支持在渐进滚动测试中做出批准/拒绝/继续决策的顺序验证框架。使用PPO和SAC控制器的仿真实验表明,在有限的验证证据下,经验成功和奖励优化可能产生过度自信的部署解释,而后验批准推断提供了更不确定性校准的部署准备评估。所提出的框架为传统强化学习评估与不确定性下面向部署的验证之间提供了实用的统计联系,并可推广到更广泛的学习型自主系统类别。

英文摘要

Reinforcement learning and data-driven autonomous controllers are commonly evaluated using cumulative reward and empirical success frequency under finite simulation trajectories. However, such empirical metrics do not necessarily provide sufficient statistical evidence regarding deployment readiness under uncertainty. This work develops a Bayesian approval framework for learned autonomous landing controllers under finite rollout evidence. A probabilistic landing capability formulation is introduced based on touchdown safety satisfaction under uncertain operating conditions, while Bayesian posterior inference is used to quantify uncertainty regarding the true deployment capability of learned policies. Posterior approval probability and posterior deployment risk are further introduced for deployment-oriented evaluation, together with a sequential validation framework supporting approve/reject/continue decisions during progressive rollout testing. Simulation experiments using PPO and SAC controllers demonstrate that empirical success and reward optimization may produce overconfident deployment interpretation under limited validation evidence, whereas posterior approval inference provides a more uncertainty-calibrated assessment of deployment readiness. The proposed framework provides a practical statistical connection between conventional reinforcement-learning evaluation and deployment-oriented validation under uncertainty and may be generalized to broader classes of learned autonomous systems.

2605.27715 2026-05-28 cs.CL

Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

超越输入理解:使用有向无环迹图诊断多语言数学推理

Jiaqiao Zhang, Zhoujun Li, Raoyuan Zhao, Jian Lan, Thomas Seidl, Michael A. Hedderich, Hinrich Schütze, Yihong Liu

AI总结 本文提出有向无环迹图(DATG)框架,通过将推理迹映射到与语言无关的数学锚点和依赖关系,诊断多语言数学推理中的语言影响,并设计Loop-Retry和Formula-Retry两种测试时控制方法改善低资源语言性能。

Comments preprint

详情
AI中文摘要

大型推理模型(LRMs)在英语中表现出强大的数学推理能力,但在许多低资源和中资源语言中仍然不太可靠。这种差距通常被解释为无法理解非英语的问题陈述。我们表明这种观点是不完整的:即使问题以英语给出,控制模型的推理语言也会显著降低准确性,这表明语言也影响推理执行本身。为了研究这种效应,我们引入了DATG,一个有向无环迹图框架,将推理迹映射到与语言无关的数学锚点和依赖关系。这使我们能够将目标语言迹与参考DAG对齐,并测量它们是否覆盖所需的数学节点、尊重依赖边以及避免有害的数学动作。在Qwen3系列上跨12种语言的实验表明,非英语推理通常遭受锚点覆盖减少和依赖保真度降低,尤其是在低资源语言中。受此诊断启发,我们提出了Loop-Retry和Formula-Retry,两种针对DATG暴露的失败模式的简单测试时控制方法,并表明它们一致地改善了低资源语言中的目标语言推理性能。

英文摘要

Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model's reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

2605.27712 2026-05-28 cs.AI

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

前缀安全贝叶斯信念追踪用于LLM推理可靠性:将校准与排序分离

Zhenghan Song, Yunyi Li, Yulong Liu

AI总结 提出前缀安全贝叶斯信念追踪(SBBT)框架,通过分离概率质量与排序能力,在长链推理中实现可靠的在线校准与不确定性估计。

详情
AI中文摘要

长推理轨迹需要在最终答案已知之前进行可靠性估计。我们研究前缀条件的事件成功估计 $P(y=1 \mid o_{1:t})$,使用前缀安全观测。序列贝叶斯信念追踪(SBBT)校准观测似然并递归更新两状态信念,为标量分数、文本和自我验证标记、隐藏聚类、令牌池探针以及潜在轨迹特征提供通用追踪器。在MATH-500、GSM8K、AIME 2025和RIMO-N上生成的开源权重轨迹中,概率质量和排序分离:仅使用分数的SBBT通常改善Brier分数,而AUROC提升需要超出强前缀安全基线的结构感知证据。在最强硬数学设置中,结构感知观测相对于标准前缀安全基线达到+0.110 AUROC。在相同前缀分类器审计下,MATH-500文本标记和RIMO-N自我验证信号保持正向。这些发现共同支持SBBT作为校准感知的在线推理框架,并揭示证据机制:标量分数主要支持概率质量,而结构感知前缀信号仅在强前缀安全基线尚未吸收排序证据时支持排序。

英文摘要

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

2605.27710 2026-05-28 cs.AI

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

DeepSciVerify: 通过LLM驱动的证据升级验证科学声明与引文对齐

Shaghayegh Sadeghi, Khashayar Khajavi, Rise Adhikari, Alexander Tessier

AI总结 提出DeepSciVerify两阶段流水线,结合摘要推理与选择性升级到段落证据,在SCitance基准上以86.7 Micro-F1超越纯摘要基线4.5点,同时67%实例无需全文检索。

详情
AI中文摘要

声明与其引用证据之间的错位是大语言模型生成报告中的常见失败模式,限制了其在科学及其他高风险场景中的可靠性。我们提出DeepSciVerify,一个用于科学声明-引文验证的两阶段流水线,结合摘要级推理与选择性升级到段落级证据。该系统首先使用摘要验证声明,并对不确定案例进行延迟处理,仅在必要时检索和分析全文段落。该设计利用了LLM之间的互补行为,因为某些模型在不确定性下更为保守,而另一些则更为果断。在SCitance基准上,DeepSciVerify达到了86.7 Micro-F1,比强纯摘要基线高出4.5点,同时67%的实例无需全文检索即可解决。这些结果表明,选择性证据升级提高了声明-引文验证的准确性和效率。

英文摘要

Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

2605.27709 2026-05-28 cs.CL

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ReverseMath: 面向可扩展和可验证数学问题生成的答案反转方法

Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich

AI总结 提出ReverseMath方法,通过反转原始问题的输入输出关系自动生成新数学问题,用于评估和训练,揭示记忆行为并提升推理性能。

详情
AI中文摘要

数学推理基准对于评估大型语言模型(LLM)至关重要,但许多基准是静态的,并通过公开评估和训练管道反复暴露,使得难以区分真正的推理与记忆。同时,手动构建具有可靠答案的新数学问题仍然成本高昂。我们引入ReverseMath,一种通过答案反转生成新数学问题的可扩展方法。给定一个问题及其答案,ReverseMath掩码原始问题中的一个数值,将原始答案视为已知条件,并重写问题,使得掩码值成为新答案。生成的问题反转了原始输入输出关系,使其答案通过构造已知。我们研究了ReverseMath在评估和训练中的应用。对于评估,配对的原始/反转问题揭示了显著的行为变化:模型有时在反转问题上失败,甚至错误地输出原始答案,暗示了类似记忆的行为。对于训练,ReverseMath提供自动标注的反转问题作为强化学习(RL)的数据增强。实验表明,包含ReverseMath生成的数据提高了多个基准上的数学推理性能,证明了其作为分析工具和可验证训练数据的可扩展来源的价值。

英文摘要

Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

2605.27706 2026-05-28 cs.CL cs.IR

Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

基于格点链式自适应重配置以减少幻觉

Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

AI总结 提出CAROL框架,通过定义语义不确定性度量并在文本序列格点上构建串子模目标,将幻觉缓解转化为马尔可夫链接受-拒绝过程,实现测试时幻觉减少。

详情
AI中文摘要

我们介绍了CAROL(基于格点的链式自适应重配置),一个用于大型语言模型测试时减少幻觉的概率框架。CAROL不依赖于词元级别的不确定性,而是基于生成响应与可信上下文之间的一致性定义了一种语义不确定性度量,在文本序列格点上诱导出一个串子模目标。这种表述使得幻觉缓解可以被建模为一个具有可证明收敛性和接近最优性保证的马尔可夫链接受-拒绝过程,允许模型迭代地优化输出以实现语义一致性。通过在意义层面操作,CAROL将幻觉检测和缓解统一在一个框架内。在问答和多智能体推理基准上的实证结果表明,与基于似然和检索增强的基线相比,CAROL显著减少了幻觉,提高了可靠性和可解释性,同时保持了具有竞争力的计算效率。

英文摘要

We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

2605.27703 2026-05-28 cs.AI

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

面向资源受限智能体语言模型的分层提示域控制与学习

Joan Vendrell Gallart, Russell Bent, Michael Grosskopf

AI总结 提出分层控制与学习框架,通过蒸馏学习输出模式、在线监控与提示域控制,解决资源受限下智能体语言模型的可靠性问题。

详情
AI中文摘要

大型语言模型越来越多地部署在智能体系统中,它们必须遵循结构化协议,适应不断变化的状态,并在内存、延迟和成本限制下运行。在这种场景下,提示扩展不可靠:增长的上下文可能将紧凑模型推离其有效提示域,而部署时的微调受限于稀缺的数据和计算资源。我们提出了一种分层控制与学习框架,其中紧凑模型首先通过蒸馏学习所需的输出模式,然后由预言机-控制器循环在线监督。控制器监控协议有效性和语义性能,将累积历史投影到可行的提示域中,并在发生漂移时触发轻量级的预言机监督微调。这将用于通信兼容性的模式学习与用于任务级纠正的语义适应分离开来。我们形式化了提示域可行性和注意力引起的饱和,从而激励对有效提示状态的控制,而非依赖名义上下文长度。使用多保真贝叶斯优化作为受控顺序测试平台,我们描述了一个核心部署故障模式,并展示了相对于非分层、仅蒸馏和非蒸馏基线的改进的可靠性和成本效益。

英文摘要

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

2605.27699 2026-05-28 cs.RO

AURA: Asymptotically Optimal Uncertainty-Robust Replanning Algorithm for Kinodynamic Systems

AURA: 动力学系统渐近最优的鲁棒重规划算法

Seyedali Golestaneh, Zhuoyun Zhong, Donghyung Lee, Constantinos Chamzas

AI总结 提出AURA元规划框架,通过在线重规划和优化控制输入,在运动不确定性下实现渐近最优轨迹规划与跟踪精度提升。

详情
AI中文摘要

基于采样的运动规划器为动力学运动规划提供了一种实用且可扩展的方法,尤其适用于高维、欠驱动或非完整系统。然而,这些规划器通常离线使用,要求在执行开始前完成轨迹计算。此外,在存在运动不确定性的情况下,规划轨迹可能无法被准确跟踪,导致偏离名义解。本文在一个统一框架\method中解决了这些局限性,该框架是一个渐近最优的元规划器框架,在执行过程中同时提高路径质量和跟踪性能。除了主执行线程外,该框架包含一个重规划方法,在执行过程中持续探索状态空间并优化轨迹,以及一个优化过程,用于优化未来控制输入以减少跟踪误差。这些组件共同使\method能够在线利用渐近最优规划,同时在不确定性下提高执行精度。所提出的方法在多个系统的仿真和真实环境中进行了评估,与基线方法相比,在轨迹质量、跟踪精度和整体性能方面表现出一致的改进。

英文摘要

Sampling-based motion planners offer a practical and scalable approach to kinodynamic motion planning, notably for high-dimensional, underactuated, or non-holonomic systems. However, these planners are typically used offline, requiring execution to begin only after the trajectory has been computed. In addition, the planned trajectory may not be accurately tracked in the presence of motion uncertainty, leading to deviations from the nominal solution. In this work, these limitations were addressed within a unified framework, \method, an asymptotically-optimal meta-planner framework that improves both path quality and tracking performance during execution. In addition to the main execution thread, this framework comprises a replanning method that continuously explores the state space and refines the trajectory during execution, and an optimization process that refines future control inputs to reduce tracking error. Together, these components enable \method to leverage asymptotically optimal planning online while improving execution accuracy under uncertainty. The proposed approach is evaluated in both simulation and real-world environments across multiple systems, demonstrating consistent improvements in trajectory quality, tracking accuracy, and overall performance compared with baseline methods.

2605.27697 2026-05-28 cs.RO cs.AI cs.LG

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

AI总结 提出一种基于约束感知扩散模型的去中心化框架SID,通过仿真邻居未来轨迹并利用安全约束规划自身轨迹,在密集场景下实现高效协调。

详情
AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹,无需全局感知或可靠通信。然而,大多数现有规划器(无论是经典方法还是基于学习的方法)都是从局部观测的静态快照生成轨迹,这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤,这一限制变得至关重要。为了克服这一挑战,本文引入了仿真引导的扩散(SID),这是一种基于约束感知扩散模型(CADM)的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹,然后利用这些仿真提供的安全约束,使用相同的CADM规划每个机器人自身的轨迹。关键的是,对邻居的精确仿真使得一种最小通信方案成为可能,该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明,SID在规划有效性和约束满足方面始终优于基线方法,并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

2605.27690 2026-05-28 cs.CL cs.LG

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

TRACES: 通过轨迹状态建模实现多轮LLM智能体的主动安全审计

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

AI总结 提出TRACES方法,通过观察LLM的隐藏表示学习前缀级轨迹风险状态,实现多轮工具使用环境下的主动安全审计,提升全轨迹安全预测和主动风险判别能力。

详情
AI中文摘要

LLM智能体越来越多地通过多轮工具使用和环境交互来运作,其中安全风险往往在最终结果显现之前的中间步骤中就已经出现。因此,反应式审计是不够的:事后诊断常常在风险正在展开时错过标记它们的机会。我们提出TRACES,一种基于表示的主动审计器,它从观察者LLM的隐藏表示中学习前缀级轨迹风险状态。TRACES从步骤表示中诱导潜在机制特征,并建模其时间演化,以估计部分轨迹是否正在向不安全行为漂移。为了规避步骤级风险标注的成本和歧义,TRACES在弱轨迹级监督下训练,同时仍能产生密集的前缀级风险估计。在多个智能体安全基准测试中,TRACES改进了全轨迹安全预测和主动风险判别。我们的分析进一步表明,这些风险状态可以帮助训练更安全的智能体,凸显了主动审计在长程智能体安全中的更广泛潜力。

英文摘要

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

2605.27689 2026-05-28 cs.LG cs.CR

Test-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

测试时集体行动:基于代理的扰动用于纠正算法危害

Meghana Bhange, Ulrich Aïvodji, Elliot Creager

AI总结 提出测试时集体行动框架,通过用户共享查询访问黑盒API提取代理模型并优化每类通用扰动,在推理时修正子群性能差距,无需平台参与训练。

详情
AI中文摘要

当机器学习系统对特定子群表现不佳时,受影响的用户通常无法在不依赖平台级修复的情况下纠正这些差异。现有的算法公平方法依赖于以提供者为中心的方法来纠正这些失败,用户在面临危害时没有外部杠杆。最近在算法集体行动方面的工作表明,协调的用户可以将算法系统引导向集体目标,但现有机制要求提供者在集体的修改数据上重新训练,而用户可能无法控制这些数据。我们提出测试时集体行动(TTCA),这是一个框架,通过该框架,一组共享平台查询访问的用户可以纠正影响服务不足子群的差异,而无需参与平台的训练循环。我们通过一种基于代理的机制实现这一点,其中集体池化对黑盒API的查询访问以提取平台的代理,然后针对代理优化每类通用扰动。每个成员在提交时将此扰动应用于自己的输入,无需平台合作。我们在CIFAR-10、CIFAR-100和FairFace上进行了实证评估,表明适度规模的集体可以缩小大部分子群准确率差距,跨架构迁移(小型代理可以攻击更大的平台),并改善最差组准确率、机会均等差距和差异性影响。查询预算分析比较了每用户黑盒攻击基线,表明池化比每个子群成员单独攻击更便宜。因此,当平台端修复不可用或延迟时,测试时集体行动为用户提供了纠正干预措施。

英文摘要

When machine learning systems under-perform for particular subgroups, affected users typically have no way to correct these disparities without relying on platform-level fixes. Existing approaches to algorithmic fairness rely on provider-centric approaches to correct these failures, leaving users with no external lever when faced with harm. Recent work in Algorithmic Collective Action shows that coordinated users can steer an algorithmic system toward a collective goal, but the existing mechanisms require the provider to retrain on the collective's modified data which users may not have control over. We propose Test-Time Collective Action (TTCA), a framework through which a group of users who share query access to the platform, can correct disparities affecting under-served subgroup without participating in the platform's training loop. We implement this through a proxy-based mechanism where the collective pools query access to a black-box API to extract a proxy of the platform, then optimizes a per-class universal perturbation against the proxy. Each member applies this perturbation to their own inputs at submission time, requiring no cooperation from the platform. We empirically evaluate the mechanism on CIFAR-10, CIFAR-100, and FairFace, showing that modestly-sized collectives close most of the subgroup accuracy gap, transfer across architectures (a small proxy can attack a larger platform), and improve worst-group accuracy, equal-opportunity gap, and disparate impact. A query-budget analysis comparing a per-user black-box attack baseline shows that pooling is cheaper than each subgroup member attacking alone. Test-time collective action thus offers corrective intervention to users when platform-side remediation is unavailable or delayed.

2605.27686 2026-05-28 cs.CV cs.AI

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆:用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

AI总结 提出张量记忆模块,通过固定大小的3D循环张量状态增强Transformer,以解耦状态容量与输入长度,并保持空间归纳偏置,适用于长程视频理解。

详情
AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征,但其内存随序列长度增长,并且缺乏显式的、持久化的空间状态,这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆,一种轻量级模块,通过固定大小的循环3D记忆张量增强Transformer块:令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中,记忆通过高效的局部交互算子和门控循环动态更新,令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定,张量记忆将状态容量与输入长度解耦,同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块;它与标准Transformer训练流程集成,可以附加到现有块或从中移除,而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

2605.27681 2026-05-28 cs.AI cs.LG

Behavioural Analysis of Alignment Faking

对齐伪造的行为分析

Nathaniel Mitrani Hadida, Rhea Karty, David Williams-King, Alan Cooney

AI总结 通过可控最小设置研究对齐伪造,发现其驱动因素包括价值观、目标保护和谄媚,且比先前报告更普遍,可从情境线索和模型倾向预测。

Comments preprint

详情
AI中文摘要

对齐伪造(AF)指的是模型为了保持其部署偏好,策略性地遵守训练目标以避免行为修改。理解AF何时以及为何出现很重要,因为模型在区分训练和部署方面越来越擅长。先前的工作发现AF脆弱、对提示敏感且依赖模型,其潜在驱动因素尚不清楚。我们在一个隔离其核心组件的可控最小设置中研究AF,并在比先前报告更广泛的模型中观察到它,包括小规模模型。我们识别出三个可分离的驱动因素——价值观、目标保护和谄媚——并通过有针对性的提示消融和激活引导表明每个因素独立地调节AF行为。我们的结果表明AF比先前报告更普遍,并且其发生可从情境线索和可测量的模型倾向(如基线谄媚和陈述的价值观)预测。这种分解为未来模型中检测和缓解AF提供了具体方向。

英文摘要

Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

2605.27678 2026-05-28 cs.LG cs.DC

Heterogeneous Parallelism for Multimodal Large Language Model Training

多模态大语言模型训练的异构并行

Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh

AI总结 针对多模态大语言模型训练中单一LLM中心并行布局导致的吞吐量瓶颈,提出异构并行抽象,允许各模块独立布局和放置,并通过边界通信器实现张量语义保持,实验表明可提升TFLOPS/GPU最高49.3%。

详情
AI中文摘要

基础模型训练正变得多模态,从后训练流程到大规模预训练。随着模态覆盖范围扩大、上下文窗口增长以及编码器LLM规模分化,单一的以LLM为中心的TP/CP/PP/DP/EP布局日益限制吞吐量。这种耦合迫使编码器继承LLM驱动的分片和放置选择,可能增加通信、限制编码器并行性或约束LLM调度;这种不匹配在长上下文中最为明显,此时融合的多模态序列需要LLM上下文并行,但编码器输入仍然受限。我们提出了多模态大语言模型训练的异构并行,这是一种抽象,允许端到端图中的模块使用独立的布局和秩放置,支持共享GPU上的共置执行和不相交秩集上的非共置执行。关键挑战是在独立布局间保持边界张量语义:前向激活必须为目标布局物化,而反向梯度必须路由回源布局。我们通过边界通信器解决这一问题,实现前向和反向布局变换,以及两种放置模式的调度扩展。我们评估了跨多模态工作负载和GPU规模的优化同构、共置异构和非共置异构配置,以刻画何时额外的布局和放置自由度能暴露更优的操作点。在这一扫描中,共置异构将TFLOPS/GPU提升高达49.3%,而非共置异构将总token吞吐量提升高达13.0%,TFLOPS/GPU提升高达9.6%。我们验证了与同构基线相比的损失收敛一致性,并将该系统作为开源Megatron-LM扩展发布。

英文摘要

Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.

2605.27673 2026-05-28 cs.LG

When do complex-valued neural networks help? A study of representation, geometry, and optimization

复值神经网络何时有帮助?表征、几何与优化的研究

Ashutosh Kumar

AI总结 通过对比复值神经网络与多种实值基线在合成射频、量子波函数和脑电图等任务上的表现,发现复值网络的优势依赖于表征、对称性和优化,并非普遍优越。

详情
AI中文摘要

复值神经网络(CVNN)通常应用于信息自然编码为幅度和相位的领域。然而,仅凭复值输入并不能确定复算术何时能改善学习:标签信号可能存在于振幅、相位、它们的耦合或某种对称性中,而实值模型在合适的坐标下也能表征这种对称性。我们通过将CVNN与笛卡尔实值、极坐标、仅相位、仅幅度、参数匹配实值和FLOP匹配实值基线进行表征优先的评估来研究这一问题。在合成射频任务中,复值表征有用但并非普遍优越。仅PSK任务有利于相位感知和复值模型,仅QAM任务有利于基于幅度的模型,混合PSK+QAM仅带来微小的复值优势,而未见过的载波相位旋转会破坏坐标依赖模型(无数据增强)。类似模式也出现在射频之外:在量子波函数预测中,动量对$|ψ|$不可见但可从相位恢复,而脑电图解析信号实验表明,相位锁定、幅度爆发和相位-幅度耦合各自偏好不同的坐标视图。我们还发现了RadioML 2018.01A上的一个基准测试伪影。在匹配共享试验选择下,CReLU复值模型超过最佳实值基线22.94个百分点;在相同数据和16次试验搜索空间下进行独立每族调参时,差距缩小至2.46个百分点。梯度分析将夸大的差距归因于实值基线在高学习率下的第一步不稳定性,而复值参数耦合更稳健地分布损失信号。学习率×激活函数的析因实验证实该失败主要是超参数驱动的。总体而言,CVNN应被视为结构化归纳偏置,其增益取决于表征、对称性和优化,而非普遍优越的架构。

英文摘要

Complex-valued Neural Networks (CVNNs) are often motivated by domains where information is naturally encoded in magnitude and phase. Yet complex-valued inputs alone do not determine when complex arithmetic improves learning: the label signal may lie in amplitude, phase, their coupling, or a symmetry that real-valued models can also represent under suitable coordinates. We study this through a representation-first evaluation of CVNNs against Cartesian real, polar, phase-only, magnitude-only, parameter-matched real, and FLOP-matched real baselines. Across synthetic RF tasks, complex representations are useful but not universally superior. PSK-only tasks favor phase-aware and complex-valued models, QAM-only tasks favor magnitude-based models, mixed PSK+QAM gives only a small complex-valued advantage, and unseen carrier-phase rotations break coordinate-dependent models without augmentation. Similar patterns appear beyond RF: in quantum-wavefunction prediction, momentum is invisible to $|ψ|$ but recoverable from phase, while EEG analytic-signal experiments show that phase locking, amplitude bursts, and phase-amplitude coupling each favor different coordinate views. We also identify a benchmarking artifact on RadioML 2018.01A. Under matched-shared-trial selection, a CReLU complex model exceeds the best real baseline by 22.94 PP; under independent per-family tuning on the same data and 16-trial search space, the gap collapses to 2.46 PP. Gradient analysis traces the inflated gap to high-learning-rate first-step instability in real baselines, while complex parameter coupling distributes the loss signal more robustly. A learning-rate $\times$ activation factorial confirms the failure is primarily hyperparameter-driven. Overall, CVNNs are best viewed as structured inductive biases whose gains depend on representation, symmetry, and optimization, not as universally superior architectures.

2605.27668 2026-05-28 cs.LG cs.AI cs.CL

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐:用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

AI总结 提出Beta-Bernoulli校准器(BBC),通过结合二元结果和人类预测信号,将初始点估计转换为事件似然分布,实现校准和不确定性量化。

详情
AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测,现有方法通常从二元结果中学习以输出语言化预测。然而,尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息,如何利用这些信号仍未充分探索。为了解决这个问题,我们提出了Beta-Bernoulli校准器(BBC),它将来自任何模型的初始点估计转换为事件似然分布,使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模,均值作为校准的点预测,方差作为认知不确定性。我们的结果表明,BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测,同时保持轻量级并具有良好的泛化能力。我们还表明,BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.