arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.04320 2026-06-04 cs.LG cs.AI

OpenRFM: Dissecting Relational In-Context Learning

OpenRFM：剖析关系型上下文学习

Zhikai Chen, Junyu Yin, Jialiang Gu, Siheng Xiong, Xiaoze Liu, Ruowang Zhang, Keren Zhou, Kai Guo

AI总结本文通过分析关系型Transformer的模型和数据两方面问题，提出双阶段上下文学习架构和同质性感知预训练混合策略，构建OpenRFM模型，在关系型基础模型上平均任务性能提升约30%。

详情

Comments: 25 pages, including appendix

AI中文摘要

关系型基础模型（RFM）承诺一个单一的预训练预测器，给定任何关系数据库，通过关系型上下文学习（ICL）在一次前向传播中返回预测。然而，开放RFM与其商业对应物之间存在显著差距，且这一差距的根源尚未被系统理解。我们从两个角度剖析了一个代表性框架——关系型Transformer（RT）。模型方面：我们表明RT执行关系级ICL，而核回归视图显示，当稀疏标签单元覆盖导致欠定回归时，它会失败。数据方面：我们消融了RT的预训练来源，发现仅合成预训练和分布内预训练将相同架构驱动到不同机制（惰性与特征学习）。探究这一差距揭示，缺失的成分是标签生成过程中可识别支持的关系型潜在变量。这两个诊断转化为：（1）一种双阶段ICL架构，将关系型骨干与从预训练表格基础模型提升的批级ICL层相结合，以克服关系级标签稀缺；（2）一种同质性感知的合成加持续真实数据预训练混合，辅以基于原型的正则化。这些选择定义了OpenRFM，一个简单而有效的RFM，在RT骨干上平均任务性能提升约30%，并在大量评估任务上超越了商业模型KumoRFMv1。

英文摘要

Relational Foundation Models (RFMs) promise a single pre-trained predictor that, given any relational database, returns predictions in one forward pass via relational in-context learning (ICL). Yet a substantial gap separates open RFMs from their commercial counterparts, and the origin of this gap has not been systematically understood. We dissect a representative framework, the Relational Transformer (RT), from two perspectives. Model side: we show that RT performs relation-level ICL, and a kernel regression view shows it fails when sparse label-cell coverage yields an underdetermined regression. Data side: we ablate RT's pre-training source and find that existing synthetic-only pre-training and in-distribution pre-training drive the same architecture into different regimes, lazy vs. feature-learning. Probing this gap reveals that the missing ingredient is a support-identifiable relational latent in the label-generation process. These two diagnoses translate into (1) a dual-stage ICL architecture that combines the relational backbone with a batch-level ICL layer lifted from a pre-trained tabular foundation model to overcome relation-level label scarcity, and (2) a homophily-aware synthetic plus continual real-data pre-training mixture, augmented with a prototype-based regularization. These choices define OpenRFM, a simple yet effective RFM that improves average task performance by approximately 30% over the RT backbone and surpasses the commercial model KumoRFMv1 on a large set of evaluation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.04319 2026-06-04 cs.GR cs.CV

PureLight: Learning Complex Luminaires with Light Tracing

PureLight: 使用光线追踪学习复杂光源

Pedro Figueiredo, Zixuan Li, Beibei Wang, Miloš Hašan, Nima Khademi Kalantari

AI总结提出一种基于神经网络的公式，通过光线追踪和归一化流网络学习复杂光源的辐射分布，并蒸馏为轻量级MLP以实现高效渲染。

详情

Comments: 9 pages, 10 figures

AI中文摘要

我们提出了一种神经公式来估计复杂光源的外观。我们专注于具有复杂光传输（例如，被多个镜面层包围的小型发射器）的具有挑战性的光源，这些光源对于（双向）路径追踪来说很难处理。为此，我们使用光线追踪从发射器构建路径到出射表面，并将外观估计公式化为一个分布学习问题。具体来说，我们使用一个大型归一化流网络对出射表面上的出射辐射概率密度函数（pdf）进行建模，并将出射辐射恢复为估计的pdf与通量的乘积。为了实现高效推理，我们将学习到的外观蒸馏到一个轻量级MLP中，该MLP直接估计出射表面上的辐射。我们还训练了一个采样网络用于从光源进行有效的直接照明计算，以及一个混合网络将光源合成到场景中。我们的公式使得在任意场景中使用低样本数渲染具有挑战性的光源成为可能。

英文摘要

We propose a neural formulation for estimating the appearance of complex luminaires. We focus on challenging luminaires with complex light transport (e.g., small emitters enclosed by multiple specular layers) that are difficult for (bidirectional) path tracing. To this end, we use light tracing to construct paths from emitters to the exit surfaces and formulate appearance estimation as a distribution learning problem. Specifically, we model the probability density function (pdf) of outgoing radiance on the exit surfaces using a large normalizing flow network, and recover the outgoing radiance as the product of the estimated pdf and flux. To enable efficient inference, we distill the learned appearance into a lightweight MLP that directly estimates radiance on the exit surfaces. We additionally train a sampling network for effective direct illumination computation from the luminaire, and a blending network to composite the luminaire into the scene. Our formulation makes it feasible to render challenging luminaires using low sample counts in arbitrary scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.04317 2026-06-04 cs.CR cs.LG cs.SE

Toward a Generalized Defense Across Sparse, Continuous, and Structured Parameter Attacks

面向稀疏、连续和结构化参数攻击的通用防御

Bin Duan, Zeyu Bai, Guowei Yang

AI总结提出 ParDef 框架，通过密钥通道重参数化、QC-LDPC 量化和自适应鲁棒推理，实现对多种参数攻击的通用防御，在保持高性能的同时降低攻击成功率。

详情

AI中文摘要

深度神经网络越来越多地部署在异构和部分不可信的环境中，模型通过云存储、CI/CD 流水线、容器化服务和边缘执行平台进行分发。这种广泛的部署场景使模型参数面临各种完整性风险。与输入空间对抗攻击不同，参数攻击直接篡改模型的内部参数，并持续影响所有后续推理。现有防御要么需要重新训练，要么导致显著的精度下降，或者仅限于特定的攻击类别。然而，在实际部署场景中，参数攻击的形式往往不可预测。为了解决这一挑战，我们提出了 ParDef，一种针对深度神经网络面向多种类型参数攻击的通用防御。ParDef 集成了密钥通道重参数化（隐藏敏感参数方向）、QC-LDPC 量化（嵌入冗余并支持纠错）以及自适应鲁棒推理（在不确定性下稳定预测）。我们在 CIFAR-10、CIFAR-100 和 Tiny-ImageNet 上使用 ResNet 和 VGG 模型的评估表明，ParDef 在不同参数攻击下持续降低攻击成功率，同时保持较高的模型性能，且仅引入适度的部署开销。这些结果凸显了 ParDef 是一种实用且通用的 DNN 部署防御方案。

英文摘要

Deep neural networks are increasingly deployed across heterogeneous and partially untrusted environments, where models are distributed through cloud storage, CI/CD pipelines, containerized services, and edge execution platforms. This broad deployment landscape exposes model parameters to various integrity risks. Unlike input-space adversarial attacks, parameter attacks directly tamper with the model's internal parameters and persist across all subsequent inferences. Existing defenses either require retraining, incur significant accuracy degradation, or are limited to specific attack classes. However, in real-world deployment scenarios, the forms of parameter attacks are often unpredictable. To address this challenge, we present ParDef, a generalized defense for deep neural networks against diverse types of parameter attacks. ParDef integrates keyed channel reparameterization, which obscures sensitive parameter directions, QC-LDPC quantization, which embeds redundancy and supports error correction, and adaptive robust inference, which stabilizes predictions under uncertainty. Our evaluation on CIFAR-10, CIFAR-100, and Tiny-ImageNet using ResNet and VGG models demonstrates that ParDef consistently reduces attack success rates across different parameter attacks while maintaining high model performance and incurring only moderate deployment overhead. These results highlight that ParDef is a practical and generalized defense for DNN deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.04315 2026-06-04 cs.AI

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

探索智能体记忆系统的跨场景通用性：诊断与强基线

Zhikai Chen, Jialiang Gu, Junyu Yin, Xianxuan Long, Shenglai Zeng, Xiaoze Liu, Kai Guo, Keren Zhou, Jiliang Tang

AI总结通过诊断现有记忆系统在多种场景下的表现，提出一个基于工具调用的自管理记忆框架AutoMEM，实现最佳跨场景通用性。

详情

Comments: 14 pages

AI中文摘要

LLM智能体积累的历史记录会超出其上下文窗口，这推动了关于记忆系统的研究日益增多。然而，大多数现有设计仅针对单一场景（多会话聊天或单轨迹格式）进行调优，几乎没有证据表明它们能够泛化到部署中智能体遇到的异构轨迹。我们重新审视了八个记忆系统以及一个用于搜索问题的智能体框架，在五个场景上进行了评估：单轮问答、多会话聊天、智能体轨迹问答、记忆压力测试和长周期智能体任务。该框架通过工具调用自管理平面文本文件存储，实现了最佳跨任务排名，这表明记忆性能取决于赋予智能体对存储和检索的主动控制，而不是被动地依赖固定流水线后的存储。我们将这一见解实例化为AutoMEM，一个具有自管理工具接口的智能体记忆框架，在我们评估的系统中实现了最佳跨场景通用性。

英文摘要

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

URL PDF HTML ☆

赞 0 踩 0

2606.04314 2026-06-04 cs.LG cs.SE

Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes

通过贝叶斯引导的决策景观探索测试神经网络

Bin Duan, Meiru Che, Guowei Yang

AI总结提出BayesWarp框架，利用可解释显著性技术识别决策关键区域，并通过不确定性感知的贝叶斯优化自适应引导测试，在保持数据分布和语义接近性的同时高效发现多样化故障。

详情

AI中文摘要

随着神经网络越来越多地部署在安全关键领域，测试对于评估和提高其可靠性至关重要。现有的测试方法，无论是黑盒还是白盒，主要使用全局变异或覆盖引导策略，这两种方法都难以在保持与原始数据分布和语义接近的同时高效发现多样化的模型故障。我们提出BayesWarp，一个通过可解释显著性技术识别决策关键输入区域，并使用不确定性感知的贝叶斯优化策略自适应引导测试过程的测试框架，能够在保持与原始数据分布和语义接近的同时发现多样化故障。在MNIST、CIFAR-10和ImageNet上对六个神经网络模型的评估表明，BayesWarp在固定变异预算下提高了故障发现率、故障多样性、测试用例质量和关键神经元覆盖率。这些结果表明BayesWarp提高了测试有效性。此外，使用生成的故障案例进行微调可提高模型性能。

英文摘要

As neural networks are increasingly deployed in safety-critical domains, testing is essential to evaluate and improve their reliability. Existing testing methods, whether black-box or white-box, primarily use global mutation or coverage-guided strategies, both of which struggle to efficiently uncover diverse model failures while remaining proximate to the original data distribution and semantics. We propose BayesWarp, a testing framework that addresses this limitation by mutating decision-critical input regions identified via interpretable saliency techniques and adaptively guiding the testing process using an uncertainty-aware Bayesian Optimization strategy, enabling the discovery of diverse failures while preserving distributional and semantic proximity to the original data. Evaluation on MNIST, CIFAR-10, and ImageNet across six neural network models shows that BayesWarp improves failure discovery, failure diversity, test case quality, and critical neuron coverage under a fixed mutation budget. These results demonstrate that BayesWarp improves testing effectiveness. Moreover, fine-tuning with the generated failure cases leads to improvements in model performance.

URL PDF HTML ☆

赞 0 踩 0

2606.04310 2026-06-04 cs.LG cs.SE

Latent Anchor-Driven Test Generation for Deep Neural Networks

基于潜在锚点的深度神经网络测试生成

Bin Duan, Matthew B. Dwyer, Guowei Yang

AI总结提出 Latte 框架，利用预训练 VQ-VAE 在潜在空间中进行锚点引导的变异，生成语义相近、多样且能揭示错误的测试用例，提高故障暴露和行为多样性。

详情

AI中文摘要

深度神经网络（DNN）越来越多地部署在安全关键和安全性敏感的应用中，这使得严格的测试对于识别和缓解模型弱点至关重要。现有的 DNN 测试方法要么探索输入空间，要么探索学习到的潜在空间。虽然潜在空间生成比直接输入空间变异能更好地保持合理性，但当前方法在探索可控性、故障多样性和种子相对语义漂移之间仍面临权衡。为了克服这些限制，我们提出了 Latte，一个黑盒测试框架，通过利用潜在空间生成语义相近、多样且能揭示错误的测试用例。具体来说，Latte 使用预训练的 VQ-VAE 对每个输入种子进行编码，并沿着由从替代类别中采样的锚点定义的方向执行以种子为中心的一步潜在变异，然后进行量化并解码回输入空间。这会在学习到的潜在流形中探索每个种子周围的局部邻域，从而在相同预算下产生更多数量和更广泛多样性的触发预言机预测差异。我们在 5 个数据集和 10 个 DNN 模型上评估了 Latte，包括单模型和多模型测试场景。在评估的数据集和模型上，Latte 在匹配的测试预算下提高了故障暴露和行为多样性。在单模型设置下，它还相对于源种子保持了较低的种子相对语义漂移。

英文摘要

Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches explore either the input space or a learned latent space. While latent-space generation can better maintain plausibility than direct input-space mutation, current methods still face a trade-off among exploration controllability, failure diversity, and seed-relative semantic drift. To overcome these limitations, we propose Latte, a black-box testing framework that generates semantically proximate, diverse, and fault-revealing test cases by leveraging the latent space. Specifically, Latte encodes each input seed with a pre-trained VQ-VAE and performs a seed-centered, one-step latent mutation along directions defined by anchors sampled from alternative classes, followed by quantization and decoding back to the input space. This explores local neighborhoods around each seed within the learned latent manifold, resulting in a larger number and broader diversity of oracle-triggering prediction discrepancies under the same budget. We evaluated Latte on 5 datasets and 10 DNN models in single-model and multi-model testing scenarios. Across the evaluated datasets and models, Latte improves fault exposure and behavioral diversity under matched testing budgets. Under the single-model setting, it also maintains low seed-relative semantic drift with respect to the source seeds.

URL PDF HTML ☆

赞 0 踩 0

2606.04307 2026-06-04 cs.LG stat.CO stat.ME

Folded Transport MCMC: Certifiable Quotient Posterior Computation for Symmetric Bayesian Models

折叠传输MCMC：对称贝叶斯模型的可认证商后验计算

Jun Hu

AI总结针对对称贝叶斯模型中的冗余多峰性导致MCMC收敛诊断退化的问题，提出Folded Transport MCMC方法，通过在对称群的基本域上构建独立采样器直接对商后验进行推断，并利用LCNF振荡认证框架在商度量下提供可证明的认证下界。

详情

Comments: 48 pages (including supplementary material), 5 figures, 6 tables. Submitted to Journal of the Royal Statistical Society: Series B

AI中文摘要

具有有限对称性的贝叶斯模型——如可交换分量的混合模型、具有紧密间隔模态的结构识别——定义的后验在标签置换群下不变，产生冗余的多峰性，从而降低MCMC收敛诊断的质量。我们引入折叠传输MCMC（FolT-MCMC），该方法通过在对称群的基本域上构建独立采样器，直接对商后验进行推断。商提议分布通过对群轨道上学习的归一化流进行对称化得到。我们证明了基于LCNF振荡的认证框架可以迁移到商度量，并具有稳定子修正的球质量界和改进的覆盖半径，并且当未折叠流表现出跨模态提议缺陷时，分位数核心认证下界会得到改善。在高斯混合（d=2-20）、标签切换目标（最多24个等价模态）以及标准贝叶斯三分量混合后验上，分位数核心认证改进比从2倍到145倍不等，且折叠认证经验上几乎与维度无关。在台风山竹期间超高层建筑的真实加速度计数据上，FolT-MCMC产生了非平凡的分位数核心认证，而未折叠认证是平凡的。

英文摘要

Bayesian models with finite symmetry - mixture models with exchangeable components, structural identification with closely-spaced modes - define posteriors that are invariant under a group of label permutations, creating redundant multimodality that degrades MCMC convergence diagnostics. We introduce Folded Transport MCMC (FolT-MCMC), which performs inference directly on the quotient posterior by constructing an independence sampler on the fundamental domain of the symmetry group. The quotient proposal is formed by symmetrising a learned normalising flow over the group orbits. We prove that the LCNF oscillation-based certification framework transfers to the quotient metric with a stabiliser-corrected ball-mass bound and improved covering radius, and that the quantile-core certified lower bound improves whenever the unfolded flow exhibits cross-mode proposal deficiency. On Gaussian mixtures (d = 2 - 20), label-switching targets (up to 24 equivalent modes), and a standard Bayesian three-component mixture posterior, the quantile-core certified improvement ratio ranges from 2x to 145x, with the folded certificate empirically nearly dimension-free. On real accelerometer data from a supertall building during Typhoon Mangkhut, FolT-MCMC yields a non-vacuous quantile-core certificate where the unfolded certificate is vacuous.

URL PDF HTML ☆

赞 0 踩 0

2606.04305 2026-06-04 cs.LG stat.ML

Offline-to-Online Learning in Linear Bandits

线性Bandit中的离线到在线学习

Kushagra Chandak, Toshinori Kitamura, Xiaoqi Tan

AI总结针对随机线性Bandit问题，提出一种平衡离线数据与在线探索的算法，实现次线性遗憾并随离线样本增加降低离线参考遗憾。

2606.04302 2026-06-04 cs.CL cs.LG

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

LazyAttention: 高效检索增强生成中的延迟位置编码

Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park

AI总结针对检索增强生成中KV缓存位置编码复用性差的问题，提出LazyAttention机制，通过核化延迟位置编码实现零拷贝、位置无关的KV重用，显著降低首令牌延迟并提升推理吞吐量。

详情

Comments: ICML 2026

AI中文摘要

键值（KV）缓存通过重用已生成令牌的过去计算来加速大型语言模型（LLM）的推理。在长上下文应用（如检索增强生成（RAG）和上下文学习（ICL））中，其重要性更加凸显。然而，传统的KV缓存将位置信息直接嵌入缓存中，限制了其可重用性。现有解决方案要么将重用限制为前缀，要么需要昂贵的内存物化来进行位置重新编码。我们引入了LazyAttention，一种新颖的注意力机制，它通过核化延迟位置编码来实现零拷贝、位置无关的KV重用。通过在注意力内核中动态调整位置编码，LazyAttention解决了物化瓶颈，使得单个物理KV副本能够服务于任意位置的多个逻辑请求。利用为预填充和解码定制的注意力内核，我们的系统实现了显著的效率提升：在偏斜的文档分布下，与最先进的Block-Attention相比，首令牌延迟（TTFT）降低了1.37倍，推理吞吐量提高了1.40倍，同时保持了可比的输出质量。

英文摘要

Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

URL PDF HTML ☆

赞 0 踩 0

2606.04301 2026-06-04 cs.CV

XSSR: Cross-Domain Self-Supervised Representative Selection for Efficient Annotation in Medical Image Segmentation

XSSR: 跨域自监督代表性选择用于医学图像分割中的高效标注

Byunghyun Ko, Aleksei Anisimov, Kobe Ke, Suhas Bharthepude, Jeongkyu Lee

AI总结提出XSSR框架，通过自监督学习在目标域中自动选择代表性样本进行标注，在仅使用5%标注预算时达到接近全数据性能。

详情

Comments: Accepted to the Third International Conference on AI in Healthcare (AIiH 2026). This is the preprint version of the paper

AI中文摘要

获取标注医学图像数据是资源密集型的，而在跨域场景中，源域和目标域在成像设备、人群或临床站点上存在差异，这一挑战进一步加剧。本研究引入了XSSR（跨域自监督代表性选择），一个旨在最小化目标域标注工作同时保持稳健分割性能的框架。XSSR包括三个阶段：首先，在无标签源数据上训练掩码自编码器（MAE），以建立共享嵌入空间，无需目标标签；其次，贪婪选择算法基于复合密度、新颖性和多样性标准对无标签目标样本进行评分；第三，仅在所选子集上训练U-Net分割模型。新颖性-多样性权衡参数alpha通过最小化嵌入空间覆盖自动校准，消除了手动调整。我们在三个公开基准上评估XSSR：胸部X光、RIGA+视网膜眼底成像和多站点前列腺MRI，每个基准在固定的5%标注预算下。XSSR在胸部X光上仅使用22个标注样本就达到了全数据性能的99.3%，在前列腺MRI上比随机选择高出最多2.5个Dice点，并在所有数据集上始终比CoreSet基线高出0.4到1.2个Dice点。消融研究表明多样性是最有影响力的评分组成部分，按站点分析表明性能与源域的扫描仪相似性相关。

英文摘要

Acquiring labeled medical image data is resource-intensive and a challenge further exacerbated in cross-domain scenarios where source and target datasets differ in imaging equipment, population, or clinical site. This study introduces XSSR (Cross-Domain Self-Supervised Representative Selection), a framework designed to minimize annotation effort in the target domain while maintaining robust segmentation performance. XSSR comprises three stages: first, a Masked Autoencoder (MAE) is trained on unlabeled source data to establish a shared embedding space without requiring target labels; second, a greedy selection algorithm scores unlabeled target samples based on a composite density, novelty, and diversity criterion; and third, a U-Net segmentation model is trained exclusively on the selected subset. The novelty-diversity trade-off parameter, alpha, is automatically calibrated by minimizing embedding-space coverage, eliminating manual tuning. We evaluate XSSR on three public benchmarks: Chest X-ray, RIGA+ retinal fundus imaging, and multi-site Prostate MRI, each under a fixed 5% annotation budget. XSSR achieves 99.3% of full-data performance on Chest X-ray using only 22 labeled samples, surpasses random selection by up to 2.5 Dice points on Prostate MRI, and consistently outperforms the CoreSet baseline by 0.4 to 1.2 Dice points across all datasets. Ablation studies indicate that diversity is the most influential scoring component, and per-site analysis shows that performance correlates with scanner similarity to the source domain.

URL PDF HTML ☆

赞 0 踩 0

2606.04299 2026-06-04 cs.CV cs.LG

Efficient and Training-Free Single-Image Diffusion Models

高效且无需训练的单图像扩散模型

Haojun Qiu, Kiriakos N. Kutulakos, David B. Lindell

AI总结提出一种基于多尺度补丁数据集的无训练单图像扩散模型，通过闭式最优去噪器实现高效生成，达到与训练模型相当的质量和多样性。

详情

Comments: CVPR 2026; Project Page: https://haojunqiu.github.io/efficient-SID/

AI中文摘要

我们考虑生成图像的问题，其内部结构——由多尺度补丁分布定义——与单个参考图像匹配。最近的方法通过训练单图像扩散模型来解决这个问题。但即使在这种设置下，训练计算成本高昂且需要数小时的优化。相反，我们使用不同尺度下的图像补丁数据集对图像进行建模。由于该数据集是有限的，且其补丁的维度较小，可以使用最优的闭式去噪器可计算地获得噪声补丁的得分函数，从而消除了神经网络训练的需要。我们将这种基于补丁的去噪器集成到一个高效、无需训练的图像扩散模型中，并描述了我们的方法如何与经典的基于补丁的图像恢复技术相联系。与训练过的单图像扩散模型相比，我们的方法实现了最先进的生成质量和多样性，并展示了应用，包括无条件图像生成、文本引导风格化、图像对称化和重定向。此外，我们展示了我们的方法与潜在空间扩散兼容，并展示了多种额外的加速技术，以实现一秒内的百万像素单图像生成和几分钟内的十亿像素生成。

英文摘要

We consider the problem of generating images whose internal structure -- defined by the distribution of patches across multiple scales -- matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

URL PDF HTML ☆

赞 0 踩 0

2606.04298 2026-06-04 cs.NI cs.AI

Anycast Performance in Context

上下文中的任播性能

Eric Liang

AI总结本文通过比较根DNS和CDN中的任播延迟，提出了一种区分弹性驱动和延迟驱动目标的优化框架，并得出结论：运营商不应使用相同的目标函数优化根DNS和CDN任播。

详情

AI中文摘要

IP任播允许一个服务从多个物理站点通告一个地址，让BGP将每个客户端映射到一个站点。它是DNS根服务器系统、公共解析器和一些内容分发网络的核心，然而相同的路由机制在不同应用中有着截然不同的后果。本文比较了两种设置中的任播延迟：根DNS（其中递归缓存将根服务器延迟分摊到许多用户和长生存时间值上）和CDN（其中每次额外的往返直接影响页面加载、视频启动或API延迟）。综合发现，根DNS任播可能表现出显著的路径膨胀，但仍产生有限的用户可见延迟，而CDN任播需要主动工程化对等互联、路由策略、吸引范围和测量反馈以保持膨胀较小。本文贡献了一个比较延迟模型、一个可复现的测量设计以及一个将弹性驱动的任播目标与延迟驱动的目标分开的优化框架。核心结论是实用的：运营商不应使用相同的目标函数优化根DNS和CDN任播。对于根DNS，鲁棒性、可达性和缓存行为占主导地位；对于CDN服务，尾部延迟、吸引正确性和策略控制占主导地位。

英文摘要

IP anycast lets a service advertise one address from many physical sites, leaving BGP to map each client to a site. It is central to the DNS root server system, public resolvers, and some content delivery networks, yet the same routing mechanism has very different consequences across applications. This paper compares anycast latency in two settings: root DNS, where recursive caching amortizes root-server delay over many users and long time-to-live values, and CDNs, where each additional round trip can directly affect page-load, video-start, or API latency. The synthesis finds that root DNS anycast can exhibit substantial path inflation while still producing limited user-visible delay, whereas CDN anycast requires active engineering of peering, route policy, catchment scope, and measurement feedback to keep inflation small. The paper contributes a comparative latency model, a reproducible measurement design, and an optimization framework that separates resilience-driven anycast objectives from latency-driven objectives. The central conclusion is practical: operators should not optimize root DNS and CDN anycast with the same objective function. For root DNS, robustness, reachability, and cache behavior dominate; for CDN services, tail latency, catchment correctness, and policy control dominate.

URL PDF HTML ☆

赞 0 踩 0

2606.04296 2026-06-04 cs.AI

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

饱和陷阱与干预时机的主观性：为什么基于情感的触发器和LLM评判者无法在自主智能体上把握干预时机

Manvendra Modgil

AI总结本研究通过18维情感动力学引擎HEART诊断自主智能体干预时机问题，发现状态饱和陷阱、LLM评判者的能力与上下文门槛，以及人类标注者之间极低的干预时机一致性，表明干预时机是一个低可靠性构念。

详情

Comments: 11 pages, 5 tables. Code and data:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

AI中文摘要

随着自主AI智能体从对话系统转向长周期软件执行，决定何时中断智能体的运行时安全层变得至关重要。我们使用一个连续的18维情感动力学引擎（HEART）作为诊断探针，研究了这一时机问题，评估了四种干预触发家族——绝对状态阈值、复合状态-动作模式、正则推理特征提取和零样本LLM作为评判者——针对SWE-bench-Verified调试轨迹上人工标注的干预点。我们报告了三个发现。首先，状态饱和陷阱：智能体在持续困难下没有恢复信号，因此建模的挫折感迅速越过阈值并保持最大值，将基于状态阈值的触发器从时刻检测器转变为近乎恒定的指示器，在五个轨迹中触发39-83%的动作。其次，LLM评判者的能力和上下文底线：小模型（gpt-5.4-mini）从不触发，而前沿和跨供应商模型只有在完整轨迹上下文下才能逃脱零触发底线，即使如此，F1值也仅为0.17-0.40，成本高达90倍。第三，最重要的是，监督目标在人类之间不可复现：三名训练有素的标注者使用同一评分标准对一条56动作轨迹进行标注，在干预位置上的一致性仅略高于偶然（位置Krippendorff's alpha = +0.047；最佳成对Cohen's kappa = +0.349），而在干预类型上完全不一致（暂停退化；澄清低于偶然；仅反思alpha = +0.226）。我们得出结论，干预时机是一个低可靠性构念，使得单标注者F1不适合作为优化目标。我们的贡献是跨人类评分者间信度、四种检测器架构、跨模型LLM评判者扫描以及复现的饱和效应，共同绘制了这一问题图谱，而非任何单一检测器的准确性。

英文摘要

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.04291 2026-06-04 cs.CV

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

3D视觉食谱：数据、学习范式与应用

Hongyang Du, Zongxia Li, Dawei Liu, Runhao Li, Haoyuan Song, Qingyu Zhang, Yubo Wang, Jingcheng Ni, Shihang Gui, Congchao Dong, Tao Hu

AI总结本文提出一种以数据为中心的3D视觉分类法，通过分析点云、网格、体素和3D高斯等几何表示及其获取流程，以及数据集设计、基准构建和监督机制，统一了表示、学习范式与下游任务（重建、生成、视频建模）之间的关系。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026
Comments: Accepted to the CVPR 2026 OpenSUN3D Workshop. Official version available at CVF Open Access. https://openaccess.thecvf.com/content/CVPR2026W/OpenSUN3D/html/Du_A_Cookbook_of_3D_Vision_Data_Learning_Paradigms_and_Application_CVPRW_2026_paper.html

AI中文摘要

3D视觉在日益多样化的数据表示、学习范式和建模策略的推动下迅速发展。然而，该领域在表示和基准测试方面仍然分散，难以形成关于效率、保真度和可扩展性的统一视角。本文提供了一种以数据为中心的3D视觉分类法，将几何表示、数据集、学习框架和应用连接在一个单一的概念图中。我们首先分析3D数据的主要结构表示——点云、网格、体素和3D高斯——及其获取流程。然后，我们研究数据集设计、基准构建和监督机制如何塑造最近的进展，涵盖2D监督的3D学习、隐式神经表示和4D世界建模。通过这种整合视角，我们阐明了表示、学习范式与下游任务（重建、生成和视频建模）之间的关系，提供了关于平衡效率与保真度以及多模态几何基础的新兴趋势的统一观点。

英文摘要

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.04290 2026-06-04 cs.LG math.OC

PE-MHL: Physics-Encoded Modular Hybrid Layers for Scalable Learning of Complex Systems

PE-MHL: 用于复杂系统可扩展学习的物理编码模块化混合层

Ismail Hassaballa, Mircea Lazar

AI总结提出物理编码模块化混合层（PE-MHL）框架，通过增量添加子模型并保证训练误差单调非增，实现可扩展、鲁棒的混合建模，在非线性NARX基准和Quanser Aero 2平台上优于同等规模单体网络。

详情

AI中文摘要

结合基于物理和数据驱动的混合模型在控制应用中展现出实现准确性和可解释性的强大潜力。尽管最近的方法在融入物理一致性方面取得了进展，但在可扩展性、对噪声的鲁棒性以及模型复杂度控制方面仍存在挑战。本文提出了物理编码模块化混合层（PE-MHL）框架，其中基线基于物理的模型通过添加新的子模型逐步细化，每个新组件在保留先前组件已学知识的同时增加复杂度。我们为这种构造建立了理论保证：通过每个新子模型的最小二乘初始化，训练误差在子模型数量上单调非增并可证明收敛。在非线性NARX基准和Quanser Aero 2平台上的实证评估表明，PE-MHL在准确性和泛化能力上均优于同等规模的单体网络，同时提供更稳定的训练动态和更好的底层数据结构保留。

英文摘要

Hybrid models that combine physics-based and data-driven components have shown strong potential for achieving accuracy and interpretability in control applications. While recent methods have made progress in incorporating physical consistency, challenges remain in scalability, robustness to noise, and control of model complexity. This paper proposes a Physics-Encoded Modular Hybrid Layer (PE-MHL) framework, in which a baseline physics-based model is incrementally refined through the addition of new sub-models, where each new component adds complexity while preserving what previous components have already learned. We establish a theoretical guarantee for this construction: with a least-squares initialization of each new sub-model, the training error is monotonically non-increasing in the number of sub-models and provably converges. Empirical evaluations on a nonlinear NARX benchmark and the Quanser Aero 2 platform demonstrate that PE-MHL outperforms equivalently sized monolithic networks in both accuracy and generalization, while also providing more stable training dynamics and better preservation of underlying data structures.

URL PDF HTML ☆

赞 0 踩 0

2606.04287 2026-06-04 cs.LG cs.AI

Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models

通过轻量级结构引导自回归模型扩展新颖图生成

Alessio Barboni, Massimiliano Lupo Pasini, Bishal Lakha, Edoardo Serra

AI总结提出一种轻量级自回归框架，利用结构引导拓扑排序和两阶段训练策略，在分子和非分子基准上实现高新颖性、有效性和唯一性的图生成。

详情

AI中文摘要

生成真实且多样的图是机器学习中的一个关键问题，在分子发现、电路设计、网络安全等领域有应用。然而，当前的图生成模型在可扩展性和新颖性方面仍存在局限。基于扩散的方法通常需要昂贵的全邻接操作和长去噪链，而许多自回归和混合模型至少具有二次复杂度。此外，这些模型往往模仿训练图而非泛化到新图。我们提出一个轻量级自回归框架来解决这些问题。它使用结构引导的拓扑排序将图序列化为规则的边序列，实现近对数线性生成，以及一种两阶段训练策略，结合探索导向的增强和迭代细化，以减少过拟合并促进受控的新颖性。在分子和非分子基准上的实验表明，我们的方法在保持高有效性和唯一性的同时提高了新颖性。该框架还支持LSTM和Mamba风格的因果序列骨干，大内存加速器使得能够进行超出典型GPU限制的更长的图序列实验。

英文摘要

Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cybersecurity, and beyond. However, current graph generative models remain limited by scalability and novelty. Diffusion-based methods often require costly full-adjacency operations and long denoising chains, while many autoregressive and hybrid models have at least quadratic complexity. In addition, these models often imitate training graphs rather than generalize beyond them. We propose a lightweight autoregressive framework to address these issues. It uses a structure-guided topological ordering to serialize graphs into regular edge sequences, enabling near log-linear generation, and a two-phase training strategy that combines exploration-oriented augmentation with iterative refinement to reduce overfitting and promote controlled novelty. Experiments on molecular and non-molecular benchmarks show that our approach improves novelty while preserving high validity and uniqueness. The framework also supports both LSTM and Mamba-style causal sequence backbones, with large-memory accelerators enabling longer graph-sequence experiments beyond typical GPU limits.

URL PDF HTML ☆

赞 0 踩 0

2606.04286 2026-06-04 cs.CL

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

使用基于文本的因果推断解构影响在线评论评分的因素

Linsen Li, Aron Culotta, Nicholas Mattei

AI总结提出基于CausalBERT的文本因果分析方法，通过温度缩放、超参数优化和可解释性改进，从60万条美国K-12学校评论中解构各因素对整体评分的影响。

详情

DOI: 10.18653/v1/2025.naacl-long.562
Journal ref: In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
Comments: HLT/NAACL 2025

AI中文摘要

在线评论提供了对产品或服务各方面感知质量的宝贵见解。虽然基于方面的情感分析侧重于从评论中提取这些方面，但关于每个方面对整体感知影响的研究较少。由于方面之间的相关性，分离每个方面的影响尤其具有挑战性。本文介绍了一种基于文本因果分析最新进展的方法，特别是CausalBERT，以解构每个因素对整体评论评分的影响。我们通过三个关键改进增强了CausalBERT：用于更校准的处理分配估计的温度缩放；减少混杂过度调整的超参数优化；以及表征发现混杂因素的可解释性方法。在这项工作中，我们将评论中的文本提及视为现实世界属性的代理。我们在来自超过60万条美国K-12学校评论的真实和半合成数据上验证了我们的方法。我们发现，所提出的增强方法产生了更可靠的估计，并且对学校管理和基准测试表现的感知是整体学校评分的重要驱动因素。

英文摘要

Online reviews provide valuable insights into the perceived quality of facets of a product or service. While aspect-based sentiment analysis has focused on extracting these facets from reviews, there is less work understanding the impact of each aspect on overall perception. This is particularly challenging given correlations among aspects, making it difficult to isolate the effects of each. This paper introduces a methodology based on recent advances in text-based causal analysis, specifically CausalBERT, to disentangle the effect of each factor on overall review ratings. We enhance CausalBERT with three key improvements: temperature scaling for better calibrated treatment assignment estimates; hyperparameter optimization to reduce confound overadjustment; and interpretability methods to characterize discovered confounds. In this work, we treat the textual mentions in reviews as proxies for real-world attributes. We validate our approach on real and semi-synthetic data from over 600K reviews of U.S. K-12 schools. We find that the proposed enhancements result in more reliable estimates, and that perception of school administration and performance on benchmarks are significant drivers of overall school ratings.

URL PDF HTML ☆

赞 0 踩 0

2606.04284 2026-06-04 cs.LG cs.AI cs.CL

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

稀疏混合专家奖励模型学习可解释且专业化的专家用于个性化偏好建模

Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg

AI总结提出稀疏混合专家奖励模型，通过稀疏路由和专家多样性训练，从二元偏好数据中学习可解释的专家模式，提升个性化偏好建模的测试时适应性和可解释性。

详情

AI中文摘要

偏好建模在基于人类反馈的强化学习（RLHF）中扮演核心角色，使大型语言模型（LLMs）与人类价值观对齐。然而，大多数现有方法假设一个通用的奖励函数，忽视了人类偏好的多样性和异质性。为了在不增加额外标注成本的情况下解决这一限制，最近的工作提出从二元数据中学习多个偏好组件，并组合它们以建模个体偏好。然而，这些组件往往无法捕捉连贯且解耦的模式，限制了其可解释性和个性化效果。在这项工作中，我们提出了一种稀疏混合专家（MoE）奖励模型，该模型在二元偏好数据训练过程中鼓励稀疏路由和专家多样性。在受控和真实世界的实验中，稀疏MoE学习了可解释的路由模式和专业化的专家。它还改进了测试时的个性化，并且适应后的专家权重变化为分析模型如何适应个性化偏好提供了定性视角。

英文摘要

Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.

URL PDF HTML ☆

赞 0 踩 0

2606.04282 2026-06-04 cs.CV

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

FindIt：面向通用多模态大语言模型的格式感知视觉检测基准

Eshika Khandelwal, Jingjing Pan, Mingfang Zhang, Quan Kong, Lorenzo Garattoni, Hilde Kuehne

AI总结提出首个全面评估通用多模态大语言模型在可提示定位能力上的基准，涵盖四种核心任务，并标准化输入输出格式，揭示模型对格式约束的敏感性。

详情

AI中文摘要

多模态大语言模型（MLLMs）主要在自由形式的视觉语言任务（如视觉问答、图像描述和摘要）上进行评估。然而，它们的实际应用正在迅速扩展到更结构化的计算机视觉场景，用户提示模型执行以定位为中心的任务（如目标检测），通常是在更大的智能体或决策系统中。尽管发生了这种转变，但目前还没有标准化的基准来系统地大规模评估这些能力。在这项工作中，我们引入了第一个专门设计用于评估通用MLLMs可提示定位能力的全面基准。我们的基准涵盖四个核心任务类别：目标检测、指代表达检测、实例级检测和基于视频的检测。为了实现一致和公平的评估，我们开发了一个统一框架，标准化输入，强制可解析的边界框输出，并定义了跨任务的透明评估协议。使用该套件，我们评估了多种开源和专有MLLMs，深入分析了它们的性能和局限性。除了准确性，我们还检查了模型遵守输出格式规范的能力，表明当前系统对格式约束高度敏感，并且即使面对微小变化也常常无法泛化。我们的结果突出了最先进的MLLMs在定位设置中的优势和缺点，并指出了改进多模态模型设计和评估的重要方向。

英文摘要

Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models' ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.04280 2026-06-04 cs.LG cs.AI cs.IR

The Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning

损失还不够：对比表示学习中的采样条件和归纳偏置

Justinas Zaliaduonis, Patrick Putzky, Till Richter, Sergios Gatidis

AI总结本文通过测度论框架形式化对比学习中的多样性条件，提出支持校正的InfoNCE变体，并实验验证了采样多样性与编码器归纳偏置的相互作用。

详情

AI中文摘要

对比学习已成为自监督表示学习的主要范式，但其恢复有意义潜在几何的条件尚未完全理解。我们开发了一个测度论框架，形式化了多样性条件，即正对采样的支持要求，这是等距潜在恢复所必需的。我们表明，标准的全支持von Mises-Fisher设置意味着满足多样性条件，因此全局对比损失最小化器可以恢复潜在几何（直到正交变换），而受限条件分布可以使非正交映射达到严格更低的渐近对比损失。我们引入了一种支持校正的信息噪声对比估计（InfoNCE）变体作为理论修复：这种校正使得正交潜在空间恢复成为可能，但并不能唯一选择它。在合成基准上的实验验证了可识别性预测，CIFAR-10实验与定性预测一致，即当采样多样性有限时，架构归纳偏置变得更加重要。总之，我们的结果阐明了采样机制和编码器归纳偏置在对比表示学习中的相互作用。

英文摘要

Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers meaningful latent geometry remain incompletely understood. We develop a measure-theoretic framework formalizing the diversity condition, a support requirement on positive-pair sampling that is necessary for isometric latent recovery. We show that the standard full-support von Mises-Fisher setting implies the satisfaction of the diversity condition and as a consequence global contrastive loss minimizers recover latent geometry up to orthogonal transformation, while restricted conditionals can make non-orthogonal maps attain strictly lower asymptotic contrastive loss. We introduce a support-corrected Information Noise Contrastive Estimation (InfoNCE) variant as a theoretical fix: this correction makes orthogonal latent space recovery achievable but does not uniquely select it. Experiments on synthetic benchmarks validate the identifiability predictions, and CIFAR-10 experiments are consistent with the qualitative prediction that architectural inductive bias becomes more important when sampling diversity is limited. Together, our results clarify how sampling mechanisms and encoder inductive bias interact in contrastive representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.04279 2026-06-04 cs.LG quant-ph

Derivative Informed Learning of Exchange-Correlation Functionals

交换相关泛函的导数知情学习

Eike S. Eberhard, Luca A. Thiede, Abdul Aldossary, Andreas Burger, Nicholas Gao, Vignesh Bhethanabotla, Alán Aspuru-Guzik, Stephan Günnemann

AI总结提出导数知情交换相关损失(DI-Loss)，通过监督能量在密度矩阵Grassmannian上的一阶和二阶导数，训练O(N^3)标度的机器学习交换相关泛函以复现B3LYP/def2-SVP目标，在多个架构上平均总能量MAE降低66%，并减少混合泛函SCF迭代次数达50%。

详情

Comments: Proceedings of the 43rd International Conference on Machine Learning

AI中文摘要

机器学习（ML）交换相关（XC）泛函旨在通过直接从参考数据学习来替代人工设计的密度泛函近似，但它们仍未能持续优于传统的$\mathcal{O}(N^4)$标度混合泛函。我们研究了一种混合蒸馏设置，其中$\mathcal{O}(N^3)$标度的ML-XC泛函被训练以复现B3LYP/def2-SVP目标。我们引入了导数知情XC损失（DI-Loss），该损失通过监督能量在可容许密度矩阵的Grassmannian上的一阶和二阶导数，融入了来自参考混合泛函的额外信息。DI-Loss不仅匹配自洽不动点，还将学习到的泛函的局部一阶和二阶响应与目标泛函对齐。在四个评估的架构中，DI-Loss一致地改善了主要能量指标。在所有架构上均匀平均，总能量MAE相对于仅使用能量和密度监督降低了66%。密度敏感的均场能量度量$E_ρ$平均从1.2 mEh改善到0.8 mEh，而偶极子和$\mathcal{L}_2$密度误差并未均匀改善。我们进一步表明，来自蒸馏泛函的密度将混合泛函的SCF迭代次数减少了高达50%。在下游TDDFT计算中，Hessian监督改善了激发态预测，XCdiff将平均激发能MAE降低了19-35%。

英文摘要

Machine-learned (ML) exchange-correlation (XC) functionals aim to replace human-designed density functional approximations by learning directly from reference data, but they still do not consistently outperform traditional $\mathcal{O}(N^4)$-scaling hybrid functionals. We study a hybrid-distillation setting in which $\mathcal{O}(N^3)$-scaling ML-XC functionals are trained to reproduce B3LYP/def2-SVP targets. We introduce Derivative Informed XC-Loss (DI-Loss), a loss that incorporates additional information from the reference hybrid functional by supervising first and second derivatives of the energy on the Grassmannian of admissible density matrices. Rather than only matching the self-consistent fixed point, DI-Loss aligns the local first- and second-order response of the learned functional with that of the target functional. Across four evaluated architectures, DI-Loss consistently improves the main energy metrics. Averaged uniformly across architectures, the total-energy MAE decreases by 66% relative to energy and density supervision alone. The density-sensitive mean-field energy metric $E_ρ$ improves from $1.2$ to $0.8$ mEh on average, while dipole and $\mathcal{L}_2$ density errors do not improve uniformly. We further show that densities from the distilled functionals reduce hybrid-functional SCF iterations by up to 50%. In downstream TDDFT calculations, Hessian supervision improves excited-state predictions, with XCdiff reducing the mean excitation-energy MAE by 19 - 35%.

URL PDF HTML ☆

赞 0 踩 0

2606.04275 2026-06-04 cs.LG cs.AI

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

从蜱虫到流：连续环境中神经强化学习的动力学

Saket Tiwari, Tejas Kotwal, George Konidaris

AI总结本文通过将深度强化学习建模为连续时间随机过程，利用随机控制理论，首次推导了连续环境下过参数化神经演员-评论家算法在无限宽度极限下的状态分布演化方程。

详情

Comments: Presented at ICLR 2026: https://openreview.net/forum?id=TdiRLe3rPA

AI中文摘要

我们提出了一种新颖的深度强化学习（RL）在连续环境中的理论框架，通过借鉴随机控制的思想，将问题建模为连续时间随机过程。在先前工作的基础上，我们引入了一个可行的演员-评论家算法模型，该模型同时包含探索和随机转移。对于单隐藏层神经网络，我们表明环境状态可以表述为两个时间尺度的过程：环境时间和梯度时间。在此框架下，我们描述了表示环境状态和累积折扣回报估计的时间相关随机变量如何在两层网络的无限宽度极限下随梯度步长演化。利用随机微分方程理论，我们首次在连续RL中推导出一个方程，描述了在极小的学习率下，每个梯度步长上状态分布的无穷小变化。总体而言，我们的工作为研究过参数化神经演员-评论家算法提供了一种新颖的非参数化表述。我们通过一个简单的连续控制任务实证验证了我们的理论结果。

英文摘要

We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.

URL PDF HTML ☆

赞 0 踩 0

2606.04274 2026-06-04 cs.CL cs.CY

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

长存微调：在Reddit上，任务特定Transformer在错误信息响应分类中优于零样本LLM

JooYoung Lee, Lin Tian, Angela Brillantes, Adriana-Simona Mihăiţă, Marian-Andrei Rizoiu

AI总结通过对比微调模型与零样本LLM在Reddit错误信息评论分类上的表现，发现微调RoBERTa在宏F1分数上显著优于最佳零样本模型，且成本更低，表明任务特定微调在检测隐性错误信息方面仍更可靠。

详情

AI中文摘要

随着大型语言模型（LLM）成为在线信息验证的默认工具，一个隐含的假设随之而来：规模和通用能力足以对错误信息话语进行细致分类。我们直接在900条Reddit评论上测试这一假设，这些评论涵盖三个经PolitiFact验证的错误信息主张（环境、健康、移民），并标记为相信（传播主张）、事实核查（纠正主张）或其他。我们比较了三种范式下的九个模型——BART-MNLI、三种Llama变体、三种商业前沿LLM（Claude Haiku 4.5、Gemini Flash Lite 2.5、Claude Sonnet 4.6），以及微调的DistilBERT和RoBERTa——在通用和主题特定标签方案下。该假设不成立。微调RoBERTa达到0.62的宏F1，而最佳零样本结果为0.50（Claude Haiku 4.5），且每次查询成本极低；监督优势集中在相信类别，这是每个零样本模型都检测不足的隐式情感类别。规模无帮助：Llama-3-8B与Llama-3-70B表现相当，Claude Sonnet 4.6在通用标签下表现逊于较小的Haiku，将相信检测降至0.17，并直接拒绝部分被标记为敏感的评论。这是安全对齐的伪影，而非能力限制。标签方案和主题共同塑造零样本性能，同一模型在匹配标签下不同主题间的宏F1差异超过0.13。在验证场景中，遗漏相信是代价更高的错误，尽管大型生成模型激增，任务特定微调仍是更可靠的选择。

英文摘要

As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.

URL PDF HTML ☆

赞 0 踩 0

2606.04273 2026-06-04 cs.AI

Characterizing initial human-AI proof formalization workflows

表征初始人机交互的证明形式化工作流

Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou, Shanda Li, Sam Looi, Joshua B. Tenenbaum, Umang Bhatt, Adrian Weller, Jose Hernandez-Orallo, Cameron E. Freer, Valerie Chen, Ilia Sucholutsky

AI总结通过混合方法分析，研究人们在形式化证明过程中对AI工具的需求、障碍及实际使用模式，发现AI辅助能提高形式化准确率且用户偏好多样但普遍希望保持人类对证明发现过程的高层控制。

详情

AI中文摘要

几个世纪以来，人类数学家通过书写证明来支撑其数学论证；然而，自动验证证明有效性的能力长期以来一直是一个挑战。AI系统在生成代码和进行日益高级的数学推理方面的进步，有望改变人们形式化并进而验证证明的能力。虽然许多工作聚焦于对当前前沿进行基准测试，但我们转而研究人们如何使用这些工具。我们采用混合方法分析，研究AI对人们形式化工作流的初始影响：人们声称想要什么，他们认为这些愿景的障碍是什么，以及他们在实践中如何实际使用和适应AI。一项定性调查显示，人们的偏好是多样化的，但普遍希望AI辅助形式化，同时保留人类对证明发现过程的高层控制。为了评估在这种限制下人们如何实际使用AI进行形式化，我们进行了一项受控用户研究，参与者形式化非正式的数学问题及其证明，在有和没有AI的情况下，涉及不同难度和领域的多种数学问题。尽管当时用于自动形式化的工具有限，但参与者在使用AI工具时往往比单独形式化时获得更高的形式化准确率，大多数参与者灵活选择使用多种不同的AI工具。综合来看，我们的工作揭示了AI融入形式化工作流的早期阶段，涉及人类与AI参与的密切互动。

英文摘要

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.

URL PDF HTML ☆

赞 0 踩 0

2606.04272 2026-06-04 cs.LG

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

预训练期间的强化学习探索：重新审视LLM训练中的策略优化

Rachit Bansal, Clara Mohri, Tian Qin, David Alvarez-Melis, Sham Kakade

AI总结本文质疑LLM标准训练流程中仅在预训练和监督微调后使用强化学习的做法，通过从头训练LLM并在中间检查点直接应用RL、SFT及SFT后RL，发现RL早期有效且能匹配完整流程，同时提出并行平均合并RL和SFT目标的方法在保持通用能力的同时优于其他方法。

详情

AI中文摘要

标准的LLM训练流程仅在预训练和监督微调（SFT）之后应用强化学习（RL）。我们通过从头训练LLM，并直接在中间预训练检查点上应用RL、SFT以及SFT后接RL，来质疑这一现状。我们发现RL在早期非常有效，并且通常也能在早期匹配完整的SFT→RL流程。通过在更难问题上的实验，我们发现针对性的预训练数据组成是RL有效性的强大杠杆，甚至比模型规模更重要。除了推理准确性之外，直接将RL应用于基础检查点会扩展模型的分布；而最近工作中报告的锐化效应仅在RL跟随SFT时出现。RL基本不改变模型的通用能力，而SFT后通用能力会下降。最后，我们通过并行平均合并RL和SFT目标，该方法在所有其他训练方法中表现最佳，跨指标均优于其他方法，同时保持通用能力。这些结果表明，LLM训练可能受益于RL的更广泛使用。

英文摘要

The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre-training checkpoints. We find that RL is effective very early, and often matches the full SFT$\to$RL pipeline early as well. Through experiments on harder problems, we find that targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model's distribution; the sharpening effect reported in recent work arises only when RL follows SFT. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT. Finally, we merge RL and SFT objectives by parallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities. Together, these results suggest that LLM training might benefit from an expanded use of RL.

URL PDF HTML ☆

赞 0 踩 0

2606.04271 2026-06-04 cs.CV cs.AI

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

StandardE2E：端到端自动驾驶数据集的统一框架

Stepan Konev

AI总结提出StandardE2E框架，通过统一数据模式、多数据集联合加载和简化新数据集添加流程，解决端到端自动驾驶数据集格式不兼容问题。

详情

AI中文摘要

自动驾驶已从模块化的感知-预测-规划堆栈转向端到端（E2E）模型，这些模型直接将传感器输入映射到车辆控制，通常通过辅助任务（如3D检测、运动预测和高清地图感知）进行正则化。进展由快速增长的传感器丰富驾驶数据集生态系统驱动，但每个数据集都有自己的文件格式、API、坐标约定和模态覆盖范围，导致跨数据集实验甚至基本的每个数据集预处理都需要为每个项目重新实现。我们提出StandardE2E，一个为E2E驾驶数据集提供统一接口的框架。StandardE2E (i) 在共享数据模式下标准化每个数据集的预处理；(ii) 在单个PyTorch DataLoader中组合多个数据集，用于跨数据集预训练、辅助任务监督和场景级过滤；(iii) 将添加新数据集简化为从原始帧到规范模式的单个数据集映射，而整个下游流程保持不变。该框架开箱即支持六个数据集：Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1) 和 WayveScenes101，并作为开源标准e2e Python包发布，可在 https://github.com/stepankonev/StandardE2E 获取。

英文摘要

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

URL PDF HTML ☆

赞 0 踩 0

2606.04269 2026-06-04 cs.RO cs.AI cs.CV

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Instant-Fold: 可变形物体操作的情境模仿学习

Yilong Wang, Cheng Qian, Edward Johns

AI总结提出Instant-Fold框架，通过单次人类演示的情境模仿学习，无需梯度更新即可推断并执行多种可变形物体操作模式，在仿真训练后零样本迁移到真实世界。

详情

AI中文摘要

可变形物体操作（DOM）具有挑战性，因为其状态是高维、部分可观测的，并且通过长时间跨度、拓扑变化的交互演变，涉及多种有效的操作模式。我们引入了Instant-Fold，一个用于DOM的情境模仿学习框架。给定单次人类演示，我们的策略直接从演示中推断并执行多种操作模式，包括空间执行和顺序的变化，无需梯度更新。我们的方法首先通过时间对比预训练学习变形感知的视觉表示，然后基于演示的条件流匹配变换器策略预测执行预期操作模式的动作。完全在仿真中训练的Instant-Fold能够泛化到多种折叠模式，并零样本迁移到真实世界环境，无需额外的数据收集或微调。视频可在https://instant-fold.github.io获取。

英文摘要

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.04266 2026-06-04 cs.CR cs.LG

Long-Term and Short-Term Transistor Aging in Deep Neural Networks: Impact and Mitigation

深度神经网络中的长期与短期晶体管老化：影响与缓解

Alireza Sarmadi, Virinchi Roy Surabhi, Prashanth Krishnamurthy, Hussam Amrouch, Ramesh Karri, Farshad Khorrami

AI总结本文研究了长期和短期晶体管老化对深度神经网络推理精度的影响，并提出了一种老化感知重训练方法来缓解性能下降。

详情

Comments: 28 pages, 16 figures

AI中文摘要

深度神经网络（DNN）被用于各种实际应用，例如图像分类和语音识别。在集成电路（IC）的硬件上实现的DNN的推理精度会在晶体管老化等现象下下降。老化会减慢晶体管的开关速度，由于时钟无法维持而导致系统级时序违规。为了在整个预期寿命内保持可靠性，设计人员添加保护带以防止时序违规；然而，添加大的时序保护带会导致性能（速度或吞吐量）损失。本章详细讨论了长期和短期晶体管老化对DNN推理精度的影响。此外，为了减轻老化对DNN精度的影响并控制它们，提出了一种老化感知重训练方法，以生成即使在激进（即小于所需）保护带下也具有弹性的DNN。这提高了DNN在老化引起的退化情况下的推理精度。本章在用于图像分类的DNN硬件实现上，使用现成的图像数据集讨论了这些影响以及缓解策略。还简要讨论了短期老化作为检测集成电路中硬件木马的激励机制的应用。

英文摘要

Deep neural networks (DNNs) are used in a variety of real-world applications including, for example, image classification and speech recognition. The inference accuracy of DNN implemented on hardware in integrated circuits (ICs) degrades under phenomena such as transistor aging. Aging slows down the switching speed of transistors, resulting in system-level timing violations due to unsustainable clocks. To maintain reliability for the entire projected lifetime, designers add guardbands to prevent timing violations; however, adding large timing guardbands causes losses in performance (speed or throughput). This chapter provides a detailed discussion of the effects of long-term and short-term transistor aging on DNN inference accuracy. Furthermore, to mitigate aging effects on DNN's accuracy and keep them at bay, a methodology for aging-aware retraining is presented in order to generate a resilient DNN even when aggressive (i.e., smaller than required) guardbands are used. This improves the inference accuracy of the DNNs even in the presence of aging-induced degradation. These effects are discussed in this chapter along with mitigation strategies on a hardware implementation of a DNN for image classification on an off-the-shelf image dataset. The application of short-term aging as an excitation mechanism for the detection of hardware Trojans in integrated circuits is also briefly discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.04265 2026-06-04 math.OC cs.LG cs.NA math.NA

Nonlocal Mean Field Schrödinger Bridge with Learned Interactions

具有学习相互作用的非局部平均场薛定谔桥

Daisuke Inoue, Mathieu Laurière, Dante Kalise

AI总结本文提出一种使用神经网络代理近似非局部相互作用的平均场薛定谔桥方法，将推理时的每步计算成本从二次降低到线性，并推导了代理误差传播的稳定性界限。

详情

Comments: 31 pages, 15 figures

AI中文摘要

薛定谔桥问题构建一个以最小能量连接初始分布和终端分布的随机过程。本文考虑其平均场扩展，即平均场薛定谔桥，用于相互作用粒子系统。对于非局部相互作用，评估产生的依赖于粒子的分布项的计算量随种群规模呈二次增长，这使得大规模问题难以处理。我们通过使用神经网络代理近似非局部相互作用来解决这一瓶颈。由此产生的四阶段交替算法将推理时每步成本从种群规模的二次降低到线性。我们还推导了Grönwall型稳定性界限，显示代理误差如何传播到生成的轨迹。在导航和意见动力学任务的数值实验中，所提出的方法再现了通过解析评估获得的轨迹，并减少了训练时间。

英文摘要

The Schrödinger Bridge Problem constructs a stochastic process that connects an initial distribution to a terminal distribution with minimum energy. This work considers its mean-field extension, the Mean-Field Schrödinger Bridge, for interacting particle systems. With nonlocal interactions, evaluating the resulting particle-dependent distributional terms can scale quadratically with the population size, which makes large-scale problems intractable. We address this bottleneck by approximating the nonlocal interactions with neural network surrogates. The resulting four-stage alternating algorithm reduces the per-step cost from quadratic to linear in the population size at inference. We also derive Grönwall-type stability bounds that show how surrogate errors propagate to the generated trajectories. In numerical experiments on navigation and opinion-dynamics tasks, the proposed method reproduces trajectories obtained with analytical evaluation and reduces training time.

URL PDF HTML ☆

赞 0 踩 0

2606.04264 2026-06-04 cs.CV

UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

UniCanvas: 一种基于扩散的图文联合生成统一模型

Zeyuan Yang, Hao-Wei Chen, Xueyang Yu, Yuncong Yang, Haoyu Zhen, Ziqiao Ma, Maohao Shen, Chuang Gan

AI总结提出UniCanvas，通过扩散模型在像素画布上以文本嵌入图像的方式实现图文联合生成，解决现有模型在视觉与文本生成上的不足。

详情

AI中文摘要

近年来，在单一架构内同时处理多模态理解与生成的统一视觉语言模型取得了显著进展。虽然自回归VLM能够跨模态推理，但无法生成高质量图像。相比之下，扩散模型能生成逼真的视觉效果，却难以生成连贯的文本，这使得开发一个能无缝处理视觉和文本生成的统一模型变得具有挑战性。最近的进展表明，语言可以有效地嵌入到视觉表示中，使模型能够直接从图像中推理文本语义。为此，我们提出了UniCanvas，这是首次尝试通过文本图像生成来统一扩散模型以生成交错多模态内容。扩散模型自然地捕捉共享像素画布上的变换，这可以视为视觉变化的世界模型。该模型不是生成离散的文本标记，而是学习将语言表示为图像内部的视觉模式，利用其固有的多模态嵌入空间。这种设计使得模型在图像合成过程中能够在单个像素画布上自然地“绘制”文本，实现无缝的多模态生成。实验表明，UniCanvas在性能上优于先前的统一模型，将基于扩散模型的文本图像生成定位为一种有前景的统一多模态生成范式。

英文摘要

Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.

URL PDF HTML ☆

赞 0 踩 0