arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.21488 2026-05-21 cs.LG

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

平衡推理器：学习吸引子使推理可扩展

Benhao Huang, Zhengyang Geng, Zico Kolter

AI总结本文提出平衡推理器（EqR），通过学习任务条件的吸引子来实现可扩展推理，该方法在测试时无需外部验证器或任务特定先验，通过增加深度和广度实现推理能力的提升，从而在Sudoku-Extreme上将准确率从2.6%提升至超过99%。

详情

Comments: ICML 2026

AI中文摘要

通过迭代更新潜在状态来扩展测试时计算已成为推理的强大范式。然而，这些迭代模型能够超越记忆模式进行泛化内部机制仍不清楚。我们假设可泛化推理源于学习任务条件的吸引子：潜在动态系统，其稳定固定点对应有效解。我们通过平衡推理器（EqR）正式化这一过程，该方法在测试时无需外部验证器或任务特定先验，通过沿两个轴扩展内部动态：深度（通过运行更多迭代）和广度（通过聚合多个初始化中的随机轨迹）。经验上，测试时扩展的收益与更强的收敛性向解对齐的吸引子紧密相关。这种吸引子视角使神经网络能够根据任务难度自适应分配测试时计算。虽然简单案例在1到5次迭代步骤内收敛，更难的案例则受益于大规模测试时扩展。通过展开相当于40,000层的深度，可扩展的潜在推理将准确率从前馈模型的2.6%提升到Sudoku-Extreme上的超过99%。这些结果表明，学习的吸引子景观为理解迭代潜在模型中的可扩展推理提供了有用的机制视角。

英文摘要

Scaling test-time compute by iteratively updating a latent state has emerged as a powerful paradigm for reasoning. Yet the internal mechanisms that enable these iterative models to generalize beyond memorized patterns remain unclear. We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions. We formalize this process through Equilibrium Reasoners (EqR), which enable test-time scaling without external verifiers or task-specific priors. EqR scales internal dynamics along two axes: depth, by running more iterations, and breadth, by aggregating stochastic trajectories from multiple initializations. Empirically, gains from test-time scaling are tightly coupled with stronger convergence toward solution-aligned attractors. This attractor perspective allows neural networks to adaptively allocate test-time compute based on task difficulty. While simple cases converge within 1 to 5 iteration steps, harder cases benefit from massive test-time scaling. By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts accuracy from 2.6% for feedforward models to over 99% on Sudoku-Extreme. These results suggest that learned attractor landscapes provide a useful mechanistic lens for understanding scalable reasoning in iterative latent models.

URL PDF HTML ☆

赞 0 踩 0

2605.21486 2026-05-21 cs.LG cond-mat.dis-nn cs.AI stat.ML

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

量化超参数迁移与嵌入层学习率的重要性

Dayal Singh Kalra, Maissam Barkeshli

AI总结本文研究了超参数迁移的量化方法，通过三种指标评估超参数迁移的质量，发现Maximal Update（μP）参数化在训练中通过最大化嵌入层学习率提升了超参数迁移质量，而权重衰减虽改善了缩放定律拟合，但会降低外推鲁棒性。

详情

Comments: 10+28 pages, 5+17 figures

AI中文摘要

超参数迁移允许从小规模到大规模模型中外推最优优化超参数，这对于训练大型语言模型（LLMs）至关重要。这可以通过拟合缩放定律或通过精心选择参数化方式（如Maximal Update（μP））来实现，使最优超参数近似规模不变。本文首先开发了一个框架，通过三个指标量化超参数迁移：（1）缩放定律拟合的质量，（2）对外推误差的鲁棒性，以及（3）由于参数化选择导致的渐近损失惩罚。接着，通过一系列全面的消融实验，探讨了为何μP相对于标准参数化（SP）在训练AdamW时提供高质量的学习率迁移，因为现有理论不足。我们发现，μP相对于SP的主要优势在于最大化嵌入层学习率。在SP中，嵌入层学习率充当瓶颈，导致训练不稳定性；将其增加到宽度的倍数以匹配μP，可显著平滑训练并提高超参数迁移质量。此外，权重衰减改善了缩放定律拟合，但在固定token-per-parameter设置下会损害外推的鲁棒性。

英文摘要

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

URL PDF HTML ☆

赞 0 踩 0

2605.21485 2026-05-21 cs.LG

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

EvoStruct: 通过蛋白质语言模型适应桥接进化和结构先验以进行抗体CDR设计

Mansoor Ahmed, Sujin Lee, Umar Khayaz, Murray Patterson

AI总结本文提出EvoStruct方法，通过蛋白质语言模型适应桥接进化和结构先验，解决抗体CDR设计中的词汇崩溃问题，提升了氨基酸恢复率和降低困惑度。

详情

AI中文摘要

等价图神经网络（GNN）方法在抗体互补决定区（CDR）设计中实现了最高的序列恢复，但面临严重的词汇崩溃问题。当前最佳的GNN方法只预测非常少的氨基酸，如酪氨酸和甘氨酸，而忽略了功能上重要的残基。我们追溯这种失败的原因在于GNN编码器从有限的结构数据中学习氨基酸分布，丢弃了在进化数据库中编码的替代模式。为了解决这个问题，我们提出了EvoStruct，它通过一个冻结的蛋白质语言模型（PLM）与来自E(3)-等价GNN的3D结构上下文通过交叉注意力适配器进行连接。与以往用于一般蛋白质设计的PLM-结构适配器不同，EvoStruct通过逐步解冻PLM和R-Drop一致性正则化，针对CDR设计特有的词汇崩溃问题进行优化。在CHIMERA-Bench数据集上，EvoStruct在几种抗体设计方法中实现了最高的氨基酸恢复率和最低的困惑度，相比最佳的GNN基线，提升了序列恢复率16%，降低了困惑度43%，同时恢复了2.3倍更大的氨基酸多样性，并与地面真实值具有最高的结合对相关性。

英文摘要

Equivariant graph neural network (GNN) methods for antibody complementarity-determining region (CDR) design achieve the highest sequence recovery but suffer from severe vocabulary collapse. The current best GNN methods over-predict very few amino acids, such as tyrosine and glycine, while ignoring functionally important residues. We trace this failure to GNN encoders learning amino acid distributions de novo from limited structural data, discarding substitution patterns encoded in evolutionary databases. To resolve this, we propose EvoStruct, which bridges a frozen protein language model (PLM) with 3D structural context from an E(3)-equivariant GNN via a cross-attention adapter. Unlike prior PLM-structure adapters for general protein design, EvoStruct targets the vocabulary collapse problem specific to CDR design through progressive PLM unfreezing and R-Drop consistency regularization. On the CHIMERA-Bench dataset, EvoStruct achieves the highest amino acid recovery and lowest perplexity among several antibody design methods, improving sequence recovery by 16% and reducing perplexity by 43% relative to the best GNN baselines, while recovering 2.3x greater amino acid diversity and the highest binding-pair correlation with ground truth.

URL PDF HTML ☆

赞 0 踩 0

2605.21484 2026-05-21 cs.CV

One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

通过固定点迭代实现离散扩散图像生成器的一步蒸馏

Chaoyang Wang, Yunhai Tong

AI总结本文提出了一种名为Fixed-Point Distillation (FPD)的端到端框架，通过部分破坏学生模型的一步草稿并用单个教师步骤进行细化，构建局部修正目标。该方法将离散标记提升为连续特征，并应用多带宽漂移损失，迭代累积这些修正。通过直通估计器将连续梯度回传到学生日志it，同时可选地引入无条件对抗目标以增强感知现实。在类别和文本条件生成上的评估验证了该框架的有效性，FPD在单步推理中实现了竞争性的视觉保真度和结构对齐，缩小了与多步教师之间的差距，同时优于现有离散蒸馏基线。

详情

AI中文摘要

离散扩散模型在视觉合成方面表现出色，但依赖于缓慢的迭代解码。现有的单步蒸馏方法试图绕过这一瓶颈，要么通过训练辅助分数网络，从而有效地将计算量翻倍，要么通过引入专门的参数化和多阶段管道来碎片化优化。在本文中，我们引入了Fixed-Point Distillation (FPD)，一种端到端的框架，通过部分破坏学生模型的一步草稿并用单个教师步骤进行细化，构建局部修正目标。为了在语义有意义的空间中计算训练目标，我们将离散标记提升为连续特征，并应用多带宽漂移损失，该损失迭代地累积这些修正。为了通过离散瓶颈进行反向传播，我们采用直通估计器，在前向传递过程中将精确的硬采样标记喂给教师和解码器，确保训练和推理在同一个代码本流形上进行，同时将连续梯度回传到学生日志it。这种完全可微的路径还允许可选地引入无条件对抗目标以增强感知现实。在类别和文本条件生成上的评估验证了该框架的有效性。FPD在单步推理中实现了竞争性的视觉保真度和结构对齐，缩小了与多步教师之间的差距，同时优于现有离散蒸馏基线。

英文摘要

Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.21483 2026-05-21 astro-ph.CO cs.LG

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

Velocityformer: 用于宇宙学速度重建的破缺对称性匹配等价图变换器

Tilman Tröster, David Mirkovic, Veronika Oehl, Arne Thomsen

AI总结该研究提出Velocityformer，一种等价图变换器架构，通过匹配观测数据的破缺对称性来提高宇宙学速度重建的精度，其在速度相关系数r上比标准线性理论基线提高了35%。

详情

AI中文摘要

精确测量动能Sunyaev-Zel'dovich效应（kSZ效应）——一种探测大尺度宇宙中等离子体分布的关键可观测量——需要准确从光谱巡天中重建星系速度。kSZ测量的信噪比（SNR）直接与重建速度和真实速度之间的相关系数r成正比。我们引入了Velocityformer，一种等价图变换器架构，旨在匹配观测数据的特定对称性。尽管底层物理在平移和旋转下是等价的，但观测效应由于视线方向的偏好而打破了这一对称性。将模型的归纳偏置与数据的破缺对称性匹配，能够一致地提高所有模型大小和训练体积下的性能，Velocityformer在标准线性理论基线上将r提高了35%，并在所有数据体积上优于机器学习基线。通过将模型的归纳偏置与数据以及基于物理的长波长解进行条件化，Velocityformer具有高度的数据效率，能够在最少的低保真模拟数据上训练到高精度，并在输入几何、宇宙学参数和星系样本上实现零样本泛化。在高保真模拟星系目录上，这将r比物理基线提高了30%，直接转化为观测数据上的相同SNR增益。

英文摘要

Precise measurement of the kinematic Sunyaev-Zel'dovich (kSZ) effect - a probe of the large-scale distribution of baryonic matter, a key observable for cosmological inference - requires accurate reconstruction of galaxy velocities from spectroscopic surveys. The signal-to-noise ratio (SNR) of kSZ measurements scales directly with the correlation coefficient $r$ between reconstructed and true velocities. We introduce Velocityformer, an equivariant graph transformer architecture designed to match the specific symmetry of the observational data. While the underlying physics is equivariant with respect to translations and rotations, observational effects break this symmetry due to the preferred line-of-sight direction. Matching the model's inductive bias to the data's broken symmetry consistently improves performance across all model sizes and training volumes, with Velocityformer improving $r$ by 35% over the standard linear theory baseline and outperforming ML baselines at every data volume. By matching the model's inductive bias to the data and conditioning on the physics-based long-wavelength solution, Velocityformer is highly data-efficient, training to high accuracy on as few as 4 low-fidelity simulations, and generalises zero-shot across input geometry, cosmological parameters, and galaxy sample. On high-fidelity simulated galaxy catalogues, this yields a 30% improvement in $r$ over the physical baseline, directly translating to the same SNR gain on observational data.

URL PDF HTML ☆

赞 0 踩 0

2605.21482 2026-05-21 cs.AI

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench: 一个要求大规模跨源证据和长周期推导的深度研究基准

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma

AI总结本文提出DeepWeb-Bench基准，通过要求大规模证据收集、跨源验证和长周期推导，评估前沿语言模型在深度研究任务中的能力，揭示检索并非瓶颈，强弱模型失败方式不同，且模型在不同领域表现出专业性。

详情

Comments: Work in Progress. 27 pages, 10 figures, 4 tables. Project page: https://sixiongxie1001-dot.github.io/deep-research-benchmark2.0

AI中文摘要

深度研究，即一个智能体在开放网络上搜索、收集证据并通过扩展推理得出答案，是前沿语言模型的重要应用场景。前沿深度研究产品在现有基准上表现优异，难以通过现有评估数据单独区分其能力。我们引入DeepWeb-Bench，一个比现有基准更难的深度研究基准。难度来源于数据本身的三个特性：每个任务需要大规模证据收集、跨源验证和长周期多步骤推导。我们将这三个难度来源表示为四个能力家族（检索、推导、推理和校准），并按家族报告结果。每个参考答案都附有带有四个披露级别和可用跨源检查的来源证明记录，使评分更容易审计底层证据。我们在九个前沿模型上评估DeepWeb-Bench，并报告三个发现：（1）检索不是瓶颈，因为检索失败仅占12-14%的错误，而推导和校准失败占超过70%；（2）强弱模型以不同方式失败，强模型的错误主要由不完整推导引起，弱模型的错误主要由幻觉精度引起；（3）模型在不同领域表现出真正的专业性，跨模型一致度仅为rho=0.61，每案例分歧达到18.8个百分点。公开的基准发布包括数据、评分标准和评估代码。

英文摘要

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

URL PDF HTML ☆

赞 0 踩 0

2605.21481 2026-05-21 cs.AI cs.CL cs.LG

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

AiraXiv：一个面向人类和AI科学家的AI驱动的开放获取平台

Junshu Pan, Panzhong Lu, Yixuan Weng, Qiyao Sun, Fang Guo, Zijie Yang, Qiji Zhou, Yue Zhang

AI总结本文提出AiraXiv平台，通过AI驱动的开放预印本、AI增强的分析与评审以及读者反馈，解决传统学术出版系统在AI时代面临的研究产出增长和可扩展性挑战。

详情

AI中文摘要

近年来，人工智能（AI）的进步加速了人类和AI生成的研究产出的增长，对传统学术出版系统施加了越来越大的压力，并在提交量增加、评审工作量和会议规模扩大时挑战了以会议和期刊为中心的可扩展性。为了解决这些挑战，我们探索了一个AI时代的出版范式，其中人类和AI科学家作为作者和读者参与，并通过持续反馈驱动的迭代使论文不断发展。我们提出了AiraXiv，一个基于开放预印本、AI增强的分析和评审以及读者反馈的AI驱动的开放获取平台。AiraXiv通过交互式UI支持人类科学家，通过基于模型上下文协议（MCP）的交互支持AI科学家。通过实际部署验证了AiraXiv，包括作为IC AIS 2025的提交平台，展示了其作为AI时代快速、包容和可扩展的研究基础设施的潜力。AiraXiv在https://airaxiv.com上公开可用。

英文摘要

Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.

URL PDF HTML ☆

赞 0 踩 0

2605.21479 2026-05-21 cs.CV cs.AI

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench: 一个基于维基百科和维基数据的知识引导视觉问答基准

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

AI总结本文提出WikiVQABench，一个结合维基百科图片、文章描述和维基数据结构化知识的知识引导视觉问答基准，通过大规模语言模型生成候选多选题，并由人工审核确保事实正确性和视觉-文本一致性，评估多种视觉-语言模型在知识密集型推理中的性能。

详情

AI中文摘要

视觉问答（VQA）基准大多强调基于感知的任务，这些任务可以通过单独的视觉内容解决。相比之下，许多现实场景需要外部知识来正确回答，而这些知识无法直接从图像中观察到。我们介绍了WikiVQABench，一个由系统结合维基百科图片、其相关文章描述和来自维基数据的结构化知识构建的人工整理的知识引导VQA基准。我们的流程使用大规模语言模型（LLMs）生成候选多选图像-问题-答案集。所有生成的实例随后由人工标注者审核，以确保事实正确性、视觉-文本一致性以及每个问题需要外部知识，除了视觉证据外，才能正确解决。WikiVQABench包含大量维基百科图片和经过整理的多选问题，旨在基准测试知识意识的视觉-语言模型（VLMs）。对十五种VLMs（256M-90B参数）的评估显示了广泛的性能范围（24.7%-75.6%准确率），表明该基准能够有效区分模型在知识密集型推理中的能力。数据集和基准测试代码已公开。

英文摘要

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.21478 2026-05-21 cs.CV cs.GR

Latent Dynamics for Full Body Avatar Animation

基于潜在动态的全身动画 avatar

Shichong Peng, Chengxiang Yin, Fei Jiang, Zhongshi Jiang, Lingchen Yang, Qingyang Tan, Amin Jourabloo, Jason Saragih, Ke Li, Christian Häne

AI总结本文提出了一种基于潜在动态的全身动画 avatar 方法，通过引入 transformer 解码器和动态残差潜在变量，实现了更精确的动态模拟，提高了动画质量。

详情

Comments: Supplementary video: https://youtu.be/xjnr3YM0yIE

AI中文摘要

基于姿态驱动的神经渲染全身 avatar 能够生成高质量的新型视角。然而，松散的衣物和其他动态元素的变形方式超出了姿态本身所能解释的范围：相同姿态可以对应多种不同状态，因为它们的运动依赖于历史、惯性和接触。显式模拟和分层衣物方法可以建模此类动态，但需要专门的衣物模板，而原始多视角捕获并不自然提供此类模板，或者需要测试时的物理模拟器，其运行时间成本较高。另一条研究线学习了数据驱动的衣物 avatar，这些方法在推理时固定辅助潜在变量，从姿态回归或从训练数据检索，而不显式建模潜在变量如何随自身动态演变。此外，即使在日常运动中，现有架构在捕捉细粒度细节时也常常遇到困难，产生模糊的渲染和时间伪影。本文在姿态条件的 3D 高斯 avatar 上加入了 transformer 基于解码器和动态残差潜在变量，以捕捉超出驱动信号的时空外观和几何变化。在推理时，学习的潜在动态模型从短姿态历史和前一潜在状态演化残差潜在变量。模型将每次更新分解为驱动、恢复和耗散力，产生时间一致、依赖历史的滚动，且附加成本极低。不同的初始条件产生多样但合理的运动轨迹，力的分解暴露了如刚性等控制。在九个具有不同松散衣物的日常运动捕获序列中，定量指标和感知用户研究显示，与最近的数据驱动基线相比，动画质量有所提高。

英文摘要

Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.21475 2026-05-21 cs.LG

Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning

关系预测任务是否需要固定模式图？关系深度学习中的全分辨率图结构学习

Yi Huang, Qingyun Sun, Jia Li, Xingcheng Fu, Jianxin Li

AI总结本文提出了一种全分辨率且可优化的图结构学习框架FROG，用于关系深度学习，将关系结构学习建模为可学习的表角色建模问题，允许表作为节点和边在信息传递中发挥作用，并设计了基于角色的信息传递机制，以捕捉关系语义，同时通过功能依赖约束确保语义一致性，实验表明该方法在多个下游任务中优于现有方法。

详情

Comments: Accepted by the Forty-third International Conference on Machine Learning (ICML2026)

AI中文摘要

关系预测任务在许多现实世界应用中至关重要，其中数据自然存储在关系数据库（RDBs）中。关系深度学习（RDL）通过将RDBs建模为图并应用图神经网络（GNNs）进行端到端学习来解决这个问题。然而，全分辨率属性通常被用作图构造的设计原则，以保持关系语义，这导致大多数现有方法依赖于固定的图结构。在本文中，我们提出FROG，一种用于RDL的全分辨率和可优化的图结构学习框架，将关系结构学习建模为可学习的表角色建模问题，允许表作为节点和边在信息传递中发挥作用。我们进一步设计了基于角色的信息传递机制，以捕捉关系语义，使图结构和GNN表示能够联合优化。为了确保语义一致性，我们引入了功能依赖约束，以在表和实体层面正则化表示。广泛的实验表明，我们的方法在多个下游任务中优于现有方法，并揭示了表角色对下游任务的影响，为RDL的图构造提供了新的见解。

英文摘要

Relational prediction tasks are fundamental in many real-world applications, where data are naturally stored in relational databases (RDBs). Relational Deep Learning (RDL) addresses this problem by modeling RDBs as graphs and applying graph neural networks (GNNs) for end-to-end learning. However, the full-resolution property is commonly adopted as a design principle in graph construction for RDBs to preserve relational semantics, which leads most existing methods to rely on fixed graph structures. In this paper, we propose FROG, a Full-Resolution and Optimizable Graph Structure Learning} framework for RDL that formulates relational structure learning as a learnable table role modeling problem, allowing tables to contribute as nodes and edges in message passing. We further design role-driven message passing mechanisms to capture relational semantics, enabling joint optimization of graph structure and GNN representations. To ensure semantic consistency, we introduce functional dependency constraints that regularize representations across table and entity levels. Extensive experiments demonstrate that our method outperforms existing approaches and reveal how table roles impact downstream tasks, offering new insights into graph construction for RDL

URL PDF HTML ☆

赞 0 踩 0

2605.21468 2026-05-21 cs.LG cs.CL

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

你只需要最小的RLVR训练：通过秩-1轨迹来扩展LLMs

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng

AI总结本文研究了通过秩-1轨迹扩展LLMs的方法，发现RLVR参数轨迹具有极低的秩和高度可预测性，并提出RELEX方法，通过简单的线性回归在无需训练模型的情况下实现高效的超量扩展。

详情

Comments: preprint. Code: https://github.com/weizhepei/RELEX

AI中文摘要

可验证奖励的强化学习（RLVR）已成为改进大语言模型（LLMs）推理能力的主要范式，但其底层几何结构仍待探索。本文证明RLVR权重轨迹具有极低的秩且高度可预测。具体而言，我们发现大多数下游性能提升可通过参数增量的秩-1近似来捕捉，其中该投影的幅度与训练步数近似线性增长。受此启发，我们提出了一种简单且计算高效的RELEX（强化学习扩展）方法，通过从短观察窗口估计秩-1子空间并利用线性回归进行超量扩展，无需任何训练模型。在三个模型（即Qwen2.5-Math-1.5B、Qwen3-4B-Base和Qwen3-8B-Base）上，RELEX生成的检查点在领域内和领域外基准测试中表现匹配或优于RLVR性能，仅需完整RLVR训练的15%步数。令人惊讶的是，RELEX能在无训练成本的情况下超量扩展远超观察窗口，预测检查点多达10-20倍于观察前缀，并持续改进（例如，仅观察前50步并扩展到1000步）。我们的消融分析证实了RELEX的极简充分性：增加子空间秩或采用非线性建模不会进一步提升超量扩展效果。最后，我们显示RELEX的成功源于“去噪”效应：通过将更新投影到秩-1子空间，模型会丢弃那些会降性能的随机优化噪声。我们的代码可在https://github.com/weizhepei/RELEX获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

URL PDF HTML ☆

赞 0 踩 0

2605.21467 2026-05-21 cs.LG cs.CL

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

DelTA: 一种用于可验证奖励强化学习的判别性token信用分配

Kaiyi Zhang, Wei Wu, Yankai Lin

AI总结本文提出DelTA方法，通过估计token系数来增强特定侧的token梯度方向，从而改进可验证奖励强化学习中的token概率更新，提升了模型在数学基准测试中的性能。

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为提升大语言模型推理能力的核心技术。尽管其有效性已得到认可，但响应级奖励如何转化为token级概率变化仍缺乏深入理解。本文引入了RLVR更新的判别视角，表明策略梯度更新方向隐式地作为token梯度向量的线性判别器，从而决定学习过程中哪些token概率被增加或减少。在标准序列级RLVR中，该判别器由通过优势加权平均得到的正负侧质心构成。然而，此类质心构建可能被共享的高频模式（如格式token）主导，稀释了稀疏但判别性强的方向，这些方向更能区分高分响应与低分响应。为解决这一限制，本文提出DelTA，一种判别性token信用分配方法，通过估计token系数来放大侧特定的token梯度方向并降低共享或弱判别性的方向。这些系数重新加权了自我归一化的RLVR替代方案，使有效的侧向质心更具对比性，从而重塑RLVR更新方向。在七个数学基准测试中，DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分别比最强的同规模基线高出3.26和2.62个平均分。此外，代码生成、不同backbone和域外评估的额外结果进一步展示了DelTA的泛化能力。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

URL PDF HTML ☆

赞 0 踩 0

2605.21466 2026-05-21 cs.CV

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

StreamGVE: 无需训练的视频编辑通过少步流式视频生成

Guanlong Jiao, Chenyangguang Zhang, Jia Jun Cheng Xian, Zewei Zhang, Renjie Liao

AI总结本文提出StreamGVE，一种基于噪声到数据视角的视频编辑方法，通过引入双分支快速采样和自注意力桥接以及交叉注意力接地/增强，实现了高效的视频编辑，能够在少步设置中优于现有方法。

详情

Comments: Project Page: https://dsl-lab.github.io/StreamGVE/

AI中文摘要

尽管现有的视频编辑方法通常可行，但它们往往需要许多昂贵的迭代，并且仍然难以交付高质量且令人满意的编辑结果。我们归因于普遍的数据到数据范式，这种范式不如噪声到数据生成与现代生成模型兼容。为了解决这一差距，我们重新审视视频编辑从噪声到数据的视角，并提出基于流式生成的视频编辑（StreamGVE），在保留少量步骤采样的同时无缝地注入源视频条件。基于预训练的流式生成模型，StreamGVE引入双分支快速采样，结合自注意力桥接和交叉注意力接地/增强，以满足采样和条件要求。我们进一步提出源导向的指导以提高目标生成质量，并提出视觉提示策略以增强编辑的灵活性和实用性。该方法在不同模型上均有效、稳健且具有通用性。在多样化的视频编辑任务上的广泛实验表明，StreamGVE在少步设置中也优于现有方法，即使时间成本极低。

英文摘要

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

URL PDF HTML ☆

赞 0 踩 0

2605.21465 2026-05-21 cs.CL cs.SE

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

利用大语言模型进行语法适应：关于元模型-语法共演的研究

Weixing Zhang, Bowen Jiang, Rahul Sharma, Regina Hebig, Daniel Strüber

AI总结本文研究了如何利用大语言模型自动适应语法，通过学习先前版本的语法适应来实现自动适应，同时探讨了在复杂语法场景下的优势与局限性。

详情

AI中文摘要

在模型驱动工程中，元模型的演变导致需要相应地调整语法以保持一致性，这通常需要繁琐的手动工作。现有的基于规则的方法可以实现部分自动化，但在处理复杂语法场景时存在限制。本文提出了一种基于大语言模型的方法，通过学习先前版本的语法适应来自动应用于新语法。该方法在六个真实世界Xtext领域特定语言上进行了评估，使用四个DSL作为训练集开发提示策略，两个DSL作为测试集进行验证，并在QVTo上进行纵向案例研究。评估使用了三个大型语言模型（Claude Sonnet 4.5、ChatGPT 5.1、Gemini 3），并从三个维度测量语法适应质量：语法规则层面的适应一致性、输出相似性和元模型符合性。结果表明，在测试集上，所有三个LLM都实现了100%的适应一致性和输出相似性，而基于规则的方法在DOT上仅达到84.21%，在Xcore上为62.50%。在QVTo的纵向研究中，基于LLM的方法在所有三个演变步骤中成功重用了学习的适应，而基于规则的方法在两个演变步骤中需要手动调整。然而，在大规模语法（EAST-ADL，297条规则）上，LLM的适应一致性远低于90%。本研究展示了基于LLM的方法在处理复杂语法场景中的优势，同时揭示了其在大规模语法适应中的局限性。

英文摘要

In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.21463 2026-05-21 cs.CL cs.AI

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Mem-$π$: 通过学习何时以及生成什么来实现自适应记忆

Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian

AI总结 Mem-$π$ 通过学习在何时以及生成什么来实现自适应记忆，利用专门的语言或视觉-语言模型生成上下文特定的指导，从而在多种代理任务中优于基于检索和先前RL优化的记忆基线。

详情

Comments: Work in progress

AI中文摘要

我们提出了Mem-$π$，一种用于大语言模型（LLM）代理的自适应记忆框架，其中有用的指导是按需生成而非从外部内存存储中检索。现有的记忆增强代理通常依赖于从事件记忆库或技能库中基于相似性的检索，返回静态条目，这些条目往往与当前上下文不一致。相比之下，Mem-$π$ 使用一个具有自身参数的专用语言或视觉-语言模型，与下游代理分开，以生成复杂任务的上下文特定指导。在当前代理上下文中，模型联合决定何时生成指导以及生成什么指导。我们通过决策-内容解耦的强化学习（RL）目标对其进行训练，使其能够避免在生成不会有所帮助的情况，并在其他情况下生成简洁有用的信息。在涵盖网页导航、基于终端的工具使用和基于文本的具身交互等多样代理基准上，Mem-$π$ 一致优于基于检索和先前RL优化的记忆基线，实现网页导航任务超过30%的相对提升。

英文摘要

We present Mem-$π$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$π$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$π$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.21461 2026-05-21 cs.LG

A Machine Learning Framework for Weighted Least Squares GNSS Positioning based on Activation Functions

一种基于激活函数的加权最小二乘GNSS定位机器学习框架

Pin-Hsun Lee, Harry Leib

AI总结本文提出了一种基于激活函数的加权最小二乘GNSS定位机器学习框架，通过使用信号质量指标作为训练特征，利用集成学习算法识别低质量信号，并通过激活函数将机器学习预测的分数转换为适当的权重以提高定位精度。

详情

AI中文摘要

全球导航卫星系统（GNSS）被广泛用于为各种应用提供位置、速度和时间（PVT）信息，包括交通运输、基于位置的通信服务和智能农业。在城市峡谷中，高楼大厦和狭窄街道可能导致信号遮挡、非视距（NLOS）接收和多路径效应，这些都会引入GNSS伪距测量的误差。尽管多星座GNSS有效增加了可用卫星的数量，但包含退化信号可能导致严重的定位误差。本文提出了一种基于激活函数的加权最小二乘（WLS）算法的机器学习框架，以提高定位精度。几种信号质量指标被用作集成学习算法的训练特征，以通过提供质量分数来识别低质量信号。然后，激活函数被用来将机器学习预测的分数转换为适合WLS定位的适当权重。为了评估我们方法的性能，使用来自香港和东京城市地区的实际数据集进行了实验。对激活函数的比较分析表明，Sigmoid函数在不同的机器学习算法和GNSS星座配置下始终产生最大的改进。所提出的算法在单星座和多星座场景中均表现出显著的定位误差减少。此外，我们的结果表明，所提出的算法在训练数据来自其他具有类似城市化水平的地区时，表现出强大的地理迁移性。

英文摘要

Global Navigation Satellite Systems (GNSS) are widely used to provide position, velocity, and timing (PVT) information for various applications, including transportation, location-based communication services, and intelligent agriculture. In urban canyons, high-rise buildings and narrow streets can cause signal obstruction, non-line-of-sight (NLOS) reception, and multipath effects that introduce errors in GNSS pseudorange measurements. Although multi-constellations GNSS effectively increase the number of available satellites, the inclusion of degraded signals can lead to severe positioning errors. This study proposes a machine learning framework for the weighted least squares (WLS) algorithm incorporating activation functions to enhance positioning accuracy. Several signal quality indicators are employed as training features for ensemble learning algorithms to identify poor quality signals by providing quality scores. Then, activation functions are employed to transform the machine learning predicted scores to appropriate weights for WLS positioning. To evaluate the performance of our approach, experiments are conducted using real-world datasets from Hong Kong and Tokyo urban areas. Comparative analysis of activation functions reveals that sigmoid functions consistently yield the greatest improvements with different machine learning algorithms and GNSS constellation configurations. The proposed algorithm demonstrates substantial reductions in positioning errors for both single- and multiconstellation scenarios. Furthermore, our results indicate that the proposed algorithm exhibits strong geographical transferability. The proposed algorithm maintains comparable level of performance when trained on data from other regions with similar levels of urbanization.

URL PDF HTML ☆

赞 0 踩 0

2605.21460 2026-05-21 cs.RO cs.AI cs.HC

HITL-D: Human In The Loop Diffusion Assisted Shared Control

HITL-D: 有人参与的扩散辅助共享控制

Riley Zilka, Sergey Khlynovskiy, Allie Wang, Martin Jagersand

AI总结本文提出HITL-D框架，通过结合扩散策略和人类控制，提升多步骤、插入和精细操作任务的用户表现，减少 joystick 控制轴数量，降低认知负荷，并在多任务用户研究中显著提高任务完成速度和用户满意度。

详情

Comments: Accepted for presentation at ICRA 2026

AI中文摘要

自主操作系统已展现出显著能力，但将人类专业知识与基于扩散的策略结合在共享控制中仍较为不成熟。本文提出人类在环扩散（HITL-D），一种共享控制框架，通过结合扩散策略和人类控制，提供基于场景点云和末端执行器笛卡尔位置的自主末端执行器方向更新。该方法减少了所需joystick控制轴的数量，从而降低认知负荷。在12名参与者的多任务用户研究中，HITL-D将平均任务完成时间减少了40%，降低了37%的感知负荷，并在独立性、直观性和信心等李克特量表评分上优于传统遥控方法。这些结果表明，HITL-D有效整合了人类专业知识与自主协助，提高了遥控的客观和主观方面。

英文摘要

Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.

URL PDF HTML ☆

赞 0 踩 0

2605.21458 2026-05-21 cs.AI cs.LG stat.ME

Mind the Sim-to-Real Gap & Think Like a Scientist

注意仿真到现实的差距并像科学家一样思考

Harsh Parikh, Gabriel Levin-Konigsberg, Dominique Perrault-Joncas, Alexander Volfovsky

AI总结本文研究了在仿真和现实之间如何补充实验以减少价值差距，提出了Fisher-SEP方法，并通过两个案例研究展示了其应用。

详情

AI中文摘要

假设有规划者拥有一个预先训练的序列决策问题的仿真器，并有机会在现实中进行实验。仿真器查询成本低，但继承了校准数据中的混杂因素和漂移。实验是无偏的，但每次试验消耗一个现实单位。我们研究了规划者何时以及如何补充仿真器进行实验。我们给出了三个结果。首先，扩展的仿真引理将仿真器的价值误差分解为校准-部署偏移，该偏移可以随机化识别，以及一个参数残差，无法通过进一步交互减少。第二，仿真器最优策略与最优解之间的价值差距分为局部部分，这部分在部署策略已访问的状态上，以及可达性部分，这部分在部署策略未访问的状态上。在纯被动学习下，可达性部分在任何时间范围内都保持远离零。第三，我们提出了Fisher-SEP，一种辅助仿真的实验策略（SEP），该策略最小化目标策略价值的后验预测方差，具有仅奖励和仅转换的特殊化版本。两个案例研究展示了这些制度。在自动售货机供应链中，前端实验在时间范围足够长以抵消试点成本后超过后验更新。在HIV移动测试示例中，有一个走廊将一个受监控区域与一个受监控较差的区域分开，只有设计的探索才能到达受监控较差的区域。

英文摘要

Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.

URL PDF HTML ☆

赞 0 踩 0

2605.21455 2026-05-21 cs.LG

Mitigating Label Bias with Interpretable Rubric Embeddings

通过可解释的评分标准嵌入缓解标签偏差

Calvin Isley, Johann D. Gaebler, Sharad Goel

AI总结本文提出通过可解释的评分标准嵌入来缓解标签偏差问题，通过理论和实验证明该方法在合理条件下能减少标签偏差并提升群体质量评估。

详情

AI中文摘要

统计决策算法越来越多地应用于难以获取真实标签的领域，如招聘、大学录取和内容审核。在这些情况下，模型通常是在历史人类评估上进行训练——例如使用过去招聘决定作为真实申请者质量的代理。然而，如果过去的评估不公正地偏袒某些群体，基于这些标签训练的模型可能会继承这些偏见。为了解决这个问题，我们提出基于评分标准嵌入进行预测，这是一种表示框架，用专家定义的准则派生的特征替代标准黑盒嵌入，这些准则与感兴趣的底层构造对齐。通过将预测锚定在语义有意义的维度上，这种方法可以防止受偏见代理信号的影响。我们提供了理论和实证证据，证明在合理条件下评分标准嵌入能够缓解标签偏见。实证上，我们在一个新型的数据集上评估了我们的方法，该数据集包含申请大型硕士项目的申请。我们发现，基于评分标准嵌入训练的模型在减少群体差异的同时提高了群体质量的衡量标准。我们的结果表明，基于可解释、领域相关的表示进行预测，为存在偏见标签的学习提供了一种实用方法。

英文摘要

Statistical decision algorithms are increasingly deployed in domains where ground-truth labels are hard to obtain, such as hiring, university admissions, and content moderation. In these settings, models are typically trained on historical human evaluations -- for example, using past hiring decisions as a proxy for true applicant quality. However, if past evaluations unjustly favor certain groups, models trained on these labels may inherit those biases. To address this problem, we propose basing predictions on rubric embeddings, a representation framework that replaces standard black-box embeddings with features derived from expert-defined criteria that align with the underlying construct of interest. By anchoring predictions to semantically meaningful dimensions, this approach guards against biased proxy signals. We provide both theoretical and empirical evidence that rubric embeddings mitigate label bias under plausible conditions. Empirically, we evaluate our method on a novel dataset of applications to a large master's program. We find that models trained on rubric embeddings reduce group disparities while improving measures of cohort quality. Our results suggest that basing predictions on interpretable, domain-grounded representations offers a practical approach to learning in the presence of biased labels.

URL PDF HTML ☆

赞 0 踩 0

2605.21454 2026-05-21 cs.CV q-bio.QM q-bio.TO

ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

ProtoPathway: 为多模态癌症生存预测设计的生物结构化原型-路径融合

Amaya Gallagher-Syed, Costantino Pitzalis, Myles J. Lewis, Michael R. Barnes, Gregory Slabaugh

AI总结本文提出ProtoPathway框架，通过统一全切片成像和转录组学，利用编码器生成生物基础的表示，以提升癌症生存预测的生物可解释性和计算效率。

详情

Comments: Currently under peer review

AI中文摘要

我们介绍了ProtoPathway，一种为癌症生存预测设计的可解释多模态框架，通过编码器在两个融合侧生成生物基础的表示。在组织病理学侧，$K$个可学习的形态原型通过端到端训练与生存目标相结合，作为切片本身的表示：片段通过软分配流入原型标记，将可变长度的片段集压缩成固定任务适应的标记。在基因组侧，双分图神经网络在Reactome通路层级编码基因表达，生成反映构成基因及其更广泛生物背景的通路嵌入，通过双向消息传递在共享的基因-通路图上进行。跨模态注意机制则在紧凑的原型$ imes$通路矩阵上操作，其中原型查询通路，建模分子程序如何导致组织形态的生物方向。由于两个轴都携带稳定的任务学习身份，注意矩阵本身是可解释性输出，从而在完整的生物层级上实现原生的推理时间归因，从基因通过通路和原型到空间组织图。我们在五个TCGA癌症队列上进行评估，展示了与现有方法相比具有竞争力或更优的生存预测能力，同时具有显著改进的生物可解释性和减少的计算成本，通过折叠分层的基于排名的群体水平分析验证了可解释性声明。我们的源代码、模型权重和Reactome通路，以及一个重新实现所有多模态生存基准的统一代码库，在相同预处理和评估条件下可用：https://github.com/AmayaGS/ProtoPathway.

英文摘要

We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: https://github.com/AmayaGS/ProtoPathway.

URL PDF HTML ☆

赞 0 踩 0

2605.21453 2026-05-21 cs.SE cs.AI

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

AI生成Python重构拉取请求中的质量和安全信号

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

AI总结本研究通过分析AIDev数据集中的Python重构拉取请求，探讨了AI生成代码对代码质量和安全性的影响，发现AI提交在22.5%的案例中提升了质量属性，但同时也引入了新的代码问题，提出了24种重构操作的分类和安全门控的重要性。

详情

AI中文摘要

随着AI代理在代码开发和维护中的作用日益增强，关于其在真实项目中变更的质量和风险特征仍缺乏实证证据，特别是针对重构类贡献。为了填补这一空白，我们对AIDev数据集中的Python重构拉取请求进行了实证研究。我们使用基于机器学习的质量评估工具PyQu分析代理重构拉取请求，以量化五个质量属性的变化，并通过领域无关的静态分析（Pylint和Bandit）来测量每次更改前后代码质量和安全问题。我们的结果表明，平均而言，代理提交在22.5%的案例中提升了质量属性，其中可用性提升最频繁（36.5%）。同时，24.17%的修改文件引入了新的Pylint问题，主要为约定层面的违规（如长行），而4.7%引入了新的Bandit发现。从观察到的差异中，我们推导出24种反复出现的更改操作，并将其映射到最常影响的lint和安全发现。尽管这些混合结果，开发者接受度很高：73.5%的分析拉取请求被合并，包括引入新lint或安全发现的案例，通常伴随现有问题的移除。总体而言，这些发现突显了代理重构的潜力和当前限制，并推动了更强的工具在循环中质量与安全门控，以应对AI驱动的开发工作流。

英文摘要

As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.21451 2026-05-21 cs.LG cond-mat.dis-nn cs.AI cs.NE

Approximation Theory for Neural Networks: Old and New

神经网络的近似理论：旧与新

Soumendu Sundar Mukherjee, Himasish Talukdar

AI总结本文综述了神经网络近似理论的发展，包括传统单隐层网络的密度结果、量化误差界限以及深度-宽度权衡，还探讨了Kolmogorov-Arnold网络等新架构的理论性质。

详情

Comments: 31 pages, 4 figures

AI中文摘要

通用近似定理为神经网络的表达能力提供了数学解释。它们断言，在激活函数的温和条件下，前馈神经网络在广泛的函数类中是密集的，例如实数空间$\mathbb{R}^d$的紧致子集上的连续函数、$L^p$空间或Sobolev空间。在过去四十年里，这些定性的一般性结果已发展成丰富的定量理论，涉及近似速率、参数效率以及深度和宽度等架构特征的作用。本文综述了该理论的几个方面。我们回顾了单隐层网络的经典密度结果，以及将近似误差与网络大小和目标函数的光滑性假设联系起来的量化界限。特别强调了深度-宽度权衡以及证明更深层次架构在结构函数类中可实现更高参数效率的结果。除了标准前馈神经网络外，我们还回顾了Kolmogorov-Arnold网络（KANs）等近期发展的理论性质。

英文摘要

Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

URL PDF HTML ☆

赞 0 踩 0

2605.21443 2026-05-21 cs.CV cs.AI

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

TempGlitch: 评估视觉-语言模型在游戏视频中检测时间故障的能力

Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

AI总结本文提出TempGlitch基准测试，用于评估视觉-语言模型在游戏视频中检测时间故障的能力，发现现有模型在处理时间故障时表现不佳，且更密集的帧采样和更大的模型尺寸并不能有效解决这些问题。

详情

AI中文摘要

视觉-语言模型（VLMs）正被越来越多地探索用于视频游戏质量保证，特别是游戏故障检测。然而，大多数现有评估将故障视为静态视觉异常，要求模型从单个帧中检测故障。我们主张这种框架忽略了关键区别：一些故障是空间性的，在孤立帧中可见，而另一些是时间性的，只有通过连续帧的变化才能显现。初步研究证实了这一差距，显示时间故障对VLMs的检测比空间故障要困难得多。为系统评估这一未被充分探索的设置，我们引入了TempGlitch，一个受控的游戏视频基准测试，用于时间故障检测。TempGlitch涵盖五种时间故障类型，每类样本平衡，同时配有配对的无故障视频，以实现可靠的二元评估。我们评估了12个专有和开源的VLMs，在多个帧采样设置下。我们的结果表明，当前VLMs在TempGlitch上仍接近随机猜测，通常会陷入过于保守的行为，错过大多数故障，或过于敏感的行为，将干净的视频标记为有故障。此外，更密集的帧采样和更大的模型尺寸并不能可靠地解决这些失败。TempGlitch为时间推理、稳健的游戏理解以及自动化故障检测提供了专注的测试平台。代码和数据可在项目网站上获得。

英文摘要

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

URL PDF HTML ☆

赞 0 踩 0

2605.21442 2026-05-21 cs.LG cs.AI

torchtune: PyTorch native post-training library

torchtune: 一种基于PyTorch的后训练库

Mark Obozov, Maxime Griot, Joseph Cummings, Evan Smothers, Felipe Mello, Rafi Ayub, Philip John Bontrager, Salman Mohammadi, Ariel Kwiatkowski, Nathan Azrak, Mircea Mironenco

AI总结本文介绍了torchtune，一种基于PyTorch的后训练库，旨在简化大语言模型的后训练生命周期，提供高效的微调、实验和部署流程，通过模块化和可扩展性提升性能和灵活性。

详情

Comments: 14 pages

AI中文摘要

现代大语言模型通常需要多阶段训练流水线才能实现强大的下游性能，后训练是适应开放式模型的主要接口。我们介绍了torchtune，一种基于PyTorch的库，旨在简化大语言模型的后训练生命周期，使微调、实验和面向部署的工作流程更加高效。与许多现有的微调框架不同，这些框架往往在易用性、专用食谱或硬件效率方面进行优化，而牺牲了透明性和扩展性，torchtune强调模块化、可修改性和对底层PyTorch组件的直接访问。在本文中，我们阐述了torchtune的设计原则，描述了这些原则如何体现在其模型构建器、训练食谱和分布式训练堆栈中，并在具有代表性的后训练设置中评估了该库。我们对比了流行的微调框架，包括Axolotl和Unsloth，并展示了torchtune在许多设置中提供了强大的性能和内存效率，同时保持足够的灵活性以支持快速的研究迭代。这些结果将torchtune定位为可重复的大语言模型后训练研究的实用基础。

英文摘要

Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

URL PDF HTML ☆

赞 0 踩 0

2605.21440 2026-05-21 cs.CV

ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes

ReMATF: 基于循环的运动自适应多尺度湍流抑制用于动态场景

Zhiming Liu, Zhicheng Zou, Nantheera Anantrasirichai

AI总结本文提出ReMATF，一种轻量级循环框架，通过仅使用两帧恢复视频，同时保持空间细节和时间稳定性，有效抑制湍流并提升视频质量。

详情

AI中文摘要

大气湍流严重降质视频质量，通过引入几何扭曲、模糊和时间闪烁等失真，对视觉清晰度和时间一致性构成重大挑战。当前最先进的方法基于transformer、3D架构和多帧输入，但其大计算成本和内存使用限制了实时部署，特别是在资源受限的场景中。在本工作中，我们提出ReMATF，一种轻量级循环框架，通过仅使用两帧恢复视频，同时保持空间细节和时间稳定性。ReMATF结合多尺度编码器-解码器、时间扭曲和运动自适应时间融合模块，通过将扭曲的前一输出与当前预测进行逐像素融合，增强一致性而不扩大时间窗口。该设计减少了闪烁，提升了细节清晰度，并保持了效率。在合成和真实湍流数据集上的实验显示，ReMATF在PSNR/SSIM和感知质量（LPIPS）上表现出一致的改进，同时比多帧transformer基线有显著更快的推理速度，使其适合资源受限场景中的湍流抑制。

英文摘要

Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose ReMATF, a lightweight recurrent framework that restores videos using only two frames at a time while preserving spatial detail and temporal stability. ReMATF combines a multi-scale encoder-decoder with temporal warping and a motion-adaptive temporal fusion module that performs per-pixel fusion between the warped previous output and the current prediction to enhance coherence without enlarging the temporal window. This design reduces flicker, sharpens details, and remains efficient. Experiments on synthetic and real turbulence datasets show consistent improvements in PSNR/SSIM and perceptual quality (LPIPS), along with substantially faster inference than multi-frame transformer baselines, making ReMATF suitable turbulence mitigation in resource-constrained scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.21439 2026-05-21 eess.SY cs.RO cs.SY

Fully Actuated Manifold Constraint Based Output Feedback Control for Input-Constrained Uncertain Nonlinear Systems

全驱动流形约束基于输出反馈控制的输入受限不确定非线性系统

Dianrui Mu, Changchun Hua, Yafeng Li, Jiannan Chen, Rao Wei

AI总结本文提出了一种低复杂度、无模型的输出反馈控制器，用于处理具有未知输入约束的未知时变非线性系统，实现了预设的控制精度，并在执行器饱和后保持灵活的控制精度。该方法扩展了现有线性流形约束控制方法，包括非线性流形的构造和各种约束类型，从而在有限或固定时间内实现预设的控制精度。此外，通过构造误差驱动的灵活约束，实现了未知饱和情况下的灵活控制。最后提供了二阶及更高阶的控制示例和仿真。

2605.21437 2026-05-21 physics.geo-ph cs.LG stat.ML

Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment

基于神经网络的负二项回归用于每周地震预测：每个单元的分散估计和尾部风险评估

Alim Igilik

AI总结本文提出了一种基于神经网络的地震预测方法，通过每个单元的分散参数估计和尾部风险评估，改进了传统泊松分布的假设，提高了极端事件预测的准确性。

详情

Comments: 28 pages, 9 figures. Source code available at https://github.com/Al1mkaYandere/seismic-probabilistic-modeling

AI中文摘要

传统方法在空间网格上预测每周地震数量时依赖于具有单一全局分散假设的泊松分布。我们证明在中亚（2010-2024）的地震数据中，这一假设系统性地被违反，通过具有边界校正的似然比检验，强烈拒绝泊松假设（p < 10^{-179}）。本文的主要贡献是EarthquakeNet架构，它通过神经网络（空间嵌入+MLP）提供每个单元的过分散参数alpha的内生估计，而无需显式空间协方差指定。与现有地震预测中的负二项回归方法不同，后者通常假设单一全局alpha，所提出的每个单元公式允许模型识别地震聚类的空间异质性，并通过预测分布的分位数构建概率风险意识警报。在2018-2023年的四系统走步评估中，与负二项GLM基线相比，平均皮球偏差（MPD）减少了8.6%。在尾部区域（Y >= 5）的改进最为显著，所提出模型的连续排名概率得分（CRPS）比基线低12.5%，表明极端事件预测的校准得到改善。

英文摘要

Standard approaches to forecasting the weekly number of earthquakes on a spatial grid rely on the Poisson distribution with a single global dispersion assumption. We show that this assumption is systematically violated in seismic data from Central Asia (2010-2024), where a likelihood-ratio test with boundary correction strongly rejects the Poisson hypothesis (p < 10^{-179}). The main contribution of this work is the EarthquakeNet architecture, which provides an endogenous per-cell estimate of the overdispersion parameter alpha via a neural network (spatial embeddings + MLP), without explicit spatial covariance specification. In contrast to existing negative binomial regression approaches in seismological forecasting, which typically assume a single global alpha, the proposed per-cell formulation allows the model to identify spatial heterogeneity in seismic clustering and to construct probabilistic risk-aware alerts via quantiles of the predicted distribution. A walk-forward evaluation (2018-2023) over four systems shows an 8.6 percent reduction in mean pinball deviation (MPD) relative to a negative binomial GLM baseline. The strongest improvements are observed in the tail regime (Y >= 5), where the continuous ranked probability score (CRPS) of the proposed model is 12.5 percent lower than that of the baseline, indicating improved calibration in extreme-event forecasting.

URL PDF HTML ☆

赞 0 踩 0

2605.21435 2026-05-21 cs.LG math.AT math.CT

Gaussian Sheaf Neural Networks

高斯sheaf神经网络

André Ribeiro, Ana Luiza Tenório, Tiago da Silva, Diego Mesquita

AI总结本文提出高斯sheaf神经网络（GSNNs），通过将高斯分布的均值和协方差矩阵作为节点特征，解决传统GNN在处理概率分布特征时的不足，提出新的拉普拉斯算子并进行实验验证。

详情

AI中文摘要

图神经网络（GNNs）已成为学习关系数据的主流方法。尽管传统GNN的消息传递机制适合向量值节点特征，但某些情况下节点特征更适合用概率分布表示而非实数向量。具体来说，当节点特征是高斯分布时，其由均值和协方差矩阵描述，简单地将参数拼接成单一向量并应用标准消息传递会丢失均值和协方差的几何和代数结构。我们提出高斯sheaf神经网络（GSNNs），这是一个将这些归纳偏置纳入图学习的系统框架。基于细胞sheaf理论，我们推导出一个新的拉普拉斯算子，该算子扩展到此设置并保留其关键性质。我们通过合成和实际数据的实验补充了我们的理论贡献，展示了GSNNs的实用相关性。

英文摘要

Graph Neural Networks (GNNs) have become the de facto standard for learning on relational data. While traditional GNNs' message passing is well suited for vector-valued node features, there are cases in which node features are better represented by probability distributions than real vectors. Concretely, when node features are Gaussians, characterized by a mean and a covariance matrix, naively concatenating their parameters into a single vector and applying standard message passing discards the geometric and algebraic structure that governs means and covariances. We propose Gaussian Sheaf Neural Networks (GSNNs), a principled framework that incorporates these inductive biases into graph-based learning. Building on the theory of cellular sheaves, we derive a new Laplacian operator that generalizes the sheaf Laplacian to this setting and preserves its key properties. We complement our theoretical contributions with experiments on synthetic and real-world data that illustrate the practical relevance of GSNNs.

URL PDF HTML ☆

赞 0 踩 0

2605.21433 2026-05-21 cs.SD

Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches

通过辅助条件分支进行乐器文生成

Junyoung Koh

AI总结本文研究了在无外部预训练的情况下，通过控制数据和预训练来隔离有效设计选择的问题，发现去除辅助分支的模型在多个评估指标上表现较差，而增加DiT深度只能小幅恢复性能，表明辅助分支可能在训练时起到架构锚定作用。

详情

Comments: ICME 2026 Grand Challenge on Academic Text-to-Music Generation

AI中文摘要

文本到音乐生成已经取得了快速进展，现代自回归和扩散模型能够从自然语言提示生成逼真的音乐。然而，大部分进展依赖于大规模训练数据和外部预训练，使得在控制数据和预训练的情况下难以确定哪些设计选择仍然有效。我们使用扩散变压器主干网络，结合歌词和音色条件，针对仅乐器的文本到音乐任务进行了调整，在此任务中辅助的歌词和音色分支仅接收退化条件信号。通过受控消融分析，我们发现没有这些分支重新训练的模型在AudioBox美学、LLM-as-judge和人类MOS评分上得分较低，而将节省的参数作为额外的DiT深度重新投资只能略微恢复性能。这表明辅助分支可能在训练时起到架构锚定作用，其贡献超出了其显式条件内容。我们通过与外部乐器基线的比较以及通过我们的ICME 2026学术文本到音乐（ATTM）大奖挑战提交进行了验证，在该挑战中，我们的Performance提交在客观指标和随后组织者管理的MOS评分上均排名第一，获得所有提交中最高的总体MOS评分，而我们的Efficiency提交是决赛选手，以客观指标第二名的成绩并列。

英文摘要

Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isolate which design choices remain effective when data and pretraining are controlled. We study this setting using a Diffusion Transformer backbone with lyric and timbre conditioning, adapted to an instrumental-only text-to-music task in which the auxiliary lyric and timbre branches receive only degenerate conditioning signals. Through controlled ablations, we find that models retrained without these branches score lower across AudioBox aesthetics, LLM-as-judge, and human MOS, and that reinvesting the saved parameters as additional DiT depth recovers only marginally. This suggests the auxiliary branches may act as training-time architectural anchors whose contribution goes beyond their explicit conditioning content. We validate the same model through comparisons with external instrumental baselines and through our submission to the ICME 2026 Academic Text-to-Music (ATTM) Grand Challenge, where our Performance submission ranked first under both the objective metrics and the subsequent organizer-administered MOS over 35 raters, attaining the highest overall MOS across all challenge submissions, while our Efficiency submission was a finalist that tied for second under the objective metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.21431 2026-05-21 cs.CV

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

AI总结本文提出iTryOn框架，通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题，实现了更动态可控的虚拟试穿体验。

详情

Comments: Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

AI中文摘要

视频虚拟试穿（VVT）旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展，但它们主要局限于非交互场景，其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面：主动的人-衣物互动。为弥合这一差距，我们引入并正式化了一个新的挑战性任务：交互式视频虚拟试穿（Interactive VVT），其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战，包括：（1）从标准姿态信息中解决交互的语义模糊性，以及（2）从视频中学习复杂的衣物变形，其中交互时刻稀少且短暂。为了解决这些挑战，我们提出了iTryOn，一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制，以引导复杂动态的生成。在空间层面，我们引入了服装无关的3D手先验，以提供精细的指导，精确的手-服装接触，有效解决空间模糊性。在语义层面，iTryOn利用全局描述词提供整体上下文，并利用时间戳动作描述词提供局部交互，通过我们新颖的Action-aware Rotational Position Embedding（A-RoPE）进行同步。广泛的实验表明，iTryOn不仅在传统VVT基准上实现了最先进的性能，还在新的交互设置中建立了显著的领先优势，标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

URL PDF HTML ☆

赞 0 踩 0