arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1676
2605.15397 2026-05-18 cs.CV

ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest

ELDOR:亚马逊雨林非法金矿开采的数据集和基准

Kangning Cui, Surendra Bohara, Suraj Prasai, Zishan Shao, Wei Tang, Martin Pillaca, Edwin Flores, Zhen Yang, Gregory Larsen, Evan Dethier, David Lutz, Jean-Michel Morel, Miles Silman, Victor Pauca, Fan Yang

AI总结 ELDOR通过大规模无人机基准监测亚马逊雨林非法金矿开采对环境和景观的影响,包含2500多公顷的手动标注正射影像,涵盖采矿活动和生态结构的像素级语义标签,评估多种模型在细粒度和小规模结构识别上的性能。

Comments 70 pages, 35 figures, 28 tables

详情
AI中文摘要

亚马逊雨林非法金矿开采导致森林砍伐、水污染和长期生态系统破坏,但难以在精细空间尺度上监测。卫星影像支持大范围观测,但常遗漏小型采矿相关结构和微妙的土地覆盖转变,尤其是频繁的云层覆盖。我们引入ELDOR,一个大规模无人机基准,用于监测非法金矿开采对雨林环境和景观的破坏。ELDOR包含覆盖超过2500公顷的手动标注正射影像,具有像素级语义标签,涵盖采矿相关活动和周围生态结构。借助这一统一的标注源,我们建立了四个基准任务:语义分割、分割衍生识别、直接多标签分类以及基于视觉-语言模型的类别存在识别。在这些任务中,我们比较了通用和遥感专用的分割模型、视觉基础模型相关的分割方法、直接多标签分类方法以及视觉-语言模型,在受控的闭集协议下。结果表明,当前方法在罕见的小规模采矿结构和细粒度恢复类别上仍存在困难,表明需要上下文感知和多模态建模。为了支持领域分析和实际应用,我们进一步构建了一个交互式探索器,为领域专家提供统一的数据探索和模型推理界面。

英文摘要

Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.

2605.15394 2026-05-18 cs.LG cs.AI stat.ML

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

无奖励的表示:用于LLM微调的JEPA审计

Biswa Sengupta

AI总结 本文探讨了在无奖励设定下,通过JEPA架构学习更有效的表示方法,测试了多种辅助项在自然语言到正则表达式生成任务中的表现,发现某些辅助项在特定统计检验下显著,但整体效果不显著。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)提出,当模型被训练以预测潜在表示而非观测输出时,应学习更有用的抽象。对于自回归语言模型微调,这一原则意味着诱导的隐藏状态几何必须达到语言模型头部并且提高解码任务指标。我们在此基础上,在固定Llama-3.2-1B-Instruct LoRA基础上,对自然语言到正则表达式生成任务进行了测试,比较了22种训练时的辅助项,包括轨迹形状正则化、分布约束、预测器/目标不对称性、Fisher度量Jacobi残差以及一个解码器可见的JEPA目标,该目标位于交叉熵的正锥内。经验结果是一个结构化的零假设:几种辅助项在单细胞配对α=0.10下显著(T3-Local在Δ=+2.53 pp,p=0.003最强),但无一通过Bonferroni或Holm-Bonferroni检验。解码器可见的JEPA产生了研究中的第一个正辅助-交叉熵梯度余弦值,但精确匹配仍处于种子噪声内;在五个种子的完整微调复制中,相同的辅助项在两个基准测试中均重现了零假设(TURK:Δ=+0.04 pp,p_配对=0.96;SYNTH:Δ=+0.52 pp,p_配对=0.28),因此零假设在LoRA和完整微调中对解码器可见的构造是稳健的。隐藏状态表示和解码任务准确性在这一领域因此弱相关;我们相应地将LLM领域JEPA评估重新定义为耦合问题,其中核心问题是哪些指标下有用的隐藏几何成为解码器可见的任务信号。

英文摘要

Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emph{and} improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired $α= 0.10$ without correction (T3-Local at $Δ= +2.53$~pp, $p = 0.003$ being the strongest), but none survives Bonferroni or Holm--Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary--cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at $n = 5$ seeds reproduces the null on both benchmarks (TURK: $Δ= +0.04$~pp, $p_{\text{paired}} = 0.96$; SYNTH: $Δ= +0.52$~pp, $p_{\text{paired}} = 0.28$), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.

2605.15393 2026-05-18 cs.LG

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

LPDS:通过逻辑保持难度扩展评估LLM鲁棒性

Philipp Mondorf, Samuel J. Bell, Jesse Dodge, Dieuwke Hupkes

AI总结 本文提出LPDS框架,通过系统搜索逻辑保持变化来评估LLM鲁棒性,发现难度增加导致性能下降,且微调困难变体能获得更一致的鲁棒性提升。

Comments 41 pages, 31 figures

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地用于在最小人为监督下执行任务,确保这些模型运行稳健至关重要。特别是,一个能够解决给定问题的模型不应因某些实体(如名称、数字或其他上下文细节)的变化而失败,而问题逻辑保持不变。先前研究表明,当前LLM仍难以处理这种鲁棒性:它们在某些问题变体上成功,但在其他变体上失败。然而,现有评估缺乏系统方法来识别最可能引发失败的逻辑保持变体。相反,它们通常测试允许变体的随机子集,这可能高估鲁棒性。为解决这一差距,我们引入了逻辑保持难度扩展(LPDS),一个框架,它(i)量化问题变体的难度,并(ii)系统搜索允许变体的空间以找到那些最大化难度并暴露失败的变体。我们显示,随着难度增加,性能下降,且模型推理链中的错误变得更加明显。我们进一步证明,LPDS高效地找到困难问题变体,导致性能下降幅度比随机采样大5倍。最后,我们证明在更多困难变体上微调比在更容易的变体上微调能获得更一致的鲁棒性提升。

英文摘要

As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities$\unicode{x2013}$such as names, numbers, or other contextual details$\unicode{x2013}$have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that (i) quantifies the difficulty of a problem variation and (ii) systematically searches the space of allowable variations to find those that maximize difficulty and expose failures. We show that as difficulty increases, performance declines and errors in the models' reasoning chains become more pronounced. We further demonstrate that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling. Finally, we show that fine-tuning on more difficult variations leads to more consistent robustness gains than training on easier ones.

2605.15391 2026-05-18 cs.CV cs.AI

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

PanoWorld:几何一致的全景视频世界建模

Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas

AI总结 PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

详情
AI中文摘要

PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

英文摘要

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

2605.15388 2026-05-18 cs.LG

Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation

随机方差缩减估计的统一高概率分析

Zhankun Luo, Antesh Upadhyay, M. Berk Sahin, Sang Bin Moon, Anuran Makur, Abolfazl Hashemi

AI总结 本文提出一个统一框架,通过递归中的记忆保留、重置概率和迭代移动修正项,分析随机方差缩减估计,推导出高概率界,并改进了随机优化的复杂度。

详情
AI中文摘要

随机估计器是大规模优化的基础,需从噪声 oracle 观察中推断总体量。尽管动量、SPIDER、STORM 和 PAGE 等方法成效显著,但其分析多为特定估计器和期望基,掩盖了可靠性决定的结构权衡。本文开发了一个基于递归的统一框架,包含记忆保留、重置概率和迭代移动修正项,恢复经典估计器,推动新的二阶变体,并推导出估计误差的偏差-方差分解。主要结果是一个统一的高概率界,使用新的无维向量值 Freedman 不等式,适用于包含随机向量鞅和的光滑规范空间。结果适用于欧几里得和非欧几里得设置,包括 Banach 空间中的镜像下降法分析。应用包括无约束优化的高概率 oracle 复杂度,建立对置信度的对数依赖。还推导出首次 $\tilde{\mathcal{O}}(\varepsilon^{-3})$ 的随机优化 oracle 复杂度界,改进了现有 $\tilde{\mathcal{O}}(\varepsilon^{-4})$ 复杂度,通过利用方差缩减估计首次应用于此场景。

英文摘要

Stochastic estimators are fundamental to large-scale optimization, where population quantities must be inferred from noisy oracle observations. Although influential methods such as momentum, SPIDER, STORM, and PAGE have been highly successful, their analyses are largely estimator-specific and expectation-based, obscuring the structural tradeoffs that determine reliability. In this paper, we develop a unified framework for stochastic variance-reduced estimation based on a recursion with three components: memory retention, reset probability, and a correction term for iterate movement. This framework recovers several classical estimators, motivates new second-order variants, and yields a bias-variance decomposition of estimation error. Our main result is a unified high-probability bound proved using a new dimension-free vector-valued Freedman inequality, valid for smooth normed spaces involving random sums of vector martingales. The result applies in both Euclidean and non-Euclidean settings, including the analysis of mirror-descent-based methods in Banach spaces. As applications, we obtain high-probability oracle complexities for unconstrained optimization with mirror descent, establishing the logarithmic dependence on the confidence level. We also derive the first $\tilde{\mathcal{O}}(\varepsilon^{-3})$ oracle-complexity bounds for stochastic optimization with expectation constraints, improving upon the existing $\tilde{\mathcal{O}}(\varepsilon^{-4})$ complexity by leveraging variance-reduced estimation for the first time in this setting.

2605.15384 2026-05-18 cs.LG cs.AI

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

一个评分够吗?重新思考序列演进LLM记忆的评估

Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen

AI总结 本文提出SeqMem-Eval框架,通过评估记忆状态的演变、泛化、经验巩固和信息保留,揭示传统指标无法捕捉的记忆质量差异。

Comments 29 pages, 13 figures

详情
AI中文摘要

记忆在使大语言模型(LLM)能够处理序列任务中起着核心作用,通过积累和重用经验实现时间连续性。然而,现有LLM记忆评估大多依赖汇总指标如最终验证准确率或累积在线性能,这可能掩盖诸如遗忘和负迁移等关键失败模式。本文引入SeqMem-Eval,一种用于序列演进LLM记忆的诊断评估框架。受持续学习启发,它针对一种测试时间设置,其中记忆是外部的、提示介导的,并且在不修改模型参数的情况下更新。与只关注最终性能不同,SeqMem-Eval评估记忆状态在连续推理中的演变、泛化、经验巩固和信息保留。具体而言,它测量在线效用、验证泛化、反向迁移和遗忘,提供更细致的记忆质量视角。通过在多样任务和记忆方法上的广泛实验,我们显示更高的最终或累积准确性不必然意味着更好的记忆质量:许多方法表现出强劲的性能提升,同时遭受显著的遗忘或负迁移。此外,不同记忆设计在适应性和稳定性之间表现出不同的权衡,这些权衡在标准评估指标下是不可见的。

英文摘要

Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.

2605.15383 2026-05-18 cs.CV

MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays

MorphoHELM:用于评估基于显微镜的形态学检测方法的综合基准

Emre Hayir, Lorin Crawford, Alex X. Lu

AI总结 MorphoHELM提供了一个综合的开放基准,用于评估细胞染色法中的特征提取方法,通过不同批次效应评估任务,揭示方法间的权衡关系,展示经典计算机视觉方法在多种场景下的优势。

详情
AI中文摘要

显微镜图像包含关于细胞对扰动响应的丰富信息,对药物筛选等应用至关重要。研究人员常使用表示提取方法来量化图像,近年来深度学习方法层出不穷。然而,评估这些表示的质量仍存在碎片化问题,各模型在不同任务和数据集上使用定制的流程和指标,难以公平比较。本文介绍MorphoHELM,一个全面的开放基准,用于评估细胞染色法中的特征提取方法。MorphoHELM整合了领域内的评估标准,扩展并修正使其更稳健,并在迄今为止最广泛的方法上进行评估。该基准的一个显著特点是每个任务在不同批次效应(或技术噪声)程度下进行评估,直接量化方法检测生物信号能力随噪声增加而下降的程度。这些特性使MorphoHELM能够检测方法间的权衡关系,我们证明某些类型的生物信号检测能力强的模型在其他方面表现较弱。我们展示现有模型在所有设置中均无法超越经典计算机视觉分析策略,这些策略仍是最强的通用场景表示。所有数据集、代码和评估工具均在https://github.com/microsoft/MorphoHELM公开。

英文摘要

Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.

2605.15380 2026-05-18 cs.CL cs.CY cs.HC

Eskwai for Students: Generative AI Assistant for Legal Education in Ghana

Eskwai for Students:面向加纳法律教育的生成式AI助手

George Boateng, Philemon Badu, Patrick Agyeman-Budu, Samuel Ansah, Evans Atompoya, Evan Igwilo, Lord Baah, Frederick Abu-Bonsrah, Victor Wumbor-Apin Kumbol

AI总结 本文介绍了Eskwai for Students,一个基于检索增强生成(RAG)的生成式AI助手,用于帮助加纳法律学生解答法律问题,通过超过12000份案例法和1400项立法数据库,评估其对学生查询的帮助性及伦理影响。

Comments 10 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

详情
AI中文摘要

生成式AI的最新进展展示了其在法律教育中的潜力。然而,针对全球南方国家开发和部署此类系统的研究有限。本文开发了Eskwai for Students,一个生成式AI助手,帮助法律学生进行法律教育。Eskwai for Students是一个检索增强生成(RAG)系统,能够回答广泛的法律问题,基于超过12000份案例法和1400项立法的定制数据库。我们部署了Eskwai for Students,进行了为期30个月(2.5年)的纵向研究,由3100名加纳法律学生使用,共提交了32000次查询。我们评估了AI的有用性,并提供了关于法律学生提交的查询类型的见解,这引发了一些伦理问题。本工作有助于理解全球南方法律学生如何利用生成式AI进行学习,以及如何负责任地利用它来促进法律教育。

英文摘要

Recent advances in generative AI have shown their potential to be leveraged for legal education. Yet, work on the development and deployment of such systems for legal education in the Global South is limited. In this work, we developed Eskwai for Students, a generative AI assistant to help law students with their legal education. Eskwai for Students is a retrieval augmented generation (RAG) system that provides answers to a wide range of legal questions for law students grounded in a curated database of over 12K case laws and 1.4K legislation in Ghana. We deployed Eskwai for Students in a longitudinal study of 30 months (2.5 years) used by 3.1K law students in Ghana who made 32K queries. We evaluated the helpfulness of our AI, and provided insight into the kinds of queries law students submit to this generative AI tool, which raises some ethical concerns. This work contributes to an understanding of how law students in the Global South are using generative AI for their studies and the ways it could be leveraged responsibly to advance legal education.

2605.15376 2026-05-18 cs.CL cs.CY

Adesua: Development and Feasibility Study of an AI WhatsApp Bot for Science Learning in West Africa

Adesua:面向西非科学学习的AI WhatsApp机器人开发与可行性研究

George Boateng, Evans Atompoya, Philemon Badu, Samuel John, Samuel Ansah, Patrick Agyeman-Budu, Victor Wumbor-Apin Kumbol

AI总结 Adesua通过WhatsApp平台为西非初中和高中学生提供科学学习支持,利用生成式AI实现问答和自动评估,初步测试显示高使用价值。

Comments 11 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

详情
AI中文摘要

西非地区持续面临学生与教师比例高和合格教师短缺的问题,限制了学生获得个性化学习支持和形成性评估的机会。为解决这一挑战,我们提出了Adesua,一个基于WhatsApp的AI教学助手,用于科学教育,扩展了Kwame for Science平台。Adesua利用WhatsApp在非洲的广泛采用,为西非初级中学(JHS)和高级中学(SHS)学生提供可及的、与课程大纲一致的学习支持。系统整合了精选的教科书和33年的全国考试问题,结合生成式AI,实现基于对话的问题回答和自动评估,通过WhatsApp机器人提供反馈。学生可以提出科学问题,通过主题或考试年份进行定时或非定时的多项选择测试,并获得即时评分和正确与错误回答的详细解释。2025年6个月的可行性部署在加纳有56名活跃用户,包括学生和家长。定量评估显示高感知效用,AI生成答案的有用性得分为93.75%,尽管评价数量较小(n=16)。这些初步结果为未来更广泛的评估提供了基础,评估基于基于WhatsApp的AI助手在资源受限教育环境中提供可扩展、低成本个性化学习支持和形成性评估的潜力。

英文摘要

Sub-Saharan Africa faces persistently high student-teacher ratios and shortages of qualified teachers, limiting students' access to personalized learning support and formative assessment. To address this challenge, we present Adesua, a WhatsApp-based AI Teaching Assistant for science education that extends the Kwame for Science platform. Adesua leverages WhatsApp's widespread adoption in Africa to provide accessible, curriculum-aligned learning support for Junior High School (JHS) and Senior High School (SHS) students across West Africa. The system integrates curated textbooks and 33 years of national examination questions with generative AI to enable conversational question answering and automated assessment with feedback via a WhatsApp bot. Students can ask science questions, take timed or untimed multiple-choice tests by topic or exam year, and receive instant grading and detailed explanations of correct and incorrect responses. A 6-month feasibility deployment in 2025 had 56 active users in Ghana, including students and parents. Quantitative evaluation showed a high perceived usefulness, with a helpfulness score of 93.75\% for AI-generated answers, albeit with a small number of ratings (n=16). These preliminary results provide a basis for more extensive future evaluation of a WhatsApp-based AI assistant to assess its potential to offer scalable, low-cost personalized learning support and formative assessment in resource-constrained educational contexts.

2605.15375 2026-05-18 cs.CV cs.AI

ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing

ChangeFlow -- 潜在修正流用于遥感中的变化检测

Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc

AI总结 本文提出ChangeFlow框架,通过潜在空间中的修正流合成变化掩码,以生成分布中的可能掩码,提升全局一致性与鲁棒性,实现80.4%的平均F1分数。

详情
AI中文摘要

遥感变化检测(RSCD)旨在定位同一地理区域两幅图像之间的变化。在实践中,变化掩码通常遵循区域级注释惯例而非纯粹的局部外观差异,使其具有上下文依赖性和偶尔的模糊性。大多数最先进的方法使用逐像素判别分类,产生单个预测,无法显式建模变化区域作为整体。生成式方法是自然替代方案,可建模可能掩码的分布,使采样能捕捉模糊性并鼓励全局一致性。然而,现有生成式RSCD方法通常落后于强大判别基线,由于像素空间生成的高计算成本和其条件机制的复杂性。为了解决判别和生成方法的局限性,我们提出ChangeFlow,一种生成框架,通过潜在空间中的修正流重新表述变化检测为变化掩码的合成。ChangeFlow由结构化但轻量级的条件信号引导,其随机设计自然支持基于采样的预测融合。即,聚合多个预测的变化掩码提高鲁棒性,而样本一致性提供实用的置信度估计,突出模糊区域。在四个基准上,ChangeFlow实现80.4%的平均F1分数,比先前最佳方法平均提高1.3个百分点,同时保持与最近强大基线相当的推理速度。项目页面:https://blaz-r.github.io/changeflow_cd

英文摘要

Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz-r.github.io/changeflow_cd

2605.15368 2026-05-18 cs.CV cs.GR cs.LG

Discretizing Group-Convolutional Neural Networks for 3D Geometry in Feature Space

对特征空间中的群卷积神经网络进行离散化以处理3D几何

Daniel Franzen, Jean Philip Filling, Michael Wand

AI总结 本文提出在特征空间中进行采样,通过特征相似性选择代表性样本,从而解耦几何分辨率与内存处理成本,实现计算效率与精度的平衡。实验表明粗粒度的特征空间采样能有效保持分类精度,加速等变3D分类器的训练。

Comments 11 pages, 7 figures, 2 tables

详情
AI中文摘要

群卷积神经网络(GCNNs)是深度学习中引入对称性作为归纳偏置的重要方法:在每个线性层中,GCNNs密集采样变换群G,并在不同姿态下相关数据和滤波器(适用于可旋转GCNNs的适当反混叠)以保持对G的等变性。不幸的是,对这种采样产生的许多数据项应用滤波器成本很高(即使仅限于平移,即普通CNNs),随着自由度(如3D中的平移和旋转)的增加,成本呈指数增长,这往往阻碍了实际应用。在本文中,我们提出在特征空间中进行采样,即用特征相似性选择的代表性样本替代几何密集采样。这在训练和推理过程中解耦了几何分辨率与内存和处理成本,提供了一种新的方法来权衡计算努力和准确性。我们的主要经验发现是,粗粒度的特征空间采样在保持分类精度方面表现得非常出色,这允许基于几何相似性进行预计算,从而显著加速等变3D分类器的训练。

英文摘要

Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group $G$ densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to $G$. Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.

2605.15365 2026-05-18 cs.CL

Greedy or not, here I come: Language production under vocabulary constraints in humans and resource-rational models

贪心还是不贪心,这里我来:在词汇限制下的人类语言生成与资源理性模型

Thomas Hikaru Clark, Sihan Chen, Laura Nicolae

AI总结 研究人类在词汇限制下如何生成语言,发现人类更倾向于贪心采样,但更熟练者会回溯修正,非贪心行为。

详情
AI中文摘要

仅使用有限词汇进行交流是一种常见但具有挑战性的认知现象,需要理想沟通者精心规划以在受限词库下优化可理解性。本文探讨了人类在不同词汇限制下对各种问题的反应,最严格的限制仅包含250个高频词。我们使用大型语言模型的序列蒙特卡洛推断,将理论动机的比较应用于贪心和全局最优采样算法。人类总体上更接近贪心采样而非全局最优采样,但更熟练的人更可能回溯和修正——一种非贪心行为。观察到的人类在高约束设置中倾向于依赖语义轻的词,这种模式既不符合贪心也不符合全局最优采样。我们讨论了结果及其对资源理性认知、心理语言学、二语交流和语言障碍的更广泛影响。

英文摘要

Communicating using only a limited vocabulary is a common but challenging cognitive phenomenon, requiring an ideal communicator to plan carefully to optimize for intelligibility while circumventing a constrained lexicon. In this work, we investigate how humans respond to a broad array of questions under variable vocabulary limitations, consisting of only 250 highly frequent words at the most restrictive. We provide theoretically motivated comparisons to greedy and globally optimal sampling algorithms using Sequential Monte Carlo inference with large language models. Humans generally resemble greedy sampling more than globally optimal sampling, though more skilled humans are more likely to backtrack and revise -- a non-greedy behavior. An observed human pattern of leaning on semantically light words in high-constraint settings falls out of both greedy and globally optimal sampling. We discuss the results and their broader implications for resource-rational cognition, psycholinguistics, L2 communication, and language impairments.

2605.15363 2026-05-18 cs.LG eess.SP

PRB-RUPFormer: A Recursive Unified Probabilistic Transformer for Residual PRB Forecasting

PRB-RUPFormer:一种递归统一概率变压器用于残余PRB预测

Saad Masrur, Yuxuan Jiang, Matti Hiltunen, Ajay Rajkumar, Ismail Guvenc

AI总结 本文提出PRB-RUPFormer模型,通过联合处理多变量KPI时间序列,利用时间、季节性和载波感知嵌入,实现残余PRB预测,降低计算开销并提供预测不确定性估计,实验表明其在LTE网络数据上具有较高的预测精度和可靠性。

Comments Accepted for publication in the Proceedings of the 2026 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN 2026), Washington, DC, USA, May 11-14, 2026

详情
AI中文摘要

准确预测残余物理资源块(PRB)对于主动网络切片配置、节能运行和频谱感知决策至关重要,其中残余PRB作为短至中期内频谱可用性的实用代理。现有PRB预测方法通常仅依赖历史PRB值并针对每个载波或扇区独立训练,限制了其捕捉跨载波依赖性和提供预测不确定性的能力。此外,仅点预测在高度变化的交通条件下不足以支持鲁棒的频谱感知控制。本文提出PRB-RUPFormer,一种递归统一概率变压器用于残余PRB预测。所提出的模型通过时间、季节性和载波感知嵌入联合处理多变量KPI时间序列,在递归展开过程中保持指标间的时间耦合,并稳定长周期预测。在eNB的所有载波和扇区上训练一个共享模型,实现高效学习联合交通动态,具有低计算开销。通过分位数预测区间捕捉预测不确定性,提供对未来PRB可用性的置信度感知估计。在六个月内多个美国商业LTE网络数据上的评估显示,对于一天和七天递归预测,中位MAE低于0.05且命中概率高于0.80。这些概率预测直接支持频谱感知RAN功能,如动态载波激活、拥堵避免和主动频谱共享,使所提出的框架适用于动态频谱接入场景。

英文摘要

Accurate forecasting of residual Physical Resource Blocks (PRBs) is critical for proactive network slice provisioning, energy-efficient operation, and spectrum-aware decision making in cellular systems, where residual PRBs serve as a practical proxy for short- and medium-term spectrum availability. Existing PRB prediction methods typically rely only on historical PRB values and are trained independently per carrier or sector, limiting their ability to capture cross-carrier dependencies and providing no measure of forecast uncertainty. Moreover, point forecasts alone are insufficient for robust spectrum-aware control under highly variable traffic conditions. This paper proposes PRB-RUPFormer, a recursive unified probabilistic Transformer for residual PRB forecasting. The proposed model jointly processes multivariate KPI time series using temporal, seasonal, and carrier-aware embeddings, preserving inter-metric temporal coupling during recursive rollout and stabilizing long-horizon forecasting. A single shared model is trained across all carriers and sectors of an eNB, enabling efficient learning of joint traffic dynamics with low computational overhead. Forecast uncertainty is captured through quantile-based prediction intervals, providing confidence-aware estimates of future PRB availability. Evaluations on six months of commercial LTE network data from multiple U.S. locations demonstrate median MAE below 0.05 and hit probabilities above 0.80 for both one-day and seven-day recursive forecasts. These probabilistic predictions directly support spectrum-aware RAN functions such as dynamic carrier activation, congestion avoidance, and proactive spectrum sharing, making the proposed framework well-suited for dynamic spectrum access scenarios.

2605.15362 2026-05-18 cs.CL cs.DL cs.IR

Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

从1亿份乌克兰法院判决中自动构建法律引文图:大规模提取、拓扑分析和本体驱动聚类

Volodymyr Ovcharov

AI总结 研究通过大规模提取和拓扑分析,构建了首个大规模法律引文图,揭示了司法引文结构编码法律领域边界并预测立法重要性。

Comments 15 pages, 7 figures, 2 tables, 21 references

详情
AI中文摘要

从1亿零7万份乌克兰法院判决中提取出5亿条引文边,揭示司法引文结构无监督地编码法律领域边界,并以几乎完美的准确性预测未来立法重要性。我们从完整的EDRSR注册表(9950万全文,1.1TB)中构建了首个大规模引文图,通过正则表达式在商用硬件上约5小时内提取了5亿条引文链接,精度为1.00(在200份判决验证样本中,95%的Wilson置信区间为[0.982, 1.000])。三个主要发现:(1)度分布遵循幂律(alpha=1.57±0.008),将乌克兰法院网络置于欧盟法院和美国最高法院之间,核心文章被数百万份判决引用。(2)在共引文投影上应用Louvain社区检测恢复法律领域边界(民事、刑事、行政、商业),模块度Q=0.44-0.55,时间稳定性(NMI=0.83-0.86)构成自动构建的法律本体。(3)引文特征预测前1000篇文章的AUC=0.9984,显著优于朴素频率基线(P@1000=0.655);时间动态检测立法制度变化作为相变和2022年入侵作为引文熵尖峰(H:11.02->13.49)并产生战争立法节点。引文衍生的本体被操作化为工作流内存系统的领域层,用于LLM辅助法律分析,连接到本体控制范式。提取管道、分析代码和汇总统计已作为开放数据发布。

英文摘要

Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

2605.15355 2026-05-18 cs.LG

Federated Learning of Spiking Neural Networks under Heterogeneous Temporal Resolutions

在异构时间分辨率下训练脉冲神经网络的联邦学习

Sanja Karilanova, Subhrakanti Dey, Ayça Özçelikkale

AI总结 本文提出一种联邦学习框架,解决不同时间分辨率下脉冲神经网络参数整合问题,通过在SHD和DVS-Gesture数据集上验证,有效恢复因时间不匹配导致的精度损失。

详情
AI中文摘要

脉冲神经网络(SNNs)是生物启发的节能模型,通过稀疏二进制脉冲在神经元间通信,使其适用于资源受限的边缘设备。联邦学习使这些设备能够协作训练而不共享原始数据。在时间序列应用中,边缘设备由于硬件和能耗限制,往往以不同时间分辨率收集数据。这种时间异质性对联邦学习构成了根本挑战:在一种时间分辨率上学习的参数不一定能直接转移到另一种,这可能导致朴素的联邦平均无效。针对SNNs以及更广泛的具有状态神经元的深度网络,我们提出了一种联邦学习框架,解决这种时间分辨率不匹配问题。我们研究了从不同时间分辨率数据中学习的神经元参数和模型聚合应如何整合。我们在两个SNN原生基准数据集(SHD和DVS-Gesture)上评估了所提框架,在多种分辨率异质场景下进行测试。我们的结果表明,所提适应方法能显著恢复因时间不匹配导致的精度损失,从而使每个客户端能够在本地时间分辨率上训练,同时保持与全局模型的兼容性。

英文摘要

Spiking neural networks (SNNs) are biologically inspired energy-efficient models that use sparse binary spike-based communication between neurons, making them attractive for resource-constrained edge devices. Federated learning enables such devices to train collaboratively without sharing raw data. In time-series applications, edge devices often collect data at different time resolutions due to hardware and energy constraints. This temporal heterogeneity poses a fundamental challenge for federated learning: parameters learned at one temporal resolution do not necessarily transfer directly to another, which might result in the naive federated averaging being ineffective. Targeting SNNs and, more broadly, deep networks with stateful neurons, we propose a federated learning framework that addresses this temporal resolution mismatch. We investigate how neuron parameters learned from data at different temporal resolutions and model aggregation should be integrated. We evaluate the proposed framework across two SNN-native benchmark datasets (SHD and DVS-Gesture) under a range of resolution heterogeneity scenarios. Our results show that the proposed adaptation methods can substantially recover accuracy lost due to temporal mismatch, hence enabling each client to train at their local temporal resolution while remaining compatible with the global model.

2605.15353 2026-05-18 cs.LG cs.AI q-bio.MN q-bio.QM

PACER: Acyclic Causal Discovery from Large-Scale Interventional Data

PACER:从大规模干预数据中进行无环因果发现

Ramon Viñas Torné, Sílvia Fàbregas Salazar, Soyon Park, Ivo Alexander Ban, Artyom Gadetsky, Nikita Doikov, Maria Brbić

AI总结 PACER通过构建无环性保证的因果发现框架,在大规模高维干预数据中实现高效且准确的因果结构推断,优于现有方法。

Comments Accepted at the 43rd International Conference on Machine Learning (2026)

详情
AI中文摘要

从数据中推断有向无环图(DAG)的结构是因果发现中的核心挑战,特别是在现代高维设置中,大规模干预数据日益可用。尽管干预数据可以提高可识别性,但现有方法仍受软无环约束限制,导致优化无效环图、数值不稳定和可扩展性差。我们引入PACER(扰动驱动无环因果边恢复),一种可扩展的因果发现框架,通过构建无环性保证的结构进行优化。PACER通过变量排列和边概率的联合模型参数化DAG分布,使可以直接优化有效因果结构而无需替代惩罚。该框架支持观察性和干预性数据的统一似然处理,灵活的条件密度模型以及结构先验知识的整合。对于线性高斯机制,我们推导出干预对数似然和梯度的闭式表达式,获得显著的计算增益。实证上,PACER在蛋白质信号和大规模基因扰动基准上匹配或超过最先进方法,同时高效扩展到具有千变量的网络,并在基于惩罚的可微方法上实现高达两数量级的速度提升。这些结果表明,通过原则性的搜索空间设计,从高维扰动数据中实现精确且可扩展的因果发现是可能的。

英文摘要

Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern high-dimensional settings where large-scale interventional data are increasingly available. While interventional data can improve identifiability, existing methods remain limited by soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and reduced scalability. We introduce PACER (Perturbation-driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood-based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear-Gaussian mechanisms, we derive closed-form expressions for the expected interventional log-likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state-of-the-art methods on protein signaling and large-scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty-based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high-dimensional perturbation data is achievable through principled search space design.

2605.15352 2026-05-18 cs.RO

Diffusion Policy for Coordinated Control of a Nonholonomic Mobile Base and Dual Arms in Door Opening and Passing

扩散策略用于非holonomic移动基底和双臂的协调控制以开门和通过

Shangqun Yu, Matthew En, Daniel Wu, Sangjun Park, Ziyi Zhou, Seyed Fakoorian, Donghyun Kim

AI总结 本文提出基于扩散的视觉-运动控制策略,实现非holonomic移动基底与双臂的协调控制,以完成开门和通过任务,展示了在复杂约束下的高成功率和鲁棒性。

详情
AI中文摘要

开门重物,尤其是需要拉门的自闭门,一直是机器人领域长期存在的挑战。人类自然地使用双臂灵活地操作,旋转把手、扩大间隙、持门、需要时换臂,并在通过时保持间隙。为复制此类行为,机器人必须执行一系列跨越多个阶段和与门不同部分交互的运动。传统方法依赖于状态机,通过手动定义的阶段(例如,旋转把手后拉门,间隙足够宽后通过)。虽然直观,但这些方法缺乏鲁棒性,因为手工编写的轨迹无法泛化到现实世界的多样性,而无需大量工程工作。最近的模仿学习进展提供了一种可扩展的替代方案,但现有的视觉动作模型尚未展示出同时协调非holonomic基底和双臂完成完整开门和通过任务的能力。在本文中,我们使用基于扩散的视觉-运动控制策略来解决这一复杂、高度约束的问题。我们的结果表明,一个端到端的策略可以学习执行需要紧密协调操作和移动的任务。所得到的策略不仅在打开和通过阻尼拉门方面实现了高成功率,还展示了对外部干扰的强鲁棒性,这种能力传统方法难以实现。

英文摘要

Opening heavy, self closing doors, especially those that require pulling remains a long standing challenge in robotics. Humans naturally employ both arms in a dexterous manner, rotating the handle, widening the gap, holding the door, switching arms when needed, and moving through while maintaining clearance. To replicate such behaviors, a robot must perform a long sequence of motions spanning multiple stages and interactions with different parts of the door. Traditional approaches rely on state machines that transition between manually defined stages (e.g., pulling after the knob is rotated, passing after the gap is sufficiently wide). While intuitive, these methods lack robustness, as hand crafted trajectories fail to generalize to the diversity of real world conditions without extensive engineering effort. Recent advances in imitation learning offer a scalable alternative, yet no existing visual action model has demonstrated simultaneous coordination of a nonholonomic base and dual arms for the complete door opening and passing task. In this paper, we tackle this complex, highly constrained problem using a diffusion based visuomotor control policy. Our results demonstrate that a single end to end policy can be learned to execute long horizon tasks requiring tight coordination between manipulation and locomotion. The resulting policy not only achieves a high success rate in opening and traversing damped pull doors but also demonstrates strong robustness to external disturbances capabilities that are difficult to realize with traditional methods.

2605.15343 2026-05-18 cs.AI cs.LG cs.MA

Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

信念引擎:多智能体大语言模型协商中的可配置和可检查立场动态

Joshua C. Yang, Maurice Flechtner, Damian Dailisan, Michiel A. Bakker

AI总结 本文提出Belief Engine,通过可配置的信念更新机制,研究多智能体协商中的立场动态,揭示立场变化背后的证据吸收与锚定因素。

详情
AI中文摘要

基于大语言模型的智能体日益用于模拟协商、冲突解决和多轮意见交流等 deliberative 交互。然而,生成的对话记录往往无法解释智能体立场变化的原因:变化可能反映证据吸收、锚定、角色漂移、回声或改变的提示和检索上下文。我们引入Belief Engine (BE),一个可审计的信念更新层,将“信念”视为命题上的证据状态,并将其暴露为标量立场。BE将论点提取为结构化记忆,并通过由证据吸收u和先验锚定a控制的对数几率规则更新立场。在多个基础LLM上,参数扫描显示这些控制可靠地塑造立场动态,同时保留证据层面的更新轨迹。在DEBATE数据集上,BE最佳重建了最终立场遵循提取证据的参与者;稳定和证据反对的案例则指向锚定或提取证据流之外的因素。BE为研究证据导向的协商提供了可配置的基础设施,其中开放性、承诺、收敛和分歧可以与显式的更新假设联系,而不是隐藏的提示效应。

英文摘要

LLM-based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi-turn opinion exchange. Yet generated transcripts often do not reveal why an agent's stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context. We introduce the Belief Engine (BE), an auditable belief-update layer that treats "belief" as an evidential state over a proposition and exposes it as scalar stance. BE extracts arguments into structured memory and updates stance with a log-odds rule controlled by evidence uptake u and prior anchoring a. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence-level update trail. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence-opposed cases instead point to anchoring or factors outside the extracted evidence stream. BE provides configurable infrastructure for studying evidence-grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects.

2605.15342 2026-05-18 cs.CV cs.LG

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Minerva-Ego:眼动视频理解的空间时间提示

Arsha Nagrani, Jasper Uijilings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid

AI总结 本文提出Minerva-Ego基准,通过多步骤多模态问题和密集标注的时空推理轨迹评估眼动视频理解模型,发现提示'何时何地'显著提升性能。

详情
AI中文摘要

视频推理模型是眼动和具身智能体的核心组成部分。然而,标准评估模型的基准仅提供输出评估(例如回答问题),而不评估中间推理步骤,且大多数仅提供文本领域的答案。我们引入了Minerva-Ego,一个用于评估复杂眼动视觉推理的基准。我们扩展了最近高质量的视频数据源,这些数据源来自眼动/具身设置,并添加了一组具有挑战性的多步骤多模态问题和密集标注的时空推理轨迹。基准测试实验表明,最先进的模型与人类表现之间仍有较大差距。为了深入研究这一差距,我们对数据集中的每个推理轨迹标注了所需解决问题的对象,作为时空掩码标注。通过广泛的评估,我们发现提示前沿模型以'哪里'和'何时'的提示来查看,能显著提高性能。Minerva-Ego可在https://github.com/google-deepmind/neptune下载。

英文摘要

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.

2605.15341 2026-05-18 cs.LG cs.AI

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

LEAP:LLM在迭代科学设计中的轨迹级评估

Marilyn Zhang, Tianfeng Chen, Fabián Barzuna, Ankita Rathod, Mark E. Whiting

AI总结 本文提出LEAPBench框架,通过轨迹级评估方法揭示LLM在迭代科学设计中的学习效率,发现传统基于结果的评估方法存在偏差,轨迹指标能更准确反映效率提升。

详情
AI中文摘要

LLMs正被越来越多地应用于自主实验室,其假设是领域先验知识和迭代反馈使它们在更少的迭代中收敛到好的设计。然而,当前的迭代科学设计基准仅评估固定时间范围内的结果快照,忽略了学习轨迹。为此,本文探讨了三种评估选择:测量什么、比较什么基准以及以什么为基础。引入LEAPBench,一个包含55个任务的框架,结合最佳到目前为止的曲线下面积(AUC)轨迹指标、经典贝叶斯优化基准和基于发表文献的审计。在八个现代LLMs上应用后,从最终结果到轨迹评分的切换在匹配时间范围内改变了53%的任务最佳模型决策,并揭示了被传统评分忽视的效率提升。LLMs在经典贝叶斯基准下并不表现更好。在16个生物学任务中,当oracle的奖励信号与发表最佳设计配置一致时,领域感知提示导致LLM选择匹配发表最佳的频率比领域无关提示低约10个百分点。这种模式在6个任务中最为明显,其中领域无关提示在所有6个任务中更常匹配发表最佳。轨迹指标还充当了可训练的目标。使用轨迹指标作为奖励的离线强化学习在14个21个保留任务中提升了性能。

英文摘要

LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.

2605.15340 2026-05-18 cs.LG

Bounded-Rationality, Hedging, and Generalization

有限理性、对冲与泛化

Pedro A. Ortega

AI总结 本文研究了学习者在训练样本影响下的对冲策略,通过分析响应规律与信息几何关系,提出泛化是学习者自身响应规律的可检验对冲属性。

Comments 32 pages, 9 figures

详情
AI中文摘要

一个学习者不仅拟合数据,还决定了训练样本如何塑造其输出以及能容忍多少扭曲。我们研究这种关系作为有限理性决策问题,其基本对象是样本到输出的诱导通道。学习者的响应规律决定了通道变化的代价,从而诱导出训练损失与样本依赖之间的更低折衷曲线以及匹配的上界证书曲线。当响应规律由f-散度正则化表示时,这些曲线存在于正则化器的原生信息几何中,KL散度作为特殊情况对应香农互信息。我们展示了如何通过观察缩放损失和局部损失扰动来从黑盒行为中恢复对冲和两条曲线。在学习中,总体损失是经验损失加上由特定训练样本引起的扭曲。恢复的对冲在覆盖该扭曲时提供实际证书。因此,泛化被视作学习者自身响应规律的可检验对冲属性。

英文摘要

A learner does not only fit data; it also determines how strongly the training sample may shape its output and how much distortion it can hedge. We study this relation as a bounded-rational decision problem whose primitive object is the induced channel from samples to outputs. The learner's response law determines which changes in this channel are cheap or costly, and therefore induces both a lower tradeoff curve between training loss and sample dependence and a matched upper certificate curve. When the response law is represented by an $f$-divergence regularizer, these curves live in the regularizer's native information geometry, with KL as the special case corresponding to Shannon mutual information. We show how the hedge and the two curves can be recovered from black-box behavior by observing responses to scaled losses and local loss perturbations. In learning, population loss is empirical loss plus the distortion induced by the particular training sample. The recovered hedge gives a practical certificate when it covers that distortion. Thus generalization is treated as a testable hedging property of the learner's own response law.

2605.15334 2026-05-18 cs.LG cs.AI cs.CL cs.SE

From I/O to Code with Discovery Agent

从输入输出到代码:发现代理

Yihong Dong, Jiaru Qian, Haoran Zhang, Peixu Wang, Binhua Li, Zhi Jin, Yongbin Li, Ge Li, Xiaokang Yang, Xue Jiang

AI总结 本文提出DIO-Agent,通过将IO2Code视为离散程序空间的进化搜索,利用LLM作为突变算子,结合执行误差信号指导突变,解决从输入输出行为合成代码的难题。

详情
AI中文摘要

将程序自动合成视为计算机科学的圣杯。受LLM推动,NL2Code取得巨大成功,但从输入输出行为合成程序(IO2Code)仍难以解决。NL2Code可利用自然语言与代码的语义对齐,而IO2Code需从具体计算行为中恢复底层原理,面对广阔且未明确规定的假设空间。为此,我们提出DIO-Agent,将IO2Code视为离散程序空间的进化搜索,在其中LLM作为突变算子,执行误差信号指导突变。为防止搜索进入结构复杂但错误的死胡同,引入变换优先前提作为突变先验,使LLM偏向最简单的假设,逐步从常量到条件到迭代。为促进系统研究,我们构建了跨越多个难度级别的IO2CodeBench。大量实验表明,DIO-Agent在所有难度级别和各种LLM上均优于传统程序示例方法和SOTA进化代理基线,同时显著超越等效采样预算下的测试时间扩展策略。

英文摘要

The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from concrete computational behavior, navigating a vast and underspecified hypothesis space. To address this, we propose DIO-Agent, a discovery agent for IO2Code. Our method frames IO2Code as an evolutionary search over discrete program space, in which an LLM serves as the mutation operator and concrete error signals from execution guide each mutation. To prevent the search from wandering into structurally complex yet incorrect dead ends, we introduce the Transformation Priority Premise as a mutation prior that biases the LLM toward the simplest hypothesis consistent with current evidence, progressively escalating from constants to conditionals to iteration only when simpler constructs are insufficient. To facilitate systematic study, we further construct an IO2CodeBench spanning multiple difficulty levels. Extensive experiments show that DIO-Agent consistently outperforms both traditional program-by-example method and SOTA evolution-agent baselines across all difficulty levels and various LLMs, while substantially surpassing test-time scaling strategies with equivalent sampling budgets.

2605.15333 2026-05-18 cs.AI

Zero-Shot Goal Recognition with Large Language Models

基于大语言模型的零样本目标识别

Kin Max Piamolini Gusmão, Nathan Gavenski, Nir Oren, Felipe Meneguzzi

AI总结 本文首次系统评估前沿大语言模型在经典PDDL基准上的零样本目标识别能力,发现其表现不均,部分模型随证据增加而提升精度,而另一些模型则依赖世界知识先验。

Comments 9 pages, 1 figure, 1 table; appendix with 8 figures and 2 code listings (29 pages total); submitted to NeurIPS 2026

详情
AI中文摘要

大语言模型最近在知名规划领域达到了与经典规划器相当的水平,但这种能力依赖于世界知识的利用而非真正的符号推理。目标识别是一种互补的归纳任务,结构上更适合大语言模型的特长:它涉及评估与世界知识的一致性,而非生成新的动作序列。本文首次系统地对前沿大语言模型进行了零样本评估,以评估其在关键经典PDDL基准上的目标识别能力。我们的结果表明,大语言模型在目标识别上的能力不均:一些模型随着证据的增加而提升,接近全观测下的地标精度,而另一些模型则无论证据如何增加,都依赖于世界知识的先验。对模型推理轨迹的定性分析表明,这种差异反映了证据整合的根本差异,而非领域熟悉度。这些发现将目标识别定位为评估大语言模型基础规划知识的原则性基准。

英文摘要

Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.

2605.15328 2026-05-18 cs.LG

From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks

从权重扰动到特征归因:解释全连接神经网络

Thodoris Lymperopoulos, Denia Kanellopoulou

AI总结 本文提出通过扰动特征权重而非值来估计特征归因,改进了Occlusion方法的局限性,提出XWP和XWP_c两种新方法,在简单DNN中表现优异,为模型可解释性提供稳健框架。

Comments 9 pages, 5 figures

详情
AI中文摘要

全连接神经网络(FCNN)常被视为简单直观的架构,但仍是复杂模型的基础。然而,其可解释性缺乏共识仍是个挑战。本文提出一种新的特征归因估计方法,通过扰动特征的权重而非值来解决Occlusion技术中的常见问题,如Added Bias和Out-of-Distribution数据。该方法形成了两种新的归因方法XWP和XWP_c,基于简单规则,在识别图像信号方面与最成熟的归因方法在标准基准指标上具有竞争力。本文为可解释性领域引入了稳健的框架,为解决长期存在的漏洞提供了途径,从而产生更可靠和可解释的模型解释。

英文摘要

Fully Connected Neural Networks (FCNNs) are often regarded as simple and intuitive architectures, yet they serve as the foundation for more complex models. Nonetheless, the lack of consensus on their interpretability continues to pose challenges, underscoring the enduring relevance of simpler, attribution-based approaches for understanding even the most advanced neural architectures. In this regard, we explore a novel idea for estimating feature attribution, by applying perturbation to the features' attached weights instead of their values. This method offers a fresh perspective aimed at mitigating common limitations in Occlusion techniques, such as Added Bias and Out-of-Distribution data. The application of this rule leads to the formation of a pair of novel attribution methods we call XWP and XWP_c. Founded on simple rules, our methods achieve competitive performance in identifying image signals for simple DNNs, competing with the most established attribution methods on standard baseline metrics. Our work thus contributes to the field of Explainability by introducing a robust framework that paves the way for addressing these long-standing vulnerabilities, and leads to more reliable and interpretable model explanations.

2605.15326 2026-05-18 cs.CV

Multimodal Object Detection Under Sparse Forest-Canopy Occlusion

多模态目标检测在稀疏森林冠层遮挡下的应用

Nitik Jain, Mangal Kothari

AI总结 本文提出一种多模态管道,结合激光雷达、可见-热成像融合和合成孔径成像技术,以提高森林冠层下人类检测的可靠性,展示了改进的YOLOv5检测器在热成像和融合图像上的性能。

详情
AI中文摘要

可靠检测森林冠层下的人类仍是一个远程传感难题,由于遮挡稀疏、结构化且视点依赖。本文提出一个多模态的证明概念管道,整合三种互补方法:(i) 通过植被评估激光雷达回波的实验评估以评估主动传感的可行性;(ii) 使用多尺度变换和稀疏表示框架进行可见-热图像融合以增强人类显著性;(iii) 通过空中光学切片(AOS)合成孔径成像以抑制冠层杂波。在Teledyne FLIR热数据集上微调YOLOv5检测器,并在热图像和融合图像上进行评估。结果表明,测试的地面激光雷达配置对目标级检测的穿透有限,而可见-热融合在低对比度场景中提高了目标可见性,AOS在合成森林图像中增强了地面平面检测。微调的YOLOv5在FLIR前三个类别上实现了平均平均精度约为0.83。这些发现为在森林环境中部署的无人机搜索和救援及监视系统建立了初始基准,并推动了未来专门针对森林数据集和实时多模态整合的工作。

英文摘要

Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.

2605.15325 2026-05-18 cs.CV

COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection

COPRA:基于强化学习的条件参数适应用于视频异常检测

Darryl Cherian Jacob, Xinyu Liu, Kai Wang, Pan He

AI总结 COPRA通过生成输入特定的参数更新,动态适应冻结的VLM,提升视频异常检测的适应性和泛化能力,同时拓展到多选视频问答和密集标注等任务。

Comments Manuscript currently under review for publication

详情
AI中文摘要

COPRA通过生成输入特定的参数更新,动态适应冻结的VLM,提升视频异常检测的适应性和泛化能力,同时拓展到多选视频问答和密集标注等任务。

英文摘要

Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at https://github.com/THE-MALT-LAB/COPRA

2605.15315 2026-05-18 cs.AI cs.CL

Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

通过多标准潜在推理进行编码代理的上下文剪枝

Jingjing Wang, Xiwen Chen, Wenhui Zhu, Huayu Li, Zhengxiao He, Feiyang Cai, Ana S. Carreon-Rascon, Xuanzhao Dong, Feng Luo

AI总结 本文提出LaMR框架,通过分解代码相关性为语义证据和依赖支持两个维度,利用多任务CRF模型提升编码代理的上下文剪枝效果,实验表明其在多个基准测试中表现优异。

详情
AI中文摘要

LLM驱动的编码代理花费大部分token预算阅读仓库文件,但检索到的代码大多与任务无关。现有学习剪枝器使用单一目标序列标注器压缩上下文,将代码相关性所有方面压缩为一个分数和一个转移矩阵。我们证明这种建模瓶颈:单一CRF转移先验必须服务于异质保留模式,包括连续语义跨度和稀疏结构支持线。我们提出LaMR(潜在多标准),一个结构化剪枝框架,将代码相关性分解为两个可解释的质量维度,语义证据和依赖支持,每个由专用CRF建模,具有维度特定的转移动态。混合专家门控网络动态加权每个标准的发射量,根据查询条件。最终CRF层在融合的发射量上产生汇总的保留或剪枝决策。为了监督每个维度而无需额外标注成本,我们通过基于AST的程序分析从现有训练语料中推导出多标准标签,同时去噪教师的二元标签。通过有效过滤干扰噪声,LaMR经常匹配或甚至优于未修剪的完整上下文基线。在四个基准测试(SWE-Bench Verified,SWE-QA,LCC,LongCodeQA)上的实验表明,LaMR在16次头对头多轮比较中胜出12次。它在多轮代理任务中节省多达31%的token,并在单轮任务中将Exact Match提高多达+3.5,同时性能经常通过去噪上下文得到增强,任何剩余的下降都是微小的。

英文摘要

LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.

2605.15314 2026-05-18 cs.LG math.OC

Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise

超越有界方差:非凸优化中在Blum-Gladyshev噪声下的方差缩减规范化方法

Antesh Upadhyay, Arda Fazla, Abolfazl Hashemi

AI总结 本文研究了在Blum-Gladyshev噪声模型下非凸随机优化问题,提出了一种规范化动量随机梯度下降方法,证明其在该噪声下具有O(ε^{-6})的收敛率,并进一步研究了方差缩减的规范化STORM方法,实现了最优复杂度。

详情
AI中文摘要

我们研究了在Blum-Gladyshev(BG-0)噪声模型下的非凸随机优化问题,其中随机梯度方差随初始化点的距离平方增长。我们考虑了标准平滑性和对称广义平滑性框架,后者捕捉了局部曲率随梯度范数变化的目标函数。我们证明,使用仅一个随机梯度每迭代的规范化随机梯度下降法,在BG-0噪声下具有Oracle复杂度O(ε^{-6})。该速率在标准平滑性和α-对称广义平滑性下均成立,表明广义平滑性在该设置下对规范化动量是速率中性的。随后,我们研究了一种方差缩减的规范化STORM方法。在均方平滑性和尖锐初始化下,该方法实现了最优复杂度O(ε^{-4}),匹配下界。在预期α-对称广义平滑性下,STORM递归将梯度依赖的平滑性与距离依赖的噪声耦合,导致复杂度O(ε^{-(4+α)})(α∈(0,1))和O(ε^{-5})(α=1)。当噪声模型中的距离增长参数消失时,我们的保证恢复了标准有界方差速率:O(ε^{-4})(动量)、O(ε^{-3})(方差缩减)和O(ε^{-2})(确定性情况)。据我们所知,这些是首次在非凸随机优化中在BG-0噪声下无需有界域、增加批大小或显式锚定的规范化方法的收敛保证,覆盖了标准和广义平滑性情形。

英文摘要

We study nonconvex stochastic optimization under the Blum-Gladyshev ($\mathsf{BG}$-0) noise model, where the stochastic gradient variance grows quadratically with the distance from the initialization. We consider this problem under both standard smoothness and the symmetric generalized-smoothness framework, which captures objectives whose local curvature can scale with the gradient norm. We prove that normalized stochastic gradient descent with momentum, using only one stochastic gradient per iteration, converges under $\mathsf{BG}$-0 noise with oracle complexity $O(\varepsilon^{-6})$. This rate holds both for standard smoothness and for $α$-symmetric generalized smoothness, showing that generalized smoothness is rate-neutral for normalized momentum in this setting. We then study a variance-reduced normalized STORM method. Under mean-square smoothness and sharp initialization, the method achieves the minimax optimal $O(\varepsilon^{-4})$ complexity, matching the lower bound. Under expected $α$-symmetric generalized smoothness, the STORM recursion couples gradient-dependent smoothness with distance-dependent noise, leading to complexity $O(\varepsilon^{-(4+α)})$ for $α\in(0,1)$ and $O(\varepsilon^{-5})$ for $α=1$. When the distance-growth parameter in the noise model vanishes, our guarantees recover the standard bounded-variance rates: $O(\varepsilon^{-4})$ for momentum, $O(\varepsilon^{-3})$ for variance reduction, and $O(\varepsilon^{-2})$ in the deterministic case. To our knowledge, these are the first convergence guarantees for normalized methods in non-convex stochastic optimization under $\mathsf{BG}$-0 noise without bounded domains, increasing batch sizes, or explicit anchoring, covering both standard and generalized smoothness regimes.

2605.15311 2026-05-18 cs.LG cs.SY eess.SY

Time-Varying Deep State Space Models for Sequences with Switching Dynamics

具有切换动态的序列的时变深度状态空间模型

Sanja Karilanova, Subhrakanti Dey, Ayça Özçelikkale

AI总结 本文提出了一种基于时变状态空间模型的神经网络,用于处理具有切换动态的序列数据,通过可学习的基函数捕捉时变动态,优于传统时不变模型。

详情
AI中文摘要

对时变系统的识别和建模是信号处理和系统识别中的基本挑战。为了解决这一挑战,我们提出了一类基于时变状态空间模型(SSM)的神经网络,其中神经元的状态由时变动态控制。所提出的模型通过基函数字典学习时变动态,每个基函数随时间演变不同。我们在切换系统合成数据和受切换动态噪声干扰的语音去噪任务上评估了所提出的方法。结果表明,所提出的时变模型在保持计算复杂度相当的情况下,始终优于传统时不变模型。我们的研究还揭示了数据时变动态中哪些方面最需要被所提出的时不变模型捕捉,如何在模型组件中分配额外的自由度,以及更大模型能有多大程度地补偿时不变限制。

英文摘要

The identification and modeling of time-varying systems is a fundamental challenge in signal processing and system identification. To address this challenge, we propose a class of time-varying state-space model (SSM) based neural networks in which the neurons' states are governed by time-varying dynamics. The proposed model provides the learnable time-varying dynamics through a dictionary of basis functions, where each basis function evolves differently over time. We evaluate the proposed approach on both synthetic data from switching systems and a speech denoising task where real audio is corrupted with switching dynamics noise. The results show that the proposed time-varying model consistently outperforms its time-invariant counterparts while maintaining comparable computational complexity. Our investigations also reveal which aspects of the time-varying dynamics of the data most need to be captured by the proposed time-invariant models, how the additional freedom provided by time-varying basis functions should be allocated across model components, and to what extent larger models can compensate for time-invariant limitations.

2605.15309 2026-05-18 cs.CV

One Pass Is Not Enough: Recursive Latent Refinement for Generative Models

一次不够:生成模型的递归潜在细化

Mehdi Esmaeilzadeh, Alexia Jolicoeur-Martineau, Chirag Vashist, Ke Li

AI总结 本文提出RTM方法,通过递归细化提升生成模型的多样性和覆盖范围,改进了FID指标下的精度和召回率,适用于多个数据集和基准测试。

详情
AI中文摘要

尽管在图像生成领域取得了显著进展,但问题仍未解决。主导指标FID将样本保真度与模式覆盖混淆,并接近饱和。然而,模型仍可能在低FID下出现模式崩溃,因为少量锐利的近似图像可能优于覆盖完整数据分布的模型。我们主张精度和召回率是FID的必要补充,由于FID已饱和,更重要的目标是提高多样性和覆盖范围。高召回率需要优先考虑模式覆盖的模型,而非大多数生成模型优化样本保真度。我们引入RTM,将基于风格的生成器中的单次潜在映射替换为迭代细化过程,证明这能一致提高质量和多样性。与隐式最大似然估计(IMLE)结合,IMLE通过设计优化模式覆盖,RTM在当前最先进的方法中实现了最高精度和召回率,同时保持竞争性的FID,改进了CIFAR-10、CelebA-HQ(256x256)和九个少样本基准。RTM还改进了StyleGAN2和StyleGAN2-ADA在CIFAR-10和AFHQ-v1(512x512)上的表现,证明其益处不限于IMLE。不同于通过牺牲覆盖范围获得竞争性FID的流匹配基线,递归细化同时提高了质量和多样性。

英文摘要

Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near-duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single-pass latent mapping in style-based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state-of-the-art approaches while maintaining competitive FID, with improvements across CIFAR-10, CelebA-HQ at 256x256, and nine few-shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2-ADA on CIFAR-10 and AFHQ-v1 at 512x512, demonstrating that the benefit is not specific to IMLE. Unlike flow-matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.