arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2508.11836 2026-05-22 cs.AI

Finite Automata Extraction: Low-data World Model Learning as Programs from Gameplay Video

有限自动机提取:从游戏录像中学习低数据世界模型作为程序

Dave Goel, Matthew Guzdial, Anurag Sarkar

AI总结 本文提出了一种名为有限自动机提取(FAE)的方法,通过一种新的领域特定语言(DSL)Retro Coder,从游戏录像中学习神经符号世界模型,相较于以往的方法,FAE能够更精确地建模环境并生成更通用的代码。

详情
AI中文摘要

世界模型被定义为对环境的压缩空间和时间学习表示。学习的表示通常是神经网络,使得转移学习的环境动态和可解释性成为一个挑战。在本文中,我们提出了一种方法,有限自动机提取(FAE),通过一种新的领域特定语言(DSL)Retro Coder,从游戏录像中学习神经符号世界模型。与以往的世界模型方法相比,FAE学习了更精确的环境模型和比以往DSL方法更通用的代码。

英文摘要

World models are defined as a compressed spatial and temporal learned representation of an environment. The learned representation is typically a neural network, making transfer of the learned environment dynamics and explainability a challenge. In this paper, we propose an approach, Finite Automata Extraction (FAE), that learns a neuro-symbolic world model from gameplay video represented as programs in a novel domain-specific language (DSL): Retro Coder. Compared to prior world model approaches, FAE learns a more precise model of the environment and more general code than prior DSL-based approaches.

2508.02127 2026-05-22 cs.CV

Enhancing Event-based Object Detection with Monocular Normal Maps

通过单目法线图增强基于事件的目标检测

Mingjie Liu, Hanqing Liu, Luoping Cui, Chuang Zhu

AI总结 本文提出NRE-Net框架,结合法线图的结构先验、RGB图像的外观上下文和事件的高频动态,通过自适应双流融合模块和事件模态感知融合模块提升自动驾驶中复杂光照下的目标检测性能。

详情
AI中文摘要

自动驾驶中的目标检测常受到复杂光照条件的干扰。虽然事件相机提供了一种稳健的解决方案,但它们容易受到突然的对比度变化(如反射)的影响,这通常会触发密集且误导性的事件信号。为了解决这个问题,我们利用RGB衍生的表面法线图作为显式的几何约束。关键在于,即使RGB退化,它们也保留了低频的结构先验,这有助于事件检测。因此,我们提出了NRE-Net,一个三模态框架,该框架整合了来自表面法线图的结构先验、来自RGB图像的外观上下文以及来自事件的高频动态。自适应双流融合模块(ADFM)首先对几何和外观线索进行对齐,随后是事件模态感知融合模块(EAFM),它选择性地整合事件动态。在DSEC-Det-sub和PKU-DAVIS-SOD上的大量评估表明,结合几何先验相比双模态基线在AP50上获得了额外的3.0%提升,而我们的方法在融合方法如SFNet(+2.7%)和SODFormer(+7.1%)上表现一致优于。

英文摘要

Object detection in autonomous driving is frequently compromised by complex illumination. While event cameras offer a robust solution, they are susceptible to sudden contrast changes such as reflections which often trigger dense, misleading event signals. To overcome this, we leverage RGB-derived surface normal maps as explicit geometric constraints. Crucially, even when RGB degrades, they preserve low-frequency structural priors that effectively assist in event-based detection. Consequently, we present NRE-Net, a trimodal framework that integrates structural priors from surface Normal maps, appearance context from RGB images, and high-frequency dynamics from Events. The Adaptive Dual-stream Fusion Module (ADFM) first aligns geometric and appearance cues, followed by the Event-modality Aware Fusion Module (EAFM) which selectively integrates event dynamics. Extensive evaluations on DSEC-Det-sub and PKU-DAVIS-SOD demonstrate that incorporating geometric priors yields an additional 3.0% AP50 gain over dual-modal baselines, while our approach consistently outperforms fusion methods such as SFNet (+2.7%) and SODFormer (+7.1%).

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

AI总结 本文提出通过模拟推理实现通用代理规划,利用世界模型进行未来状态预测,提升决策能力,通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情
AI中文摘要

什么是规划?当前的代理系统,无论是 scaffolding 工作流还是端到端策略,都依赖于反应式决策:通过固定流程选择下一步行动,最多只能有非区分性的适应性计算(例如链式思维),缺乏对未来结果的显式建模。这限制了通用性,因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下,人类通过在内部世界模型中心理模拟候选动作的后果来规划,这种能力被称为模拟推理(系统II),它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制,比反应式策略(系统I)更优,因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点,我们引入了SiRA(模拟推理架构),一种以目标为导向的架构,利用基于LLM的世界模型和自然语言信念状态来实现模拟推理,同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别:受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中,模拟推理在与匹配的反应基线相比,任务完成率提高了124%,并且在与代表性的开放网络代理相比,受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明,这种优势源于可泛化的情境评估,而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

2506.22316 2026-05-22 cs.CL

Evaluating Scoring Bias in LLM-as-a-Judge

评估LLM作为裁判的评分偏见

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

AI总结 本文研究了LLM作为裁判在评分任务中的偏见问题,提出了三种新的评分偏见类型,并开发了一个框架来量化这些偏见,以改进评分提示设计。

Comments Accepted by DASFAA 2026

详情
AI中文摘要

"LLM-as-a-Judge"范式通过大型语言模型(LLMs)作为自动评估者,是LLM发展中的关键部分,为复杂任务提供可扩展的反馈。然而,这些裁判的可靠性受到多种偏见的影响。现有研究主要集中在比较性评估中的偏见。相比之下,基于评分的评估(分配绝对分数,常用于工业应用)研究较少。为填补这一空白,我们进行了首次专门的评分偏见评估。我们从评分提示本身而非评估目标的偏见出发。我们正式定义了评分偏见,并识别了三种新的偏见类型:评分标准顺序偏见、评分ID偏见和参考答案评分偏见。我们提出了一种全面的框架来量化这些偏见,包含多方面的度量指标和自动数据合成管道来创建定制的评估语料库。我们的实验实证地证明了即使最先进的LLMs也受这些显著评分偏见的影响。我们的分析为设计更稳健的评分提示和缓解这些新发现的偏见提供了可行的见解。

英文摘要

The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

2506.04708 2026-05-22 cs.CL

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

加速测试时间缩放与模型无关的推测采样

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

AI总结 本文提出STAND,一种无需模型的推测解码方法,通过利用推理轨迹中的冗余性,显著提升推理效率而不牺牲准确性,经多个模型和任务评估,STAND在保持准确性的同时将推理延迟降低了60-65%。

Comments EMNLP 2025 Oral

详情
AI中文摘要

语言模型通过测试时间缩放技术如best-of-N采样和树搜索在推理任务中展现了显著的能力。然而,这些方法通常需要大量的计算资源,导致性能与效率之间的关键权衡。我们引入STAND(STochastic Adaptive N-gram Drafting),一种新颖的无模型推测解码方法,利用推理轨迹中的内在冗余性,实现显著的加速而不牺牲准确性。我们的分析显示,推理路径经常重复相似的推理模式,使高效的无模型令牌预测成为可能,而无需单独的草案模型。通过引入随机草案和通过高效日志几率基的n-gram模块保留概率信息,结合优化的Gumbel-Top-K采样和数据驱动的树构建,STAND显著提高了令牌接受率。在多个模型和推理任务(AIME-2024、GPQA-Diamond和LiveCodeBench)上的广泛评估表明,与标准自回归解码相比,STAND将推理延迟降低了60-65%,同时保持准确性。此外,STAND在各种推理模式下,包括单轨迹解码、批量解码和测试时间树搜索中,均优于最先进的推测解码方法。作为一种无模型方法,STAND可以应用于任何现有语言模型,无需额外训练,使其成为加速语言模型推理的强大即插即用解决方案。

英文摘要

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

2505.17123 2026-05-22 cs.CL

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

MTR-Bench:多轮推理评估的综合性基准

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

AI总结 本文提出MTR-Bench,一个包含4类、40个任务和3600个实例的综合性基准,用于评估大型语言模型的多轮推理能力,通过自动化框架实现大规模评估,并揭示了当前先进推理模型在多轮交互任务中的不足。

Comments ACL 2026 Main Conference

详情
AI中文摘要

近年来,大型语言模型(LLMs)在复杂推理任务中展现出有前景的结果。然而,当前的评估主要集中在单轮推理场景,忽略了交互性任务。我们归因于缺乏全面的数据集和可扩展的自动评估协议。为了填补这些空白,我们提出了MTR-Bench用于LLM的多轮推理评估。MTR-Bench包含4类、40个任务和3600个实例,覆盖了多样的推理能力、细粒度难度层次以及需要与环境进行多轮交互的任务。此外,MTR-Bench具备完全自动化的框架,涵盖了数据集构建和模型评估,使大规模评估成为可能而无需人工干预。广泛实验表明,即使是最先进的推理模型在多轮交互推理任务中也显得不足。对这些结果的进一步分析为未来交互式人工智能系统的研究提供了有价值的见解。

英文摘要

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

2505.16416 2026-05-22 cs.CV cs.AI

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: 用于大视觉-语言模型的锥形解耦旋转位置嵌入

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

AI总结 本文提出Circle-RoPE,通过将图像标记坐标映射到与文本位置轴正交的圆环上,实现跨模态位置解耦,同时保留图像内部空间结构,并通过交替几何编码增强跨模态位置解耦和细粒度图像空间结构保留。

Comments Accepted at ICML 2026

详情
AI中文摘要

旋转位置嵌入(RoPE)在大型语言模型中被广泛采用,但应用于视觉-语言模型(VLMs)时会耦合文本和图像位置索引,并可能引入虚假的跨模态相对位置偏差。我们提出Per-Token Distance(PTD)来量化跨模态位置解耦,并证明PTD = 0是消除RoPE引起的几何注意力偏差的充分条件。基于此准则,我们引入Circle-RoPE,将2D图像标记坐标映射到与文本位置轴正交的圆环上,得到一种锥形几何结构,其中每个文本标记到所有图像标记等距,同时保留图像内部空间结构。我们进一步提出交替几何编码(AGE)以通过在层之间交替Circle-RoPE的解耦几何和标准RoPE的网格先验来结合互补的几何先验。这种设计在保持细粒度图像空间结构的同时实现了跨模态位置解耦。在多种VLM后端和多模态基准测试中的实验显示,在空间定位和视觉推理方面均取得了稳定的提升。代码可在https://github.com/lose4578/CircleRoPE上获得。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.

2505.05406 2026-05-22 cs.CL

Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries

框入框出:衡量LLM生成新闻摘要中的框架偏差

Valeria Pastorino, Nafise Sadat Moosavi

AI总结 本文提出FIFO基准测试,用于衡量LLM生成的新闻摘要中的框架存在性,发现LLM生成的摘要在科学和公共卫生领域显示出较高的框架率,表明框架是摘要质量的一个被忽视但重要的维度。

Comments Accepted to The 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026) co-located with ACL 2026

详情
AI中文摘要

新闻标题和摘要通过选择性强调和省略来影响事件的解读,这种现象通常称为框架。大型语言模型现在经常用于生成此类内容,但现有的评估框架大多忽略了这一维度。我们介绍了Frame In, Frame Out (FIFO),这是首个大规模基准测试,用于衡量LLM生成的新闻摘要中的框架存在性,基于广泛使用的XSum数据集。FIFO结合了15,499名陪审团标注的例子和320个专家标注的实例(κ=0.61)来验证和校准基于模型的标注。使用FIFO,我们分析了27个摘要模型的测量框架率。我们发现,LLM生成的摘要往往表现出比人类撰写的参考更高的校准框架率,不同主题和训练制度下存在显著差异,包括在科学和公共卫生摘要中出现较高的框架率。我们的结果确立了框架作为摘要质量的一个被忽视但重要的维度。

英文摘要

News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing presence in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances ($κ= 0.61$) to validate and calibrate model-based annotations. Using FIFO, we analyze measured framing rates across 27 summarization models. We find that LLM-generated summaries often exhibit higher calibrated framing rates than human-written references, with substantial variation across topics and training regimes, including elevated rates in scientific and public health summaries. Our results establish framing as an underexplored and consequential dimension of summarization quality.

2503.17599 2026-05-22 cs.CL cs.AI

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

利用通用医疗基准评估大型语言模型的临床能力

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Lin Yao

AI总结 本文提出了一种新的评估框架,通过通用医疗基准(GPBench)评估大型语言模型在医疗实践中的能力,发现当前LLM无法独立应用于临床医疗,需持续的人类监督。

详情
AI中文摘要

大型语言模型(LLMs)在一般医疗实践中展现出了相当大的潜力。然而,现有的基准测试和评估框架主要依赖于考试式或简化的问题-答案格式,缺乏与一般医疗实践中实际临床责任相匹配的基于能力的结构。因此,LLMs能否可靠地履行一般医生(GPs)职责的范围仍然不确定。在本工作中,我们提出了一种新的评估框架,用于评估LLMs作为GPs的能力。基于此框架,我们引入了一个通用医疗基准(GPBench),其数据由领域专家根据常规临床实践标准进行细致标注。我们评估了十种最先进的LLMs,并分析了它们的能力。我们的发现表明,当前的LLMs不适合在临床一般实践中自主部署,所有实际应用都需要持续的人类监督;进一步针对GPs日常职责进行的特定优化仍至关重要。

英文摘要

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

2502.01476 2026-05-22 cs.LG cs.NA math.NA physics.comp-ph

Neuro-Symbolic AI for Analytical Solutions of Differential Equations

神经符号AI用于微分方程的解析解

Orestis Oikonomou, Levi Lingsch, Dana Grund, Siddhartha Mishra, Georgios Kissas

AI总结 本文提出SIGS神经符号框架,通过上下文无关文法生成数学上有效且物理上有意义的构建块,并结合用户指定的Ansatz进行组合,嵌入到拓扑正则化的连续潜在流形中,通过两阶段搜索发现解析解,提高了微分方程解析解的准确性和效率。

Comments Updates the method and added extra results

详情
AI中文摘要

微分方程的解析解提供精确且可解释的洞察,但很少有可用,因为发现它们需要专家直觉或穷举组合空间。我们引入SIGS,一种用于方程驱动的闭式解发现的神经符号框架。SIGS使用上下文无关文法生成数学上有效且物理上有意义的构建块,结合用户指定的Ansatz来组合这些块,将其嵌入到拓扑正则化的连续潜在流形中,并通过两个阶段在该流形上进行搜索:结构选择后通过梯度下降进行系数细化,仅根据PDE残差和指定的边界和初始条件评分候选。这种设计将符号推理与数值优化统一起来;文法约束候选解块为正确,而潜在搜索使探索变得可行且数据无关。SIGS是首个神经符号方法,能够(i)恢复耦合非线性PDE系统的解析解,(ii)当文法缺乏自然原始元时发现等价的符号形式,(iii)为缺乏已知闭式解的PDE产生准确的符号近似。总体而言,SIGS在标准PDE基准测试中,在准确性和运行时间上都比现有符号方法提高了多个数量级。

英文摘要

Analytical solutions to differential equations offer exact, interpretable insight but are rarely available because discovering them requires expert intuition or exhaustive search of combinatorial spaces. We introduce SIGS, a neuro-symbolic framework for equation-driven closed-form solution discovery. SIGS uses a context-free grammar to generate mathematically valid and physically meaningful building blocks, with a user-specified Ansatz prescribing how these blocks combine, embeds them into a topology-regularised continuous latent manifold, and searches this manifold in two stages: structure selection followed by coefficient refinement using gradient descent, scoring candidates only against the PDE residual and prescribed boundary and initial conditions. This design unifies symbolic reasoning with numerical optimization; the grammar constrains candidate solution blocks to be proper by construction, while the latent search makes exploration tractable and data-free. SIGS is the first neuro-symbolic method to (i) recover analytical solutions for coupled nonlinear PDE systems, (ii) discover equivalent symbolic forms when the grammar lacks the natural primitives, and (iii) produce accurate symbolic approximations for PDEs lacking known closed-form solutions. Overall, SIGS improves over existing symbolic methods by orders of magnitude in both accuracy and runtime across standard PDE benchmarks.

2410.19787 2026-05-22 cs.CV cs.LG

Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

利用多时相哨兵1和2卫星数据进行叶面积指数估计的深度学习方法

Clement Wang, Antoine Debouchage, Valentin Goldité, Aurélien Wery, Jules Salzinger

AI总结 本文提出了一种基于多时相哨兵1雷达数据和哨兵2多谱段数据的深度学习方法,用于像素级叶面积指数预测,通过多U-Net网络结构和共同潜在空间实现不同输入模态的互补信息融合,最终在公开数据上取得了0.06 RMSE和0.93 R2分数。

详情
Journal ref
Proc. 2023 Conference on Big Data from Space (BiDS'23), Publications Office of the European Union, Luxembourg, 2023
AI中文摘要

叶面积指数(LAI)是理解生态系统健康和植被动态的关键参数。在本文中,我们提出了一种新的像素级LAI预测方法,通过利用多时间戳的哨兵1雷达数据和哨兵2多谱段数据的互补信息。我们的方法基于多个针对此任务定制的多U-Net深度神经网络。为处理不同输入模态的复杂性,该方法由多个预先训练的模块组成,以在共同的潜在空间中表示所有输入数据。然后,我们通过一个共同的解码器进行端到端微调,该解码器还考虑了季节性因素,我们发现季节性在其中起重要作用。我们的方法在公开可用数据上实现了0.06 RMSE和0.93 R2分数。我们的贡献可在https://github.com/valentingol/LeafNothingBehind上获得,供未来工作进一步改进当前进展。

英文摘要

The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at https://github.com/valentingol/LeafNothingBehind for future works to further improve on our current progress.

2410.18151 2026-05-22 cs.SD cs.LG cs.MM eess.AS

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Music102: 一个 $D_{12}$-等价变换器用于和弦进行伴奏

Weiliang Luo

AI总结 本文提出Music102,一种基于群论和音乐结构的等价变换器,用于提升和弦进行伴奏的质量,通过整合音乐对称性如转位和反射操作,改进了非等价变换器Music101的性能。

Comments 10 pages, 3 figures

详情
Journal ref
Proceedings of the 2025 International Computer Music Conference (https://hdl.handle.net/2027/fulcrum.zg64tq53m)
AI中文摘要

我们提出了Music102,一种先进的模型,旨在通过$D_{12}$-等价变换器增强和弦进行伴奏。受群论和音乐结构的启发,Music102利用音乐对称性--如转位和反射操作--将这些属性整合到变换器架构中。通过编码先前的音乐知识,模型在旋律和和弦序列上保持等价性。使用POP909数据集训练和评估Music102,结果显示其在加权损失和精确准确度指标上均优于非等价变换器Music101原型,尽管参数更少。这项工作展示了自注意力机制和层归一化在离散音乐领域中的适应性,解决了计算音乐分析中的挑战。凭借其稳定且灵活的神经框架,Music102为等价音乐生成和计算音乐创作工具的进一步探索奠定了基础,将数学理论与实际音乐表演相结合。

英文摘要

We present Music102, an advanced model aimed at enhancing chord progression accompaniment through a $D_{12}$-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over the non-equivariant Music101 prototype Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.

2408.13002 2026-05-22 cs.LG

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

用置信度衡量异质处理效应中的变量重要性

Joseph Paillard, Angel Reyero Lobo, Vitaliy Kolodyazhniy, Bertrand Thirion, Denis A. Engemann

AI总结 本文提出PermuCATE算法,用于在估计条件平均处理效应时进行统计严谨的全局变量重要性评估,通过理论分析和实证研究证明其比LOCO方法具有更低的方差,从而提高统计功效,适用于生物医学应用中的有限数据环境。

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47456-47477, 2025
AI中文摘要

因果机器学习在从复杂数据中估计个体处理效应方面具有潜力。为了成功应用于现实世界,获得可靠见解以确定哪些变量驱动对治疗的异质反应至关重要。我们提出PermuCATE,一种基于条件排列重要性(CPI)方法的算法,用于统计严谨地评估条件平均处理效应(CATE)估计中的变量重要性。有限样本情况的理论分析和实证研究显示,PermuCATE比留一协变量法(LOCO)参考方法具有更低的方差,并提供可靠的变量重要性度量。这一特性提高了统计功效,这对于生物医学应用中常见的有限数据环境中的因果推断至关重要。我们通过模拟和真实世界健康数据集实证展示了PermuCATE的优势,包括具有多达数百个相关变量的设置。

英文摘要

Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.

2404.05307 2026-05-22 cs.CV cs.RO

4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks

利用时序多视角网络进行野外条件下4D雷达的人体语义分割

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

AI总结 本文提出TMVA4D网络,利用4D雷达数据进行人体语义分割,通过多视角投影区分背景与人体,在低能见度条件下实现75.9%的Dice系数和61.2%的IoU指标。

详情
AI中文摘要

可靠的人员检测对于移动机器人和重型车辆在道路和工业环境(如采矿和建筑)中的安全自主至关重要。然而,常规传感器如摄像头或激光雷达在尘埃、雾或烟等恶劣条件下容易失效,限制了其在现实机器人系统中的应用。雷达在广泛的环境条件下提供稳健的测量。特别是现代高分辨率4D成像雷达提供跨距离、方位和仰角的4D点云,以及每个点的多普勒速度数据,非常适合机器人感知。我们提出TMVA4D,一种基于CNN和ConvLSTM编码器的神经网络架构家族,利用4D雷达模态进行语义分割。这些架构被训练以区分背景和人体类别,使用一系列2D投影的4D雷达数据,涵盖仰角、方位、距离和多普勒速度维度。在多个操作站点评估中,我们的模型在低能见度条件下实现了有希望的性能(Dice 75.9%,IoU 61.2% for class person)。数据和代码将在发表后公开发布。

英文摘要

Reliable people detection is crucial for the safe autonomy of mobile robots and heavy vehicles, both on roads and in industrial settings like mining and construction. However, common sensors like cameras or lidars are prone to failure in adverse conditions such as dust, fog, or smoke, which limits their use in real-world robotic systems. Radar, on the other hand, delivers robust measurements in a wide range of environmental conditions. In particular, modern high-resolution 4D imaging radars provide 4D point clouds across range, azimuth, and elevation, as well as per-point Doppler velocity data, well suited for robot perception. We propose TMVA4D, a family of artificial neural network architectures based on CNN and ConvLSTM encoders that leverage the 4D radar modality for semantic segmentation. The architectures are trained to distinguish between background and person classes using a series of 2D projections of the 4D radar data, encompassing elevation, azimuth, range, and Doppler velocity dimensions. Evaluated across several operational sites, our models achieve promising performance (Dice 75.9%, IoU 61.2% for class person) even in low-visibility conditions. The data and code will be made publicly available upon publication.

2308.04371 2026-05-22 cs.AI

Cumulative Reasoning with Large Language Models

基于大语言模型的累积推理

Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao

AI总结 本文提出了一种名为累积推理(CR)的框架,通过模拟人类的迭代和累积思维过程,增强大语言模型(LLM)的问题解决能力。CR通过分解任务、生成并验证中间推理步骤,构建动态有向无环图(DAG)来组成解决方案,从而在逻辑推理、24点游戏和数学问题等任务中取得了显著的性能提升。

Comments Published in Transactions on Machine Learning Research (TMLR). Project Page: https://github.com/iiis-ai/cumulative-reasoning

详情
AI中文摘要

近年来,大语言模型(LLMs)在解决问题方面取得了显著进展,但其解决复杂问题的能力仍然有限。本文介绍了一种名为累积推理(CR)的结构化框架,通过模拟人类的迭代和累积思维过程,增强LLM的问题解决能力。CR通过三个不同的角色:提出者、验证者和报告者,系统地分解任务,生成并验证中间推理步骤,并通过构建动态有向无环图(DAG)来组成解决方案。这种方法显著增强了问题解决能力。我们通过几个复杂的推理任务展示了CR的优势:在逻辑推理任务中,CR在现有方法上提高了9.3%,在经过整理的FOLIO维基数据集上达到了98.04%的准确率。在24点游戏中,它达到了98%的准确率,比以前的方法提高了24%。在解决数学问题时,CR在之前的办法上提高了4.2%,在最困难的第五级问题中相对改进了43%。当结合代码环境使用CR时,我们进一步利用LLM的推理能力,并在程序思维(PoT)方法上提高了38.8%。

英文摘要

Recent advancements in large language models (LLMs) have shown remarkable progress, yet their ability to solve complex problems remains limited. In this work, we introduce Cumulative Reasoning (CR), a structured framework that enhances LLM problem-solving by emulating human-like iterative and cumulative thought processes. CR orchestrates LLMs in three distinct roles: Proposer, Verifier(s), and Reporter, to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution by building a dynamic Directed Acyclic Graph (DAG) of verified propositions. This approach substantially enhances problem-solving capabilities. We demonstrate CR's advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs' reasoning capabilities and outperform the Program of Thought (PoT) method by 38.8%.

1709.03806 2026-05-22 cs.CV

Do Vision Models Encode Object-Level Semantic Relatedness? A Cognitive Psychology-Inspired Benchmark

视觉模型是否编码物体层面的语义相关性?一种受认知心理学启发的基准

Hansang Lee, Haeil Lee, Junmo Kim

AI总结 本文通过一种受认知心理学启发的基准,探讨了视觉模型是否能编码物体层面的语义相关性,研究了两种仅基于图像的测试集,并揭示了分类准确率之外的表征特性。

详情
AI中文摘要

现代视觉模型在物体识别任务上取得了显著的性能,但尚不清楚其表示是否编码物体层面的语义相关性,即支持人类视觉认知的对象概念之间的有意义联系。现有的基准主要针对类别预测或依赖图像-文本匹配,忽略了视觉表示本身的研究。受认知心理学启发,我们将语义相关性重新定义为三元组排序任务,并研究了两个仅基于图像的测试集:POPORO,一个已有的400个三元组心理刺激集,重新用于表示评估;以及PoporoIN,一个新构建并人工编写的1000个三元组ImageNet验证扩展集。每个三元组沿两个正交轴进行注释:一个相关目标轴区分类别相关性(CR,分类学)和上下文相关性(TR,主题性),一个干扰轴区分颜色匹配干扰项(CD)和形状匹配干扰项(SD)。二十种预训练模型,涵盖监督、自监督、视觉-语言和生成范式,在仅推理的协议下通过余弦相似度进行评估。基于变换器的表示在PoporoIN上比卷积表示高出高达18.30个百分点,且在可比的ImageNet准确率下,视觉-语言编码器在POPORO上比视觉-only编码器高出高达22.50个百分点。在所有范式中,模型在分类学目标上比主题性目标更可靠地识别,且更容易被形状匹配干扰项所误导,而不是颜色匹配干扰项。这些基准揭示了分类准确率之外的表征特性,连接了认知心理学和视觉表征评估。

英文摘要

Modern vision models have achieved strong object-recognition performance, yet it remains unclear whether their representations encode object-level semantic relatedness, the meaningful connection between object concepts that supports human visual cognition. Existing benchmarks predominantly target category prediction or rely on image--text matching, leaving the visual representation itself underexamined. Drawing on cognitive psychology, we recast semantic relatedness as a triplet-ranking task and study two image-only test beds: POPORO, an existing 400-triplet psychological stimulus set repurposed for representation evaluation, and PoporoIN, a newly constructed and manually curated 1,000-triplet ImageNet-validation extension. Each triplet is annotated along two orthogonal axes: a related-target axis distinguishing Categorical Relatedness (CR, taxonomic) from conTextual Relatedness (TR, thematic), and a distractor axis distinguishing Color-matched Distractors (CD) from Shape-matched Distractors (SD). Twenty pretrained models spanning supervised, self-supervised, vision--language, and generative paradigms were evaluated by cosine similarity in an inference-only protocol. Transformer-based representations exceeded convolutional counterparts by up to 18.30 percentage points on PoporoIN at comparable ImageNet accuracy, and vision--language encoders exceeded vision-only counterparts by up to 22.50 percentage points under matched ImageNet accuracy on POPORO. Across paradigms, models recognized taxonomic targets more reliably than thematic ones and were more easily misled by shape-matched than by color-matched distractors. The benchmarks expose representational properties that classification accuracy alone does not fully predict, bridging cognitive psychology and visual representation evaluation.

2605.22086 2026-05-22 cs.CV

GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

GenHAR:面向最后一公里配送的跨领域人类活动识别通用化

Zhiqing Hong, Zelong Li, Xiubin Fan, Guang Yang, Baoshen Guo, Haotian Wang, Tian He, Desheng Zhang

AI总结 本文提出GenHAR框架,通过学习领域不变的传感器表示来解决跨领域人类活动识别中的分布偏移问题,提升了目标领域的泛化能力,并在实际部署中实现了高效率和高精度的实时活动检测。

详情
AI中文摘要

人类活动识别(HAR)在各种应用中表现出显著的有效性,如智能医疗和智能制造。然而,HAR面临的主要挑战是不同传感器数据域之间的分布偏移,这通常会导致在现实应用中性能下降。为了解决这个问题,本文引入了GenHAR,一种新的框架,旨在通过学习领域不变的传感器表示来缩小领域差距。GenHAR的目标是通过仅使用源域的数据来增强HAR在目标域上的泛化能力。GenHAR的关键创新体现在两个方面:首先,GenHAR对传感器数据进行分词,并学习频率传感器通道维度之间的相关性,以提高HAR模型的鲁棒性;其次,GenHAR通过选择性掩码和高效的注意力机制来提高效率。我们通过在现实世界的人类活动数据集上与最先进的HAR方法进行比较,系统分析了GenHAR。结果表明,GenHAR在准确性上比最先进的方法高出9.97%,并减少了6.4倍的浮点运算。此外,我们还在四个城市的一家领先物流公司部署了GenHAR,并检测到21.5亿次实时活动。我们发布了代码:https://github.com/Sensor-FoundationModel/GenHAR。

英文摘要

Human Activity Recognition (HAR) has shown remarkable effectiveness in various applications, such as smart healthcare and intelligent manufacturing. However, a major challenge faced by HAR is the distribution shift across different sensor data domains, which often leads to decreased performance when deployed for real-world applications. To address this issue, this paper introduces GenHAR, a novel framework designed to mitigate the domain gap by learning domain-invariant sensor representations. GenHAR aims to enhance the generalization capabilities of HAR on target domains purely with data from the source domain. The key novelty of GenHAR lies in two aspects. Firstly, GenHAR tokenizes sensor data and learns correlations among frequency sensor channel dimensions to improve the robustness of HAR models. Secondly, GenHAR improves the efficiency via selective masking and an efficient attention mechanism. We conduct a systematic analysis of GenHAR by comparing it with state-of-the-art HAR methods on real-world human activity datasets. Results show that GenHAR outperforms state-of-the-art methods by 9.97% in accuracy, and reduces Floating Point Operations by 6.4 times. Moreover, we deploy GenHAR at a leading logistics company in 4 cities, and have detected 2.15 billion real-time activities. We release our code at: https://github.com/Sensor-FoundationModel/GenHAR.

2605.22083 2026-05-22 cs.SD cs.LG eess.AS

RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

RobustSpeechFlow: 通过基于增强的对比流匹配学习鲁棒的文本到语音轨迹

Jinhyeok Yang, Hyeongju Kim, Yechan Yu, Joon Byun, Frederik Bous, Juheon Lee

AI总结 本文提出RobustSpeechFlow,一种通过引入长度保持重复和跳过潜在增强来改进对齐鲁棒性的训练策略,从而在无需外部对齐器或偏好数据的情况下,直接惩罚现实中的失败模式,并能无缝集成到现有流程中,实验表明其在文本到语音任务中显著提升了语音质量与鲁棒性。

Comments Submitted to INTERSPEECH 2026

详情
AI中文摘要

尽管流匹配文本到语音(TTS)在零样本说话人相似性和自然度方面表现强劲,但仍易受内容保真度问题影响,特别是由于不完美的对齐导致的跳过和重复错误。我们提出了RobustSpeechFlow,一种训练策略,通过扩展对比流匹配,引入长度保持重复和跳过潜在增强来提高对齐鲁棒性。该方法无需外部对齐器或偏好数据,直接惩罚现实中的失败模式,并能无缝集成到现有流程中。在Seed-TTS-eval上,仅使用0.06B参数,其将词错误率(WER)从1.44降至1.38。在我们的ZERO500基准测试中,它在多样化的说话人和语调条件下实现了稳定的可理解性提升;在NFE=24时,其将英文字符错误率(CER)从0.48%降至0.35%,将韩文CER从0.81%降至0.57%。音频样本:https://robustspeechflow.github.io/

英文摘要

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/

2605.22081 2026-05-22 cs.CL

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

ArabDiscrim: 一个十年的阿拉伯语Facebook语料库,涉及种族主义和歧视

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

AI总结 本文提出了ArabDiscrim,一个包含293,000条阿拉伯语Facebook公开帖子的十年长的词料库(2014-2024年),用于研究种族主义和歧视。该语料库整合了平台原生的互动信号,如反应、分享、评论和页面元数据,支持语言和受众反应的联合分析。该资源包括200个精心挑选的术语(100个与种族主义相关,100个与歧视相关)以及20个歧视轴,捕捉基于身份的不平等对待。它还提供了显式的归属模式。ArabDiscrim在伦理合规的限制研究使用许可下发布,支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度,它为公平导向、平台意识的阿拉伯语NLP建立了基础。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

我们介绍了ArabDiscrim,一个十年长的词料库和包含293,000条公开阿拉伯语Facebook帖子(2014-2024)的词料库,讨论种族主义和歧视。不同于现有以推特为中心的数据集,ArabDiscrim整合了平台原生的互动信号,包括反应、分享、评论和页面元数据,使语言和受众反应的联合分析成为可能。该资源包括200个精心挑选的术语(100个与种族主义相关,100个与歧视相关)以及20个歧视轴,捕捉基于身份的不平等对待。它还提供了显式的归属模式。在遵守平台条款的限制研究使用许可下发布,ArabDiscrim支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度,它为公平导向、平台意识的阿拉伯语NLP建立了基础。

英文摘要

We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.

2605.22078 2026-05-22 cs.AI cs.CV

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

通过无训练空间-时间池化和栅格化增强视频大语言模型的视觉令牌表示

Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding

AI总结 本文提出了一种无需训练的空间-时间池化和栅格化方法ST-GridPool,用于提升视频大语言模型的视觉令牌表示,通过多级时空交互和基于规范的空间池化技术,在不需重新训练的情况下提高性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在视频理解任务中取得了显著进展,但如何高效压缩视觉令牌同时保持时空交互仍面临挑战。现有方法如LLaVA家族使用简单的池化或插值技术,忽视了视觉令牌的复杂动态。为弥合这一差距,我们提出了ST-GridPool,一种专为视频LLM设计的新型无训练视觉令牌增强方法。我们的方法整合了金字塔时间栅格(PTG),通过层次化时间栅格捕捉多粒度时空交互,以及基于规范的空间池化(NSP),通过利用令牌规范与语义丰富度之间的相关性来保留高信息视觉区域。在各种基准测试中,ST-GridPool在不需成本高昂重新训练的情况下,一致提升了视频LLM的性能。我们的方法提供了一种高效且即插即用的解决方案来改进视觉令牌表示。我们的代码可在https://github.com/bingjunluo/ST-GridPool上获得。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.

2605.22075 2026-05-22 cs.LG q-bio.QM

Can Breath Biomarkers Causally Influence Blood Glucose? Investigating VOC-Mediated Modulation in Diabetes

呼吸生物标志物能否因果影响血糖?探讨VOC介导的糖尿病调节

Varsha Sharma, Prasanta K. Guha, Avik Ghose

AI总结 本研究通过非侵入式数据驱动框架,利用挥发性有机化合物(VOCs)和生活方式变量识别糖尿病高风险个体,采用因果推断技术估计VOCs如乙酮、异丙醇、异戊二烯和乙醇对血糖水平的影响,并设计分类器区分糖尿病患者与非糖尿病患者,建立基于风险的排名系统和高斯混合模型识别自然聚类。

详情
Journal ref
Proceedings of the IJCAI workshop on Advanced Neural Systems for Next-Generation Biomedical Intelligence, 2025
AI中文摘要

糖尿病是一种全球健康负担,早期检测对于及时干预至关重要。本研究探讨了一种非侵入式、数据驱动的框架,利用挥发性有机化合物(VOCs)和生活方式变量识别糖尿病高风险个体。我们使用因果推断技术估计乙酮、异丙醇、异戊二烯和乙醇等VOCs对血糖水平的影响。此外,我们设计了一个分类器,利用非侵入式标志物区分糖尿病患者和非糖尿病患者。我们为“灰色区域”中的个体建立了基于风险的排名系统,并使用高斯混合模型识别人群中的自然聚类。我们的结果表明,特定的VOCs对血糖水平表现出强因果影响,且机器学习模型能够可靠地分类和分层高风险个体。这种集成的因果-可解释分析可以支持非侵入式糖尿病早期筛查工具的开发。

英文摘要

Diabetes is a global health burden, and early detection is critical for timely intervention. This study explores a non-invasive, data-driven framework to identify individuals at risk of diabetes using Volatile Organic Compounds (VOCs) and lifestyle variables. We use causal inference techniques to estimate the impact of VOCs such as acetone, isopropanol, isoprene, and ethanol on blood glucose levels. Additionally, we designed a classifier to distinguish diabetics from non-diabetics using non-invasive markers. We created a risk-based ranking system for individuals in the "gray zone," and identified natural clusters in the population using Gaussian Mixture Model. Our results suggest that specific VOCs exhibit a strong causal influence on glucose levels and that machine learning models can reliably classify and stratify individuals at high risk. This integrated causal-explainable analysis can support the development of tool for non-invasive early screening of diabetes.

2605.22074 2026-05-22 cs.LG cs.AI cs.CL

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题:课程强化学习使LLM推理能够进行信用分配

Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang

AI总结 该研究提出SCRL框架,通过从参考推理链中生成可验证子问题,解决LLM推理中信用分配问题,提升了在数学推理任务中的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在LLM推理中展现出强大潜力,但基于结果的RLVR在处理难题时效率低下,因为正确的最终答案 rollout 很少且样本层面的信用分配无法利用失败尝试中的部分进展。我们引入SCRL(子问题课程强化学习),一种课程强化学习框架,通过从参考推理链中推导出可验证子问题,并将最终子问题固定为原始问题。这将难题中的部分进展转化为可验证的学习信号。算法上,SCRL使用子问题层面的归一化,每个子问题位置独立归一化奖励,并将结果优势分配给相应的答案片段,使在没有外部评分标准或奖励模型的情况下实现更细粒度的信用分配。我们的分析表明,子问题课程将难题从梯度死亡区中拉出,随着原始问题难度增加,相对收益也更大。在七个数学推理基准测试中,SCRL超越了强大的课程学习基线,使Qwen3-4B-Base的平均准确率比GRPO提高+4.1点,Qwen3-14B-Base提高+1.9点。在AIME24、AIME25和IMO-Bench上,SCRL进一步提高Qwen3-4B-Base的pass@1由+3.7点,pass@64由+4.6点,表明在难题推理任务中探索能力更强。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.

2605.22072 2026-05-22 cs.CL cs.CV

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Faithful-MR1: 通过锚定和强化视觉注意力实现忠实的多模态推理

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

AI总结 本文提出Faithful-MR1框架,通过锚定和强化视觉注意力解决多模态推理中的忠实性问题,提升模型在多模态基准上的表现。

Comments 20 pages, 7 figures, 3 tables. Preprint

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为推动大语言模型复杂推理发展的有希望范式,最近的研究将其扩展到多模态大语言模型(MLLMs)。然而,这种转移带来了忠实性挑战:任务相关视觉证据的忠实感知以及在推理中忠实使用该证据,导致多模态基准上的不满意收益。具体而言,现有的感知监督通常基于文本描述而非原生的图像区域,且忠实使用被忽视,暴露出感知-推理断层,正确感知的证据在推理中被丢弃或矛盾。为弥合这些差距,我们提出Faithful-MR1,一个训练框架,通过锚定和强化视觉注意力来解决忠实多模态推理的两方面。锚定阶段将感知转化为一个显式的预推理子任务,监督专门的<Focus>标记的注意力直接针对图像区域,而不是通过文本描述。强化阶段通过反事实图像干预暴露忠实使用,奖励那些在视觉上因果重要的区域集中注意力的轨迹。广泛实验表明,Faithful-MR1在Qwen2.5-VL-Instruct 3B和7B架构上优于最近的多模态推理基线,同时使用大量训练数据。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

2605.22068 2026-05-22 cs.CV

COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

COCOTree: 一个用于开放树状视觉分解的数据集和基准

Junhyub Lee, Seunghun Chae, Hyosu Kim

AI总结 本文提出COCOTree数据集和基准,通过自动化生成管道和开放词汇空间,实现了对复杂物理组装的长尾分布的捕捉,并提出了Open Tree Quality (OTQ)评估指标。

详情
AI中文摘要

我们正式化并启用了开放树分解任务,该任务将图像分割为具有无约束粒度和灵活性的层次树状视觉组件。具体而言,我们为这一新范式提供了基础基准,有三个关键贡献:首先,通过开发一个完全自动化的生成管道,结合大视觉-语言模型的语义推理与SAM 3的精确几何定位,克服了手动标注的高认知和物理瓶颈;其次,利用该管道构建了COCOTree大规模基准,包含超过21,000张图像和180万个结构节点,通过超过3,500个唯一标签的开放词汇空间,成功捕捉了复杂物理组装的长尾分布;最后,我们通过提出Open Tree Quality (OTQ)指标建立了标准化评估协议,该指标联合评估掩码精度、标签准确性和结构一致性。我们已发布数据集和基准代码:https://github.com/melonkick3090/COCOTree.

英文摘要

We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.

2605.22066 2026-05-22 cs.CV cs.AI

Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

Echo4DIR: 从2D超声视频重建4D隐式心脏结构

Yanan Liu, Qinya Li, Hao Zhang, Kangjian He, Xuan Yang, Hao Li, Dan Xu, Lei Li

AI总结 本文提出Echo4DIR框架,通过隐式重建方法从稀疏2D超声视频中重建4D心脏几何结构,解决了几何歧义和时间不连续性问题,实现了高精度的临床重叠度。

详情
AI中文摘要

从稀疏的2D超声图像中重建4D(3D+t)心脏几何结构具有高度的实用性,但受到几何歧义和时间不连续性的根本挑战。为了解决这些问题,我们提出了Echo4DIR,一种新颖的测试时4D隐式重建框架。具体来说,我们通过心脏条件SDF学习鲁棒的3D形状先验,构建了具有极线交叉注意力的Epipolar Mask Encoder模块,以有效融合多视角特征。为了弥合合成到现实的领域差距,我们引入了一种自监督的SDF定制可微渲染策略,利用未经校准的临床掩码进行患者特定的3D形状适应,而无需3D地面真实数据。关键的是,隐式表示的内在连续性克服了稀疏观测,使在任意分辨率下都能获得解剖学可靠的几何结构。此外,为了使我们的框架具备物理连续的4D扩展能力,我们引入了一种径向SDF对齐策略,严格锁定形状演变到预测的速度场,从根本上消除了网格漂移。在合成基准和真实临床数据集上的广泛实验表明,Echo4DIR实现了最先进的4D心脏网格重建,特别是在临床重叠度方面,达到了高达98.35%的Dice和96.75%的IoU。

英文摘要

Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentally challenged by geometric ambiguity and temporal discontinuity. To tackle these issues, we propose Echo4DIR, a novel test-time 4D implicit reconstruction framework. Specifically, we learn robust 3D shape priors from statistical shape models (SSMs) via a cardiac conditional SDF, constructing an Epipolar Mask Encoder module with epipolar cross attention to effectively fuse multi-view features. To bridge the synthetic-to-real domain gap, we introduce a self-supervised SDF-tailored differentiable rendering strategy for patient-specific 3D shape adaptation using uncalibrated clinical masks without requiring 3D ground truth. Crucially, the inherent continuity of implicit representation overcomes sparse observations, enabling anatomically reliable geometry at arbitrary resolutions. Furthermore, to empower our framework with physically continuous 4D extension, we introduce a Radial SDF Alignment strategy that strictly locks shape evolution to the predicted velocity field, fundamentally eliminating mesh drift. Extensive experiments on synthetic benchmarks and real clinical datasets demonstrate that Echo4DIR achieves state-of-the-art 4D cardiac mesh reconstruction, notably yielding an impressive clinical overlap of up to 98.35% Dice and 96.75% IoU.

2605.22061 2026-05-22 cs.CV

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

分布式图像压缩与多模态侧信息在极低比特率下的应用

Guojun Xu, Mingyang Zhang, Jianwen Xiang, Cheng Tan, Yanchao Yang, Junwei Zhou

AI总结 本文提出了一种多模态分布式图像压缩框架(MDIC),通过利用多模态侧信息在极低比特率下实现高质量图像重建,核心方法是引入文本到图像扩散解码器和特征掩码生成器,以提升全局感知质量和局部细节保留能力。

Comments Accepted by CVPR2026

详情
AI中文摘要

分布式图像压缩(DIC)对于多视图传输至关重要,尤其是在极低比特率(< 0.1 bpp)下。其核心挑战是有效利用侧信息以在严格比特率预算下实现高质量重建。然而,现有DIC方法难以利用全局上下文和对象级细节,导致局部模糊和细节丢失。为此,我们提出了一种多模态DIC框架(MDIC),首次将多模态侧信息引入DIC范式,有效保留细粒度局部细节并提升重建图像的全局感知质量。具体而言,我们引入基于文本到图像扩散的解码器,该解码器根据从相关图像中提取的文本侧信息进行条件化,以捕捉共享的全局语义。此外,我们设计了一个由多模态细粒度对齐任务监督的特征掩码生成器,以加强视觉侧信息的利用。生成的掩码具有两个作用:首先,它指导从无损传输的侧信息中提取细粒度细节,以保持重建细节的语义一致性;其次,它调节从量化VQ-VAE嵌入中提取的聚类特征表示,补偿主图像在极端压缩下的类别信息丢失。在广泛使用的KITTI立体和Cityscapes数据集上的大量实验表明,MDIC在极低比特率下实现了最先进的感知质量。

英文摘要

Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

2605.22057 2026-05-22 cs.CL

FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

FlyRoute: 通过数据飞轮实现自进化代理配置以实现适应性任务路由

Rongjun Li, Ziyu Zhou, Yihang Wu

AI总结 本文提出FlyRoute,一种自进化配置框架,通过真实流量增长能力证据,提高适应性任务路由的性能。

Comments 13 pages, 5 figures, 5 tables

详情
AI中文摘要

企业路由器将查询分配给专家代理,但部署的配置保持静态,而代理进化(提示、工具、模型)时,配置未更新。我们提出了FlyRoute,一种自进化配置框架,从真实流量中增长能力证据:将调度候选者和质量门成功的配对加入每个代理的成功存储,定期将证据转化为学习的能力描述,并将这些描述与BM25检索的成功注入到LLM路由器中。为了使此飞轮数据高效,FlyRoute引入了一种针对性探索策略,结合配置不确定性、BM25相关性和词汇新颖性,只优先为可能的查询下注释欠配置的代理,并避免冗余证据收集。在我们专有的企业开发支持数据集上的实验中,FlyRoute仅使用每个代理五个种子查询,将相同架构的零样本LLM路由器的性能从72.57%提升到78.04%,表明配置检索已经增强了冷启动路由。在流过7,211个标记的训练查询后,准确率提升到89.83%(零样本提升17.26个点;冷启动提升11.79个点),在四个专家领域下,标准路由准确率在单金测试查询上保持一致的提升。

英文摘要

Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.

2605.22055 2026-05-22 cs.LG cs.AI

Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series

基于原型的分类子任务解耦框架:提升多变量时间序列的泛化能力与可解释性

Xianhao Song, Yuang Zhang, Yuqi She, Liping Wang, Xuemin Lin

AI总结 本文提出PDFTime框架,通过多阶段决策过程解耦时间序列分类任务,提升模型的泛化能力和可解释性,实现了在UEA和UCR基准测试中的最优性能。

详情
AI中文摘要

时间序列分类(TSC)是一个长期存在的研究问题,近年来随着大规模时间数据的快速增长而受到越来越多的关注。尽管深度学习带来了显著进展,但设计出既准确又可解释的TSC模型仍然是一个具有挑战性的任务。许多现有方法采用直接的特征到标签分类范式,通过单一线性投影(通常在全局池化后)将高维时间嵌入压缩为类别日志it,这种范式将特征提取和决策逻辑合并为不可分割的映射。为了解决这些限制,我们提出了PDFTime,一个基于原型的框架,将时间序列分类重新表述为多阶段决策过程。不同于直接的特征到标签映射,PDFTime利用学习到的原型来近似潜在空间中的类别条件特征分布,通过不同粒度的分类子任务实现逐步辨别。据我们所知,PDFTime是第一个将时间序列分类重新表述为解耦、多阶段相似性推理过程的框架,打破了长期以来直接、黑箱的特征到标签映射范式。广泛的评估表明,PDFTime在UEA和UCR基准测试中实现了最先进的性能。值得注意的是,它在UCR档案中的128个数据集中,取得了80个数据集的top-1准确率,显著优于最近的强基线方法在一致性和泛化性上的表现。

英文摘要

Time Series Classification (TSC) is a long-standing research problem that has gained increasing attention in recent years with the rapid growth of large-scale temporal data. Despite substantial progress enabled by deep learning, designing TSC models that are both accurate and interpretable remains a challenging task. Many existing approaches adopt a direct feature-to-label classification paradigm, by collapsing high-dimensional temporal embeddings into class logits via a single linear projection (often after global pooling), the paradigm conflates feature extraction and decision logic into an inseparable mapping. To address these limitations, we propose PDFTime, a prototype-guided framework that reformulates time series classification as a multi-stage decision process. Instead of direct feature-to-label mapping, PDFTime leverages learned prototypes to approximate class-conditional feature distributions in the latent space, enabling progressive discrimination through classification sub-tasks of varying granularity. To our knowledge, PDFTime is the first framework to reformulate time series classification as a decoupled, multi-stage similarity-based reasoning process, breaking the long-standing paradigm of direct, black-box feature-to-label mapping. Extensive evaluations demonstrate that PDFTime achieves state-of-the-art (SOTA) performance across UEA and UCR benchmarks. Notably, it secures the top-$1$ accuracy on 80 out of 128 datasets in the UCR archive, significantly outperforming recent strong baselines in both consistency and generalization.

2605.22054 2026-05-22 cs.LG cs.AI

LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation

LABO: 通过广泛探索和选择性实验实现的LLM加速贝叶斯优化

Zhuo Chen, Xinzhe Yuan, Jianshu Zhang, Jinzong Dong, Ruichen Zhou, Yingchun Niu, Tianhang Zhou, Yu Yang Fredrik Liu, Yuqiang Li, Nanyang Ye, Qinying Gu

AI总结 本文提出LABO框架,通过结合LLM预测与实验观测,在贝叶斯优化中实现更高效的样本优化,理论分析和实验结果表明其在科学任务中优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

科学探索中的高成本和数据稀缺性推动了将大型语言模型(LLMs)作为知识驱动组件应用于贝叶斯优化(BO)的研究。然而,现有方法通常将LLMs直接嵌入到采样或替代建模流程中,未能充分利用其显著低于现实实验的评估成本。为了解决这一限制,我们提出了LLM加速贝叶斯优化(LABO)框架,该框架在单个BO循环中结合LLM预测与实验观测。LABO采用门控标准来动态平衡对LLM预测和实际实验的依赖。通过利用低成本的LLM评估进行广泛探索搜索空间,并仅在高不确定性区域保留昂贵的现实实验,LABO实现了更高效的样本优化。我们提供了理论分析,通过累积遗憾界正式化这一效率增益。在多样化的科学任务中,实验结果表明LABO在相同实验预算下一致优于现有方法。我们的结果表明,LABO为将LLMs整合到科学发现流程中提供了一种实用且理论严谨的方法。

英文摘要

The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledge-driven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directly into the sampling or surrogate modeling pipeline, without fully leveraging their significantly lower evaluation cost compared to real-world experiments. To address this limitation, we propose LLM-Accelerated Bayesian Optimization (LABO), a framework that combines LLM predictions with experimental observations within a single BO loop. LABO employs a gating criterion to dynamically balance the reliance on LLM predictions versus actual experiments. By leveraging inexpensive LLM evaluations to broadly explore the search space and reserving costly real experiments only for regions with high uncertainty, LABO achieves more sample-efficient optimization. We provide a theoretical analysis with a cumulative regret bound that formalizes this efficiency gain. Empirical results across diverse scientific tasks demonstrate that LABO consistently outperforms existing methods under identical experimental budgets. Our results suggest that LABO offers a practical and theoretically grounded approach for integrating LLMs into scientific discovery workflows.

2605.22051 2026-05-22 cs.CV

EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

EasyVFX: 用于资源高效视觉效果生成的频率驱动解耦

Yue Ma, Xu Ye, Qinghe Wang, Yucheng Wang, Hongyu Liu, Yinhan Zhang, Xinyu Wang, Yuanpeng Che, Shanhui Mo, Paul Liang, Fangneng Zhan, Qifeng Chen

AI总结 本文提出EasyVFX框架,通过频率域分解解耦高频和低频成分,降低视觉效果生成的计算和数据依赖性,实现高效且高质量的视觉效果合成。

Comments Accepted by SIGGRAPH 2026. Project page: https://easy-vfx.github.io/

详情
AI中文摘要

生成高保真视觉效果(VFX)通常需要大量数据集和高昂的计算资源,因为空间纹理和时间动态之间存在复杂的耦合。本文介绍了EasyVFX,一个资源高效的框架,能够在严格约束下实现逼真的VFX合成。我们的核心理念在于频域分解:我们观察到通过解耦高频成分(代表复杂的空间外观)和低频成分(代表全局运动动态),可以显著降低VFX的复杂性。这种频域解耦将高维学习问题转化为可管理的子任务,从而降低优化障碍并减少数据依赖性。基于这一见解,我们提出了一种双阶段训练范式。首先,我们设计了一种频率感知的专家混合(Freq-MoE)架构。通过利用软路由机制,我们的模型将专门的专家分配到不同的频谱带,使它们能够培养稳健的先验知识用于外观和运动动态。这种专业化使模型能够以更少的GPU资源获取基础的VFX知识。其次,我们引入了一种由新型频率约束损失驱动的测试时训练策略。这使预训练模型能够通过局部优化快速适应特定的、未见过的效果,仅需在单个GPU上进行约100步的训练。实验结果表明,EasyVFX生成的结构一致且视觉震撼的效果,证明了频率感知学习是使专业级VFX民主化的重要催化剂。

英文摘要

Generating high-fidelity visual effects (VFX) typically demands massive datasets and prohibitive computational power due to the intricate coupling of spatial textures and temporal dynamics. In this paper, we introduce EasyVFX, a resource-efficient framework that achieves realistic VFX synthesis under stringent constraints. Our core philosophy lies in frequency-domain decomposition: we observe that the complexity of VFX can be significantly mitigated by decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics. This spectral disentanglement transforms a high-dimensional learning problem into manageable sub-tasks, thereby lowering the optimization barrier and reducing data dependency. Building upon this insight, we propose a two-stage training paradigm. First, we design a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture. By utilizing a soft routing mechanism, our model assigns specialized experts to distinct spectral bands, enabling them to cultivate robust priors for appearance and motion dynamics. This specialization allows the model to acquire foundational VFX knowledge with fewer GPU resources. Second, we introduce a Test-Time Training strategy powered by a novel Frequency-constraint Loss. This allows the pre-trained model to swiftly adapt to specific, unseen effects through localized optimizations, requiring only about 100 steps on a single GPU. Experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects, proving that frequency-aware learning is a key catalyst for democratizing professional-grade VFX.