2602.01124 2026-05-08 cs.LG

LFS: 用于事件感知和时间多样化的视频描述生成的可学习帧选择器

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren, Dingcheng Shan, Jing-cheng Pang, Sijie Wu, Xubin Li, Kai Zhang, Xin Chen

发表机构 * GTS, AI Data Department, Huawei Technologies Co., Ltd.（华为技术有限公司人工智能数据部）

AI总结本文提出LFS，通过学习选择时间多样且事件相关的帧，提升视频描述的细节质量，在两个基准和ICH-CC上取得显著提升。

详情

AI中文摘要

视频描述生成模型将帧转换为视觉标记，并利用大语言模型（LLMs）生成描述。由于编码所有帧成本过高，通常采用均匀采样，但此方法强制等时间覆盖而忽略事件分布的不均匀性。为此，本文提出可学习帧选择器（LFS），选择时间多样且事件相关的帧。LFS显式建模时间重要性以平衡时间多样性和事件相关性，并采用分层策略确保时间覆盖同时避免聚类。关键在于LFS利用冻结视频-LLMs的描述反馈来学习帧选择，直接优化下游描述质量。此外，本文识别现有基准与人类认知之间的差距，因此引入ICH-CC，由注释者精心设计的问题反映人类一致的视频理解。实验表明，LFS在两个代表性社区基准和ICH-CC上一致提升详细视频描述质量，在VDC上提升2.0%，在ICH-CC上提升超过4%。此外，观察到LFS增强的描述在视频问答中表现更优。总体而言，LFS为详细视频描述生成提供了一个有效且易于集成的解决方案。

英文摘要

Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human's cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.

URL PDF HTML ☆

赞 0 踩 0

2601.12355 2026-05-08 cs.LG

Sensoformer：通过物理结构化随机化实现变量几何传感器集的鲁棒仿真到现实推理

Zhe Jia, Xiaotian Zhang, Junpeng Li

发表机构 * Institute for Geophysics University of Texas at Austin（地质研究所德克萨斯大学奥斯汀分校）

AI总结 Sensoformer通过物理结构化随机化方法，解决稀疏传感器阵列在仿真到现实推理中的挑战，实现高维物理状态推断，并在地震源反演中展现优于MPNN和神经算子的性能。

详情

AI中文摘要

从稀疏、非标准传感器阵列推断高维物理状态是人工智能科学和工业物联网中的基本挑战。标准机器学习架构在这些领域挣扎，因为传感器几何形状不规则、变量基数以及未建模的物理异质性导致的仿真到现实分布偏移。为解决这些挑战，我们提出了Sensoformer，一种集成了物理结构化领域随机化（PSDR）的集合注意力框架。通过显式随机化底层物理动态（例如传播介质、极端噪声和网络可用性丢弃），而不是仅仅视觉特征，PSDR强制学习领域不变的物理算子。使用地震源反演作为严格的现实世界测试平台，Sensoformer在10万合成数据上预训练，并在高度复杂的现实世界目录上评估。我们证明Sensoformer实现了最先进的精度，并优于Message Passing Neural Networks（MPNNs）和Neural Operators（例如DeepONet），后者在极端空间稀疏性和混合模态输入方面挣扎。此外，可解释性分析揭示注意力机制自动发现最优实验设计原则，动态优先选择稀疏正交传感器以克服信息瓶颈。

英文摘要

Inferring high-dimensional physical states from sparse, ad-hoc sensor arrays is a fundamental challenge across AI for Science and industrial IoT. Standard machine learning architectures struggle in these domains due to irregular, variable-cardinality sensor geometries and the profound sim-to-real distribution shift caused by unmodeled physical heterogeneities. To address these challenges, we propose Sensoformer, a set-attention framework integrated with Physics-Structured Domain Randomization (PSDR). By explicitly randomizing the underlying physical dynamics (e.g., propagation media, extreme noise, and network availability dropout) rather than just visual features, PSDR enforces the learning of domain-invariant physical operators. Using seismic source inversion as a rigorous real-world testbed, Sensoformer is pre-trained on 100,000 synthetics and evaluated on a highly complex real-world catalog. We demonstrate that Sensoformer achieves state-of-the-art precision and outperforms Message Passing Neural Networks (MPNNs) and Neural Operators (e.g., DeepONet) which struggle with extreme spatial sparsity and mixed-modality inputs. Furthermore, interpretability analysis reveals that the attention mechanism autonomously discovers optimal experimental design principles, dynamically prioritizing sparse orthogonal sensors to overcome information bottlenecks.

URL PDF HTML ☆

赞 0 踩 0

2601.01400 2026-05-08 cs.CL

EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

EternalMath: 一个与人类发现同步演化的前沿数学基准

Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu, Yuhong Liu

发表机构 * School of Mathematics, Renmin University of China（中国人民大学数学学院）； Tencent（腾讯）

AI总结本文提出一个自动化数学推理评估框架，通过将最新数学文献转化为可执行任务，构建可扩展的EternalMath基准，揭示LLM在前沿数学上的性能差距。

详情

AI中文摘要

当前大型语言模型的数学推理评估主要依赖静态基准，导致研究级数学覆盖有限且性能迅速饱和。本文提出一个完全自动化的、基于定理的评估流程，直接将最新同行评审数学文献转化为可执行且可验证的推理任务。该流程识别构造性或定量结果，将其实例化为参数化问题模板，并通过执行验证生成确定性解决方案，从而实现可扩展、可重复和持续更新的评估，无需依赖大规模专家编写。通过设计，该方法支持时间扩展性、内在正确性检查和数学子领域的定制化。应用该流程得到EternalMath，一个从当代研究论文中衍生的演进评估套件。对最新LLM的实验揭示了显著的性能差距，表明前沿数学推理仍远未饱和，凸显了需要与人类数学发现同步进化的评估方法的重要性。

英文摘要

Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.

URL PDF HTML ☆

赞 0 踩 0

2601.00655 2026-05-08 cs.LG cs.AI

Interpretability-Guided Bi-objective Optimization: Aligning Accuracy and Explainability

可解释性引导的双目标优化：协调精度与可解释性

Kasra Fouladi, Hamta Rahmani

发表机构 * Department of Computer Science, Iran University of Science and Technology（伊朗科学技术大学计算机科学系）

AI总结本文提出IGBO框架，通过双目标方法整合领域知识训练可解释模型，采用DAG编码特征重要性并利用TIG测量重要性，提出相对重要性分数Hk(X,θ)并证明收敛到帕累托 stationary 点。

Comments 12 pages

详情

AI中文摘要

本文介绍可解释性引导的双目标优化（IGBO），通过双目标方法整合领域知识训练可解释模型。IGBO通过基于中心极限定理的构造将特征重要性层级编码为有向无环图（DAG），并利用时间整合梯度（TIG）测量特征重要性。框架提出新的相对重要性分数Hk(X,θ)，量化特征随时间的归一化累积归因。我们提出几何投影映射P结合任务和可解释性梯度，并证明收敛到帕累托 stationary 点。为解决TIG计算中的分布外问题，我们概述了最优路径Oracle架构，留待未来工作。基于中心极限定理的可解释性DAG构造提供了关于无环性和传递性的统计保证，中位数阈值有无条件保证，更高置信水平有条件保证。

英文摘要

This paper introduces Interpretability-Guided Bi-objective Optimization (IGBO), a framework that trains interpretable models by incorporating structured domain knowledge via a bi-objective formulation. IGBO encodes feature importance hierarchies as a Directed Acyclic Graph (DAG) via Central Limit Theorem-based construction and uses Temporal Integrated Gradients (TIG) to measure feature importance. The framework employs a novel Relative Importance Score Hk(X, θ) that quantifies the normalized cumulative attribution of each feature over time. We propose a geometric projection mapping P for combining task and interpretability gradients, and prove convergence to Pareto-stationary points. To address the Out-of-Distribution problem in TIG computation, we outline an Optimal Path Oracle architecture, which we leave for future work. Central Limit Theorem-based construction of the interpretability DAG provides statistical guarantees on acyclicity and transitivity, with an unconditional guarantee for the median threshold and conditional guarantees for higher confidence levels.

URL PDF HTML ☆

赞 0 踩 0

2512.22991 2026-05-08 cs.LG

Fusion or Confusion? Multimodal Complexity Is Not All You Need

融合还是混淆？多模态复杂性并不都是你需要的

Tillmann Rheude, Roland Eils, Benjamin Wild

发表机构 * Berlin Institute of Health, Charité - Universitätsmedizin Berlin（柏林健康研究所，柏林查理医院）； Intelligent Medicine Institute, Fudan University（复旦大学智能医学研究院）； Department of Mathematics and Computer Science, Freie Universität Berlin（柏林自由大学数学与计算机科学系）

AI总结本文通过大规模实验挑战多模态学习中复杂架构提升性能的假设，发现增加复杂性常导致混淆而非有效融合，强调需转向方法论严谨性。

详情

AI中文摘要

多模态学习已成为重要研究领域，通过跨模态信息融合可能带来显著性能提升。然而，模型发展趋向于更复杂的深度学习架构，基于多模态特定方法能提升性能的假设。本文通过重新实现19种高影响力多模态方法，在九个包含最多23种模态的多样化数据集上进行大规模实证研究。在标准化实验条件下，包括超参数调优、权重初始化、交叉验证和统计检验，增加多模态复杂性往往导致混淆而非有效数据模态融合。因此，复杂多模态架构并不总能优于单模态基线和简单多模态学习基线（SimBaMM）。通过聚焦案例研究，进一步展示了顶级多模态学习出版物中的具体方法论缺陷，强调了标准化评估实践的必要性。总之，本文呼吁多模态学习研究转向：远离架构创新的追求，转向方法论严谨性。

英文摘要

Multimodal learning has become a prominent research area, with the potential of substantial performance gains by combining information across modalities. At the same time, model development has trended toward increasingly complex deep learning architectures, motivated by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study by reimplementing 19 high-impact multimodal methods across nine diverse datasets with up to 23 modalities. Under standardized experimental conditions, including hyperparameter tuning, weight initialization, cross-validation, and statistical testing, increased multimodal complexity often yields confusion rather than effective fusion of data modalities. Accordingly, complex multimodal architectures do not reliably outperform unimodal baselines and a Simple Baseline for Multimodal Learning (SimBaMM). Through a focused case study, we further demonstrate concrete methodological shortcomings even in top-tier multimodal learning publications, underscoring the need for standardized evaluation practices. In summary, we argue for a shift in focus for multimodal learning: away from the pursuit of architectural novelty and toward methodological rigor.

URL PDF HTML ☆

赞 0 踩 0

2512.20854 2026-05-08 cs.CL cs.IR

How important is Recall for Measuring Retrieval Quality?

召回在衡量检索质量时有多重要？

Shelly Schwartz, Oleg Vasilyev, Randy Sawaya

发表机构 * Primer Technologies Inc.

AI总结本文评估了在大型动态知识库中，如何通过LLM判断响应质量来衡量检索质量，提出了一种无需知道总相关文档数的高效方法。

Comments Dataset: https://huggingface.co/datasets/primer-ai/retrieval-response

2512.18181 2026-05-08 cs.CV

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

MACE-Dance: 基于动作-外观级联专家的音乐驱动舞蹈视频生成

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He

发表机构 * Renmin University of China（中国人民大学）； AMAP, Alibaba Group（阿里集团AMAP）； Wuhan University（武汉大学）； Tsinghua University（清华大学）

AI总结本文提出MACE-Dance框架，结合级联混合专家模型，实现音乐驱动的高质量舞蹈视频生成，通过动作和外观专家分别生成3D动作和视频，提升视觉一致性和动作真实感。

Comments Accepted by SIGGRAPH 2026

详情

AI中文摘要

随着在线舞蹈视频平台的兴起和AI生成内容（AIGC）的快速发展，音乐驱动的舞蹈生成已成为有吸引力的研究方向。尽管在音乐驱动的3D舞蹈生成、姿态驱动的图像动画和音频驱动的谈话头合成等领域的进展显著，现有方法无法直接应用于此任务。此外，该领域的有限研究仍难以同时实现高质量的视觉外观和逼真的人体运动。因此，我们提出了MACE-Dance，一种具有级联混合专家（MoE）的音乐驱动舞蹈视频生成框架。动作专家负责音乐到3D动作生成，强制执行运动学合理性和艺术表现力，而外观专家负责动作和参考条件下的视频合成，保持视觉身份与时空一致性。具体而言，动作专家采用具有BiMamba-Transformer混合架构和无引导训练（GFT）策略的扩散模型，实现了3D舞蹈生成的最先进性能。外观专家采用解耦的运动-美学微调策略，实现了姿态驱动图像动画的最先进性能。为了更好地评估此任务，我们整理了一个大规模且多样化的数据集，并设计了动作-外观评估协议。基于此协议，MACE-Dance也实现了最先进性能。代码可在https://github.com/AMAP-ML/MACE-Dance获取。

英文摘要

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.

URL PDF HTML ☆

赞 0 踩 0

2512.18034 2026-05-08 cs.AI

Accelerating Discrete Facility Layout Optimization: A Hybrid CDCL and CP-SAT Architecture

加速离散设施布局优化：一种混合CDCL和CP-SAT架构

Joshua Gibson, Kapil Dhakal

发表机构 * University of Alabama in Huntsville（阿拉巴马大学亨茨维尔分校）

AI总结本文提出一种混合CDCL和CP-SAT架构，通过利用CDCL的快速可行性检测能力提升离散布局优化效率，结合CP-SAT实现精确优化。

详情

AI中文摘要

离散设施布局设计涉及将物理实体放置以最小化搬运成本，同时遵守严格的安全和空间约束。这是一个组合优化问题，通常通过混合整数线性规划（MILP）或约束编程（CP）来解决，但这些方法在约束密度增加时往往面临可扩展性挑战。本文系统评估了冲突驱动子句学习（CDCL）与VSIDS启发式方法作为离散布局问题替代计算引擎的潜力。通过统一的基准测试工具，我们对CDCL、CP-SAT和MILP在不同网格大小和约束密度下的进行了受控比较。实验结果揭示出性能上的明显二元性：虽然CDCL由于成本盲目的分支策略在优化目标上表现不佳，但它在可行性检测上展现出无与伦比的主导地位，能够以比其他方法快数个数量级的速度解决高度约束的实例。基于这一发现，我们开发了一种新的"Warm-Start"混合架构，利用CDCL快速生成有效的可行性提示，然后将其注入到CP-SAT优化器中。我们的结果证实，这种分层方法成功地加速了精确优化，通过SAT驱动的剪枝来弥合快速可满足性和证明最优性之间的差距。

英文摘要

Discrete facility layout design involves placing physical entities to minimize handling costs while adhering to strict safety and spatial constraints. This combinatorial problem is typically addressed using Mixed Integer Linear Programming (MILP) or Constraint Programming (CP), though these methods often face scalability challenges as constraint density increases. This study systematically evaluates the potential of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics as an alternative computational engine for discrete layout problems. Using a unified benchmarking harness, we conducted a controlled comparison of CDCL, CP-SAT, and MILP across varying grid sizes and constraint densities. Experimental results reveal a distinct performance dichotomy: while CDCL struggles with optimization objectives due to cost-blind branching, it demonstrates unrivaled dominance in feasibility detection, solving highly constrained instances orders of magnitude faster than competing paradigms. Leveraging this finding, we developed a novel "Warm-Start" hybrid architecture that utilizes CDCL to rapidly generate valid feasibility hints, which are then injected into a CP-SAT optimizer. Our results confirm that this layered approach successfully accelerates exact optimization, using SAT-driven pruning to bridge the gap between rapid satisfiability and proven optimality.

URL PDF HTML ☆

赞 0 踩 0

2512.13281 2026-05-08 cs.CV

VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

VideoASMR-Bench: AI生成的ASMR视频能否欺骗视觉语言模型和人类？

Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin

发表机构 * The Chinese University of Hong Kong（香港中文大学）； National University of Singapore（新加坡国立大学）； Peking University（北京大学）； Monash University（墨尔本大学）； Video Rebirth ； University of Oxford（牛津大学）

AI总结 VideoASMR-Bench通过细粒度音频视觉感知和感官沉浸性评估AI生成ASMR视频的检测能力，揭示了当前VLMs在识别AI生成ASMR视频上的不足以及VGMs生成逼真ASMR视频的能力。

Comments Code is at https://github.com/video-reality-test/video-reality-test, page is at https://video-reality-test.github.io/

详情

AI中文摘要

随着AI生成的视频越来越难以与现实区分，当前的基准主要关注广义语义对齐和基本物理一致性，提供有限的判别能力。为此，我们引入VideoASMR-Bench，一个基于自主感官脉冲反应（ASMR）视频的基准，强调细粒度音频视觉感知和感官沉浸性。该基准旨在回答两个关键问题：（i）当今的视频理解模型（VLMs）是否足够敏感，能够通过识别细微的视觉、物理或听觉瑕疵来检测AI生成的ASMR视频？（ii）当今的视频生成模型（VGMs）能否生成具有沉浸体验的ASMR视频？该基准包含来自社交媒体精心挑选的1500个高质量真实ASMR视频，以及由九个VGMs生成的2235个合成视频。此外，我们开源了一套可扩展的提示和参考图像，使基准能够动态扩展以适应未来视频模型。此外，我们设计了一个自动理解-生成评估框架，使VGMs试图生成逼真假视频以欺骗VLMs，而VLMs则试图检测它们，形成双方之间的对抗游戏。在VideoASMR-Bench上的评估表明，即使最先进的VLMs，如Gemini-3-Pro，也未能可靠地检测AI生成的ASMR视频。同时，当前前沿的视频生成模型能够生成难以被VLMs区分的ASMR视频，而人类仍能相对容易地识别它们。

英文摘要

With AI-generated videos increasingly indistinguishable from reality, current benchmarks primarily focus on broad semantic alignment and basic physical consistency, offering limited discriminative power for evaluating them. To address this, we introduce VideoASMR-Bench, a benchmark based on Autonomous Sensory Meridian Response (ASMR) videos that emphasizes fine-grained audio-visual perception and sensory immersion. This benchmark aims to answer two key questions: (i) Are today's video understanding models (VLMs) sensitive enough to detect AI-generated ASMR videos by recognizing minor visual, physical, or auditory artifacts? (ii) Can today's video generation models (VGMs) produce convincing ASMR videos with immersive experiences? This benchmark comprises a diverse set of 1,500 high-quality real ASMR videos curated from social media, alongside 2,235 synthetic counterparts generated by nine VGMs. Additionally, we open-source an extensible suite of prompts and reference images, enabling the benchmark to scale dynamically with future video models. Moreover, we design an automatic understanding-generation evaluation framework between VGMs and VLMs, where VGMs aim to produce realistic fake videos to fool the VLMs, while the VLMs seek to detect them, forming an adversarial game between the two parties. Our evaluation on VideoASMR-Bench reveals that even state-of-the-art VLMs, such as Gemini-3-Pro, fail to reliably detect AI-generated ASMR videos. Meanwhile, current frontier video generation models can produce ASMR videos that are difficult for VLMs to distinguish from real ones, while humans can still identify them relatively easily.

URL PDF HTML ☆

赞 0 踩 0

2512.10248 2026-05-08 cs.CV cs.AI

利用时序和情境 grounding 的临床语言处理进行早期风险预测

Rochana Chaturvedi, Yue Zhou, Andrew D. Boyd, Brian T. Layden, Mudassir Rashid, Lu Cheng, Ali Cinar, Barbara Di Eugenio

发表机构 * Kellogg School of Management, Northwestern University（西北大学凯洛格管理学院）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结本文提出两种方法，HiTGNN 通过时序图神经网络建模患者轨迹，ReVeAL 通过轻量框架提升预测准确性与公平性，用于2型糖尿病早期筛查。

详情

AI中文摘要

电子健康记录中的临床笔记捕捉了事件、医生推理和生活方式因素的丰富时序信息，这些信息在结构化数据中往往缺失。利用这些笔记进行预测建模可以及时识别慢性疾病。然而，它们带来了自然语言处理的核心挑战：长文本、事件分布不规则、复杂的时序依赖、隐私限制和资源限制。我们提出两种互补的方法，用于从纵向笔记中进行时序和情境 grounded 的风险预测。首先，我们引入HiTGNN，一种层次化时序图神经网络，整合了笔记内的时间事件结构、就诊间动态和医学知识，以精细的时间粒度建模患者轨迹。其次，我们提出ReVeAL，一种轻量级的测试时框架，将大语言模型的推理提炼成较小的验证器模型。应用于利用从私人和公共医院语料中编纂的现实时序队列进行2型糖尿病（T2D）的偶然筛查，HiTGNN实现了最高的预测准确性，尤其是在短期风险方面，同时保持隐私并限制对大型专有模型的依赖。ReVeAL增强了对真实T2D病例的敏感性并保留了解释性推理。我们的消融实验确认了时序结构和知识增强的价值，公平性分析显示HiTGNN在子群体中表现更加公平。

英文摘要

Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight test-time framework that distills LLMs' reasoning into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.

URL PDF HTML ☆

赞 0 踩 0

2511.21471 2026-05-08 cs.AI

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

SpatialBench: 多模态大语言模型空间认知的基准测试

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, Yunjian Zhang

发表机构 * Sun Yat-Sen University（中山大学）； HKUST (GZ)（香港科技大学）； Zhejiang University（浙江大学）； Peking University（北京大学）； CAICT（中国科学院电子技术研究所）； UCAS（中国科学技术大学）； CUC（中国科学技术大学）

AI总结本文提出SpatialBench基准，通过五级空间认知框架评估多模态大语言模型的空间能力，揭示模型在感知与符号推理间的性能差异。

详情

AI中文摘要

空间认知是现实世界多模态智能的基础，使模型能有效与物理环境交互。尽管多模态大语言模型（MLLMs）取得显著进展，现有基准往往简化空间认知为单一维度指标，无法捕捉空间能力的层次结构和相互依赖性。为此，我们提出一个分层空间认知框架，将空间智能分解为五个逐步复杂的层次，从基本观察到高级规划。基于此分类，我们构建了覆盖15个任务的大型精细基准SpatialBench。为进一步统一评估异质任务，我们引入了一个高阶能力导向的度量标准，可靠评估模型的整体空间推理能力。大量实验表明，模型在感知方面表现强劲，但在符号推理、因果推理和规划方面受限。此外，人类测试显示，人类表现出选择性、目标导向的抽象能力，而MLLMs则倾向于过度关注表面细节，缺乏连贯的空间意图。本文建立了首个系统框架，用于衡量多模态大语言模型的分层空间认知，为未来空间智能系统奠定基础。

英文摘要

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

URL PDF HTML ☆

赞 0 踩 0

2511.19972 2026-05-08 cs.CV

Boosting Reasoning in Large Multimodal Models via Activation Replay

通过激活回放提升大多模态模型的推理能力

Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang

发表机构 * Nanyang Technological University（南洋理工大学）； National University of Singapore（国立新加坡大学）； Tencent Youtu Lab（腾讯云图实验室）； Zhejiang University（浙江大学）； Fudan University（复旦大学）

AI总结本文通过激活回放方法提升大模型的多模态推理能力，通过操控低熵激活来增强推理性能，验证了该方法在数学、视觉代理和视频推理等场景中的有效性。

Comments CVPR 2026

详情

AI中文摘要

最近，可验证奖励的强化学习（RLVR）作为一种有效的方法，用于激励大多模态模型（LMMs）的推理能力，但其底层机制尚不明确。我们通过logit视角探讨了输入激活如何受RLVR影响，系统研究多个post-trained LMMs表明，RLVR会意外地改变低熵激活，而高熵激活影响较小。进一步通过受控实验表明，这些现象与LMM推理相关，暗示调节低熵激活可能有益。为此，我们提出了Activation Replay，一种新颖且有效的无训练方法，无需昂贵的策略优化即可提升post-trained LMMs的多模态推理能力。我们的设计涉及在测试时操控视觉token，回放基础LMMs的输入上下文中的低熵激活以调节RLVR对应部分。Activation Replay在数学、o3-like视觉代理和视频推理等多样化场景中促进了更好的推理。我们进一步展示了Activation Replay提高了Pass@K并缓解了RLVR的推理覆盖范围狭窄问题。我们的设计与替代方案进行比较，如回放高熵激活而不是低熵激活，或直接跨模型干预而不是操控输入token，证明了我们的实现优势。代码可在https://github.com/latentcraft/replay公开获取。

英文摘要

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Code is publicly available at https://github.com/latentcraft/replay.

URL PDF HTML ☆

赞 0 踩 0

2511.00751 2026-05-08 cs.AI cs.CL

Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs

自一致性正在失去优势：现代大语言模型中的边际效益递减与成本上升

Chiyan Loo

发表机构 * Chiyan Loo

AI总结研究指出，随着模型增强，自一致性技术变得低效且可能降低性能，通过实验显示增加推理路径数量带来的准确率提升有限，而成本却显著增加，建议仅在单次可靠度不足的问题上使用多路径采样。

Comments 7 pages, 3 figures

详情

AI中文摘要

自一致性——通过采样多个推理路径并选择最频繁的答案——最初设计用于语言模型频繁且不可预测出错的时代。本研究认为，随着模型能力增强，该技术变得越来越浪费资源，并可能在现代模型已能可靠解决的问题上降低性能。使用Gemini 2.5模型在HotpotQA和MATH-500上进行实验，结果显示增加采样路径数量带来的准确率提升极低——在HotpotQA上20次采样仅提升0.4%，在MATH-500上提升1.6%——而token成本几乎与采样次数成线性增长。关键发现是性能在早期趋于平缓，某些配置下在高采样次数时反而下降，表明当模型已能可靠解决问题时，额外路径引入噪声而非信号。随着模型规模扩大，推理成本上升，盲目使用自一致性难以成立。我们建议仅在明显超出模型单次可靠度的问题上保留多路径采样。

英文摘要

Self-consistency -- sampling multiple reasoning paths and selecting the most frequent answer -- was designed for an era when language models made frequent, unpredictable errors. This study argues that the technique has become increasingly wasteful as models grow stronger, and may degrade performance on problems that modern models already solve reliably. Using Gemini 2.5 models on HotpotQA and MATH-500, we show that accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count. Critically, performance plateaued early and in some configurations declined at high sample counts, suggesting that additional paths introduce noise rather than signal when models already solve problems reliably. As inference costs rise with model scale, indiscriminate self-consistency is difficult to justify. We recommend reserving multi-path sampling for problems that demonstrably exceed a model's single-pass reliability.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Action-to-Action Flow Matching

Parity, Sensitivity, and Transformers

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

AROpt: An Optimization Method for Autoregressive Time Series Forecasting

SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing

ChronoSpike: An Adaptive Spiking Graph Neural Network for Dynamic Graphs

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

Tree-Structured Synergy of Large Language Models and Bayesian Optimization for Efficient CASH

Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

Sensoformer: Robust Sim-to-Real Inference on Variable-Geometry Sensor Sets via Physics-Structured Randomization

EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

Interpretability-Guided Bi-objective Optimization: Aligning Accuracy and Explainability

Fusion or Confusion? Multimodal Complexity Is Not All You Need

How important is Recall for Measuring Retrieval Quality?

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Accelerating Discrete Facility Layout Optimization: A Hybrid CDCL and CP-SAT Architecture

VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Greedy Alignment Principle for Optimizer Selection

LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Boosting Reasoning in Large Multimodal Models via Activation Replay

Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs