arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.24324 2026-06-02 cs.LG cs.AI cs.SY eess.SY

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

大语言模型引导的激励感知奖励设计用于合作多智能体强化学习

Dogan Urgun, Gokhan Gungor

发表机构 * Department of Electrical and Electronics Engineering（电气与电子工程系）； Karabuk University（卡拉博克大学）； Department of Mechatronics Engineering（机械工程系）

AI总结提出利用大语言模型自动生成可执行奖励程序，结合多智能体近端策略优化训练，在Overcooked-AI环境中显著提升合作任务回报。

详情

AI中文摘要

设计有效的辅助奖励对于合作多智能体系统仍然具有挑战性，因为激励不匹配会导致次优协调，尤其是在稀疏任务奖励无法为协调行为提供足够基础的情况下。本研究引入了一个自主奖励设计框架，利用大语言模型（LLMs）从环境仪器化中合成可执行的奖励程序。该过程将候选程序限制在形式有效性范围内，并在固定计算预算下使用多智能体近端策略优化（MAPPO）从头训练策略。然后根据性能评估候选程序，并仅基于稀疏任务回报进行跨代选择。该框架在四个Overcooked-AI布局中进行了评估，这些布局具有不同程度的走廊拥堵、交接依赖和结构不对称性。所提出的奖励设计方法始终产生更高的任务回报和交付数量，在交互瓶颈主导的环境中收益最为显著。对合成塑造成分的诊断分析揭示了动作选择中更强的相互依赖性，以及在协调密集型任务中信号对齐的改善。这些结果表明，所提出的LLM引导的奖励搜索框架减轻了手动工程的需求，同时产生了与有限预算下合作学习兼容的塑造成分信号。

英文摘要

Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient grounding for coordinated behavior. This study introduces an autonomous reward design framework that uses large language models (LLMs) to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and trains policies from scratch using Multi-Agent Proximal Policy Optimization (MAPPO) under a fixed computational budget. The candidates are then evaluated on the basis of their performance, and selection across generations solely based on the sparse task returns. The framework is evaluated in four Overcooked-AI layouts characterized by varying levels of corridor congestion, handoff dependencies, and structural asymmetries. The proposed reward design approach consistently yields higher task returns and delivery counts, with the most pronounced gains observed in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components reveals stronger interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the proposed LLM-guided reward search framework mitigates the need for manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

URL PDF HTML ☆

赞 0 踩 0

2603.10742 2026-06-02 cs.LG

A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time

机器学习工作流的语法：在调用时拒绝数据泄露

Simon Roth

发表机构 * GitHub

AI总结提出一种包含八个类型化原语和四个硬约束的有向无环图语法，通过首次在调用时强制执行的评估/评估边界，使最严重的数据泄露类型在语法范围内结构上不可表示。

Comments 40 pages, v1.3. Two maintained implementations: Python (PyPI: mlw), R (CRAN: ml), Code under github.com/epagogy/ml

2604.03893 2026-06-02 cs.AI

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

FeynmanBench：多模态大语言模型在图解物理推理上的基准测试

Zeyu Wang, Jingye Xu, Xiaogang Li, Peiyao Xiao, Qinhao Kong, Ben Wang, Chengliang Xu, Zichao Chen, Bing Zhao, Hu Wei

发表机构 * Alibaba Group（阿里巴巴集团）； Skylenage

AI总结提出FeynmanBench基准，包含2000多个费曼图任务，评估多模态大模型在拓扑结构、守恒约束和视觉-代数映射等全局结构推理上的能力，发现模型在局部识别上表现良好但在拓扑重建和代数推导上严重不足。

Comments 9 pages, 5 figures

详情

AI中文摘要

当前用于科学推理的多模态基准主要评估局部信息提取——模型识别符号和数值，然后进行文本推理。它们不评估模型是否能在形式化图表的全局结构属性上进行推理，例如拓扑、守恒约束以及视觉模式与代数表达式之间的一致映射。我们引入了FeynmanBench，一个包含2000多个任务的基准，聚焦于涵盖标准模型电磁、弱和强相互作用的费曼图。每个实例将图表图像与最少的文本约定相结合，要求模型恢复完整的物理内容——顶点清单、传播子类型、拓扑连接性、动量路由以及完整的散射振幅。一个自动化的生成和验证流程在标准化规则下产生图表、注释和参考答案。评估了19个最先进的多模态大语言模型，我们发现一个一致的失败模式：模型在局部识别（顶点和传播子识别）上达到70-95%，但在拓扑重建（CP3）上下降到13-17%，在完整代数推导（CP5）上接近零。FeynmanBench为形式化科学图表上的多模态推理提供了一个受控测试平台，并突显了当前架构在拓扑敏感的科学推理中的基本局限性。

英文摘要

Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and values and then perform textual inference. They do not assess whether models can reason over the global structural properties of formal diagrams, such as topology, conservation constraints, and the consistent mapping between visual patterns and algebraic expressions. We introduce FeynmanBench, a benchmark of over 2,000 tasks centered on Feynman diagrams spanning the electromagnetic, weak, and strong interactions of the Standard Model. Each instance couples a diagram image with minimal textual conventions and requires models to recover the full physical content -- vertex inventory, propagator types, topological connectivity, momentum routing, and the complete scattering amplitude. An automated generation and verification pipeline produces the diagrams, annotations, and reference answers under standardized rules. Evaluating 19 state-of-the-art multimodal LLMs, we find a consistent failure pattern: models achieve 70--95\% on local recognition (vertex and propagator identification) but collapse to 13--17\% on topological reconstruction (CP3), and near zero on full algebraic derivation (CP5). FeynmanBench offers a controlled testbed for multimodal reasoning over formal scientific diagrams and highlights fundamental limitations of current architectures in topology-sensitive scientific reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.03789 2026-06-02 cs.LG cs.AI

Automated Conjecture Resolution with Formal Verification

自动猜想解决与形式化验证

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Shurui Liu, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University（北京大学数学科学学院）； Westlake Institute for Advanced Study, Westlake University（西拉雅大学先进研究所）； School of Mathematics, Tianjin University（天津大学数学学院）； Research Institute for Mathematical Sciences, Kyoto University（京都大学数学研究所）； Department of Mathematics, Stanford University（斯坦福大学数学系）； IQuest Research（IQuest研究）； New Cornerstone Science Laboratory, School of Mathematical Sciences, Peking University（北京大学数学科学学院新基石科学实验室）； Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University（北京大学国际数学研究所以及新基石科学实验室）； Center for Machine Learning Research, Peking University（北京大学机器学习研究中心）； Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University（大湾大学先进研究所智能计算中心）； Zhongguancun Academy（中关村学院）

AI总结提出一个集成非形式化推理与形式化验证的自动框架，通过两个组件Rethlas和Archon解决研究级数学问题，并成功解决交换代数中的开放问题并在Lean 4中形式化验证。

Comments Code and resources are available at: Rethlas (https://github.com/frenzymath/Rethlas), Rethlas Results (https://github.com/frenzymath/Rethlas_results), Archon (https://github.com/frenzymath/Archon), and the formalization results (https://github.com/frenzymath/Anderson-Conjecture)

详情

AI中文摘要

近年来，大型语言模型在数学推理能力上取得了显著进步，从解决初等问题扩展到研究级问题。然而，由于自然语言推理固有的歧义性，可靠地解决和验证此类问题仍然具有挑战性。本文提出一个自动框架，将自然语言推理与形式化验证相结合，以应对研究级数学问题。我们的框架由两个组件组成：非形式化推理代理Rethlas和形式化验证代理Archon。Rethlas将推理原语与我们的定理搜索引擎Matlas相结合，探索解决策略并构建候选证明。Archon配备LeanSearch，通过任务分解、迭代细化和自动证明合成，将非形式化论证转化为形式化的Lean 4项目，确保机器可检查的正确性。利用该框架，我们解决了一个交换代数中的开放问题，并在几乎无需人工参与的情况下在Lean 4中形式化验证了所得证明。额外的案例研究展示了Rethlas在非形式化数学推理和发现方面的能力，以及Archon将研究级证明形式化为Lean 4的能力。我们的实验表明，强大的定理检索工具能够发现和应用跨领域数学技巧，而形式化代理可以自主填补非形式化论证中的非平凡空白。更广泛地说，我们的工作展示了一种有前景的数学研究范式，其中配备定理检索工具的非形式化和形式化推理系统协同工作，以产生可验证的结果，减少人工努力，并支持人机协作的数学研究。

英文摘要

Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research-level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework that integrates natural language reasoning with formal verification to tackle research-level mathematical problems. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas combines reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with LeanSearch, translates informal arguments into formalized Lean 4 projects through task decomposition, iterative refinement, and automated proof synthesis, ensuring machine-checkable correctness. Using this framework, we resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Additional case studies illustrate the capabilities of Rethlas in informal mathematical reasoning and discovery, as well as the ability of Archon to formalize research-level proofs in Lean 4. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, while the formal agent can autonomously fill nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, reduce human effort, and support human-AI collaborative mathematical research.

URL PDF HTML ☆

赞 0 踩 0

2602.00906 2026-06-02 cs.LG cs.AI cs.CL cs.DS cs.IT math.IT

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

幻觉是空间最优性的结果：成员测试的率失真定理

Anxin Guo, Jingwei Li

发表机构 * arXiv.org

AI总结通过将幻觉形式化为成员测试问题，建立率失真定理，证明在有限容量下信息论最优策略必然导致对某些非事实的高置信度，从而产生幻觉。

Comments ICML 2026

详情

AI中文摘要

大型语言模型通常对缺乏可推断模式的“随机事实”以高置信度产生幻觉。我们将此类事实的记忆形式化为一个成员测试问题，统一了布隆过滤器的离散误差指标与LLM的连续对数损失。通过分析在事实在可能主张的宇宙中稀疏的情况下，我们建立了一个率失真定理：最优记忆效率由事实与非事实得分分布之间的最小KL散度刻画。这一理论框架在理想化设置下为幻觉提供了独特的解释：即使有最优训练、完美数据和简化的“封闭世界”设置，有限容量下信息论最优策略不是放弃或遗忘，而是对某些非事实赋予高置信度，从而导致幻觉。我们在合成数据和真实数据上实证验证了这一理论，表明幻觉作为有损压缩的自然结果持续存在。同一定理恢复并锐化了布隆型滤波器的经典空间下界，确定了两侧滤波器遗留的加性常数。

英文摘要

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination under an idealized setting: even with optimal training, perfect data, and a simplified ``closed world'' setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on both synthetic and real-world data, showing that hallucinations persist as a natural consequence of lossy compression. The same theorem recovers and sharpens classical space lower bounds for Bloom-type filters, pinning down an additive constant left open for two-sided filters.

URL PDF HTML ☆

赞 0 踩 0

2604.02941 2026-06-02 cs.CV

MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

MMTalker: 多分辨率3D说话头合成与多模态特征融合

Bin Liu, Zhixiang Xiong, Zhifen He, Bo Li

发表机构 * IEEE Publication Technology Group（IEEE出版技术组）； Piscataway, NJ（新泽西州皮萨卡威）

AI总结提出一种基于多分辨率表示和多模态特征融合的3D语音驱动面部动画合成方法MMTalker，通过网格参数化、非均匀可微采样、残差图卷积网络和双交叉注意力机制，实现高唇同步精度和逼真面部表情。

Comments This article presents only the preliminary research results, which are not yet complete and lack necessary supplementary experiments. The author has decided to withdraw it to improve the research work, and will submit a more complete version in the future

详情

AI中文摘要

语音驱动的三维（3D）面部动画合成旨在建立从一维（1D）语音信号到时变3D面部运动信号的映射。当前方法在保持唇同步精度和生成逼真面部表情方面仍面临挑战，主要由于这种跨模态映射的高度病态性。本文通过多分辨率表示和多模态特征融合，提出一种新颖的3D音频驱动面部动画合成方法MMTalker，能够准确重建3D面部运动的丰富细节。我们首先通过网格参数化和非均匀可微采样实现带有细节的3D面部连续表示。网格参数化技术建立了UV平面与3D面部网格之间的对应关系，并用于为连续学习提供真值。可微非均匀采样通过在每个三角面中设置可学习的采样概率，实现精确的面部细节获取。接着，我们采用残差图卷积网络和双交叉注意力机制，从多个输入模态中提取判别性面部运动特征。所提出的多模态融合策略充分利用了语音的分层特征和面部网格的显式时空几何特征。最后，一个轻量级回归网络通过联合处理规范UV空间中的采样点和编码的面部运动特征，预测合成说话头的逐顶点几何位移。综合实验表明，与现有最先进方法相比，该方法在唇部和眼部运动的同步精度上取得了显著提升。

英文摘要

Speech-driven three-dimensional (3D) facial animation synthesis aims to build a mapping from one-dimensional (1D) speech signals to time-varying 3D facial motion signals. Current methods still face challenges in maintaining lip-sync accuracy and producing realistic facial expressions, primarily due to the highly ill-posed nature of this cross-modal mapping. In this paper, we introduce a novel 3D audio-driven facial animation synthesis method through multi-resolution representation and multi-modal feature fusion, called MMTalker which can accurately reconstruct the rich details of 3D facial motion. We first achieve the continuous representation of 3D face with details by mesh parameterization and non-uniform differentiable sampling. The mesh parameterization technique establishes the correspondence between UV plane and 3D facial mesh and is used to offer ground truth for the continuous learning. Differentiable non-uniform sampling enables precise facial detail acquisition by setting learnable sampling probability in each triangular face. Next, we employ residual graph convolutional network and dual cross-attention mechanism to extract discriminative facial motion feature from multiple input modalities. This proposed multimodal fusion strategy takes full use of the hierarchical features of speech and the explicit spatiotemporal geometric features of facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements of the synthesized talking face by jointly processing the sampled points in the canonical UV space and the encoded facial motion features. Comprehensive experiments demonstrate that significant improvements are achieved over state-of-the-art methods, especially in the synchronization accuracy of lip and eye movements.

URL PDF HTML ☆

赞 0 踩 0

2604.02878 2026-06-02 cs.RO cs.SY eess.SY

An Asynchronous Two-Speed Kalman Filter for Real-Time UUV Cooperative Navigation Under Acoustic Delays

一种用于声学延迟下实时UUV协同导航的异步双速卡尔曼滤波器

Shuyue Li, Miguel López-Benítez, Eng Gee Lim, Fei Ma, Qian Dong, Mengze Cao, Limin Yu, Xiaohui Qin

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Suzhou Municipal Key Laboratory Broadband Wireless Access Technology（苏州市级宽带无线接入技术重点实验室）； XJTLU-JITRI Academy（XJTLU-JITRI学院）

AI总结针对水声通信延迟导致实时状态估计困难的问题，提出一种异步双速卡尔曼滤波器（TSKF），通过变分历史蒸馏（VHD）机制解耦估计过程，实现高频实时控制与延迟协同信息处理，在严重延迟下保持与批量优化方法相当的轨迹误差。

Comments 6 pages, 6 figures. Accepted for publication in the 2026 IEEE International Conference on Industrial Informatics (INDIN). \c{opyright} 2026 IEEE. Personal use of this material is permitted. See PDF for the full IEEE copyright notice

详情

AI中文摘要

在全球导航卫星系统（GNSS）受限的水下环境中，单个无人水下航行器（UUV）会遭受无界航位推算漂移，因此协同导航（CN）对于精确状态估计至关重要。然而，水声信道固有的严重通信延迟对实时状态估计构成了严峻挑战。传统滤波器，如扩展卡尔曼滤波器（EKF）或无迹卡尔曼滤波器（UKF），通常在等待延迟数据时阻塞主控制回路，或者有效丢弃乱序测量（OOSM），导致严重漂移。为了解决这一问题，我们提出了一种由新颖投影机制——变分历史蒸馏（VHD）增强的异步双速卡尔曼滤波器（TSKF）。所提出的架构将估计过程解耦为两个并行线程：一个快速线程利用高斯过程（GP）补偿的航位推算来保证高频实时控制，另一个慢速线程专门处理异步延迟的协同信息。通过引入有限长度循环状态缓冲区（FLCSB），该算法将延迟测量应用于对应的历史状态，并利用基于VHD的投影将修正快速前向传播到当前时刻，而无需计算密集的重新计算。仿真结果表明，所提出的TSKF在严重延迟（高达30秒）下保持了与计算密集的批量优化方法相当的轨迹误差。在亚毫秒时间内执行，它显著优于标准EKF/UKF。结果展示了一种有效的控制、通信和计算（3C）协同设计，显著增强了自主海洋自动化系统的鲁棒性。

英文摘要

In Global Navigation Satellite System (GNSS)-denied underwater environments, individual unmanned underwater vehicles (UUVs) suffer from unbounded dead-reckoning drift, making collaborative navigation (CN) crucial for accurate state estimation. However, the severe communication delay inherent in underwater acoustic channels poses serious challenges to real-time state estimation. Traditional filters, such as Extended Kalman Filters (EKFs) or Unscented Kalman Filters (UKFs), usually block the main control loop while waiting for delayed data, or effectively discard Out-of-Sequence Measurements (OOSMs), resulting in serious drift. To address this, we propose an Asynchronous Two-Speed Kalman Filter (TSKF) enhanced by a novel projection mechanism, which we term Variational History Distillation (VHD). The proposed architecture decouples the estimation process into two parallel threads: a fast-rate thread that utilizes Gaussian Process (GP) compensated dead reckoning to guarantee high-frequency real-time control, and a slow-rate thread dedicated to processing asynchronously delayed collaborative information. By introducing a Finite-Length Circular State Buffer (FLCSB), the algorithm applies delayed measurements to their corresponding historical states, and utilizes a VHD-based projection to fast-forward the correction to the current time without computationally heavy recalculations. Simulation results demonstrate that the proposed TSKF maintains a trajectory error comparable to computationally intensive batch-optimization methods under severe delays (up to 30\,s). Executing in sub-millisecond time, it significantly outperforms standard EKF/UKF. The results demonstrate an effective control, communication, and computing (3C) co-design that significantly enhances the resilience of autonomous marine automation systems.

URL PDF HTML ☆

赞 0 踩 0

2603.03312 2026-06-02 cs.CL cs.AI cs.HC eess.AS q-bio.NC

Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

逃离BLEU陷阱：一种基于信号锚定与解耦语义引导的脑电解码文本框架

Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Shenzhen Loop Area Institute（深圳环湖研究院）

AI总结针对脑电解码文本中语义偏差、信号忽略和BLEU陷阱问题，提出SemKey多阶段框架，通过解耦语义目标（情感、主题、长度、意外性）和主动检索解码机制，强制生成基于信号而非语言先验，并采用检索和分布度量（如Fréchet距离）建立评估协议，有效缓解幻觉并达到最优性能。

详情

AI中文摘要

从非侵入性脑电信号中解码自然语言是一项有前景但充满挑战的任务。然而，当前最先进的模型仍受限于三个基本问题：语义偏差（输出退化为通用语言模板）、信号忽略（模型严重依赖大语言模型先验，即使在缺乏有意义信号时也能生成流畅文本）以及“BLEU陷阱”（高频停用词虚增n-gram指标，掩盖真正语义保真度的缺失）。为解决这些挑战，我们超越传统的端到端流水线，提出SemKey——一种新颖的多阶段框架，通过四个解耦的语义目标（情感、主题、长度和意外性）强制进行基于信号的生成。我们直接从脑电嵌入中提取这些语义锚点，然后通过主动检索解码机制统一它们，迫使大语言模型将其令牌生成锚定在神经信号上，而非默认使用语言先验。此外，我们通过建立全面的评估协议（使用严格的检索和基于分布的度量，如Fréchet距离）打破BLEU陷阱。大量实验表明，SemKey有效缓解了对噪声输入的幻觉，并在这些鲁棒协议上达到了最先进的性能。代码将在论文被接收后发布于https://github.com/xmed-lab/SemKey。

英文摘要

Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental issues: Semantic Bias, where outputs collapse into generic linguistic templates; Signal Neglect, where models rely heavily on LLM priors to hallucinate fluent text even in the absence of meaningful signals; and the "BLEU Trap", where high-frequency stopwords inflate n-gram metrics, masking a lack of true semantic fidelity. To resolve these challenges, we move beyond conventional end-to-end pipelines and propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We extract these semantic anchors from EEG embeddings directly, then unify them with an Active Retrieval Decoding mechanism, compelling the LLM to ground its token generation in the neural signals rather than defaulting to linguistic priors. Furthermore, we break the BLEU Trap by establishing a comprehensive evaluation protocol using rigorous retrieval and distribution-based metrics such as Fréchet Distance. Extensive experiments demonstrate that SemKey effectively mitigates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.

URL PDF HTML ☆

赞 0 踩 0

2601.16884 2026-06-02 cs.LG cs.NA math.NA stat.ML

Multigrade Neural Network Approximation

多级神经网络逼近

Shijun Zhang, Zuowei Shen, Yuesheng Xu

发表机构 * Department of Applied Mathematics, Hong Kong Polytechnic University（应用数学系，香港理工大学）； Department of Mathematics, National University of Singapore（数学系，新加坡国立大学）； Department of Mathematics and Statistics, Old Dominion University（数学与统计学系，老 Dominion 大学）

AI总结本文提出多级深度学习（MGDL）框架，通过逐级冻结并训练子网络拟合残差，实现结构化误差修正，并证明固定宽度多级ReLU网络可均匀逼近连续函数。

详情

AI中文摘要

我们研究多级深度学习（MGDL）作为深度神经网络中结构化误差修正的原则性框架。虽然神经网络的逼近能力现在相对被充分理解，但由于高度非凸且常常病态的优化景观，训练非常深的架构仍然具有挑战性。相比之下，对于相对浅的网络，特别是某些单隐层ReLU模型，在适当设置下训练允许具有全局保证的凸重构，这激发了在扩展深度的同时提高稳定性的学习范式。MGDL基于这一见解，通过逐级训练深度网络：先前学习的级别被冻结，每个新添加的级别子网络被组合在先前学习的级别之上，并训练以拟合当前逼近留下的残差，产生结构化和可解释的分层修正过程。我们为MGDL开发了算子理论基础，并证明对于定义在超立方体上的任何连续目标函数，存在一个固定宽度的多级ReLU方案，其残差点态非增且一致收敛到零，并且对于每个非平凡级别，在$p\in [1,\infty)$上具有严格的$L^p$范数衰减。据我们所知，这项工作提供了第一个严格的构造性逼近保证，表明逐级残差修正方案可以在固定宽度多级ReLU架构中实现误差消失。

英文摘要

We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly nonconvex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably certain one-hidden-layer ReLU models, training admits convex reformulations with global guarantees under appropriate settings, motivating learning paradigms that improve stability while scaling to depth. MGDL builds on this insight by training deep networks grade by grade: previously learned grades are frozen, and each newly added grade-wise subnetwork is composed on top of the previously learned grades and trained to fit the residual left by the current approximation, yielding a structured and interpretable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function defined on a hypercube, there exists a fixed-width multigrade ReLU scheme whose residuals are pointwise nonincreasing in magnitude and converge uniformly to zero, with strict $L^p$-norm decay at every nontrivial grade for $p\in [1,\infty)$. To the best of our knowledge, this work provides the first rigorous constructive approximation guarantee showing that a grade-wise residual refinement scheme can achieve vanishing error in a fixed-width multigrade ReLU architecture.

URL PDF HTML ☆

赞 0 踩 0

2604.01841 2026-06-02 cs.AI

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

检索对齐的表格基础模型实现电子健康记录中在现实约束下的稳健临床风险预测

Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica

发表机构 * University of Cambridge（剑桥大学）

AI总结针对电子健康记录中高维、异质、类别不平衡和分布偏移等挑战，提出任务对齐检索框架AWARE，通过监督嵌入学习和轻量适配器提升表格上下文学习性能，在极端不平衡下AUPRC提升高达12.2%。

Comments Not peer-reviewed

详情

DOI: 10.21203/rs.3.rs-9085469/v1

AI中文摘要

从结构化电子健康记录（EHR）进行临床预测具有挑战性，原因包括高维性、异质性、类别不平衡和分布偏移。尽管表格上下文学习（TICL）和检索增强方法在通用基准上表现良好，但它们在临床环境中的行为仍不清楚。我们提出了一个多队列EHR基准，比较了经典模型、深度表格模型和TICL模型在不同数据规模、特征维度、结果稀有性和跨队列泛化下的表现。基于PFN的TICL模型在低数据情况下样本高效，但随着异质性和不平衡的增加，在基于朴素距离的检索下性能下降。我们提出了AWARE，一个任务对齐的检索框架，使用监督嵌入学习和轻量适配器。AWARE在极端不平衡下将AUPRC提升了高达12.2%，且增益随数据复杂性增加。我们的结果识别出检索质量以及检索-推理对齐是部署表格上下文学习进行临床预测的关键瓶颈。

英文摘要

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.

URL PDF HTML ☆

赞 0 踩 0

2604.01802 2026-06-02 cs.LG

Real-Time Sensing of Inaccessible Physical Fields via an Edge-Deployable Hardware-Portable Graph Neural Operator

通过边缘可部署的硬件可移植图神经算子实时感知不可及的物理场

William Howes, Jason Yoo, Kazuma Kobayashi, Subhankar Sarkar, Farid Ahmed, Souvik Chakraborty, Syed Bahauddin Alam

发表机构 * Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering Department, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校格拉inger工程学院、核物理与等离子体工程系）； Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi（印度理工学院德里分校人工智能学院）； Department of Applied Mechanics, Indian Institute of Technology Delhi（印度理工学院德里分校应用力学系）； National Center for Supercomputing Applications（国家超级计算应用中心）

AI总结提出VIRSO，一种具有独特时空架构的神经算子，通过硬件协同设计实现边缘设备上从稀疏边界观测到内部多物理场的实时推理，在降低能耗的同时保持高精度。

Comments 36 pages, 5 figures, 16 tables

详情

AI中文摘要

从稀疏边界观测实时推断不可及的内部物理场是科学机器学习中一个基本但未解决的问题，与许多工程应用中的安全关键监测直接相关。现有的神经算子实现了高精度，但未解决在嵌入式边缘平台上的部署问题。本文引入VIRSO（虚拟不规则实时稀疏算子），这是第一个具有独特时空架构、明确针对边缘部署硬件的神经算子。VIRSO通过显式与硬件执行对齐的谱-空间分解（计算受限的图谱路径和内存带宽受限的空间聚合路径，分别在数据中心和嵌入式加速器上独立表征），学习从稀疏、几何不连续的边界输入到不规则非结构化网格上空间连续内部多物理场的非线性映射。该设计将推理能量-延迟积相对于原始图算子基线降低了29倍（在NVIDIA H200上从206 J·ms降至7.0 J·ms），并在NVIDIA Jetson Orin Nano上实现了17.0样本/秒的嵌入式推理，板级功耗为7.06 W，无需修改。一种网格密度自适应图构建策略（V-KNN）同时提高了精度并将图边数减少了34%。在三个基准测试中，重建比从47:1到156:1，VIRSO实现了低于1%的平均相对$L_2$误差，参数少于算子基线，并且相对于高保真参考求解器提供了约$10^4$倍的推理加速。据我们所知，这是首个单瓦级神经算子的演示，确立了硬件协同设计作为算子推理中缺失的要素以及实现实时部署的可行路径。

英文摘要

Real-time inference of inaccessible interior physical fields from sparse boundary observations is a fundamental but unresolved problem in scientific machine learning, with direct relevance to safety-critical monitoring across many engineering applications. Existing neural operators achieve high accuracy but leave deployment to embedded edge platforms unaddressed. Here we introduce VIRSO (Virtual Irregular Real-Time Sparse Operator), the first neural operator with a unique spatial-spectral architecture that explicitly addresses edge-deployment hardware. VIRSO learns a nonlinear mapping from sparse, geometrically disjoint boundary inputs to spatially continuous interior multiphysics fields on irregular unstructured meshes through a spectral-spatial decomposition explicitly aligned with hardware execution: a compute-bound graph spectral pathway and a memory-bandwidth-bound spatial-aggregation pathway, each independently characterized on datacenter and embedded accelerators. The design reduces the inference energy-delay product by 29$\times$ relative to the vanilla graph-operator baseline (206 J$\cdot$ms $\to$ 7.0 J$\cdot$ms on an NVIDIA H200) and enables 17.0 samples/s embedded inference on an NVIDIA Jetson Orin Nano within 7.06 W board-level power, without modification. A mesh-density-adaptive graph construction strategy (V-KNN) simultaneously improves accuracy and reduces graph edge count by 34%. Across three benchmarks with reconstruction ratios from 47:1 to 156:1, VIRSO achieves mean relative $L_2$ errors below 1% with fewer parameters than operator baselines and delivers an inference speedup of $\approx 10^4$ times over the high-fidelity reference solver. To our knowledge, this is the first demonstration of a single-digit-watt neural operator, establishing hardware co-design as a missing ingredient in operator-based inference and a tractable path to real-time deployment.

URL PDF HTML ☆

赞 0 踩 0

2511.06676 2026-06-02 cs.CL cs.CY cs.HC

How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models

AI如何失败：用于演示自动毒性模型中的方言偏见的交互式教学工具

Subhojit Ghimire

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过定量基准测试和交互式教学工具，揭示了自动毒性模型对非裔美国人英语文本的系统性偏见，并强调人为设定的敏感度阈值才是歧视操作化的关键。

Comments 9 pages, 5 figures, 4 tables, 14 references. Preliminary abstract presented at the International Conference on Envisioning the Himalayan Future: Pathways to Sustainability and Development (PUiCON 2026) p. 105; abstract available online at: https://pufoe.edu.np/wp-content/uploads/2026/05/PUiCON_2026_Book_of-_Abstracts.pdf

详情

AI中文摘要

如今，AI驱动的审核已渗透到日常生活中，我们经常听到“AI有偏见”的说法。虽然这通常是以玩笑的方式说出，但这一轻松的评论反映了更深层次的担忧。我们如何能确定被标记为“不当”的在线帖子不是仅仅成为偏见算法的受害者？本文采用双重方法研究这一问题。首先，我对一个广泛使用的毒性模型（unitary/toxic-bert）进行定量基准测试，以衡量非裔美国人英语（AAE）和标准美国英语（SAE）文本之间的性能差异。基准测试揭示了明显的系统性偏见：平均而言，模型将AAE文本的毒性评分高出1.8倍，将“身份仇恨”评分高出8.8倍。其次，我引入了一个交互式教学工具，使这些抽象偏见变得具体可感。该工具的核心机制是一个用户可控的“敏感度阈值”，它表明有偏见的分数本身并非唯一的伤害；相反，更令人担忧的伤害是由人为设定的、看似中立的政策最终操作化的歧视。这项工作既提供了差异影响的统计证据，也提供了一个面向公众的工具，旨在培养批判性AI素养。

英文摘要

Now that AI-driven moderation has become pervasive in everyday life, we often hear claims that "the AI is biased". While this is often said jokingly, the light-hearted remark reflects a deeper concern. How can we be certain that an online post flagged as "inappropriate" was not simply the victim of a biased algorithm? This paper investigates this problem using a dual approach. First, I conduct a quantitative benchmark of a widely used toxicity model (unitary/toxic-bert) to measure performance disparity between text in African-American English (AAE) and Standard American English (SAE). The benchmark reveals a clear, systematic bias: on average, the model scores AAE text as 1.8 times more toxic and 8.8 times higher for "identity hate". Second, I introduce an interactive pedagogical tool that makes these abstract biases tangible. The tool's core mechanic, a user-controlled "sensitivity threshold," demonstrates that the biased score itself is not the only harm; instead, the more-concerning harm is the human-set, seemingly neutral policy that ultimately operationalises discrimination. This work provides both statistical evidence of disparate impact and a public-facing tool designed to foster critical AI literacy.

URL PDF HTML ☆

赞 0 踩 0

2510.22276 2026-06-02 cs.CV cs.CL

WAON: A Large-Scale Japanese Image-Text Dataset for Cultural Adaptation in Contrastive Vision-Language Models

WAON：用于对比视觉语言模型文化适应的大规模日语图像-文本数据集

Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki

发表机构 * Kyoto University（京都大学）； NII LLMC（日本国家研究所语言模型中心）； NII（日本国家研究所）； Waseda University（早稻田大学）； Institute of Science Tokyo（东京科学研究所）

AI总结提出WAON，一个从Common Crawl构建的包含约1.55亿样本的最大公开原生日语图像-文本数据集，并通过微调实验证明其在日语文化基准上优于翻译数据。

Comments 13 pages, 7 figures

详情

AI中文摘要

对比视觉语言模型通过大规模预训练取得了显著进展。最近的研究表明，去除仅英文的标题过滤器并在全球数据上进行预训练对于提升多元文化表现是有效的。我们研究了这种全球预训练是否足以实现特定文化的理解，或者进一步使用原生数据进行适应能否超越仅全球预训练所达到的性能。为了进行这项研究，我们提出了WAON，这是从Common Crawl中的原生日语网络内容构建的最大公开可用的原生日语图像-文本数据集，包含约1.55亿个样本。我们还引入了WAON-Bench，一个手动策划的涵盖374个类别的日本文化基准。通过在多个日语图像-文本数据集上的比较微调实验，我们观察到在WAON上微调的模型在日本文化基准上始终比在英语到日语翻译数据上微调的模型表现更强。我们发布了数据集和代码。

英文摘要

Contrastive vision-language models have achieved remarkable progress through large-scale pretraining. Recent work has shown that removing English-only caption filters and pretraining on global data is effective for improving multicultural performance. We study whether such global pretraining is sufficient for culture-specific understanding, or whether further adaptation with natively sourced data can boost performance beyond what global pretraining alone achieves. To enable this investigation, we present WAON, the largest publicly available native Japanese image-text dataset constructed from native Japanese web content in Common Crawl, containing approximately 155 million examples. We also introduce WAON-Bench, a manually curated Japanese cultural benchmark spanning 374 classes. Through comparative fine-tuning experiments on multiple Japanese image-text datasets, we observe that models fine-tuned on WAON consistently achieve stronger performance on Japanese cultural benchmarks than those fine-tuned on English-to-Japanese translated data. We release our dataset and code.

URL PDF HTML ☆

赞 0 踩 0

2510.03690 2026-06-02 cs.LG stat.ML

From Moments to Models: Graphon-Mixture Learning for Mixup and Contrastive Learning

从矩到模型：用于混合和对比学习的图模型混合学习

Ali Azizpour, Reza Ramezanpour, Santiago Segarra

发表机构 * University of Michigan（密歇根大学）

AI总结提出一个统一框架，将图数据建模为图模型（graphon）混合，利用图矩（motif密度）聚类并估计混合成分，进而提出图模型感知的混合（GMAM）和对比学习（MGCL）方法，在监督和无监督任务上取得最优或竞争性能。

详情

AI中文摘要

现实世界的图数据集通常来自混合群体，其中图由多个不同的潜在分布生成。在这项工作中，我们提出了一个统一框架，将图数据显式建模为由图模型表示的 probabilistic 图生成模型的混合。为了表征和估计这些图模型，我们利用图矩（motif密度）对从相同底层模型生成的图进行聚类。我们建立了一个新的理论保证，推导出一个更紧的界，表明从结构相似的图模型中采样的图以高概率表现出相似的 motif 密度。这一结果使得图模型混合成分的估计具有原则性。我们展示了如何将估计的图模型混合成分增强两种广泛使用的下游范式：通过混合进行图数据增强和图对比学习。通过将这些方法基于底层生成模型，我们开发了图模型感知的混合（GMAM）和模型感知的图对比学习（MGCL）。在模拟和真实数据集上的大量实验证明了强大的实证性能。在监督学习中，GMAM 优于现有的增强策略，在 7 个数据集中的 6 个上达到了新的最先进准确率。在无监督学习中，MGCL 在七个基准数据集上具有竞争力，并实现了总体最低的平均排名。

英文摘要

Real-world graph datasets often arise from mixtures of populations, where graphs are generated by multiple distinct underlying distributions. In this work, we propose a unified framework that explicitly models graph data as a mixture of probabilistic graph generative models represented by graphons. To characterize and estimate these graphons, we leverage graph moments (motif densities) to cluster graphs generated from the same underlying model. We establish a novel theoretical guarantee, deriving a tighter bound showing that graphs sampled from structurally similar graphons exhibit similar motif densities with high probability. This result enables principled estimation of graphon mixture components. We show how incorporating estimated graphon mixture components enhances two widely used downstream paradigms: graph data augmentation via mixup and graph contrastive learning. By conditioning these methods on the underlying generative models, we develop graphon-mixture-aware mixup (GMAM) and model-aware graph contrastive learning (MGCL). Extensive experiments on both simulated and real-world datasets demonstrate strong empirical performance. In supervised learning, GMAM outperforms existing augmentation strategies, achieving new state-of-the-art accuracy on 6 out of 7 datasets. In unsupervised learning, MGCL performs competitively across seven benchmark datasets and achieves the lowest average rank overall.

URL PDF HTML ☆

赞 0 踩 0

2603.29488 2026-06-02 cs.LG

What Cosine Similarity of Label Representations Can and Cannot Tell us

标签表示的余弦相似度能告诉我们什么，不能告诉我们什么

Beatrix M. G. Nielsen, Andreas Grivas

发表机构 * IT University of Copenhagen（丹麦哥本哈根技术大学）； School of Mathematics, University of Edinburgh（爱丁堡大学数学学院）

AI总结本文证明对于softmax分类器，标签表示（称为unembedding）之间的余弦相似度不提供模型概率的任何信息，而对于sigmoid分类器，所有成对余弦相似度定义了可能的标签组合集。

详情

AI中文摘要

余弦相似度常用于衡量神经网络模型向量表示的相似性。然而，表示的余弦相似度并不能保证告诉我们关于模型概率的任何信息。在本文中，我们证明对于softmax分类器，无论是图像分类器还是自回归语言模型，标签表示（在论文中称为unembedding）之间的余弦相似度不提供模型分配的概率的任何信息。具体地，我们证明给定两个unembedding，可以创建另一个模型，该模型对所有输入分配相同的概率，但表示之间的余弦相似度现在要么是1要么是-1。我们还证明对于sigmoid分类器（其中每个输入可以被分配多个标签），unembedding之间的所有成对余弦相似度定义了可能的标签组合集。然而，对于softmax分类器（其中每个输入被分配从最可能到最不可能的标签排序），我们需要所有unembedding差异之间的所有成对余弦相似度才能知道模型可以预测哪些排序。我们得出结论，在没有参考产生它们的分类器的情况下解释unembedding之间的余弦相似度是具有误导性的。

英文摘要

Cosine similarity is often used to measure the similarity of vector representations of neural network models. However, the cosine similarity of representations is not guaranteed to tell us anything about model probabilities. In this paper we show that for a softmax classifier, be it an image classifier or an autoregressive language model, the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that given two unembeddings, it is possible to create another model which assigns the same probabilities for all inputs, but where the cosine similarity between the representations is now either 1 or -1. We also show that for a sigmoid classifier (where each input can be assigned multiple labels), all pairwise cosine similarities between the unembeddings define the set of possible label combinations. However, for softmax classifiers (where each input is assigned a ranking of the labels from most to least likely), we need all pairwise cosine similarities between all differences of unembeddings to know which rankings the model can predict. We conclude that it is misleading to interpret the cosine similarity between unembeddings without reference to the classifier that produced them.

URL PDF HTML ☆

赞 0 踩 0

2602.12724 2026-06-02 cs.RO

TRANS: Terrain-aware Reinforcement Learning for Agile Navigation of Quadruped Robots under Social Interactions

TRANS：面向社交互动下四足机器人敏捷导航的地形感知强化学习

Wei Zhu, Irfan Tito Kurniawan, Ye Zhao, Mitsuhiro Hayashibe

发表机构 * Department of Robotics, Graduate School of Engineering, Tohoku University（东邦大学机器人学系）； Laboratory for Intelligent Decision and Autonomous Robots, Woodruff School of Mechanical Engineering, Georgia Institute of Technology（佐治亚理工学院智能决策与自主机器人实验室）

AI总结提出一个名为TRANS的两阶段深度强化学习框架，通过三个DRL流水线（TRANS-Loco、TRANS-Nav和统一TRANS）实现四足机器人在非结构化地形上的社交导航，克服了传统方法中运动规划与运动控制分离、缺乏地形感知以及假设静态环境等局限。

详情

AI中文摘要

本研究介绍了TRANS：面向社交互动下敏捷导航的地形感知强化学习，这是一个用于四足机器人在非结构化地形上进行社交导航的深度强化学习（DRL）框架。传统的四足导航通常将运动规划与运动控制分离，忽略了全身约束和地形感知。另一方面，端到端方法更加集成，但需要高频传感，这通常噪声大且计算成本高。此外，大多数现有方法假设静态环境，限制了它们在有人环境中的使用。为了解决这些限制，我们提出了一个包含三个DRL流水线的两阶段训练框架。（1）TRANS-Loco采用非对称演员-评论家（AC）模型进行四足运动，无需显式的地形或接触观测即可穿越不平坦地形。（2）TRANS-Nav采用对称AC框架进行社交导航，在差速驱动运动学下直接将变换后的LiDAR数据映射到自我智能体动作。（3）统一流水线TRANS集成了TRANS-Loco和TRANS-Nav，支持在不平坦和社交互动环境中的地形感知四足导航。针对运动导航和社交导航基线的全面基准测试证明了TRANS的有效性。硬件实验进一步证实了其从仿真到实际迁移的潜力。

英文摘要

This study introduces TRANS: Terrain-aware Reinforcement learning for Agile Navigation under Social interactions, a deep reinforcement learning (DRL) framework for quadrupedal social navigation over unstructured terrains. Conventional quadrupedal navigation typically separates motion planning from locomotion control, neglecting whole-body constraints and terrain awareness. On the other hand, end-to-end methods are more integrated but require high-frequency sensing, which is often noisy and computationally costly. In addition, most existing approaches assume static environments, limiting their use in human-populated settings. To address these limitations, we propose a two-stage training framework with three DRL pipelines. (1) TRANS-Loco employs an asymmetric actor-critic (AC) model for quadrupedal locomotion, enabling traversal of uneven terrains without explicit terrain or contact observations. (2) TRANS-Nav applies a symmetric AC framework for social navigation, directly mapping transformed LiDAR data to ego-agent actions under differential-drive kinematics. (3) A unified pipeline, TRANS, integrates TRANS-Loco and TRANS-Nav, supporting terrain-aware quadrupedal navigation in uneven and socially interactive environments. Comprehensive benchmarks against locomotion and social navigation baselines demonstrate the effectiveness of TRANS. Hardware experiments further confirm its potential for sim-to-real transfer.

URL PDF HTML ☆

赞 0 踩 0

2510.01009 2026-06-02 cs.CV cs.MM

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

POVQA: 基于偏好的视频问答与数据效率的推理

Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

发表机构 * University of Southern Mississippi（密西根州立大学）

AI总结提出POVQA方法，通过时间池化压缩视频帧、监督微调加偏好优化，在长视频问答中实现数据高效推理。

Comments Accepted in MAR at CVPR Workshop (Proceedings Track)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 11533-11542

AI中文摘要

长视频多模态问答需要对视觉证据和对话进行结构化推理，但大型视觉语言模型（LVLMs）受限于上下文窗口和计算限制。我们提出POVQA，将每秒压缩为时间池化图像（1 fps池化图像），以在固定token预算下保持密集的时间覆盖。然后，我们在推理+答案目标上对Qwen2.5-VL-7B进行监督微调（SFT），并可选地应用直接偏好优化（DPO）进行偏好对齐。我们引入ReasonVQA作为初步诊断数据集，包含12部电影和239个人工标注的QA+推理三元组，用于在压缩下对长上下文多模态推理进行受控分析。在ReasonVQA上，SFT将最佳纯池化基线从0.212 F1提升至0.550 F1，表明池化证据加推理监督在此设置中提供了主要性能提升。在零样本迁移中，POVQA在SFT+DPO后在TVQA上也达到64.7%。这些结果是初步的：ReasonVQA规模小，池化可能丢失细粒度时间顺序，且DPO效果在不同设置中并非一致正面。代码、数据集和额外定性评估见\href{https://povqa.github.io}{https://povqa.github.io}。

英文摘要

Long-video multimodal question answering requires structured reasoning over visual evidence and dialogue, but Large Vision-Language Models (LVLMs) are constrained by context-window and compute limits. We propose POVQA, which compresses each second into a temporally pooled image (1 fps pooled images) to maintain dense temporal coverage under a fixed token budget. We then train Qwen2.5-VL-7B with supervised fine-tuning (SFT) on rationale+answer targets, and optionally apply Direct Preference Optimization (DPO) for preference alignment. We introduce ReasonVQA as a pilot diagnostic dataset with 12 movies and 239 human-annotated QA+rationale triplets for controlled analysis of long-context multimodal reasoning under compression. On ReasonVQA, SFT improves the best pooled-only baseline from 0.212 to 0.550 F1, showing that pooled evidence plus rationale supervision provides the main performance gains in this setting. In zero-shot transfer, POVQA also reaches 64.7\% on TVQA after SFT+DPO. These results are preliminary: ReasonVQA is small, pooling can lose fine-grained temporal order, and DPO effects are not uniformly positive across settings. Code, dataset, and additional qualitative evaluations are available at \href{https://povqa.github.io}{https://povqa.github.io}.

URL PDF HTML ☆

赞 0 踩 0

2603.28759 2026-06-02 cs.CV

FlowIt: Global Matching via Hierarchical Transformers and Optimal Transport for Optical Flow

FlowIt: 通过分层Transformer和最优传输实现全局匹配的光流估计

Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney

发表机构 * Department of Computer Engineering and KUIS AI Center, Koç University, Istanbul, Turkey（计算机工程系和KUIS人工智能中心，科克大学，伊斯坦布尔，土耳其）； Department of Computer Science and Engineering (DISI), University of Bologna, Italy（计算机科学与工程系（DISI），博洛尼亚大学，意大利）

AI总结提出FlowIt架构，结合分层Transformer和最优传输进行全局匹配，并通过置信度与遮挡引导的细化步骤，在多个基准上达到最先进性能。

Comments Project Page: https://kuis-ai.github.io/FlowIt/

详情

AI中文摘要

我们提出FlowIt，一种新颖的光流估计架构，结合了全局匹配与置信度和遮挡引导的细化。其核心是利用分层Transformer架构捕获广泛的全局上下文，使模型能够有效建模长距离对应关系。为了克服局部匹配的局限性，我们将流初始化表述为一个最优传输问题。这种表述产生了一个高度鲁棒的初始流场，以及显式推导的遮挡和置信度图。然后，这些线索无缝集成到引导细化阶段，网络将可靠的运动估计从高置信度区域主动传播到模糊的低置信度区域。在Sintel、KITTI、Spring和LayeredFlow数据集上的大量实验验证了我们方法的有效性。FlowIt在具有挑战性的Sintel基准上取得了最先进的结果，并在Sintel、Spring和LayeredFlow上建立了新的跨数据集零样本泛化性能的最先进水平，同时在KITTI基准和KITTI零样本泛化设置上也提供了有竞争力的性能。

英文摘要

We present FlowIt, a novel architecture for optical flow estimation that combines global matching with confidence and occlusion-guided refinement. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the effectiveness of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel benchmark and establishes new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow, while also delivering competitive performance on both the KITTI benchmark and KITTI zero-shot generalization settings.

URL PDF HTML ☆

赞 0 踩 0

2603.28439 2026-06-02 cs.RO

A Predictive Control Strategy to Offset-Point Tracking for Agricultural Mobile Robots

农业移动机器人偏移点跟踪的预测控制策略

Stephane Ngnepiepaye Wembe, Vincent Rousseau, Johann Laconte, Roland Lenain

发表机构 * Université Clermont Auvergne, INRAE, UR TSCF（克莱蒙特-奥弗涅大学，法国国家农业食品与环境研究 council（INRAE），TSCF研究单位）； SABI AGRI（SABI农业）

AI总结针对农业机器人忽略机具位置导致跟踪误差大的问题，提出一种闭环预测控制策略，显式建模刚性偏移点机具并考虑侧滑和杠杆臂效应，田间实验表明中位跟踪误差降低24%-56%，峰值误差降低70%。

Comments Accepted in the journal IEEE Transaction on Field Robotics

详情

AI中文摘要

机器人越来越多地被部署在农业中，以支持可持续实践并提高生产力。它们为实现精确、高效和环保的操作提供了巨大潜力。然而，现有的大多数路径跟踪控制器仅关注机器人的运动中心，忽略了所连接机具的空间占用和动力学。在实践中，诸如机械除草机或弹簧齿中耕机之类的机具通常是大型的、刚性安装的，并直接与作物和土壤相互作用；忽略它们的位置会降低跟踪性能并增加作物受损的风险。为解决这一局限，我们提出了一种闭环预测控制策略，扩展了文献[1]中介绍的方法。该方法专门针对阿克曼型农业车辆开发，将机具显式建模为刚性偏移点，同时考虑横向滑移和杠杆臂效应。该方法与最先进的基线控制器进行了基准测试，包括反应式几何方法、反应式反步法和基于模型的预测方案。使用两种不同机具的实际农业实验表明，所提方法将中位跟踪误差降低了24%至56%，并在曲率过渡期间将峰值误差降低了高达70%。这些改进转化为增强的操作安全性，特别是在机具靠近作物行作业的场景中。

英文摘要

Robots are increasingly being deployed in agriculture to support sustainable practices and improve productivity. They offer strong potential to enable precise, efficient, and environmentally friendly operations. However, most existing path-following controllers focus solely on the robot's center of motion and neglect the spatial footprint and dynamics of attached implements. In practice, implements such as mechanical weeders or spring-tine cultivators are often large, rigidly mounted, and directly interacting with crops and soil; ignoring their position can degrade tracking performance and increase the risk of crop damage. To address this limitation, we propose a closed-form predictive control strategy extending the approach introduced in [1]. The method is developed specifically for Ackermann-type agricultural vehicles and explicitly models the implement as a rigid offset point, while accounting for lateral slip and lever-arm effects. The approach is benchmarked against state-of-the-art baseline controllers, including a reactive geometric method, a reactive backstepping method, and a model-based predictive scheme. Real-world agricultural experiments with two different implements show that the proposed method reduces the median tracking error by 24% to 56%, and decreases peak errors during curvature transitions by up to 70%. These improvements translate into enhanced operational safety, particularly in scenarios where the implement operates in close proximity to crop rows.

URL PDF HTML ☆

赞 0 踩 0

2603.27223 2026-06-02 cs.CV cs.AI

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

EuraGovExam：来自现实世界公务员考试的多语言多模态基准

Jaeseong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

发表机构 * School of Computer Science / Data Intelligence Lab（计算机科学学院/数据智能实验室）

AI总结提出一个包含8000多道真实公务员考试题目的多语言多模态基准EuraGovExam，要求模型直接从图像中进行布局感知的跨语言推理，当前最先进的视觉语言模型准确率仅达86%。

详情

DOI: 10.1145/3770855.3817532

AI中文摘要

我们提出了EuraGovExam，一个多语言和多模态基准，来源于五个代表性欧亚地区（韩国、日本、台湾、印度和欧盟）的现实世界公务员考试。该数据集旨在反映公共部门评估的真实复杂性，包含超过8000道高分辨率扫描选择题，涵盖17个不同的学术和行政领域。与现有基准不同，EuraGovExam将所有题目内容（包括问题陈述、答案选项和视觉元素）嵌入到单个图像中，仅提供最小化的标准答案格式指令。这种设计要求模型直接从视觉输入进行布局感知的跨语言推理。所有题目均来自真实考试文档，保留了丰富的视觉结构，如表格、多语言排版和类似表单的布局。评估结果显示，即使是最先进的视觉语言模型（VLM）也仅达到86%的准确率，突显了该基准的难度及其诊断当前模型局限性的能力。通过强调文化真实性、视觉复杂性和语言多样性，EuraGovExam为在高风险、多语言、图像基础环境中评估VLM建立了新标准。它还支持电子政务、公共部门文档分析和公平考试准备等实际应用。

英文摘要

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

URL PDF HTML ☆

赞 0 踩 0

2603.22999 2026-06-02 cs.CL

PaperVoyager : Building Interactive Web with Visual Language Models

PaperVoyager：利用视觉语言模型构建交互式网页

Dasen Dai, Biao Wu, Meng Fang, Wenhao Wang

发表机构 * Vast Intelligence Lab（vast 智能实验室）； UTS（UTS大学）； University of Liverpool（利物浦大学）

AI总结提出PaperVoyager框架，将研究论文自动转化为可执行的交互式网页系统，通过显式建模机制和交互逻辑，显著提升生成系统的质量。

Comments 9 pages, 5 figures

详情

AI中文摘要

视觉语言模型的最新进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而，现有的文档代理主要将论文转化为静态产物，如摘要、网页或幻灯片，这对于涉及动态机制和状态转换的技术论文来说是不够的。在这项工作中，我们提出了一种论文到交互式系统的代理，将研究论文转化为可执行的交互式网页系统。给定一篇PDF论文，该代理无需人工干预即可进行端到端处理，包括论文理解、系统建模和交互式网页合成，使用户能够操作输入并观察动态行为。为了评估这一任务，我们引入了一个包含19篇研究论文的基准测试，每篇论文都配有专家构建的交互式系统作为真实参考。我们进一步提出了PaperVoyager，一个结构化生成框架，在合成过程中显式建模机制和交互逻辑。实验表明，PaperVoyager显著提高了生成的交互式系统的质量，为交互式科学论文理解提供了新的范式。

英文摘要

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

URL PDF HTML ☆

赞 0 踩 0

2603.26779 2026-06-02 cs.CV cs.AI

Limits of Spatial Imagery Reasoning in Frontier LLM Models

前沿大语言模型在空间意象推理中的局限性

Sergio Y. Hayashi, Nina S. T. Hirata

发表机构 * Institute of Mathematics and Statistics – University of São Paulo（数学统计研究所 – 圣保罗大学）

AI总结本研究通过引入外部“意象模块”辅助3D模型旋转任务，发现即使外包整体3D状态维护，前沿模型仍缺乏基础视觉空间原语，导致准确率最高仅62.5%。

Comments 25 pages. v2: Title updated; added a section on object/spatial imagery and propositional reasoning; added new experimental results for the single-object rotation probe

详情

AI中文摘要

大型语言模型（LLMs）展示了令人印象深刻的推理能力，但在需要心理模拟的空间任务（如心理旋转）中表现不佳。本文研究是否通过为LLM配备一个外部“意象模块”——一种能够渲染和旋转3D模型的工具——可以弥合这一差距，充当“认知假体”。我们使用双模块架构进行了实验，其中推理模块（MLLM）与意象模块在3D模型旋转任务上进行交互。性能低于预期，准确率最高达到62.5%。进一步研究表明，即使将维护和操作整体3D状态的负担外包，系统仍然失败。这揭示了当前前沿模型缺乏与意象交互所需的基础视觉空间原语。具体来说，它们缺乏：（1）提取空间信号的低级敏感性，例如（a）深度，（b）运动，以及（c）短视距动态预测；以及（2）对图像进行沉思性推理的能力，动态转移视觉焦点，并平衡意象与符号和关联信息。

英文摘要

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

URL PDF HTML ☆

赞 0 踩 0

2603.26028 2026-06-02 cs.CV

Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA

学习修剪：基于动态解剖特征库的端到端因果图剪枝用于医学视觉问答

Zibo Xu, Qiang Li, Weizhi Nie, Yuting Su

发表机构 * School of Microelectronics, Tianjin University（天津大学微电子学院）； School of Electrical and Information Engineering, Tianjin University（天津大学电气与信息工程学院）

AI总结提出可学习因果修剪（LCT）框架，通过动态解剖特征库（DAFB）和可微修剪模块，在端到端优化中抑制虚假相关，增强因果信号，提升医学VQA的鲁棒性和泛化性。

详情

AI中文摘要

医学视觉问答（MedVQA）模型通常由于依赖数据集特定的相关性（如重复的解剖模式或问题类型规律）而非真正的诊断证据，表现出有限的泛化能力。现有的因果方法通常实现为静态调整或事后校正。为了解决这个问题，我们提出了一个可学习因果修剪（LCT）框架，将因果修剪集成到端到端优化中。我们引入了一个动态解剖特征库（DAFB），通过动量机制更新，以捕获频繁解剖和语言模式的全局原型，作为数据集级别规律性的近似。我们进一步设计了一个可微修剪模块，估计实例级表示与全局特征库之间的依赖关系。与全局原型高度相关的特征被软抑制，而实例特定证据被强调。这种可学习机制鼓励模型自适应地优先考虑因果信号而非虚假相关。在VQA-RAD、SLAKE、SLAKE-CP和PathVQA上的实验表明，LCT在现有去偏策略上持续提高了鲁棒性和泛化性。

英文摘要

Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.

URL PDF HTML ☆

赞 0 踩 0

2603.20176 2026-06-02 cs.CV

LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

LagerNVS：用于全神经实时新视角合成的潜在几何

Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford（牛津大学视觉几何组）； Meta AI

AI总结提出LagerNVS，一种基于3D感知潜在特征的编码器-解码器神经网络，通过显式3D监督预训练初始化编码器，结合轻量解码器和光度损失端到端训练，实现实时、泛化的新视角合成，在Re10k上达到31.4 PSNR。

Comments IEEE CVF Conference on Computer Vision and Pattern Recognition 2026. Project page with code, models and examples: szymanowiczs.github.io/lagernvs

2603.23582 2026-06-02 cs.LG cs.AI

AI Generalisation Gap In Comorbid Sleep Disorder Staging

共病睡眠障碍分期中的AI泛化差距

Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

发表机构 * arXiv

AI总结针对脑卒中患者睡眠分期中深度学习模型在健康与临床人群间泛化差的问题，通过Grad-CAM可视化和新数据集iSLEEPS，揭示模型关注生理无意义区域，并强调需开发疾病特异性模型。

详情

DOI: 10.1109/ISBI61048.2026.11515484

AI中文摘要

准确的睡眠分期对于诊断脑卒中患者的OSA和低通气至关重要。尽管PSG可靠，但成本高、劳动密集且需人工评分。虽然深度学习在健康受试者中实现了基于EEG的自动睡眠分期，但我们的分析显示，该方法在睡眠紊乱的临床人群中泛化能力差。利用Grad-CAM解释，我们系统地证明了这一局限性。我们引入了iSLEEPS，一个经过临床注释的缺血性脑卒中新数据集（即将公开发布），并评估了SE-ResNet加双向LSTM模型用于单通道EEG睡眠分期。正如预期，健康与疾病受试者之间的跨域性能很差。注意力可视化在临床专家反馈的支持下显示，模型在患者数据中关注生理上无信息的EEG区域。统计和计算分析进一步证实了健康与缺血性脑卒中队列之间显著的睡眠结构差异，强调了在部署前需要经过临床验证的受试者感知或疾病特异性模型。论文和代码摘要见https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

英文摘要

Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

URL PDF HTML ☆

赞 0 踩 0

2511.16992 2026-06-02 cs.LG

FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models

FIRM: 面向大型语言模型的联邦客户端内正则化多目标对齐

Fatemeh Nourzad, Amirhossein Roknilamouki, Eylem Ekici, Jia Liu, Ness Shroff

发表机构 * arXiv.org ； cs.LG（计算机科学与计算语言学）

AI总结提出FIRM算法，通过客户端内正则化缓解客户端分歧漂移并提高通信效率，实现联邦多目标对齐，并首次给出有限时间收敛保证。

详情

AI中文摘要

将大型语言模型（LLMs）与人类价值观对齐通常需要平衡多个相互冲突的目标，如有用性和无害性。训练这些模型计算密集，且集中式处理引发严重的数据隐私问题。联邦学习（FL）提供了一种有吸引力的替代方案，但现有的联邦多目标优化（FMOO）方法面临严重的通信瓶颈，因为它们依赖向服务器传输多个梯度，这对于大型模型不可扩展。我们提出了FIRM（联邦客户端内正则化多目标对齐），一种新颖的算法，同时实现了客户端分歧漂移缓解和通信效率。在FIRM中，每个客户端本地求解一个正则化多目标优化问题。通过客户端内正则化直接缓解客户端分歧漂移，我们的方法消除了先前工作中常见的多梯度传输需求。因此，客户端只需传输一组适配参数，保持高通信效率。我们证明了我们的算法收敛到帕累托驻点，并且据我们所知，首次为这种联邦多目标对齐设置提供了有限时间收敛保证。实验上，我们展示了与基线相比，FIRM导致更平滑的训练动态、减少的客户端分歧漂移和改进的奖励权衡。我们进一步提出了一种方法，将目标上的偏好纳入考虑，并报告了经验帕累托图，表明FIRM可以根据指定偏好平滑地调整目标之间的权衡。

英文摘要

Aligning Large Language Models (LLMs) with human values often involves balancing multiple, conflicting objectives such as helpfulness and harmlessness. Training these models is computationally intensive, and centralizing the process raises significant data privacy concerns. Federated Learning (FL) offers a compelling alternative, but existing Federated Multi-Objective Optimization (FMOO) methods face severe communication bottlenecks as their reliance on transmitting multiple gradients to a server is unscalable for large models. We introduce FIRM (Federated In-client Regularized Multi-objective alignment), a novel algorithm that achieves both client disagreement drift mitigation and communication efficiency. In FIRM, each client locally solves a regularized multi-objective optimization problem. By directly mitigating client disagreement drift through in-client regularization, our method eliminates the need for the multi-gradient transmissions common in prior works. Consequently, clients need only to transmit a single set of adapted parameters, maintaining high communication efficiency. We prove that our algorithm converges to Pareto-stationary points and, to our knowledge, provide the first finite-time convergence guarantees for this federated multi-objective alignment setting. Empirically, we show that FIRM leads to smoother training dynamics, reduced client disagreement drift, and improved reward trade-offs compared to baselines. We further propose a method to incorporate a preference over the objectives and report empirical Pareto plots, demonstrating that FIRM can smoothly adapt trade-offs between objectives in response to specified preferences.

URL PDF HTML ☆

赞 0 踩 0

2603.23902 2026-06-02 cs.CV cs.AI

Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

知识精炼的双上下文感知网络用于部分相关视频检索

Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu, Shanmin Pang

发表机构 * School of Software Engineering, Xi’an Jiaotong University（西安交通大学软件工程学院）； Faculty of Computer Science, Electrical Engineering and Information Technology, Universität Stuttgart（斯图加特大学计算机科学、电子工程和信息学院）

AI总结针对未修剪视频中部分相关片段检索的信息密度不匹配和注意力机制不足问题，提出KDC-Net网络，通过层次语义聚合、动态时间注意力和基于CLIP的蒸馏策略，显著提升检索性能。

Comments Accepted in ICME 2026

详情

AI中文摘要

从未修剪视频中检索部分相关片段仍然面临两个持续挑战：文本与视频片段之间的信息密度不匹配，以及有限的注意力机制忽略了语义焦点和事件相关性。我们提出了KDC-Net，一个知识精炼的双上下文感知网络，从文本和视觉两个角度解决这些问题。在文本方面，层次语义聚合模块捕获并自适应融合多尺度短语线索以丰富查询语义。在视频方面，动态时间注意力机制采用相对位置编码和自适应时间窗口来突出具有局部时间连贯性的关键事件。此外，一种基于CLIP的动态蒸馏策略，结合时间连续性感知精炼，确保了片段感知和目标对齐的知识迁移。在PRVR基准上的实验表明，KDC-Net始终优于最先进的方法，特别是在低片段-视频比率下。

英文摘要

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

URL PDF HTML ☆

赞 0 踩 0

2603.23485 2026-06-02 cs.CL cs.AI cs.CY

Failure of contextual invariance in large language models

大型语言模型中语境不变性的失效

Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

发表机构 * Network Science Institute, Northeastern University（网络科学研究所，东北大学）； Center for Health Informatics Program, Boston Children’s Hospital（健康信息学计划中心，波士顿儿童医院）； Dept. of Mathematics, City St George’s, University of London（伦敦大学城市圣乔治学院数学系）； IT University of Copenhagen（哥本哈根IT大学）

AI总结通过代词选择任务发现，在语境等价但无信息量的干扰下，大语言模型输出发生系统性偏移，表明其违反语境不变性，影响偏见评估与高风险应用。

详情

AI中文摘要

标准评估实践假设，当提示嵌入语境等价的语篇中时，大型语言模型（LLM）的输出是稳定的。这里，我们在性别推断的背景下测试这一假设。使用受控的代词选择任务，我们引入最小的、理论上无信息的语篇语境，发现这会导致模型输出出现大规模、系统性的偏移。与去语境化设置中存在的文化性别刻板印象的相关性在引入语境后减弱或消失，而理论上无关的特征（如无关指代对象的代词性别）成为模型行为最具信息量的预测因子。通过默认语境性分析发现，在模型间的19%至52%的案例中，这种依赖性在考虑语境对单个输出的所有边际效应后仍然存在，并且不能归因于简单的代词重复。这些发现表明，即使在几乎相同的句法表述下，LLM的输出也违反了语境不变性，这对偏见基准测试和高风险环境中的部署具有重要影响。

英文摘要

Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behavior. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

URL PDF HTML ☆

赞 0 踩 0

2603.23398 2026-06-02 cs.LG cs.AI stat.ML

Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

图能量匹配：用于图生成的传输对齐能量基建模

Michal Balcerak, Suprosanna Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze

发表机构 * University of Zurich（苏黎世大学）； Harvard University（哈佛大学）； Kempner Institute（凯普纳研究所）

AI总结提出Graph Energy Matching (GEM)方法，基于JKO传输映射优化视角学习置换不变势能，通过能量基切换策略实现离散图的高质量生成，在分子图基准上匹配或超越离散扩散模型。

详情

AI中文摘要

离散数据（如图）的生成建模支撑着许多科学和工业应用，包括分子发现和材料设计。在这些领域中，概率推理尤其有价值，因为它能够实现可组合生成和原则性地融入期望的约束，例如结构或功能属性。能量基模型通过捕获相对似然并在推理过程中直接施加约束来支持可组合推理，自然符合这一目标。然而，离散能量基模型通常难以实现高效高质量的采样，因为支持区域外的区域常包含虚假局部最小值，会困住采样器并导致训练不稳定，从而与离散扩散模型相比存在保真度差距。为了解决这一差距，我们引入了Graph Energy Matching (GEM)，这是一种受Jordan-Kinderlehrer-Otto (JKO)传输映射优化视角启发的离散生成框架。GEM学习一个置换不变势能，同时引导从噪声到高似然图区域的离散传输，并在这些区域内细化样本。我们进一步引入了一种利用能量基切换策略的采样协议，无缝衔接快速的梯度引导传输和用于有效探索的局部混合机制。在分子图基准上，GEM在大多数报告指标上匹配或超越了强离散扩散基线。除了提高生成质量，GEM的相对似然建模还支持定向探索，促进组合生成、属性约束采样以及图之间的插值。项目页面：https://michalbalcerak.ai/graph-energy-matching/。

英文摘要

Generative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery and materials design. In these domains, probabilistic inference is particularly valuable, as it enables composable generation and principled incorporation of desired constraints, such as structural or functional properties. Energy-based models naturally support this goal by capturing relative likelihoods and enabling composable inference by directly enforcing constraints during inference. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities, resulting in a fidelity gap compared to discrete diffusion models. To address this gap, we introduce Graph Energy Matching (GEM), a discrete generative framework inspired by the Jordan-Kinderlehrer-Otto (JKO) transport-map optimization perspective. GEM learns a permutation-invariant potential energy that simultaneously guides discrete transport from noise toward high-likelihood graph regions and refines samples within these regions. We further introduce a sampling protocol leveraging an energy-based switching strategy, seamlessly bridging rapid, gradient-guided transport and a local mixing regime for effective exploration. On molecular graph benchmarks, GEM matches or surpasses strong discrete diffusion baselines on most reported metrics. Beyond improving generation quality, GEM's relative likelihood modeling enables targeted exploration, facilitating compositional generation, property-constrained sampling, and interpolation between graphs. Project page: https://michalbalcerak.ai/graph-energy-matching/.

URL PDF HTML ☆

赞 0 踩 0

2603.06989 2026-06-02 cs.CV

MipSLAM: Alias-Free Gaussian Splatting SLAM

MipSLAM：无混叠高斯泼溅SLAM

Yingzhao Li, Yan Li, Shixiong Tian, Yanjie Liu, Lijun Zhao, Gim Hee Lee

发表机构 * State Key Laboratory of Robotics and Systems (HIT), Harbin Institute of Technology（机器人系统国家重点实验室（哈工大））； Yangtze River Delta HlT Robot Technology Research Institute（长江三角洲HLT机器人技术研究院）； Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）

AI总结提出MipSLAM框架，通过椭圆自适应抗混叠算法和频谱感知位姿图优化，实现高保真抗混叠新视角合成与鲁棒位姿估计。

Comments Accepted to ICRA 2026

详情

AI中文摘要

本文介绍了MipSLAM，一种频率感知的3D高斯泼溅（3DGS）SLAM框架，能够在不同相机配置下实现高保真抗混叠新视角合成和鲁棒位姿估计。现有的基于3DGS的SLAM系统常因滤波不足和纯空间优化而遭受混叠伪影和轨迹漂移。为克服这些限制，我们提出椭圆自适应抗混叠（EAA）算法，通过几何感知数值积分近似高斯贡献，避免昂贵的解析计算。此外，我们提出频谱感知位姿图优化（SA-PGO）模块，在频域中重新表述轨迹估计，通过图拉普拉斯分析有效抑制高频噪声和漂移。在Replica和TUM数据集上的广泛评估表明，MipSLAM在多种分辨率下实现了最先进的渲染质量和定位精度。代码可在https://github.com/yzli1998/MipSLAM获取。

英文摘要

This paper introduces MipSLAM, a frequency-aware 3D Gaussian Splatting (3DGS) SLAM framework capable of high-fidelity anti-aliased novel view synthesis and robust pose estimation under varying camera configurations. Existing 3DGS-based SLAM systems often suffer from aliasing artifacts and trajectory drift due to inadequate filtering and purely spatial optimization. To overcome these limitations, we propose an Elliptical Adaptive Anti-aliasing (EAA) algorithm that approximates Gaussian contributions via geometry-aware numerical integration, avoiding costly analytic computation. Furthermore, we present a Spectral-Aware Pose Graph Optimization (SA-PGO) module that reformulates trajectory estimation in the frequency domain, effectively suppressing high-frequency noise and drift through graph Laplacian analysis. Extensive evaluations on Replica and TUM datasets demonstrate that MipSLAM achieves state-of-the-art rendering quality and localization accuracy across multiple resolutions. Code is available at https://github.com/yzli1998/MipSLAM.

URL PDF HTML ☆

赞 0 踩 0