arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09609 2026-05-12 cs.LG math.AG

Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

Kevin Dao, Jose Israel Rodriguez

AI总结 本文针对具有幂激活函数的多项式神经网络(PNNs)提出了对“最小填充架构单峰性猜想”的反例。研究通过前沿搜索方法发现了一个反例,并利用递归维度界和符号计算进行了验证。该反例中的一些子架构表现出较大的缺陷值,与以往研究中普遍观察到的小缺陷现象形成对比,揭示了PNN架构设计中的一些新特性。

详情
英文摘要

We provide a counterexample to the minimal unimodal conjecture for polynomial neural networks (PNNs) with power activation functions. Fixing the input and output widths, the conjecture states that any minimal filling architecture has unimodal widths for the hidden layers. We found a counterexample via a frontier search and certified it using recursive dimension bounds and symbolic computation. Notably, several subarchitectures of this example exhibit large defect, in contrast with the predominantly small-defect behavior observed in prior examples.

2605.09608 2026-05-12 cs.LG cs.IT math.IT

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Yuanyi Wang, Yifan Yang, Su Lu, Yanggan Gu, Pengkai Wang, Wenjun Wang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Jialun Cao, Shing-Chi Cheung, Hongxia Yang

AI总结 该研究探讨了大语言模型持续后训练过程中遗忘现象的成因与控制方法,提出通过任务参数更新的协方差几何来分析模型状态变化与新知识更新之间的兼容性。核心方法基于几何冲突理论,提出了一种无需数据的更新融合算法GCWM,通过高斯Wasserstein重心构建共享度量,并利用几何冲突进行修正控制。实验表明,该方法在多个模型规模上有效提升了持续训练中的知识保留与最终性能。

详情
英文摘要

Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.

2605.09604 2026-05-12 cs.CV

DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition

Jiaying Lin, Shiman Wu, Jinfu Liu, Can Wang, Mengyuan Liu

AI总结 该研究针对毫米波雷达在异构场景下的人体动作识别(HAR)问题,提出了首个大规模异构多源毫米波点云数据集UniMM-HAR,并设计了DAP-Net网络以应对不同设备和频段带来的分布差异。DAP-Net通过融合多模态信息与Doppler感知机制,增强了模型对异构雷达源的鲁棒性,实验表明其在跨源识别任务中取得了优越的性能。

详情
英文摘要

Millimeter-wave (mmWave) radar provides privacy-preserving sensing and is valuable for human action recognition (HAR). Existing mmWave point cloud datasets are limited in scale and mostly collected under homogeneous single-source settings, preventing current methods from handling real-world distribution shifts caused by heterogeneous radar sources, such as different devices and frequency bands. To address this, we introduce UniMM-HAR, the largest and first mmWave point cloud HAR dataset for heterogeneous multi-source scenarios, standardizing three distinct radar configurations to realistically evaluate cross-source generalization. We further propose the Doppler-aware Point Cloud Network (DAP-Net) to tackle heterogeneity challenges. DAP-Net enhances intra-modal representations and performs cross-modal alignment to learn source-invariant action semantics. Leveraging action-consistent spatio-temporal Doppler patterns as anchors, the Dual-space Doppler Reparameterization (D2R) module performs sample-adaptive geometric densification and Doppler-guided feature recalibration, while the Text Alignment Module (TAM) provides stable semantic anchors via a pretrained textual space. Experiments show that DAP-Net significantly outperforms existing methods under heterogeneous radar settings, achieving state-of-the-art accuracy and strong cross-source robustness.

2605.09603 2026-05-12 cs.CL

Edit-Based Refinement for Parallel Masked Diffusion Language Models

Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang, Haotian Hou, Junting Pan, Hongsheng Li

AI总结 本文提出了一种基于编辑的改进框架ME-DLM,用于提升并行掩码扩散语言模型在多令牌生成时的性能。该方法在生成初始完整响应后,通过最小编辑操作(如替换、删除和插入)进行后处理优化,以增强序列一致性。实验表明,ME-DLM在保持并行生成效率的同时,显著提升了生成质量与鲁棒性,尤其在基于LLaDA模型时,在HumanEval和GSM8K数据集上分别取得了11.6和33.6点的提升。

Comments Accepted to ICML 2026

详情
英文摘要

Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at https://github.com/renhouxing/ME-DLM.

2605.09591 2026-05-12 cs.CV

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang

AI总结 本文研究了可提示分割模型是否真正理解其分割的概念,而不仅仅是依赖视觉显著但语义误导的线索。为此,作者提出了一个新的基准测试 CAFE,通过属性层面的反事实修改来评估模型对概念的忠实度。实验表明,尽管模型能生成准确的分割掩码,但在面对误导性提示时仍表现出概念理解的不足,揭示了定位质量与语义理解之间的系统性差距。

Comments 30 pages, 8 figures

详情
英文摘要

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.

2605.09584 2026-05-12 cs.CL cs.AI cs.LG

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

Aishik Nagar, Arun-Kumar Kaliya-Perumal, Yu-Hsuan Han, Andrew Sheng-Han Huang, Kristen Kee, Yushi Cao, Yiming Chen, Hongchao Jiang

AI总结 CLR-voyance 是一种用于强化住院临床决策支持系统中开放性推理能力的新框架,它将临床推理建模为部分可观察马尔可夫决策过程(POMDP),并结合临床结果和专家验证的奖励机制进行监督。该方法通过区分患者旅程中可见的过去信息和仅由专家可见的未来信息,生成可验证的临床推理评分标准(rubrics),用于模型的训练与评估。实验表明,基于 CLR-voyance 训练的模型在住院临床推理任务中表现优异,显著优于现有先进模型,并已在实际医院中部署应用。

详情
英文摘要

Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.

2605.09581 2026-05-12 cs.CV

FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision

Michal Filipkowski, Marcin Kowalczyk, Tomasz Kryjak

AI总结 本文提出了一种基于FPGA的硬件架构,用于实现基于事件视觉系统的对比度最大化(CM)算法。该架构利用FPGA的并行处理能力,高效实现了从异步事件流中重构图像的对比度计算与迭代优化,从而估计运动参数。研究展示了该硬件模块的设计细节与优化方法,并通过实验验证其在速度和能效方面的显著优势,相比CPU和GPU实现快200倍以上,为高速、低功耗嵌入式系统中的实时运动估计提供了坚实基础。

Comments Accepted for ARC 2026

详情
英文摘要

This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.

2605.09579 2026-05-12 cs.LG cs.AI

Biosignal Fingerprinting: A Cross-Modal PPG-ECG Foundation Model

Zhangdaihong Liu, Chang Liu, Fenglin Liu, Yixuan Chen, Yang Yang, David A. Clifton, Xiao Gu

AI总结 该研究提出了一种跨模态的生物信号指纹技术,旨在弥合心电图(ECG)与光电容积图(PPG)在心血管疾病监测中的应用差距。通过构建多模态掩码自编码器(M2AE),该方法从大量配对的ECG和PPG信号中学习到紧凑且可迁移的潜在表示,能够在无需任务特定微调的情况下,用于多种临床任务。实验表明,该方法在心血管疾病分类、高血压检测等任务中表现优异,且仅需单一模态输入即可保持高性能,适用于资源受限的可穿戴设备场景。

Comments 21 pages, 8 figures, 7 tables

详情
英文摘要

Cardiovascular disease remains the leading cause of global mortality, yet scalable cardiac monitoring is hindered by the gap between diagnostic-rich ECG and ubiquitous wearable PPG. Bridging this gap requires representations that are compact, transferable across modalities and devices, and deployable without task-specific retraining. Here we introduce biosignal fingerprints: compact latent representations of cardiovascular state derived from a cross-modal foundation model, the Multi-modal Masked Autoencoder (M2AE), trained on over 3.4 million paired ECG and PPG signals. M2AE integrates modality-specific encoders with a shared bottleneck and dual decoders, jointly optimized using reconstruction and cross-modal contrastive objectives, yielding generalizable fingerprints that retain intra- and inter-modality features. Like a biometric fingerprint, these representations uniquely encode an individual's cardiovascular state in a modality-agnostic, privacy-preserving form reusable across clinical tasks without exposing raw waveform data or requiring model retraining. Across 7 downstream tasks, spanning cross-modal reconstruction, cardiovascular disease classification, hypertension detection, mortality prediction, and demographic inference, biosignal fingerprints achieve competitive or superior performance compared to leading domain-specialist foundation models in frozen settings, including an AUROC of 0.974 for five-class CVD classification and 0.877 for hypertension detection, with a maximum improvement of 27.7% in AUROC across 5 classification tasks. Critically, strong performance is maintained with only a single modality, enabling deployment in resource-constrained, single-sensor environments typical of real-world wearable monitoring, with direct implications for continuous cardiovascular monitoring across clinical and consumer health settings.

2605.09572 2026-05-12 cs.CV cs.AI cs.MM

KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

Guanyi Du, Lintao Wang, Kun Hu, Ziyang Wang

AI总结 该研究探讨了如何利用Kolmogorov-Arnold网络(KAN)从符号注释生成手语姿态动画,提出了一种多尺度序列生成模型KANMultiSign,能够将HamNoSys符号系统转化为二维人体姿态序列。研究引入了从粗到细的生成策略,并结合多尺度监督机制,先生成整体身体结构,再细化手部动作细节;同时将KAN模块集成到Transformer架构中,以更高效地建模符号到连续姿态的非线性映射。实验表明,该方法在多个手语语料库中取得了比现有方法更优的性能,同时大幅减少了参数量,验证了多尺度监督在提升符号条件姿态生成效果中的关键作用。

Comments Accepted at Neurocomputing

详情
英文摘要

Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body--hand--face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov--Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.

2605.09570 2026-05-12 cs.LG

End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor

Wiktor Matykiewicz, Piotr Wzorek, Kamil Jeziorek, Tomás Muñoz, Antonio Rios-Navarro, Angel Jiménez-Fernández, Tomasz Kryjak

AI总结 随着移动机器人和嵌入式智能的快速发展,边缘平台对高效设备端数据处理的需求日益增加。本文提出了一种基于现场可编程门阵列(FPGA)的端到端关键词识别系统,首次将神经形态听觉传感器(NAS)与图神经网络(GNN)集成在单一FPGA设备上,直接处理基于事件的音频流,无需传统信号预处理。该系统采用计算近内存架构,在保持高识别准确率(87.43%)的同时实现了低延迟和低功耗的实时处理。

Comments Accepted for the ARC 2026 conference

详情
英文摘要

With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at https://github.com/vision-agh/NAS-GNN-KWS.

2605.09566 2026-05-12 cs.CV

Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing

Tianyi Lu, Wenxue Cui, Shaohui Liu

AI总结 本文提出了一种双路径超先验引导的深度展开网络(DPH-DUN),用于解决图像压缩感知中的重建问题。该方法通过将测量数据分为两个子集,并引入超先验信息指导重建过程,有效提升了不同纹理区域的重建质量。核心创新包括设计轻量神经模块生成多域超先验知识,并在重建过程中动态生成自适应步长和注意力机制,以提高重建精度和鲁棒性。实验表明,该方法在多个基准数据集上优于现有压缩感知方法。

详情
英文摘要

Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.

2605.09565 2026-05-12 cs.LG

Online Set Learning from Precision and Recall Feedback

Lee Cohen, Yishay Mansour, Shay Moran, Han Shao

AI总结 本文研究了在在线设置下,从精确率和召回率反馈中学习未知子集的问题。在每一轮中,学习者预测一个子集并根据反馈类型(精确率或召回率)获得部分信息,目标是最大化累积奖励。研究证明,该问题的可学习性等价于假设类具有有限的VC维,并提出了应对反馈依赖性的算法,在可实现和不可知设置下均获得了遗憾界,为该模型的可学习性提供了理论刻画,并指出了多个值得进一步研究的问题。

详情
英文摘要

We consider the problem of learning an unknown subset $N_\text{target}$ of a domain in an online setting. In each round $t$, the learner predicts a set of items ${N}_t$ and receives one of two types of feedback, each with equal probability: precision feedback, in which a randomly chosen item from the predicted set $N_t$ is revealed and the learner is told whether it belongs to $N_\text{target}$ (incurring a reward if it does), or recall feedback, in which a randomly chosen item from the target set $N_\text{target}$ is revealed and the learner is told whether it belongs to $N_t$ (incurring a reward if it does). The goal is to maximize the cumulative reward over time. This simple online set learning problem abstracts a variety of learning scenarios with precision- and recall-type feedback. We show that a hypothesis class (a family of subsets of the domain) is learnable in this setting if and only if it has finite Vapnik-Chervonenkis (VC) dimension, mirroring the classical PAC characterization. However, the resulting algorithmic structure is markedly more intricate: in contrast to standard Probably Approximately Correct (PAC) learning -- where the algorithmic landscape is governed by the simple principle of Empirical Risk Minimization (ERM) -- our partial feedback model can invalidate ERM and even all proper learning rules. We develop algorithms to address the dependencies induced by the feedback, obtaining regret guarantees in both the realizable and agnostic settings. Our results provide a qualitative characterization of learnability in this model, addressing its most basic question, while pointing to a range of natural and intriguing open questions, including the determination of optimal regret rates.

2605.09554 2026-05-12 cs.CL cs.CV

Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs

Kuanwei Chen, Mengfeng Tsai

AI总结 本文研究了手语翻译(SLT)中帧率与模型大小之间的权衡问题,旨在实现更紧凑高效的翻译系统。作者提出了一种仅含77M参数的轻量级管道,结合MMPose骨骼姿态提取与单一线性投影至T5-small模型,通过调整输入帧率,在保证翻译质量的前提下显著降低计算复杂度。实验表明,该方法在12fps时相比24fps仅小幅降低BLEU-4得分,同时模型大小仅为之前T5-base系统的1/3,展示了轻量架构在无需层次化编码器或大规模模型的情况下仍具竞争力。

Comments 2 pages, 1 figure, 2 tables

详情
英文摘要

Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.

2605.09549 2026-05-12 cs.LG

When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning

Yunxuan Fang, Ziwei Zhang, Xinhe Wang

AI总结 本文研究了在冻结的少样本视觉-语言提示学习中,自适应门控机制失效的问题,发现自适应门和提示选择模块常出现输出恒定、梯度信号微弱且性能不如固定提示的现象。通过系统实验,作者识别出两种主要失效模式:梯度幅值不平衡和门控退化,揭示了自适应门控在特定条件下的局限性,并对参数高效学习中盲目增加架构复杂性的做法提出了反思。

详情
英文摘要

Adaptive prompting mechanisms have been proposed to enhance vision-language models by dynamically tailoring prompts to inputs. However, in frozen few-shot prompt learning with CLIP-style backbones, we systematically observe that adaptive gates and prompt-selection modules often collapse: they produce nearly constant outputs, contribute negligible gradient signals, and frequently fail to outperform fixed prompts. To further explore this issue, we present a systematic diagnostic study to uncover the underlying causes and conditions of adaptation failure. Through controlled experiments across datasets and multiple prompt learning architectures, we identify two recurring failure modes: gradient magnitude imbalance and gate degradation. Our findings invite a re-examination of indiscriminately adding architectural complexity in parameter-efficient learning and clarify when prompt-level adaptive gating is, and is not, effective in this regime.

2605.09548 2026-05-12 cs.CL

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, Hinrich Schütze

AI总结 该研究针对多语言推理中低资源语言表现较差的问题,提出了一种跨语言的在线自蒸馏方法COPSD。该方法利用同一模型作为学生和教师,学生仅看到低资源语言的问题,而教师则获得包括英文翻译和参考解法在内的跨语言上下文信息,通过最小化学生生成过程中的全分布词级差异,提供密集的监督信号。实验表明,COPSD在17种低资源非洲语言上显著提升了数学推理能力,优于现有方法,并在答案格式、推理扩展性和基准泛化方面表现出色。

Comments preprint

详情
英文摘要

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.

2605.09544 2026-05-12 cs.AI

TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

Yize Li, Junzhi Li, Jason Song, Chuxiong Sun, Rui Wang, Changwen Zheng

AI总结 TIDE-Bench 是一个用于评估工具集成推理(TIR)方法的全面且高效的基准测试平台,旨在解决当前TIR评估在任务多样性、诊断全面性和评估效率方面的不足。该基准引入了多种任务设置,包括数学推理、知识密集型问答以及两种新设计的任务,以考察模型在复杂工具调用和多工具协作方面的能力。同时,TIDE-Bench 采用任务感知的综合评估协议,并通过筛选高质量样本提升评估效率,实验结果揭示了当前TIR方法在工具 grounding 方面的持续瓶颈,为未来研究提供了重要参考。

Comments 10 pages, 5 figures, 10 tables

详情
英文摘要

Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.

2605.09542 2026-05-12 cs.AI

LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

Rishabh Jakhar, Michel Dumontier, Remzi Celebi

AI总结 该研究提出了一种结合知识图谱与大语言模型(LLM)的神经符号框架TESSERA,用于从知识图谱中生成药物-疾病对的多步机制解释。该方法利用LLM进行局部判断和状态评估,同时借助蒙特卡洛树搜索(MCTS)实现长期路径的结构化搜索与信用分配,从而在保证生物知识准确性的同时,生成合理且多样化的解释路径。实验表明,该框架在两个互补的知识图谱上有效揭示了药物作用机制,并验证了LLM在其中的关键作用。

Comments Accepted at IJCAI-ECAI 2026. 9 pages (7 content + 2 references), 5 figures, 3 tables. Includes supplementary material (26 pages)

详情
英文摘要

Extracting multi-step explanations from knowledge graphs poses a combinatorial challenge requiring both heuristic guidance (as candidates proliferate with depth) and credit assignment (as path quality emerges over extended sequences). Frontier LLMs, strong on knowledge/reasoning benchmarks, offer a compelling source of such heuristics, yet their knowledge comes sans guarantees and compositional performance degrades as chains lengthen. We thus present TESSERA, a 3-part neuro-symbolic framework that uses LLMs in a circumscribed role: for local discriminative judgement rather than autonomous multi-step generation; the knowledge graph then defines the hypothesis space enforcing hard structural constraints, and MCTS coordinates the long-horizon search with principled credit assignment via backpropagation. LLMs perform dual roles as a prior policy biasing exploration and a comparative state evaluator supplying reward signals. Evaluation on drug mechanism elucidation across two complementary knowledge graphs demonstrates fidelity to curated biology while surfacing coherent alternative mechanisms, with ablations confirming discriminative contribution from both LLM components. Beyond its current application, our framework offers a general paradigm for compositional reasoning over structured knowledge.

2605.09539 2026-05-12 cs.CL

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

Chen Xu, Yicheng Hu, Ruizi Wang, Xinyu Lin, Wenjie Wang, Dongrui Liu, Fuli Feng

AI总结 本文提出了一种名为TacoMAS的测试时多智能体系统共进化框架,旨在同时动态调整智能体的能力与通信拓扑结构。该方法通过快速更新智能体能力以应对新出现的子任务,并在更长时间尺度上调整通信拓扑以保持协作稳定性,从而实现更高效的多智能体协作。实验表明,TacoMAS在四个基准任务中显著优于近20种现有方法,平均性能提升了13.3%。

详情
英文摘要

Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2-gif/TacoMAS-MultiAgent.

2605.09538 2026-05-12 cs.CV cs.AI cs.RO

PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions

Jihyun Lee, Changmin Lee, Donghwan Kim, Tae-Kyun Kim

AI总结 PhysHanDI 是一种基于物理的框架,旨在同时重建手部与非刚性物体(如布料、毛绒玩具)的三维交互。该方法通过模拟由密集重建的手部运动引起的力来驱动物体变形,确保重建的物体动态既符合物理规律又与手部运动一致。此外,物体变形的模拟还能通过逆物理方法提升手部重建的精度,实验表明 PhysHanDI 在重建和未来预测任务中均优于现有最佳方法。

Comments Accepted to ICML 2026

详情
英文摘要

While existing methods for reconstructing hand-object interactions have made impressive progress, they either focus on rigid or part-wise rigid objects-limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations-or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHanDI (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHanDI outperforms the state-of-the-art baseline across reconstruction and future prediction.

2605.09537 2026-05-12 cs.RO

Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning

Kewei Chen, Yayu Long, Mingsheng Shang

AI总结 尽管视觉-语言-动作(VLA)模型在机器人控制方面取得了快速进展,但在长期任务中仍存在指令漂移的问题。本文将这一现象重新定义为一种系统性的采样误差,并提出了一种无需训练的推理时计算框架——上下文感知功率采样(CAPS),通过功率分布增强全局轨迹概率,结合信噪比(SNR)的元认知控制机制,在检测到漂移风险时触发自适应MCMC搜索,从而在“直觉快速思考”与“理性慢速搜索”之间实现动态切换。实验表明,CAPS在多个长期任务基准上显著优于现有方法,提升了机器人长期任务的鲁棒性。

Comments Accepted at ICML 2026

详情
英文摘要

Despite rapid progress in Vision-Language-Action (VLA) models for robotic control, instruction drift remains a persistent failure mode in long-horizon tasks. This paper reconceptualizes this phenomenon, positing that instruction drift is fundamentally a systematic sampling error: local greedy sampling is prone to collapsing into "Negative Pivotal Windows"--irreversible local optima with high local probability that sever global success pathways. To address this, we propose Context-Aware Power Sampling (CAPS), a training-free inference-time computation framework. CAPS leverages power distributions to sharpen global trajectory probabilities, enabling lookahead search over the model's conditional generative trajectory distribution. Furthermore, we introduce a metacognitive control mechanism based on Signal-to-Noise Ratio (SNR). This mechanism triggers adaptive MCMC search solely when drift risk is detected, enabling a dynamic transition from "intuitive fast thinking" to "rational slow search." Experiments on RoboTwin, Simpler-WindowX, and Libero-long benchmarks show that CAPS achieves substantial improvements over strong baselines, including OpenVLA and TACO, without parameter updates. These results support the effectiveness of adaptive inference-time computation for improving long-horizon robustness in embodied control.

2605.09536 2026-05-12 cs.CL cs.AI

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

Haoyang Zhou, Li Kong, Shijie Ren, Xiting Wang, Shuang Liang, Guowei Wang, Zhenxuan Pan

AI总结 扩散大语言模型(dLLMs)在并行文本生成方面具有潜力,但面临生成速度与准确率之间的权衡问题。为此,本文提出了一种时序感知的轨迹自蒸馏框架TAD,通过教师模型生成解码轨迹并根据解码步数对掩码位置进行划分,分别采用交叉熵损失和KL散度损失进行训练,从而在保证生成质量的同时提升并行效率。实验表明,TAD有效改善了准确率与并行性的平衡,在多个指标上均取得显著提升。

详情
英文摘要

Diffusion large language models (dLLMs) offer a promising paradigm for parallel text generation, but in practice they face an accuracy-parallelism trade-off, where increasing tokens per forward (TPF) often degrades generation quality. Existing acceleration methods often gain speed at the cost of accuracy. To address this limitation, we propose TAD, a Temporal-Aware trajectory self-Distillation framework. During data construction, we condition a teacher model on both the prompt and the ground-truth response to generate decoding trajectories, recording the intermediate masked states throughout the process. Based on how many decoding steps remain before each masked token is revealed, we partition masked positions into near and distant subsets. For near tokens, we train the student with a hard cross-entropy loss using the teacher trajectory tokens as labels, encouraging confident predictions for tokens that are about to be decoded. For distant tokens, we apply a soft KL divergence loss between the teacher and student token distributions, providing softer supervision and preserving future planning knowledge. This temporal-aware partition naturally gives rise to two deployment configurations: a Quality model that prioritizes accuracy and a Speed model that favors more aggressive acceleration. Experiments show that TAD consistently improves the accuracy-parallelism trade-off. On LLaDA, it raises average accuracy from 46.2\% to 51.6\% with the Quality model and average AUP from 46.2 to 257.1 with the Speed model. Our code is available at: https://github.com/BHmingyang/TAD

2605.09533 2026-05-12 cs.CL cs.AI

Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

Jakob Sturm, Josef Pichlmeier, Christian Bernhard, Maka Karalashvili, Johannes Klepsch, Georg Groh, Andre Luckow

AI总结 本研究评估了检索增强生成(RAG)和微调(FT)在工业问答场景中的应用效果,重点分析了它们在汽车行业特定数据集上的表现。通过扩展成本-生成框架,综合考量了输出质量与操作成本,研究发现尽管高端模型在默认情况下表现最佳,但结合RAG的开源模型可以达到相近的质量,且RAG在整体上被证明是更高效且成本更低的适配方法。

Comments Accepted at AAAI 2026 Workshop on New Frontiers in Information Retrieval

详情
英文摘要

Large Language Models (LLMs) are increasingly employed in enterprise question-answering (QA) systems, requiring adaptation to domain-specific knowledge. Among the most prevalent methods for incorporating such knowledge are Retrieval-Augmented Generation (RAG) and fine-tuning (FT). Yet, from a cost-accuracy trade-off perspective, it remains unclear which approach best suits industry scenarios. This study examines the impact of RAG and FT on two closed datasets specific to the automotive industry, assessing answer quality and operational costs. We extend the Cost-of-Pass framework proposed by Erol et al. (arXiv:2504.13359) to jointly assess output quality, generation cost, and user interaction cost. Our findings reveal that while premium models perform best out of the box, open-source models can achieve comparable quality when enhanced with RAG. Overall, RAG emerges as the most effective and cost-efficient adaptation method for both closed- and open-source models.

2605.09528 2026-05-12 cs.AI

Cplus2ASP: Computing Action Language C+ in Answer Set Programming

Joseph Babb, Joohyung Lee

AI总结 本文介绍了Cplus2ASP系统的第二版,实现了行动语言C+的确定性片段。该系统通过结合现代答案集求解技术,显著提升了运行效率,并兼容Causal Calculator Version 2的输入语言。系统整合了多个最新理论成果,支持增量执行模式和多种实用功能,同时为其他行动语言提供了可扩展的多模态翻译支持。

详情
Journal ref
In Proceedings of the 12th International Conference on Logic Programming and Nonmonotonic Reasoning (LPNMR 2013), 122-134, 2013
英文摘要

We present Version 2 of system Cplus2ASP, which implements the definite fragment of action language C+. Its input language is fully compatible with the language of the Causal Calculator Version 2, but the new system is significantly faster thanks to modern answer set solving techniques. The translation implemented in the system is a composition of several recent theoretical results. The system orchestrates a tool chain, consisting of f2lp, clingo, iclingo, and as2transition. Under the incremental execution mode, the system translates a C+ description into the input language of iclingo, exploiting its incremental grounding mechanism. The correctness of this execution is justified by the module theorem extended to programs with nested expressions. In addition, the input language of the system has many useful features, such as external atoms by means of Lua calls and the user interactive mode. The system supports extensible multi-modal translations for other action languages, such as B and BC, as well.

2605.09524 2026-05-12 cs.AI

Functional Stable Model Semantics and Answer Set Programming Modulo Theories

Michael Bartholomew, Joohyung Lee

AI总结 本文研究了在“答案集编程模理论(ASPMT)”框架中引入“内涵函数”的问题,探讨了功能稳定模型语义在其中的重要作用。作者指出,传统答案集编程中函数是预定义的,而内涵函数的值可通过其他函数和谓词描述,这使得ASPMT能够更灵活地处理复杂约束。研究展示了如何将“紧致”ASPMT程序转化为SMT实例,扩展了答案集编程与可满足性模理论之间的联系。

详情
Journal ref
In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 2013), pages 718-724, 2013
英文摘要

Recently there has been an increasing interest in incorporating ``intensional'' functions in answer set programming. Intensional functions are those whose values can be described by other functions and predicates, rather than being pre-defined as in the standard answer set programming. We demonstrate that the functional stable model semantics plays an important role in the framework of ``Answer Set Programming Modulo Theories (ASPMT)'' -- a tight integration of answer set programming and satisfiability modulo theories, under which existing integration approaches can be viewed as special cases where the role of functions is limited. We show that ``tight'' ASPMT programs can be translated into SMT instances, which is similar to the known relationship between ASP and SAT.

2605.09519 2026-05-12 cs.AI cs.LO

Weighted Rules under the Stable Model Semantics

Joohyung Lee, Yi Wang

AI总结 本文提出了一种在稳定模型语义下的加权规则形式,借鉴了马尔可夫逻辑中的对数线性模型,以克服传统稳定模型语义的确定性限制。该方法能够处理答案集程序中的不一致性、对稳定模型进行排序、赋予稳定模型概率以及进行统计推理。文章还对相关形式系统如答案集程序、马尔可夫逻辑、ProbLog和P-log进行了形式上的比较分析。

详情
Journal ref
In Proceedings of the 15th International Conference on Principles of Knowledge Representation and Reasoning (KR 2016), pages 145-154, 2016
英文摘要

We introduce the concept of weighted rules under the stable model semantics following the log-linear models of Markov Logic. This provides versatile methods to overcome the deterministic nature of the stable model semantics, such as resolving inconsistencies in answer set programs, ranking stable models, associating probability to stable models, and applying statistical inference to computing weighted stable models. We also present formal comparisons with related formalisms, such as answer set programs, Markov Logic, ProbLog, and P-log.

2605.09518 2026-05-12 cs.LG

LLM-Driven Performance-Space Augmentation for Meta-Learning-Based Algorithm Selection

Darren Zhu, Daren Ler

AI总结 该研究针对元学习算法选择中因真实数据集稀缺导致的元数据集稀疏问题,提出通过大语言模型生成合成回归数据集以扩充元数据集。研究通过引导语言模型生成具有特定性能特征的数据,重点增强算法性能空间中关键区域的覆盖。实验表明,这种基于性能空间的扩充策略显著提升了元学习模型的性能,尤其在统一采样策略下表现更优,为算法选择的元学习提供了新的数据增强方法。

详情
英文摘要

Meta-learning for algorithm selection relies on a meta-dataset in which each row corresponds to a supervised learning dataset described by meta-features and labelled with a target value that is associated with algorithm choice (typically, some function of algorithm performance). A persistent limitation is that the number of curated real-world datasets is small, resulting in sparse meta-datasets that constrain meta-learner generalisation. In this paper, we address this problem by augmenting the meta-dataset with synthetic regression datasets produced via a large language model (LLM), with generation steered toward target regions of a low-dimensionality performance space. In our experiments, we adopt a two-dimensional geometric setting defined by the cross-validated $R^2$ scores of two anchor algorithms, known as landmarkers. We compare two augmentation strategies: (1) uniform sampling, which distributes synthetic datasets across the performance space; and (2) margin-based sampling, which concentrates them near the decision boundary where landmarker preference is most ambiguous. Across 42 real-world UCI regression datasets and 730 synthetic datasets, both strategies substantially improve meta-learner performance over the unaugmented baseline under regression and multi-label evaluation formulations. However, uniform augmentation consistently outperforms margin-based augmentation, achieving a 17.47% relative reduction in Hamming loss, a 100.41% relative improvement in subset accuracy, and a +6.09% relative gain in pooled out-of-fold $R^2$. These results lead us to postulate a central thesis: the performance of algorithms resides on a low-dimensional performance manifold, whose reconstruction bias may be minimised by user-guided LLMs that seek to maximise uniform $ε$-cover, and consequently, lead to improved meta-learning for algorithm selection.

2605.09516 2026-05-12 cs.LG cs.AI

Mixture of Layers with Hybrid Attention

Ivan Ternovtsii, Yurii Bilak

AI总结 本文提出了一种新的混合注意力机制的分层混合模型(MoL),用于改进传统混合专家(MoE)变压器的结构。该方法通过在每一层中使用多个低维子块,并结合路由机制选择激活的块,从而提升模型的效率和表达能力。为了解决稀疏路由导致的注意力覆盖不足问题,作者引入了混合注意力机制,结合全局软注意力和线性注意力,以兼顾全局上下文和局部细节信息。

详情
英文摘要

Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

2605.09515 2026-05-12 cs.AI

A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models

Djamel Bouchaffra

AI总结 本文研究了大型语言模型中多头注意力机制中头之间的高阶协同关系,提出了基于博弈论自由能原理(GTFEP)的分析框架,将注意力头视为理性代理,并通过变分自由能最小化解释其集体行为。研究发现,注意力头之间的三阶协同信息普遍为负,揭示了模型中的高阶冗余,据此提出的剪枝方法可在保持性能基本不变的情况下显著降低计算成本。

Comments this manuscript has been submitted to Neural Networks

详情
英文摘要

Large language models rely on multihead attention, but interactions among heads remain poorly understood. We apply the Game Theoretic Free Energy Principle (GTFEP): a framework casting multiagent systems as distributed variational inference to analyze attention heads as bounded rational agents. According to GTFEP, each head minimizes its variational free energy, and collective behavior follows a Gibbs distribution over coalition structures whose energy is decomposed into Harsanyi dividends. Using a tractable approximation (uniform prior, deterministic dynamics), coalition free energy reduces to joint Shannon entropy of discretized head outputs (argmax key index). Pairwise dividends become mutual information (nonnegative), while triple dividends correspond to interaction information and can be negative. On BERT, GPT2, and Llama with GSM8K, triple dividends are consistently negative, revealing higher order redundancy. The Nash FEP correspondence guarantees that stationary points of collective free energy are epsilon Nash equilibria; thus, heads with negligible contribution can be pruned with minimal performance loss. Pruning heads with low marginal contribution reduces computational cost with minimal performance loss: for example, pruning 20% of heads in GPT2 reduces FLOPs by 18%, increases throughput by 22%, and raises perplexity only modestly (from 28.4 to 33.4 on GSM8K). Our work shows GTFEP provides a principled foundation for analyzing and optimizing transformer architectures.

2605.09514 2026-05-12 cs.LG

Doubly Robust Proxy Causal Learning with Neural Mean Embeddings

Bariscan Bozkurt, Alexandre Galashov, Dimitri Meunier, Zikai Shen, Arthur Gretton, Houssam Zenati

AI总结 该论文研究了在存在未观测混杂因素的情况下,如何通过代理因果学习方法识别因果响应函数的问题。提出了一种基于神经均值嵌入的双重稳健代理因果学习框架,结合治疗桥和结果桥的神经网络估计器,并通过最终回归阶段实现双重稳健修正。该方法适用于连续和结构化处理变量,能够估计群体、异质性和条件剂量-响应函数,相比现有方法在合成和图像数据集上表现出更优的性能。

详情
英文摘要

Unobserved confounding prevents standard covariate adjustment from identifying causal response functions in observational studies. Proxy causal learning addresses this problem through bridge equations involving treatment- and outcome-inducing proxies, avoiding direct recovery of the latent confounder. Existing doubly robust proxy estimators combine outcome and treatment bridges, but typically rely on fixed kernels, sieves, or low-dimensional semiparametric models; existing neural proxy methods are more flexible, but are largely single-bridge estimators. We develop a neural doubly robust framework for proxy causal learning with continuous and structured treatments. Our method introduces a neural mean-embedding estimator for the treatment bridge, combines it with a neural outcome bridge, and estimates the doubly robust correction through a final regression stage. The framework covers population, heterogeneous, and conditional dose-response functions, yielding full response-curve estimators rather than binary-treatment effects. The algorithms use two stages for each bridge and history-aware updates of the final linear layers to stabilize stochastic multi-stage training. We prove consistency of the algorithms showing that the doubly robust error is controlled by the final averaging and regression errors together with the smaller of the outcome- and treatment-side weak-norm bridge errors. Across synthetic and image-valued benchmarks, the proposed estimators outperform existing baselines and single-bridge neural estimators, showing the benefit of combining learned outcome and treatment bridges in a doubly robust construction. Our implementation is available at https://github.com/BariscanBozkurt/DRPCL-Neural-Mean-Embedding.

2605.09513 2026-05-12 cs.CV cs.RO

QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking

Mayank Anand, Mohammad Saqlain, Kyan Mahajan, Priya Shukla, Gora Chand Nandi, Andrew Melnik

AI总结 本文提出QueST,一种用于长期轨迹跟踪的语义监控框架,旨在解决传统逐帧匹配方法在复杂场景下累积误差导致的语义漂移问题。QueST将与交互相关的实体视为持久的语义查询,而非瞬时的点轨迹,并在每个时间步全局关注时空视频特征,提供稳定的语义锚点。通过引入轻量的三维物理约束,QueST在遮挡等情况下有效抑制漂移,实验表明其在长期关节运动序列上的跟踪精度显著优于现有方法。

详情
英文摘要

Tracking points in videos is typically formulated as frame-to-frame correspondence, where each point is matched locally to the next frame. While this works over short horizons, errors accumulate under articulation, occlusion, and viewpoint change, leading to silent semantic drift that existing trackers cannot detect or correct. In this work, we revisit long-horizon tracking from a monitoring perspective and introduce QueST, a monitoring-by-design framework that treats interaction-relevant entities as persistent semantic queries rather than transient point tracks. Instead of local propagation, each query attends globally over spatio-temporal video features at every time-step, providing a stable semantic anchor across time. We further constrain query trajectories with lightweight 3D physical grounding, using geometric plausibility to suppress unbounded drift under occlusion. We evaluate QueST on long-horizon articulated sequences from PartNet-Mobility in SAPIEN and compare against RAFT-3D, CoTracker, and TAP-Net. QueST substantially reduces terminal drift achieving a 67.7% Absolute Point Error (APE) improvement over TAP-Net while better preserving identity over extended horizons. Our results show that embedding semantic monitoring directly into perception enables more reliable long-horizon tracking under distribution shift.