arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.23325 2026-04-28 cs.CV cs.AI eess.IV

EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence

Yahui Li, Yinfeng Yu, Liejun Wang, Shengjie Shen

Comments Main paper (10 pages). Accepted for publication by ICMR(International Conference on Multimedia Retrieval) 2026

详情

英文摘要

Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.

URL PDF HTML ☆

赞 0 踩 0

2604.23324 2026-04-28 cs.LG cs.AI

Layer Embedding Deep Fusion Graph Neural Network

Taihua Xu, Genhao Tian, Jicong Fan, Xibei Yang, Qinghua Zhang, Yun Cui

2604.23323 2026-04-28 cs.CL cs.SD

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth

2604.23320 2026-04-28 cs.CV

KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

Zhaoxiang Liu, Zhicheng Ma, Kaikai Zhao, Kai Wang, Shiguo Lian

Comments published on journal"Image and Vision Computing"

详情

DOI: 10.1016/j.imavis.2026.105983

英文摘要

The Convolutional Neural Networks (CNNs) have been the dominant and effective approach for general computer vision tasks. Recently, Kolmogorov-Arnold neural networks (KANs), based on the Kolmogorov-Arnold representation theorem, have shown potential to replace Multi-Layer Perceptrons (MLPs) in deep learning. KANs, which use learnable nonlinear activations on edges and simple summation on nodes, offer fewer parameters and greater explainability compared to MLPs. However, there has been limited exploration of integrating the Kolmogorov-Arnold representation theorem with convolutional methods for computer vision tasks. Existing attempts have merely replaced learnable activation functions with weights, undermining KANs' theoretical foundation and limiting their potential effectiveness. Additionally, the B-spline curves used in KANs suffer from computational inefficiency and a tendency to overfit. In this paper, we propose a novel Kolmogorov-Arnold Convolutional Layer that deeply integrates the Kolmogorov-Arnold representation theorem with convolution. This layer provides stronger method interpretability because it is based on established mathematical theorems and its design has theoretical alignment. Building on the Kolmogorov-Arnold Convolutional Layer, we design an efficient network architecture called KAConvNet, which outperforms existing methods combining KAN and convolution, and achieves competitive performance compared to mainstream ViTs and CNNs. We believe that our work offers valuable insight into the field of artificial intelligence and will inspire the development of more innovative CNNs in the 2020s. The code is publicly available at https://github.com/UnicomAI/KAConvNet.

URL PDF HTML ☆

赞 0 踩 0

2604.23318 2026-04-28 cs.CL cs.LG

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun, Xuanru Wang, Jiuchong Gao, Jinghua Hao, Renqing He, Weijie Yu

详情

英文摘要

Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.

URL PDF HTML ☆

赞 0 踩 0

2604.23314 2026-04-28 cs.CV

Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM

Jingxuan Kang, Ziqi Zhang, Shaoming Zheng, Shuang Li, Uday Bharat Patel, Alexander Harry Fitzhugh, Phillip Lung, Yusuf Kiberu, Nikesh Jathanna, Shahnaz Jamil-Copley, Bernhard Kainz, Chen Qin

Comments Accepted to CVPR 2026 (Findings Track)

2604.23312 2026-04-28 cs.LG cs.AI

GIFT: Global stabilisation via Intrinsic Fine Tuning

Rory Young, Nicolas Pugeault

2604.23309 2026-04-28 cs.CV cs.LG

STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning

Yanpei Gong, Beichen Zhang, Hao Wang, Zhaobo Qi, Xinyan Liu, Yuanrong Xu, Ruiyang Gao, Weigang Zhang

2604.23308 2026-04-28 cs.LG stat.ML

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong

2604.23307 2026-04-28 cs.LG cs.AI

CombiMOTS: Combinatorial Multi-Objective Tree Search for Dual-Target Molecule Generation

Thibaud Southiratn, Bonil Koo, Yijingxiu Lu, Sun Kim

Comments Accepted as a poster at ICML 2025 (Main Track)

2604.23296 2026-04-28 cs.CL cs.AI

$\mathcal{S}^2$IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction

Bingfeng Chen, Chenjie Qiu, Yifeng Xie, Boyan Xu, Ruichu Cai, Zhifeng Hao

Comments Accepted to Findings of NAACL 2025

2604.23290 2026-04-28 cs.LG cs.AI cs.NI

An Analysis of Active Learning Algorithms using Real-World Crowd-sourced Text Annotations

Varun Totakura, Ankita Singh, Yushun Dong, Shayok Chakraborty

Comments The proposed dataset can be accessed at https://github.com/varuntotakura/al_rcta/. To appear in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2026)

详情

Journal ref: IEEE International Joint Conference on Neural Networks (IJCNN 2026)

英文摘要

Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real-world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real-world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real-world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd-sourced workers through a crowd-sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real-world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real-world applications. The obtained annotations can be accessed at https://github.com/varuntotakura/al_rcta/.

URL PDF HTML ☆

赞 0 踩 0

2604.23289 2026-04-28 cs.CV cs.AI cs.LG cs.MM

MetaErr: Towards Predicting Error Patterns in Deep Neural Networks

Varun Totakura, Shayok Chakraborty

Comments Accepted and presented at the IEEE International Conference on SMART MULTIMEDIA (ICSM 2025)

2604.23284 2026-04-28 cs.CL cs.AI

Au-M-ol: A Unified Model for Medical Audio and Language Understanding

Meizhu Liu, Nistha Mitra, Paul Li, Amine Abdaoui, Adam Ledyard, Tao Sheng

2604.23283 2026-04-28 cs.LG

Revisable by Design: A Theory of Streaming LLM Agent Execution

Zhiyuan Zhai, Ming Li, Xin Wang

2604.23281 2026-04-28 cs.LG cs.CV

Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

Long Jing, Zhixiong Yang, Yajun Zhang, Xinlong Feng

2604.23280 2026-04-28 cs.AI cs.CR

AI Identity: Standards, Gaps, and Research Directions for AI Agents

Takumi Otsuka, Kentaroh Toyoda, Alex Leung

2604.23278 2026-04-28 cs.AI

Active Inference: A method for Phenotyping Agency in AI systems?

Philip Wilson, Axel Constant, Mahault Albarracin, Nicolás Hinrichs, Jasmine Moore, Daniel Polani, Karl Friston

2604.23277 2026-04-28 cs.CL cs.AI

From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors

Yitian Zhou, Chaoning Zhang, Jiaquan Zhang, Zhenzhen Huang, Jinyu Guo, Sung-Ho Bae, Lik-Hang Lee, Caiyan Qin, Yang Yang

2604.23276 2026-04-28 cs.CV cs.AI cs.CL

Lightweight and Production-Ready PDF Visual Element Parsing

Meizhu Liu, Yassi Abbasi, Matthew Rowe, Michael Avendi, Paul Li

2604.23274 2026-04-28 cs.CV

SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation

Kaiwen Huang, Yi Zhou, Yizhe Zhang, Jingxiong Li, Tao Zhou

Comments This paper have been accepted by CVPR 2026

2604.23272 2026-04-28 cs.RO

Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

Jimin Lee, Huiwon Jang, Myungkyu Koo, Jungwoo Park, Jinwoo Shin

Comments 14 pages, 8 figures, Project page: https://jiminlx.github.io/MoSS

2604.23271 2026-04-28 cs.CV

A Hierarchical Ensemble Inference Pipeline for Robust White Blood Cell Classification Under Domain Shifts

Ruyi Dai, Tingkwong Ng, Hao Chen

2604.23270 2026-04-28 cs.AI

CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Shuxu Chen, Yitian Zhou, Jiaquan Zhang, Haoyu Bian, Aming Wu, Sungyoung Lee, Chaoning Zhang, Hyundong Shin

2604.23264 2026-04-28 cs.CV

MotionHiFlow: Text-to-motion via hierarchical flow matching

Heng Li, Xiaotong Lin, Ling-An Zeng, Yulei Kang, Shuai Li, Jian-Fang Hu

Comments accepted to CVPR 2026

2604.23263 2026-04-28 cs.CL cs.AI

Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt

Zhenzhen Huang, Chaoning Zhang, Fachrina Dewi Puspitasari, Jiaquan Zhang, Yitian Zhou, Shuxu Chen, Yang Yang

2604.23249 2026-04-28 cs.RO

BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances

Yifan Han, Jianxiang Liu, Haoyu Zhang, Yuqi Gu, Yunhan Guo, Wenzhao Lian

2604.23247 2026-04-28 cs.CV

Micro-Expression-Aware Avatar Fingerprinting via Inter-Frame Feature Differencing

Masoumeh Chapariniya, Jean-Marc Odobez, Volker Dellwo, Teodora Vuković

Comments Accepted to TrustFA Workshop, IEEE FG 2026

2604.23241 2026-04-28 cs.SD cs.CL

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

Khalid Zaman, Masashi Unoki

2604.23239 2026-04-28 cs.AI

AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

Xudong Jiang, Mingshan Loo, Hanchen Yang, Wengen Li, Mingrui Zhang, Yichao Zhang, Jihong Guan, Shuigeng Zhou