arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.01465 2026-03-03 cs.RO cs.AI

Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

Yipeng Chen, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen

详情

英文摘要

Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.

URL PDF HTML ☆

赞 0 踩 0

2603.01464 2026-03-03 cs.AI cs.CL

ProtRLSearch: A Multi-Round Multimodal Protein Search Agent with Large Language Models Trained via Reinforcement Learning

Congying Liu, Taihao Li, Ming Huang, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui

2603.01461 2026-03-03 cs.CV

UltraStar: Semantic-Aware Star Graph Modeling for Echocardiography Navigation

Teng Wang, Haojun Jiang, Chenxi Li, Diwen Wang, Yihang Tang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang

2603.01454 2026-03-03 cs.CV cs.AI

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang, Xiao Yang, Jianyu Wang, Siqi Cai

2603.01452 2026-03-03 cs.AI cs.RO

Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning

Shaohuai Liu, Weirui Ye, Yilun Du, Le Xie

2603.01450 2026-03-03 cs.CV

Deepfake Forensics Adapter: A Dual-Stream Network for Generalizable Deepfake Detection

Jianfeng Liao, Yichen Wei, Raymond Chan Ching Bon, Shulan Wang, Kam-Pui Chow, Kwok-Yan Lam

Comments Accepted at ICDF2C 2025

2603.01441 2026-03-03 cs.CV cs.RO

Unifying Language-Action Understanding and Generation for Autonomous Driving

Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen

2603.01438 2026-03-03 cs.CL cs.AI

Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents

Yuxin Liu, Mingye Zhu, Siyuan Liu, Bo Hu, Lei Zhang

Comments ICLR 2026

2603.01437 2026-03-03 cs.AI

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

Kyle Cox, Darius Kianersi, Adrià Garriga-Alonso

2603.01436 2026-03-03 cs.RO

PhysGraph: Physically-Grounded Graph-Transformer Policies for Bimanual Dexterous Hand-Tool-Object Manipulation

Runfa Blark Li, David Kim, Xinshuang Liu, Keito Suzuki, Dwait Bhatt, Nikola Raicevic, Xin Lin, Ki Myung Brian Lee, Nikolay Atanasov, Truong Nguyen

2603.01431 2026-03-03 cs.CV

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding, Xin Gu, Bin Fan, Shiming Xiang

Comments Accepted by Machine Intelligence Research

2603.01426 2026-03-03 cs.CL

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty

2603.01425 2026-03-03 cs.CL cs.IR

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, Zhicheng Dou

Comments Under Review

2603.01423 2026-03-03 cs.CL

Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction

Jiyoon Myung

Comments Accepted at the Workshop on Assessing and Improving Reliability of Foundation Models in the Real World (AAAI 2026)

2603.01418 2026-03-03 cs.CV cs.MM cs.SD

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li, Zihao Liang, Benyuan Sun, Zihao Yin, Xiao Sha, Chenliang Wang, Yi Yang

Comments Accepted at CVPR 2026 (Findings Track)

2603.01416 2026-03-03 cs.AI

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Zhixiang Wang, Jingxuan Xu, Dajun Chen, Yunfang Wu, Wei Jiang, Yong Li

2603.01414 2026-03-03 cs.RO

Jailbreaking Embodied LLMs via Action-level Manipulation

Xinyu Huang, Qiang Yang, Leming Shen, Zijing Ma, Yuanqing Zheng

Comments This paper has been officially accepted for ACM SenSys 2026

2603.01409 2026-03-03 cs.AI cs.LG cs.SE

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

Sicheng Zhu, Jiajun Wang, Jiawei Ai, Xin Li

Comments Preprint. 17 pages

2603.01407 2026-03-03 cs.AI cs.MA cs.SI

The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition

Saad Alqithami

Comments Extended version of the AAMAS 2026 paper with the same title

详情

DOI: 10.65109/CHZG9392
Journal ref: 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25-29, 2026, IFAAMAS, 18 pages

英文摘要

Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer-Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective-aware cognition. OSL is a finite complete lattice where each element represents a unique observer-situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph-based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption-based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective-aware autonomous agents.

URL PDF HTML ☆

赞 0 踩 0

2603.01385 2026-03-03 cs.CL cs.AI

Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning

Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, Chuan Shi

Comments accepted by WWW 2026

2603.01382 2026-03-03 cs.SD cs.CL

End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation

Minghui Wu, Haitao Tang, Jiahuan Fan, Ruizhi Liao, Yanyong Zhang

Comments Submitted to 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

详情

DOI: 10.1109/APSIPAASC65261.2025.11248992
Journal ref: 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, 2025, pp. 1092-1097

英文摘要

Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: https://wflrz123.github.io/

URL PDF HTML ☆

赞 0 踩 0

2603.01376 2026-03-03 cs.LG stat.ML

3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

Mehdi Makni, Xiang Meng, Rahul Mazumder

Comments The Thirty-ninth Annual Conference on Neural Information Processing Systems

2603.01375 2026-03-03 cs.AI cs.LG

Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

Chenxing Wei, Hong Wang, Ying He, Zhongxiang Dai, Bo Jiang, F. Richard Yu, Yao Shu

2603.01369 2026-03-03 cs.SD cs.CL

DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement

Minghui Wu, Xueling Liu, Jiahuan Fan, Haitao Tang, Yanyong Zhang, Yue Zhang

Comments Submitted to 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

2603.01365 2026-03-03 cs.LG cs.AI cs.RO cs.SY eess.SY

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Homayoun Honari, Roger Creus Castanyer, Michael Przystupa, Michael Noukhovitch, Pablo Samuel Castro, Glen Berseth

2603.01363 2026-03-03 cs.LG cs.DC

Fed-GAME: Personalized Federated Learning with Graph Attention Mixture-of-Experts For Time-Series Forecasting

Yi Li, Han Liu, Mingfeng Fan, Guo Chen, Chaojie Li, Biplab Sikdar

2603.01361 2026-03-03 cs.CV cs.AI

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Zilong Zhao, Zhengming Ding, Pei Niu, Wenhao Sun, Feng Guo

Comments Accepted by CVPR 2026

2603.01357 2026-03-03 cs.AI

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Zidi Xiu, David Q. Sun, Kevin Cheng, Maitrik Patel, Josh Date, Yizhe Zhang, Jiarui Lu, Omar Attia, Raviteja Vemulapalli, Oncel Tuzel, Meng Cao, Samy Bengio

2603.01353 2026-03-03 cs.LG cs.AI cs.CL

Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain

Yuma Okochi, Fabio Milentiansen Sim, Tomoyasu Okada

Comments 8 pages, 2 figures. Japanese version published in NLP2026

2603.01348 2026-03-03 cs.LG cs.AI

UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification

Yessin Moakher, Youssef Attia El Hili, Vasilii Feofanov