arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.19396 2026-04-13 cs.AI

EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang

Comments CVPR 2026 Findings

详情

英文摘要

Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ''amnesia'' results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.

URL PDF HTML ☆

赞 0 踩 0

2512.19099 2026-04-13 cs.LG

Dual Model Deep Learning for Alzheimer Prognostication

Alireza Moayedikia, Sara Fin, Uffe Kock Wiil

详情

DOI: 10.1016/j.compbiomed.2026.111672

英文摘要

Disease modifying therapies for Alzheimer's disease demand precise timing decisions, yet current predictive models require longitudinal observations and provide no uncertainty quantification, rendering them impractical at the critical first visit when treatment decisions must be made. We developed PROGRESS (PRognostic Generalization from REsting Static Signatures), a dual-model deep learning framework that transforms a single baseline cerebrospinal fluid biomarker assessment into actionable prognostic estimates without requiring prior clinical history. The framework addresses two complementary clinical questions: a probabilistic trajectory network predicts individualized cognitive decline with calibrated uncertainty bounds achieving near-nominal coverage, enabling honest prognostic communication; and a deep survival model estimates time to conversion from mild cognitive impairment to dementia. Using data from over 3,000 participants across 43 Alzheimer's Disease Research Centers in the National Alzheimer's Coordinating Center database, PROGRESS substantially outperforms Cox proportional hazards, Random Survival Forests, and gradient boosting methods for survival prediction. Risk stratification identifies patient groups with seven-fold differences in conversion rates, enabling clinically meaningful treatment prioritization. Leave-one-center-out validation demonstrates robust generalizability, with survival discrimination remaining strong across held-out sites despite heterogeneous measurement conditions spanning four decades of assay technologies. By combining superior survival prediction with trustworthy trajectory uncertainty quantification, PROGRESS bridges the gap between biomarker measurement and personalized clinical decision-making.

URL PDF HTML ☆

赞 0 踩 0

2512.17425 2026-04-13 cs.RO

The Impact of Gait Pattern Personalization on the Perception of Rigid Robotic Guidance: A Pilot User Experience Evaluation

Beatrice Luciani, Katherine Lin Poggensee, Heike Vallery, Alex van den Berg, Severin David Woernle, Mostafa Mogharabi, Stefano Dalla Gasperina, Laura Marchal-Crespo

详情

英文摘要

Exoskeletons modulate human movement across diverse applications, from performance augmentation to daily-life assistance. These systems often enforce specific kinematic patterns to mitigate injury risks and motivate users to keep moving despite diminished capacity. However, little is known about users' perception of such robot-imposed guidance, especially when personalized to the uniqueness of individual human walk. Given the usually substantial computational cost for personalization, understanding its subjective impact is essential to justify its implementation over standard patterns. Ten unimpaired participants completed a within-subject experiment in a multi-planar treadmill-based exoskeleton that enforced three different gait patterns: personalized, standard, and a randomly selected pattern from a publicly available database. Personalization was achieved using a data-driven framework that predicts hip, knee, and pelvis trajectories from walking speed, anthropometric, and demographic data. The standard pattern was obtained by averaging gait patterns from the aforementioned database. After each condition, participants rated enjoyment, comfort, and perceived naturalness. Knee joint interaction forces were also recorded. Subjective ratings revealed no significant differences among patterns, despite all trajectories being executed with high accuracy. However, gait patterns experienced last were rated as significantly more comfortable and natural, indicating adaptation to the system. Higher interaction forces were observed only for the random vs. standard pattern. Personalizing gait kinematics had minimal short-term influence on user experience relative to the dominant effect of adaptation to the exoskeleton. These findings highlight the importance of integrating subjective feedback and accounting for user adaptation when designing personalized robot controllers.

URL PDF HTML ☆

赞 0 踩 0

2512.12641 2026-04-13 cs.CL

Which Pieces Does Unigram Tokenization Really Need?

Sander Land, Yuval Pinter

Comments 10 pages, 1 figure. For associated code, see https://github.com/sanderland/script_tok

2512.11179 2026-04-13 cs.LG cs.MA

Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

Wei Duan, Jie Lu, En Yu, Junyu Xuan

Comments Accepted by AAMAS 2026 (oral) with appendix

2512.07833 2026-04-13 cs.CV cs.AI cs.LG

Relational Visual Similarity

Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li

Comments CVPR 2026 camera-ready; Project page, data, and code: https://thaoshibe.github.io/relsim

2512.06838 2026-04-13 cs.CV

SparseCoop: Cooperative Perception with Kinematic-Grounded Queries

Jiahao Wang, Zhongwei Jiang, Wenchao Sun, Jiaru Zhong, Haibao Yu, Yuner Zhang, Chenyang Lu, Chuang Zhang, Lei He, Shaobing Xu, Jianqiang Wang

Comments Accepted by AAAI 2026

2512.04292 2026-04-13 cs.CL

SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh

Comments Accepted in The IEEE International Workshop on Large Language Models in Finance, Dec 8-11, Macau, China, 2025, Preprint Copy

2512.04175 2026-04-13 cs.CV

Beyond Flicker: Detecting Kinematic Inconsistencies for Generalizable Deepfake Video Detection

Alejandro Cobo, Roberto Valle, José Miguel Buenaposada, Luis Baumela

2512.04072 2026-04-13 cs.CL cs.AI

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren, Greg Durrett

Comments Published at ICLR 2026; code at https://github.com/Zayne-sprague/SkillFactory

2512.03370 2026-04-13 cs.CV

ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding

Lingjun Zhao, Yandong Luo, James Hays, Lu Gan

2512.02826 2026-04-13 cs.LG cs.AI

From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen

Comments Accepted to CVPR 2026 (Findings track); 16 pages, 17 figures

2512.02231 2026-04-13 cs.CV cs.AI cs.LG

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

Comments Findings of CVPR 2026

2511.23369 2026-04-13 cs.CV cs.RO

SimScale: Learning to Drive via Real-World Simulation at Scale

Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li

Comments CVPR 2026 Oral. Project page: https://opendrivelab.com/SimScale

2511.23071 2026-04-13 cs.CV cs.AI cs.CL

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

2511.20068 2026-04-13 cs.CV

PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Findings Track (CVPRF 2026)

2511.19704 2026-04-13 cs.CV

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang, Sebastian Scherer

Comments Accepted to CVPR'26 Findings Code at https://radseg-ovss.github.io/

2511.17687 2026-04-13 cs.LG cs.NE

Boosting Brain-inspired Path Integration Efficiency via Learning-based Replication of Continuous Attractor Neurodynamics

Zhangyu Ge, Xu He, Lingfei Mo, Xiaolin Meng, Wenxuan Yin, Youdong Zhang, Lansong Jiang, Fengyuan Liu

2511.16136 2026-04-13 cs.CV

How Noise Benefits AI-generated Image Detection

Ziqiang Li, Jiazhen Yan, Fan Wang, Kai Zeng, Zhangjie Fu

2511.15578 2026-04-13 cs.CV

AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning

Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar

Comments Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)

详情

DOI: 10.1109/BigData66926.2025.11401853
Journal ref: 2025 IEEE International Conference on Big Data (BigData), Macau, China

英文摘要

With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.

URL PDF HTML ☆

赞 0 踩 0

2511.14603 2026-04-13 cs.CL cs.AI cs.LG

A Method for Characterizing Disease Progression from Acute Kidney Injury to Chronic Kidney Disease

Yilu Fang, Jordan G. Nestor, Casey N. Ta, Jerard Z. Kneifati-Hayek, Chunhua Weng

2511.13053 2026-04-13 cs.LG cs.NE

Self-Organization and Spectral Mechanism of Attractor Landscapes in High-Capacity Kernel Hopfield Networks

Akira Tamamori

Comments 16 pages, 8 figures; accepted to NOLTA, IEICE

2511.09829 2026-04-13 cs.AI

Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems

Jiahuan Long, Tingsong Jiang, Hanqing Liu, Chao Ma, Weien Zhou, Yang Yang, Wen Yao

Comments accepted by CVPR 2026 (Highlight)

2511.09324 2026-04-13 cs.LG

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

Mohsen Amiri, Konstantin Avrachenkov, Ibtihal El Mimouni, Sindri Magnússon

2511.08947 2026-04-13 cs.AI

AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting

Xiaohan Zhang, Tian Gao, Mingyue Cheng, Bokai Pan, Ze Guo, Yaguo Liu, Xiaoyu Tao, Qi Liu

2511.08798 2026-04-13 cs.CL cs.AI

Structured Uncertainty guided Clarification for LLM Agents

Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha

2511.06756 2026-04-13 cs.LG

Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling

Xin He, Yili Wang, Yiwei Dai, Xin Wang

Comments Accepted by The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)

2511.05168 2026-04-13 cs.CV cs.LG

Another BRIXEL in the Wall: Towards Cheaper Dense Features

Alexander Lappe, Martin A. Giese

2510.26641 2026-04-13 cs.CV

All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, Abolfazl Razi

详情

DOI: 10.1016/j.imavis.2026.105944

英文摘要

Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.

URL PDF HTML ☆

赞 0 踩 0

2510.24718 2026-04-13 cs.CV cs.LG

Generative View Stitching

Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

Comments Published at ICLR 2026. Camera-ready Submission. Project website: https://andrewsonga.github.io/gvs