arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.21203 2026-02-25 cs.RO cs.CV cs.LG

Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

Abdulaziz Almuzairee, Henrik I. Christensen

Comments For website and code, see https://aalmuzairee.github.io/squint

详情

英文摘要

Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

URL PDF HTML ☆

赞 0 踩 0

2602.21196 2026-02-25 cs.LG cs.DC

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin

Comments 14 pages, 6 figures

2602.21195 2026-02-25 cs.CV

Region of Interest Segmentation and Morphological Analysis for Membranes in Cryo-Electron Tomography

Xingyi Cheng, Julien Maufront, Aurélie Di Cicco, Daniël M. Pelt, Manuela Dezi, Daniel Lévy

2602.21193 2026-02-25 cs.CL

On Data Engineering for Scaling LLM Terminal Capabilities

Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping

2602.21191 2026-02-25 cs.LG cs.DS stat.ML

Statistical Query Lower Bounds for Smoothed Agnostic Learning

Ilias Diakonikolas, Daniel M. Kane

2602.21188 2026-02-25 cs.CV

Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

2602.21186 2026-02-25 cs.CV

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang

2602.21183 2026-02-25 cs.SD

823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

Ratnajit Dhar, Arpita Mallik

2602.21179 2026-02-25 cs.CV

Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision

Nicolás Gaggion, Maria J. Ledesma-Carbayo, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante

详情

英文摘要

Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.

URL PDF HTML ☆

赞 0 踩 0

2602.21178 2026-02-25 cs.CV cs.AI

XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence

Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser

Comments Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

2602.21175 2026-02-25 cs.CV

Seeing Through Words: Controlling Visual Retrieval Quality with Language Models

Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu

2602.21174 2026-02-25 cs.RO cs.AI

Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids

Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott

Comments 12 pages, 9 figures, 4 tables, accepted to RSS 2025, code is open-source: https://github.com/ethz-asl/wavestar

2602.21168 2026-02-25 cs.LG

Sequential Counterfactual Inference for Temporal Clinical Data: Addressing the Time Traveler Dilemma

Jingya Cheng, Alaleh Azhir, Jiazi Tian, Hossein Estiri

2602.21165 2026-02-25 cs.CL cs.AI

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

2602.21161 2026-02-25 cs.RO

ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Guangming Wang, Qizhen Ying, Yixiong Jing, Olaf Wysocki, Brian Sheil

Comments 8 pages, 5 figures, accepted by the 2026 IEEE International Conference on Robotics and Automation

2602.21154 2026-02-25 cs.AI

CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin, Yuxin Liu, Yueming Jin

Comments Accepted by ICASSP 2026

2602.21153 2026-02-25 cs.CV

SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement

Bastien Gimbert

Comments 11 pages, 17 figures. Code available at https://github.com/BastienGimbert/SpriteToMesh

2602.21148 2026-02-25 cs.RO cs.MA

A Micro-Macro Model of Encounter-Driven Information Diffusion in Robot Swarms

Davis S. Catherman, Carlo Pinciroli

Comments 10 pages, 5 figures, published at ANTS 2026

2602.21143 2026-02-25 cs.AI cs.CL cs.IR cs.LG

A Benchmark for Deep Information Synthesis

Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger, Aysim Toker, Roy Miles, Andreea-Maria Oncescu, Jasivan Alex Sivakumar, Philipp Borchert, Ismail Elezi, Meiru Zhang, Ka Yiu Lee, Guchun Zhang, Jun Wang, Gerasimos Lampouras

Comments Accepted at ICLR 2026

2602.21142 2026-02-25 cs.CV cs.LG

LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis

Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R. Roth, Marius George Linguraru

Comments Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

2602.21137 2026-02-25 cs.CV

UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi

详情

英文摘要

Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.

URL PDF HTML ☆

赞 0 踩 0

2602.21133 2026-02-25 cs.LG stat.ML

SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models

Alessandro Londei, Denise Lanzieri, Matteo Benati

2602.21119 2026-02-25 cs.RO cs.AI

Cooperative-Competitive Team Play of Real-World Craft Robots

Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang, Yufeng Zhang, Cheng Zhou, Zhengyou Zhang, Lei Han

Comments Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), Vienna, Austria

2602.21104 2026-02-25 cs.LG cs.DS

Ski Rental with Distributional Predictions of Unknown Quality

Qiming Cui, Michael Dinitz

2602.21098 2026-02-25 cs.CV

Optimizing Occupancy Sensor Placement in Smart Environments

Hao Lu, Richard J. Radke

2602.21092 2026-02-25 cs.LG cs.AI

Probing Graph Neural Network Activation Patterns Through Graph Topology

Floriano Tori, Lorenzo Bini, Marco Sorbi, Stéphane Marchand-Maillet, Vincent Ginis

2602.21082 2026-02-25 cs.CL

Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification

Vishal Patil, Shree Vaishnavi Bacha, Revanth Yamani, Yidan Sun, Mayank Kejriwal

2602.21081 2026-02-25 cs.LG eess.SP

Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

Huy Trinh, Rebecca Ma, Zeqi Yu, Tahsin Reza

2602.21078 2026-02-25 cs.LG cs.CV

ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning

Duowen Chen, Yan Wang

Comments CVPR 2026. code: https://github.com/DuowenC/FSSLlib

2602.21072 2026-02-25 cs.LG cs.AI cs.RO

Localized Dynamics-Aware Domain Adaption for Off-Dynamics Offline Reinforcement Learning

Zhangjie Xia, Yu Yang, Pan Xu

Comments 33 pages, 9 figures, 11 tables