arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2509.19115 2026-03-17 cs.CV

Track-On2: Enhancing Online Point Tracking with Memory

Görkay Aydemir, Weidi Xie, Fatma Güney

Comments TPAMI 2026

详情

DOI: 10.1109/TPAMI.2026.3675257

英文摘要

In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2

URL PDF HTML ☆

赞 0 踩 0

2509.18802 2026-03-17 cs.CV

Surgical Video Understanding with Label Interpolation

Garam Kim, Tae Kyeong Jeong, Juyoun Park

Comments Accepted to ICRA 2026. Video: https://youtu.be/24LlhqvgFBU | Dataset: https://huggingface.co/datasets/KIST-HARILAB/MISAW-Seg

2509.17450 2026-03-17 cs.RO

Learning Dexterous Manipulation with Quantized Hand State

Ying Feng, Hongjie Fang, Yinong He, Jingjing Chen, Chenxi Wang, Zihao He, Ruonan Liu, Cewu Lu

Comments accepted by ICRA 2026

2509.17141 2026-03-17 cs.RO

History-Aware Visuomotor Policy Learning via Point Tracking

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu

Comments accepted by ICRA 2026

2509.15540 2026-03-17 cs.CV cs.CL

Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues

Wei Chen, Tongguan Wang, Feiyue Xue, Junkai Li, Hui Liu, Ying Sha

Comments Accepted by WWW 2026

2509.15107 2026-03-17 cs.LG cs.DL

Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges

Amy Rafferty, Ajitha Rajan

2509.14769 2026-03-17 cs.CV cs.CL

Frame Sampling Strategies Matter: A Benchmark for small vision language models

Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi

2509.08731 2026-03-17 cs.LG stat.ML

Generating solution paths of Markovian stochastic differential equations using diffusion models

Xuefeng Gao, Jiale Zha, Xun Yu Zhou

2508.21556 2026-03-17 cs.CV

ECHO: Ego-Centric modeling of Human-Object interactions

Ilya A. Petrov, Vladimir Guzov, Riccardo Marin, Emre Aksan, Xu Chen, Daniel Cremers, Thabo Beeler, Gerard Pons-Moll

2508.17130 2026-03-17 cs.CV

Structural Damage Detection Using AI Super Resolution and Visual Language Model

Catherine Hoier, Khandaker Mamun Ahmed

2508.13587 2026-03-17 cs.AI cs.CV

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma

Comments Accepted to ICLR 2026

详情

英文摘要

While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring deep understanding of information-rich images and structured output generation remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to produce structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies tailored to structured outputs. In this paper, we systematically investigate the performance plateau of SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation. We construct the largest training corpus to date, with 3 million chart-code pairs curated from real-world tables in arXiv papers, addressing the limitations of previous synthetic datasets. Despite achieving state-of-the-art performance, our experiments show that simply increasing SFT data eventually leads to diminishing improvements. To break this plateau, MSRL employs a multi-granularity reward system that integrates both textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details, while at the visual level, a model-based reward assesses the structural similarity between rendered code and ground-truth charts. We implement a two-stage curriculum training strategy, first optimizing the model with textual rewards and then incorporating visual signals for further enhancement. Experimental results demonstrate that MSRL substantially breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks, respectively. Notably, our method outperforms all existing approaches in the chart domain and achieves competitive results with advanced closed-source models.

URL PDF HTML ☆

赞 0 踩 0

2508.07657 2026-03-17 cs.RO

MoRoCo: An Online Topology-Adaptive Framework for Multi-Operator Multi-Robot Coordination under Restricted Communication

Zhuoli Tian, Yanze Bao, Yuyang Zhang, Meng Guo

Comments 20 pages, 19 figures. Submitted to IEEE Transactions on Robotics (TRO)

2508.06351 2026-03-17 cs.CV math.OC

An Implemention of Two-Phase Image Segmentation using the Split Bregman Method

Olakunle S. Abawonse, Günay Doğan

Comments 15 pages

2508.04166 2026-03-17 cs.CV cs.CL

STEMTOX: From Social Tags to Fine-Grained Toxic Meme Detection via Entropy-Guided Multi-Task Learning

Subhankar Swain, Naquee Rizwan, Vishwa Gangadhar S, Nayandeep Deb, Animesh Mukherjee

2507.16443 2026-03-17 cs.CV

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie

Comments IEEE International Conference on Robotics & Automation (ICRA 2026)

2507.14642 2026-03-17 cs.AI cs.SE

Efficient Story Point Estimation With Comparative Learning

Monoshiz Mahbub Khan, Xiaoyin Xi, Andrew Meneely, Yiming Tang, Zhe Yu

2507.14172 2026-03-17 cs.LG cs.AI cs.NE

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer

Comments update related work

2507.10401 2026-03-17 cs.LG math.PR

Stochastic Operator Network: A Stochastic Maximum Principle Based Approach to Operator Learning

Ryan Bausback, Jingqiao Tang, Lu Lu, Feng Bao, Toan Huynh

2507.07685 2026-03-17 cs.CV cs.AI cs.LG

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa

Comments Accepted to CVPR 2026 (Main); Code is available at https://github.com/yshinya6/red/

2507.07610 2026-03-17 cs.CV cs.CL cs.HC

SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Yuchen Li, Kun Shao, Zheng Tian, Haifeng Zhang, Jun Wang

2506.21509 2026-03-17 cs.CV

Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models

Jiahe Chen, Jiaying He, Qiyuan Chen, Qian Shao, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu

2506.13766 2026-03-17 cs.CV

LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D

Lingteng Qiu, Peihao Li, Heyuan Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Rui Peng, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong

Comments HomePage: https://lingtengqiu.github.io/LHM++/ Online Demo: https://huggingface.co/spaces/Lingteng/LHMPP

2506.06955 2026-03-17 cs.CL cs.AI

BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Ha-Thanh Nguyen, Hideyuki Tachibana, Chaoran Liu, Qianying Liu, Su Myat Noe, Koichi Takeda, Sadao Kurohashi

Comments Accepted at LREC 2026

2506.06632 2026-03-17 cs.LG cs.AI cs.CL

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

2506.06097 2026-03-17 cs.CV

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang

2506.04779 2026-03-17 cs.CL cs.SD eess.AS

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Junan Li, Jincenzi Wu, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng

Comments ICLR 2026. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Project page https://github.com/dingdongwang/MMSU

2506.01208 2026-03-17 cs.LG

Multiresolution Analysis and Statistical Thresholding on Dynamic Networks

Raphaël Romero, Tijl De Bie, Nick Heard, Alexander Modell

2505.23973 2026-03-17 cs.LG

Adaptive Deadline and Batch Layered Synchronized Federated Learning

Asaf Goren, Natalie Lang, Nir Shlezinger, Alejandro Cohen

2505.22636 2026-03-17 cs.CV

Precise Object and Effect Removal with Adaptive Target-Aware Attention

Jixin Zhao, Zhouxia Wang, Peiqing Yang, Shangchen Zhou

Comments Accepted to CVPR 2026. Project page: https://zjx0101.github.io/projects/ObjectClear/

2505.22596 2026-03-17 cs.CV

SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, Kehong Yuan

Comments Accepted to NeurIPS 2025