arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.03619 2026-03-31 cs.CV

LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan

Comments CVPR 2026. Project Page: https://cyberiada.github.io/LAMP/

详情

英文摘要

Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.

URL PDF HTML ☆

赞 0 踩 0

2512.02711 2026-03-31 cs.CL cs.LG

CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Lavish Bansal, Naman Mishra

Comments 9 Pages, 5 Figures, Accepted at LREC 2026

2512.02369 2026-03-31 cs.CV

SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains

Qingmei Li, Yang Zhang, Peifeng Zhang, Haohuan Fu, Juepeng Zheng

2512.01342 2026-03-31 cs.CV

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Lang Lin, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

2512.00255 2026-03-31 cs.CV

Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

Kunwar Maheep Singh, Jianchun Chen, Vladislav Golyanik, Stephan J. Garbin, Thabo Beeler, Rishabh Dabral, Marc Habermann, Christian Theobalt

2511.23342 2026-03-31 cs.CV cs.AI

Overcoming the Curvature Bottleneck in MeanFlow

Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Chengzhi Mao, Dimitris Metaxas, Vladimir Pavlovic

2511.23227 2026-03-31 cs.CV

PointCNN++: Performant Convolution on Native Points

Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li

2511.22264 2026-03-31 cs.CV

DriveVGGT: Calibration-Constrained Visual Geometry Transformers for Multi-Camera Autonomous Driving

Xiaosong Jia, Yanhao Liu, Yu Hong, Renqiu Xia, Junqi You, Bin Sun, Zhihui Hao, Junchi Yan

详情

英文摘要

Feed-forward reconstruction has been progressed rapidly, with the Visual Geometry Grounded Transformer (VGGT) being a notable baseline. However, directly applying VGGT to autonomous driving (AD) fails to capture three domain-specific priors: (i) Sparse Spatial Overlap: the overlap among mutli-view cameras is minimal due to $360^{\circ}$ coverage requirements under budget control, which renders global attention among all images inefficient; (ii) Calibrated Geometric Constraints: the absolute distance among cameras is generally accessible for AD data with calibration process before driving. Standard VGGT is unable to directly utilize such information for absolute scale scene reconstruction; (iii) Rigid Extrinsic Constancy: relative poses of multi-view cameras are approximately static, i.e., the ego-motion is the same for all cameras. To bridge these gaps, we propose DriveVGGT, a scale-aware reconstruction framework that explicitly integrates these priors through three targeted components. First, for the Sparse Spatial Overlap in (i), we introduce a Temporal Video Attention (TVA) module to process multi-camera videos independently. Second, for Calibrated Geometric Constraints in (ii), a Multi-camera Consistency Attention (MCA) module is designed to directly utilize the calibration information among cameras with a scale head for absolute scale scene reconstruction. Finally, to utilize Rigid Extrinsic Constancy in (iii), we reformulate the decoding process of VGGT into factorized sequential pose head and ego motion head. On AD datasets, experiments demonstrate that DriveVGGT reduces inference time by 49.3\% while improving depth and pose estimation compared to vanilla VGGT in long-sequence scenarios. It consistently outperforms recent SOTA variants. Meanwhile, extensive ablation studies verify the effectiveness of each devised module.

URL PDF HTML ☆

赞 0 踩 0

2511.21428 2026-03-31 cs.CV cs.AI cs.RO

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner

Comments 10 pages, 5 figures, Accepted to CVPR 2026

2511.19413 2026-03-31 cs.LG cs.AI cs.CV

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang

Comments Accepted to CVPR 2026

2511.18174 2026-03-31 cs.CV

Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

Mukai Yu, Mosam Dabhi, Liuyue Xie, Sebastian Scherer, László A. Jeni

Comments Accepted to CVPR 2026. Camera-ready version. Added computation benchmark

2511.17290 2026-03-31 cs.CL

Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Marii Ojastu, Hele-Andra Kuulmets, Aleksei Dorkin, Marika Borovikova, Dage Särg, Kairit Sirts

Comments LREC 2026

2511.17209 2026-03-31 cs.CV

Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen

2511.16216 2026-03-31 cs.AI cs.LG

FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis

Zhen Hao Wong, Jingwen Deng, Yuzhao Wang, Wenkai Yu, Jihao Huang, Runming He, Chengyu Shen, Hao Liang, Wentao Zhang

2511.15355 2026-03-31 cs.CL

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares

Comments LREC 2026 camera-ready version

2511.15211 2026-03-31 cs.CL cs.AI

OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Xinli Tao, Xin Dong, Xuezhong Zhou

Comments 12 pages, 4 figures, 4 tables

详情

DOI: 10.1093/jamiaopen/ooag049
Journal ref: JAMIA Open, 2026, ooag049

英文摘要

With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.

URL PDF HTML ☆

赞 0 踩 0

2511.14262 2026-03-31 cs.LG cs.AI

Object-Centric World Models for Causality-Aware Reinforcement Learning

Yosuke Nishimoto, Takashi Matsubara

Comments Accepted by AAAI-26. Codes are available at https://github.com/nishimoto0430/STICA

2511.03121 2026-03-31 cs.CL cs.AI cs.SY eess.SY

Control Barrier Function for Aligning Large Language Models

Yuya Miyaoka, Masaki Inoue

Comments This work is an extenede version of arXiv:2408.15625 and has been submitted to the IEEE for possible publication

2511.01187 2026-03-31 cs.CL cs.CY

Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata

2510.24434 2026-03-31 cs.CL

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot

2510.20351 2026-03-31 cs.CL cs.AI

Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Matteo Silvestri, Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei

2510.20212 2026-03-31 cs.CV

Target-aware Image Editing via Cycle-consistent Constraints

Yanghao Wang, Zhen Wang, Long Chen

2510.16518 2026-03-31 cs.RO cs.AI

DIV-Nav: Open-Vocabulary Spatial Relationships for Multi-Object Navigation

Jesús Ortega-Peimbert, Finn Lukas Busch, Timon Homberger, Quantao Yang, Olov Andersson

2510.14255 2026-03-31 cs.CV

Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng

Comments accepted by CVPR 2026

2510.14035 2026-03-31 cs.AI

GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations

Rajesh Mangannavar, Prasad Tadepalli

Comments 10 pages content. 2 pages references. 2 pages appendix. Updated paper with more results from multiple domains and added appendix

2510.13044 2026-03-31 cs.CV cs.AI

SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

Jungbin Cho, Minsu Kim, Jisoo Kim, Ce Zheng, Laszlo A. Jeni, Ming-Hsuan Yang, Youngjae Yu, Seonjoo Kim

Comments 15 pages

2510.10102 2026-03-31 cs.LG

PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

Guilin Li, Yun Zhang, Xiuyuan Chen, Chengqi Li, Bo Wang, Linghe Kong, Wenjia Wang, Weiran Huang, Matthias Hwai Yong Tan

详情

英文摘要

Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action, defined by multi-dimensional attributes such as time, context, and transaction type, constitutes a behavioral token. Modeling these high-cardinality sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative-discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories; and (4) Real-time scalability enabled by offline caching of pretrained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25.6 percent boost in next-transaction prediction HitRate@1 and a 38.6 percent relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks show strong generalization, achieving up to 21 percent HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial sequential user behavior modeling.

URL PDF HTML ☆

赞 0 踩 0

2510.08673 2026-03-31 cs.CV

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy

Comments Accepted by ICLR2026. Project Page: https://kangliao929.github.io/projects/puffin/

2510.08553 2026-03-31 cs.CV cs.AI cs.RO

Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

Yunzhe Xu, Yiyuan Pan, Zhe Liu

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

2510.06961 2026-03-31 cs.CL cs.AI cs.SD eess.AS

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr Żelasko, Somshubra Majumdar, Adel Moumen, Sanchit Gandhi

Comments Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ; Code: https://github.com/huggingface/open_asr_leaderboard