arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.18627 2026-03-20 cs.AI

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen

详情

英文摘要

Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

URL PDF HTML ☆

赞 0 踩 0

2603.18626 2026-03-20 cs.CV

GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

Zelin Liu, Bocheng Li, Yuling Zhou, Xuanting Li, Yixuan Yang, Jing Wang, Weishu Zhao, Xiaofeng Gao

2603.18625 2026-03-20 cs.CV

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He

Comments ECCV 2026 submission. 14 pages, 6 figures, 4 tables. Supplementary material included

2603.18624 2026-03-20 cs.RO cs.AI cs.CV

REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation

Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong

2603.18623 2026-03-20 cs.CV cs.AI

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu

2603.18620 2026-03-20 cs.CL cs.AI

Learning to Self-Evolve

Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He

2603.18616 2026-03-20 cs.CV

Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Lukas Bayer, Sheethal Bhat, Andreas Maier

2603.18614 2026-03-20 cs.AI

ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen

2603.18612 2026-03-20 cs.CL cs.SD eess.AS

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo, Ewan Dunbar, Emmanuel Chemla, Emmanuel Dupoux

Comments 6 pages, 2 figures. Submitted to Interspeech 2026

2603.18611 2026-03-20 cs.CL cs.CV

Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl

Comments Accepted at WWW 2026

详情

DOI: 10.1145/3774904.3792991

英文摘要

Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

URL PDF HTML ☆

赞 0 踩 0

2603.18600 2026-03-20 cs.CV

Improving Joint Audio-Video Generation with Cross-Modal Context Learning

Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge, Yi Zhang, Guanglu Song, Yu Liu

2603.18598 2026-03-20 cs.CV

Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu, Haiyang Zhang, Changsheng Xu

Comments Accepted to TPAMI 2026. arXiv admin note: substantial text overlap with arXiv:2410.21802

2603.18411 2026-03-20 cs.CL cs.AI cs.LG

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang

2603.17944 2026-03-20 cs.CV

TransText: Alpha-as-RGB Representation for Transparent Text Animation

Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel

Comments 19 pages, publication review

2603.17927 2026-03-20 cs.RO

RoboForge: Physically Optimized Text-guided Whole-Body Locomotion for Humanoids

Xichen Yuan, Zhe Li, Bofan Lyu, Kuangji Zuo, Yanshuo Lu, Gen Li, Jianfei Yang

Comments 10 pages, 5 figures

详情

英文摘要

While generative models have become effective at producing human-like motions from text, transferring these motions to humanoid robots for physical execution remains challenging. Existing pipelines are often limited by retargeting, where kinematic quality is undermined by physical infeasibility, contact-transition errors, and the high cost of real-world dynamical data. We present a unified latent-driven framework that bridges natural language and whole-body humanoid locomotion through a retarget-free, physics-optimized pipeline. Rather than treating generation and control as separate stages, our key insight is to couple them bidirectionally under physical constraints.We introduce a Physical Plausibility Optimization (PP-Opt) module as the coupling interface. In the forward direction, PP-Opt refines a teacher-student distillation policy with a plausibility-centric reward to suppress artifacts such as floating, skating, and penetration. In the backward direction, it converts reward-optimized simulation rollouts into high-quality explicit motion data, which is used to fine-tune the motion generator toward a more physically plausible latent distribution. This bidirectional design forms a self-improving cycle: the generator learns a physically grounded latent space, while the controller learns to execute latent-conditioned behaviors with dynamical integrity.Extensive experiments on the Unitree G1 humanoid show that our bidirectional optimization improves tracking accuracy and success rates. Across IsaacLab and MuJoCo, the implicit latent-driven pipeline consistently outperforms conventional explicit retargeting baselines in both precision and stability. By coupling diffusion-based motion generation with physical plausibility optimization, our framework provides a practical path toward deployable text-guided humanoid intelligence.

URL PDF HTML ☆

赞 0 踩 0

2603.17769 2026-03-20 cs.SD cs.CL cs.LG eess.AS

Modeling Overlapped Speech with Shuffles

Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev Khudanpur

2603.17759 2026-03-20 cs.CL cs.AI

Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed, Yasser Rohaim, Yuxia Wang

2603.17558 2026-03-20 cs.CL cs.SD

Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Yuxiang Mei, Delai Qiu, Shengping Liu, Jiaen Liang, Yanhua Long

Comments 13 pages, 8 figures

2603.16749 2026-03-20 cs.CL cs.LG

Probing Cultural Signals in Large Language Models through Author Profiling

Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes

2603.16549 2026-03-20 cs.CV cs.LG

Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

Jilles S. van Hulst, W. P. M. H. Heemels, Duarte J. Antunes

Comments This work has been submitted to the IEEE for possible publication

2603.16340 2026-03-20 cs.CV

Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang

Comments Accepted by CVPR2026

2603.16313 2026-03-20 cs.AI cs.LG

Learning to Predict, Discover, and Reason in High-Dimensional Event Sequences

Hugo Math

Comments PhD dissertation, 135 pages of main content, 201 pages in total

详情

英文摘要

Electronic control units (ECUs) embedded within modern vehicles generate a large number of asynchronous events known as diagnostic trouble codes (DTCs). These discrete events form complex temporal sequences that reflect the evolving health of the vehicle's subsystems. In the automotive industry, domain experts manually group these codes into higher-level error patterns (EPs) using Boolean rules to characterize system faults and ensure safety. However, as vehicle complexity grows, this manual process becomes increasingly costly, error-prone, and difficult to scale. Notably, the number of unique DTCs in a modern vehicle is on the same order of magnitude as the vocabulary of a natural language, often numbering in the tens of thousands. This observation motivates a paradigm shift: treating diagnostic sequences as a language that can be modeled, predicted, and ultimately explained. Traditional statistical approaches fail to capture the rich dependencies and do not scale to high-dimensional datasets characterized by thousands of nodes, large sample sizes, and long sequence lengths. Specifically, the high cardinality of categorical event spaces in industrial logs poses a significant challenge, necessitating new machine learning architectures tailored to such event-driven systems. This thesis addresses automated fault diagnostics by unifying event sequence modeling, causal discovery, and large language models (LLMs) into a coherent framework for high-dimensional event streams. It is structured in three parts, reflecting a progressive transition from prediction to causal understanding and finally to reasoning for vehicle diagnostics. Consequently, we introduce several Transformer-based architectures for predictive maintenance, scalable sample- and population-level causal discovery frameworks and a multi-agent system that automates the synthesis of Boolean EP rules.

URL PDF HTML ☆

赞 0 踩 0

2603.16137 2026-03-20 cs.CL

SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang

2603.16128 2026-03-20 cs.CL

Social Simulacra in the Wild: AI Agent Communities on Moltbook

Agam Goyal, Olivia Pal, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

Comments Preprint: 13 pages, 4 figures, 5 tables

2603.15030 2026-03-20 cs.AI

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou

2603.13032 2026-03-20 cs.CV

Multimodal OCR: Parse Anything from Documents

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Yuqiu Ji, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai

详情

英文摘要

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, our model achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

URL PDF HTML ☆

赞 0 踩 0

2603.11717 2026-03-20 cs.CV

COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection

Guillem González, Guillem Alenyà, Sergi Foix

Comments 15 pages, 11 figures. This paper will be submitted to Computers and Electronics in Agriculture, special issue

2603.11667 2026-03-20 cs.CL cs.HC

A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy

María Isabel Rivas Ginel, Janiça Hackenbuchner, Alina Secară, Ralph Krüger, Caroline Rossi

Comments Under review

2603.10651 2026-03-20 cs.RO cs.AI

Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions

Elisa Tosello, Arthur Bit-Monnot, Davide Lusuardi, Alessandro Valentini, Andrea Micheli

2603.09583 2026-03-20 cs.LG

Nonparametric Variational Differential Privacy via Embedding Parameter Clipping

Dina El Zein, Shashi Kumar, James Henderson

Comments 8 pages, 1 figure