arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2509.14117 2026-03-10 cs.RO

GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model

Ali Abouzeid, Malak Mansour, Qinbo Sun, Zezhou Sun, Dezhen Song

Comments Under Review, Project Page https://alisharey.github.io/GeoAware-VLA/

详情

英文摘要

Vision-Language-Action (VLA) models often fail to generalize to unseen camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A lightweight, trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on the LIBERO and CALVIN benchmarks, we show that GeoAware-VLA preserves and even improves in-distribution performance while achieving substantial gains in zero-shot generalization to unseen camera poses, improving unseen-view success rates by an average of 35 percentage points on LIBERO and over 11 percentage points on CALVIN compared to their respective baselines. Crucially, these gains transfer to the physical world, where our model shows significant improvement on a real robotic platform. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key ingredient for building more generalizable robotic agents.

URL PDF HTML ☆

赞 0 踩 0

2509.12871 2026-03-10 cs.CV

Cumulative Consensus Score: Label-Free and Model-Agnostic Evaluation of Object Detectors in Deployment

Avinaash Manoharan, Xiangyu Yin, Domenik Helm, Chih-Hong Cheng

2509.09349 2026-03-10 cs.CV cs.AI cs.ET cs.RO eess.IV

Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles

Ian Nell, Shane Gilroy

2509.02541 2026-03-10 cs.CV

Mix-modal Federated Learning for MRI Image Segmentation

Guyue Hu, Siyuan Song, Jingpeng Sun, Zhe Jin, Chenglong Li, Jin Tang

2509.01487 2026-03-10 cs.CV

PointSlice: Accurate and Efficient Slice-Based Representation for 3D Object Detection from Point Clouds

Liu Qifeng, Zhao Dawei, Dong Yabo, Xiao Liang, Wang Juan, Min Chen, Li Fuyang, Jiang Weizhong, Lu Dongming, Nie Yiming

Comments Accepted by Pattern Recognition

2509.01370 2026-03-10 cs.LG cond-mat.mtrl-sci

CbLDM: A Diffusion Model for recovering nanostructure from atomic pair distribution function

Jiarui Cao, Zhiyang Zhang, Heming Wang, Jun Xu, Ling Lan, Simon J. L. Billinge, Ran Gu

2508.11954 2026-03-10 cs.AI

UniCast: A Unified Framework for Instance-Conditioned Multimodal Time-Series Forecasting

Sehyuk Park, Soyeon Caren Han, Eduard Hovy

2508.11952 2026-03-10 cs.CV

UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding

Yueming Xu, Jiahui Zhang, Ze Huang, Yurui Chen, Yanpeng Zhou, Zhenyu Chen, Yu-Jie Yuan, Pengxiang Xia, Guowei Huang, Xinyue Cai, Zhongang Qi, Xingyue Quan, Jianye Hao, Hang Xu, Li Zhang

2508.08660 2026-03-10 cs.CV

Unified and Semantically Grounded Domain Adaptation for Medical Image Segmentation

Xin Wang, Yin Guo, Jiamin Xia, Kaiyu Zhang, Niranjan Balu, Mahmud Mossa-Basha, Linda Shapiro, Chun Yuan

Comments Accepted by IEEE Transactions on Medical Imaging

详情

英文摘要

Most prior unsupervised domain adaptation approaches for medical image segmentation are narrowly tailored to either the source-accessible setting, where adaptation is guided by source-target alignment, or the source-free setting, which typically resorts to implicit adaptation mechanisms such as pseudo-labeling and network distillation. This substantial divergence in methodological designs between the two settings reveals an inherent flaw: the lack of an explicit, structured construction of anatomical knowledge that naturally generalizes across domains and settings. To bridge this longstanding divide, we introduce a unified, semantically grounded framework that supports both source-accessible and source-free adaptation. Fundamentally distinct from all prior works, our framework's adaptability emerges naturally as a direct consequence of the model architecture, without relying on explicit cross-domain alignment strategies. Specifically, our model learns a domain-agnostic probabilistic manifold as a global space of anatomical regularities, mirroring how humans establish visual understanding. Thus, the structural content in each image can be interpreted as a canonical anatomy retrieved from the manifold and a spatial transformation capturing individual-specific geometry. This disentangled, interpretable formulation enables semantically meaningful prediction with intrinsic adaptability. Extensive experiments on challenging cardiac and abdominal datasets show that our framework achieves state-of-the-art results in both settings, with source-free performance closely approaching its source-accessible counterpart, a level of consistency rarely observed in prior works. The results provide a principled foundation for anatomically informed, interpretable, and unified solutions for domain adaptation in medical imaging. The code is available at https://github.com/wxdrizzle/remind

URL PDF HTML ☆

赞 0 踩 0

2508.05592 2026-03-10 cs.CL

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, Fei Tan

Comments Accepted to AAAI 2026

2508.05202 2026-03-10 cs.CV

SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao, Jing Zhang, Minqiang Xu, Jianbo Zhan, Jianshe Wang, Lin Liu, Bo Du, Liangpei Zhang

Comments Accepted to IEEE TGRS

2508.02879 2026-03-10 cs.LG cs.AI

CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data

Shifeng Xie, Vasilii Feofanov, Ambroise Odonnat, Lei Zan, Marius Alonso, Jianfeng Zhang, Themis Palpanas, Lujia Pan, Keli Zhang, Ievgen Redko

Comments This manuscript combines material from the ICML 2025 TSFM Workshop paper and the ICLR 2026 Main Track paper

2507.21426 2026-03-10 cs.SD eess.AS

Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer

Bence Mark Halpern, Thomas Tienkamp, Teja Rebernik, Rob J. J. H. van Son, Martijn Wieling, Defne Abur, Tomoki Toda

Comments 5 pages, 1 figure, 1 table. Accepted at Interspeech 2025

2507.20152 2026-03-10 cs.CL cs.AI

Goal Alignment in LLM-Based User Simulators for Conversational AI

Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-Tür

2507.19914 2026-03-10 cs.RO

They See Me Rolling: High-Speed Event Vision-Based Tactile Roller Sensor for Large Surface Inspection

Akram Khairi, Hussain Sajwani, Abdallah Mohammad Alkilany, Laith AbuAssi, Mohamad Halwani, Islam Mohamed Zaid, Ahmed Awadalla, Dewald Swart, Abdulla Ayyad, Yahya Zweiri

Comments Accepted to IEEE T-RO - Project Page: https://akramekhairi.github.io/TheySeeMeRolling/

2507.18858 2026-03-10 cs.LG

Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models

Ruimeng Ye, Zihan Wang, Yang Xiao, Zinan Ling, Manling Li, Bo Hui

2507.17731 2026-03-10 cs.LG cs.AI

Flow Matching Meets Biology and Life Science: A Survey

Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, Hanghang Tong, Jingrui He

Comments Nature Portfolio Journal Artificial Intelligence, 34 pages

2507.16849 2026-03-10 cs.CV cs.AI

Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery

Yi-Shan Chu, Hsuan-Cheng Wei

2507.14899 2026-03-10 cs.AI cs.CV

InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

Jiale Liu, Huan Wang, Yue Zhang, Xiaoyu Luo, Jiaxiang Hu, Zhiliang Liu, Min Xie

详情

DOI: 10.1109/TR.2026.3668946

英文摘要

Non-destructive testing (NDT), particularly X-ray inspection, is vital for industrial quality assurance, yet existing deep-learning-based approaches often lack interactivity, interpretability, and the capacity for critical self-assessment, limiting their reliability and operator trust. To address these shortcomings, this paper proposes InsightX Agent, a novel LMM-based agentic framework designed to deliver reliable, interpretable, and interactive X-ray NDT analysis. Unlike typical sequential pipelines, InsightX Agent positions a Large Multimodal Model (LMM) as a central orchestrator, coordinating between the Sparse Deformable Multi-Scale Detector (SDMSD) and the Evidence-Grounded Reflection (EGR) tool. The SDMSD generates dense defect region proposals from multi-scale feature maps and sparsifies them through Non-Maximum Suppression (NMS), optimizing detection of small, dense targets in X-ray images while maintaining computational efficiency. The EGR tool guides the LMM agent through a chain-of-thought-inspired review process, incorporating context assessment, individual defect analysis, false positive elimination, confidence recalibration and quality assurance to validate and refine the SDMSD's initial proposals. By strategically employing and intelligently using tools, InsightX Agent moves beyond passive data processing to active reasoning, enhancing diagnostic reliability and providing interpretations that integrate diverse information sources. Experimental evaluations on the GDXray+ dataset demonstrate that InsightX Agent not only achieves a high object detection F1-score of 96.54\% but also offers significantly improved interpretability and trustworthiness in its analyses, highlighting the transformative potential of LMM-based agentic frameworks for industrial inspection tasks.

URL PDF HTML ☆

赞 0 踩 0

2507.11662 2026-03-10 cs.AI cs.CL cs.LG cs.MA cs.RO

Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira

Comments ICLR 2026. Code, models, and data publicly available at https://mshalimay.github.io/agreement-bias-sgv/

详情

英文摘要

Verifiers--functions assigning rewards to agent behavior--have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior--a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.

URL PDF HTML ☆

赞 0 踩 0

2507.03831 2026-03-10 cs.CV cs.RO

Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Jiuhong Xiao, Yang Zhou, Giuseppe Loianno

Comments 8 pages, 4 figures, accepted at ICRA 2026

2507.03168 2026-03-10 cs.LG cs.CV

Adopting a human developmental visual diet yields robust, shape-based AI vision

Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C Kietzmann

2506.19300 2026-03-10 cs.CV

Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

Kai Zhao, Wubang Yuan, Zheng Wang, Guanyi Li, Xiaoqiang Zhu, Deng-ping Fan, Dan Zeng

Comments Accepted to Computational Visual Media (CVMJ) 2026

详情

DOI: 10.26599/CVM.2025.9450512

英文摘要

Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs' full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

URL PDF HTML ☆

赞 0 踩 0

2506.18882 2026-03-10 cs.CV

Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Houyuan Chen, Hong Li, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

Comments Home: https://houyuanchen111.github.io/lino.github.io Github: https://github.com/houyuanchen111/LINO_UniPS HuggingFace

2506.18485 2026-03-10 cs.CL cs.AI

A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models

Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao

详情

英文摘要

Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce \textit{\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning}~(\textbf{MeRF}), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}. Specifically, \textbf{MeRF} directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that \textbf{MeRF} achieves substantial performance gains over the RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.

URL PDF HTML ☆

赞 0 踩 0

2506.17252 2026-03-10 cs.LG cs.AI

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

2506.16563 2026-03-10 cs.CV cs.AI cs.LG

From Semantic To Instance: A Semi-Self-Supervised Learning Approach

Keyhan Najafian, Farhad Maleki, Lingling Jin, Ian Stavness

2506.15828 2026-03-10 cs.RO cs.AI

Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning

Emanuele Musumeci, Michele Brienza, Francesco Argenziano, Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Domenico D. Bloisi

2506.09487 2026-03-10 cs.SD cs.AI cs.LG cs.LO eess.AS

BemaGANv2: Discriminator Combination Strategies for GAN-based Vocoders in Long-Term Audio Generation

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon

Comments Currently under review at ICT Express as an extended version of our ICAIIC 2025 paper

2506.08523 2026-03-10 cs.LG cond-mat.dis-nn nlin.CD physics.data-an

Leveraging chaotic transients in the training of artificial neural networks

Pedro Jiménez-González, Miguel C. Soriano, Lucas Lacasa