arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2507.18064 2026-05-04 cs.CV

Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

Xiaoran Sun, Liyan Wang, Yeying Jin, Kin-man Lam, Zhixun Su, Yang Yang, Jinshan Pan, Cong Wang

Comments 11 papers,8 figures, CVPR2026 Findings

详情

英文摘要

Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a framework that adapts large vision-language models with iterative and manual instructions for generative LLIE. VLM-IMI mainly contains two branches: Normal-Light Instruction Prior Generation (NL-IPG) and Instruction-aware Light Enhancement Diffusion (IA-LED). The NL-IPG incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. IA-LED incorporates instruction priors from the NL-IPG to guide the diffusion process, enabling precise illumination enhancement. To effectively integrate cross-modal priors, we introduce a learnable instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, as the ground-truth normal-light images are not available, we propose an inference with an iterative instructions strategy to refine textual instructions, progressively improving visual quality. Our VLM-IMI also inherently supports manual instruction control by allowing users to directly input custom instructions into the LLM to generate user-expected outputs. Experiments across diverse scenarios demonstrate that VLM-IMI outperforms SOTA methods in terms of perception and realism. The source code is available at: https://github.com/sunxiaoran01/VLM-IMI.

URL PDF HTML ☆

赞 0 踩 0

2507.01955 2026-05-04 cs.CV cs.AI cs.LG

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

Comments ICLR 2026. Project page at https://fm-vision-evals.epfl.ch/

2506.22982 2026-05-04 cs.CV

Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models

Atharv Mittal, Agam Pandey, Amritanshu Tiwari, Sukrit Jindal, Swadesh Swain

Comments Accepted to MLRC 2025

2506.13015 2026-05-04 cs.LG cs.AI

Geometric Embedding Alignment via Curvature Matching in Transfer Learning

Sung Moon Ko, Jaewan Lee, Sumin Lee, Soorin Yim, Kyunghoon Bae, Sehui Han

Comments 13+19 pages, 7 figures, 8 tables, 1 pseudo code

2506.11991 2026-05-04 cs.CV cs.AI cs.CL

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao

Comments 9 pages, 4 figures

2506.11989 2026-05-04 cs.CV

Thought Graph Traversal for Test-time Scaling in Chest X-ray VLLMs

Yue Yao, Zelin Wen, Yan Tong, Xinyu Tian, Xuqing Li, Xiao Ma, Dongliang Xu, Tom Gedeon

2506.00166 2026-05-04 cs.LG cs.AI cs.CL

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

Kundan Krishna, Joseph Y Cheng, Charles Maalouf, Leon A Gatys

Comments ICLR 2026 Workshop: Principled Design for Trustworthy AI

2505.23875 2026-05-04 cs.LG cs.AI

A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants

Peter Samoaa, Marcus Vukojevic, Morteza Haghir Chehreghani, Antonio Longa

2505.23723 2026-05-04 cs.CL cs.AI cs.LG

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen

2505.22003 2026-05-04 cs.CL cs.AI

Lightweight Domain Adaptation of a Large Language Model for Legal Assistance in the Indian Context

Jatin Gupta, Akhil Sharma, Saransh Singhania, Ali Imam Abidi

Comments 8 pages, 2 tables, 5 figures. This is a revised version of a preprint previously available at this DOI: \url{https://doi.org/10.48550/arXiv.2505.22003}

2505.20948 2026-05-04 cs.AI

Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs

Yisen Gao, Jiaxin Bai, Tianshi Zheng, Qingyun Sun, Ziwei Zhang, Xingcheng Fu, Jianxin Li, Yangqiu Song

Comments Accepted by ICLR2026

2505.13007 2026-05-04 cs.LG cs.CE

Latent Generative Modeling of Random Fields from Limited Training Data

James E. Warner, Tristan A. Shah, Patrick E. Leser, Geoffrey F. Bomarito, Joshua D. Pribe, Michael C. Stanley

Comments 24 pages plus references and appendices, 26 figures

详情

英文摘要

The ability to accurately model random fields plays a critical role in science and engineering for problems involving uncertain, spatially-varying quantities such as heterogeneous material properties and turbulent flows. Deep generative models offer a powerful tool for sampling high- or infinite-dimensional uncertainties like random fields, but their reliance on large, dense training datasets limits their applicability in contexts where sufficient data is difficult or expensive to obtain. In this work, we propose a latent-space approach to generative modeling of random fields that incorporates domain knowledge to supplement limited training data. A constraint-aware variational autoencoder (VAE) with a function decoder is first used to learn compact latent representations of continuous functions that adhere to known physical or statistical constraints, even when training data is sparse or indirect. Generative modeling is then performed in the learned latent space, decoupling constraint enforcement from the sampling process. This decoupling enables expressive multi-step generative methods to be deployed in data-limited settings where existing constrained multi-step approaches are not directly applicable. The richer latent distributions captured by the generative model also overcome limitations of standard VAEs, which rely on simple parametric priors and struggle to represent complex, multimodal, or heavy-tailed distributions over functions. Efficacy is demonstrated on two challenging applications: wind velocity field reconstruction from sparse sensors and material property inference from indirect measurements. Results show the effectiveness of incorporating domain knowledge constraints for data-limited problems and the improved sample quality and robustness of the latent generative modeling approach versus directly sampling a constrained VAE.

URL PDF HTML ☆

赞 0 踩 0

2505.10887 2026-05-04 cs.AI

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding

2505.09971 2026-05-04 cs.CV

APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds

Yuan Gao, Shaobo Xia, Sheng Nie, Cheng Wang, Xiaohuan Xi, Bisheng Yang

Comments 18 pages,12 figures

详情

DOI: 10.1016/j.isprsjprs.2026.04.040
Journal ref: ISPRS Journal of Photogrammetry and Remote Sensing Volume 237, July 2026, Pages 339-354

英文摘要

Airborne laser scanning (ALS) point cloud semantic segmentation is a fundamental task for large-scale 3D scene understanding. Fixed models deployed in real-world scenarios often suffer from performance degradation due to continuous domain shifts caused by environmental and sensor changes. Continuous Test-Time Adaptation (CTTA) enables adaptation to evolving unlabeled domains, but its application to ALS point clouds remains underexplored, hindered by the lack of benchmarks and the risks of catastrophic forgetting and error accumulation. To address these challenges, we propose APCoTTA (ALS Point cloud Continuous Test-Time Adaptation), a novel CTTA framework tailored for ALS point cloud semantic segmentation. APCoTTA consists of three key components. First, we adapt a gradient-driven layer selection mechanism for ALS point clouds, selectively updating low-confidence layers while freezing stable ones to preserve source knowledge and mitigate catastrophic forgetting. Second, an entropy-based consistency loss discards unreliable samples and enforces consistency regularization solely on reliable ones, effectively reducing error accumulation and improving adaptation stability. Third, a random parameter interpolation mechanism stochastically blends adapted parameters with source model parameters, further balancing target adaptation and source knowledge retention. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Extensive experiments demonstrate that APCoTTA achieves superior performance on both benchmarks, improving mIoU by approximately 9\% and 14\% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.

URL PDF HTML ☆

赞 0 踩 0

2505.06698 2026-05-04 cs.CL

SCAN: Structured Capability Assessment and Navigation for LLMs

Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang

Comments Accepted by ACL 2026 Main

2505.03500 2026-05-04 cs.RO

VLAs are Confined yet Capable of Generalizing to Novel Instructions

Quanyi Li

2504.11901 2026-05-04 cs.RO cs.AI

Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic Environments

Luca Castri, Gloria Beraldo, Nicola Bellotto

Comments Causal Discovery and Inference - Robot Autonomy - Human-Robot Spatial Interaction - Decision-Making

2504.05679 2026-05-04 cs.CV

Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark

Udayanga G. W. K. N. Gamage, Xuanni Huo, Luca Zanatta, T Delbruck, Cesar Cadena, Matteo Fumagalli, Silvia Tolu

Comments Accepted version of the journal paper published in Sage Structural health monitoring journa and it is under review currently. consist of 29 pages. It has 20 figures and 7 tables. Keywords Event-based vision, civil structural health monitoring, defect detection, crack, spalling, DVS, dataset, YOLOv6, SSD, 2D event histograms

详情

DOI: 10.1177/14759217251411320

英文摘要

Small unmanned aerial vehicle (UAV)-based visual inspections are a more efficient alternative to manual methods for examining civil structural defects, offering safe access to hazardous areas and significant cost savings by reducing labor requirements. However, traditional frame-based cameras, widely used in UAV-based inspections, often struggle to capture defects under low or dynamic lighting conditions. In contrast, dynamic vision sensors (DVS), or event-based cameras, excel in such scenarios by minimizing motion blur, enhancing power efficiency, and maintaining high-quality imaging across diverse lighting conditions without saturation or information loss. Despite these advantages, existing research lacks studies exploring the feasibility of using DVS for detecting civil structural defects. Moreover, there is no dedicated event-based dataset tailored for this purpose. Addressing this gap, this study introduces the first event-based civil infrastructure defect detection dataset, capturing defective surfaces as a spatio-temporal event stream using DVS. In addition to event-based data, the dataset includes grayscale intensity image frames captured simultaneously using an active pixel sensor (APS). Both data types were collected using the DAVIS346 camera, which integrates DVS and APS sensors. The dataset focuses on two types of defects: cracks and spalling, and includes data from both field and laboratory environments. The field dataset comprises 318 recording sequences, documenting 458 distinct cracks and 121 distinct spalling instances. The laboratory dataset includes 362 recording sequences, covering 220 distinct cracks and 308 spalling instances. We evaluated the dataset using four real-time object detection models.The results demonstrate the applicability of DVS cameras for robust detection of civil infrastructure defects under challenging lighting conditions.

URL PDF HTML ☆

赞 0 踩 0

2503.19034 2026-05-04 cs.CV

Color Conditional Generation with Sliced Wasserstein Guidance

Alexander Lobashev, Maria Larchenko, Dmitry Guskov

Comments NeurIPS 2025, spotlight

2503.06740 2026-05-04 cs.CV

Diffusion Models are Secretly Zero-Shot 3DGS Harmonizers

Vsevolod Skorokhodov, Nikita Durasov, Pascal Fua

2501.06540 2026-05-04 cs.CV math.ST stat.AP stat.ME stat.TH

Copula-enhanced Vision Transformer for high myopia diagnosis through OU UWF fundus images

Chong Zhong, Yunhao Liu, Yang Li, Xiang Fu, Jin Yang, Danjuan Yang, Meiyan Li, Jinfeng Xu, Aiyi Liu, Alan H. Welsh, Xingtao Zhou, Bo Fu, Catherine C. Liu

2501.00885 2026-05-04 cs.CL cs.AI cs.LG

Representation in large language models

Cameron Yetman

Comments Preprint, forthcoming in Ergo: An Open Access Journal of Philosophy, 34 pages, 2 figures

2412.07010 2026-05-04 cs.LG physics.comp-ph

TAEN: A Model-Constrained Tikhonov Autoencoder Network for Forward and Inverse Problems

Hai V. Nguyen, Tan Bui-Thanh, Clint Dawson

详情

DOI: 10.1016/j.cma.2025.118245

英文摘要

Efficient real-time solvers for forward and inverse problems are essential in engineering and science applications. Machine learning surrogate models have emerged as promising alternatives to traditional methods, offering substantially reduced computational time. Nevertheless, these models typically demand extensive training datasets to achieve robust generalization across diverse scenarios. While physics-based approaches can partially mitigate this data dependency and ensure physics-interpretable solutions, addressing scarce data regimes remains a challenge. Both purely data-driven and physics-based machine learning approaches demonstrate severe overfitting issues when trained with insufficient data. We propose a novel Tikhonov autoencoder model-constrained framework, called TAE, capable of learning both forward and inverse surrogate models using a single arbitrary observation sample. We develop comprehensive theoretical foundations including forward and inverse inference error bounds for the proposed approach for linear cases. For comparative analysis, we derive equivalent formulations for pure data-driven and model-constrained approach counterparts. At the heart of our approach is a data randomization strategy, which functions as a generative mechanism for exploring the training data space, enabling effective training of both forward and inverse surrogate models from a single observation, while regularizing the learning process. We validate our approach through extensive numerical experiments on two challenging inverse problems: 2D heat conductivity inversion and initial condition reconstruction for time-dependent 2D Navier-Stokes equations. Results demonstrate that TAE achieves accuracy comparable to traditional Tikhonov solvers and numerical forward solvers for both inverse and forward problems, respectively, while delivering orders of magnitude computational speedups.

URL PDF HTML ☆

赞 0 踩 0

2411.17429 2026-05-04 cs.LG cs.AI

Graph Rewiring in GNNs to Mitigate Over-Squashing and Over-Smoothing: A Survey

Hugo Attali, Davide Buscaldi, Nathalie Pernelle, Fragkiskos D. Malliaros

Comments Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2026), Survey Track

2411.10915 2026-05-04 cs.CL cs.LG

Bias in Large Language Models: Origin, Evaluation, and Mitigation

Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, Shuo Shuo Liu

2408.11513 2026-05-04 cs.LG cs.AI

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Washim Uddin Mondal, Vaneet Aggarwal

Comments Published in Transactions on Machine Learning Research (TMLR)

2408.11349 2026-05-04 cs.CV

Image Score: Learning and Evaluating Human Preferences for Mercari Search

Chingis Oinar, Miao Cao, Shanshan Fu

2406.14429 2026-05-04 cs.LG cs.AI cs.CV

CollaFuse: Collaborative Diffusion Models

Simeon Allmendinger, Domenique Zipperling, Lukas Struppek, Niklas Kühl

Comments Conditionally Accepted at the Journal of Artificial Intelligence Research (JAIR)

2405.14093 2026-05-04 cs.RO cs.CL cs.CV

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King

Comments Project page: https://github.com/yueen-ma/Awesome-VLA

2405.13693 2026-05-04 cs.LG

Mutatis Mutandis: Revisiting the Comparator in Discrimination Testing

Jose M. Alvarez, Salvatore Ruggieri