arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2507.00170 2026-05-04 cs.CV

SelvaBox: A high-resolution dataset for tropical tree crown detection

Hugo Baudchon, Arthur Ouaknine, Martin Weiss, Mélisande Teng, Thomas R. Walla, Antoine Caron-Guay, Christopher Pal, Etienne Laliberté

详情

Journal ref: The Fourteenth International Conference on Learning Representations, 2026 (ICLR)

英文摘要

Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open-access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than 83,000 manually labeled crowns - an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: (1) higher-resolution inputs consistently boost detection accuracy; and (2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.

URL PDF HTML ☆

赞 0 踩 0

2505.09901 2026-05-04 cs.LG cs.AI cs.CL cs.HC

Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

Ziyuan Zhang, Darcy Wang, Ningyuan Chen, Rodrigo Mansur, Vahid Sarhangian

2504.10368 2026-05-04 cs.CL cs.AI

Exploring the System 1 Thinking Capability of Large Reasoning Models

Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu

Comments Accepted by IJCAI 2026 (Main Track)

2412.02125 2026-05-04 cs.AI cs.LG

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Guangyu Zhao, Kewei Lian, Haoxuan Ru, Borong Zhang, Haowei Lin, Zhancun Mu, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang

2411.02327 2026-05-04 cs.CV

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Shangkun Sun, Ruyang Liu, Haoran Tang, Yixiao Ge, Haibo Lu, Wei Gao, Jiankun Yang, Chen Li

Comments Accepted to ICLR' 26

2410.04299 2026-05-04 cs.LG cs.NA math.DS math.NA

Dynamics-Encoded Deep Learning for Robust System Identification and Parameter Estimation

Caitlin Ho, Andrea Arnold

Comments 33 pages, 20 figures

2605.00725 2026-05-04 cs.LG

Weisfeiler Lehman Test on Combinatorial Complexes: Generalized Expressive Power of Topological Neural Networks

Jiawen Chen, Qi Shao, Duxin Chen, Wenwu Yu

2605.00722 2026-05-04 cs.CV

Exploring the Limits of End-to-End Feature-Affinity Propagation for Single-Point Supervised Infrared Small Target Detection

Qiancheng Zhou, Wenhua Zhang

2605.00721 2026-05-04 cs.SD cs.AI eess.AS eess.SP

Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

Anton Ratnarajah, Mehmet Ergezer, Arun Nair, Mrudula Athi

Comments Accepted to Generative Data Augmentation for Real-World Signal Processing Applications (GenDA 2025). An ICASSP 2025 Satellite Workshop and IEEE Data Science and Learning Workshop: Room Acoustics and Speaker Distance Estimation Challenge

2605.00719 2026-05-04 cs.CV

Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy

Yinghao Chen, Yeying Jin, Xiang Chen, Yanyan Wei, Ziyang Yan, Yaowen Fu

2605.00718 2026-05-04 cs.CV

Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels

Tongxu Zhang

2605.00708 2026-05-04 cs.LG

Deep Kernel Learning for Stratifying Glaucoma Trajectories

Bruce Rushing, Angela Danquah, Alireza Namazi, Arjun Dirghangi, Heman Shakeri

2605.00707 2026-05-04 cs.CV

PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

Guandong Li, Mengxia Ye

Comments 11 pages, 7 figures

2605.00706 2026-05-04 cs.CL

FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios

Yutao Hou, Yihan Jiang, Yuhan Xie, Jian Yang, Liwen Zhang, Hailiang Huang, Guanhua Chen, Yun Chen

Comments Accepted by Findings of ACL2026

2605.00702 2026-05-04 cs.CL

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

Derong Xu, Shuochen Liu, Pengfei Luo, Pengyue Jia, Yingyi Zhang, Yi Wen, Yimin Deng, Wenlin Zhang, Enhong Chen, Xiangyu Zhao, Tong Xu

2605.00689 2026-05-04 cs.CL cs.CR

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang, Bo Li

2605.00684 2026-05-04 cs.CV

Static and Dynamic Graph Alignment Network for Temporal Video Grounding

Zhanjie Hu, Bolin Zhang, Jianhua Wang, Jianbo Zheng, Chenchen Yan, Takahiro Komamizu, Ichiro Ide, Jiangbo Qian

详情

英文摘要

Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization task may lead to slow convergence and suboptimal precision. To address these challenges, we propose Static and Dynamic Graph Alignment Network (SDGAN). First, SDGAN jointly exploits static and dynamic visual features to construct two complementary temporal graphs and performs Position-wise Nodes Alignment, enabling more expressive and robust visual representation. Second, SDGAN introduces Query-Clip Contrastive Learning and Adaptive Graph Modeling to explicitly align visual clips with their corresponding textual queries, yielding query-aware visual representations. Third, SDGAN incorporates multi-granularity temporal proposals within Progressive Easy-to-Hard Training Strategy, effectively bridging coarse-grained semantic localization and fine-grained temporal boundary refinement. Extensive experiments on three benchmark datasets demonstrate that SDGAN achieves superior performance across complex TVG scenarios. Codes and datasets are available at https://github.com/ZhanJieHu/SDGAN.

URL PDF HTML ☆

赞 0 踩 0

2605.00678 2026-05-04 cs.CV

Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data

Zahid Hassan Tushar, Sanjay Purushotham

Comments 5 pages, 4 figures, to appear in 2026 IEEE International Geoscience and Remote Sensing Symposium

2605.00677 2026-05-04 cs.LG

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Lixing Li

Comments 4 pages. Accepted as a short paper to the AAAI 2026 Spring Symposium on Machine Learning and Knowledge Engineering for Knowledge-Grounded Semantic Agents (MAKE 2026)

2605.00675 2026-05-04 cs.CV

DMDSC: A Dynamic-Margin Deep Simplex Classifier for Open-Set Recognition on Medical Image Datasets

Vishal, Arnav Aditya, Nitin Kumar, Saurabh J. Shigwan

2605.00667 2026-05-04 cs.LG cs.AI

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

Jiaming Zhang, Yujie Yang, Yao Lyu, Shengbo Eben Li, Liping Zhang

Comments 13 pages, 41 figures, 1 tables

2605.00665 2026-05-04 cs.CV

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

Seowung Leem, Yunchao Yang, Adam J. Woods, Ruogu Fang

Comments Accepted to the "Journal of Alzheimer's Disease" for publication

详情

英文摘要

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using UK Biobank CFPs, DL models were trained using 62,876 images from 44,501 unique participants to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

URL PDF HTML ☆

赞 0 踩 0

2605.00664 2026-05-04 cs.CV cs.AI

InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization

Jaeyoung Chung, Suyoung Lee, Kyoung Mu Lee

Comments project page: https://robot0321.github.io/InpaintSLat/index.html

2605.00658 2026-05-04 cs.CV

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu, Weiqing Xiao, Yuwei Guo, Chongjie Ye, Lvmin Zhang, Hao Zhao, Anyi Rao

Comments Project page: https://houyuanchen111.github.io/UniVidX.github.io/ Accepted to ACM Transactions on Graphics (Proceedings of SIGGRAPH 2026)

详情

DOI: 10.1145/3811304
Journal ref: ACM Trans. Graph. 45, 4, Article 51 (July 2026)

英文摘要

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.00654 2026-05-04 cs.LG cs.AI math.OC stat.ML

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

Andrzej Ruszczynski, Tiangang Zhang

2605.00650 2026-05-04 cs.LG cs.AI

AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

Zhijie Cai, Haolong Chen, Guangxu Zhu

2605.00645 2026-05-04 cs.LG

From Prediction to Practice: A Task-Aware Evaluation Framework for Blood Glucose Forecasting

Alireza Namazi, Heman Shakeri

2605.00644 2026-05-04 cs.LG cs.AI

Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

Jiali Cui, Zhiqiang Lao, Heather Yu

Comments Transactions on Machine Learning Research, 2026

2605.00641 2026-05-04 cs.LG

Bridging Graph Drawing and Dimensionality Reduction with Stochastic Stress Optimization

Daniel Hangan, Stephen Kobourov, Jacob Miller

Comments To appear in GDxDR workshop 2026

2605.00640 2026-05-04 cs.LG physics.chem-ph

Knowing when to trust machine-learned interatomic potentials

Shams Mehdi, Ilkwon Cho, Olexandr Isayev