arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.12652 2026-05-01 cs.CV

CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

Marco Stricker, Masakazu Iwamura, Koichi Kise

Comments We are currently in the process of selecting an appropriate journal for submission

详情

英文摘要

Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN

URL PDF HTML ☆

赞 0 踩 0

2602.11084 2026-05-01 cs.LG cs.AI

GRASP: group-Shapley feature selection for patients

Yuheng Luo, Shuyan Li, Zhong Cao

Comments 5 pages, 4 figures, 2 tables. Accepted at IEEE ICASSP 2026

2602.07915 2026-05-01 cs.LG cs.AI stat.ME stat.ML

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu

Comments Major revision from the previous version

2602.07408 2026-05-01 cs.AI cs.MA

Progressive Multi-Agent Reasoning for Biological Perturbation Prediction

Hyomin Kim, Sang-Yeon Hwang, Jaechang Lim, Yinhua Piao, Yunhak Oh, Woo Youn Kim, Chanyoung Park, Sungsoo Ahn, Junhyeok Jeon

Comments 17 pages, 4 figures, 9 tables

2602.00937 2026-05-01 cs.RO cs.AI cs.CV cs.LG

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang, Connor Schenck

Comments Accepted to the Robotics: Science and Systems (RSS) 2026

2602.00095 2026-05-01 cs.CV cs.AI cs.CY

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng, Ying Zhang

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. Project Website: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL GitHub and Dataset: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL

2601.22228 2026-05-01 cs.CV cs.AI cs.CL

Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser

2601.22199 2026-05-01 cs.RO cs.HC

Game-Based and Gamified Robotics Education: A Comparative Systematic Review and Design Guidelines

Syed T. Mubarrat, Byung-Cheol Min, Tianyu Shao, E. Cho Smith, Bedrich Benes, Alejandra J. Magana, Christos Mousas, Dominic Kao

Comments Accepted for publication at Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 26 pages, 14 figures, 7 tables;

2601.20969 2026-05-01 cs.AI

The Epistemic Planning Domain Definition Language: Official Guideline

Alessandro Burigana, Francesco Fabiano

2601.19768 2026-05-01 cs.AI cs.CR cs.LG

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

Comments Accepted to ICLR 2026

2601.14289 2026-05-01 cs.CL cs.AI

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li

Comments ACL'26, 12 pages, 23 appendix pages

2601.11908 2026-05-01 cs.CL

PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

Byeongjin Kim, Gyuwan Kim, Seo Yeon Park

Comments Accepted to the Main Conference of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). 27 pages, 6 figures

2601.08418 2026-05-01 cs.LG cs.AI

Taxon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance

Jihang Li, Qing Liu, Zulong Chen, Jing Wang, Wei Wang, Chuanfei Xu, Zeyi Wen

2601.06394 2026-05-01 cs.CV cs.AI

Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification

Ahmed Abdelkawy, Ahmed Elsayed, Asem Ali, Aly Farag, Thomas Tretter, Michael McIntyre

Comments accepted to the Computer Vision for Education (CV4Edu) workshop, CVPR 2026

2601.05052 2026-05-01 cs.LG stat.ML

DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul

Comments 25 pages, 20 tables, 2 figures

2601.02845 2026-05-01 cs.CL cs.AI

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

Kai Li, Xuanqing Yu, Ziyi Ni, Yi Zeng, Yao Xu, Zheqing Zhang, Xin Li, Jitao Sang, Xiaogang Duan, Xuelei Wang, Chengbao Liu, Jie Tan

Comments ACL 2026 Findings

2601.01885 2026-05-01 cs.CL

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu

Comments The code is available at https://github.com/y1y5/AgeMem

2512.24329 2026-05-01 cs.CL

World model inspired sarcasm reasoning with large language model agents

Keito Inoshita, Shinnosuke Mizuno

详情

DOI: 10.1007/s44163-026-01360-7

英文摘要

Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

URL PDF HTML ☆

赞 0 踩 0

2512.23192 2026-05-01 cs.LG

PGOT: A Physics-Geometry Operator Transformer for Complex PDEs

Zhuo Zhang, Xi Yang, Ying Miao, Xiaobin Hu, Yifu Gao, Yuan Zhao, Yong Yang, Canqun Yang, Boocheong Khoo

Comments 24 pages, 17 figures

2512.22046 2026-05-01 cs.CV cs.CR

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He, Xingshuo Han, Shengmin Xu, Xinyi Huang

2512.17435 2026-05-01 cs.RO

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang, Xinxin Zhao, Wenzhe Cai, Changyin Sun

Comments 17 pages, 10 figures. arXiv admin note: text overlap with arXiv:2410.09874

2512.14067 2026-05-01 cs.CL cs.AI cs.LG

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

详情

英文摘要

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

URL PDF HTML ☆

赞 0 踩 0

2512.10267 2026-05-01 cs.CV

Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, Li Fuxin

详情

Journal ref: IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings (CVPRF), 2026

英文摘要

Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2512.09111 2026-05-01 cs.RO cs.AI math.OC

Language-Conditioned Safe Trajectory Generation for Spacecraft Rendezvous

Yuji Takubo, Arpit Dwivedi, Sukeerth Ramkumar, Luis A. Pabon, Daniele Gammelli, Marco Pavone, Simone D'Amico

Comments 42 pages, 12 figures. Submitted to AIAA Journal of Guidance, Control, and Dynamics

2512.02535 2026-05-01 cs.RO

AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning

Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti

2511.12386 2026-05-01 cs.CV

Leveraging Quantum-Based Architectures for Robust Diagnostics

Shabnam Sodagari, Tommy Long

2510.23498 2026-05-01 cs.LG cs.AI cs.NA math.NA

Mixed Precision Training of Neural ODEs

Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang

Comments Code available at https://github.com/EmoryMLIP/rampde; 30 pages, 5 figures

2510.20933 2026-05-01 cs.CV cs.AI

Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation

Moin Safdar, Shahzaib Iqbal, Mubeen Ghafoor, Tariq M. Khan, Imran Razzak, Thantrira Porntaveetus, Hamid Alinejad-Rokny

2510.18731 2026-05-01 cs.CL cs.AI cs.LG

Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

Ming Li, Pei Chen, Zhenhao Zhang, Tao Yang, Xinyang Zhang, Han Li, Tianyu Cao, Ming Zeng, Zhuofeng Wu, Meng Jiang, Huasheng Li, Lihong Li, Bing Yin

Comments ACL2026, camera-ready

2510.15350 2026-05-01 cs.RO cs.NE

Nauplius Optimisation for Autonomous Hydrodynamics

Shyalan Ramesh, Scott Mann, Alex Stumpf

Comments IEEE Access, 2026