arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.27499 2026-05-01 cs.CV

Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark

Shuo Wang, Jilin Mei, Wenfei Guan, Shuai Wang, Yan Xing, Chen Min, Yu Hu

详情

英文摘要

Off-road nighttime autonomous driving suffers from unreliable visible-light perception, making infrared modality crucial for accurate freespace detection. However, progress remains limited due to the scarcity of annotated infrared off-road datasets and the inter-frame inconsistencies inherent to current single-frame methods. To address these gaps, we present the IRON dataset, which, to our knowledge, is the first large-scale infrared dataset for off-road temporal freespace detection under all-day conditions, with strong support for nighttime perception. The dataset comprises 24,314 densely annotated infrared images with synchronized RGB images in diverse scenes and different light conditions. Building upon this dataset, we propose IRONet, a novel flow-free framework for temporal freespace detection that addresses inter-frame inconsistencies by aggregating historical context via a memory-attention mechanism and a carefully designed mask decoder. On our IRON dataset, IRONet achieves state-of-the-art performance, reaching 82.93%(+1.19%) IoU and 90.66%(+0.71%) F1 score at real-time inference. Remarkably, IRONet also exhibits robust generalization to RGB modalities on ORFD and Rellis datasets. Overall, our work establishes a foundation for reliable all-day off-road autonomous driving and future research in infrared temporal perception. The code and IRON dataset are available at https://github.com/wsnbws/IRON.

URL PDF HTML ☆

赞 0 踩 0

2604.27495 2026-05-01 cs.CL cs.AI

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

Comments Accepted to ACL 2026 Main Conference

2604.27491 2026-05-01 cs.CV

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Mengfei Zhang, Jinlu Zhang, Zhigang Tu

Comments 10 pages

2604.27488 2026-05-01 cs.CL

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

Yu Tian, Jiawei Chen, Lifan Zheng, Mingxiang Tao, Xinyi Zeng, Zhaoxia Yin, Hang Su, Xian Sun

2604.27487 2026-05-01 cs.LG cs.CR

Low Rank Adaptation for Adversarial Perturbation

Han Liu, Shanghao Shi, Yevgeniy Vorobeychik, Chongjie Zhang, Ning Zhang

2604.27478 2026-05-01 cs.LG cs.SY eess.SY

Toward Scalable SDN for LEO Mega-Constellations: A Graph Learning Approach

Sivaram Krishnan, Bassel Al Homssi, Zhouyou Gu, Jihong Park, Sung-Min Oh, Jinho Choi

2604.27472 2026-05-01 cs.AI cs.LG cs.RO

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

Yang Zhang, Jiangyuan Zhao, Chenyou Fan, Fangzheng Yan, Tian Li, Haitong Tang, Sen Fu, Xuan'er Wu, Qizhen Weng, Weinan Zhang, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li

Comments 38 pages, 12 figures

详情

英文摘要

Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot learning as a goal-reaching process that requires understanding temporal task progress. We present \textbf{PRTS} (\textbf{P}rimitive \textbf{R}easoning and \textbf{T}asking \textbf{S}ystem), a VLA foundation model that reformulates pretraining through Goal-Conditioned Reinforcement Learning. By treating language instructions as goals and employing contrastive reinforcement learning, PRTS learns a unified embedding space where the inner product of state-action and goal embeddings approximates the log-discounted goal occupancy, the probability of reaching the language-specified goal from the current state-action, quantitatively assessing physical feasibility beyond static semantic matching. PRTS draws this dense goal-reachability supervision directly from offline trajectories without reward annotations, and folds it into the VLM backbone via a role-aware causal mask, incurring negligible overhead over vanilla behavior cloning. This paradigm endows the high-level reasoning system with intrinsic goal reachability awareness, bridging semantic reasoning and temporal task progress, and further benefits goal-conditioned action prediction. Pretrained on 167B tokens of diverse manipulation and embodied-reasoning data, PRTS reaches state-of-the-art performance on LIBERO, LIBERO-Pro, LIBERO-Plus, SimplerEnv, and a real-world suite of 14 complex tasks, with particularly substantial gains on long-horizon, contact-rich, and zero-shot novel-instruction settings, confirming that injecting goal-reachability awareness significantly improves both execution success and long-horizon planning of general-purpose robotic foundation policies.

URL PDF HTML ☆

赞 0 踩 0

2604.27470 2026-05-01 cs.CL

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, Karan Singhal

Comments Data link in paper; Blog: https://openai.com/index/making-chatgpt-better-for-clinicians/

2604.27462 2026-05-01 cs.LG cs.AI

Improving Graph Few-shot Learning with Hyperbolic Space and Denoising Diffusion

Yonghao Liu, Jialu Sun, Wei Pang, Fausto Giunchiglia, Ximing Li, Xiaoyue Feng, Renchu Guan

2604.27453 2026-05-01 cs.CL

From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks

Qingyu Ren, Tianjun Pan, Xingzhou Chen, Xuhong Wang

2604.27450 2026-05-01 cs.RO cs.AI

RAY-TOLD: Ray-Based Latent Dynamics for Dense Dynamic Obstacle Avoidance with TDMPC

Seungho Han, Seokju Lee, Jeonguk Kang

Comments 8 pages, 4 figures

2604.27448 2026-05-01 cs.CV

LA-Pose: Latent Action Pretraining Meets Pose Estimation

Zhengqing Wang, Saurabh Nair, Prajwal Chidananda, Pujith Kachana, Samuel Li, Matthew Brown, Yasutaka Furukawa

Comments Project page: https://la-pose.github.io/

2604.27445 2026-05-01 cs.CV

Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed

Wenqian Zhang, Zehao Wang

Comments Accepted to the CVPR 2026 Animal Workshop

2604.27443 2026-05-01 cs.LG cs.AI

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

Gabe Guo, Thanawat Sornwanee, Lutong Hao, Elon Litman, Stefano Ermon, Jose Blanchet

2604.27439 2026-05-01 cs.CL

Sentiment Analysis of AI Adoption in Indonesian Higher Education Using Machine Learning and Transformer-Based Models

Happy Syahrul Ramadhan, Ahmad Sahidin Akbar, Karin Yehezkiel Sinaga, Luluk Muthoharoh, Ardika Satria, Martin C. T. Manullang

Comments 8 pages, 6 figures, 7 tables. The paper compares TF-IDF-based machine learning models and DistilBERT for Indonesian sentiment analysis on student opinions about AI adoption in higher education. The manuscript reports that DistilBERT achieves the best overall test performance, while SVM is the strongest classical baseline

2604.27437 2026-05-01 cs.CV

Softmax-GS: Generalized Gaussians Learning When to Blend or Bound

Chen Ziwen, Peng Wang, Hao Tan, Zexiang Xu, Li Fuxin

2604.27434 2026-05-01 cs.LG cs.AI cs.CR

AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning

Zehui Tang, Yuchen Liu, Feihu Huang

Comments 24 pages

2604.27422 2026-05-01 cs.CV

Sparse-View 3D Gaussian Splatting in the Wild

Wongi Park, Jordan A. James, Myeongseok Nam, Minjae Lee, Soomok Lee, Sang-Hyun Lee, William J. Beksi

Comments 18 pages, 14 figures, and 14 tables

2604.27419 2026-05-01 cs.AI cs.CL

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Qiyao Wang, Haoran Hu, Longze Chen, Hongbo Wang, Hamid Alinejad-Rokny, Yuan Lin, Min Yang

Comments 21 pages, 13 figures, 7 tables

2604.27415 2026-05-01 cs.LG

ChipLingo: A Systematic Training Framework for Large Language Models in EDA

Lei Li, Xingwen Yu, Jianguo Ni, Junxuan Zhu, Jieqiong Zhang, Jian Zhao, Zhi Liu

2604.27414 2026-05-01 cs.CV cs.CR cs.LG

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Mert D. Pese

Comments 9 pages, 2 figures. Accepted at SAE WCX 2026

2604.27411 2026-05-01 cs.LG

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

Haiyang Zhao

2604.27405 2026-05-01 cs.CL cs.AI

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

Jon-Paul Cacioli

Comments 7 pages, 4 figures, 2 tables. Pre-registered study. Code and data available

2604.27401 2026-05-01 cs.CL cs.LG

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Hongliang Liu, Tung-Ling Li, Yuhao Wu

详情

英文摘要

Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight behavioral circuits, 13 models, and four architecture families, we identify two circuit structures that organize LLM behavior. Opposition circuits appear when RLHF suppresses a pre-training tendency. In safety refusal, about 50 neurons, or 0.014 percent of all neurons, control the refusal template; ablating them changes 80 percent of response formats on 520 AdvBench prompts while producing near-zero harmful compliance, 3 of 520 cases, all with disclaimers. Routing circuits appear for pre-training behaviors distributed through attention. For language selection, residual-stream direction injection switches English to Chinese output on 99.1 percent of 580 benchmark prompts in the 3 of 19 tested models that satisfy three observed conditions: bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability. The same intervention fails on the other 16 models and on math, code, and factual circuits, defining the limits of directional steering. The FFN-to-skip signal ratio, computed from the same two forward passes, distinguishes the two structures and predicts the appropriate intervention. Circuit topology varies by architecture, from Qwen's concentrated FFN bottleneck to Gemma's normalization-shielded circuit. In Qwen3.5-2B, ablating 20 neurons eliminates multi-turn sycophantic capitulation, while amplifying 10 related neurons improves factual correction from 52 percent to 88 percent on 200 TruthfulQA prompts. These results show that perturbation probing offers mechanistic insight into RLHF-organized behavior and a practical toolkit for precision template-layer editing.

URL PDF HTML ☆

赞 0 踩 0

2604.27398 2026-05-01 cs.CL

Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

Tomomasa Hara, Hiroto Kurita, Masaaki Imaizumi, Kentaro Inui, Sho Yokoi

Comments ACL 2026 Main Conference; GitHub: https://github.com/tohoku-nlp/socm-text-embedding

2604.27393 2026-05-01 cs.CL

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hongliang Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang Zeng, Chaojun Xiao, Yankai Lin, Xu Han, Maosong Sun, Zhiyuan Liu, Yuan Yao

详情

英文摘要

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.

URL PDF HTML ☆

赞 0 踩 0

2604.27392 2026-05-01 cs.AI cs.CL cs.CY cs.HC

Leading Across the Spectrum of Human-AI Relationships: A Conceptual Framework for Increasingly Heterogeneous Teams

Alejandro R. Jadad

Comments 13 pages, 1 figure, 1 table, 1 appendix, 8 references

详情

英文摘要

What shapes a consequential decision when human and artificial intelligence work on it together? The answer is becoming harder to see. A decision may look human-led after AI has set the frame, or appear automated while human judgment still carries decisive force. This paper offers a leadership-facing spectrum to see those relationships within a bounded mandate: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI. The spectrum asks where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows. The five positions are landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision. The central risk is misrecognition: leaders may keep a human-centered story in place after decision-shaping authority has shifted elsewhere. They may believe oversight remains meaningful when it has become ceremonial, or keep humans in the loop when their involvement could make the decision worse. The framework introduces co-adaptability, the capacity of a configuration to improve as human and non-human participants adjust together, and places it within heterogeneous teaming, where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation. The aim is practical: to help strategic leaders and those designing or deploying AI systems recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them. These configurations will shape how power, responsibility, and trust are distributed in organizational life. Whether the futures they help create remain governable and worth inhabiting will depend on leaders who can see, early enough, where and how consequential decisions are actually being shaped.

URL PDF HTML ☆

赞 0 踩 0

2604.27387 2026-05-01 cs.AI

Robust Learning on Heterogeneous Graphs with Heterophily: A Graph Structure Learning Approach

Yihan Zhang, Ercan E. Kuruoglu

2604.27385 2026-05-01 cs.RO cs.HC cs.SY eess.SY

An Experimental Modular Instrument With a Haptic Feedback Framework for Robotic Surgery Training

Walid Shaker, Mustafa Suphi Erden

Comments Accepted to the 11th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob 2026)

2604.27379 2026-05-01 cs.CL cs.LG

Proactive Dialogue Model with Intent Prediction

Yang Luo

Comments 9 pages, 1 figure