arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.19643 2026-04-22 cs.RO

A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots

Alex Lin, Lei Gao, Narsimlu Kemsaram, Sriram Subramanian

Comments This paper has been accepted for publication in the Proceedings of the 2026 4th International Conference on Robotics, Control and Vision Engineering (RCVE 2026)

详情

英文摘要

AcoustoBots are mobile acoustophoretic robots capable of delivering mid-air haptics, directional audio, and acoustic levitation, but existing implementations rely on scripted commands and lack an intuitive interface for real-time human control. This work presents a gesture-based visual learning framework for contactless human-swarm interaction with a multimodal AcoustoBot platform. The system combines ESP32-CAM gesture capture, PhaseSpace motion tracking, centralized processing, and an OpenCLIP-based visual learning model (VLM) with linear probing to classify three hand gestures and map them to haptics, audio, and levitation modalities. Validation accuracy improved from about 67% with a small dataset to nearly 98% with the largest dataset. In integrated experiments with two AcoustoBots, the system achieved an overall gesture-to-modality switching accuracy of 87.8% across 90 trials, with an average end-to-end latency of 3.95 seconds. These results demonstrate the feasibility of using a vision-language-model-based gesture interface for multimodal human-swarm interaction. While the current system is limited by centralized processing, a static gesture set, and controlled-environment evaluation, it establishes a foundation for more expressive, scalable, and accessible swarm robotic interfaces.

URL PDF HTML ☆

赞 0 踩 0

2604.19642 2026-04-22 cs.CL

Micro Language Models Enable Instant Responses

Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan, Luke Zettlemoyer, Shyamnath Gollakota

2604.19638 2026-04-22 cs.AI cs.CL cs.RO

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai

Comments Work accepted at ACL 2026 Findings

2604.19636 2026-04-22 cs.CV

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma

Comments The project page: https://xinxiaozhe12345.github.io/CoInteract_Project/

2604.19635 2026-04-22 cs.SD cs.AI

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong, Jiakui Li, Huimeng Wang, Haiyun Li, Liang Cao, Shiyin Kang, Zhiyong Wu

2604.19633 2026-04-22 cs.AI cs.CE

Time Series Augmented Generation for Financial Applications

Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena

Comments 11 pages, 3 figures, 2 tables

2604.19632 2026-04-22 cs.CV

CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

Weidong Chen, Dexiang Hong, Zhendong Mao, Yutao Cheng, Xinyan Liu, Lei Zhang, Yongdong Zhang

2604.19631 2026-04-22 cs.CV

MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

Xuejiao Wang, Bohao Zhang, Changbo Wang, Gaoqi He

2604.19624 2026-04-22 cs.CV

GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

Pradyumna YM, Yuxuan Xue, Yue Chen, Nikita Kister, István Sárándi, Gerard Pons-Moll

Comments Project Page: https://pradyumnaym.github.io/graft

2604.19623 2026-04-22 cs.LG cs.CV eess.SP

SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets

Inhyeok Choi, Hyuncheol Park

Comments 11pages, 9 figures

2604.19620 2026-04-22 cs.CL

The "Small World of Words" German Free-Association Norms

Samuel Aeschbach, Rui Mata, Kaidi Lõo, Simon De Deyne, Dirk U. Wulff

2604.19618 2026-04-22 cs.RO

Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing

Wen Li, Hui Wang, Jinya Su, Cunjia Liu, Wen-Hua Chen, Shihua Li

Comments 11 pages, 12 figures, Under Review

2604.19609 2026-04-22 cs.CV

Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Kadir Yilmaz, Adrian Kruse, Tristan Höfer, Daan de Geus, Bastian Leibe

Comments Project page: https://vision.rwth-aachen.de/Volt

2604.19592 2026-04-22 cs.LG cs.GT

An Efficient Black-Box Reduction from Online Learning to Multicalibration, and a New Route to $Φ$-Regret Minimization

Gabriele Farina, Juan Carlos Perdomo

2604.19587 2026-04-22 cs.CV

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang, Ruiyang Fan, Linxiao Shi, Qirui Yang, Jian Zhang, Chengcheng Liu, Siming Zheng, Jinwei Chen, Bo Li, Peng-Tao Jiang

Comments tech report

详情

英文摘要

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.

URL PDF HTML ☆

赞 0 踩 0

2604.19584 2026-04-22 cs.CL

A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

Silvio Calderaro, Johanna Monti

Comments Accepted at the DIALRES Workshop, LREC-COLING 2026

2604.19578 2026-04-22 cs.CL cs.AI cs.DL cs.IR

Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

Wenqing Wu, Chengzhi Zhang, Yi Zhao, Tong Bao

Comments Scientometrics

2604.19570 2026-04-22 cs.CV

RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

Ahmed Marouane Djouama, Abir Belaala, Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Cosimo Distante, Abdenour Hadid

2604.19569 2026-04-22 cs.LG cs.AI cs.SY eess.SY

Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee

2604.19567 2026-04-22 cs.AI

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

Chuou Xu, Liya Ji, Qifeng Chen

详情

英文摘要

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.

URL PDF HTML ☆

赞 0 踩 0

2604.19565 2026-04-22 cs.CL cs.AI cs.LG

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov

Comments Accepted to Findings of ACL 2026

2604.19562 2026-04-22 cs.LG

Structure-guided molecular design with contrastive 3D protein-ligand learning

Carles Navarro, Philipp Tholke, Gianni de Fabritiis

2604.19561 2026-04-22 cs.AI

Detecting Data Contamination in Large Language Models

Juliusz Janicki, Savvas Chamezopoulos, Evangelos Kanoulas, Georgios Tsatsaronis

2604.19560 2026-04-22 cs.LG math.OC stat.ML

Separating Geometry from Probability in the Analysis of Generalization

Maxim Raginsky, Benjamin Recht

Comments 19 pages

2604.19559 2026-04-22 cs.AI cs.CL cs.LG

Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics

Syed Sajid Ullah, Amir Khan

2604.19556 2026-04-22 cs.CV

Paparazzo: Active Mapping of Moving 3D Objects

Davide Allegro, Shiyao Li, Stefano Ghidoni, Vincent Lepetit

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

2604.19548 2026-04-22 cs.CL cs.AI cs.CY

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu

Comments ACL 2026 Main Conference. Project page: https://unikcc.github.io/ReTAS/

2604.19547 2026-04-22 cs.CL

Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

Tianxiang Ma, Weijie Feng, Xinyu Wang, Zhiyong Cheng

2604.19544 2026-04-22 cs.AI

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen

Comments code will be uploaded to https://github.com/zhang123434/DT2IT-MRM

2604.19538 2026-04-22 cs.AI cs.HC cs.MA

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

Farbod Zorriassatine, Ahmad Lotfi

Comments 6 pages, 3 figures