arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2509.21782 2026-03-05 cs.AI

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Junliang Liu, Jingyu Xiao, Wenxin Tang, Zhixian Wang, Zipeng Xie, Wenxuan Wang, Minrui Zhang, Shuanghe Yu

详情

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed as the core reasoning engine for web-facing systems, powering GUI agents and front-end automation that must interpret page structure, select actionable widgets, and execute multi-step interactions reliably. However, existing benchmarks largely emphasize visual perception or UI code generation, showing insufficient evaluation on the reasoning, robustness and safety capability required for end-to-end web applications. To bridge the gap, we introduce a comprehensive web understanding benchmark, named WebRRSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc. The benchmark is constructed from 729 websites and contains 3799 QA pairs that probe multi-step inference over page structure, text, widgets, and safety-critical interactions. To ensure reliable measurement, we adopt standardized prompts, a protocolized and deterministic evaluation pipeline, and multi-stage quality control combining automatic checks with targeted human verification. We evaluate 11 MLLMs on WebRRSBench. The results reveal significant gaps: models still struggle with compositional and cross-element reasoning over realistic layouts, show limited robustness when facing perturbations in user interfaces and content such as layout rearrangements or visual style shifts, and are rather conservative in recognizing and avoiding safety critical or irreversible actions. Our code and appendix are available at https: //github.com/annoy-worker/WebRSSBench.

URL PDF HTML ☆

赞 0 踩 0

2509.21675 2026-03-05 cs.LG math.OC

Scalable Second-order Riemannian Optimization for $K$-means Clustering

Peng Xu, Chun-Ying Hou, Xiaohui Chen, Richard Y. Zhang

2509.19624 2026-03-05 cs.CV

Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG

Mahmoud Afifi, Ran Zhang, Michael S. Brown

2509.19084 2026-03-05 cs.LG cs.AI

Bridging Computational Social Science and Deep Learning: Cultural Dissemination-Inspired Graph Neural Networks

Asela Hevapathige

2509.18979 2026-03-05 cs.RO cs.CV

Category-Level Object Shape and Pose Estimation in Less Than a Millisecond

Lorenzo Shaikewitz, Tim Nguyen, Luca Carlone

Comments Accepted to ICRA 2026. This version contains appendices

2509.18311 2026-03-05 cs.RO

Fine-Tuning Robot Policies While Maintaining User Privacy

Benjamin A. Christie, Sagar Parekh, Dylan P. Losey

2509.17874 2026-03-05 cs.LG

Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models

Paulius Rauba, Mihaela van der Schaar

Comments Published as a conference paper at ICLR 2026

2509.17844 2026-03-05 cs.CL

Trust Me, I Can Convince You: The Contextualized Argument Appraisal Framework

Lynn Greschner, Sabine Weber, Roman Klinger

Comments Accepted at LREC 2026

2509.16677 2026-03-05 cs.CV cs.LG cs.RO eess.IV

Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang

Comments Accepted to ICRA 2026. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL

详情

英文摘要

Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.

URL PDF HTML ☆

赞 0 踩 0

2509.14858 2026-03-05 cs.SD cs.AI

MeanFlowSE: one-step generative speech enhancement via conditional mean flow

Duojia Li, Shenghui Lu, Hongchen Pan, Zongyi Zhan, Qingyang Hong, Lin Li

2509.14610 2026-03-05 cs.CV

Enhancing Feature Fusion of U-like Networks with Dynamic Skip Connections

Yue Cao, Quansong He, Kaishen Wang, Jianlong Xiong, Zhang Yi, Tao He

2509.14516 2026-03-05 cs.RO

Event-LAB: Towards Standardized Evaluation of Neuromorphic Localization Methods

Adam D. Hines, Alejandro Fontan, Michael Milford, Tobias Fischer

Comments 8 pages, 6 figures, accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

2509.06415 2026-03-05 cs.CV cs.AI cs.CL

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Jaemin Son, Sujin Choi, Inyong Yun

Comments Accepted to ICLR 2026 Workshop MM Intelligence

2509.04111 2026-03-05 cs.CL

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Comments Camera-ready version for LREC 2026

2509.02071 2026-03-05 cs.RO

A Geometric Method for Base Parameter Analysis in Robot Inertia Identification Based on Projective Geometric Algebra

Guangzhen Sun, Ye Ding, Xiangyang Zhu

Comments 20 pages, 10 figures

2508.21648 2026-03-05 cs.AI

Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI

Farhad Abtahi, Mehdi Astaraki, Fernando Seoane

2508.17986 2026-03-05 cs.RO

No Need to Look! Locating and Grasping Objects by a Robot Arm Covered with Sensitive Skin

Karel Bartunek, Lukas Rustler, Matej Hoffmann

Comments Karel Bartunek, Lukas Rustler: Authors contributed equally Accepted to IEEE ICRA 2026

2508.12880 2026-03-05 cs.CV

Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models

Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Chen Zhu, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li

Comments Accepted by ICLR 2026

2508.11538 2026-03-05 cs.CV

Reinforcing Video Reasoning Segmentation to Think Before It Segments

Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu

Comments 12 pages

2508.07782 2026-03-05 cs.CV

GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

Saihui Hou, Chenye Wang, Wenpeng Lang, Zhengxiang Lan, Yongzhen Huang

Comments Accepted by ICLR 2026

2508.07321 2026-03-05 cs.CL cs.AI cs.LG

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh

Comments LREC 2026

2508.06803 2026-03-05 cs.CL cs.MA

SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Ziqi Liu, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan, Zhijie Xu, Yangbin Chen

2508.03284 2026-03-05 cs.AI

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin, Ting Lei, Yang Liu

Comments Project page: https://fugtemypt123.github.io/ToolVQA-website/

2508.03099 2026-03-05 cs.RO

Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

Sang Min Kim, Hyeongjun Heo, Junho Kim, Yonghyeon Lee, Young Min Kim

Comments Accepted to ICRA 2026

2508.01603 2026-03-05 cs.CV

Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

Yiheng Li, Zichang Tan, Guoqing Xu, Zhen Lei, Xu Zhou, Yang Yang

Comments Accepted by CVPR2026

2508.01222 2026-03-05 cs.CL cs.AI

WebDS: An End-to-End Benchmark for Web-based Data Science

Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning

Comments 14 pages, ICLR 2026

2507.20704 2026-03-05 cs.CL cs.AI cs.CR

Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

Comments 9 pages, 9 figures. Jake Thomas served as Editor for this manuscript

2507.16405 2026-03-05 cs.AI cs.LG

Self-Supervised Inductive Logic Programming

Stassa Patsantzis

Comments To appear in AAAI-26 conference proceedings. Updated 04 March 2026 to include appendices, update affiliation, contact details

2507.15796 2026-03-05 cs.AI

From Privacy to Trust in the Agentic Era: A Taxonomy of Challenges in Trustworthy Federated Learning Through the Lens of Trust Report 2.0

Nuria Rodríguez-Barroso, Mario García-Márquez, M. Victoria Luzón, Francisco Herrera

Comments Already published in Information Fusion

2507.15557 2026-03-05 cs.CL

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

Comments LREC 2026