arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1559
2509.21782 2026-03-05 cs.AI

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Junliang Liu, Jingyu Xiao, Wenxin Tang, Zhixian Wang, Zipeng Xie, Wenxuan Wang, Minrui Zhang, Shuanghe Yu

详情
英文摘要

Multimodal large language models (MLLMs) are increasingly deployed as the core reasoning engine for web-facing systems, powering GUI agents and front-end automation that must interpret page structure, select actionable widgets, and execute multi-step interactions reliably. However, existing benchmarks largely emphasize visual perception or UI code generation, showing insufficient evaluation on the reasoning, robustness and safety capability required for end-to-end web applications. To bridge the gap, we introduce a comprehensive web understanding benchmark, named WebRRSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc. The benchmark is constructed from 729 websites and contains 3799 QA pairs that probe multi-step inference over page structure, text, widgets, and safety-critical interactions. To ensure reliable measurement, we adopt standardized prompts, a protocolized and deterministic evaluation pipeline, and multi-stage quality control combining automatic checks with targeted human verification. We evaluate 11 MLLMs on WebRRSBench. The results reveal significant gaps: models still struggle with compositional and cross-element reasoning over realistic layouts, show limited robustness when facing perturbations in user interfaces and content such as layout rearrangements or visual style shifts, and are rather conservative in recognizing and avoiding safety critical or irreversible actions. Our code and appendix are available at https: //github.com/annoy-worker/WebRSSBench.

2509.21675 2026-03-05 cs.LG math.OC

Scalable Second-order Riemannian Optimization for $K$-means Clustering

Peng Xu, Chun-Ying Hou, Xiaohui Chen, Richard Y. Zhang

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR), 2026
英文摘要

Clustering is a hard discrete optimization problem. Nonconvex approaches such as low-rank semidefinite programming (SDP) have recently demonstrated promising statistical and local algorithmic guarantees for cluster recovery. Due to the combinatorial structure of the $K$-means clustering problem, current relaxation algorithms struggle to balance their constraint feasibility and objective optimality, presenting tremendous challenges in computing the second-order critical points with rigorous guarantees. In this paper, we provide a new formulation of the $K$-means problem as a smooth unconstrained optimization over a submanifold and characterize its Riemannian structures to allow it to be solved using a second-order cubic-regularized Riemannian Newton algorithm. By factorizing the $K$-means manifold into a product manifold, we show how each Newton subproblem can be solved in linear time. Our numerical experiments show that the proposed method converges significantly faster than the state-of-the-art first-order nonnegative low-rank factorization method, while achieving similarly optimal statistical accuracy.

2509.19624 2026-03-05 cs.CV

Raw-JPEG Adapter: Efficient Raw Image Compression with JPEG

Mahmoud Afifi, Ran Zhang, Michael S. Brown

详情
英文摘要

Digital cameras digitize scene light into linear raw representations, which the image signal processor (ISP) converts into display-ready outputs. While raw data preserves full sensor information--valuable for editing and vision tasks--formats such as Digital Negative (DNG) require large storage, making them impractical in constrained scenarios. In contrast, JPEG is a widely supported format, offering high compression efficiency and broad compatibility, but it is not well-suited for raw storage. This paper presents RawJPEG Adapter, a lightweight, learnable, and invertible preprocessing pipeline that adapts raw images for standard JPEG compression. Our method applies spatial and optional frequency-domain transforms, with compact parameters stored in the JPEG comment field, enabling accurate raw reconstruction. Experiments across multiple datasets show that our method achieves higher fidelity than direct JPEG storage, supports other codecs, and provides a favorable trade-off between compression ratio and reconstruction accuracy.

2509.19084 2026-03-05 cs.LG cs.AI

Bridging Computational Social Science and Deep Learning: Cultural Dissemination-Inspired Graph Neural Networks

Asela Hevapathige

详情
英文摘要

Graph Neural Networks (GNNs) have become vital in applications like document classification in citation networks, epidemic forecasting, viral marketing, user recommendation in social networks, and network monitoring. However, their deployment faces three key challenges: feature oversmoothing in deep architectures, poor handling of heterogeneous relationships, and monolithic feature aggregation. To address these, we introduce AxelGNN, a novel architecture based on Axelrod's cultural dissemination model that incorporates three key innovations: (1) similarity-gated interactions that adaptively promote convergence or divergence based on feature similarity, (2) segment-wise feature copying that enables fine-grained aggregation of semantic feature groups rather than monolithic vectors, and (3) global polarization that maintains multiple distinct representation clusters to prevent oversmoothing. This model demonstrates empirically the capability to handle both homophilic and heterophilic graphs within a single architecture, without requiring specialized model selection based on graph characteristics. Our experiments demonstrate that AxelGNN achieves competitive or superior performance compared to existing methods in node classification and influence estimation while maintaining computational efficiency.

2509.18979 2026-03-05 cs.RO cs.CV

Category-Level Object Shape and Pose Estimation in Less Than a Millisecond

Lorenzo Shaikewitz, Tim Nguyen, Luca Carlone

Comments Accepted to ICRA 2026. This version contains appendices

详情
英文摘要

Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object's unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario. Code is released at https://github.com/MIT-SPARK/Fast-ShapeAndPose.

2509.18311 2026-03-05 cs.RO

Fine-Tuning Robot Policies While Maintaining User Privacy

Benjamin A. Christie, Sagar Parekh, Dylan P. Losey

详情
英文摘要

Recent works introduce general-purpose robot policies. These policies provide a strong prior over how robots should behave -- e.g., how a robot arm should manipulate food items. But in order for robots to match an individual person's needs, users typically fine-tune these generalized policies -- e.g., showing the robot arm how to make their own preferred dinners. Importantly, during the process of personalizing robots, end-users leak data about their preferences, habits, and styles (e.g., the foods they prefer to eat). Other agents can simply roll-out the fine-tuned policy and see these personally-trained behaviors. This leads to a fundamental challenge: how can we develop robots that personalize actions while keeping learning private from external agents? We here explore this emerging topic in human-robot interaction and develop PRoP, a model-agnostic framework for personalized and private robot policies. Our core idea is to equip each user with a unique key; this key is then used to mathematically transform the weights of the robot's network. With the correct key, the robot's policy switches to match that user's preferences -- but with incorrect keys, the robot reverts to its baseline behaviors. We show the general applicability of our method across multiple model types in imitation learning, reinforcement learning, and classification tasks. PRoP is practically advantageous because it retains the architecture and behaviors of the original policy, and experimentally outperforms existing encoder-based approaches.

2509.17874 2026-03-05 cs.LG

Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models

Paulius Rauba, Mihaela van der Schaar

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose Nested Subspace Networks (NSNs), a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.

2509.17844 2026-03-05 cs.CL

Trust Me, I Can Convince You: The Contextualized Argument Appraisal Framework

Lynn Greschner, Sabine Weber, Roman Klinger

Comments Accepted at LREC 2026

详情
英文摘要

Emotions that somebody develops based on an argument do not only depend on the argument itself - they are also influenced by a subjective evaluation of the argument's potential impact on the self. For instance, an argument to ban plastic bottles might cause fear of losing a job for a bottle industry worker, which lowers the convincingness - presumably independent of its content. While binary emotionality of arguments has been studied, such cognitive appraisal models have only been proposed in other subtasks of emotion analysis, but not in the context of arguments and their convincingness. To fill this research gap, we propose the Contextualized Argument Appraisal Framework to model the interplay between the sender, receiver, and argument. We adapt established appraisal models from psychology to argument mining, including argument pleasantness, familiarity, response urgency, and expected effort, as well as convincingness variables. To evaluate the framework and pave the way for computational modeling, we develop a novel role-playing-based annotation setup, mimicking real-world exposure to arguments. Participants disclose their emotion, explain the main cause, the argument appraisal, and the perceived convincingness. To consider the subjective nature of such annotations, we also collect demographic data and personality traits of both the participants and ask them to disclose the same variables for their perception of the argument sender. The analysis of the resulting ContArgA corpus of 4000 annotations reveals that convincingness is positively correlated with positive emotions (e.g., trust) and negatively correlated with negative emotions (e.g., anger). The appraisal variables particularly point to the importance of the annotator's familiarity with the argument.

2509.16677 2026-03-05 cs.CV cs.LG cs.RO eess.IV

Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence

Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang

Comments Accepted to ICRA 2026. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL

详情
英文摘要

Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.

2509.14858 2026-03-05 cs.SD cs.AI

MeanFlowSE: one-step generative speech enhancement via conditional mean flow

Duojia Li, Shenghui Lu, Hongchen Pan, Zongyi Zhan, Qingyang Hong, Lin Li

详情
英文摘要

Multistep inference is a bottleneck for real-time generative speech enhancement because flow- and diffusion-based systems learn an instantaneous velocity field and therefore rely on iterative ordinary differential equation (ODE) solvers. We introduce MeanFlowSE, a conditional generative model that learns the average velocity over finite intervals along a trajectory. Using a Jacobian-vector product (JVP) to instantiate the MeanFlow identity, we derive a local training objective that directly supervises finite-interval displacement while remaining consistent with the instantaneous-field constraint on the diagonal. At inference, MeanFlowSE performs single-step generation via a backward-in-time displacement, removing the need for multistep solvers; an optional few-step variant offers additional refinement. On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines. The method requires no knowledge distillation or external teachers, providing an efficient, high-fidelity framework for real-time generative speech enhancement. The proposed method is open-sourced at https://github.com/liduojia1/MeanFlowSE.

2509.14610 2026-03-05 cs.CV

Enhancing Feature Fusion of U-like Networks with Dynamic Skip Connections

Yue Cao, Quansong He, Kaishen Wang, Jianlong Xiong, Zhang Yi, Tao He

详情
英文摘要

U-like networks have become fundamental frameworks in medical image segmentation through skip connections that bridge high-level semantics and low-level spatial details. Despite their success, conventional skip connections exhibit two key limitations: inter-feature constraints and intra-feature constraints. The inter-feature constraint refers to the static nature of feature fusion in traditional skip connections, where information is transmitted along fixed pathways regardless of feature content. The intra-feature constraint arises from the insufficient modeling of multi-scale feature interactions, thereby hindering the effective aggregation of global contextual information. To overcome these limitations, we propose a novel Dynamic Skip Connection (DSC) block that fundamentally enhances cross-layer connectivity through adaptive mechanisms. The DSC block integrates two complementary components. (1) Test-Time Training (TTT) module. This module addresses the inter-feature constraint by enabling dynamic adaptation of hidden representations during inference, facilitating content-aware feature refinement. (2) Dynamic Multi-Scale Kernel (DMSK) module. To mitigate the intra-feature constraint, this module adaptively selects kernel sizes based on global contextual cues, enhancing the network capacity for multi-scale feature integration. The DSC block is architecture-agnostic and can be seamlessly incorporated into existing U-like network structures. Extensive experiments demonstrate the plug-and-play effectiveness of the proposed DSC block across CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based U-like networks.

2509.14516 2026-03-05 cs.RO

Event-LAB: Towards Standardized Evaluation of Neuromorphic Localization Methods

Adam D. Hines, Alejandro Fontan, Michael Milford, Tobias Fischer

Comments 8 pages, 6 figures, accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
英文摘要

Event-based localization research and datasets are a rapidly growing area of interest, with a tenfold increase in the cumulative total number of published papers on this topic over the past 10 years. Whilst the rapid expansion in the field is exciting, it brings with it an associated challenge: a growth in the variety of required code and package dependencies as well as data formats, making comparisons difficult and cumbersome for researchers to implement reliably. To address this challenge, we present Event-LAB: a new and unified framework for running several event-based localization methodologies across multiple datasets. Event-LAB is implemented using the Pixi package and dependency manager, that enables a single command-line installation and invocation for combinations of localization methods and datasets. To demonstrate the capabilities of the framework, we implement two common event-based localization pipelines: Visual Place Recognition (VPR) and Simultaneous Localization and Mapping (SLAM). We demonstrate the ability of the framework to systematically visualize and analyze the results of multiple methods and datasets, revealing key insights such as the association of parameters that control event collection counts and window sizes for frame generation to large variations in performance. The results and analysis demonstrate the importance of fairly comparing methodologies with consistent event image generation parameters. Our Event-LAB framework provides this ability for the research community, by contributing a streamlined workflow for easily setting up multiple conditions.

2509.06415 2026-03-05 cs.CV cs.AI cs.CL

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Jaemin Son, Sujin Choi, Inyong Yun

Comments Accepted to ICLR 2026 Workshop MM Intelligence

详情
英文摘要

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

2509.04111 2026-03-05 cs.CL

MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Comments Camera-ready version for LREC 2026

详情
英文摘要

We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total. We start with Wikipedia articles, which also provide the context for the dataset samples, and use an LLM to generate question/answer pairs related to the Wikipedia article, ensuring that the answer appears verbatim within the article. Next, the question is then rephrased to hinder simple word matching methods from performing well on the dataset. We conduct a crowdsourced human evaluation of the fluency of the generated questions, which included 156 respondents across 30 of the languages (both low- and high-resource). All 30 languages received a mean fluency rating above ``mostly natural'', showing that the samples are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. Both the dataset and survey evaluations are publicly available.

2509.02071 2026-03-05 cs.RO

A Geometric Method for Base Parameter Analysis in Robot Inertia Identification Based on Projective Geometric Algebra

Guangzhen Sun, Ye Ding, Xiangyang Zhu

Comments 20 pages, 10 figures

详情
英文摘要

This paper proposes a novel geometric method for analytically determining the base inertial parameters of robotic systems. The rigid body dynamics is reformulated using projective geometric algebra, leading to a new identification model named ``tetrahedral-point (TP)" model. Based on the rigid body TP model, coefficients in the regresoor matrix of the identification model are derived in closed-form, exhibiting clear geometric interpretations. Building directly from the dynamic model, three foundational principles for base parameter analysis are proposed: the shared points principle, fixed points principle, and planar rotations principle. With these principles, algorithms are developed to automatically determine all the base parameters. The core algorithm, referred to as Dynamics Regressor Nullspace Generator (DRNG), achieves $O(1)$-complexity theoretically following an $O(N)$-complexity preprocessing stage, where $N$ is the number of rigid bodies. The proposed method and algorithms are validated across four robots: Puma560, Unitree Go2, a 2RRU-1RRS parallel kinematics mechanism (PKM), and a 2PRS-1PSR PKM. In all cases, the algorithms successfully identify the complete set of base parameters. Notably, the approach demonstrates high robustness and computational efficiency, particularly in the cases of PKMs. Through the comprehensive demonstrations, the method is shown to be general, robust, and efficient.

2508.21648 2026-03-05 cs.AI

Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI

Farhad Abtahi, Mehdi Astaraki, Fernando Seoane

详情
Journal ref
Front. Artif. Intell., Volume 9 (2026)
英文摘要

Bias in medical artificial intelligence is conventionally viewed as a defect requiring elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potentially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model-specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof-of-concept demonstrator was developed using over 30 large language models, creating a minimum viable product that preserved both consensus and minority views in synthetic cases, making diagnostic uncertainty and latent biases transparent for clinical oversight. While not yet a validated clinical tool, the demonstration illustrates how structured diversity can enhance medical reasoning under clinician supervision. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathways for developing trustworthy medical AI systems.

2508.17986 2026-03-05 cs.RO

No Need to Look! Locating and Grasping Objects by a Robot Arm Covered with Sensitive Skin

Karel Bartunek, Lukas Rustler, Matej Hoffmann

Comments Karel Bartunek, Lukas Rustler: Authors contributed equally Accepted to IEEE ICRA 2026

详情
英文摘要

Locating and grasping of objects by robots is typically performed using visual sensors. Haptic feedback from contacts with the environment is only secondary if present at all. In this work, we explored an extreme case of searching for and grasping objects in complete absence of visual input, relying on haptic feedback only. The main novelty lies in the use of contacts over the complete surface of a robot manipulator covered with sensitive skin. The search is divided into two phases: (1) coarse workspace exploration with the complete robot surface, followed by (2) precise localization using the end-effector equipped with a force/torque sensor. We systematically evaluated this method in simulation and on the real robot, demonstrating that diverse objects can be located, grasped, and put in a basket. The overall success rate on the real robot for one object was 85.7% with failures mainly while grasping specific objects. The method using whole-body contacts is six times faster compared to a baseline that uses haptic feedback only on the end-effector. We also show locating and grasping multiple objects on the table. This method is not restricted to our specific setup and can be deployed on any platform with the ability of sensing contacts over the entire body surface. This work holds promise for diverse applications in areas with challenging visual perception (due to lighting, dust, smoke, occlusion) such as in agriculture when fruits or vegetables need to be located inside foliage and picked.

2508.12880 2026-03-05 cs.CV

Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models

Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Chen Zhu, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li

Comments Accepted by ICLR 2026

详情
英文摘要

Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S$^2$-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S$^2$-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.

2508.11538 2026-03-05 cs.CV

Reinforcing Video Reasoning Segmentation to Think Before It Segments

Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu

Comments 12 pages

详情
Journal ref
CVPR 2026
英文摘要

Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into <SEG> tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.

2508.07782 2026-03-05 cs.CV

GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

Saihui Hou, Chenye Wang, Wenpeng Lang, Zhengxiang Lan, Yongzhen Huang

Comments Accepted by ICLR 2026

详情
英文摘要

Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.

2508.07321 2026-03-05 cs.CL cs.AI cs.LG

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh

Comments LREC 2026

详情
英文摘要

The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte, and leveraging the same, introduce ObfusQA, a comprehensive, first-of-its-kind framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.

2508.06803 2026-03-05 cs.CL cs.MA

SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Ziqi Liu, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan, Zhijie Xu, Yangbin Chen

详情
英文摘要

Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of **6.75%** in Accuracy and **6.29%** in Macro-F1 score.

2508.03284 2026-03-05 cs.AI

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin, Ting Lei, Yang Liu

Comments Project page: https://fugtemypt123.github.io/ToolVQA-website/

详情
英文摘要

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

2508.03099 2026-03-05 cs.RO

Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

Sang Min Kim, Hyeongjun Heo, Junho Kim, Yonghyeon Lee, Young Min Kim

Comments Accepted to ICRA 2026

详情
英文摘要

We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/

2508.01603 2026-03-05 cs.CV

Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning

Yiheng Li, Zichang Tan, Guoqing Xu, Zhen Lei, Xu Zhou, Yang Yang

Comments Accepted by CVPR2026

详情
英文摘要

In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively. Codes and weights will be released on https://github.com/liyih/IAPL.

2508.01222 2026-03-05 cs.CL cs.AI

WebDS: An End-to-End Benchmark for Web-based Data Science

Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning

Comments 14 pages, ICLR 2026

详情
英文摘要

Many real-world data science tasks involve complex web-based interactions: finding appropriate data available on the internet, synthesizing multimodal data from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions and often do not require diverse tool-using capabilities. Conversely, traditional data science benchmarks typically concentrate on static, highly structured datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step, tool-based operations, across heterogeneous data formats, to better reflect the realities of modern data analytics. Evaluations of current SOTA LLM agents indicate significant performance gaps in accomplishing these tasks. For instance, Browser Use, which accomplishes $80\%$ of tasks on WebVoyager, completes only 15% of tasks in WebDS, which our analysis suggests is due to new failure modes, such as poor information grounding, repetitive behavior and shortcut-taking that agents performing WebDS's tasks display. By contrast, humans achieve around 90% accuracy, highlighting a substantial gap between current agents and human performance. By providing a more robust and realistic testing ground, WebDS sets the stage for significant advances in the development of practically useful LLM-based data science.

2507.20704 2026-03-05 cs.CL cs.AI cs.CR

Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

Comments 9 pages, 9 figures. Jake Thomas served as Editor for this manuscript

详情
Journal ref
Proceedings of the 2025 Conference on Applied Machine Learning for Information Security, Proceedings of Machine Learning Research PMLR Vol 299 pp 28 41
英文摘要

The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models' alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

2507.16405 2026-03-05 cs.AI cs.LG

Self-Supervised Inductive Logic Programming

Stassa Patsantzis

Comments To appear in AAAI-26 conference proceedings. Updated 04 March 2026 to include appendices, update affiliation, contact details

详情
英文摘要

Inductive Logic Programming (ILP) approaches like Meta \-/ Interpretive Learning (MIL) can learn, from few examples, recursive logic programs with invented predicates that generalise well to unseen instances. This ability relies on a background theory and negative examples, both carefully selected with expert knowledge of a learning problem and its solutions. But what if such a problem-specific background theory or negative examples are not available? We formalise this question as a new setting for Self-Supervised ILP and present a new MIL algorithm that learns in the new setting from some positive labelled, and zero or more unlabelled examples, and automatically generates, and labels, new positive and negative examples during learning. We implement this algorithm in Prolog in a new MIL system, called Poker. We compare Poker to state-of-the-art MIL system Louise on experiments learning grammars for Context-Free and L-System languages from labelled, positive example strings, no negative examples, and just the terminal vocabulary of a language, seen in examples, as a first-order background theory. We introduce a new approach for the principled selection of a second-order background theory as a Second Order Definite Normal Form (SONF), sufficiently general to learn all programs in a class, thus removing the need for a backgound theory tailored to a learning task. We find that Poker's performance improves with increasing numbers of automatically generated examples while Louise, bereft of negative examples, over-generalises.

2507.15796 2026-03-05 cs.AI

From Privacy to Trust in the Agentic Era: A Taxonomy of Challenges in Trustworthy Federated Learning Through the Lens of Trust Report 2.0

Nuria Rodríguez-Barroso, Mario García-Márquez, M. Victoria Luzón, Francisco Herrera

Comments Already published in Information Fusion

详情
Journal ref
Rodríguez-Barroso, et. al. (2026). From Privacy to Trust in the Agentic Era: A Taxonomy of Challenges in Trustworthy Federated Learning Through the Lens of Trust Report 2.0. Information Fusion, 104236
英文摘要

Federated Learning (FL) enables privacy-preserving collaborative learning, yet deployments increasingly show that privacy guarantees alone do not sustain trust in high-risk settings. As FL systems move toward agentic AI, large language model-enabled, and dynamically adaptive architectures, trustworthiness becomes a system-level problem shaped by autonomous decision-making, non-stationary environments, and multi-stakeholder governance. We argue for Trustworthy FL (TFL), treating trust as a continuously maintained operating condition rather than a static model property. Through the lens of Trust Report 2.0, we propose a requirement-driven taxonomy of challenges grounded in TAI and explicitly extended to account for control-plane decisions, agency, and system dynamics across the federated lifecycle. Building on this diagnosis, we introduce a coordination blueprint that structures cross-requirement trade-offs, decision justification, and governance alignment in TFL systems. To operationalize assurance, Trust Report 2.0 is instantiated as a lightweight, privacy-preserving artifact that surfaces decision-centric trust evidence without centralizing raw data. We illustrate applicability via healthcare as a stress-test domain, focusing on oncology FL under regulatory pressure and clinical risk.

2507.15557 2026-03-05 cs.CL

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification

Vitaly Protasov, Nikolay Babakov, Daryna Dementieva, Alexander Panchenko

Comments LREC 2026

详情
英文摘要

Despite notable advances in large language models (LLMs), reliable evaluation of text generation tasks such as text style transfer (TST) remains an open challenge. Existing research has shown that automatic metrics often correlate poorly with human judgments (Dementieva et al., 2024; Pauli et al., 2025), limiting our ability to assess model performance accurately. Furthermore, most prior work has focused primarily on English, while the evaluation of multilingual TST systems, particularly for text detoxification, remains largely underexplored. In this paper, we present the first comprehensive multilingual benchmarking study of evaluation metrics for text detoxification evaluation across nine languages: Arabic, Amharic, Chinese, English, German, Hindi, Russian, Spanish, and Ukrainian. Drawing inspiration from machine translation evaluation, we compare neural-based automatic metrics with LLM-as-a-judge approaches together with experiments on task-specific fine-tuned models. Our analysis reveals that the proposed metrics achieve significantly higher correlation with human judgments compared to baseline approaches. We also provide actionable insights and practical guidelines for building robust and reliable multilingual evaluation pipelines for text detoxification and related TST tasks.