arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1208
2602.15922 2026-02-19 cs.RO cs.CV cs.LG

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi Wang, Ryan Julian, Danfei Xu, Yilun Du, Yevgen Chebotar, Scott Reed, Jan Kautz, Yuke Zhu, Linxi "Jim" Fan, Joel Jang

Comments Project page: https://dreamzero0.github.io/

详情
英文摘要

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

2602.15918 2026-02-19 cs.CV cs.AI

EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, Zhe Jiang

详情
英文摘要

Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

2602.15915 2026-02-19 cs.CV cs.AI

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu

详情
英文摘要

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

2602.15904 2026-02-19 cs.CV cs.RO

A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving

June Moh Goo, Zichao Zeng, Jan Boehm

Comments Accepted to The IEEE Intelligent Vehicles Symposium 2026 (IEEE IV 2026)

详情
英文摘要

LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.

2602.15903 2026-02-19 cs.CV

Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

Jingwei Li, Jiaxin Tong, Pengfei Wu

详情
英文摘要

The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32\% and 4.02\%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27\%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.

2602.15902 2026-02-19 cs.CL cs.AI

Doc-to-LoRA: Learning to Instantly Internalize Contexts

Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange

详情
英文摘要

Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

2602.15901 2026-02-19 cs.RO

Coverage Path Planning for Autonomous Sailboats in Inhomogeneous and Time-Varying Oceans: A Spatiotemporal Optimization Approach

Yang An, Zhikang Ge, Taiyu Zhang, Jean-Baptiste R. G. Souppez, Gaofei Xu, Zhengru Ren

Comments 12 pages, 9 figures. This work has been submitted to the IEEE for possible publication

详情
英文摘要

Autonomous sailboats are well suited for long-duration ocean observation due to their wind-driven endurance. However, their performance is highly anisotropic and strongly influenced by inhomogeneous and time-varying wind and current fields, limiting the effectiveness of existing coverage methods such as boustrophedon sweeping. Planning under these environmental and maneuvering constraints remains underexplored. This paper presents a spatiotemporal coverage path planning framework tailored to autonomous sailboats, combining (1) topology-based morphological constraints in the spatial domain to promote compact and continuous coverage, and (2) forecast-aware look-ahead planning in the temporal domain to anticipate environmental evolution and enable foresighted decision-making. Simulations conducted under stochastic inhomogeneous and time-varying ocean environments, including scenarios with partial directional accessibility, demonstrate that the proposed method generates efficient and feasible coverage paths where traditional strategies often fail. To the best of our knowledge, this study provides the first dedicated solution to the coverage path planning problem for autonomous sailboats operating in inhomogeneous and time-varying ocean environments, establishing a foundation for future cooperative multi-sailboat coverage.

2602.15900 2026-02-19 cs.RO cs.CV

Adaptive Illumination Control for Robot Perception

Yash Turkar, Shekoufeh Sadeghi, Karthik Dantu

详情
英文摘要

Robot perception under low light or high dynamic range is usually improved downstream - via more robust feature extraction, image enhancement, or closed-loop exposure control. However, all of these approaches are limited by the image captured these conditions. An alternate approach is to utilize a programmable onboard light that adds to ambient illumination and improves captured images. However, it is not straightforward to predict its impact on image formation. Illumination interacts nonlinearly with depth, surface reflectance, and scene geometry. It can both reveal structure and induce failure modes such as specular highlights and saturation. We introduce Lightning, a closed-loop illumination-control framework for visual SLAM that combines relighting, offline optimization, and imitation learning. This is performed in three stages. First, we train a Co-Located Illumination Decomposition (CLID) relighting model that decomposes a robot observation into an ambient component and a light-contribution field. CLID enables physically consistent synthesis of the same scene under alternative light intensities and thereby creates dense multi-intensity training data without requiring us to repeatedly re-run trajectories. Second, using these synthesized candidates, we formulate an offline Optimal Intensity Schedule (OIS) problem that selects illumination levels over a sequence trading off SLAM-relevant image utility against power consumption and temporal smoothness. Third, we distill this ideal solution into a real-time controller through behavior cloning, producing an Illumination Control Policy (ILC) that generalizes beyond the initial training distribution and runs online on a mobile robot to command discrete light-intensity levels. Across our evaluation, Lightning substantially improves SLAM trajectory robustness while reducing unnecessary illumination power.

2602.15898 2026-02-19 cs.CL

MultiCube-RAG for Multi-hop Question Answering

Jimeng Shi, Wei Hu, Runchu Tian, Bowen Jin, Wonbin Kweon, SeongKu Kang, Yunfan Kang, Dingqi Ye, Sizhe Zhou, Shaowen Wang, Jiawei Han

Comments 12 pages

详情
英文摘要

Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.

2602.15897 2026-02-19 cs.CL

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang, Alsharif Abuadbba, Guangdong Bai

详情
英文摘要

Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.

2602.15896 2026-02-19 cs.CL

Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens

Yichi Zhang, Zhuo Chen, Lingbing Guo, Wen Zhang, Huajun Chen

Comments Work in progress

详情
英文摘要

Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.

2602.15893 2026-02-19 cs.RO cs.IT cs.LG math.IT

Statistical-Geometric Degeneracy in UAV Search: A Physics-Aware Asymmetric Filtering Approach

Zhiyuan Ren, Yudong Fang, Tao Zhang, Wenchi Cheng, Ben Lan

详情
英文摘要

Post-disaster survivor localization using Unmanned Aerial Vehicles (UAVs) faces a fundamental physical challenge: the prevalence of Non-Line-of-Sight (NLOS) propagation in collapsed structures. Unlike standard Gaussian noise, signal reflection from debris introduces strictly non-negative ranging biases. Existing robust estimators, typically designed with symmetric loss functions (e.g., Huber or Tukey), implicitly rely on the assumption of error symmetry. Consequently, they experience a theoretical mismatch in this regime, leading to a phenomenon we formally identify as Statistical-Geometric Degeneracy (SGD)-a state where the estimator stagnates due to the coupling of persistent asymmetric bias and limited observation geometry. While emerging data-driven approaches offer alternatives, they often struggle with the scarcity of training data and the sim-to-real gap inherent in unstructured disaster zones. In this work, we propose a physically-grounded solution, the AsymmetricHuberEKF, which explicitly incorporates the non-negative physical prior of NLOS biases via a derived asymmetric loss function. Theoretically, we show that standard symmetric filters correspond to a degenerate case of our framework where the physical constraint is relaxed. Furthermore, we demonstrate that resolving SGD requires not just a robust filter, but specific bilateral information, which we achieve through a co-designed active sensing strategy. Validated in a 2D nadir-view scanning scenario, our approach significantly accelerates convergence compared to symmetric baselines, offering a resilient building block for search operations where data is scarce and geometry is constrained.

2602.15892 2026-02-19 cs.CV cs.AI

Egocentric Bias in Vision-Language Models

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo

详情
英文摘要

Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

2602.15891 2026-02-19 cs.RO cs.LG cs.MA

Learning to Drive in New Cities Without Human Demonstrations

Zilin Wang, Saeed Rahmani, Daphne Cornelisse, Bidipta Sarkar, Alexander David Goldie, Jakob Nicolaus Foerster, Shimon Whiteson

Comments Autonomous Driving, Reinforcement Learning, Self-play, Simulation, Transfer Learning, Data-efficient Adaptation. Project Page: https://nomaddrive.github.io/

详情
英文摘要

While autonomous vehicles have achieved reliable performance within specific operating regions, their deployment to new cities remains costly and slow. A key bottleneck is the need to collect many human demonstration trajectories when adapting driving policies to new cities that differ from those seen in training in terms of road geometry, traffic rules, and interaction patterns. In this paper, we show that self-play multi-agent reinforcement learning can adapt a driving policy to a substantially different target city using only the map and meta-information, without requiring any human demonstrations from that city. We introduce NO data Map-based self-play for Autonomous Driving (NOMAD), which enables policy adaptation in a simulator constructed based on the target-city map. Using a simple reward function, NOMAD substantially improves both task success rate and trajectory realism in target cities, demonstrating an effective and scalable alternative to data-intensive city-transfer methods. Project Page: https://nomaddrive.github.io/

2602.15886 2026-02-19 cs.RO

Optimization of an Augmented R-CUBE mechanism for Cervical Surgery

Terence Essomba, Yu-Wen Wu, Abdelbadia Chaker, Med Amine Laribi

Journal ref the 9th International Workshop on Medical and Service Robots (MESROB), Jul 2025, Poitiers, France. pp.82-91

详情
英文摘要

In some surgical operations targeting the spine, it is required to drill cavities in the vertebrae for the insertion of pedicle screws. A new mechanical architecture is proposed for this application. It is based on an augmented version of the full translational R-CUBE mechanism, with improved linkages to implement additional rotational motion. Using this concept, a mechanism presented with a 3T2R motion that is required for the manipulation of the surgical drill. It is mainly composed three stages: one translational, one transmitting and one rotational. Their respective kinematic and velocity models are separately derived, then combined. Based on the drilling trajectories obtained from a real patient case, the mechanism is optimized for generating the highest kinematic performances.

2602.15885 2026-02-19 cs.RO

A novel Integrated Motion Tracking Device (IMTD) for Objective Laparoscopic Training Assessment: Development and Validation

Siwar Bouzid, Abdelbadia Chaker, Marc Arsicault, Sami Bennour, Med Amine Laribi

Comments Journal of Medical Devices, 2025

详情
英文摘要

This paper presents a novel, compact four-degree-of-freedom motion-tracking device (IMTD) designed for training and evaluation in laparoscopic surgery. The device's kinematics, mechanical design, instrumentation, and prototypes are developed and presented to meet the specific requirements of laparoscopic training context, including movement around a fixed center of motion and seamless integration into standard box trainers. The system IMTD's tracking accuracy and reliability are compared to a motion capture system (MoCap), assessing its ability to capture both angular and translational motions of surgical instruments. The study then focuses on key performance parameters including precision, fluidity, speed, and overall motion efficiency. The results highlight the system's effectiveness in tracking surgical gestures, providing valuable insights into its potential as a tool for training and performance evaluation in minimally invasive surgery. Additionally, IMTD's low cost and integrated design allow for easy integration and implementation in training rooms, offering a practical and accessible solution for general use. By offering objective, real-time feedback, the system can significantly contribute to improving surgical skills and shortening the learning curve for novice students, while also providing a foundation for future development of gesture scoring algorithms and standardized training protocols.

2602.15884 2026-02-19 cs.RO

The SLAM Confidence Trap

Sebastian Sansoni, Santiago Ramón Tosetti Sanz

详情
英文摘要

The SLAM community has fallen into a "Confidence Trap" by prioritizing benchmark scores over principled uncertainty estimation. This yields systems that are geometrically accurate but probabilitistically inconsistent and brittle. We advocate for a paradigm shift where the consistent, real-time computation of uncertainty becomes a primary metric of success.

2602.15883 2026-02-19 cs.LG physics.flu-dyn

Distributed physics-informed neural networks via domain decomposition for fast flow reconstruction

Yixiao Qian, Jiaxu Liu, Zewei Xia, Song Chen, Chao Xu, Shengze Cai

详情
英文摘要

Physics-Informed Neural Networks (PINNs) offer a powerful paradigm for flow reconstruction, seamlessly integrating sparse velocity measurements with the governing Navier-Stokes equations to recover complete velocity and latent pressure fields. However, scaling such models to large spatiotemporal domains is hindered by computational bottlenecks and optimization instabilities. In this work, we propose a robust distributed PINNs framework designed for efficient flow reconstruction via spatiotemporal domain decomposition. A critical challenge in such distributed solvers is pressure indeterminacy, where independent sub-networks drift into inconsistent local pressure baselines. We address this issue through a reference anchor normalization strategy coupled with decoupled asymmetric weighting. By enforcing a unidirectional information flow from designated master ranks where the anchor point lies to neighboring ranks, our approach eliminates gauge freedom and guarantees global pressure uniqueness while preserving temporal continuity. Furthermore, to mitigate the Python interpreter overhead associated with computing high-order physics residuals, we implement a high-performance training pipeline accelerated by CUDA graphs and JIT compilation. Extensive validation on complex flow benchmarks demonstrates that our method achieves near-linear strong scaling and high-fidelity reconstruction, establishing a scalable and physically rigorous pathway for flow reconstruction and understanding of complex hydrodynamics.

2602.15882 2026-02-19 cs.RO cs.AI

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

Jingjing Fan, Yushan Liu, Shoujie Li, Botao Ren, Siyuan Li, Xiao-Ping Zhang, Wenbo Ding, Zhidong Deng

详情
英文摘要

General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a $16\times$ extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

2602.15879 2026-02-19 cs.LG cs.CY

BamaER: A Behavior-Aware Memory-Augmented Model for Exercise Recommendation

Qing Yang, Yuhao Jiang, Rui Wang, Jipeng Guo, Yejiang Wang, Xinghe Cheng, Zezheng Wu, Jiapu Wang, Jingwei Zhang

详情
英文摘要

Exercise recommendation focuses on personalized exercise selection conditioned on students' learning history, personal interests, and other individualized characteristics. Despite notable progress, most existing methods represent student learning solely as exercise sequences, overlooking rich behavioral interaction information. This limited representation often leads to biased and unreliable estimates of learning progress. Moreover, fixed-length sequence segmentation limits the incorporation of early learning experiences, thereby hindering the modeling of long-term dependencies and the accurate estimation of knowledge mastery. To address these limitations, we propose BamaER, a Behavior-aware memory-augmented Exercise Recommendation framework that comprises three core modules: (i) the learning progress prediction module that captures heterogeneous student interaction behaviors via a tri-directional hybrid encoding scheme; (ii) the memory-augmented knowledge tracing module that maintains a dynamic memory matrix to jointly model historical and current knowledge states for robust mastery estimation; and (iii) the exercise filtering module that formulates candidate selection as a diversity-aware optimization problem, solved via the Hippopotamus Optimization Algorithm to reduce redundancy and improve recommendation coverage. Experiments on five real-world educational datasets show that BamaER consistently outperforms state-of-the-art baselines across a range of evaluation metrics.

2602.15878 2026-02-19 cs.LG cs.AI

IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation

Mingchun Sun, Rongqiang Zhao, Zhennan Huang, Songyu Ding, Jie Liu

详情
英文摘要

In industrial scenarios, data augmentation is an effective approach to improve model performance. However, its benefits are not unidirectionally beneficial. There is no theoretical research or established estimation for the optimal sample size (OSS) in augmentation, nor is there an established metric to evaluate the accuracy of OSS or its deviation from the ground truth. To address these issues, we propose an information-theoretic optimal sample size estimation (IT-OSE) to provide reliable OSS estimation for industrial data augmentation. An interval coverage and deviation (ICD) score is proposed to evaluate the estimated OSS intuitively. The relationship between OSS and dominant factors is theoretically analyzed and formulated, thereby enhancing the interpretability. Experiments show that, compared to empirical estimation, the IT-OSE increases accuracy in classification tasks across baseline models by an average of 4.38%, and reduces MAPE in regression tasks across baseline models by an average of 18.80%. The improvements in downstream model performance are more stable. ICDdev in the ICD score is also reduced by an average of 49.30%. The determinism of OSS is enhanced. Compared to exhaustive search, the IT-OSE achieves the same OSS while reducing computational and data costs by an average of 83.97% and 93.46%. Furthermore, practicality experiments demonstrate that the IT-OSE exhibits generality across representative sensor-based industrial scenarios.

2602.15877 2026-02-19 cs.LG cs.AI cs.NE

Genetic Generalized Additive Models

Kaaustaaub Shankar, Kelly Cohen

Comments Accepted to NAFIPS 2026

详情
英文摘要

Generalized Additive Models (GAMs) balance predictive accuracy and interpretability, but manually configuring their structure is challenging. We propose using the multi-objective genetic algorithm NSGA-II to automatically optimize GAMs, jointly minimizing prediction error (RMSE) and a Complexity Penalty that captures sparsity, smoothness, and uncertainty. Experiments on the California Housing dataset show that NSGA-II discovers GAMs that outperform baseline LinearGAMs in accuracy or match performance with substantially lower complexity. The resulting models are simpler, smoother, and exhibit narrower confidence intervals, enhancing interpretability. This framework provides a general approach for automated optimization of transparent, high-performing models. The code can be found at https://github.com/KaaustaaubShankar/GeneticAdditiveModels.

2602.15875 2026-02-19 cs.RO cs.AI

Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation

Zhenxing Xu, Brikit Lu, Weidong Bao, Zhengqiu Zhu, Junsong Zhang, Hui Yan, Wenhao Lu, Ji Wang

详情
英文摘要

Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real-world environments demonstrate that Fly0 outperforms state-of-the-art baselines, improving the Success Rate by over 20\% and reducing Navigation Error (NE) by approximately 50\% in unstructured environments. Our code is available at https://github.com/xuzhenxing1/Fly0.

2602.15874 2026-02-19 cs.CL cs.LG

P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

Xingda Lyu, Gongfu Lyu, Zitai Yan, Yuxin Jiang

Journal ref Proceedings of International Conference on Computing and Data Science Symposium: Application of Machine Learning in Engineering, 2025

详情
英文摘要

Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG's potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.

2602.15873 2026-02-19 cs.RO cs.AI

Test-Time Adaptation for Tactile-Vision-Language Models

Chuyang Ye, Haoxian Jing, Qinting Jiang, Yixi Lin, Qiang Li, Xing Tang, Jingyan Jiang

详情
英文摘要

Tactile-vision-language (TVL) models are increasingly deployed in real-world robotic and multimodal perception tasks, where test-time distribution shifts are unavoidable. Existing test-time adaptation (TTA) methods provide filtering in unimodal settings but lack explicit treatment of modality-wise reliability under asynchronous cross-modal shifts, leaving them brittle when some modalities become unreliable. We study TTA for TVL models under such shifts and propose a reliability-aware framework that estimates per-modality reliability from prediction uncertainty and perturbation-based responses. This shared reliability signal is used to (i) filter unreliable test samples, (ii) adaptively fuse tactile, visual, and language features, and (iii) regularize test-time optimization with a reliability-guided objective. On the TAG-C benchmark and additional TVL scenarios, our approach consistently outperforms strong TTA baselines, achieving accuracy gains of up to 49.9\% under severe modality corruptions, underscoring the importance of explicit modality-wise reliability modeling for robust test-time adaptation.

2602.15871 2026-02-19 cs.CL cs.CY

CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content

Diletta Abbonato

详情
英文摘要

The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination -- the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents "CheckIfExist", an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.

2602.15870 2026-02-19 cs.CL

VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

Shuhui Qu

详情
英文摘要

Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.

2602.15869 2026-02-19 cs.CL

Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

Noopur Zambare, Kiana Aghakasiri, Carissa Lin, Carrie Ye, J. Ross Mitchell, Mohamed Abdalla

Comments Accpted to the Findings of EACL 2026

详情
英文摘要

Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291

2602.15867 2026-02-19 cs.CL cs.AI

Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

Berry Gerrits

Comments 18 pages, 1 figure

详情
英文摘要

In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

2602.15866 2026-02-19 cs.CL cs.AI cs.CR cs.CY cs.HC

NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

Dhiman Goswami, Jai Kruthunz Naveen Kumar, Sanchari Das

Journal ref In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2026

详情
英文摘要

Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.