arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1832
2602.12652 2026-05-01 cs.CV

CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

Marco Stricker, Masakazu Iwamura, Koichi Kise

Comments We are currently in the process of selecting an appropriate journal for submission

详情
英文摘要

Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN

2602.11084 2026-05-01 cs.LG cs.AI

GRASP: group-Shapley feature selection for patients

Yuheng Luo, Shuyan Li, Zhong Cao

Comments 5 pages, 4 figures, 2 tables. Accepted at IEEE ICASSP 2026

详情
英文摘要

Feature selection remains a major challenge in medical prediction, where existing approaches such as LASSO often lack robustness and interpretability. We introduce GRASP, a novel framework that couples Shapley value driven attribution with group $L_{21}$ regularization to extract compact and non-redundant feature sets. GRASP first distills group level importance scores from a pretrained tree model via SHAP, then enforces structured sparsity through group $L_{21}$ regularized logistic regression, yielding stable and interpretable selections. Extensive comparisons with LASSO, SHAP, and deep learning based methods show that GRASP consistently delivers comparable or superior predictive accuracy, while identifying fewer, less redundant, and more stable features.

2602.07915 2026-05-01 cs.LG cs.AI stat.ME stat.ML

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Huiyang Yi, Xiaojian Shen, Yonggang Wu, Duxin Chen, He Wang, Wenwu Yu

Comments Major revision from the previous version

详情
英文摘要

Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark framework designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We additionally conduct ablation experiments to explain the strong performance of deep learning-based methods under assumption violations. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications. The user-friendly implementation, documentation and datasets are available at https://anonymous.4open.science/r/CausalCompass-anonymous-5B4F/.

2602.07408 2026-05-01 cs.AI cs.MA

Progressive Multi-Agent Reasoning for Biological Perturbation Prediction

Hyomin Kim, Sang-Yeon Hwang, Jaechang Lim, Yinhua Piao, Yunhak Oh, Woo Youn Kim, Chanyoung Park, Sungsoo Ahn, Junhyeok Jeon

Comments 17 pages, 4 figures, 9 tables

详情
英文摘要

Predicting gene regulation responses to biological perturbations requires reasoning about underlying biological causalities. While large language models (LLMs) show promise for such tasks, they are often overwhelmed by the entangled nature of high-dimensional perturbation results. Moreover, recent works have primarily focused on genetic perturbations in single-cell experiments, leaving bulk-cell chemical perturbations, which is central to drug discovery, largely unexplored. Motivated by this, we present LINCSQA, a novel benchmark for predicting target gene regulation under complex chemical perturbations in bulk-cell environments. We further propose PBio-Agent, a multi-agent framework that integrates difficulty-aware task sequencing with iterative knowledge refinement. Our key insight is that genes affected by the same perturbation share causal structure, allowing confidently predicted genes to contextualize more challenging cases. The framework employs specialized agents enriched with biological knowledge graphs, while a synthesis agent integrates outputs and specialized judges ensure logical coherence. PBio-Agent outperforms existing baselines on both LINCSQA and PerturbQA, enabling even smaller models to predict and explain complex biological processes without additional training.

2602.00937 2026-05-01 cs.RO cs.AI cs.CV cs.LG

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang, Connor Schenck

Comments Accepted to the Robotics: Science and Systems (RSS) 2026

详情
英文摘要

Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks. The project website and videos can be found at https://clamp3d.github.io/CLAMP/.

2602.00095 2026-05-01 cs.CV cs.AI cs.CY

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng, Ying Zhang

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. Project Website: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL GitHub and Dataset: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL

详情
英文摘要

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

2601.22228 2026-05-01 cs.CV cs.AI cs.CL

Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser

详情
英文摘要

We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and introduce \texttt{VRRPI-Bench}, built from real RGB-D frames with object-centric camera motion, and \texttt{VRRPI-Diag}, which isolates individual motion degrees of freedom. Humans (0.91) and specialized geometric pipelines such as LoFTR (0.99) solve the task reliably, yet the best VLM reaches only 0.66 and most others remain near random. Our analyses show that this gap is not basic spatial competence: strong VLMs are near ceiling on single-image benchmarks, but most remain near random once reasoning must span views. They are unstable under source-target reversal (best 59.7\% consistency) and remain weak even in simplified single-DoF settings, especially on optical-axis motions such as roll and depth translation (GPT-5: 0.46 on roll). These failures are useful: they localize concrete missing capabilities, namely cross-view correspondence, view-consistent reasoning, and projective camera-motion understanding, making RCPE a targeted diagnostic for improving multi-view spatial reasoning in VLMs.

2601.22199 2026-05-01 cs.RO cs.HC

Game-Based and Gamified Robotics Education: A Comparative Systematic Review and Design Guidelines

Syed T. Mubarrat, Byung-Cheol Min, Tianyu Shao, E. Cho Smith, Bedrich Benes, Alejandra J. Magana, Christos Mousas, Dominic Kao

Comments Accepted for publication at Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. 26 pages, 14 figures, 7 tables;

详情
英文摘要

Robotics education fosters computational thinking, creativity, and problem-solving, but remains challenging due to technical complexity. Game-based learning (GBL) and gamification offer engagement benefits, yet their comparative impact remains unclear. We present the first PRISMA-aligned systematic review and comparative synthesis of GBL and gamification in robotics education, analyzing 95 studies from 12,485 records across four databases (2014-2025). We coded each study's approach, learning context, skill level, modality, pedagogy, and outcomes (k = .918). Three patterns emerged: (1) approach-context-pedagogy coupling (GBL more prevalent in informal settings, while gamification dominated formal classrooms [p < .001] and favored project-based learning [p = .009]); (2) emphasis on introductory programming and modular kits, with limited adoption of advanced software (~17%), advanced hardware (~5%), or immersive technologies (~22%); and (3) short study horizons, relying on self-report. We propose eight research directions and a design space outlining best practices and pitfalls, offering actionable guidance for robotics education.

2601.20969 2026-05-01 cs.AI

The Epistemic Planning Domain Definition Language: Official Guideline

Alessandro Burigana, Francesco Fabiano

详情
英文摘要

Epistemic planning extends (multi-agent) automated planning by making agents' knowledge and beliefs first-class aspects of the planning formalism. One of the most well-known frameworks for epistemic planning is Dynamic Epistemic Logic (DEL), which offers an rich and natural semantics for modelling problems in this setting. The high expressive power provided by DEL make DEL-based epistemic planning a challenging problem to tackle both theoretically, and in practical implementations. As a result, existing epistemic planners often target different DEL fragments, and typically rely on ad hoc languages to represent benchmarks, and sometimes no language at all. This fragmentation hampers comparison, reuse, and systematic benchmark development. We address these issues by introducing the Epistemic Planning Domain Definition Language (EPDDL). EPDDL provides a unique PDDL-like representation that captures the entire DEL semantics, enabling uniform specification of epistemic planning tasks. Our main contributions are: 1. A formal development of abstract event models, a novel representation for epistemic actions used to define the semantics of our language; 2. A formal specification of EPDDL's syntax and semantics grounded in DEL with abstract event models. Through examples of representative benchmarks, we illustrate how EPDDL facilitates interoperability, reproducible evaluation, and future advances in epistemic planning.

2601.19768 2026-05-01 cs.AI cs.CR cs.LG

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

Comments Accepted to ICLR 2026

详情
英文摘要

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as 'making a threat' and 'payment processing', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We open source GAVEL and introduce GAVEL Studio, an interactive rule authoring and management tool. Code and datasets are available at github.com/Offensive-AI-Lab/gavel.

2601.14289 2026-05-01 cs.CL cs.AI

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, Xiaoyan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li

Comments ACL'26, 12 pages, 23 appendix pages

详情
英文摘要

Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding. Our code and data are available at https://rpc-bench.github.io/.

2601.11908 2026-05-01 cs.CL

PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

Byeongjin Kim, Gyuwan Kim, Seo Yeon Park

Comments Accepted to the Main Conference of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). 27 pages, 6 figures

详情
英文摘要

Large language models (LLMs) struggle with reasoning over long contexts where relevant information is sparsely distributed. Although plan-and-execute frameworks mitigate this by decomposing tasks into planning and execution, their effectiveness is often limited by unreliable plan generation due to dependence on surface-level cues. Consequently, plans may be based on incorrect assumptions, and once a plan is formed, identifying what went wrong and revising it reliably becomes difficult, limiting the effectiveness of reactive refinement. To address this limitation, we propose PPA-Plan, a proactive planning strategy for long-context reasoning that focuses on preventing such failures before plan generation. PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints. Experiments on long-context QA benchmarks show that executing plans generated by PPA-Plan consistently outperforms existing plan-and-execute methods and direct prompting.

2601.08418 2026-05-01 cs.LG cs.AI

Taxon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance

Jihang Li, Qing Liu, Zulong Chen, Jing Wang, Wei Wang, Chuanfei Xu, Zeyi Wen

详情
英文摘要

Tax code prediction is a crucial yet underexplored task in automating invoicing and compliance management for large-scale e-commerce platforms. Each product must be accurately mapped to a node within a multi-level taxonomic hierarchy defined by national standards, where errors lead to financial inconsistencies and regulatory risks. This paper presents Taxon, a semantically aligned and expert-guided framework for hierarchical tax code prediction. Taxon integrates (i) a feature-gating mixture-of-experts architecture that adaptively routes multi-modal features across taxonomy levels, and (ii) a semantic consistency model distilled from large language models acting as domain experts to verify alignment between product titles and official tax definitions. To address noisy supervision in real business records, we design a multi-source training pipeline that combines curated tax databases, invoice validation logs, and merchant registration data to provide both structural and semantic supervision. Extensive experiments on the proprietary TaxCode dataset and public benchmarks demonstrate that Taxon achieves state-of-the-art performance, outperforming strong baselines. Further, an additional full hierarchical paths reconstruction procedure significantly improves structural consistency, yielding the highest overall F1 scores. Taxon has been deployed in production within Alibaba's tax service system, handling an average of over 500,000 tax code queries per day and reaching peak volumes above five million requests during business event with improved accuracy, interpretability, and robustness.

2601.06394 2026-05-01 cs.CV cs.AI

Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification

Ahmed Abdelkawy, Ahmed Elsayed, Asem Ali, Aly Farag, Thomas Tretter, Michael McIntyre

Comments accepted to the Computer Vision for Education (CV4Edu) workshop, CVPR 2026

详情
英文摘要

Understanding student behavior in the classroom is essential to improve both pedagogical quality and student engagement. Existing methods for predicting student engagement typically require substantial annotated data to model the diversity of student behaviors, yet privacy concerns often restrict researchers to their own proprietary datasets. Moreover, the classroom context, represented in peers' actions, is ignored. To address the aforementioned limitation, we propose a novel three-stage framework for video-based student engagement measurement. First, we explore the few-shot adaptation of the vision-language model for student action recognition, which is fine-tuned to distinguish among action categories with a few training samples. Second, to handle continuous and unpredictable student actions, we utilize the sliding temporal window technique to divide each student's 2-minute-long video into non-overlapping segments. Each segment is assigned an action category via the fine-tuned VLM model, generating a sequence of action predictions. Finally, we leverage the large language model to classify this entire sequence of actions, together with the classroom context, as belonging to an engaged or disengaged student. The experimental results demonstrate the effectiveness of the proposed approach in identifying student engagement. The source code will be available at https://github.com/ahmed-nady/context_aware_student_engagement.

2601.05052 2026-05-01 cs.LG stat.ML

DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul

Comments 25 pages, 20 tables, 2 figures

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026): https://openreview.net/forum?id=fOwsr1VTi8
英文摘要

Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.

2601.02845 2026-05-01 cs.CL cs.AI

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

Kai Li, Xuanqing Yu, Ziyi Ni, Yi Zeng, Yao Xu, Zheqing Zhang, Xin Li, Jitao Sang, Xiaogang Duan, Xuelei Wang, Chengbao Liu, Jie Tan

Comments ACL 2026 Findings

详情
英文摘要

Long-horizon conversational agents have to manage ever-growing interaction histories that quickly exceed the finite context windows of large language models (LLMs). Existing memory frameworks provide limited support for temporally structured information across hierarchical levels, often leading to fragmented memories and unstable long-horizon personalization. We present TiMem, a temporal--hierarchical memory framework that organizes conversations through a Temporal Memory Tree (TMT), enabling systematic memory consolidation from raw conversational observations to progressively abstracted persona representations. TiMem is characterized by three core properties: (1) temporal--hierarchical organization through TMT; (2) semantic-guided consolidation that enables memory integration across hierarchical levels without fine-tuning; and (3) complexity-aware memory recall that balances precision and efficiency across queries of varying complexity. Under a consistent evaluation setup, TiMem achieves state-of-the-art accuracy on both benchmarks, reaching 75.30% on LoCoMo and 76.88% on LongMemEval-S. It outperforms all evaluated baselines while reducing the recalled memory length by 52.20% on LoCoMo. Manifold analysis indicates clear persona separation on LoCoMo and reduced dispersion on LongMemEval-S. Overall, TiMem treats temporal continuity as a first-class organizing principle for long-horizon memory in conversational agents. The code is available at https://github.com/TiMEM-AI/timem.

2601.01885 2026-05-01 cs.CL

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, Libing Wu

Comments The code is available at https://github.com/y1y5/AgeMem

详情
英文摘要

Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.

2512.24329 2026-05-01 cs.CL

World model inspired sarcasm reasoning with large language model agents

Keito Inoshita, Shinnosuke Mizuno

详情
英文摘要

Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

2512.23192 2026-05-01 cs.LG

PGOT: A Physics-Geometry Operator Transformer for Complex PDEs

Zhuo Zhang, Xi Yang, Ying Miao, Xiaobin Hu, Yifu Gao, Yuan Zhao, Yong Yang, Canqun Yang, Boocheong Khoo

Comments 24 pages, 17 figures

详情
英文摘要

While Transformers have demonstrated remarkable potential in modeling Partial Differential Equations (PDEs), modeling large-scale unstructured meshes with complex geometries remains a significant challenge. Existing efficient architectures often employ feature dimensionality reduction strategies, which inadvertently induces Geometric Aliasing, resulting in the loss of critical physical boundary information. To address this, we propose the Physics-Geometry Operator Transformer (PGOT), designed to reconstruct physical feature learning through explicit geometry awareness. Specifically, we propose Spectrum-Preserving Geometric Attention (SpecGeo-Attention). Utilizing a ``physics slicing-geometry injection" mechanism, this module incorporates multi-scale geometric encodings to explicitly preserve multi-scale geometric features while maintaining linear computational complexity $O(N)$. Furthermore, PGOT dynamically routes computations to low-order linear paths for smooth regions and high-order non-linear paths for shock waves and discontinuities based on spatial coordinates, enabling spatially adaptive and high-precision physical field modeling. PGOT achieves consistent state-of-the-art performance across four standard benchmarks and excels in large-scale industrial tasks including airfoil and car designs.

2512.22046 2026-05-01 cs.CV cs.CR

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He, Xingshuo Han, Shengmin Xu, Xinyi Huang

详情
英文摘要

Prompt-driven Video Segmentation Foundation Models (VSFMs), such as SAM2, are increasingly used in applications including autonomous driving and digital pathology, yet their security risks remain underexplored. We study backdoor attacks against VSFMs and show that directly applying classic attacks such as BadNet is largely ineffective, yielding attack success rates (ASR) below 5%. Through gradient-similarity and attention-map analyses, we find that traditional backdoor training fails because clean and triggered samples induce aligned image-encoder gradients, while model attention remains focused on the prompt-specified object rather than the trigger. To address this limitation, we propose BadVSFM, the first backdoor attack framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy that first learns trigger-specific encoder features and then trains the decoder to map triggered frame prompt representations to an attacker-specified target mask while preserving clean segmentation behavior. Experiments on five VSFMs and two datasets show that BadVSFM achieves strong, controllable backdoor effects across triggers and prompt types with limited clean-performance degradation. Ablations and interpretability analyses validate the necessity of the two-stage design, and five representative defenses remain largely ineffective. Our results reveal a practical and underexplored vulnerability of current VSFMs to backdoor threats.

2512.17435 2026-05-01 cs.RO

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang, Xinxin Zhao, Wenzhe Cai, Changyin Sun

Comments 17 pages, 10 figures. arXiv admin note: text overlap with arXiv:2410.09874

详情
英文摘要

Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.

2512.14067 2026-05-01 cs.CL cs.AI cs.LG

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

详情
英文摘要

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal across blocks while enabling bidirectional modeling within each block. We find that this approach can better preserve pretrained AR models' weight distributions than fully bidirectional modeling, in addition to its known benefit of enabling KV caching, and leads to a win-win in accuracy and efficiency. Second, to mitigate the training-test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. Leveraging this framework, we conduct extensive studies of dLMs' attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. These studies lead to the Efficient-DLM family, which outperforms state-of-the-art AR models and dLMs, e.g., our Efficient-DLM 8B achieves +5.4%/+2.7% higher accuracy with 4.5x/2.7x higher throughput compared to Dream 7B and Qwen3 4B, respectively.

2512.10267 2026-05-01 cs.CV

Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen, Hao Tan, Peng Wang, Zexiang Xu, Li Fuxin

详情
Journal ref
IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings (CVPRF), 2026
英文摘要

Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

2512.09111 2026-05-01 cs.RO cs.AI math.OC

Language-Conditioned Safe Trajectory Generation for Spacecraft Rendezvous

Yuji Takubo, Arpit Dwivedi, Sukeerth Ramkumar, Luis A. Pabon, Daniele Gammelli, Marco Pavone, Simone D'Amico

Comments 42 pages, 12 figures. Submitted to AIAA Journal of Guidance, Control, and Dynamics

详情
英文摘要

Reliable real-time trajectory generation is essential for future autonomous spacecraft. While recent progress in nonconvex guidance and control is paving the way for onboard autonomous trajectory optimization, these methods still rely on extensive expert input (e.g., waypoints, constraints, mission timelines, etc.), which limits operational scalability in complex missions such as rendezvous and proximity operations. This paper introduces SAGES (Semantic Autonomous Guidance Engine for Space), a trajectory-generation framework that translates natural-language commands into spacecraft trajectories that reflect high-level intent while respecting nonconvex constraints. Experiments in two settings (fault-tolerant proximity operations with continuous-time constraint enforcement and a free-flying robotic platform) demonstrate that SAGES reliably produces trajectories aligned with human commands, achieving over 90% semantic-behavioral consistency across diverse behavior modes. Ultimately, this work marks an initial step toward language-conditioned, constraint-aware spacecraft trajectory generation, enabling operators to interactively guide both safety and behavior through intuitive natural-language commands with reduced expert burden. Project Website: https://semantic-guidance4space.github.io/

2512.02535 2026-05-01 cs.RO

AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning

Jeric Lew, Yuhong Cao, Derek Ming Siang Tan, Guillaume Sartoretti

详情
英文摘要

Information gathering in large-scale or time-critical scenarios (e.g., environmental monitoring, search and rescue) requires broad coverage within limited time budgets, motivating the use of multi-agent systems. These scenarios are commonly formulated as multi-agent informative path planning (MAIPP), where multiple agents must coordinate to maximize information gain while operating under budget constraints. A central challenge in MAIPP is ensuring effective coordination while the belief over the environment evolves with incoming measurements. Recent learning-based approaches address this by using distributions over future positions as "intent" to support coordination. However, these autoregressive intent predictors are computationally expensive and prone to compounding errors. Inspired by the effectiveness of diffusion models as expressive, long-horizon policies, we propose AID, a fully decentralized MAIPP framework that leverages diffusion models to generate long-term trajectories in a non-autoregressive manner. AID first performs behavior cloning on trajectories produced by existing MAIPP planners and then fine-tunes the policy using reinforcement learning via Diffusion Policy Policy Optimization (DPPO). This two-stage pipeline enables the policy to inherit expert behavior while learning improved coordination through online reward feedback. Experiments demonstrate that AID consistently improves upon the MAIPP planners it is trained from, achieving 4x faster execution and up to 17% increased information gain, while scaling effectively to larger numbers of agents. Our implementation is publicly available at https://github.com/marmotlab/AID.

2511.12386 2026-05-01 cs.CV

Leveraging Quantum-Based Architectures for Robust Diagnostics

Shabnam Sodagari, Tommy Long

详情
英文摘要

Quantum machine learning has emerged as a promising approach for medical image analysis, particularly in settings where compact models and expressive feature representations are desired. This paper presents a hybrid classical--quantum diagnostic framework that integrates dataset-specific preprocessing, transfer learning, and quantum convolutional neural networks (QCNNs) for multi-class medical image classification. This approach is evaluated on three distinct tasks: kidney disease diagnosis from computed tomography images, cervical cell classification from pap smear images, and brain tumor classification from magnetic resonance imaging. For each dataset, a pretrained encoder is used to extract latent features, which are then embedded into quantum states through angle or amplitude encoding and processed by a QCNN. Experimental results show strong and stable convergence across all datasets. The proposed hybrid models achieve 99% test accuracy on kidney CT classification, 97% on cervical cell classification, and 99% on brain tumor classification. In comparative evaluations for precision, recall, and F1, the hybrid QCNN models consistently outperform classical CNN baselines using the same pretrained encoders and similar hyperparameter settings, while requiring fewer trainable parameters. These results demonstrate the potential of quantum-enhanced architectures for robust and efficient medical diagnostics.

2510.23498 2026-05-01 cs.LG cs.AI cs.NA math.NA

Mixed Precision Training of Neural ODEs

Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang

Comments Code available at https://github.com/EmoryMLIP/rampde; 30 pages, 5 figures

详情
英文摘要

Exploiting low-precision computations has become a standard strategy in deep learning to address the growing computational costs imposed by ever larger models and datasets. However, naively performing all computations in low precision can lead to roundoff errors and instabilities. Therefore, mixed precision training schemes usually store the weights in high precision and use low-precision computations only for whitelisted operations. Despite their success, these principles are currently not reliable for training continuous-time architectures such as neural ordinary differential equations (Neural ODEs). This paper presents a mixed precision training framework for neural ODEs consisting of explicit ODE solvers and a custom backpropagation scheme and shows their effectiveness in a range of learning tasks. Our scheme uses low-precision computations for evaluating the velocity, parameterized by the neural network, and for storing intermediate states, while numerical reliability is provided by custom dynamic adjoint scaling and by accumulating the solution and gradients in higher precision. These contributions address two key challenges in training neural ODEs: the computational cost of repeated network evaluations and the growth of memory requirements with the number of time steps or layers. Along with the paper we publish our extendable, open-source PyTorch package \texttt{rampde}, whose syntax resembles that of leading packages to provide a drop-in replacement in existing codes. We demonstrate the reliability and effectiveness of our scheme using challenging test cases and on neural ODE applications in image classification and generative models, achieving approximately 50\% memory reduction and up to 2x speedup while maintaining accuracy comparable to single-precision training.

2510.20933 2026-05-01 cs.CV cs.AI

Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation

Moin Safdar, Shahzaib Iqbal, Mubeen Ghafoor, Tariq M. Khan, Imran Razzak, Thantrira Porntaveetus, Hamid Alinejad-Rokny

详情
英文摘要

Medical image segmentation is essential for clinical applications such as disease diagnosis, treatment planning, and disease development monitoring because it provides precise morphological and spatial information on anatomical structures that directly influence treatment decisions. Convolutional neural networks significantly impact image segmentation; however, since convolution operations are local, capturing global contextual information and long-range dependencies is still challenging. Their capacity to precisely segment structures with complicated borders and a variety of sizes is impacted by this restriction. Since transformers use self-attention methods to capture global context and long-range dependencies efficiently, integrating transformer-based architecture with CNNs is a feasible approach to overcoming these challenges. To address these challenges, we propose the Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation, referred to as FM-BFF-Net in the remainder of this paper. The network combines convolutional and transformer components, employs a focal modulation attention mechanism to refine context awareness, and introduces a bidirectional feature fusion module that enables efficient interaction between encoder and decoder representations across scales. Through this design, FM-BFF-Net enhances boundary precision and robustness to variations in lesion size, shape, and contrast. Extensive experiments on eight publicly available datasets, including polyp detection, skin lesion segmentation, and ultrasound imaging, show that FM-BFF-Net consistently surpasses recent state-of-the-art methods in Jaccard index and Dice coefficient, confirming its effectiveness and adaptability for diverse medical imaging scenarios.

2510.18731 2026-05-01 cs.CL cs.AI cs.LG

Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

Ming Li, Pei Chen, Zhenhao Zhang, Tao Yang, Xinyang Zhang, Han Li, Tianyu Cao, Ming Zeng, Zhuofeng Wu, Meng Jiang, Huasheng Li, Lihong Li, Bing Yin

Comments ACL2026, camera-ready

详情
英文摘要

Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.

2510.15350 2026-05-01 cs.RO cs.NE

Nauplius Optimisation for Autonomous Hydrodynamics

Shyalan Ramesh, Scott Mann, Alex Stumpf

Comments IEEE Access, 2026

详情
英文摘要

Autonomous Underwater vehicles must operate in strong currents, limited acoustic bandwidth, and persistent sensing requirements where conventional swarm optimisation methods are unreliable. This paper formulates an irreversible hydrodynamic deployment problem for Autonomous Underwater Vehicle (AUV) swarms and presents Nauplius Optimisation for Autonomous Hydrodynamics (NOAH), a novel nature-inspired swarm optimisation algorithm that combines current-aware drift, irreversible settlement in persistent sensing nodes, and colony-based communication. Drawing inspiration from the behaviour of barnacle nauplii, NOAH addresses the critical limitations of existing swarm algorithms by providing hydrodynamic awareness, irreversible anchoring mechanisms, and colony-based communication capabilities essential for underwater exploration missions. The algorithm establishes a comprehensive foundation for scalable and energy-efficient underwater swarm robotics with validated performance analysis. Validation studies demonstrate an 86% success rate for permanent anchoring scenarios, providing a unified formulation for hydrodynamic constraints and irreversible settlement behaviours with an empirical study under flow.