arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1502
2509.20739 2026-03-09 cs.RO cs.CV

Decision-Driven Semantic Object Exploration for Legged Robots via Confidence-Calibrated Perception and Topological Subgoal Selection

Guoyang Zhao, Yudong Li, Weiqing Qi, Kai Zhang, Bonan Liu, Kai Chen, Haoang Li, Jun Ma

详情
英文摘要

Conventional navigation pipelines for legged robots remain largely geometry-centric, relying on dense SLAM representations that are fragile under rapid motion and offer limited support for semantic decision making in open-world exploration. In this work, we focus on decision-driven semantic object exploration, where the primary challenge is not map consistency but how noisy and heterogeneous semantic observations can be transformed into stable and executable exploration decisions. We propose a vision-based approach that explicitly addresses this problem through confidence-calibrated semantic evidence arbitration, a controlled-growth semantic topological memory, and a semantic utility-driven subgoal selection mechanism. These components enable the robot to accumulate task-relevant semantic knowledge over time and select exploration targets that balance semantic relevance, reliability, and reachability, without requiring dense geometric reconstruction. Extensive experiments in both simulation and real-world environments demonstrate that the proposed mechanisms consistently improve the quality of semantic decision inputs, subgoal selection accuracy, and overall exploration performance on legged robots.

2509.20507 2026-03-09 cs.LG

Auto-Regressive U-Net for Full-Field Prediction of Shrinkage-Induced Damage in Concrete

Liya Gaynutdinova, Petr Havlásek, Ondřej Rokoš, Fleur Hendriks, Martin Doškář

详情
英文摘要

This paper introduces a deep learning approach for predicting time-dependent full-field damage in concrete. The study uses an auto-regressive U-Net model to predict the evolution of the scalar damage field in a unit cell given microstructural geometry and evolution of an imposed shrinkage profile. By sequentially using the predicted damage output as input for subsequent predictions, the model facilitates the continuous assessment of damage progression. Complementarily, a convolutional neural network (CNN) utilises the damage estimations to forecast key mechanical properties, including observed shrinkage and residual stiffness. The proposed dual-network architecture demonstrates high computational efficiency and robust predictive performance on the synthesised datasets. The approach reduces the computational load traditionally associated with full-field damage evaluations and is used to gain insights into the relationship between aggregate properties, such as shape, size, and distribution, and the effective shrinkage and reduction in stiffness. Ultimately, this can help to optimize concrete mix designs, leading to improved durability and reduced internal damage.

2509.19674 2026-03-09 cs.LG cs.CV

C^2Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning

Kunlun Xu, Yibo Feng, Jiangmeng Li, Yongsheng Qi, Jiahuan Zhou

Comments Accepted by NeurIPS 2025

详情
英文摘要

Federated continual learning (FCL) tackles scenarios of learning from continuously emerging task data across distributed clients, where the key challenge lies in addressing both temporal forgetting over time and spatial forgetting simultaneously. Recently, prompt-based FCL methods have shown advanced performance through task-wise prompt communication.In this study, we underscore that the existing prompt-based FCL methods are prone to class-wise knowledge coherence between prompts across clients. The class-wise knowledge coherence includes two aspects: (1) intra-class distribution gap across clients, which degrades the learned semantics across prompts, (2) inter-prompt class-wise relevance, which highlights cross-class knowledge confusion. During prompt communication, insufficient class-wise coherence exacerbates knowledge conflicts among new prompts and induces interference with old prompts, intensifying both spatial and temporal forgetting. To address these issues, we propose a novel Class-aware Client Knowledge Interaction (C${}^2$Prompt) method that explicitly enhances class-wise knowledge coherence during prompt communication. Specifically, a local class distribution compensation mechanism (LCDC) is introduced to reduce intra-class distribution disparities across clients, thereby reinforcing intra-class knowledge consistency. Additionally, a class-aware prompt aggregation scheme (CPA) is designed to alleviate inter-class knowledge confusion by selectively strengthening class-relevant knowledge aggregation. Extensive experiments on multiple FCL benchmarks demonstrate that C${}^2$Prompt achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/NeurIPS2025-C2Prompt

2509.14063 2026-03-09 cs.RO

Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Non-Towered Airspace

Sundhar Vinodh Sangeetha, Chih-Yuan Chiu, Sarah H. Q. Li, Shreyas Kousik

Comments The last two authors advised equally. Accepted to the 2026 IEEE International Conference on Robotics and Automation. 8 pages, 6 figures

详情
英文摘要

Autonomous aircraft must safely operate in non-towered airspace, where coordination relies on voice-based communication among human pilots. Safe operation requires an aircraft to predict the intent, and corresponding goal location, of other aircraft. This paper introduces a multimodal framework for aircraft goal prediction that integrates natural language understanding with spatial reasoning to improve autonomous decision-making in such environments. We leverage automatic speech recognition and large language models to transcribe and interpret pilot radio calls, identify aircraft, and extract discrete intent labels. These intent labels are fused with observed trajectories to condition a temporal convolutional network and Gaussian mixture model for probabilistic goal prediction. Our method significantly reduces goal prediction error compared to baselines that rely solely on motion history, demonstrating that language-conditioned prediction increases prediction accuracy. Experiments on a real-world dataset from a non-towered airport validate the approach and highlight its potential to enable socially aware, language-conditioned robotic motion planning.

2509.13386 2026-03-09 cs.RO cs.LG

VEGA: Electric Vehicle Navigation Agent via Physics-Informed Neural Operator and Proximal Policy Optimization

Hansol Lim, Minhyeok Im, Jonathan Boyack, Jee Won Lee, Jongseong Brad Choi

Comments This work has been submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication

详情
英文摘要

We present VEGA, a vehicle-adaptive energy-aware routing system for electric vehicles (EVs) that integrates physics-informed parameter estimation with RL-based charge-aware path planning. VEGA consists of two copupled modules: (1) a physics-informed neural operator (PINO) that estimates vehicle-specific physical parameters-drag, rolling resistance, mass, motor and regenerative-braking efficiencies, and auxiliary load-from short windows of onboard speed and acceleration data; (2) a Proximal Policy Optimization (PPO) agent that navigates a charger-annotated road graph, jointly selecting routes and charging stops under state-of-charge constraints. The agent is initialized via behavior cloning from an A* teacher and fine-tuned with cirriculum-guided PPO on the full U.S. highway network with Tesla Supercharger locations. On a cross-country San Francisco-to-New York route (~4,860km), VEGA produces a feasible 20-stop plan with 56.12h total trip time and minimum SoC 11.41%. Against the controlled Energy-aware A* baseline, the distance and driving-time gaps are small (-8.49km and +0.37h), while inference is >20x faster. The learned policy generalizes without retraining to road networks in France and Japan.

2509.11629 2026-03-09 cs.LG cs.AI

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Chentao Cao, Xiaojun Xu, Bo Han, Hang Li

Comments ICLR 2026

详情
英文摘要

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question in their thoughts directly and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates. Notably, the fine-tuned model maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion, while post-hoc detection methods can only directly reject sensitive, harmful queries (e.g., self-harm). Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even $500$ samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment. The dataset is publicly available at: https://huggingface.co/datasets/ByteDance-Seed/ReSA.

2509.08105 2026-03-09 cs.CL

MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning

Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani

Comments Accepted to EACL 2026 (main conference)

详情
英文摘要

Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy -- from general bilingual bitext to task-specific data -- and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

2509.08087 2026-03-09 cs.LG cs.AI

Performance Assessment Strategies for Language Model Applications in Healthcare

Victor Garcia, Mariia Sidulova, Aldo Badano

详情
Journal ref
Artificial Intelligence in the Life Sciences. 2026;9:100162
英文摘要

Language models (LMs) represent an emerging paradigm within artificial intelligence, with applications throughout the medical enterprise. A comprehensive understanding of the clinical task and awareness of the variability in performance when implemented in actual clinical environments lays the foundation for the LM application assessment. Presently, a prevalent method for evaluating the performance of these generative models relies on quantitative benchmarks. Such benchmarks have limitations and may suffer from train-to-the-test overfitting, optimizing performance for a specified test set at the cost of generalizability across other tasks and data distributions. Evaluation strategies leveraging human expertise and utilizing cost-effective computational models as evaluators are gaining interest. We discuss current state-of-the-art methodologies for assessing the performance of LM applications in healthcare and medical devices.

2509.07945 2026-03-09 cs.LG

One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li

Comments 55 pages, 20 figures. Accepted as a conference paper at ICLR 2026

详情
英文摘要

In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

2509.05554 2026-03-09 cs.CV cs.IR

RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentanglement

Yihong Leng, Siming Zheng, Jinwei Chen, Bo Li, Jiaojiao Li, Peng-Tao Jiang

详情
英文摘要

Event-guided motion deblurring reconstructs sharp images using the high-temporal-resolution motion cues from event cameras. However, in real capture, thresholding-induced event under-reporting causes missing and fragmented motion cues, under which existing methods often degrade in performance due to two limitations: i) assumptions of dense and stable events, and ii) modality-indiscriminate extraction and fusion that fail to separate useful motion cues from disrupted events, allowing them to contaminate cross-modal representations. In this paper, we first introduce a Robustness-Oriented Perturbation Strategy (RPS) that mimics various trigger thresholds of dynamic vision sensors, exposing our model to diverse under-reporting patterns and thereby improving robustness under unknown conditions. Built upon this setting, we propose RED, a Robust Event-guided Deblurring network, following the principle of disentangle first and then fuse selectively. Specifically, the Modality-specific Representation Mechanism disentangles the inputs into image-semantic, event-motion, and cross-modal representations, capturing appearance, motion, and complementary interactions, respectively. With the reliable disentangled features, we selectively fuse modalities to enhance motion-sensitive areas in blurry images and enrich under-reported events with semantic context. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in terms of both accuracy and robustness.

2509.03953 2026-03-09 cs.AI cs.SC cs.SY eess.SY

Handling Infinite Domain Parameters in Planning Through Best-First Search with Delayed Partial Expansions

Ángel Aso-Mollar, Diego Aineto, Enrico Scala, Eva Onaindia

详情
Journal ref
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. 2025. Main Track. Pages 8456-8464
英文摘要

In automated planning, control parameters extend standard action representations through the introduction of continuous numeric decision variables. Existing state-of-the-art approaches have primarily handled control parameters as embedded constraints alongside other temporal and numeric restrictions, and thus have implicitly treated them as additional constraints rather than as decision points in the search space. In this paper, we propose an efficient alternative that explicitly handles control parameters as true decision points within a systematic search scheme. We develop a best-first, heuristic search algorithm that operates over infinite decision spaces defined by control parameters and prove a notion of completeness in the limit under certain conditions. Our algorithm leverages the concept of delayed partial expansion, where a state is not fully expanded but instead incrementally expands a subset of its successors. Our results demonstrate that this novel search algorithm is a competitive alternative to existing approaches for solving planning problems involving control parameters.

2508.21513 2026-03-09 cs.LG cond-mat.dis-nn cs.AI

A Geometric Perspective on the Difficulties of Learning GNN-based SAT Solvers

Geri Skenderi

Comments Accepted in the Proceedings track of the GRaM Workshop @ ICLR 2026

详情
英文摘要

Graph Neural Networks (GNNs) have gathered increasing interest as learnable solvers of Boolean Satisfiability Problems (SATs), operating on graph representations of logical formulas. However, their performance degrades sharply on harder and more constrained instances, raising questions about architectural limitations. In this paper, we work towards a geometric explanation built upon graph Ricci Curvature (RC). We prove that bipartite graphs derived from random k-SAT formulas are inherently negatively curved, and that this curvature decreases with instance difficulty. Given that negative graph RC indicates local connectivity bottlenecks, we argue that GNN solvers are affected by oversquashing, a phenomenon where long-range dependencies become impossible to compress into fixed-length representations. We validate our claims empirically across different SAT benchmarks and confirm that curvature is both a strong indicator of problem complexity and can be used to predict generalization error. Finally, we connect our findings to the design of existing solvers and outline promising directions for future work.

2508.13238 2026-03-09 cs.CV

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

详情
英文摘要

Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that do not exist in the input image due to over-reliance on language priors. By contrast, traditional OCR models, whose architectures are tailored for specific recognition tasks, often achieve stronger fine-grained visual perception with fewer hallucinations, but they typically lack the contextual semantic understanding and reasoning capabilities needed in more challenging cases. To bridge this gap, we propose DianJin-OCR-R1, a reasoning-enhanced framework for recognition that trains VLMs in a reasoning-and-tool interleaved paradigm. Our DianJin-OCR-R1 model first recognizes the content in the input image through its own OCR capabilities, and then calls other expert models for extra results as references. After that, it is guided to "look again" at the image and compare its own recognized content with other results to find errors or omissions. Finally, it integrates all available evidence to generate a more accurate output. This design empowers the model to learn how to implicitly re-focus on the visual input and effectively leverage the results of other expert models for better performance. We evaluate our DianJin-OCR-R1 model on ReST and OmniDocBench, where it consistently outperforms both its non-reasoning counterparts and expert models, demonstrating the effectiveness of our method.

2508.10154 2026-03-09 cs.LG

Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression

Zhankun Luo, Abolfazl Hashemi

Comments This paper was accepted by Transactions on Machine Learning Research (TMLR). The code for numerical experiments is available at https://github.com/dassein/em_overspecified_mlr

详情
英文摘要

Mixture models have attracted significant attention due to practical effectiveness and comprehensive theoretical foundations. A persisting challenge is model misspecification, which occurs when the model to be fitted has more mixture components than those in the data distribution. In this paper, we develop a theoretical understanding of the Expectation-Maximization (EM) algorithm's behavior in the context of targeted model misspecification for overspecified two-component Mixed Linear Regression (2MLR) with unknown $d$-dimensional regression parameters and mixing weights. In Theorem 5.1 at the population level, with an unbalanced initial guess for mixing weights, we establish linear convergence of regression parameters in $O(\log(1/ε))$ steps. Conversely, with a balanced initial guess for mixing weights, we observe sublinear convergence in $O(ε^{-2})$ steps to achieve the $ε$-accuracy at Euclidean distance. In Theorem 6.1 at the finite-sample level, for mixtures with sufficiently unbalanced fixed mixing weights, we demonstrate a statistical accuracy of $O((d/n)^{1/2})$, whereas for those with sufficiently balanced fixed mixing weights, the accuracy is $O((d/n)^{1/4})$ given $n$ data samples. Furthermore, we underscore the connection between our population level and finite-sample level results: by setting the desired final accuracy $ε$ in Theorem 5.1 to match that in Theorem 6.1 at the finite-sample level, namely letting $ε= O((d/n)^{1/2})$ for sufficiently unbalanced fixed mixing weights and $ε= O((d/n)^{1/4})$ for sufficiently balanced fixed mixing weights, we intuitively derive iteration complexity bounds $O(\log (1/ε))=O(\log (n/d))$ and $O(ε^{-2})=O((n/d)^{1/2})$ at the finite-sample level for sufficiently unbalanced and balanced initial mixing weights. We further extend our analysis in overspecified setting to low SNR regime.

2508.03351 2026-03-09 cs.CV cs.AI cs.CL

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang, Xuelong Li, Jun Zhang

详情
英文摘要

Post-training quantization (PTQ) has emerged as an effective technique for compressing large models and accelerating inference without retraining. While PTQ has been extensively studied in large language models (LLMs), its application to vision-language models (VLMs) remains underexplored. In this work, we identify two intrinsic characteristics of VLM activations: 1) visual over-representation, where vision tokens are excessive and often redundant, and 2) modality gap, which refers to the clear distribution gap between text and vision tokens in the latent feature space. Together, these two factors significantly deteriorate quantization performance but have been overlooked by existing PTQ methods. To address these challenges, we propose VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization. In particular, we introduce a gradient-driven importance factor to capture the token-wise importance variance, the effectiveness of which is substantiated through both empirical and theoretical analysis. To ensure efficiency, we propose to use lightweight block-wise backpropagation for factor acquisition. Finally, we reformulate the optimization objective into an importance-aware form to preserve important activation information. Extensive evaluations on 8 benchmarks across 0.5B$\sim$32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf{16.45\%} improvement on MME-RealWorld under 2-bit quantization.

2508.01653 2026-03-09 cs.CV cs.AI

MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, Xiande Huang

详情
英文摘要

Large Vision-Language Models (LVLMs) have achieved impressive performance in multimodal tasks, but they still suffer from hallucinations, i.e., generating content that is grammatically accurate but inconsistent with visual inputs. In this work, we introduce a novel map-level perspective to mitigate hallucinations in LVLMs, interpreting the hidden states of the model as a 2D semantic map. We observe that factual information is widely distributed across this map, extending beyond the localized inter- or intra-layer regions targeted by most existing methods (e.g., contrastive decoding and layer-wise consistency). Building on this insight, we propose Map-Level Attention Processing (MAP), a training-free decoding method that effectively leverages factual information through attention-based map-level operations to improve factual consistency. Specifically, we employ Layer-Wise Criss-Cross Attention to progressively refine token representations at each decoding layer by aggregating tokens from both inter- and intra-layer dimensions. Additionally, a Global-Local Logit Fusion mechanism combines logits obtained before and after global attention to further refine predictions and improve accuracy. Our method consistently improves the truthfulness and performance of LVLMs across benchmarks, such as POPE, MME, and MMHal-Bench, demonstrating the potential of the map-level decoding strategy.

2507.23428 2026-03-09 cs.LG

Merging Memory and Space: A State Space Neural Operator

Nodens Koren, Samuel Lanthaler

详情
英文摘要

We propose the *State Space Neural Operator* (SS-NO), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). Our formulation extends structured state space models (SSMs) to joint spatiotemporal modeling, introducing two key mechanisms: *adaptive damping*, which stabilizes learning by localizing receptive fields, and *learnable frequency modulation*, which enables data-driven spectral selection. These components provide a unified framework for capturing long-range dependencies with parameter efficiency. Theoretically, we establish connections between SSMs and neural operators, proving a universality theorem for convolutional architectures with full field-of-view. Empirically, SS-NO achieves state-of-the-art performance across diverse PDE benchmarks-including 1D Burgers' and Kuramoto-Sivashinsky equations, and 2D Navier-Stokes and compressible Euler flows-while using significantly fewer parameters than competing approaches. A factorized variant of SS-NO further demonstrates scalable performance on challenging 2D problems. Our results highlight the effectiveness of damping and frequency learning in operator modeling, while showing that lightweight factorization provides a complementary path toward efficient large-scale PDE learning.

2507.20230 2026-03-09 cs.AI cs.CV cs.MA

A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature

Yufan Chen, Ching Ting Leung, Bowen Yu, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao

详情
英文摘要

To fully expedite AI-powered chemical research, high-quality chemical databases are the foundation. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction. It utilizes the MLLM's strong reasoning capability to understand the structure of diverse chemical graphics and decompose the extraction task into sub-tasks. It then coordinates a set of specialized agents, each combining the capabilities of the MLLM with the precise, domain-specific strengths of dedicated tools and web services, to solve the subtasks accurately and integrate the results into a unified output. Our system achieved an F1 score of 76.27% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score of 39.13%) by a significant margin. Additionally, it demonstrated versatile applicability in a range of other information extraction tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.

2507.19883 2026-03-09 cs.RO

Bridging Simulation and Usability: A User-Friendly Framework for Scenario Generation in CARLA

Ahmed Abouelazm, Mohammad Mahmoud, Conrad Walter, Oleksandr Shchetsura, Erne Hussong, Helen Gremmelmaier, J. Marius Zöllner

Comments Paper is accepted in IEEE International Automated Vehicle Validation Conference (IAVVC 2025)

详情
英文摘要

Autonomous driving promises safer roads, reduced congestion, and improved mobility, yet validating these systems across diverse conditions remains a major challenge. Real-world testing is expensive, time-consuming, and sometimes unsafe, making large-scale validation impractical. In contrast, simulation environments offer a scalable and cost-effective alternative for rigorous verification and validation. A critical component of the validation process is scenario generation, which involves designing and configuring traffic scenarios to evaluate autonomous systems' responses to various events and uncertainties. However, existing scenario generation tools often require programming knowledge, limiting accessibility for non-technical users. To address this limitation, we present an interactive, no-code framework for scenario generation. Our framework features a graphical interface that enables users to create, modify, save, load, and execute scenarios without needing coding expertise or detailed simulation knowledge. Unlike script-based tools such as Scenic or ScenarioRunner, our approach lowers the barrier to entry and supports a broader user base. Central to our framework is a graph-based scenario representation that facilitates structured management, supports both manual and automated generation, and enables integration with deep learning-based scenario and behavior generation methods. In automated mode, the framework can randomly sample parameters such as actor types, behaviors, and environmental conditions, allowing the generation of diverse and realistic test datasets. By simplifying the scenario generation process, this framework supports more efficient testing workflows and increases the accessibility of simulation-based validation for researchers, engineers, and policymakers.

2507.19146 2026-03-09 cs.RO cs.LG

Diverse and Adaptive Behavior Curriculum for Autonomous Driving: A Student-Teacher Framework with Multi-Agent RL

Ahmed Abouelazm, Johannes Ratz, Philip Schörner, J. Marius Zöllner

Comments First and Second authors contributed equally; Paper accepted in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

详情
英文摘要

Autonomous driving faces challenges in navigating complex real-world traffic, requiring safe handling of both common and critical scenarios. Reinforcement learning (RL), a prominent method in end-to-end driving, enables agents to learn through trial and error in simulation. However, RL training often relies on rule-based traffic scenarios, limiting generalization. Additionally, current scenario generation methods focus heavily on critical scenarios, neglecting a balance with routine driving behaviors. Curriculum learning, which progressively trains agents on increasingly complex tasks, is a promising approach to improving the robustness and coverage of RL driving policies. However, existing research mainly emphasizes manually designed curricula, focusing on scenery and actor placement rather than traffic behavior dynamics. This work introduces a novel student-teacher framework for automatic curriculum learning. The teacher, a graph-based multi-agent RL component, adaptively generates traffic behaviors across diverse difficulty levels. An adaptive mechanism adjusts task difficulty based on student performance, ensuring exposure to behaviors ranging from common to critical. The student, though exchangeable, is realized as a deep RL agent with partial observability, reflecting real-world perception constraints. Results demonstrate the teacher's ability to generate diverse traffic behaviors. The student, trained with automatic curricula, outperformed agents trained on rule-based traffic, achieving higher rewards and exhibiting balanced, assertive driving.

2507.18923 2026-03-09 cs.CV

Gaussian Set Surface Reconstruction through Per-Gaussian Optimization

Zhentao Huang, Di Wu, Zhenbang He, Minglun Gong

详情
英文摘要

3D Gaussian Splatting (3DGS) effectively synthesizes novel views through its flexible representation, yet fails to accurately reconstruct scene geometry. While modern variants like PGSR introduce additional losses to ensure proper depth and normal maps through Gaussian fusion, they still neglect individual placement optimization. This results in unevenly distributed Gaussians that deviate from the latent surface, complicating both reconstruction refinement and scene editing. Motivated by pioneering work on Point Set Surfaces, we propose Gaussian Set Surface Reconstruction (GSSR), a method designed to distribute Gaussians evenly along the latent surface while aligning their dominant normals with the surface normal. GSSR enforces fine-grained geometric alignment through a combination of pixel-level and Gaussian-level single-view normal consistency and multi-view photometric consistency, optimizing both local and global perspectives. To further refine the representation, we introduce an opacity regularization loss to eliminate redundant Gaussians and apply periodic depth- and normal-guided Gaussian reinitialization for a cleaner, more uniform spatial distribution. Our reconstruction results demonstrate significantly improved geometric precision in Gaussian placement, enabling intuitive scene editing and efficient generation of novel Gaussian-based 3D environments. Extensive experiments validate GSSR's effectiveness, showing enhanced geometric accuracy while preserving high-quality rendering performance.

2507.15335 2026-03-09 cs.CV cs.AI

ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis

Muhammad Aqeel, Federico Leonardi, Francesco Setti

Comments Accepted to ICIAP 2025

详情
英文摘要

Industrial defect detection systems face critical limitations when confined to one-class anomaly detection paradigms, which assume uniform outlier distributions and struggle with data scarcity in real-world manufacturing environments. We present ExDD (Explicit Dual Distribution), a novel framework that transcends these limitations by explicitly modeling dual feature distributions. Our approach leverages parallel memory banks that capture the distinct statistical properties of both normality and anomalous patterns, addressing the fundamental flaw of uniform outlier assumptions. To overcome data scarcity, we employ latent diffusion models with domain-specific textual conditioning, generating in-distribution synthetic defects that preserve industrial context. Our neighborhood-aware ratio scoring mechanism elegantly fuses complementary distance metrics, amplifying signals in regions exhibiting both deviation from normality and similarity to known defect patterns. Experimental validation on KSDD2 demonstrates superior performance (94.2% I-AUROC, 97.7% P-AUROC), with optimal augmentation at 100 synthetic samples. https://github.com/aqeeelmirza/ExDD-Defect-Detection

2507.11279 2026-03-09 cs.CV

Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping

Yujie Zhang, Sabine Struckmeyer, Andreas Kolb, Sven Reichardt

详情
Journal ref
Nature Scientific Data, 2026
英文摘要

Observer bias and inconsistencies in traditional plant phenotyping methods limit the accuracy and reproducibility of fine-grained plant analysis. To overcome these challenges, we developed TomatoMAP, a comprehensive dataset for Solanum lycopersicum using an Internet of Things (IoT) based imaging system with standardized data acquisition protocols. Our dataset contains 64,464 RGB images that capture 12 different plant poses from four camera elevation angles. Each image includes manually annotated bounding boxes for seven regions of interest (ROIs), including leaves, panicle, batch of flowers, batch of fruits, axillary shoot, shoot and whole plant area, along with 50 fine-grained growth stage classifications based on the BBCH scale. Additionally, we provide 3,616 high-resolution image subset with pixel-wise semantic and instance segmentation annotations for fine-grained phenotyping. We validated our dataset using a cascading model deep learning framework combining MobileNetv3 for classification, YOLOv11 for object detection, and MaskRCNN for segmentation. Through AI vs. Human analysis involving five domain experts, we demonstrate that the models trained on our dataset achieve accuracy and speed comparable to the experts. Cohen's Kappa and inter-rater agreement heatmap confirm the reliability of automated fine-grained phenotyping using our approach.

2507.11245 2026-03-09 cs.CV

NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation

X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang

Comments Project Page: https://amap-ml.github.io/NarrLV-Website/

详情
英文摘要

With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

2507.09095 2026-03-09 cs.LG

Temporal Misalignment Attacks against Multimodal Perception in Autonomous Driving

Md Hasan Shahriar, Md Mohaimin Al Barat, Harshavardhan Sundar, Ning Zhang, Naren Ramakrishnan, Y. Thomas Hou, Wenjing Lou

Comments 19 pages, 18 Figures

详情
英文摘要

Multimodal fusion (MMF) plays a critical role in the perception of autonomous driving, which primarily fuses camera and LiDAR streams for a comprehensive and efficient scene understanding. However, its strict reliance on precise temporal synchronization exposes it to new vulnerabilities. In this paper, we introduce DejaVu, an attack that exploits the in-vehicular network to manipulate the integrity of time and create subtle temporal misalignments, severely degrading downstream MMF-based perception tasks. Our comprehensive attack analysis across different models and datasets reveals the sensors' task-specific imbalanced sensitivities: object detection is overly dependent on LiDAR inputs, while object tracking is highly reliant on the camera inputs. Consequently, with a single-frame LiDAR delay, an attacker can reduce the car detection mAP by up to 88.5%, while with a three-frame camera delay, multiple object tracking accuracy (MOTA) for car drops by 73%. We further demonstrated two attack scenarios using an automotive Ethernet testbed for hardware-in-the-loop validation and the Autoware stack for end-to-end AD simulation, demonstrating the feasibility of the DejaVu attack and its severe impact, such as collisions and phantom braking. Our code and artifacts are publicly available at: https://github.com/shahriar0651/DejaVu.

2507.06543 2026-03-09 cs.CV

Token Bottleneck: One Token to Remember Dynamics

Taekyung Kim, Dongyoon Han, Byeongho Heo, Jeongeun Park, Sangdoo Yun

Comments NeurIPS 2025, 18 pages, 9 figures, 10 tables, project page: https://token-bottleneck.github.io, code: https://github.com/naver-ai/tobo

详情
英文摘要

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the reconstruction step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of \ours~over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales. Code is available at https://github.com/naver-ai/tobo.

2507.06265 2026-03-09 cs.CV cs.AI

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini

Comments Accepted at TMLR 2026

详情
英文摘要

Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval. Code and models are available at https://github.com/AtlasAnalyticsLab/SPARC

2506.13082 2026-03-09 cs.AI

Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs

Daniel Kilov, Caroline Hendy, Secil Yanik Guyot, Aaron J. Snoswell, Seth Lazar

详情
英文摘要

Moral competence is the ability to act in accordance with moral principles. As large language models (LLMs) are increasingly deployed in situations demanding moral competence, there is increasing interest in evaluating this ability empirically. We review existing literature and identify three significant shortcoming: (i) Over-reliance on prepackaged moral scenarios with explicitly highlighted moral features; (ii) Focus on verdict prediction rather than moral reasoning; and (iii) Inadequate testing of models' (in)ability to recognize when additional information is needed. Grounded in philosophical research on moral skill, we then introduce a novel method for assessing moral competence in LLMs. Our approach moves beyond simple verdict comparisons to evaluate five dimensions of moral competence: identifying morally relevant features, weighting their importance, assigning moral reasons to these features, synthesizing coherent moral judgments, and recognizing information gaps. We conduct two experiments comparing six leading LLMs against non-expert humans and professional philosophers. In our first experiment using ethical vignettes standard to existing work, LLMs generally outperformed non-expert humans across multiple dimensions of moral reasoning. However, our second experiment, featuring novel scenarios designed to test moral sensitivity by embedding relevant features among irrelevant details, revealed a striking reversal: several LLMs performed significantly worse than humans. Our findings suggest that current evaluations may substantially overestimate LLMs' moral reasoning capabilities by eliminating the task of discerning moral relevance from noisy information, which we take to be a prerequisite for genuine moral skill. This work provides a more nuanced framework for assessing AI moral competence and highlights important directions for improving moral competence in advanced AI systems.

2506.07658 2026-03-09 cs.CL

From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Nitin Sharma, Thomas Wolfers, Çağatay Yıldız

Comments 36 pages, 24 figures. Third version

详情
英文摘要

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on multiple-choice questions, which suffer from inherent biases. To measure domain-specific knowledge in LLMs, we present a deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation. Our approach first extracts domain-specific keywords and related target vocabulary from an input corpus. It then constructs prompt-target pairs where domain-specific words serve as prediction targets. By measuring LLMs' ability to complete these prompts, we provide a direct assessment of domain knowledge at low computational cost. Our pipeline avoids benchmark contamination, enables automated updates with new domain data, and facilitates fair comparisons between base and instruction-tuned (chat) models. We validate our approach by showing that model performances on our benchmark significantly correlate with those on an expert-curated benchmark. We then demonstrate how our benchmark provides insights into knowledge acquisition in domain-adaptive, continual, and general pretraining. Finally, we examine the effects of instruction fine-tuning by comparing base and chat models within our unified evaluation framework. In conclusion, our pipeline enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.

2506.06727 2026-03-09 cs.AI cs.CV

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Can Li, Ying Liu, Ting Zhang, Mei Wang, Hua Huang

Comments Accepted to ICLR 2026

详情
英文摘要

Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image-text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration.