arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1658
2510.05486 2026-04-17 cs.CL

Language Model as Planner and Formalizer under Constraints

Cassie Huang, Stuti Mohan, Ziyi Yang, Stefanie Tellex, Li Zhang

Comments In ACL 2026 main conference

详情
英文摘要

LLMs have been widely used in planning, either as planners to generate action sequences end-to-end, or as formalizers to represent the planning domain and problem in a formal language that can derive plans deterministically. However, both lines of work rely on standard benchmarks that include only generic and simplistic environmental specifications, leading to potential overestimation of the planning ability of LLMs and safety concerns in downstream tasks. We bridge this gap by augmenting widely used planning benchmarks with manually annotated, fine-grained, and rich natural language constraints spanning four formally defined categories. Over 4 state-of-the-art reasoning LLMs, 4 formal languages, and 4 datasets, we show that the introduction of one-sentence constraints consistently halves performance, indicating current LLMs' lack of robustness and an avenue for future research.

2510.03851 2026-04-17 cs.AI

MetaMuse: Algorithm Generation via Creative Ideation

Ruiying Ma, Chieh-Jan Mike Liang, Yanjie Gao, Francis Y. Yan

Comments ICLR 2026

详情
英文摘要

Designing system algorithms remains challenging, where the discontinuous nature of the solution space often forces system engineers to rely on generic heuristics at the expense of performance. We study whether LLMs can practically drive algorithm generation, and find that they are biased towards well-known generic designs, rather than making the creative leaps needed to navigate the discontinuous solution space. To address this limitation, we introduce MetaMuse, a framework for creative ideation built on three self-reflection principles: (1) quantifying solution diversity and usefulness in measurable performance space, rather than abstract idea space, (2) steering ideation through external stimuli, rather than internal randomness, and (3) constructing executable solutions using waypoint reasoning, rather than free-form chain-of-thought. Considering two critical online problems at a global cloud provider, extensive evaluations show that MetaMuse can generate high-performing solutions: it reduces cache misses by up to 35.76% in cache replacement and reduces bin usage by up to 30.93% in online bin packing.

2510.02738 2026-04-17 cs.RO cs.LG

Flow with the Force Field: Learning 3D Compliant Flow Matching Policies from Force and Demonstration-Guided Simulation Data

Tianyu Li, Yihan Li, Zizhe Zhang, Nadia Figueroa

Comments Accepted to ICRA 2026

详情
英文摘要

While visuomotor policy has made advancements in recent years, contact-rich tasks still remain a challenge. Robotic manipulation tasks that require continuous contact demand explicit handling of compliance and force. However, most visuomotor policies ignore compliance, overlooking the importance of physical interaction with the real world, often leading to excessive contact forces or fragile behavior under uncertainty. Introducing force information into vision-based imitation learning could help improve awareness of contacts, but could also require a lot of data to perform well. One remedy for data scarcity is to generate data in simulation, yet computationally taxing processes are required to generate data good enough not to suffer from the Sim2Real gap. In this work, we introduce a framework for generating force-informed data in simulation, instantiated by a single human demonstration, and show how coupling with a compliant policy improves the performance of a visuomotor policy learned from synthetic data. We validate our approach on real-robot tasks, including non-prehensile block flipping and a bi-manual object moving, where the learned policy exhibits reliable contact maintenance and adaptation to novel conditions. Project Website: https://flow-with-the-force-field.github.io/webpage/

2510.02539 2026-04-17 cs.CL cs.IR

Hierarchical Semantic Retrieval with Cobweb

Anant Gupta, Karthik Singaravadivelan, Zekun Wang

Comments 20 pages, 7 tables, 4 figures

详情
英文摘要

Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.

2510.01433 2026-04-17 cs.RO cs.AI

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

Anukriti Singh, Kasra Torshizi, Khuzema Habib, Kelin Yu, Ruohan Gao, Pratap Tokekar

详情
英文摘要

Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.

2509.23638 2026-04-17 cs.LG

LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers

Enda Yu, Dezun Dong, Zhaoning Zhang, Zhe Bai, Weiling Yang, Haojie Wang, Dongsheng Li, Yongwei Wu, Xiangke Liao

Comments publishing in ICS 2026

详情
英文摘要

Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

2509.23468 2026-04-17 cs.RO cs.AI cs.LG

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell

Comments 8 pages, 7 figures. Project website: https://policyconsensus.github.io

详情
英文摘要

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

2509.23249 2026-04-17 cs.LG cs.NA math.NA

Deep Learning for Subspace Regression

Vladimir Fanaskov, Vladislav Trifonov, Alexander Rudikov, Ekaterina Muravleva, Ivan Oseledets

Comments Accepted to ICLR 2026, reviewed at https://openreview.net/forum?id=HF60Lu1Maj

详情
英文摘要

It is often possible to perform reduced order modelling by specifying linear subspace which accurately captures the dynamics of the system. This approach becomes especially appealing when linear subspace explicitly depends on parameters of the problem. A practical way to apply such a scheme is to compute subspaces for a selected set of parameters in the computationally demanding offline stage and in the online stage approximate subspace for unknown parameters by interpolation. For realistic problems the space of parameters is high dimensional, which renders classical interpolation strategies infeasible or unreliable. We propose to relax the interpolation problem to regression, introduce several loss functions suitable for subspace data, and use a neural network as an approximation to high-dimensional target function. To further simplify a learning problem we introduce redundancy: in place of predicting subspace of a given dimension we predict larger subspace. We show theoretically that this strategy decreases the complexity of the mapping for elliptic eigenproblems with constant coefficients and makes the mapping smoother for general smooth function on the Grassmann manifold. Empirical results also show that accuracy significantly improves when larger-than-required subspaces are predicted. With the set of numerical illustrations we demonstrate that subspace regression can be useful for a range of tasks including parametric eigenproblems, deflation techniques, relaxation methods, optimal control and solution of parametric partial differential equations.

2509.22378 2026-04-17 cs.SD cs.AI cs.MM eess.AS

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

Zijian Zhao, Dian Jin, Zijing Zhou

详情
英文摘要

Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .

2509.20869 2026-04-17 cs.LG cs.AI

Model-Based Reinforcement Learning under Random Observation Delays

Armin Karamzade, Kyungmin Kim, JB Lanier, Davide Corsi, Roy Fox

详情
英文摘要

Delays frequently occur in real-world environments, yet standard reinforcement learning (RL) algorithms often assume instantaneous perception of the environment. We study random sensor delays in POMDPs, where observations may arrive out-of-sequence, a setting that has not been previously addressed in RL. We analyze the structure of such delays and demonstrate that naive approaches, such as stacking past observations, are insufficient for reliable performance. To address this, we propose a model-based filtering process that sequentially updates the belief state based on an incoming stream of observations. We then introduce a simple delay-aware framework that incorporates this idea into model-based RL, enabling agents to effectively handle random delays. Applying this framework to the Dreamer world-modeling scheme, our method consistently outperforms delay-aware baselines developed for MDPs and demonstrates robustness to delay distribution shifts during deployment. Additionally, we present experiments on simulated robotic tasks, comparing our method to common practical heuristics and emphasizing the importance of explicitly modeling observation delays.

2509.15602 2026-04-17 cs.CV

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Zhongyuan Bao, Lejun Zhang

详情
英文摘要

Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 8 tasks from the stroke level to the rally level and includes 2527 human-verified questions. Evaluating 17 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.

2509.14255 2026-04-17 cs.CL cs.AI

Cosine-Similarity Routing with Semantic Anchors for Interpretable Mixture-of-Experts Language Models

Ivan Ternovtsii, Yurii Bilak

Comments 23 pages, 6 figures. Code available at https://github.com/ITernovtsii/semantic-resonance. Preprint

详情
英文摘要

Mixture-of-Experts (MoE) models improve efficiency through sparse activation, but their learned gating functions provide limited insight into routing decisions. This work introduces the Semantic Resonance Architecture (SRA), which routes tokens to experts via cosine similarity between token representations and learnable semantic anchors, making every routing decision directly traceable to anchor-token similarity scores. We evaluate SRA on WikiText-103 across 17 configurations. In a controlled multi-seed comparison (3 seeds x 4 configurations, 256 experts, $D_{ff}=256$), cosine routing achieves competitive perplexity with standard linear routing ($12.57 \pm 0.03$ vs $12.45 \pm 0.03$ for $K=1 \to 4$; $12.52 \pm 0.02$ vs $12.57 \pm 0.02$ for $K=2 \to 4$). The training recipe -- not the routing function -- drives specialization quality, while cosine routing provides inherent inspectability. We introduce a bandpass routing loss -- a floor-and-ceiling corridor on expert utilization -- that reduces dead experts from 30-45% to 0-6% and transfers to both routing types. Routing-space evaluation shows cosine routing provides significantly better word-level subtoken coherence in deeper layers ($p < 0.001$), with 44-54% of expert specialization being syntactic rather than semantic. Extended analysis reveals cosine routing maintains more stable router saturation and tighter per-expert vocabulary distributions -- structural advantages from the bounded cosine similarity range. An inference-time $k$-sweep shows that $k=5$ yields a free 0.08-0.16 perplexity gain over $k=4$. Cross-dataset validation on OpenWebText confirms generalization: cosine routing achieves comparable perplexity (44.88 vs 45.44), the bandpass loss eliminates dead experts, and specialization patterns are preserved.

2509.14003 2026-04-17 cs.SD cs.AI

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

Liting Gao, Yi Yuan, Yaru Chen, Yuelan Cheng, Zhenbo Li, Juan Wen, Shubin Zhang, Wenwu Wang

Comments Accepted to ICASSP 2026

详情
英文摘要

Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing, and construct a dataset featuring overlapping multi-event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.

2509.12833 2026-04-17 cs.LG

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Hannah Markgraf, Shambhuraj Sawant, Hanna Krasowski, Lukas Schäfer, Sebastien Gros, Matthias Althoff

详情
英文摘要

Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking. In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard. Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics.

2509.12712 2026-04-17 cs.SD cs.IR

A Lightweight Two-Branch Architecture for Multi-Instrument Transcription via Note-Level Contrastive Clustering

Ruigang Li, Yongxu Zhu

Comments Published in TISMIR, Vol. 9, No. 1, pp. 119-130, 2026

详情
Journal ref
Transactions of the International Society for Music Information Retrieval, 9(1), 119-130 (2026)
英文摘要

Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.

2509.06593 2026-04-17 cs.RO

A Robust Approach for LiDAR-Inertial Odometry Without Sensor-Specific Modeling

Meher V. R. Malladi, Tiziano Guadagnino, Luca Lobefaro, Cyrill Stachniss

详情
英文摘要

Accurate odometry is a critical component in a robotic navigation stack, and subsequent modules such as planning and control often rely on an estimate of the robot's motion. Sensor-based odometry approaches should be robust across sensor types and deployable in different target domains, from solid-state LiDARs mounted on cars in urban-driving scenarios to spinning LiDARs on handheld packages used in unstructured natural environments. In this paper, we propose a robust LiDAR-inertial odometry system that does not rely on sensor-specific modeling. Sensor fusion techniques for LiDAR and inertial measurement unit (IMU) data typically integrate IMU data iteratively in a Kalman filter or use pre-integration in a factor graph framework, combined with LiDAR scan matching often exploiting some form of feature extraction. We propose an alternative strategy that only requires a simplified motion model for IMU integration and directly registers LiDAR scans in a scan-to-map approach. Our approach allows us to impose a novel regularization on the LiDAR registration, improving the overall odometry performance. We detail extensive experiments on a number of datasets covering a wide array of commonly used robotic sensors and platforms. We show that our approach works with the exact same configuration in all these scenarios, demonstrating its robustness. We have open-sourced our implementation so that the community can build further on our work and use it in their navigation stacks.

2509.06591 2026-04-17 cs.CV

Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising

Yichao Liu, Hengzhi Xue, YueYang Teng, Junwen Guo

详情
英文摘要

Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, current approaches often face a trade$-$off between training stability and computational efficiency. In this study, we propose a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module to address these limitations. The EGA modules enhance both spatial and channel-wise interaction, improving the network's capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to state of the art methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. Thus, our approach demonstrates significant potential for practical, real-world clinical applications.

2509.03472 2026-04-17 cs.LG cs.AI cs.DC

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

Yubo Gao, Renbo Tu, Gennady Pekhimenko, Nandita Vijaykumar

详情
英文摘要

Differentially-Private SGD (DP-SGD) and its adaptive variant DP-Adam are powerful techniques to protect user privacy when using sensitive data to train neural networks. During training, converting model weights and activations into low-precision formats, i.e., quantization, can drastically reduce training times, energy consumption, and cost, and is thus a widely used technique. In this work, we demonstrate for the first time that quantization causes significantly higher accuracy degradation in DP training compared to regular SGD. We observe that this is caused by noise injection, which amplifies quantization variance, leading to disproportionately large accuracy degradation. To address this challenge, we present DPQuant, a dynamic quantization framework that adaptively selects a changing subset of layers to quantize at each epoch. Our method combines two key ideas that effectively reduce quantization variance: (i) probabilistic sampling that rotates which layers are quantized every epoch, and (ii) loss-aware layer prioritization, which uses a differentially private loss sensitivity estimator to identify layers that can be quantized with minimal impact on model quality. This estimator consumes a negligible fraction of the overall privacy budget, preserving DP guarantees. Empirical evaluations on ResNet18, ResNet50, and DenseNet121 across a range of datasets demonstrate that DPQuant consistently outperforms static quantization baselines, achieving near Pareto-optimal accuracy-compute trade-offs and up to $2.21\times$ theoretical throughput improvements on low-precision hardware, with less than 2% drop in validation accuracy. We further show that our framework extends to DP-Adam with similar gains.

2508.20705 2026-04-17 cs.LG cs.AI

EEGDM: Learning EEG Representation with Latent Diffusion Model

Shaocong Wang, Tong Liu, Yihan Li, Ming Li, Kairui Wen, Pei Yang, Wenqi Ji, Minjing Yu, Yong-Jin Liu

详情
英文摘要

Recent advances in self-supervised learning for EEG representation have largely relied on masked reconstruction, where models are trained to recover randomly masked signal segments. While effective at modeling local dependencies, such objectives are inherently limited in capturing the global dynamics and long-range dependencies essential for characterizing neural activity. To address this limitation, we propose EEGDM, a novel self-supervised framework that leverages latent diffusion models to generate EEG signals as an objective. Unlike masked reconstruction, diffusion-based generation progressively denoises signals from noise to realism, compelling the model to capture holistic temporal patterns and cross-channel relationships. Specifically, EEGDM incorporates an EEG encoder that distills raw signals and their channel augmentations into a compact representation, acting as conditional information to guide the diffusion model for generating EEG signals. This design endows EEGDM with a compact latent space, which not only offers ample control over the generative process but also can be leveraged for downstream tasks. Experimental results show that EEGDM (1) reconstructs high-quality EEG signals, (2) learns robust representations, and (3) achieves competitive performance across diverse downstream tasks, thus exploring a new direction for self-supervised EEG representation learning.

2508.10962 2026-04-17 cs.CV

CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving

Jiarong Li, Imad Ali Shah, Diarmaid Geever, Fiachra Collins, Enda Ward, Martin Glavin, Edward Jones, Brian Deegan

Comments Under Review at IEEE OJITS, July, 2025

详情
Journal ref
IEEE Open Journal of Intelligent Transportation Systems, vol. 7, pp. 1021-1033, 2026
英文摘要

Protecting Vulnerable Road Users (VRU) is a critical safety challenge for automotive perception systems, particularly under visual ambiguity caused by metamerism, a phenomenon where distinct materials appear similar in RGB imagery. This work investigates hyperspectral imaging (HSI) to overcome this limitation by capturing unique material signatures beyond the visible spectrum, especially in the Near-Infrared (NIR). To manage the inherent high-dimensionality of HSI data, we propose a band selection strategy that integrates information theory techniques (joint mutual information maximization, correlation analysis) with a novel application of an image quality metric (contrast signal-to-noise ratio) to identify the most spectrally informative bands. Using the Hyperspectral City V2 (H-City) dataset, we identify three informative bands (497 nm, 607 nm, and 895 nm, $\pm$27 nm) and reconstruct pseudo-color images for comparison with co-registered RGB. Quantitative results demonstrate increased dissimilarity and perceptual separability of VRU from the background. The selected HSI bands yield improvements of 70.24%, 528.46%, 1206.83%, and 246.62% for dissimilarity (Euclidean, SAM, $T^2$) and perception (CIE $ΔE$) metrics, consistently outperforming RGB and confirming a marked reduction in metameric confusion. By providing a spectrally optimized input, our method enhances VRU separability, establishing a robust foundation for downstream perception tasks in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD), ultimately contributing to improved road safety.

2508.05015 2026-04-17 cs.LG cs.AI

SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning

Dai Do, Manh Nguyen, Svetha Venkatesh, Hung Le

详情
英文摘要

Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical under many realistic training budgets. Many existing pipelines sample training examples uniformly across steps or epochs, ignoring differences in difficulty, redundancy, and learning value, which slows learning and wastes computation. We propose \textbf{SPaCe}, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply \emph{cluster-based data reduction} to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a \textit{multi-armed bandit} treats data clusters as arms, allocating training samples based on the model's solve rates and learning progress. Experiments across multiple reasoning benchmarks show that SPaCe achieves comparable or better accuracy than state-of-the-art baselines while using up to \(100\times\) fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources.

2508.03341 2026-04-17 cs.AI

What Deserves Memory: Adaptive Memory Distillation for LLM Agents

Wenquan Ma, Jiayan Nan, Wenlong Wu, Yize Chen

详情
英文摘要

Memory systems for LLM agents struggle to determine what information deserves retention. Existing approaches rely on predefined heuristics such as importance scores, emotional tags, or factual templates, encoding designer intuition rather than learning from the data itself. Inspired by cognitive ideas, we propose NEMORI, an adaptive memory distillation framework that casts the assessment of experience's future utility as a matter of predictability. Specifically, NEMORI comprises two cascading modules: Episodic Memory Integration transforms raw interactions into coherent narratives, and Semantic Knowledge Distillation extracts insights via prediction error. Centering on distillation, the framework remains agnostic to downstream management. Extensive experiments confirm that NEMORI achieves strong performance, efficiency, and storage reduction. Our work suggests that observing the intrinsic properties of interaction sequences offers a viable, data-driven alternative to heuristic-based memory design. Code: https://github.com/nemori-ai/nemori.

2507.15351 2026-04-17 cs.AI cs.ET cs.MA

One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms

Zijian Zhao, Sen Li

详情
英文摘要

Order dispatch is a critical task in ride-sharing systems with Autonomous Vehicles (AVs), directly influencing efficiency and profits. Recently, Multi-Agent Reinforcement Learning (MARL) has emerged as a promising solution to this problem by decomposing the large state and action spaces among individual agents, effectively addressing the Curse of Dimensionality (CoD) in transportation market, which is caused by the substantial number of vehicles, passengers, and orders. However, conventional MARL-based approaches heavily rely on accurate estimation of the value function, which becomes problematic in large-scale, highly uncertain environments. To address this issue, we propose two novel methods that bypass value function estimation, leveraging the homogeneous property of AV fleets. First, we draw an analogy between AV fleets and groups in Group Relative Policy Optimization (GRPO), adapting it to the order dispatch task. By replacing the Proximal Policy Optimization (PPO) baseline with the group average reward-to-go, GRPO eliminates critic estimation errors and reduces training bias. Inspired by this baseline replacement, we further propose One-Step Policy Optimization (OSPO), demonstrating that the optimal policy can be trained using only one-step group rewards under a homogeneous fleet. Experiments on a real-world ride-hailing dataset show that both GRPO and OSPO achieve promising performance across all scenarios, efficiently optimizing pickup times and the number of served orders using simple Multilayer Perceptron (MLP) networks. Furthermore, OSPO outperforms GRPO in all scenarios, attributed to its elimination of bias caused by the bounded time horizon of GRPO. Our code, trained models, and processed data are provided at https://github.com/RS2002/OSPO .

2507.11812 2026-04-17 cs.SD eess.AS eess.SP

A Multimodal Data Fusion Generative Adversarial Network for Real Time Underwater Sound Speed Field Construction

Wei Huang, Yuqiang Huang, Jixuan Zhou, Fang Ji, Hao Zhang, Tianhe Xu

详情
英文摘要

Sound speed profiles (SSPs) are essential parameters underwater that affects the propagation mode of underwater signals and has a critical impact on the energy efficiency of underwater acoustic communication and accuracy of underwater acoustic positioning. Traditionally, SSPs can be obtained by matching field processing (MFP), compressive sensing (CS), and deep learning (DL) methods. However, existing methods mainly rely on on-site underwater sonar observation data, which put forward strict requirements on the deployment of sonar observation systems. To achieve high-precision estimation of sound velocity distribution in a given sea area without on-site underwater data measurement, we propose a multi-modal data-fusion generative adversarial network model with residual attention block (MDF-RAGAN) for SSP construction. To improve the model's ability for capturing global spatial feature correlations, we embedded the attention mechanisms, and use residual modules for deeply capturing small disturbances in the deep ocean sound velocity distribution caused by changes of SST. Experimental results on real open dataset show that the proposed model outperforms other state-of-the-art methods, which achieves an accuracy with an error of less than 0.3m/s. Specifically, MDF-RAGAN not only outperforms convolutional neural network (CNN) and spatial interpolation (SITP) by nearly a factor of two, but also achieves about 65.8\% root mean square error (RMSE) reduction compared to mean profile, which fully reflects the enhancement of overall profile matching by multi-source fusion and cross-modal attention.

2506.14121 2026-04-17 cs.CV

FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution

Siyu Xu, Wenjie Li, Guangwei Gao, Jian Yang, Guo-Jun Qi, Chia-Wen Lin

Comments 12 pages, 10 figures, 8 tales

详情
英文摘要

Face super-resolution (FSR) under limited computational budgets remains challenging. Existing methods often treat all facial pixels equally, leading to suboptimal resource allocation and degraded performance. CNNs are sensitive to high-frequency facial features such as contours and outlines, while Mamba excels at capturing low-frequency attributes like facial color and texture with lower complexity than Transformers. Motivated by this, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components for dedicated processing. The low-frequency branch employs a Mamba-based Low-Frequency Enhancement Block (LFEB) that integrates state-space attention with squeeze-and-excitation to restore global interactions and emphasize informative channels. The high-frequency branch uses a CNN-based Deep Position-Aware Attention (DPA) module to refine structural details, followed by a lightweight High-Frequency Refinement (HFR) module for further frequency-specific refinement. These designs enable FADPNet to achieve a strong balance between FSR quality and efficiency, outperforming existing methods.

2506.13763 2026-04-17 cs.LG cs.AI cs.CV stat.ML

Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value

Yixian Xu, Shengjie Luo, Liwei Wang, Di He, Chang Liu

Comments 33 pages, 12 figures, 9 tables. ICLR 2026 Camera Ready version

详情
英文摘要

Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.

2506.09457 2026-04-17 cs.CL cs.LG

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang

Comments Findings of ACL 2026

详情
英文摘要

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap", a discrepancy between training objectives and autoregressive decoding dynamics. In this paper, we consider that one contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 11.8 points in AlpacaEval 2 and overall improvements across downstream tasks. These results underscore the need to mitigate the reward-generation gap in DAAs by better aligning training objectives with autoregressive decoding dynamics.

2506.00433 2026-04-17 cs.CV cs.LG eess.IV

Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

Luigi Sigillo, Shengfeng He, Danilo Comminiello

Comments Accepted at ICLR 2026

详情
英文摘要

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling. The code is available at https://github.com/LuigiSigillo/LatentWaveletDiffusion

2505.20122 2026-04-17 cs.CV

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

详情
英文摘要

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes. We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. We find that these VLMs exhibit weak ME bias, while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings. Project page: http://mebench.github.io/.

2505.18129 2026-04-17 cs.CV cs.CL

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan

Comments Technical Report

详情
英文摘要

Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.