arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.05875 2026-03-11 cs.SD

LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment

Jiahao Mei, Xuenan Xu, Zeyu Xie, Zihao Zheng, Ye Tao, Yue Ding, Mengyue Wu

Comments Submitted to Interspeech 2026

详情

英文摘要

Recent advances in text-to-music models have enabled coherent music generation from text prompts, yet fine-grained emotional control remains unresolved. We introduce LARA-Gen, a framework for continuous emotion control that aligns the internal hidden states with an external music understanding model through Latent Affective Representation Alignment (LARA), enabling effective training. In addition, we design an emotion control module based on a continuous valence-arousal space, disentangling emotional attributes from textual content and bypassing the bottlenecks of text-based prompting. Furthermore, we establish a benchmark with a curated test set and a robust Emotion Predictor, facilitating objective evaluation of emotional controllability in music generation. Extensive experiments demonstrate that LARA-Gen achieves continuous, fine-grained control of emotion and significantly outperforms baselines in both emotion adherence and music quality. Generated samples are available at https://anonymous2232330.github.io/laragen-web/.

URL PDF HTML ☆

赞 0 踩 0

2510.02490 2026-03-11 cs.LG cs.SY eess.SY

Improved Robustness of Deep Reinforcement Learning for Control of Time-Varying Systems by Bounded Extremum Seeking

Shaifalee Saxena, Alan Williams, Rafael Fierro, Alexander Scheinker

2510.01068 2026-03-11 cs.RO cs.LG

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

Jiahang Cao, Yize Huang, Hanzhong Guo, Rui Zhang, Mu Nan, Weijian Mai, Jiaxu Wang, Hao Cheng, Jingkai Sun, Gang Han, Wen Zhao, Qiang Zhang, Yijie Guo, Qihao Zheng, Chunfeng Song, Xiao Li, Ping Luo, Andrew F. Luo

Comments Accepted to ICLR 2026. Project Page: https://sagecao1125.github.io/GPC-Site/

详情

英文摘要

Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Grönwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.

URL PDF HTML ☆

赞 0 踩 0

2509.25896 2026-03-11 cs.CV

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Guolei Huang, Qinzhi Peng, Gan Xu, Yao Huang, Yuxuan Lu, Yongjun Shen

Comments Accepted to CVPR 2026

2509.25275 2026-03-11 cs.SD cs.AI eess.AS

VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu

2509.23926 2026-03-11 cs.CV

Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

Comments 80 Pages. The paper's abstract was shortened to fit the character limit. Accepted at TMLR

详情

英文摘要

Empirical evidence shows that deep vision networks often represent concepts as directions in latent space with concept information written along directional components in the vector representation of the input. However, the mechanism to encode (write) and decode (read) concept information to and from vector representations is not directly accessible as it constitutes a latent mechanism that naturally emerges from the training process of the network. Recovering this mechanism unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. In this work, we propose an unsupervised method to recover this mechanism. For each concept, we explain that under the hypothesis of linear concept representations, this mechanism can be implemented with the help of two directions: the first facilitating encoding of concept information and the second facilitating decoding. Unlike prior matrix decomposition, autoencoder, or dictionary learning methods that rely on feature reconstruction, we propose a new perspective: decoding directions are identified via directional clustering of activations, and encoding directions are estimated with signal vectors under a probabilistic view. We further leverage network weights through a novel technique, Uncertainty Region Alignment, which reveals interpretable directions affecting predictions. Our analysis shows that (a) on synthetic data, our method recovers ground-truth direction pairs; (b) on real data, decoding directions map to monosemantic, interpretable concepts and outperform unsupervised baselines; and (c) signal vectors faithfully estimate encoding directions, validated via activation maximization. Finally, we demonstrate applications in understanding global model behavior, explaining individual predictions, and intervening to produce counterfactuals or correct errors.

URL PDF HTML ☆

赞 0 踩 0

2509.17299 2026-03-11 cs.RO cs.CV

Automated Coral Spawn Monitoring for Reef Restoration: The Coral Spawn and Larvae Imaging Camera System (CSLICS)

Dorian Tsai, Christopher A. Brunner, Riki Lamont, F. Mikaela Nordborg, Andrea Severati, Java Terry, Karen Jackel, Matthew Dunbabin, Tobias Fischer, Scarlett Raine

Comments 8 pages, 7 figures, accepted for presentation at the IEEE International Conference on Robotics and Automation, 2026

2509.15328 2026-03-11 cs.LG cs.CV q-bio.NC

Kuramoto Orientation Diffusion Models

Yue Song, T. Anderson Keller, Sevan Brodjian, Takeru Miyato, Yisong Yue, Pietro Perona, Max Welling

Comments NeurIPS 2025

2509.14932 2026-03-11 cs.RO cs.LG

Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale

Tobias Jülg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter

Comments Accepted at ICRA 2026

2509.07968 2026-03-11 cs.CL

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das

2509.06067 2026-03-11 cs.LG

A Surrogate model for High Temperature Superconducting Magnets to Predict Current Distribution with Neural Network

Mianjun Xiao, Peng Song, Yulong Liu, Cedric Korte, Ziyang Xu, Jiale Gao, Jiaqi Lu, Haoyang Nie, Qiantong Deng, Timing Qu

2509.04859 2026-03-11 cs.CV

CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus

Hannah Schieber, Dominik Frischmann, Victor Schaack, Simon Boche, Angela Schoellig, Stefan Leutenegger, Daniel Roth

2509.01267 2026-03-11 cs.LG

Iterative In-Context Learning to Enhance LLMs Abstract Reasoning: The Case-Study of Algebraic Tasks

Stefano Fioravanti, Matteo Zavatteri, Roberto Confalonieri, Kamyar Zeinalipour, Paolo Frazzetto, Alessandro Sperduti, Nicolò Navarin

Comments Accepted at KNLP 2026 - ACM SAC 2026 Special Track on Knowledge and Natural Language Processing. https://knlp-sac.github.io/2026/index.html

2508.18722 2026-03-11 cs.AI

VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft

Honghao Fu, Junlong Ren, Qi Chai, Deheng Ye, Yujun Cai, Hao Wang

Comments Accepted by EMNLP 2025 main

2508.16403 2026-03-11 cs.LG

RF-Informed Graph Neural Networks for Accurate and Data-Efficient Circuit Performance Prediction

Anahita Asadi, Leonid Popryho, Inna Partin-Vaisband

Comments This work has been submitted to the IEEE for possible publication

2508.14965 2026-03-11 cs.CV cs.RO

You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Hakjin Lee, Junghoon Seo, Jaehoon Sim

Comments This paper has been accepted by IEEE ICRA 2026

2508.13532 2026-03-11 cs.LG cs.SY eess.SY

MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination

Ziyan Wu, Ivan Korolija, Rui Tang

Comments The platform is released open-source on GitHub: https://github.com/BuildNexusX/MuFlex

详情

DOI: 10.1016/j.energy.2026.140565
Journal ref: Energy, vol. 349, 140565, 2026

英文摘要

With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for multi-building flexibility coordination, was developed. MuFlex enables synchronous information exchange and co-simulation across multiple detailed building models programmed in EnergyPlus and Modelica, and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform's physics-based capabilities and workflow were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm. The results show that under four buildings' coordination, SAC effectively reduced the aggregated peak demand by nearly 12% with maintained indoor comfort to ensure the power demand below the threshold. Additionally, the platform's scalability was investigated through computational benchmarking on building clusters with varying sizes, model types, and simulation programs.

URL PDF HTML ☆

赞 0 踩 0

2508.10729 2026-03-11 cs.CV cs.AI

EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

Yanjun Li, Yuqian Fu, Tianwen Qian, Qi'ao Xu, Silong Dai, Danda Pani Paudel, Luc Van Gool, Xiaoling Wang

2508.05433 2026-03-11 cs.LG cs.NE

Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang

2507.20804 2026-03-11 cs.AI

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

Xueyao Wan, Hang Yu

2507.11531 2026-03-11 cs.LG q-bio.NC

Langevin Flows for Modeling Neural Latent Dynamics

Yue Song, T. Anderson Keller, Yisong Yue, Pietro Perona, Max Welling

Comments Full version of the Cognitive Computational Neuroscience (CCN) 2025 poster

2507.10368 2026-03-11 cs.LG physics.geo-ph

Operator Learning for Consolidation: An Architectural Comparison for DeepONet Variants

Yongjin Choi, Chenying Liu, Jorge Macedo

2507.09155 2026-03-11 cs.CL cs.AI

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim

Comments Accepted at Digital Discovery (Royal Society of Chemistry)

2506.07737 2026-03-11 cs.CV

SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding

Xuemei Chen, Huamin Wang, Jing Peng, Hangchi Shen, Shukai Duan, Shiping Wen, Tingwen Huang

详情

DOI: 10.1109/TETCI.2026.3670672

英文摘要

With the wide application of 3D object detection in some fields such as autonomous driving, its energy consumption is constantly increasing, making the research on low-power consumption alternatives a key research area. The spiking neural networks (SNNs), possessing low-power consumption characteristics, offer a novel solution for this research. Consequently, we apply SNNs to monocular 3D object detection and propose the SpikeSMOKE architecture, which represents a new attempt at low-power monocular 3D object detection. It's well known that the discrete signals of SNNs can lead to information loss compared to artificial neural networks (ANNs), which limits their feature representation capabilities. To solve this problem, inspired by the synaptic filtering mechanism of biological neurons, we propose a new Cross-Scale Gating Coding Mechanism (CSGC), which can enhance feature representation by combining cross-scale fusion of attentional methods and gated filtering mechanisms. In addition, to reduce the computation and accelerate training, we present a novel light-weight residual block that can maintain spiking computing paradigm and the highest possible detection performance. Our method is effective on the KITTI, NuScenes-mini and CIFAR10/100 datasets. Compared to the baseline SpikeSMOKE under the 3D Object Detection, the proposed SpikeSMOKE with CSGC can achieve 11.78 (+2.82, Easy), 10.69 (+3.2, Moderate), and 10.48 (+3.17, Hard) on the KITTI autonomous driving dataset by AP|R11 at 0.7 IoU threshold, respectively. It is worth noting that the results of SpikeSMOKE can significantly reduce energy consumption compared with the results of SMOKE. And SpikeSMOKE-L (lightweight) can further reduce the amount of parameters by 3 times and computation by 10 times compared to SMOKE.

URL PDF HTML ☆

赞 0 踩 0

2506.01290 2026-03-11 cs.LG cs.AI

Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment

Shunyu Wu, Dan Li, Wenjie Feng, Haozheng Ye, Jian Lou, See-Kiong Ng

Comments Accepted at ICLR 2026

2505.24417 2026-03-11 cs.CV

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren Song

2505.22473 2026-03-11 cs.LG

Pure Exploration with Infinite Answers

Riccardo Poiani, Martino Bernasconi, Andrea Celli

2505.21147 2026-03-11 cs.LG

Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score

Xuanning Zhou, Zihao Shi, Hao Zeng, Xiaobo Xia, Bingyi Jing, Hongxin Wei

Comments Accept by CVPR 2026

2505.16368 2026-03-11 cs.LG cs.AI

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong

Comments Camera-ready version for Neural Information Processing Systems (NeurIPS) 2025, Spotlight Paper

详情

英文摘要

How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.

URL PDF HTML ☆

赞 0 踩 0

2505.14679 2026-03-11 cs.CL cs.AI cs.LG

UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models

Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang

Comments TMLR 2026

详情

英文摘要

Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds more than $7\times$ faster than the previous state-of-the-art method, while requiring $4\times$ less VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at https://github.com/XiaojieGu/UltraEdit.

URL PDF HTML ☆

赞 0 踩 0