arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2797
专题追踪
2509.21765 2026-02-10 cs.AI cs.LG

Lifelong Learning with Behavior Consolidation for Vehicle Routing

Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang, Xin Yao

Comments Accepted by The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Recent neural solvers have demonstrated promising performance in learning to solve routing problems. However, existing studies are primarily based on one-off training on one or a set of predefined problem distributions and scales, i.e., tasks. When a new task arises, they typically rely on either zero-shot generalization, which may be poor due to the discrepancies between the new task and the training task(s), or fine-tuning the pretrained solver on the new task, which possibly leads to catastrophic forgetting of knowledge acquired from previous tasks. This paper explores a novel lifelong learning paradigm for neural VRP solvers, where multiple tasks with diverse distributions and scales arise sequentially over time. Solvers are required to effectively and efficiently learn to solve new tasks while maintaining their performance on previously learned tasks. Consequently, a novel framework called Lifelong Learning Router with Behavior Consolidation (LLR-BC) is proposed. LLR-BC consolidates prior knowledge effectively by aligning behaviors of the solver trained on a new task with the buffered ones in a decision-seeking way. To encourage more focus on crucial experiences, LLR-BC assigns greater consolidated weights to decisions with lower confidence. Extensive experiments on capacitated vehicle routing problems and traveling salesman problems demonstrate LLR-BC's effectiveness in training high-performance neural solvers in a lifelong learning setting, addressing the catastrophic forgetting issue, maintaining their plasticity, and improving zero-shot generalization ability.

2509.19191 2026-02-10 cs.CV

Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang

详情
英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

2509.17321 2026-02-10 cs.RO cs.CL

OpenGVL -- Benchmarking Visual Temporal Progress for Data Curation

Paweł Budzianowski, Emilia Wiśnios, Michał Tyrolski, Gracjan Góral, Igor Kulakov, Viktor Petrenko, Krzysztof Walas

Comments Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025

详情
英文摘要

Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.

2509.17107 2026-02-10 cs.CV cs.RO eess.IV

CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception

Lingzhao Kong, Jiacheng Lin, Siyu Li, Kai Luo, Zhiyong Li, Kailun Yang

Comments Accepted to ICRA 2026. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE

详情
英文摘要

Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird's Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@0.5 for LiDAR-based 3D object detection by +3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE.

2509.14579 2026-02-10 cs.SD

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Qingyu Liu, Yushen Chen, Zhikang Niu, Chunhui Wang, Yunting Yang, Bowen Zhang, Jian Zhao, Pengcheng Zhu, Kai Yu, Xie Chen

Comments 5 pages, 2 figures

详情
英文摘要

Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training and determining appropriate duration during inference. In this paper, we introduce Cross-Lingual F5-TTS, a framework that enables cross-lingual voice cloning without audio prompt transcripts. Our method preprocesses audio prompts by forced alignment to obtain word boundaries, enabling direct synthesis from audio prompts while excluding transcripts during training. To address the duration modeling challenge, we train speaking rate predictors at different linguistic granularities to derive duration from speaker pace. Experiments show that our approach matches the performance of F5-TTS while enabling cross-lingual voice cloning.

2509.07820 2026-02-10 cs.AI cs.CL

Certainty-Guided Reasoning in Large Language Models: A Dynamic Thinking Budget Approach

João Paulo Nogueira, Wentao Sun, Alonso Silva, Laith Zumot

详情
英文摘要

Large reasoning language models are typically run with fixed inference budgets, which can waste computation or terminate reasoning prematurely. We introduce Certainty-Guided Reasoning (CGR), a model-agnostic adaptive inference procedure that periodically probes whether the current reasoning supports a confident final answer and terminates early once a target certainty threshold is reached, otherwise continuing until the end-of-thinking token or the budget limit. Certainty is estimated from the model's predicted probabilities over the answer tokens, yielding a lightweight stopping criterion. On AIME2025, CGR preserves baseline accuracy while reducing token usage, providing a tunable certainty-efficiency trade-off that can eliminate millions of tokens in aggregate. Across 64 random seeds, CGR exhibits consistent behavior. We also introduce a Grade metric that penalizes incorrect answers and permits abstention, capturing risk-sensitive performance. Results show that CGR improves Grade by abstaining when certainty remains low.

2509.03231 2026-02-10 cs.RO

Exploring persuasive interactions with generative social robots: An experimental framework

Stephan Vonschallen, Larissa Julia Corina Finsler, Theresa Schmiedel, Friederike Eyssel

Comments A shortened version of this paper was accepted as poster for the Thirteenth International Conference on Human-Agent Interaction (HAI2025)

详情
英文摘要

Integrating generative AI such as Large Language Models into social robots has improved their ability to engage in natural, human-like communication. This study presents a method to examine their persuasive capabilities. We designed an experimental framework focused on decision making and tested it in a pilot that varied robot appearance and self-knowledge. Using qualitative analysis, we evaluated interaction quality, persuasion effectiveness, and the robot's communicative strategies. Participants generally experienced the interaction positively, describing the robot as competent, friendly, and supportive, while noting practical limits such as delayed responses and occasional speech-recognition errors. Persuasiveness was highly context dependent and shaped by robot behavior: Participants responded well to polite, reasoned suggestions and expressive gestures, but emphasized the need for more personalized, context-aware arguments and clearer social roles. These findings suggest that generative social robots can influence user decisions, but their effectiveness depends on communicative nuance and contextual relevance. We propose refinements to the framework to further study persuasive dynamics between robots and human users.

2509.00560 2026-02-10 cs.LG cs.PF

Efficient Graph Knowledge Distillation from GNNs to Kolmogorov--Arnold Networks via Self-Attention Dynamic Sampling

Can Cui, Zilong Fu, Penghe Huang, Yuanyuan Li, Wu Deng, Dongyan Li

详情
英文摘要

Recent success of graph neural networks (GNNs) in modeling complex graph-structured data has fueled interest in deploying them on resource-constrained edge devices. However, their substantial computational and memory demands present ongoing challenges. Knowledge distillation (KD) from GNNs to MLPs offers a lightweight alternative, but MLPs remain limited by fixed activations and the absence of neighborhood aggregation, constraining distilled performance. To tackle these intertwined limitations, we propose SA-DSD, a novel self-attention-guided dynamic sampling distillation framework. To the best of our knowledge, this is the first work to employ an enhanced Kolmogorov-Arnold Network (KAN) as the student model. We improve Fourier KAN (FR-KAN+) with learnable frequency bases, phase shifts, and optimized algorithms, substantially improving nonlinear fitting capability over MLPs while preserving low computational complexity. To explicitly compensate for the absence of neighborhood aggregation that is inherent to both MLPs and KAN-based students, SA-DSD leverages a self-attention mechanism to dynamically identify influential nodes, construct adaptive sampling probability matrices, and enforce teacher-student prediction consistency. Extensive experiments on six real world datasets demonstrate that, under inductive and most of transductive settings, SA-DSD surpasses three GNN teachers by 3.05%-3.62% and improves FR-KAN+ by 15.61%. Moreover, it achieves a 16.69x parameter reduction and a 55.75% decrease in average runtime per epoch compared to key benchmarks.

2508.20033 2026-02-10 cs.CL cs.AI

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin

详情
英文摘要

The ability to research and synthesize knowledge is central to human expertise and progress. A new class of AI systems--designed for generative research synthesis--aims to automate this process by retrieving information from the live web and producing long-form, cited reports. Yet, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short, factual answers, while expert-curated datasets risk staleness and data contamination. Neither captures the complexity and evolving nature of real research synthesis tasks. We introduce DeepScholar-bench, a live benchmark and automated evaluation framework for generative research synthesis. DeepScholar-bench draws queries and human-written exemplars from recent, high-quality ArXiv papers and evaluates a real synthesis task: generating a related work section by retrieving, synthesizing, and citing prior work. Our automated framework holistically measures performance across three key dimensions--knowledge synthesis, retrieval quality, and verifiability. To further future work, we also contribute DeepScholar-ref, a simple, open-source reference pipeline, which is implemented on the LOTUS framework and provides a strong baseline. Using DeepScholar-bench, we systematically evaluate prior open-source systems, search agents with strong models, OpenAI's DeepResearch, and DeepScholar-ref. We find DeepScholar-bench is far from saturated: no system surpasses a geometric mean of $31\%$ across all metrics. These results highlight both the difficulty and importance of DeepScholar-bench as a foundation for advancing AI systems capable of generative research synthesis. We make our benchmark code and data available at https://github.com/guestrin-lab/deepscholar-bench.

2508.15305 2026-02-10 cs.AI

Coarse-to-Fine Grounded Memory for LLM Agent Planning

Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, Bo Xu

Comments Accepted to EMNLP 2025 Main Conference;27 pages,15 figures

详情
英文摘要

Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM-based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single-granularity memory derived from dynamic environmental interactions, which are inherently constrained by the quality of the collected experiences. This limitation, in turn, constrain the diversity of knowledge and the flexibility of planning. We propose Coarse-to-Fine Grounded Memory (\Ours{}), a novel framework that grounds coarse-to-fine memories with LLM, thereby fully leverage them for flexible adaptation to diverse scenarios. \Ours{} grounds environmental information into coarse-grained focus points to guide experience collection in training tasks, followed by grounding of actionable hybrid-grained tips from each experience. At inference, \Ours{} retrieves task-relevant experiences and tips to support planning. When facing environmental anomalies, the LLM grounds the current situation into fine-grained key information, enabling flexible self-QA reflection and plan correction.

2508.14893 2026-02-10 cs.CV cs.CL cs.RO

Virtual Community: An Open World for Humans, Robots, and Society

Qinhong Zhou, Hongxin Zhang, Xiangye Lin, Zheyuan Zhang, Yutian Chen, Wenjun Liu, Zunzhe Zhang, Sunli Chen, Lixing Fang, Qiushi Lyu, Xinyu Sun, Jincheng Yang, Zeyuan Wang, Bao Chi Dang, Zhehuan Chen, Daksha Ladia, Quang Vinh Dang, Jiageng Liu, Chuang Gan

Comments website https://virtual-community-ai.github.io/

详情
英文摘要

The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to enable the study of embodied social intelligence at scale. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.

2508.13898 2026-02-10 cs.LG cs.AI

Beyond the Mean: Fisher-Orthogonal Projection for Natural Gradient Descent in Large Batch Training

Yishun Lu, Wesley Armour

Comments Accepted at Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26) (Oral)

详情
英文摘要

Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively washes out the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric.

2508.13786 2026-02-10 cs.SD cs.AI

DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer

Yisu Liu, Chenxing Li, Wanqian Zhang, Wenfu Wang, Meng Yu, Ruibo Fu, Zheng Lin, Weiping Wang, Dong Yu

Journal ref IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2026

详情
英文摘要

Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.

2508.13490 2026-02-10 cs.LG nlin.CD

DyMixOp: A Neural Operator Designed from a Complex Dynamics Perspective with Local-Global Mixing for Solving PDEs

Pengyu Lai, Yixiao Chen, Dewu Yang, Rui Wang, Feng Wang, Hui Xu

详情
英文摘要

A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) lies in recasting these systems into a tractable representation particularly when the dynamics are inherently non-linearizable or require infinite-dimensional spaces for linearization. To address this challenge, we introduce DyMixOp, a novel neural operator framework for PDEs that integrates theoretical insights from complex dynamical systems. Grounded in dynamics-aware priors and inertial manifold theory, DyMixOp projects the original infinite-dimensional PDE dynamics onto a finite-dimensional latent space. This reduction preserves both essential linear structures and dominant nonlinear interactions, thereby establishing a physically interpretable and computationally structured foundation. Central to this approach is the local-global mixing (LGM) transformation, a key architectural innovation inspired by the convective nonlinearity in turbulent flows. By multiplicatively coupling local fine-scale features with global spectral information, LGM effectively captures high-frequency details and complex nonlinear couplings while mitigating the spectral bias that plagues many existing neural operators. The framework is further enhanced by a dynamics-informed architecture that stacks multiple LGM layers in a hybrid configuration, incorporating timescale-adaptive gating and parallel aggregation of intermediate dynamics. This design enables robust approximation of general evolutionary dynamics across diverse physical regimes. Extensive experiments on seven benchmark PDE systems spanning 1D to 3D, elliptic to hyperbolic types demonstrate that DyMixOp achieves state-of-the-art performance on six of them, significantly reducing prediction errors (by up to 94.3% in chaotic regimes) while maintaining computational efficiency and strong scalability.

2508.12281 2026-02-10 cs.CL

Legal$Δ$: Enhancing Legal Reasoning in LLMs via Reinforcement Learning with Chain-of-Thought Guided Information Gain

Xin Dai, Buqiang Xu, Zhenghao Liu, Yukun Yan, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, Ge Yu

详情
英文摘要

Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal$Δ$, a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal$Δ$ employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal$Δ$ follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal$Δ$ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at https://github.com/NEUIR/LegalDelta.

2508.05599 2026-02-10 cs.CV

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Xiaohui Li, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang

Comments 32 pages, 15 figures, 39 tables

详情
英文摘要

Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoder (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratio. On the ImageNet 50k validation set, at a high-fidelity setting, WeTok achieves a record-low zero-shot rFID of 0.12, outperforming leading continuous tokenizers like FLUX-VAE (0.18) and SD-VAE 3.5 (0.19) with 400% compression ratio. Furthermore, in a high-compression regime, WeTok achieves a zero-shot rFID of 3.49 at a 768$\times$ compression ratio, substantially surpassing Cosmos, which scores 4.57 at only 50% our compression ratio. Code and models are available: https://github.com/zhuangshaobin/WeTok.

2507.23278 2026-02-10 cs.CV

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang

详情
英文摘要

In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.

2507.15375 2026-02-10 cs.CL eess.AS

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Comments ICLR 2026 camera-ready version. Project page: https://d223302.github.io/STITCH/

详情
英文摘要

Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

2507.12412 2026-02-10 cs.LG cs.AI

NOCTA: Non-Greedy Objective Cost-Tradeoff Acquisition for Longitudinal Data

Dzung Dinh, Boqi Chen, Yunni Qu, Marc Niethammer, Junier Oliva

详情
英文摘要

In many critical domains, features are not freely available at inference time: each measurement may come with a cost of time, money, and risk. Longitudinal prediction further complicates this setting because both features and labels evolve over time, and missing measurements at earlier timepoints may become permanently unavailable. We propose NOCTA, a Non-Greedy Objective Cost-Tradeoff Acquisition framework that sequentially acquires the most informative features at inference time while accounting for both temporal dynamics and acquisition cost. NOCTA is driven by a novel objective, NOCT, which evaluates a candidate set of future feature-time acquisitions by its expected predictive loss together with its acquisition cost. Since NOCT depends on unobserved future trajectories at inference time, we develop two complementary estimators: (i) NOCT-Contrastive, which learns an embedding of partial observations utilizing the induced distribution over future acquisitions, and (ii) NOCT-Amortized, which directly predicts NOCT for candidate plans with a neural network. Experiments on synthetic and real-world medical datasets demonstrate that both NOCTA estimators outperform existing baselines, achieving higher accuracy at lower acquisition costs.

2507.11589 2026-02-10 cs.LG gr-qc

Einstein Fields: A Neural Perspective To Computational General Relativity

Sandeep Suresh Cranganore, Andrei Bodnar, Arturs Berzins, Johannes Brandstetter

Comments Accepted at ICLR 2026: 64 pages, 23 figures, 14 Tables, Github: https://github.com/AndreiB137/EinFields, added (i) EinFields applied to Oscillating neutron star NR simulation using Fixed Mesh Refinement (FMR) for four refinement levels from EinsteinToolkit, (ii) Jit-based query speeds of EinFields and its derivative over 17 Million simulation grid points, (iii) Bianchi Identity values

Journal ref The Fourteenth International Conference on Learning Representations 2026

详情
英文摘要

We introduce Einstein Fields, a neural representation designed to compress computationally intensive four-dimensional numerical relativity simulations into compact implicit neural network weights. By modeling the metric, the core tensor field of general relativity, Einstein Fields enable the derivation of physical quantities via automatic differentiation. Unlike conventional neural fields (e.g., signed distance, occupancy, or radiance fields), Einstein Fields fall into the class of Neural Tensor Fields with the key difference that, when encoding the spacetime geometry into neural field representations, dynamics emerge naturally as a byproduct. Our novel implicit approach demonstrates remarkable potential, including continuum modeling of four-dimensional spacetime, mesh-agnosticity, storage efficiency, derivative accuracy, and ease of use. It achieves up to a $4,000$-fold reduction in storage memory compared to discrete representations while retaining a numerical accuracy of five to seven decimal places. Moreover, in single precision, differentiation of the Einstein Fields-parameterized metric tensor is up to five orders of magnitude more accurate compared to naive finite differencing methods. We demonstrate these properties on several canonical test beds of general relativity and numerical relativity simulation data, while also releasing an open-source JAX-based library: \href{https://github.com/AndreiB137/EinFields}{https://github.com/AndreiB137/EinFields}, taking the first steps to studying the potential of machine learning in numerical relativity.

2507.05806 2026-02-10 cs.LG stat.ML

Predicting Graph Structure via Adapted Flux Balance Analysis

Sevvandi Kandanaarachchi, Ziqi Xu, Stefan Westerlund, Conrad Sanderson

Comments extended and revised version of arXiv:2401.04280

Journal ref Lecture Notes in Computer Science (LNCS), Vol. 16370, pp. 351-363, 2026

详情
英文摘要

Many dynamic processes such as telecommunication and transport networks can be described through discrete time series of graphs. Modelling the dynamics of such time series enables prediction of graph structure at future time steps, which can be used in applications such as detection of anomalies. Existing approaches for graph prediction have limitations such as assuming that the vertices do not to change between consecutive graphs. To address this, we propose to exploit time series prediction methods in combination with an adapted form of flux balance analysis (FBA), a linear programming method originating from biochemistry. FBA is adapted to incorporate various constraints applicable to the scenario of growing graphs. Empirical evaluations on synthetic datasets (constructed via Preferential Attachment model) and real datasets (UCI Message, HePH, Facebook, Bitcoin) demonstrate the efficacy of the proposed approach.

2507.05713 2026-02-10 cs.CL cs.AI

DRAGOn: Designing RAG On Periodically Updated Corpus

Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova

Comments EACL 2026

详情
英文摘要

This paper introduces DRAGOn, method to design a RAG benchmark on a regularly updated corpus. It features recent reference datasets, a question generation framework, an automatic evaluation pipeline, and a public leaderboard. Specified reference datasets allow for uniform comparison of RAG systems, while newly generated dataset versions mitigate data leakage and ensure that all models are evaluated on unseen, comparable data. The pipeline for automatic question generation extracts the Knowledge Graph from the text corpus and produces multiple question-answer pairs utilizing modern LLM capabilities. A set of diverse LLM-as-Judge metrics is provided for a comprehensive model evaluation. We used Russian news outlets to form the datasets and demonstrate our methodology. We launch a public leaderboard to track the development of RAG systems and encourage community participation.

2507.04034 2026-02-10 cs.AI

Lyria: A Genetic Algorithm-Driven Neuro-Symbolic Reasoning Framework for LLMs

Weizhi Tang, Kwabena Nuamah, Vaishak Belle

详情
英文摘要

While LLMs have demonstrated impressive abilities across various domains, they struggle with two major issues. The first is that LLMs trap themselves into local optima and the second is that they lack exhaustive coverage of the solution space. To investigate and improve these two issues, we propose Lyria, a neuro-symbolic reasoning framework building on the integration of LLMs, genetic algorithms, and symbolic systems, comprising 7 essential components. Through conducting extensive experiments with 4 LLMs across 3 types of problems, we demonstrated the efficacy of Lyria. Furthermore, with 7 additional ablation experiments, we further systematically analyzed and elucidated the factors that affect its performance. In addition, based on Lyria, we extend the ideas to the fine-tuning process of LLMs and introduce LAFT which enables a weaker model to imitate the reasoning process of a stronger model that reason under the Lyria reasoning framework. We demonstrate that the significant effectiveness of LAFT by conducting extensive experiments against 9 constructed baselines. We finally reveal the limitations and provide insights into future directions.

2507.03041 2026-02-10 cs.LG cs.AI

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec

Comments Accepted to ICLR 2026. 22 pages

详情
英文摘要

Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at https://optimas.stanford.edu.

2507.00075 2026-02-10 cs.LG cs.AI

Theoretical Modeling of Large Language Model Self-Improvement Training Dynamics Through Solver-Verifier Gap

Yifan Sun, Yushan Liang, Zhen Zhang, Xin Liu, Jiaye Teng

Comments 37 pages

详情
英文摘要

Self-improvement is a significant techniques within the realm of large language model (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

2506.20487 2026-02-10 cs.RO

A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots

Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Huijiang Wang, Jiayu Chen, Bo Li, Wei Zhang, Wenjun Zeng, Hua Chen, Xin Jin

Comments 19 pages, 10 figures

Journal ref IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025

详情
英文摘要

Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pre-training to learn reusable primitive skills and broad behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and regularly updated collection of BFM papers and projects to facilitate more subsequent research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.

2506.19385 2026-02-10 cs.AI

CID-GraphRAG: Enhancing Multi-Turn Dialogue Systems through Dual-Pathway Retrieval of Conversation Flow and Context Semantics

Ziqi Zhu, Tao Hu, Honglong Zhang, Dan Yang, Hangeng Chen, Mengran Zhang, Xilun Chen

详情
英文摘要

We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval-Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity or static knowledge graphs, CID-GraphRAG constructs intent transition graphs from goal-achieved historical dialogues and implements a dual-retrieval mechanism that balances intent-based graph traversal with semantic search. This approach enables the system to simultaneously leverage both conversational intent flow patterns and contextual semantics, significantly improving retrieval quality and response quality. In extensive experiments on real-world customer service dialogues, we demonstrated that CID-GraphRAG significantly outperforms both semantic-based and intent-based baselines across automatic metrics, LLM-as-a-Judge evaluations and human evaluations, with relative gains of 11.4% in BLEU, 4.9% in ROUGE, and 5.9% in METEOR. Most notably, CID-GraphRAG achieves a 57.9% improvement in response quality according to LLM-as-a-Judge evaluations. These results demonstrate that integrating intent transition structures with semantic retrieval creates a synergistic effect that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for real-world multi-turn dialogue systems in customer service and other knowledge-intensive domains.

2506.16982 2026-02-10 cs.CL cs.AI cs.LG

Language Bottleneck Models for Qualitative Knowledge State Modeling

Antonin Berthon, Mihaela van der Schaar

详情
英文摘要

Accurately assessing student knowledge is central to education. Cognitive Diagnosis (CD) models estimate student proficiency at a fixed point in time, while Knowledge Tracing (KT) methods model evolving knowledge states to predict future performance. However, existing approaches either provide quantitative concept mastery estimates with limited expressivity (CD, probabilistic KT) or prioritize predictive accuracy at the cost of interpretability (deep learning KT). We propose Language Bottleneck Models (LBMs), where an encoder LLM produces textual knowledge state summaries, which a decoder LLM uses to predict future performance. This produces interpretable summaries that can express nuanced insights--such as misconceptions--that CD and KT models cannot capture. Extensive validation across synthetic and real-world datasets shows LBMs reveal qualitative insights beyond what CD and KT models can capture, while achieving competitive accuracy with improved sample efficiency. We demonstrate that the encoder and decoder can be fine-tuned with reinforcement learning and supervised fine-tuning respectively to improve both summary quality and predictive performance.

2506.11472 2026-02-10 cs.CV cs.LG

Toward Inherently Robust VLMs Against Visual Perception Attacks

Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad, Sebastian Steinhorst, Habeeb Olufowobi, Bing Li, Mert D. Pesé

Comments Accepted to the 2026 IEEE Intelligent Vehicles Symposium (IV 2026)

详情
英文摘要

Autonomous vehicles rely on deep neural networks (DNNs) for traffic sign recognition, lane centering, and vehicle detection, yet these models are vulnerable to attacks that induce misclassification and threaten safety. Existing defenses (e.g., adversarial training) often fail to generalize and degrade clean accuracy. We introduce Vehicle Vision-Language Models (V2LMs), fine-tuned vision-language models specialized for autonomous vehicle perception, and show that they are inherently more robust to unseen attacks without adversarial training, maintaining substantially higher adversarial accuracy than conventional DNNs. We study two deployments: Solo (task-specific V2LMs) and Tandem (a single V2LM for all three tasks). Under attacks, DNNs drop 33-74%, whereas V2LMs decline by under 8% on average. Tandem achieves comparable robustness to Solo while being more memory-efficient. We also explore integrating V2LMs in parallel with existing perception stacks to enhance resilience. Our results suggest V2LMs are a promising path toward secure, robust AV perception.

2506.11389 2026-02-10 cs.CL

Curriculum-Guided Layer Scaling for Language Model Pretraining

Karanpartap Singh, Neil Band, Ehsan Adeli

详情
英文摘要

As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks.