arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1795
专题追踪
2512.18718 2026-02-04 cs.CV

Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Linwei Qiu, Gongzhe Li, Xiaozhe Zhang, Qilin Sun, Fengying Xie

Comments AAAI 2026

详情
英文摘要

Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.

2512.10398 2026-02-04 cs.CL cs.AI cs.LG cs.SE

Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Sherman Wong, Zhenting Qi, Zhaodong Wang, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Wenlin Chen, Yilun Du, Minlan Yu, Ying Zhang

Comments The latest version

详情
英文摘要

Real-world software engineering tasks require coding agents that can operate on massive repositories, sustain long-horizon sessions, and reliably coordinate complex toolchains at test time. Existing research-grade coding agents offer transparency but struggle when scaled to heavier, production-level workloads, while production-grade systems achieve strong practical performance but provide limited extensibility, interpretability, and controllability. We introduce the Confucius Code Agent (CCA), a software engineering agent that can operate at large-scale codebases. CCA is built on top of the Confucius SDK, an agent development platform structured around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK supports a unified orchestrator with advanced context management for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension system for reliable tool use. In addition, we introduce a meta-agent that automates the construction, evaluation, and refinement of agents through a build-test-improve cycle, enabling rapid agent development on new tasks and tool stacks. Instantiated on the Confucius SDK using the meta-agent, CCA demonstrates strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a Resolve@1 of 59%, exceeding prior research baselines as well as commercial results, under identical repositories, model backends, and tool access.

2512.09369 2026-02-04 cs.LG cs.CL

Encoder-Free Knowledge-Graph Reasoning with LLMs via Hyperdimensional Path Retrieval

Yezi Liu, William Youngwoo Chung, Hanning Chen, Calvin Yeung, Mohsen Imani

详情
英文摘要

Recent progress in large language models (LLMs) has made knowledge-grounded reasoning increasingly practical, yet KG-based QA systems often pay a steep price in efficiency and transparency. In typical pipelines, symbolic paths are scored by neural encoders or repeatedly re-ranked by multiple LLM calls, which inflates latency and GPU cost and makes the decision process hard to audit. We introduce PathHD, an encoder-free framework for knowledge-graph reasoning that couples hyperdimensional computing (HDC) with a single LLM call per query. Given a query, PathHD represents relation paths as block-diagonal GHRR hypervectors, retrieves candidate paths using a calibrated blockwise cosine similarity with Top-K pruning, and then performs a one-shot LLM adjudication that outputs the final answer together with supporting, citeable paths. The design is enabled by three technical components: (i) an order-sensitive, non-commutative binding operator for composing multi-hop paths, (ii) a robust similarity calibration that stabilizes hypervector retrieval, and (iii) an adjudication stage that preserves interpretability while avoiding per-path LLM scoring. Across WebQSP, CWQ, and GrailQA, PathHD matches or improves Hits@1 compared to strong neural baselines while using only one LLM call per query, reduces end-to-end latency by $40-60\%$, and lowers GPU memory by $3-5\times$ due to encoder-free retrieval. Overall, the results suggest that carefully engineered HDC path representations can serve as an effective substrate for efficient and faithful KG-LLM reasoning, achieving a strong accuracy-efficiency-interpretability trade-off.

2512.04601 2026-02-04 cs.LG cs.CL

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine

Comments 21 pages, 4 figures

详情
英文摘要

Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

2512.01834 2026-02-04 cs.LG cs.AI cs.CY

Mitigating Gender Bias in Depression Detection via Counterfactual Inference

Mingxuan Hu, Hongbo Ma, Xinlan Wu, Ziqi Liu, Jiaqi Liu, Yangbin Chen

Comments To be published in CSCWD 2026

详情
英文摘要

Audio-based depression detection models have demonstrated promising performance but often suffer from gender bias due to imbalanced training data. Epidemiological statistics show a higher prevalence of depression in females, leading models to learn spurious correlations between gender and depression. Consequently, models tend to over-diagnose female patients while underperforming on male patients, raising significant fairness concerns. To address this, we propose a novel Counterfactual Debiasing Framework grounded in causal inference. We construct a causal graph to model the decision-making process and identify gender bias as the direct causal effect of gender on the prediction. During inference, we employ counterfactual inference to estimate and subtract this direct effect, ensuring the model relies primarily on authentic acoustic pathological features. Extensive experiments on the DAIC-WOZ dataset using two advanced acoustic backbones demonstrate that our framework not only significantly reduces gender bias but also improves overall detection performance compared to existing debiasing strategies.

2511.22254 2026-02-04 cs.AI

Co-Evolving Agents: Learning from Failures as Hard Negatives

Yeonsung Jung, Trilok Padhi, Sina Shaham, Dipika Khullar, Joonhyun Jeong, Ninareh Mehrabi, Eunho Yang

详情
英文摘要

The rapid progress of large foundation models has accelerated the development of task-specialized agents across diverse domains. However, the effectiveness of agents remains tightly coupled with the quality of training data, while curating task-specific datasets remains costly and often infeasible in real-world scenarios. Recent work has explored self-improving agents that autonomously generate, refine, and re-train on their own trajectories. A prominent line of approaches further leverages preference optimization by pairing predicted trajectories with scarce ground-truth trajectories, enabling agents to learn directly from their own failures. While these methods outperform supervised fine-tuning, their heavy reliance on predicted trajectories under limited ground-truth supervision leaves them prone to overfitting. To address this, we propose a co-evolving agents framework in which a target agent improves jointly with an auxiliary failure agent. The failure agent learns through preference optimization over failure trajectories from both the target and itself, thereby generating hard negatives that are close to success yet remain failures. Incorporating these informative hard negatives into the target agent's optimization sharpens decision boundaries and enhances generalization. Our comprehensive analysis and experiments across benchmark datasets show that our method not only shows improved performance but also demonstrates that failures, instead of being used as-is, can be systematically transformed into structured and valuable learning signals in self-improving agents.

2511.20348 2026-02-04 cs.CV cs.RO

Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin

Andy Huynh, João Malheiro Silva, Holger Caesar, Tong Duy Son

Comments 8 pages, 5 figures. Accepted to IEEE Intelligent Vehicles Symposium (IV) 2026. Revised version (v3) presents camera-ready publication

详情
英文摘要

3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.

2511.18793 2026-02-04 cs.AI cs.LG

NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

Yejing Wang, Shengyu Zhou, Jinyu Lu, Ziwei Liu, Langming Liu, Maolin Wang, Wenlin Zhang, Feng Li, Wenbo Su, Pengjie Wang, Jian Xu, Xiangyu Zhao

详情
英文摘要

Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.

2511.11698 2026-02-04 cs.LG

Moirai 2.0: When Less Is More for Time Series Forecasting

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, Junnan Li

Comments 16 pages, 13 figures, and 1 table

详情
英文摘要

We introduce Moirai 2.0, a decoder-only time-series foundation model trained on a new corpus of 36M series. The model adopts quantile forecasting and multi-token prediction, improving both probabilistic accuracy and inference efficiency. On the Gift-Eval benchmark, it ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size. Compared to Moirai 1.0, Moirai 2.0 replaces masked-encoder training, multi-patch inputs, and mixture-distribution outputs with a simpler decoder-only architecture, single patch, and quantile loss. Ablation studies isolate these changes -- showing that the decoder-only backbone along with recursive multi-quantile decoding contribute most to the gains. Additional experiments show that Moirai 2.0 outperforms larger models from the same family and exhibits robust domain-level results. In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large, while also performing better. Model performance plateaus with increasing parameter count and declines at longer horizons, motivating future work on data scaling and long-horizon modeling. We release code and evaluation details to support further research.

2511.11460 2026-02-04 cs.CV

Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification

Qinghao Gao, Jiahui Qu, Wenqian Dong

Comments 11 pages, 5 figures

详情
英文摘要

Multimodal remote sensing classification often suffers from missing modalities caused by sensor failures and environmental interference, leading to severe performance degradation. In this work, we rethink missing-modality learning from a conditional computation perspective and investigate whether Mixture-of-Experts (MoE) models can inherently adapt to diverse modality-missing scenarios. We first conduct a systematic study of representative MoE paradigms under various missing-modality settings, revealing both their potential and limitations. Building on these insights, we propose a Missing-aware Mixture-of-LoRAs (MaMOL), a parameter-efficient MoE framework that unifies multiple modality-missing cases within a single model. MaMOL introduces a dual-routing mechanism to decouple modality-invariant shared experts and modality-aware dynamic experts, enabling automatic expert activation conditioned on available modalities. Extensive experiments on multiple remote sensing benchmarks demonstrate that MaMOL significantly improves robustness and generalization under diverse missing-modality scenarios with minimal computational overhead. Transfer experiments on natural image datasets further validate its scalability and cross-domain applicability.

2511.09483 2026-02-04 cs.AI

CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?

Peiyu Li, Xiaobao Huang, Ting Hua, Nitesh V. Chawla

Comments code available at https://github.com/Peiyu-Georgia-Li/crochetBench

详情
英文摘要

While multimodal large language models can describe visual content, their ability to generate executable procedures remains underexplored. CrochetBench presented in this paper evaluates this shift from describing to doing through fine-grained procedural reasoning in crochet: models must recognize stitches, select structurally appropriate instructions, and generate compilable procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply decreases as the evaluation shifts from surface-level similarity to executable correctness, revealing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. Our proposed CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains. Code is available at https://anonymous.4open.science/r/crochet-82E6/README.md.

2511.03938 2026-02-04 cs.LG

LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis Reduction

Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Pietro Mercati, Nathaniel D. Bastian, Mohsen Imani

Comments Accepted to DATE 2026

详情
英文摘要

Hyperdimensional computing (HDC) suits memory, energy, and reliability-constrained systems, yet the standard "one prototype per class" design requires $O(CD)$ memory (with $C$ classes and dimensionality $D$). Prior compaction reduces $D$ (feature axis), improving storage/compute but weakening robustness. We introduce LogHD, a logarithmic class-axis reduction that replaces the $C$ per-class prototypes with $n\!\approx\!\lceil\log_k C\rceil$ bundle hypervectors (alphabet size $k$) and decodes in an $n$-dimensional activation space, cutting memory to $O(D\log_k C)$ while preserving $D$. LogHD uses a capacity-aware codebook and profile-based decoding, and composes with feature-axis sparsification. Across datasets and injected bit flips, LogHD attains competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, it sustains target accuracy at roughly $2.5$-$3.0\times$ higher bit-flip rates than feature-axis compression; an ASIC instantiation delivers $498\times$ energy efficiency and $62.6\times$ speedup over an AMD Ryzen 9 9950X and $24.3\times$/$6.58\times$ over an NVIDIA RTX 4090, and is $4.06\times$ more energy-efficient and $2.19\times$ faster than a feature-axis HDC ASIC baseline.

2511.03911 2026-02-04 cs.LG

DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory Budgets

Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Mohsen Imani

Comments Accepted to DATE 2026

详情
英文摘要

Decomposition is a proven way to shrink deep networks without changing input-output dimensionality or interface semantics. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and erode concentration and robustness. Prior HDC decompositions decode via fixed atomic hypervectors, which are ill-suited for compressing learned class prototypes. We introduce DecoHD, which learns directly in a decomposed HDC parameterization: a small, shared set of per-layer channels with multiplicative binding across layers and bundling at the end, yielding a large representational space from compact factors. DecoHD compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score; training is end-to-end, and inference remains pure HDC, aligning with in/near-memory accelerators. In evaluation, DecoHD attains extreme memory savings with only minor accuracy degradation under tight deployment budgets. On average it stays within about 0.1-0.15% of a strong non-reduced HDC baseline (worst case 5.7%), is more robust to random bit-flip noise, reaches its accuracy plateau with up to ~97% fewer trainable parameters, and--in hardware--delivers roughly 277x/35x energy/speed gains over a CPU (AMD Ryzen 9 9950X), 13.5x/3.7x over a GPU (NVIDIA RTX 4090), and 2.0x/2.4x over a baseline HDC ASIC.

2511.02280 2026-02-04 cs.CV cs.CL

SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng

详情
英文摘要

We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

2510.22124 2026-02-04 cs.LG cs.AI

Efficient Utility-Preserving Machine Unlearning with Implicit Gradient Surgery

Shiji Zhou, Tianbai Yu, Zhi Zhang, Heng Chang, Xiao Zhou, Dong Wu, Han Zhao

Comments Corresponding author: Shiji Zhou (zhoushiji25@buaa.edu.cn). Shiji Zhou and Tianbai Yu contributed equally

详情
英文摘要

Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a pre-trained model. The key challenge is to balance the potential tradeoff between unlearning efficacy and utility preservation, which involves forgetting undesirable information as defined while maintaining the model's original performance. One potential way to tackle this problem is to use multi-objective optimization to jointly optimize both the unlearning and utility preservation objectives. However, existing multi-objective methods only guarantee finding a Pareto-optimal solution without fine-grained control, which causes under-optimization of the unlearning objective. To this end, we first model MU as a constrained optimization problem, that is, optimizing the unlearning objective under the constraint of a bounded increase for utility loss. We then show that solving this optimization problem is equivalent to unilateral gradient surgery on the unlearning objective. To resolve the additional computational cost brought by gradient surgery, we propose an implicit gradient surgery method, which approximates the solution to the aforementioned constrained optimization problem via only one backpropagation, thereby achieving efficient utility-preserving MU. Theoretically, we provide a tight convergence analysis of the algorithm. Empirically, our extensive experiments show that the proposed algorithm achieves better tradeoff results than existing baselines. Codes are available at https://github.com/anseryuer/EUPMU-Efficient-Utility-Preserving-Machine-Unlearning.

2510.14009 2026-02-04 cs.LG

Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training

Jie Hao, Xiaochuan Gong, Jie Xu, Zhengdao Wang, Mingrui Liu

详情
英文摘要

Geometry-aware optimization algorithms, such as Muon, have achieved remarkable success in training deep neural networks (DNNs). These methods leverage the underlying geometry of DNNs by selecting appropriate norms for different layers and updating parameters via norm-constrained linear minimization oracles (LMOs). However, even within a group of layers associated with the same norm, the local curvature can be heterogeneous across layers and vary dynamically over the course of training. For example, recent work shows that sharpness varies substantially across transformer layers and throughout training, yet standard geometry-aware optimizers impose fixed learning rates to layers within the same group, which may be inefficient for DNN training. In this paper, we introduce a noise-adaptive layerwise learning rate scheme on top of geometry-aware optimization algorithms and substantially accelerate DNN training compared to methods that use fixed learning rates within each group. Our method estimates gradient variance in the dual norm induced by the chosen LMO on the fly, and uses it to assign time-varying noise-adaptive layerwise learning rates within each group. We provide a theoretical analysis showing that our algorithm achieves a sharp convergence rate. Empirical results on transformer architectures such as LLaMA and GPT demonstrate that our approach achieves faster convergence than state-of-the-art optimizers.

2510.13847 2026-02-04 cs.CL cs.AI cs.LG

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

详情
英文摘要

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is largely unchanged, but the drafter can become bottlenecked by its O(|V|d) output projection. Recent approaches (e.g., FR-Spec, VocabTrim) mitigate this by restricting drafting to a fixed, frequency-ranked shortlist; however, such static truncation is corpus-dependent and suppresses rare or domain-specific tokens, reducing acceptance and limiting speedups. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism for large-vocabulary speculative decoding. DynaSpec trains lightweight meta-classifiers that route each context to a small set of coarse token clusters; the union of the top-selected clusters defines the drafter's shortlist, while the target model still verifies over the full vocabulary, preserving exactness. Systems-wise, routing is overlapped with draft computation via parallel execution streams, reducing end-to-end overhead. Across standard speculative decoding benchmarks, DynaSpec consistently improves mean accepted length-recovering 98.4% of full-vocabulary performance for Llama-3-8B versus 93.6% for fixed-shortlist baselines-and achieves up to a 2.23x throughput gain compared to 1.91x for static approaches on the dataset with rare tokens.

2510.12036 2026-02-04 cs.CL

On the Interplay between Human Label Variation and Model Fairness

Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau

Comments 10 pages, 7 figures. Accepted to EACL Findings 2026

详情
英文摘要

The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness under certain configurations.

2510.11129 2026-02-04 cs.CV cs.AI

video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

Guangzhi Sun, Yixuan Li, Xiaodong Wu, Yudong Yang, Wei Li, Zejun Ma, Chao Zhang

详情
英文摘要

Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes over 3-hour videos at 1 FPS and 360p resolution, outperforming strong non-streaming models under the same memory budget. In addition to token merging or downsampling, video-SALMONN S is the first to employ test-time training (TTT) as a streaming memory mechanism for video understanding. TTT continuously transforms short-term multimodal representations into long-term memory embedded in model parameters. To improve long-range dependency modeling and memory capacity, we propose (i) a TTT_MEM layer with an additional long-span prediction objective, (ii) a two-stage training scheme, and (iii) a modality-aware memory reader. We further introduce the Episodic Learning from Video Memory (ELViM) benchmark, simulating agent-like scenarios where models must learn from videos observed hours earlier. video-SALMONN S consistently outperforms both streaming and non-streaming baselines by 3-7% on long video benchmarks. Notably, video-SALMONN S achieves a 15% absolute accuracy improvement over strong non-streaming models on ELViM, demonstrating strong learning abilities from video memory.

2510.10915 2026-02-04 cs.LG cs.AI cs.ET

LPCVAE: A Conditional VAE with Long-Term Dependency and Probabilistic Time-Frequency Fusion for Time Series Anomaly Detection

Hanchang Cheng, Weimin Mu, Fan Liu, Weilin Zhu, Can Ma

详情
英文摘要

Time series anomaly detection(TSAD) is a critical task in signal processing field, ensuring the reliability of complex systems. Reconstruction-based methods dominate in TSAD. Among these methods, VAE-based methods have achieved promising results. Existing VAE-based methods suffer from the limitation of single-window feature and insufficient leveraging of long-term time and frequency information. We propose a Conditional Variational AutoEncoder with Long-term dependency and Probabilistic time-frequency fusion, named LPCVAE. LPCVAE introduces LSTM to capture long-term dependencies beyond windows. It further incorporates a Product-of-Experts (PoE) mechanism for adaptive and distribution-level probabilistic fusion. This design effectively mitigates time-frequency information loss. Extensive experiments on four public datasets demonstrate it outperforms state-of-the-art methods. The results confirm that integrating long-term time and frequency representations with adaptive fusion yields a robust and efficient solution for TSAD.

2510.10706 2026-02-04 cs.LG cs.DM

Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance

Mamoona Ghafoor, Tatsuya Akutsu

详情
英文摘要

The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet Σ, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over Σwith tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.

2510.10000 2026-02-04 cs.LG math.OC stat.ML

Tight Robustness Certificates and Wasserstein Distributional Attacks for Deep Neural Networks

Bach C. Le, Tung V. Dao, Binh T. Nguyen, Hong T. M. Chu

详情
英文摘要

Wasserstein distributionally robust optimization (WDRO) provides a framework for adversarial robustness, yet existing methods based on global Lipschitz continuity or strong duality often yield loose upper bounds or require prohibitive computation. We address these limitations with a primal approach and adopt a notion of exact Lipschitz certificates to tighten this upper bound of WDRO. For ReLU networks, we leverage the piecewise-affine structure on activation cells to obtain an exact tractable characterization of the corresponding WDRO problem. We further extend our analysis to modern architectures with smooth activations (e.g., GELU, SiLU), such as Transformers. Additionally, we propose novel Wasserstein Distributional Attacks (WDA, WDA++) that construct candidates for the worst-case distribution. Compared to existing attacks that are restricted to point-wise perturbations, our methods offer greater flexibility in the number and location of attack points. Extensive evaluations demonstrate that our proposed framework achieves competitive robust accuracy against state-of-the-art baselines while offering tighter certificates than existing methods. Our code is available at https://github.com/OLab-Repo/WDA.

2510.07707 2026-02-04 cs.CL cs.AI cs.LG

Causality Guided Representation Learning for Cross-Style Hate Speech Detection

Chengshuai Zhao, Shu Wan, Paras Sheth, Karan Patwa, K. Selçuk Candan, Huan Liu

Comments Accepted by the ACM Web Conference 2026 (WWW 26)

详情
英文摘要

The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language -- making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.

2510.07459 2026-02-04 cs.LG cs.AI

MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting

Gilad Aviv, Jacob Goldberger, Yoli Shavit

详情
英文摘要

We introduce Mixture-of-Gaussians with Uncertainty-based Gating (MoGU), a novel Mixture-of-Experts (MoE) framework designed for regression tasks. MoGU replaces standard learned gating with an intrinsic routing paradigm where expert-specific uncertainty serves as the native gating signal. By modeling each prediction as a Gaussian distribution, the system utilizes predicted variance to dynamically weight expert contributions. We validate MoGU on multivariate time-series forecasting, a domain defined by high volatility and varying noise patterns. Empirical results across multiple benchmarks, horizon lengths, and backbones demonstrate that MoGU consistently improves forecasting accuracy compared to traditional MoE. Further evaluation via conformal prediction indicates that our approach yields more efficient prediction intervals than existing baselines. These findings highlight MoGU's capacity for providing both competitive performance and reliable, high-fidelity uncertainty quantification. Our code is available at: https://github.com/yolish/moe_unc_tsf

2510.06548 2026-02-04 cs.CL cs.LG

Reusing Overtrained Language Models Saturates Scaling

Seng Pei Liew, Takuya Kato

详情
英文摘要

Reusing pretrained base models for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, the effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling properties of model reuse and find that the scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage training tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a base model is pretrained, the less benefit additional pretraining provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

2510.04838 2026-02-04 cs.CV cs.LG

Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, Ke Qin

Comments Accepted by NeurIPS 2025

详情
英文摘要

The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

2510.02125 2026-02-04 cs.AI cs.CL

Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Claas Beger, Ryan Yi, Shuhao Fu, Kaleda Denton, Arseny Moskvichev, Sarah W. Tsai, Sivasankaran Rajamanickam, Melanie Mitchell

Comments 9 pages, 3 figures

详情
英文摘要

OpenAI's o3-preview reasoning model exceeded human accuracy on the ARC-AGI-1 benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions the benchmark was designed to test? Here we investigate abstraction abilities of AI models using the closely related but simpler ConceptARC benchmark. Our evaluations vary input modality (textual vs. visual), use of external Python tools, and reasoning effort. Beyond output accuracy, we evaluate the natural-language rules that models generate to explain their solutions, enabling us to assess whether models recognize the abstractions that ConceptARC was designed to elicit. We show that the best models' rules are frequently based on surface-level ``shortcuts,'' capturing intended abstractions considerably less often than humans. In the visual modality, AI models' output accuracy drops sharply; however, our rule-level analysis reveals that a substantial share of their rules capture the intended abstractions, even as the models struggle to apply these concepts to generate correct solutions. In short, we show that using accuracy alone to evaluate abstract reasoning can substantially overestimate AI capabilities in textual modalities and underestimate it in visual modalities. Our results offer a more faithful picture of AI models' abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

2510.00457 2026-02-04 cs.LG cs.AI cs.CE

UrbanGraph: Physics-Informed Spatio-Temporal Dynamic Heterogeneous Graphs for Urban Microclimate Prediction

Weilin Xin, Chenyu Huang, Peilin Li, Jing Zhong, Jiawei Yao

详情
英文摘要

With rapid urbanization, predicting urban microclimates has become critical, as it affects building energy demand and public health risks. However, existing generative and homogeneous graph approaches fall short in capturing physical consistency, spatial dependencies, and temporal variability. To address this, we introduce UrbanGraph, a framework founded on a novel structure-based inductive bias. Unlike implicit graph learning, UrbanGraph transforms physical first principles into a dynamic causal topology, explicitly encoding time-varying causalities (e.g., shading and convection) directly into the graph structure to ensure physical consistency and data efficiency. Results show that UrbanGraph achieves state-of-the-art performance across all baselines. Specifically, the use of explicit causal pruning significantly reduces the model's floating-point operations (FLOPs) by 73.8% and increases training speed by 21% compared to implicit graphs. Our contribution includes the first high-resolution benchmark for spatio-temporal microclimate modeling, and a generalizable explicit topological encoding paradigm applicable to urban spatio-temporal dynamics governed by known physical equations.

2510.00294 2026-02-04 cs.LG cs.AI

Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

Shutong Wu, Jiawei Zhang

详情
英文摘要

Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient inference with parallel decoding algorithms, which enable multi-token prediction. However, the high generation quality often requires the number of decoding steps equal to the sequence length, which performs a one-token-per-step decoding, and existing parallel decoding algorithms, which yield suboptimal decoding paths, bring inference speedup at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored for DLLMs that achieves lossless parallel decoding without any model modification or extra modules. Specifically, we propose an algorithm of parallel-decoded candidate generation and verification, which is theoretically guaranteed to use the fewest model forward calls to reproduce the same sequence generated by one-token-per-step decoding. By extensive evaluations on math reasoning and code generation benchmarks across different DLLMs, FreeDave is proven to accelerate the inference up to $2.83\times$ without performance degradation.

2509.26096 2026-02-04 cs.CV cs.IT cs.LG math.IT math.OC stat.ML

EVODiff: Entropy-aware Variance Optimized Diffusion Inference

Shigui Li, Wei Chen, Delu Zeng

Comments NeurIPS 2025, 41 pages, 14 figures

详情
英文摘要

Diffusion models (DMs) excel in image generation but suffer from slow inference and training-inference discrepancies. Although gradient-based solvers for DMs accelerate denoising inference, they often lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5\% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25\% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.