arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1795
专题追踪
2602.02636 2026-02-04 cs.CL cs.AI cs.IR

WideSeek: Advancing Wide Research via Multi-Agent Scaling

Ziyang Huang, Haolin Ren, Xiaowei Yuan, Jiawei Wang, Zhongtao Jiang, Kun Xu, Shizhu He, Jun Zhao, Kang Liu

详情
英文摘要

Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.

2602.02635 2026-02-04 cs.CL

Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management

Siyu Li, Chenwei Song, Qi Zhou, Wan Zhou, Xinyi Liu

详情
英文摘要

This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.

2602.02634 2026-02-04 cs.LG

A Reduction from Delayed to Immediate Feedback for Online Convex Optimization with Improved Guarantees

Alexander Ryabchenko, Idan Attias, Daniel M. Roy

详情
英文摘要

We develop a reduction-based framework for online learning with delayed feedback that recovers and improves upon existing results for both first-order and bandit convex optimization. Our approach introduces a continuous-time model under which regret decomposes into a delay-independent learning term and a delay-induced drift term, yielding a delay-adaptive reduction that converts any algorithm for online linear optimization into one that handles round-dependent delays. For bandit convex optimization, we significantly improve existing regret bounds, with delay-dependent terms matching state-of-the-art first-order rates. For first-order feedback, we recover state-of-the-art regret bounds via a simpler, unified analysis. Quantitatively, for bandit convex optimization we obtain $O(\sqrt{d_{\text{tot}}} + T^{\frac{3}{4}}\sqrt{k})$ regret, improving the delay-dependent term from $O(\min\{\sqrt{T d_{\text{max}}},(Td_{\text{tot}})^{\frac{1}{3}}\})$ in previous work to $O(\sqrt{d_{\text{tot}}})$. Here, $k$, $T$, $d_{\text{max}}$, and $d_{\text{tot}}$ denote the dimension, time horizon, maximum delay, and total delay, respectively. Under strong convexity, we achieve $O(\min\{σ_{\text{max}} \ln T, \sqrt{d_{\text{tot}}}\} + (T^2\ln T)^{\frac{1}{3}} {k}^{\frac{2}{3}})$, improving the delay-dependent term from $O(d_{\text{max}} \ln T)$ in previous work to $O(\min\{σ_{\text{max}} \ln T, \sqrt{d_{\text{tot}}}\})$, where $σ_{\text{max}}$ denotes the maximum number of outstanding observations and may be considerably smaller than $d_{\text{max}}$.

2602.02626 2026-02-04 cs.LG stat.ML

Learning Better Certified Models from Empirically-Robust Teachers

Alessandro De Palma

详情
英文摘要

Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.

2602.02623 2026-02-04 cs.LG cs.AI eess.SP

Learning Consistent Causal Abstraction Networks

Gabriele D'Acunto, Paolo Di Lorenzo, Sergio Barbarossa

Comments To be published in the proceedings of ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv admin note: substantial text overlap with arXiv:2509.25236

详情
英文摘要

Causal artificial intelligence aims to enhance explainability, trustworthiness, and robustness in AI by leveraging structural causal models (SCMs). In this pursuit, recent advances formalize network sheaves and cosheaves of causal knowledge. Pushing in the same direction, we tackle the learning of consistent causal abstraction network (CAN), a sheaf-theoretic framework where (i) SCMs are Gaussian, (ii) restriction maps are transposes of constructive linear causal abstractions (CAs) adhering to the semantic embedding principle, and (iii) edge stalks correspond--up to permutation--to the node stalks of more detailed SCMs. Our problem formulation separates into edge-specific local Riemannian problems and avoids nonconvex objectives. We propose an efficient search procedure, solving the local problems with SPECTRAL, our iterative method with closed-form updates and suitable for positive definite and semidefinite covariance matrices. Experiments on synthetic data show competitive performance in the CA learning task, and successful recovery of diverse CAN structures.

2602.02618 2026-02-04 cs.LG cs.AI

A Semi-Supervised Pipeline for Generalized Behavior Discovery from Animal-Borne Motion Time Series

Fatemeh Karimi Nejadasl, Judy Shamoun-Baranes, Eldar Rakhimberdiev

详情
英文摘要

Learning behavioral taxonomies from animal-borne sensors is challenging because labels are scarce, classes are highly imbalanced, and behaviors may be absent from the annotated set. We study generalized behavior discovery in short multivariate motion snippets from gulls, where each sample is a sequence with 3-axis IMU acceleration (20 Hz) and GPS speed, spanning nine expert-annotated behavior categories. We propose a semi-supervised discovery pipeline that (i) learns an embedding function from the labeled subset, (ii) performs label-guided clustering over embeddings of both labeled and unlabeled samples to form candidate behavior groups, and (iii) decides whether a discovered group is truly novel using a containment score. Our key contribution is a KDE + HDR (highest-density region) containment score that measures how much a discovered cluster distribution is contained within, or contains, each known-class distribution; the best-match containment score serves as an interpretable novelty statistic. In experiments where an entire behavior is withheld from supervision and appears only in the unlabeled pool, the method recovers a distinct cluster and the containment score flags novelty via low overlap, while a negative-control setting with no novel behavior yields consistently higher overlaps. These results suggest that HDR-based containment provides a practical, quantitative test for generalized class discovery in ecological motion time series under limited annotation and severe class imbalance.

2602.02611 2026-02-04 cs.LG cs.AI

Discovering Data Manifold Geometry via Non-Contracting Flows

David Vigouroux, Lucas Drumetz, Ronan Fablet, François Rousseau

详情
英文摘要

We introduce an unsupervised approach for constructing a global reference system by learning, in the ambient space, vector fields that span the tangent spaces of an unknown data manifold. In contrast to isometric objectives, which implicitly assume manifold flatness, our method learns tangent vector fields whose flows transport all samples to a common, learnable reference point. The resulting arc-lengths along these flows define interpretable intrinsic coordinates tied to a shared global frame. To prevent degenerate collapse, we enforce a non-shrinking constraint and derive a scalable, integration-free objective inspired by flow matching. Within our theoretical framework, we prove that minimizing the proposed objective recovers a global coordinate chart when one exists. Empirically, we obtain correct tangent alignment and coherent global coordinate structure on synthetic manifolds. We also demonstrate the scalability of our method on CIFAR-10, where the learned coordinates achieve competitive downstream classification performance.

2602.02597 2026-02-04 cs.LG cs.AI

ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

Hongyuan Su, Yu Zheng, Yong Li

详情
英文摘要

Large language models are transforming systems research by automating the discovery of performance-critical algorithms for computer systems. Despite plausible codes generated by LLMs, producing solutions that meet the stringent correctness and performance requirements of systems demands iterative optimization. Test-time reinforcement learning offers high search efficiency but requires parameter updates infeasible under API-only access, while existing training-free evolutionary methods suffer from inefficient context utilization and undirected search. We introduce ContextEvolve, a multi-agent framework that achieves RL-level search efficiency under strict parameter-blind constraints by decomposing optimization context into three orthogonal dimensions: a Summarizer Agent condenses semantic state via code-to-language abstraction, a Navigator Agent distills optimization direction from trajectory analysis, and a Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration forms a functional isomorphism with RL-mapping to state representation, policy gradient, and experience replay-enabling principled optimization in a textual latent space. On the ADRS benchmark, ContextEvolve outperforms state-of-the-art baselines by 33.3% while reducing token consumption by 29.0%. Codes for our work are released at https://anonymous.4open.science/r/ContextEvolve-ACC

2602.02596 2026-02-04 cs.LG stat.ML

Fubini Study geometry of representation drift in high dimensional data

Arturo Tozzi

Comments 8 pages, 1 figure

详情
英文摘要

High dimensional representation drift is commonly quantified using Euclidean or cosine distances, which presuppose fixed coordinates when comparing representations across time, training or preprocessing stages. While effective in many settings, these measures entangle intrinsic changes in the data with variations induced by arbitrary parametrizations. We introduce a projective geometric view of representation drift grounded in the Fubini Study metric, which identifies representations that differ only by gauge transformations such as global rescalings or sign flips. Applying this framework to empirical high dimensional datasets, we explicitly construct representation trajectories and track their evolution through cumulative geometric drift. Comparing Euclidean, cosine and Fubini Study distances along these trajectories reveals that conventional metrics systematically overestimate change whenever representations carry genuine projective ambiguity. By contrast, the Fubini Study metric isolates intrinsic evolution by remaining invariant under gauge-induced fluctuations. We further show that the difference between cosine and Fubini Study drift defines a computable, monotone quantity that directly captures representation churn attributable to gauge freedom. This separation provides a diagnostic for distinguishing meaningful structural evolution from parametrization artifacts, without introducing model-specific assumptions. Overall, we establish a geometric criterion for assessing representation stability in high-dimensional systems and clarify the limits of angular distances. Embedding representation dynamics in projective space connects data analysis with established geometric programs and yields observables that are directly testable in empirical workflows.

2602.02593 2026-02-04 cs.LG cs.AI math.OC

Effective Frontiers: A Unification of Neural Scaling Laws

Jiaxuan Zou, Zixuan Gong, Ye Su, Huayi Tang, Yong Liu

详情
英文摘要

Neural scaling laws govern the prediction power-law improvement of test loss with respect to model capacity ($N$), datasize ($D$), and compute ($C$). However, existing theoretical explanations often rely on specific architectures or complex kernel methods, lacking intuitive universality. In this paper, we propose a unified framework that abstracts general learning tasks as the progressive coverage of patterns from a long-tail (Zipfian) distribution. We introduce the Effective Frontier ($k_\star$), a threshold in the pattern rank space that separates learned knowledge from the unlearned tail. We prove that reducible loss is asymptotically determined by the probability mass of the tail a resource-dependent frontier truncation. Based on our framework, we derive the precise scaling laws for $N$, $D$, and $C$, attributing them to capacity, coverage, and optimization bottlenecks, respectively. Furthermore, we unify these mechanisms via a Max-Bottleneck principle, demonstrating that the Kaplan and Chinchilla scaling laws are not contradictory, but equilibrium solutions to the same constrained optimization problem under different active bottlenecks.

2602.02591 2026-02-04 cs.SD cs.AI

VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Chengyuan Ma, Jiawei Jin, Ruijie Xiong, Chunxiang Jin, Canxiang Yan, Wenming Yang

Comments Accepted by ICASSP 2026

详情
英文摘要

We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.

2602.02590 2026-02-04 cs.RO

StepNav: Structured Trajectory Priors for Efficient and Multimodal Visual Navigation

Xubo Luo, Aodi Wu, Haodong Han, Xue Wan, Wei Zhang, Leizheng Shu, Ruisuo Wang

Comments 8 pages, 7 figures; Accepted by ICRA 2026

详情
英文摘要

Visual navigation is fundamental to autonomous systems, yet generating reliable trajectories in cluttered and uncertain environments remains a core challenge. Recent generative models promise end-to-end synthesis, but their reliance on unstructured noise priors often yields unsafe, inefficient, or unimodal plans that cannot meet real-time requirements. We propose StepNav, a novel framework that bridges this gap by introducing structured, multimodal trajectory priors derived from variational principles. StepNav first learns a geometry-aware success probability field to identify all feasible navigation corridors. These corridors are then used to construct an explicit, multi-modal mixture prior that initializes a conditional flow-matching process. This refinement is formulated as an optimal control problem with explicit smoothness and safety regularization. By replacing unstructured noise with physically-grounded candidates, StepNav generates safer and more efficient plans in significantly fewer steps. Experiments in both simulation and real-world benchmarks demonstrate consistent improvements in robustness, efficiency, and safety over state-of-the-art generative planners, advancing reliable trajectory generation for practical autonomous navigation. The code has been released at https://github.com/LuoXubo/StepNav.

2602.02589 2026-02-04 cs.AI cs.LG

PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review

Yanki Margalit, Erni Avram, Ran Taig, Oded Margalit, Nurit Cohen-Inger

详情
英文摘要

Evaluating large language models typically relies on human-authored benchmarks, reference answers, and human or single-model judgments, approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis. We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses and aggregate dense peer assessments into relative performance estimates, without human supervision or gold references. PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments. In a large-scale study over 12 commercially available models and 420 autonomously generated questions, PeerRank produces stable, discriminative rankings and reveals measurable identity and presentation biases. Rankings are robust, and mean peer scores agree with Elo. We further validate PeerRank on TruthfulQA and GSM8K, where peer scores correlate with objective accuracy. Together, these results suggest that bias-aware peer evaluation with selective web-grounded answering can scale open-world LLM assessment beyond static and human curated benchmarks.

2602.02583 2026-02-04 cs.LG stat.AP

Copula-Based Aggregation and Context-Aware Conformal Prediction for Reliable Renewable Energy Forecasting

Alireza Moradi, Mathieu Tanneau, Reza Zandehshahvar, Pascal Van Hentenryck

详情
英文摘要

The rapid growth of renewable energy penetration has intensified the need for reliable probabilistic forecasts to support grid operations at aggregated (fleet or system) levels. In practice, however, system operators often lack access to fleet-level probabilistic models and instead rely on site-level forecasts produced by heterogeneous third-party providers. Constructing coherent and calibrated fleet-level probabilistic forecasts from such inputs remains challenging due to complex cross-site dependencies and aggregation-induced miscalibration. This paper proposes a calibrated probabilistic aggregation framework that directly converts site-level probabilistic forecasts into reliable fleet-level forecasts in settings where system-level models cannot be trained or maintained. The framework integrates copula-based dependence modeling to capture cross-site correlations with Context-Aware Conformal Prediction (CACP) to correct miscalibration at the aggregated level. This combination enables dependence-aware aggregation while providing valid coverage and maintaining sharp prediction intervals. Experiments on large-scale solar generation datasets from MISO, ERCOT, and SPP demonstrate that the proposed Copula+CACP approach consistently achieves near-nominal coverage with significantly sharper intervals than uncalibrated aggregation baselines.

2602.02582 2026-02-04 cs.AI cs.CL cs.CY cs.IR cs.LG cs.SE

Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems

Chandan Kumar Sah, Xiaoli Lian, Li Zhang, Tony Xu, Syed Shazaib Shah

Comments Accepted at the Second Conference of the International Association for Safe and Ethical Artificial Intelligence, IASEAI26, 14 pages

详情
英文摘要

Large language models (LLMs) enable powerful zero-shot recommendations by leveraging broad contextual knowledge, yet predictive uncertainty and embedded biases threaten reliability and fairness. This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of LLM-generated recommendations. We introduce a benchmark of curated metrics and a dataset annotated for eight demographic attributes (31 categorical values) across two domains: movies and music. Through in-depth case studies, we quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind's Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes; measured similarity-based gaps are SNSR at 0.1363 and SNSV at 0.0507. These disparities persist under prompt perturbations such as typographical errors and multilingual inputs. We further integrate personality-aware fairness into the RecLLM evaluation pipeline to reveal personality-linked bias patterns and expose trade-offs between personalization and group fairness. We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark that advances explainability and equity in LLM recommendations. Together, these contributions establish a foundation for safer, more interpretable RecLLMs and motivate future work on multi-model benchmarks and adaptive calibration for trustworthy deployment.

2602.02581 2026-02-04 cs.LG cs.AI

QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

Nan Zhang, Eugene Kwek, Yusen Zhang, Muyu Pan, Suhang Wang, Prasenjit Mitra, Rui Zhang

详情
英文摘要

Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term "protecting both ends". Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.

2602.02573 2026-02-04 cs.LG cs.AI

Product Interaction: An Algebraic Formalism for Deep Learning Architectures

Haonan Dong, Chun-Wun Cheng, Angelica I. Aviles-Rivero

详情
英文摘要

In this paper, we introduce product interactions, an algebraic formalism in which neural network layers are constructed from compositions of a multiplication operator defined over suitable algebras. Product interactions provide a principled way to generate and organize algebraic expressions by increasing interaction order. Our central observation is that algebraic expressions in modern neural networks admit a unified construction in terms of linear, quadratic, and higher-order product interactions. Convolutional and equivariant networks arise as symmetry-constrained linear product interactions, while attention and Mamba correspond to higher-order product interactions.

2602.02571 2026-02-04 cs.LG cs.AI cs.CV

Trajectory Consistency for One-Step Generation on Euler Mean Flows

Zhiqi Li, Yuchen Sun, Duowen Chen, Jinjin He, Bo Zhu

Comments 40 pages, 27 figures

详情
英文摘要

We propose \emph{Euler Mean Flows (EMF)}, a flow-based generative framework for one-step and few-step generation that enforces long-range trajectory consistency with minimal sampling cost. The key idea of EMF is to replace the trajectory consistency constraint, which is difficult to supervise and optimize over long time scales, with a principled linear surrogate that enables direct data supervision for long-horizon flow-map compositions. We derive this approximation from the semigroup formulation of flow-based models and show that, under mild regularity assumptions, it faithfully approximates the original consistency objective while being substantially easier to optimize. This formulation leads to a unified, JVP-free training framework that supports both $u$-prediction and $x_1$-prediction variants, avoiding explicit Jacobian computations and significantly reducing memory and computational overhead. Experiments on image synthesis, particle-based geometry generation, and functional generation demonstrate improved optimization stability and sample quality under fixed sampling budgets, together with approximately $50\%$ reductions in training time and memory consumption compared to existing one-step methods for image generation.

2602.02568 2026-02-04 cs.LG

Mitigating Task-Order Sensitivity and Forgetting via Hierarchical Second-Order Consolidation

Protik Nag, Krishnan Raghavan, Vignesh Narayanan

Comments 21 pages, 8 figures

详情
英文摘要

We introduce $\textbf{Hierarchical Taylor Series-based Continual Learning (HTCL)}$, a framework that couples fast local adaptation with conservative, second-order global consolidation to address the high variance introduced by random task ordering. To address task-order effects, HTCL identifies the best intra-group task sequence and integrates the resulting local updates through a Hessian-regularized Taylor expansion, yielding a consolidation step with theoretical guarantees. The approach naturally extends to an $L$-level hierarchy, enabling multiscale knowledge integration in a manner not supported by conventional single-level CL systems. Across a wide range of datasets and replay and regularization baselines, HTCL acts as a model-agnostic consolidation layer that consistently enhances performance, yielding mean accuracy gains of $7\%$ to $25\%$ while reducing the standard deviation of final accuracy by up to $68\%$ across random task permutations.

2602.02567 2026-02-04 cs.LG cs.AI eess.IV

IceBench-S2S: A Benchmark of Deep Learning for Challenging Subseasonal-to-Seasonal Daily Arctic Sea Ice Forecasting in Deep Latent Space

Jingyi Xu, Shengnan Wang, Weidong Yang, Siwei Tu, Lei Bai, Ben Fei

Comments 9 pages, 6 figures

详情
英文摘要

Arctic sea ice plays a critical role in regulating Earth's climate system, significantly influencing polar ecological stability and human activities in coastal regions. Recent advances in artificial intelligence have facilitated the development of skillful pan-Arctic sea ice forecasting systems, where data-driven approaches showcase tremendous potential to outperform conventional physics-based numerical models in terms of accuracy, computational efficiency and forecasting lead times. Despite the latest progress made by deep learning (DL) forecasting models, most of their skillful forecasting lead times are confined to daily subseasonal scale and monthly averaged values for up to six months, which drastically hinders their deployment for real-world applications, e.g., maritime routine planning for Arctic transportation and scientific investigation. Extending daily forecasts from subseasonal to seasonal (S2S) scale is scientifically crucial for operational applications. To bridge the gap between the forecasting lead time of current DL models and the significant daily S2S scale, we introduce IceBench-S2S, the first comprehensive benchmark for evaluating DL approaches in mitigating the challenge of forecasting Arctic sea ice concentration in successive 180-day periods. It proposes a generalized framework that first compresses spatial features of daily sea ice data into a deep latent space. The temporally concatenated deep features are subsequently modeled by DL-based forecasting backbones to predict the sea ice variation at S2S scale. IceBench-S2S provides a unified training and evaluation pipeline for different backbones, along with practical guidance for model selection in polar environmental monitoring tasks.

2602.02566 2026-02-04 cs.LG cs.AI cs.CY

A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

Samin Semsar, Kiran Laxmikant Prabhu, Gabriella Waters, James Foulds

Comments 36 pages, 27 figures

详情
英文摘要

There are ongoing discussions about predictive policing systems, such as those deployed in Los Angeles, California and Baltimore, Maryland, being unfair, for example, by exhibiting racial bias. Studies found that unfairness may be due to feedback loops and being trained on historically biased recorded data. However, comparative studies on predictive policing systems are few and are not sufficiently comprehensive. In this work, we perform a comprehensive comparative simulation study on the fairness and accuracy of predictive policing technologies in Baltimore. Our results suggest that the situation around bias in predictive policing is more complex than was previously assumed. While predictive policing exhibited bias due to feedback loops as was previously reported, we found that the traditional alternative, hot spots policing, had similar issues. Predictive policing was found to be more fair and accurate than hot spots policing in the short term, although it amplified bias faster, suggesting the potential for worse long-run behavior. In Baltimore, in some cases the bias in these systems tended toward over-policing in White neighborhoods, unlike in previous studies. Overall, this work demonstrates a methodology for city-specific evaluation and behavioral-tendency comparison of predictive policing systems, showing how such simulations can reveal inequities and long-term tendencies.

2602.02565 2026-02-04 cs.LG cs.AI

High Rank Matrix Completion via Grassmannian Proxy Fusion

Huanran Li, Jeremy Johnson, Daniel Pimentel-Alarcón

详情
英文摘要

This paper approaches high-rank matrix completion (HRMC) by filling missing entries in a data matrix where columns lie near a union of subspaces, clustering these columns, and identifying the underlying subspaces. Current methods often lack theoretical support, produce uninterpretable results, and require more samples than theoretically necessary. We propose clustering incomplete vectors by grouping proxy subspaces and minimizing two criteria over the Grassmannian: (a) the chordal distance between each point and its corresponding subspace and (b) the geodesic distances between subspaces of all data points. Experiments on synthetic and real datasets demonstrate that our method performs comparably to leading methods in high sampling rates and significantly better in low sampling rates, thus narrowing the gap to the theoretical sampling limit of HRMC.

2602.02564 2026-02-04 cs.LG cs.MA

Label Curation Using Agentic AI

Subhodeep Ghosh, Bayan Divaaniaazar, Md Ishat-E-Rabban, Spencer Clarke, Senjuti Basu Roy

详情
英文摘要

Data annotation is essential for supervised learning, yet producing accurate, unbiased, and scalable labels remains challenging as datasets grow in size and modality. Traditional human-centric pipelines are costly, slow, and prone to annotator variability, motivating reliability-aware automated annotation. We present AURA (Agentic AI for Unified Reliability Modeling and Annotation Aggregation), an agentic AI framework for large-scale, multi-modal data annotation. AURA coordinates multiple AI agents to generate and validate labels without requiring ground truth. At its core, AURA adapts a classical probabilistic model that jointly infers latent true labels and annotator reliability via confusion matrices, using Expectation-Maximization to reconcile conflicting annotations and aggregate noisy predictions. Across the four benchmark datasets evaluated, AURA achieves accuracy improvements of up to 5.8% over baseline. In more challenging settings with poor quality annotators, the improvement is up to 50% over baseline. AURA also accurately estimates the reliability of annotators, allowing assessment of annotator quality even without any pre-validation steps.

2602.02563 2026-02-04 cs.LG

A General ReLearner: Empowering Spatiotemporal Prediction by Re-learning Input-label Residual

Jiaming Ma, Binwu Wang, Pengkun Wang, Xu Wang, Zhengyang Zhou, Yang Wang

详情
英文摘要

Prevailing spatiotemporal prediction models typically operate under a forward (unidirectional) learning paradigm, in which models extract spatiotemporal features from historical observation input and map them to target spatiotemporal space for future forecasting (label). However, these models frequently exhibit suboptimal performance when spatiotemporal discrepancies exist between inputs and labels, for instance, when nodes with similar time-series inputs manifest distinct future labels, or vice versa. To address this limitation, we propose explicitly incorporating label features during the training phase. Specifically, we introduce the Spatiotemporal Residual Theorem, which generalizes the conventional unidirectional spatiotemporal prediction paradigm into a bidirectional learning framework. Building upon this theoretical foundation, we design an universal module, termed ReLearner, which seamlessly augments Spatiotemporal Neural Networks (STNNs) with a bidirectional learning capability via an auxiliary inverse learning process. In this process, the model relearns the spatiotemporal feature residuals between input data and future data. The proposed ReLearner comprises two critical components: (1) a Residual Learning Module, designed to effectively disentangle spatiotemporal feature discrepancies between input and label representations; and (2) a Residual Smoothing Module, employed to smooth residual terms and facilitate stable convergence. Extensive experiments conducted on 11 real-world datasets across 14 backbone models demonstrate that ReLearner significantly enhances the predictive performance of existing STNNs.Our code is available on GitHub.

2602.02559 2026-02-04 cs.AI cs.CV cs.LG cs.MA

Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

Pengyu Dai, Weihao Xuan, Junjue Wang, Hongruixuan Chen, Jian Song, Yafei Ou, Naoto Yokoya

Comments 21 pages, 6 figures

详情
英文摘要

Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce \textbf{GeoEvolver}, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12\% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.

2602.02558 2026-02-04 cs.LG cs.AI q-bio.QM

PA-MIL: Phenotype-Aware Multiple Instance Learning Guided by Language Prompting and Genotype-to-Phenotype Relationships

Zekang Yang, Hong Liu, Xiangdong Wang

详情
英文摘要

Deep learning has been extensively researched in the analysis of pathology whole-slide images (WSIs). However, most existing methods are limited to providing prediction interpretability by locating the model's salient areas in a post-hoc manner, failing to offer more reliable and accountable explanations. In this work, we propose Phenotype-Aware Multiple Instance Learning (PA-MIL), a novel ante-hoc interpretable framework that identifies cancer-related phenotypes from WSIs and utilizes them for cancer subtyping. To facilitate PA-MIL in learning phenotype-aware features, we 1) construct a phenotype knowledge base containing cancer-related phenotypes and their associated genotypes. 2) utilize the morphological descriptions of phenotypes as language prompting to aggregate phenotype-related features. 3) devise the Genotype-to-Phenotype Neural Network (GP-NN) grounded in genotype-to-phenotype relationships, which provides multi-level guidance for PA-MIL. Experimental results on multiple datasets demonstrate that PA-MIL achieves competitive performance compared to existing MIL methods while offering improved interpretability. PA-MIL leverages phenotype saliency as evidence and, using a linear classifier, achieves competitive results compared to state-of-the-art methods. Additionally, we thoroughly analyze the genotype-phenotype relationships, as well as cohort-level and case-level interpretability, demonstrating the reliability and accountability of PA-MIL.

2602.02554 2026-02-04 cs.LG cs.AI cs.SE

BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

Jingwen Xu, Yiyang Lu, Zisu Huang, Changze Lv, Xiaohua Wang, Shizheng Li, Zhibo Xu, Zhengkang Guo, Zhengyuan Wang, Muzhao Tian, Xuanjing Huang, Xiaoqing Zheng

详情
英文摘要

Training LLMs for code-related tasks typically depends on high-quality code-documentation pairs, which are costly to curate and often scarce for niche programming languages. We introduce BatCoder, a self-supervised reinforcement learning framework designed to jointly optimize code generation and documentation production. BatCoder employs a back-translation strategy: a documentation is first generated from code, and then the generated documentation is used to reconstruct the original code. The semantic similarity between the original and reconstructed code serves as an implicit reward, enabling reinforcement learning to improve the model's performance both in generating code from documentation and vice versa. This approach allows models to be trained using only code, substantially increasing the available training examples. Evaluated on HumanEval and MBPP with a 7B model, BatCoder achieved 83.5% and 81.0% pass@1, outperforming strong open-source baselines. Moreover, the framework demonstrates consistent scaling with respect to both training corpus size and model capacity.

2602.02551 2026-02-04 cs.LG cs.AI cs.CV

EEO-TFV: Escape-Explore Optimizer for Web-Scale Time-Series Forecasting and Vision Analysis

Hua Wang, Jinghao Lu, Fan Zhang

Comments Main paper: 12 pages

详情
英文摘要

Transformer-based foundation models have achieved remarkable progress in tasks such as time-series forecasting and image segmentation. However, they frequently suffer from error accumulation in multivariate long-sequence prediction and exhibit vulnerability to out-of-distribution samples in image-related tasks. Furthermore, these challenges become particularly pronounced in large-scale Web data analysis tasks, which typically involve complex temporal patterns and multimodal features. This complexity substantially increases optimization difficulty, rendering models prone to stagnation at saddle points within high-dimensional parameter spaces. To address these issues, we propose a lightweight Transformer architecture in conjunction with a novel Escape-Explore Optimizer (EEO). The optimizer enhances both exploration and generalization while effectively avoiding sharp minima and saddle-point traps. Experimental results show that, in representative Web data scenarios, our method achieves performance on par with state-of-the-art models across 11 time-series benchmark datasets and the Synapse medical image segmentation task. Moreover, it demonstrates superior generalization and stability, thereby validating its potential as a versatile cross-task foundation model for Web-scale data mining and analysis.

2602.02550 2026-02-04 cs.LG cs.AI

HyPAC: Cost-Efficient LLMs-Human Hybrid Annotation with PAC Error Guarantees

Hao Zeng, Huipeng Huang, Xinhao Qu, Jianguo Huang, Bingyi Jing, Hongxin Wei

详情
英文摘要

Data annotation often involves multiple sources with different cost-quality trade-offs, such as fast large language models (LLMs), slow reasoning models, and human experts. In this work, we study the problem of routing inputs to the most cost-efficient annotation source while controlling the labeling error on test instances. We propose \textbf{HyPAC}, a method that adaptively labels inputs to the most cost-efficient annotation source while providing distribution-free guarantees on annotation error. HyPAC calibrates two decision thresholds using importance sampling and upper confidence bounds, partitioning inputs into three regions based on uncertainty and routing each to the appropriate annotation source. We prove that HyPAC achieves the minimum expected cost with a probably approximately correct (PAC) guarantee on the annotation error, free of data distribution and pre-trained models. Experiments on common benchmarks demonstrate the effectiveness of our method, reducing the annotation cost by 78.51\% while tightly controlling the annotation error.

2602.02548 2026-02-04 cs.LG cs.AI cs.CV cs.MA

ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

Xiaoce Wang, Guibin Zhang, Junzhe Li, Jinzhe Tu, Chun Li, Ming Li

Comments 8 pages main paper, 18 pages total, 8 figures, 5 tables, code at https://github.com/ZephinueCode/ToolTok

详情
英文摘要

Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open-source at https://github.com/ZephinueCode/ToolTok.