arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2975
专题追踪
2505.11891 2026-02-03 cs.CL cs.AI

Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An

详情
英文摘要

VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent's ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\&A interactions is released to evaluate the agent's proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.

2505.09134 2026-02-03 cs.LG stat.ML

Scaling Gaussian Process Regression with Full Derivative Observations

Daniel Huang

Comments 13 pages, Published in TMLR

详情
英文摘要

We present a scalable Gaussian Process (GP) method called DSoftKI that can fit and predict full derivative observations. It extends SoftKI, a method that approximates a kernel via softmax interpolation, to the setting with derivatives. DSoftKI enhances SoftKI's interpolation scheme by replacing its global temperature vector with local temperature vectors associated with each interpolation point. This modification allows the model to encode local directional sensitivity, enabling the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. Moreover, the interpolation scheme eliminates the need for kernel derivatives, facilitating extensions such as Deep Kernel Learning (DKL). We evaluate DSoftKI on synthetic benchmarks, a toy n-body physics simulation, standard regression datasets with synthetic gradients, and high-dimensional molecular force field prediction (100-1000 dimensions). Our results demonstrate that DSoftKI is accurate and scales to larger datasets with full derivative observations than previously possible.

2505.08664 2026-02-03 cs.RO cs.AI

A Social Robot with Inner Speech for Dietary Guidance

Valerio Belcamino, Alessandro Carfì, Valeria Seidita, Fulvio Mastrogiovanni, Antonio Chella

详情
英文摘要

We explore the use of inner speech as a mechanism to enhance transparency and trust in social robots for dietary advice. In humans, inner speech structures thought processes and decision-making; in robotics, it improves explainability by making reasoning explicit. This is crucial in healthcare scenarios, where trust in robotic assistants depends on both accurate recommendations and human-like dialogue, which make interactions more natural and engaging. Building on this, we developed a social robot that provides dietary advice, and we provided the architecture with inner speech capabilities to validate user input, refine reasoning, and generate clear justifications. The system integrates large language models for natural language understanding and a knowledge graph for structured dietary information. By making decisions more transparent, our approach strengthens trust and improves human-robot interaction in healthcare. We validated this by measuring the computational efficiency of our architecture and conducting a small user study, which assessed the reliability of inner speech in explaining the robot's behavior.

2505.05064 2026-02-03 cs.LG

WaterDrum: Watermarking for Data-centric Unlearning Metric

Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, John Russell Himawan, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low

详情
英文摘要

Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. Existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when the forget and retain sets have semantically similar content and/or retraining the model from scratch on the retain set is impractical. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking to overcome these limitations. We introduce new benchmark datasets (with different levels of data similarity) for LLM unlearning that can be used to rigorously evaluate unlearning algorithms via WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.

2504.19472 2026-02-03 cs.CL

Conflicts in Texts: Data, Implications and Challenges

Siyi Liu, Dan Roth

详情
英文摘要

As NLP models become increasingly integrated into real-world applications, it becomes clear that there is a need to address the fact that models often rely on and generate conflicting information. Conflicts could reflect the complexity of situations, changes that need to be explained and dealt with, difficulties in data annotation, and mistakes in generated outputs. In all cases, disregarding the conflicts in data could result in undesired behaviors of models and undermine NLP models' reliability and trustworthiness. This survey categorizes these conflicts into three key areas: (1) natural texts on the web, where factual inconsistencies, subjective biases, and multiple perspectives introduce contradictions; (2) human-annotated data, where annotator disagreements, mistakes, and societal biases impact model training; and (3) model interactions, where hallucinations and knowledge conflicts emerge during deployment. While prior work has addressed some of these conflicts in isolation, we unify them under the broader concept of conflicting information, analyze their implications, and discuss mitigation strategies. We highlight key challenges and future directions for developing conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively.

2504.19110 2026-02-03 cs.CL

APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries

Huajian Xin, Luming Li, Xiaoran Jin, Jacques Fleuriot, Wenda Li

详情
英文摘要

While frontier formal mathematics systems now routinely develop repository-scale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation, existing evaluation benchmarks remain focused on isolated theorem proving. We introduce Automated Proof Engineering (APE), the first systematic framework for evaluating repository-scale proof engineering through dual verification that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. We present a complete infrastructure comprising APE-Bench, which automatically extracts proof engineering tasks from real library commit histories, and APE-Harness, a unified execution framework based on task contract abstraction. This contract-based design enables standardized evaluation across diverse formal mathematics tasks and fair systematic comparison of different agent implementations (including our APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. We demonstrate the framework's effectiveness through comprehensive evaluation. All code and benchmark dataset are released as open-source at https://github.com/xinhjBrant/APE-Bench.

2504.18881 2026-02-03 cs.LG

TSCAN: Context-Aware Uplift Modeling via Two-Stage Training for Online Merchant Business Diagnosis

Hangtao Zhang, Zhe Li, Kairui Zhang

Comments 15 pages,7 figures

详情
英文摘要

A primary challenge in ITE estimation is sample selection bias. Traditional approaches utilize treatment regularization techniques such as the Integral Probability Metrics (IPM), re-weighting, and propensity score modeling to mitigate this bias. However, these regularizations may introduce undesirable information loss and limit the performance of the model. Furthermore, treatment effects vary across different external contexts, and the existing methods are insufficient in fully interacting with and utilizing these contextual features. To address these issues, we propose a Context-Aware uplift model based on the Two-Stage training approach (TSCAN), comprising CAN-U and CAN-D sub-models. In the first stage, we train an uplift model, called CAN-U, which includes the treatment regularizations of IPM and propensity score prediction, to generate a complete dataset with counterfactual uplift labels. In the second stage, we train a model named CAN-D, which utilizes an isotonic output layer to directly model uplift effects, thereby eliminating the reliance on the regularization components. CAN-D adaptively corrects the errors estimated by CAN-U through reinforcing the factual samples, while avoiding the negative impacts associated with the aforementioned regularizations. Additionally, we introduce a Context-Aware Attention Layer throughout the two-stage process to manage the interactions between treatment, merchant, and contextual features, thereby modeling the varying treatment effect in different contexts. We conduct extensive experiments on two real-world datasets to validate the effectiveness of TSCAN. Ultimately, the deployment of our model for real-world merchant diagnosis on one of China's largest online food ordering platforms validates its practical utility and impact.

2504.16063 2026-02-03 cs.CL cs.DB cs.IR

Free Access to World News: Reconstructing Full-Text Articles from GDELT

A. Fronzetti Colladon, R. Vestrelli

Journal ref Big Data and Cognitive Computing, 10(2), 45 (2026)

详情
英文摘要

News data have become essential resources across various disciplines. Still, access to full-text news corpora remains challenging due to high costs and the limited availability of free alternatives. This paper presents a novel Python package (gdeltnews) that reconstructs full-text newspaper articles at near-zero cost by leveraging the Global Database of Events, Language, and Tone (GDELT) Web News NGrams 3.0 dataset. Our method merges overlapping n-grams extracted from global online news to rebuild complete articles. We validate the approach on a benchmark set of 2211 articles from major U.S. news outlets, achieving up to 95% text similarity against original articles based on Levenshtein and SequenceMatcher metrics. Our tool facilitates economic forecasting, computational social science, information science, and natural language processing applications by enabling free and large-scale access to full-text news data.

2504.09970 2026-02-03 cs.LG

ASIL: Augmented Structural Information Learning for Deep Graph Clustering in Hyperbolic Space

Li Sun, Zhenhao Huang, Yujie Wang, Hongbo Lv, Chunyang Liu, Hao Peng, Philip S. Yu

Comments Accepted by IEEE TPAMI, 36 pages

详情
英文摘要

Graph clustering is a longstanding topic in machine learning. Recently, deep methods have achieved results but still require predefined cluster numbers K and struggle with imbalanced graphs. We study deep graph clustering without K considering realistic imbalance through structural information theory. In the literature, structural information is rarely used in deep clustering, and its classic discrete definition neglects node attributes while exhibiting prohibitive complexity. In this paper, we establish a differentiable structural information framework, generalizing the discrete formalism to the continuous realm. We design a hyperbolic model (LSEnet) to learn the neural partitioning tree in the Lorentz model. Theoretically, we demonstrate its capability in clustering without K and identifying minority clusters. Second, we refine hyperbolic representations to enhance graph semantics. Since tree contrastive learning is non-trivial and costs quadratic complexity, we advance our theory by discovering that structural entropy bounds the tree contrastive loss. Finally, we approach graph clustering through a novel augmented structural information learning (ASIL), which offers an efficient objective to integrate hyperbolic partitioning tree construction and contrastive learning. With a provable improvement in graph conductance, ASIL achieves effective debiased graph clustering in linear complexity. Extensive experiments show ASIL outperforms 20 strong baselines by an average of +12.42% in NMI on the Citeseer dataset.

2504.08697 2026-02-03 cs.CL

LLMs as Span Annotators: A Comparative Study of LLMs and Humans

Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu

Comments Accepted to the MME workshop @ EACL 2026

详情
英文摘要

Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.

2504.05711 2026-02-03 cs.AI cs.DL cs.IR cs.LG

Automated Archival Descriptions with Federated Intelligence of LLMs

Jinghua Groppe, Andreas Marquet, Annabel Walz, Sven Groppe

Comments 16 pages

详情
英文摘要

Enforcing archival standards requires specialized expertise, and manually creating metadata descriptions for archival materials is a tedious and error-prone task. This work aims at exploring the potential of agentic AI and large language models (LLMs) in addressing the challenges of implementing a standardized archival description process. To this end, we introduce an agentic AI-driven system for automated generation of high-quality metadata descriptions of archival materials. We develop a federated optimization approach that unites the intelligence of multiple LLMs to construct optimal archival metadata. We also suggest methods to overcome the challenges associated with using LLMs for consistent metadata generation. To evaluate the feasibility and effectiveness of our techniques, we conducted extensive experiments using a real-world dataset of archival materials, which covers a variety of document types and formats. The evaluation results demonstrate the feasibility of our techniques and highlight the superior performance of the federated optimization approach compared to single-model solutions in metadata quality and reliability.

2504.05520 2026-02-03 cs.LG cs.CL

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, Jieyu Zhao

Comments 23 pages, 8 figures, 7 tables

详情
英文摘要

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model's recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

2504.01842 2026-02-03 cs.LG stat.CO

shapr: Explaining Machine Learning Models with Conditional Shapley Values in R and Python

Martin Jullum, Lars Henry Berge Olsen, Jon Lachmann, Annabelle Redelmeier

详情
英文摘要

This paper introduces the shapr R package, a versatile tool for generating Shapley value-based prediction explanations for machine learning and statistical regression models. Moreover, the shaprpy Python library brings the core capabilities of shapr to the Python ecosystem. Shapley values originate from cooperative game theory in the 1950s, but have over the past few years become a widely used method for quantifying how a model's features/covariates contribute to specific prediction outcomes. The shapr package emphasizes conditional Shapley value estimates, providing a comprehensive range of approaches for accurately capturing feature dependencies -- a crucial aspect for correct model explanation, typically lacking in similar software. In addition to regular tabular data, the shapr R package includes specialized functionality for explaining time series forecasts. The package offers a minimal set of user functions with sensible default values for most use cases while providing extensive flexibility for advanced users to fine-tune computations. Additional features include parallelized computations, iterative estimation with convergence detection, and rich visualization tools. shapr also extends its functionality to compute causal and asymmetric Shapley values when causal information is available. Overall, the shapr and shaprpy packages aim to enhance the interpretability of predictive models within a powerful and user-friendly framework.

2504.01154 2026-02-03 cs.AI cs.GT cs.MA

Past-Discounting is Key for Learning Markovian Fairness with Long Horizons

Ashwin Kumar, William Yeoh

详情
英文摘要

Fairness is an important consideration for dynamic resource allocation in multi-agent systems. Many existing methods treat fairness as a one-shot problem without considering temporal dynamics, which misses the nuances of accumulating inequalities over time. Recent approaches overcome this limitation by tracking allocations over time, assuming perfect recall of all past utilities. While the former neglects long-term equity, the latter introduces a critical challenge: the augmented state space required to track cumulative utilities grows unboundedly with time, hindering the scalability and convergence of learning algorithms. Motivated by behavioral insights that human fairness judgments discount distant events, we introduce a framework for temporal fairness that incorporates past-discounting into the learning problem. This approach offers a principled interpolation between instantaneous and perfect-recall fairness. Our central contribution is a past-discounted framework for memory tracking and a theoretical analysis of fairness memories, showing past-discounting guarantees a bounded, horizon-independent state space, a property that we prove perfect-recall methods lack. This result unlocks the ability to learn fair policies tractably over arbitrarily long horizons. We formalize this framework, demonstrate its necessity with experiments showing that perfect recall fails where past-discounting succeeds, and provide a clear path toward building scalable and equitable resource allocation systems.

2504.00573 2026-02-03 cs.CL

Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models

Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, Xueqi Cheng

Comments EMNLP 2025 Main Conference (Long paper)

详情
英文摘要

Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provide valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility and generalize across tasks. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.

2503.24047 2026-02-03 cs.AI cs.MA

Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

Shuo Ren, Can Xie, Pu Jian, Zhenjiang Ren, Chunlin Leng, Jiajun Zhang

详情
英文摘要

As scientific research becomes increasingly complex, innovative tools are needed to manage vast data, facilitate interdisciplinary collaboration, and accelerate discovery. Large language models (LLMs) are now evolving into LLM-based scientific agents that automate critical tasks ranging from hypothesis generation and experiment design to data analysis and simulation. Unlike general-purpose LLMs, these specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms, enabling them to handle complex data types, ensure reproducibility, and drive scientific breakthroughs. This survey provides a focused review of the architectures, design, benchmarks, applications, and ethical considerations surrounding LLM-based scientific agents. We highlight why they differ from general agents and the ways in which they advance research across various scientific fields. By examining their development and challenges, this survey offers a comprehensive roadmap for researchers and practitioners to harness these agents for more efficient, reliable, and ethically sound scientific discovery.

2503.17279 2026-02-03 cs.CL

CASE -- Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement

Gaifan Zhang, Yi Zhou, Danushka Bollegala

Comments Accepted to EACL2026

详情
英文摘要

The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear as how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM) encoder, where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised method is learnt to align the LLM-based text embeddings with the Conditional Semantic Textual Similarity (C-STS) task. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings by improving the isotropy of the embedding space. Moreover, our supervised projection method significantly improves the performance of LLM-based embeddings despite requiring a small number of embedding dimensions.

2503.16718 2026-02-03 cs.SD cs.CL cs.LG

CAARMA: Class Augmentation with Adversarial Mixup Regularization

Massa Baali, Xiang Li, Hao Chen, Syed Abdul Hannan, Rita Singh, Bhiksha Raj

Comments Accepted to EMNLP 2025 Findings

详情
英文摘要

Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8\% over all baseline models. The code is available at: https://github.com/massabaali7/CAARMA/

2503.14858 2026-02-03 cs.LG cs.AI

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzciński, Benjamin Eysenbach

Comments Link to project website: https://wang-kevin3290.github.io/scaling-crl/

详情
英文摘要

Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 - 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by $2\times$ - $50\times$, outperforming other goal-conditioned baselines. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned. The project webpage and code can be found here: https://wang-kevin3290.github.io/scaling-crl/.

2503.10304 2026-02-03 cs.LG cs.AI cs.GT

Large-Scale Auto-bidding with Nash Equilibrium Constraints

Zhiyu Mou, Miao Xu, Rongquan Bai, Zhuoran Yang, Chuan Yu, Jian Xu, Bo Zheng

详情
英文摘要

Auto-bidding has become a cornerstone of modern online advertising platforms, enabling many advertisers to automate bidding at scale and optimize campaign performance. However, prevailing industrial systems rely on single-agent auto-bidding methods that are scalable but overlook the strategic interdependence among advertisers' bids, leading to unstable or suboptimal outcomes. While recent works recognize the game-theoretic nature of auto-bidding, existing approaches remain either computationally intractable at scale or lack a principled equilibrium-selection that aligns with platform-wide objectives. In this paper, we bridge this gap by introducing Nash Equilibrium-Constrained Bidding (NCB), a principled and scalable auto-bidding framework that recasts auto-bidding as a platform-wide optimization problem subject to Nash equilibrium constraints. This approach accounts for fine-grained strategic interdependencies among advertisers, ensuring both agent-level stability and ecosystem-level optimality. Notably, we develop a theoretically sound penalty-based primal-dual gradient method with rigorous convergence guarantees, supported by an efficient algorithm suitable for industrial deployment. Extensive experiments validate the effectiveness of our approach.

2503.09701 2026-02-03 cs.CL cs.LG

Reassessing Active Learning Adoption in Contemporary NLP: A Community Survey

Julia Romberg, Christopher Schröder, Julius Gonsior, Katrin Tomanek, Fredrik Olsson

Comments EACL 2026 Main Conference

详情
英文摘要

Supervised learning relies on data annotation which usually is time-consuming and therefore expensive. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Research in active learning has made considerable progress, especially with the rise of large language models (LLMs). However, we still know little about how these remarkable advances have translated into real-world applications, or contributed to removing key barriers to active learning adoption. To fill in this gap, we conduct an online survey in the NLP community to collect previously intangible insights on current implementation practices, common obstacles in application, and future prospects in active learning. We also reassess the perceived relevance of data annotation and active learning as fundamental assumptions. Our findings show that data annotation is expected to remain important and active learning to stay relevant while benefiting from LLMs. Consistent with a community survey from over 15 years ago, three key challenges yet persist -- setup complexity, uncertain cost reduction, and tooling -- for which we propose alleviation strategies. We publish an anonymized version of the dataset.

2502.16570 2026-02-03 cs.LG cs.AI cs.CV

Entropy-Lens: Uncovering Decision Strategies in LLMs

Riccardo Ali, Francesco Caso, Christopher Irwin, Pietro Liò

详情
英文摘要

In large language models (LLMs), each block operates on the residual stream to map input token sequences to output token distributions. However, most of the interpretability literature focuses on internal latent representations, leaving token-space dynamics underexplored. The high dimensionality and categoricity of token distributions hinder their analysis, as standard statistical descriptors are not suitable. We show that the entropy of logit-lens predictions overcomes these issues. In doing so, it provides a per-layer scalar, permutation-invariant metric. We introduce Entropy-Lens to distill the token-space dynamics of the residual stream into a low-dimensional signal. We call this signal the entropy profile. We apply our method to a variety of model sizes and families, showing that (i) entropy profiles uncover token prediction dynamics driven by expansion and pruning strategies; (ii) these dynamics are family-specific and invariant under depth rescaling; (iii) they are characteristic of task type and output format; (iv) these strategies have unequal impact on downstream performance, with the expansion strategy usually being more critical. Ultimately, our findings further enhance our understanding of the residual stream, enabling a granular assessment of how information is processed across model depth.

2502.12769 2026-02-03 cs.CL cs.AI

How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš

Comments EMNLP 2025

详情
英文摘要

In the age of misinformation, hallucination - the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses - represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

2502.11367 2026-02-03 cs.LG cs.AI cs.CL

Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

详情
英文摘要

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

2502.05568 2026-02-03 cs.CL cs.AI cs.LG

Large Multimodal Models for Low-Resource Languages: A Survey

Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu

Comments Accepted in Information Fusion

详情
英文摘要

In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

2502.04528 2026-02-03 cs.CL cs.LG

Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection

Minseok Jung, Cynthia Fuertes Panizo, Liam Dugan, Yi R., Fung, Pin-Yu Chen, Paul Pu Liang

详情
英文摘要

The advancement of large language models (LLMs) has made it difficult to differentiate human-written text from AI-generated text. Several AI-text detectors have been developed in response, which typically utilize a fixed global threshold (e.g., $θ= 0.5$) to classify machine-generated text. However, one universal threshold could fail to account for distributional variations by subgroups. For example, when using a fixed threshold, detectors make more false positive errors on shorter human-written text, and more positive classifications of neurotic writing styles among long texts. These discrepancies can lead to misclassifications that disproportionately affect certain groups. We address this critical limitation by introducing FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors. We partitioned data into subgroups based on attributes (e.g., text length and writing style) and implemented FairOPT to learn decision thresholds for each group to reduce discrepancy. FairOPT showed notable discrepancy mitigation across nine detectors and three heterogeneous datasets, and the remarkable mitigation of the minimax problem by decreasing overall discrepancy 27.4% across five metrics while minimally sacrificing accuracy by 0.005%. Our framework paves the way for more robust classification in AI-generated content detection via post-processing. We release our data, code, and project information at URL.

2502.01477 2026-02-03 cs.LG cs.AI

Achieving Time Series Reasoning Requires Rethinking Model Design, Tasks Formulation, and Evaluation

Yaxuan Kong, Yiyuan Yang, Shiyu Wang, Chenghao Liu, Yuxuan Liang, Ming Jin, Stefan Zohren, Dan Pei, Yan Liu, Qingsong Wen

详情
英文摘要

Understanding time series data is fundamental to many real-world applications. Recent work explores multimodal large language models (MLLMs) to enhance time series understanding with contextual information beyond numerical signals. This area has grown from 7 papers in 2023 to over 580 in 2025, yet existing methods struggle in real-world settings. We analyze 20 influential works from 2025 across model design, task formulation, and evaluation, and identify critical gaps: methods adapt NLP techniques with limited attention to core time series properties; tasks remain restricted to traditional prediction and classification; and evaluations emphasize benchmarks over robustness, interpretability, or decision relevance. We argue that achieving time series reasoning requires rethinking model design, task formulation, and evaluation together. We define time series reasoning, outline challenges and future directions, and call on researchers to develop unified frameworks for robust, interpretable, and decision-relevant reasoning in real-world applications. The material is available at https://github.com/Eleanorkong/Awesome-Time-Series-Reasoning.

2501.15098 2026-02-03 cs.LG cs.AI

CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter

Zihang Li, Yangdong Ruan, Wenjun Liu, Zhengyang Wang, Tong Yang

详情
英文摘要

Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than naive Tree-RAG while maintaining high levels of generative quality. When the number of trees is large, our method is hundreds of times faster than naive Tree-RAG. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.

2501.14687 2026-02-03 cs.LG cs.AI

Decoding Generalization from Memorization in Deep Neural Networks

Simran Ketha, Venkatakrishnan Ramaswamy

Journal ref Transactions on Machine Learning Research, 2026

详情
英文摘要

Overparameterized deep networks that generalize well have been key to the dramatic success of deep learning in recent years. The reasons for their remarkable ability to generalize are not well understood yet. When class labels in the training set are shuffled to varying degrees, it is known that deep networks can still reach perfect training accuracy at the detriment of generalization to true labels -- a phenomenon that has been called memorization. It has, however, been unclear why the poor generalization to true labels that accompanies such memorization, comes about. One possibility is that during training, all layers of the network irretrievably re-organize their representations in a manner that makes generalization to true labels difficult. The other possibility is that one or more layers of the trained network retain significantly more latent ability to generalize to true labels, but the network somehow "chooses" to readout in a manner that is detrimental to generalization to true labels. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially-improved generalization to true labels. Furthermore, such abilities can be easily decoded from the internals of the trained model, and we build a technique to do so. We demonstrate results on multiple models trained with standard datasets. Our code is available at: https://github.com/simranketha/MASC_DNN.

2501.08907 2026-02-03 cs.LG cs.AI

PIQL: Projective Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

Xinchen Han, Hossam Afifi, Michel Marot

详情
英文摘要

Offline Reinforcement Learning (RL) faces a fundamental challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) employs expectile regression to achieve in-sample learning. Nevertheless, IQL relies on a fixed expectile hyperparameter and a density-based policy improvement method, both of which impede its adaptability and performance. In this paper, we propose Projective IQL (PIQL), a projective variant of IQL enhanced with a support constraint. In the policy evaluation stage, PIQL substitutes the fixed expectile hyperparameter with a projection-based parameter and extends the one-step value estimation to a multi-step formulation. In the policy improvement stage, PIQL adopts a support constraint instead of a density constraint, ensuring closer alignment with the policy evaluation. Theoretically, we demonstrate that PIQL maintains the expectile regression and in-sample learning framework, guarantees monotonic policy improvement, and introduces a progressively more rigorous criterion for advantageous actions. Experiments on D4RL and NeoRL2 benchmarks demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.