arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1447
专题追踪
2601.22447 2026-02-02 cs.LG

Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features

Yiting Liu, Zhi-Hong Deng

详情
英文摘要

Sparse autoencoders (SAEs) have emerged as a powerful technique for decomposing language model representations into interpretable features. Current interpretation methods infer feature semantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass. We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions, requiring no activation data. Through three experiments on Gemma-2 and Llama-3.1 models, we demonstrate that (1) 1/4 of features directly predict output tokens, (2) features actively participate in attention mechanisms with depth-dependent structure, and (3) semantic and non-semantic feature populations exhibit distinct distribution profiles in attention circuits. Our analysis provides the missing out-of-context half of SAE feature interpretability.

2601.22446 2026-02-02 cs.AI

Anytime Safe PAC Efficient Reasoning

Chengyao Yu, Hao Zeng, Youxin Zhu, Jianguo Huang, Huajun Zeng, Bingyi Jing

详情
英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks but suffer from high computational costs and latency. While selective thinking strategies improve efficiency by routing easy queries to non-thinking models, existing approaches often incur uncontrollable errors, especially in online settings where the performance loss of a non-thinking model is only partially observed and data are non-stationary. To address this, we propose Betting Probably Approximately Correct (B-PAC) reasoning, a principled method that enables anytime safe and efficient online reasoning under partial feedback. Specifically, we utilize inverse propensity scoring estimators to construct test supermartingales for candidate thresholds, and then dynamically adjust the routing threshold based on the accumulated statistical evidence of safety. Theoretically, we establish the anytime-valid performance loss control and the efficiency of B-PAC reasoning. Extensive experiments demonstrate that B-PAC reasoning significantly reduces computational overhead, decreasing thinking model usage by up to 81.01\%, while controlling the performance loss below the user-specified level.

2601.22445 2026-02-02 cs.RO cs.CV

High-Definition 5MP Stereo Vision Sensing for Robotics

Leaf Jiang, Matthew Holzel, Bernhard Kaplan, Hsiou-Yuan Liu, Sabyasachi Paul, Karen Rankin, Piotr Swierczynski

详情
英文摘要

High-resolution (5MP+) stereo vision systems are essential for advancing robotic capabilities, enabling operation over longer ranges and generating significantly denser and accurate 3D point clouds. However, realizing the full potential of high-angular-resolution sensors requires a commensurately higher level of calibration accuracy and faster processing -- requirements often unmet by conventional methods. This study addresses that critical gap by processing 5MP camera imagery using a novel, advanced frame-to-frame calibration and stereo matching methodology designed to achieve both high accuracy and speed. Furthermore, we introduce a new approach to evaluate real-time performance by comparing real-time disparity maps with ground-truth disparity maps derived from more computationally intensive stereo matching algorithms. Crucially, the research demonstrates that high-pixel-count cameras yield high-quality point clouds only through the implementation of high-accuracy calibration.

2601.22442 2026-02-02 cs.LG cs.DC

AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham, Hadi Mohaghegh Dolatabadi, Chamin P Hewa Koneputugodage, Violetta Shevchenko, Yan Zuo, Alexander Long

详情
英文摘要

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.

2601.22439 2026-02-02 cs.CL

Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss

Galim Turumtaev

Comments Accepted at LoResLM 2025 (COLING 2025 workshop). Oral presentation

Journal ref In Proceedings of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025), pages 373-386, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

详情
英文摘要

Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.

2601.22433 2026-02-02 cs.AI cs.SE

When LLM meets Fuzzy-TOPSIS for Personnel Selection through Automated Profile Analysis

Shahria Hoque, Ahmed Akib Jawad Karim, Md. Golam Rabiul Alam, Nirjhar Gope

Comments 10 pages, 8 figures. This paper has been peer-reviewed and published in IEEE Access. The arXiv version corresponds to the accepted author manuscript (AAM)

Journal ref IEEE Access, vol. 14, 2026, Article ID 3658575

详情
英文摘要

In this highly competitive employment environment, the selection of suitable personnel is essential for organizational success. This study presents an automated personnel selection system that utilizes sophisticated natural language processing (NLP) methods to assess and rank software engineering applicants. A distinctive dataset was created by aggregating LinkedIn profiles that include essential features such as education, work experience, abilities, and self-introduction, further enhanced with expert assessments to function as standards. The research combines large language models (LLMs) with multicriteria decision-making (MCDM) theory to develop the LLM-TOPSIS framework. In this context, we utilized the TOPSIS method enhanced by fuzzy logic (Fuzzy TOPSIS) to address the intrinsic ambiguity and subjectivity in human assessments. We utilized triangular fuzzy numbers (TFNs) to describe criteria weights and scores, thereby addressing the ambiguity frequently encountered in candidate evaluations. For candidate ranking, the DistilRoBERTa model was fine-tuned and integrated with the fuzzy TOPSIS method, achieving rankings closely aligned with human expert evaluations and attaining an accuracy of up to 91% for the Experience attribute and the Overall attribute. The study underlines the potential of NLP-driven frameworks to improve recruitment procedures by boosting scalability, consistency, and minimizing prejudice. Future endeavors will concentrate on augmenting the dataset, enhancing model interpretability, and verifying the system in actual recruitment scenarios to better evaluate its practical applicability. This research highlights the intriguing potential of merging NLP with fuzzy decision-making methods in personnel selection, enabling scalable and unbiased solutions to recruitment difficulties.

2601.22432 2026-02-02 cs.LG cs.CL

ReNCE: Learning to Reason by Noise Contrastive Estimation

Wenzheng Zhang, Karl Stratos

详情
英文摘要

GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities. It estimates the advantage of an outcome from a group of $K$ outcomes, and promotes those with positive advantages inside a trust region. Since GRPO discriminates between good and bad outcomes softly, it benefits from additional refinements such as asymmetric clipping and zero-variance data filtering. While effective, these refinements require significant empirical insight and can be challenging to identify. We instead propose an explicit contrastive learning approach. Instead of estimating advantages, we bifurcate $K$ outcomes into positive and negative sets, then maximize the likelihood of positive outcomes. Our approach can be viewed as an online instantiation of (multi-label) noise contrastive estimation for LLM reasoning. We validate our method by demonstrating competitive performance on a suite of challenging math benchmarks against strong baselines such as DAPO and online DPO.

2601.22420 2026-02-02 cs.LG cs.AI

MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments

Roelien C. Timmer, Necva Bölücü, Stephen Wan

Comments EACL 2026

详情
英文摘要

Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type: baseline, proposed method, or variation of proposed method for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research.

2601.22418 2026-02-02 cs.AI

AI-Enabled Waste Classification as a Data-Driven Decision Support Tool for Circular Economy and Urban Sustainability

Julius Sechang Mboli, Omolara Aderonke Ogungbemi

Comments Accepted version of Conference paper

Journal ref 2025 IEEE International Smart Cities Conference (ISC2), Patras, Greece, 2025, pp. 1-6

详情
英文摘要

Efficient waste sorting is crucial for enabling circular-economy practices and resource recovery in smart cities. This paper evaluates both traditional machine-learning (Random Forest, SVM, AdaBoost) and deep-learning techniques including custom CNNs, VGG16, ResNet50, and three transfer-learning models (DenseNet121, EfficientNetB0, InceptionV3) for binary classification of 25 077 waste images (80/20 train/test split, augmented and resized to 150x150 px). The paper assesses the impact of Principal Component Analysis for dimensionality reduction on traditional models. DenseNet121 achieved the highest accuracy (91 %) and ROC-AUC (0.98), outperforming the best traditional classifier by 20 pp. Principal Component Analysis (PCA) showed negligible benefit for classical methods, whereas transfer learning substantially improved performance under limited-data conditions. Finally, we outline how these models integrate into a real-time Data-Driven Decision Support System for automated waste sorting, highlighting potential reductions in landfill use and lifecycle environmental impacts.)

2601.22416 2026-02-02 cs.LG

MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning

Xunkai Li, Yuming Ai, Yinlin Zhu, Haodong Lu, Yi Zhang, Guohao Fu, Bowen Fan, Qiangqiang Dai, Rong-Hua Li, Guoren Wang

Comments Under Review

详情
英文摘要

Multimodal-attributed graphs (MMAGs) provide a unified framework for modeling complex relational data by integrating heterogeneous modalities with graph structures. While centralized learning has shown promising performance, MMAGs in real-world applications are often distributed across isolated platforms and cannot be shared due to privacy concerns or commercial constraints. Federated graph learning (FGL) offers a natural solution for collaborative training under such settings; however, existing studies largely focus on single-modality graphs and do not adequately address the challenges unique to multimodal federated graph learning (MMFGL). To bridge this gap, we present MM-OpenFGL, the first comprehensive benchmark that systematically formalizes the MMFGL paradigm and enables rigorous evaluation. MM-OpenFGL comprises 19 multimodal datasets spanning 7 application domains, 8 simulation strategies capturing modality and topology variations, 6 downstream tasks, and 57 state-of-the-art methods implemented through a modular API. Extensive experiments investigate MMFGL from the perspectives of necessity, effectiveness, robustness, and efficiency, offering valuable insights for future research on MMFGL.

2601.22410 2026-02-02 cs.CL

Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking

Imene Kolli, Kai-Robin Lange, Jonas Rieger, Carsten Jentsch

Comments 20 pages, 16 figures

详情
英文摘要

We propose an interpretable, graph-based framework for analyzing semantic shift in diachronic corpora. For each target word and time slice, we induce a word-centered semantic network that integrates distributional similarity from diachronic Skip-gram embeddings with lexical substitutability from time-specific masked language models. We identify sense-related structure by clustering the peripheral graph, align clusters across time via node overlap, and track change through cluster composition and normalized cluster mass. In an application study on a corpus of New York Times Magazine articles (1980 - 2017), we show that graph connectivity reflects polysemy dynamics and that the induced communities capture contrasting trajectories: event-driven sense replacement (trump), semantic stability with cluster over-segmentation effects (god), and gradual association shifts tied to digital communication (post). Overall, word-centered semantic graphs offer a compact and transparent representation for exploring sense evolution without relying on predefined sense inventories.

2601.22406 2026-02-02 cs.RO

Accurate Pedestrian Tracking in Urban Canyons: A Multi-Modal Fusion Approach

Shahar Dubiner, Peng Ren, Roberto Manduchi

详情
英文摘要

The contribution describes a pedestrian navigation approach designed to improve localization accuracy in urban environments where GNSS performance is degraded, a problem that is especially critical for blind or low-vision users who depend on precise guidance such as identifying the correct side of a street. To address GNSS limitations and the impracticality of camera-based visual positioning, the work proposes a particle filter based fusion of GNSS and inertial data that incorporates spatial priors from maps, such as impassable buildings and unlikely walking areas, functioning as a probabilistic form of map matching. Inertial localization is provided by the RoNIN machine learning method, and fusion with GNSS is achieved by weighting particles based on their consistency with GNSS estimates and uncertainty. The system was evaluated on six challenging walking routes in downtown San Francisco using three metrics related to sidewalk correctness and localization error. Results show that the fused approach (GNSS+RoNIN+PF) significantly outperforms GNSS only localization on most metrics, while inertial-only localization with particle filtering also surpasses GNSS alone for critical measures such as sidewalk assignment and across street error.

2601.22402 2026-02-02 cs.CL cs.FL cs.LG

Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization

Kanishk Awadhiya

详情
英文摘要

Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ''Spectral Rigidity'': standard RoPE utilizes a fixed geometric decay ($θ^{-i}$) optimized for local syntactic coherence, which fails to capture the long-range, periodic structures inherent in recursive logic and algorithmic reasoning. This results in a ''Structure Gap'', where models trained on shallow reasoning chains fail to extrapolate to deeper recursive steps. In this work, we introduce Bifocal Attention, an architectural paradigm that decouples positional encoding into two distinct modalities: Geometric Eyes (Standard RoPE) for precise token-level manipulation, and Spectral Eyes (Learnable Harmonic Operators) for tracking long-range recursive depth. We propose a novel training protocol, Spectral Evolution, which initializes positional frequencies as static geometric parameters but allows them to evolve via gradient descent into a harmonic basis optimized for the specific algorithmic topology of the task.

2601.22399 2026-02-02 cs.LG cs.AI

Score-based Integrated Gradient for Root Cause Explanations of Outliers

Phuoc Nguyen, Truyen Tran, Sunil Gupta, Svetha Venkatesh

Comments Accepted at ICDM 2025

详情
英文摘要

Identifying the root causes of outliers is a fundamental problem in causal inference and anomaly detection. Traditional approaches based on heuristics or counterfactual reasoning often struggle under uncertainty and high-dimensional dependencies. We introduce SIREN, a novel and scalable method that attributes the root causes of outliers by estimating the score functions of the data likelihood. Attribution is computed via integrated gradients that accumulate score contributions along paths from the outlier toward the normal data distribution. Our method satisfies three of the four classic Shapley value axioms - dummy, efficiency, and linearity - as well as an asymmetry axiom derived from the underlying causal structure. Unlike prior work, SIREN operates directly on the score function, enabling tractable and uncertainty-aware root cause attribution in nonlinear, high-dimensional, and heteroscedastic causal models. Extensive experiments on synthetic random graphs and real-world cloud service and supply chain datasets show that SIREN outperforms state-of-the-art baselines in both attribution accuracy and computational efficiency.

2601.22398 2026-02-02 cs.CV cs.AI

Jailbreaks on Vision Language Model via Multimodal Reasoning

Aarush Noheria, Yuguang Yao

详情
英文摘要

Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.

2601.22397 2026-02-02 cs.LG cs.DC

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

Jianchang Su, Yifan Zhang, Shengkai Lin, Shizhen Zhao, Yusheng Zheng, Yiwei Yang, Wei Zhang

详情
英文摘要

Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.

2601.22390 2026-02-02 cs.SD cs.CR eess.AS

An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems

Chanwoo Park, Chanwoo Kim

详情
英文摘要

Evasion attacks pose significant threats to AI systems, exploiting vulnerabilities in machine learning models to bypass detection mechanisms. The widespread use of voice data, including deepfakes, in promising future industries is currently hindered by insufficient legal frameworks. Adversarial attack methods have emerged as the most effective countermeasure against the indiscriminate use of such data. This research introduces masked energy perturbation (MEP), a novel approach using power spectrum for energy masking of original voice data. MEP applies masking to small energy regions in the frequency domain before generating adversarial perturbations, targeting areas less noticeable to the human auditory model. The study primarily employs advanced speaker recognition models, including ECAPA-TDNN and ResNet34, which have shown remarkable performance in speaker verification tasks. The proposed MEP method demonstrated strong performance in both audio quality and evasion effectiveness. The energy masking approach effectively minimizes the perceptual evaluation of speech quality (PESQ) degradation, indicating that minimal perceptual distortion occurs to the human listener despite the adversarial perturbations. Specifically, in the PESQ evaluation, the relative performance of the MEP method was 26.68% when compared to the fast gradient sign method (FGSM) and iterative FGSM.

2601.22387 2026-02-02 cs.RO cs.HC

Plant-Inspired Robot Design Metaphors for Ambient HRI

Victor Nikhil Antony, Adithya R N, Sarah Derrick, Zhili Gong, Peter M. Donley, Chien-Ming Huang

详情
英文摘要

Plants offer a paradoxical model for interaction: they are ambient, low-demand presences that nonetheless shape atmosphere, routines, and relationships through temporal rhythms and subtle expressions. In contrast, most human-robot interaction (HRI) has been grounded in anthropomorphic and zoomorphic paradigms, producing overt, high-demand forms of engagement. Using a Research through Design (RtD) methodology, we explore plants as metaphoric inspiration for HRI; we conducted iterative cycles of ideation, prototyping, and reflection to investigate what design primitives emerge from plant metaphors and morphologies, and how these primitives can be combined into expressive robotic forms. We present a suite of speculative, open-source prototypes that help probe plant-inspired presence, temporality, form, and gestures. We deepened our learnings from design and prototyping through prototype-centered workshops that explored people's perceptions and imaginaries of plant-inspired robots. This work contributes: (1) Set of plant-inspired robotic artifacts; (2) Designerly insights on how people perceive plant-inspired robots; and (3) Design consideration to inform how to use plant metaphors to reshape HRI.

2601.22386 2026-02-02 cs.CL cs.MA

Specialists or Generalists? Multi-Agent and Single-Agent LLMs for Essay Grading

Jamiu Adekunle Idowu, Ahmed Almasoud

详情
英文摘要

Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single-agent and multi-agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi-agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric-aligned logic including veto rules and score capping. We test both architectures in zero-shot and few-shot conditions using GPT-5.1. Results show that the multi-agent system is significantly better at identifying weak essays while the single-agent system performs better on mid-range essays. Both architectures struggle with high-quality essays. Critically, few-shot calibration emerges as the dominant factor in system performance -- providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi-agent AI particularly suited for diagnostic screening of at-risk students, while single-agent models provide a cost-effective solution for general assessment.

2601.22385 2026-02-02 cs.CL cs.AI cs.LG

SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

Chaoyue He, Xin Zhou, Di Wang, Hong Xu, Wei Liu, Chunyan Miao

Comments 39 pages, 15 figures, 16 tables, 60 equations

详情
英文摘要

Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.

2601.22381 2026-02-02 cs.RO cs.HC

Lantern: A Minimalist Robotic Object Platform

Victor Nikhil Antony, Zhili Gong, Guanchen Li, Clara Jeon, Chien-Ming Huang

详情
英文摘要

Robotic objects are simple actuated systems that subtly blend into human environments. We design and introduce Lantern, a minimalist robotic object platform to enable building simple robotic artifacts. We conducted in-depth design and engineering iterations of Lantern's mechatronic architecture to meet specific design goals while maintaining a low build cost (~40 USD). As an extendable, open-source platform, Lantern aims to enable exploration of a range of HRI scenarios by leveraging human tendency to assign social meaning to simple forms. To evaluate Lantern's potential for HRI, we conducted a series of explorations: 1) a co-design workshop, 2) a sensory room case study, 3) distribution to external HRI labs, 4) integration into a graduate-level HRI course, and 5) public exhibitions with older adults and children. Our findings show that Lantern effectively evokes engagement, can support versatile applications ranging from emotion regulation to focused work, and serves as a viable platform for lowering barriers to HRI as a field.

2601.22379 2026-02-02 cs.CL

SPLA: Block Sparse Plus Linear Attention for Long Context Modeling

Bailin Wang, Dan Friedman, Tao Lei, Chong Wang

Comments v1

详情
英文摘要

Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining "long tail," SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA -- calculating the residual as the difference between global and selected linear attention -- ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.

2601.22376 2026-02-02 cs.CV

FlexMap: Generalized HD Map Construction from Flexible Camera Configurations

Run Wang, Chaoyi Zhou, Amir Salarpour, Xi Liu, Zhi-Qi Cheng, Feng Luo, Mert D. Pesé, Siyu Huang

详情
英文摘要

High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems, yet current HD map construction methods require calibrated multi-camera setups and either implicit or explicit 2D-to-BEV transformations, making them fragile when sensors fail or camera configurations vary across vehicle fleets. We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes or per-configuration retraining. Our key innovation eliminates explicit geometric projections by using a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding in feature space. FlexMap features two core components: a spatial-temporal enhancement module that separates cross-view spatial reasoning from temporal dynamics, and a camera-aware decoder with latent camera tokens, enabling view-adaptive attention without the need for projection matrices. Experiments demonstrate that FlexMap outperforms existing methods across multiple configurations while maintaining robustness to missing views and sensor variations, enabling more practical real-world deployment.

2601.22373 2026-02-02 cs.CL

Stability-Aware Prompt Optimization for Clinical Data Abstraction

Arinbjörn Kolbeinsson, Daniel Timbie, Sajjan Narsinghani, Sanjay Hariharan

详情
英文摘要

Large language models used for clinical abstraction are sensitive to prompt wording, yet most work treats prompts as fixed and studies uncertainty in isolation. We argue these should be treated jointly. Across two clinical tasks (MedAlign applicability/correctness and MS subtype abstraction) and multiple open and proprietary models, we measure prompt sensitivity via flip rates and relate it to calibration and selective prediction. We find that higher accuracy does not guarantee prompt stability, and that models can appear well-calibrated yet remain fragile to paraphrases. We propose a dual-objective prompt optimization loop that jointly targets accuracy and stability, showing that explicitly including a stability term reduces flip rates across tasks and models, sometimes at modest accuracy cost. Our results suggest prompt sensitivity should be an explicit objective when validating clinical LLM systems.

2601.22371 2026-02-02 cs.LG

FIRE: Multi-fidelity Regression with Distribution-conditioned In-context Learning using Tabular Foundation Models

Rosen Ting-Ying Yu, Nicholas Sung, Faez Ahmed

详情
英文摘要

Multi-fidelity (MF) regression often operates in regimes of extreme data imbalance, where the commonly-used Gaussian-process (GP) surrogates struggle with cubic scaling costs and overfit to sparse high-fidelity observations, limiting efficiency and generalization in real-world applications. We introduce FIRE, a training-free MF framework that couples tabular foundation models (TFMs) to perform zero-shot in-context Bayesian inference via a high-fidelity correction model conditioned on the low-fidelity model's posterior predictive distributions. This cross-fidelity information transfer via distributional summaries captures heteroscedastic errors, enabling robust residual learning without model retraining. Across 31 benchmark problems spanning synthetic and real-world tasks (e.g., DrivAerNet, LCBench), FIRE delivers a stronger performance-time trade-off than seven state-of-the-art GP-based or deep learning MF regression methods, ranking highest in accuracy and uncertainty quantification with runtime advantages. Limitations include context window constraints and dependence on the quality of the pre-trained TFM's.

2601.22369 2026-02-02 cs.AI cs.DC

Learning Provably Correct Distributed Protocols Without Human Knowledge

Yujie Hui, Xiaoyi Lu, Andrew Perrault, Yang Wang

详情
英文摘要

Provably correct distributed protocols, which are a critical component of modern distributed systems, are highly challenging to design and have often required decades of human effort. These protocols allow multiple agents to coordinate to come to a common agreement in an environment with uncertainty and failures. We formulate protocol design as a search problem over strategies in a game with imperfect information, and the desired correctness conditions are specified in Satisfiability Modulo Theories (SMT). However, standard methods for solving multi-agent games fail to learn correct protocols in this setting, even when the number of agents is small. We propose a learning framework, GGMS, which integrates a specialized variant of Monte Carlo Tree Search with a transformer-based action encoder, a global depth-first search to break out of local minima, and repeated feedback from a model checker. Protocols output by GGMS are verified correct via exhaustive model checking for all executions within the bounded setting. We further prove that, under mild assumptions, the search process is complete: if a correct protocol exists, GGMS will eventually find it. In experiments, we show that GGMS can learn correct protocols for larger settings than existing methods.

2601.22364 2026-02-02 cs.CL cs.AI

Context Structure Reshapes the Representational Geometry of Language Models

Eghbal A. Hosseini, Yuxuan Li, Yasaman Bahri, Declan Campbell, Andrew Kyle Lampinen

详情
英文摘要

Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emph{within} a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs' representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent -- it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.

2601.22362 2026-02-02 cs.LG

Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use

Julien Delavande, Regis Pierrard, Sasha Luccioni

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face's Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.

2601.22359 2026-02-02 cs.LG cs.AI

The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples

Hsiang Hsu, Pradeep Niroula, Zichang He, Ivan Brugere, Freddy Lecue, Chun-Fu Chen

Comments Presented at NeurIPS 2025

详情
英文摘要

Machine unlearning offers a practical alternative to avoid full model re-training by approximately removing the influence of specific user data. While existing methods certify unlearning via statistical indistinguishability from re-trained models, these guarantees do not naturally extend to model outputs when inputs are adversarially perturbed. In particular, slight perturbations of forget samples may still be correctly recognized by the unlearned model - even when a re-trained model fails to do so - revealing a novel privacy risk: information about the forget samples may persist in their local neighborhood. In this work, we formalize this vulnerability as residual knowledge and show that it is inevitable in high-dimensional settings. To mitigate this risk, we propose a fine-tuning strategy, named RURK, that penalizes the model's ability to re-recognize perturbed forget samples. Experiments on vision benchmarks with deep neural networks demonstrate that residual knowledge is prevalent across existing unlearning methods and that our approach effectively prevents residual knowledge.

2601.22357 2026-02-02 cs.LG

Small Talk, Big Impact: The Energy Cost of Thanking AI

Julien Delavande, Regis Pierrard, Sasha Luccioni

详情
英文摘要

Being polite is free - or is it? In this paper, we quantify the energy cost of seemingly innocuous messages such as ``thank you'' when interacting with large language models, often used by users to convey politeness. Using real-world conversation traces and fine-grained energy measurements, we quantify how input length, output length and model size affect energy use. While politeness is our motivating example, it also serves as a controlled and reproducible proxy for measuring the energy footprint of a typical LLM interaction. Our findings provide actionable insights for building more sustainable and efficient LLM applications, especially in increasingly widespread real-world contexts like chat. As user adoption grows and billions of prompts are processed daily, understanding and mitigating this cost becomes crucial - not just for efficiency, but for sustainable AI deployment.