arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1559
2603.03499 2026-03-05 cs.RO

Overlapping Domain Decomposition for Distributed Pose Graph Optimization

Aneesa Sonawalla, Yulun Tian, Jonathan P. How

Comments Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
英文摘要

We present ROBO (Riemannian Overlapping Block Optimization), a distributed and parallel approach to multi-robot pose graph optimization (PGO) based on the idea of overlapping domain decomposition. ROBO offers a middle ground between centralized and fully distributed solvers, where the amount of pose information shared between robots at each optimization iteration can be set according to the available communication resources. Sharing additional pose information between neighboring robots effectively creates overlapping optimization blocks in the underlying pose graph, which substantially reduces the number of iterations required to converge. Through extensive experiments on benchmark PGO datasets, we demonstrate the applicability and feasibility of ROBO in different initialization scenarios, using various cost functions, and under different communication regimes. We also analyze the tradeoff between the increased communication and local computation required by ROBO's overlapping blocks and the resulting faster convergence. We show that overlaps with an average inter-robot data cost of only 36 Kb per iteration can converge 3.1$\times$ faster in terms of iterations than state-of-the-art distributed PGO approaches. Furthermore, we develop an asynchronous variant of ROBO that is robust to network delays and suitable for real-world robotic applications.

2603.03495 2026-03-05 cs.RO

Navigating in Uncertain Environments with Heterogeneous Visibility

Jongann Lee, Melkior Ornik

详情
英文摘要

Navigating an environment with uncertain connectivity requires a strategic balance between minimizing the cost of traversal and seeking information to resolve map ambiguities. Unlike previous approaches that rely on local sensing, we utilize a framework where nodes possess varying visibility levels, allowing for observation of distant edges from certain vantage points. We propose a novel heuristic algorithm that balances the cost of detouring to high-visibility locations against the gain in information by optimizing the sum of a custom observation reward and the cost of traversal. We introduce a technique to sample the shortest path on numerous realizations of the environment, which we use to define an edge's utility for observation and to quickly estimate the path with the highest reward. Our approach can be easily adapted to a variety of scenarios by tuning a single hyperparameter that determines the importance of observation. We test our method on a variety of uncertain navigation tasks, including a map based on real-world topographical data. The method demonstrates lower mean cost of traversal compared to a shortest path baseline that does not consider observation and has exponentially lower computational overhead compared to an existing method for balancing observation with path cost minimization.

2603.03491 2026-03-05 cs.LG cs.AR

When Small Variations Become Big Failures: Reliability Challenges in Compute-in-Memory Neural Accelerators

Yifan Qin, Jiahao Zheng, Zheyu Yan, Wujie Wen, Xiaobo Sharon Hu, Yiyu Shi

Comments 2026 International VLSI Symposium on Technology, Systems and Applications (VLSI TSA)

详情
英文摘要

Compute-in-memory (CiM) architectures promise significant improvements in energy efficiency and throughput for deep neural network acceleration by alleviating the von Neumann bottleneck. However, their reliance on emerging non-volatile memory devices introduces device-level non-idealities-such as write variability, conductance drift, and stochastic noise-that fundamentally challenge reliability, predictability, and safety, especially in safety-critical applications. This talk examines the reliability limits of CiM-based neural accelerators and presents a series of techniques that bridge device physics, architecture, and learning algorithms to address these challenges. We first demonstrate that even small device variations can lead to disproportionately large accuracy degradation and catastrophic failures in safety-critical inference workloads, revealing a critical gap between average-case evaluations and worst-case behavior. Building on this insight, we introduce SWIM, a selective write-verify mechanism that strategically applies verification only where it is most impactful, significantly improving reliability while maintaining CiM's efficiency advantages. Finally, we explore a learning-centric solution that improves realistic worst-case performance by training neural networks with right-censored Gaussian noise, aligning training assumptions with hardware-induced variability and enabling robust deployment without excessive hardware overhead. Together, these works highlight the necessity of cross-layer co-design for CiM accelerators and provide a principled path toward dependable, efficient neural inference on emerging memory technologies-paving the way for their adoption in safety- and reliability-critical systems.

2603.03484 2026-03-05 cs.LG cs.AI

Optimal trajectory-guided stochastic co-optimization for e-fuel system design and real-time operation

Jeongdong Kim, Minsu Kim, Jonggeol Na, Junghwan Kim

Comments 29 pages, 6 figures. Supplementary Information included

详情
英文摘要

E-fuels are promising long-term energy carriers supporting the net-zero transition. However, the large combinatorial design-operation spaces under renewable uncertainty make the use of mathematical programming impractical for co-optimizing e-fuel production systems. Here, we present MasCOR, a machine-learning-assisted co-optimization framework that learns from global operational trajectories. By encoding system design and renewable trends, a single MasCOR agent generalizes dynamic operation across diverse configurations and scenarios, substantially simplifying design-operation co-optimization under uncertainty. Benchmark comparisons against state-of-the-art reinforcement learning baselines demonstrate near-optimal performance, while computational costs are substantially lower than those of mathematical programming, enabling rapid parallel evaluation of designs within the co-optimization loop. This framework enables rapid screening of feasible design spaces together with corresponding operational policies. When applied to four potential European sites targeting e-methanol production, MasCOR shows that most locations benefit from reducing system load below 50 MW to achieve carbon-neutral methanol production, with production costs of 1.0-1.2 USD per kg. In contrast, Dunkirk (France), with limited renewable availability and high grid prices, favors system loads above 200 MW and expanded storage to exploit dynamic grid exchange and hydrogen sales to the market. These results underscore the value of the MasCOR framework for site-specific guidance from system design to real-time operation.

2603.03475 2026-03-05 cs.LG cs.AI cs.CL

When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

Comments Accepted at ICLR 2026 Workshop on Latent & Implicit Thinking - Going Beyond CoT Reasoning. 19 Pages and 5 Figures

详情
英文摘要

Mathematical reasoning models are widely deployed in education, automated tutoring, and decision support systems despite exhibiting fundamental computational instabilities. We demonstrate that state-of-the-art models (Qwen2.5-Math-7B) achieve 61% accuracy through a mixture of reliable and unreliable reasoning pathways: 18.4% of correct predictions employ stable, faithful reasoning while 81.6% emerge through computationally inconsistent pathways. Additionally, 8.8% of all predictions are silent failures -- confident yet incorrect outputs. Through comprehensive analysis using novel faithfulness metrics, we reveal: (1) reasoning quality shows weak negative correlation with correctness (r=-0.21, p=0.002), reflecting a binary classification threshold artifact rather than a monotonic inverse relationship; (2) scaling from 1.5B to 7B parameters (4.7x increase) provides zero accuracy benefit on our evaluated subset (6% of GSM8K), requiring validation on the complete benchmark; and (3) latent reasoning employs diverse computational strategies, with ~20% sharing CoT-like patterns. These findings highlight that benchmark accuracy can mask computational unreliability, demanding evaluation reforms measuring stability beyond single-sample metrics.

2603.03469 2026-03-05 cs.LG cond-mat.dis-nn cond-mat.stat-mech

Biased Generalization in Diffusion Models

Jerome Garnier-Brun, Luca Biggio, Davide Beltrame, Marc Mézard, Luca Saglietti

Comments 10 pages, 6 figures

详情
英文摘要

Generalization in generative modeling is defined as the ability to learn an underlying distribution from a finite dataset and produce novel samples, with evaluation largely driven by held-out performance and perceived sample quality. In practice, training is often stopped at the minimum of the test loss, taken as an operational indicator of generalization. We challenge this viewpoint by identifying a phase of biased generalization during training, in which the model continues to decrease the test loss while favoring samples with anomalously high proximity to training data. By training the same network on two disjoint datasets and comparing the mutual distances of generated samples and their similarity to training data, we introduce a quantitative measure of bias and demonstrate its presence on real images. We then study the mechanism of bias, using a controlled hierarchical data model where access to exact scores and ground-truth statistics allows us to precisely characterize its onset. We attribute this phenomenon to the sequential nature of feature learning in deep networks, where coarse structure is learned early in a data-independent manner, while finer features are resolved later in a way that increasingly depends on individual training samples. Our results show that early stopping at the test loss minimum, while optimal under standard generalization criteria, may be insufficient for privacy-critical applications.

2603.03464 2026-03-05 cs.LG cs.AI cs.IR

Graph Hopfield Networks: Energy-Based Node Classification with Associative Memory

Abinav Rao, Alex Wa, Rishi Athavale

Comments 10 Pages, 4 Figures, Acceptted at ICLR NFAM Workshop 2026

详情
英文摘要

We introduce Graph Hopfield Networks, whose energy function couples associative memory retrieval with graph Laplacian smoothing for node classification. Gradient descent on this joint energy yields an iterative update interleaving Hopfield retrieval with Laplacian propagation. Memory retrieval provides regime-dependent benefits: up to 2.0~pp on sparse citation networks and up to 5 pp additional robustness under feature masking; the iterative energy-descent architecture itself is a strong inductive bias, with all variants (including the memory-disabled NoMem ablation) outperforming standard baselines on Amazon co-purchase graphs. Tuning enables graph sharpening for heterophilous benchmarks without architectural changes.

2603.03453 2026-03-05 cs.RO

Radar-based Pose Optimization for HD Map Generation from Noisy Multi-Drive Vehicle Fleet Data

Alexander Blumberg, Jonas Merkert, Christoph Stiller

Comments Accepted for the 37th IEEE Intelligent Vehicles Symposium (IV 2026), 7 pages

详情
英文摘要

High-definition (HD) maps are important for autonomous driving, but their manual generation and maintenance is very expensive. This motivates the usage of an automated map generation pipeline. Fleet vehicles provide sufficient sensors for map generation, but their measurements are less precise, introducing noise into the mapping pipeline. This work focuses on mitigating the localization noise component through aligning radar measurements in terms of raw radar point clouds of vehicle poses of different drives and performing pose graph optimization to produce a globally optimized solution between all drives present in the dataset. Improved poses are first used to generate a global radar occupancy map, aimed to facilitate precise on-vehicle localization. Through qualitative analysis we show contrast-rich feature clarity, focusing on omnipresent guardrail posts as the main feature type observable in the map. Second, the improved poses can be used as a basis for an existing lane boundary map generation pipeline, majorly improving map output compared to its original pure line detection based optimization approach.

2603.03452 2026-03-05 cs.RO cs.CV

Impact of Localization Errors on Label Quality for Online HD Map Construction

Alexander Blumberg, Jonas Merkert, Richard Fehler, Fabian Immel, Frank Bieder, Jan-Hendrik Pauls, Christoph Stiller

Comments Accepted for the 36th IEEE Intelligent Vehicles Symposium (IV 2025), 8 pages

详情
英文摘要

High-definition (HD) maps are crucial for autonomous vehicles, but their creation and maintenance is very costly. This motivates the idea of online HD map construction. To provide a continuous large-scale stream of training data, existing HD maps can be used as labels for onboard sensor data from consumer vehicle fleets. However, compared to current, well curated HD map perception datasets, this fleet data suffers from localization errors, resulting in distorted map labels. We introduce three kinds of localization errors, Ramp, Gaussian, and Perlin noise, to examine their influence on generated map labels. We train a variant of MapTRv2, a state-of-the-art online HD map construction model, on the Argoverse 2 dataset with various levels of localization errors and assess the degradation of model performance. Since localization errors affect distant labels more severely, but are also less significant to driving performance, we introduce a distance-based map construction metric. Our experiments reveal that localization noise affects the model performance significantly. We demonstrate that errors in heading angle exert a more substantial influence than position errors, as angle errors result in a greater distortion of labels as distance to the vehicle increases. Furthermore, we can demonstrate that the model benefits from non-distorted ground truth (GT) data and that the performance decreases more than linearly with the increase in noisy data. Our study additionally provides a qualitative evaluation of the extent to which localization errors influence the construction of HD maps.

2603.03437 2026-03-05 cs.CV

Beyond Accuracy: Evaluating Visual Grounding In Multimodal Medical Reasoning

Anas Zafar, Leema Krishna Murali, Ashish Vashist

Comments 12 pages, 2 figures, 2 tables, medical VQA / multimodal reasoning evaluation

详情
英文摘要

Recent work shows that text-only reinforcement learning with verifiable rewards (RLVR) can match or outperform image-text RLVR on multimodal medical VQA benchmarks, suggesting current evaluation protocols may fail to measure causal visual dependence. We introduce a counterfactual evaluation framework using real, blank, and shuffled images across four medical VQA benchmarks: PathVQA, PMC-VQA, SLAKE, and VQA-RAD. Beyond accuracy, we measure Visual Reliance Score (VRS), Image Sensitivity (IS), and introduce Hallucinated Visual Reasoning Rate (HVRR) to detect cases where models generate visual claims despite producing image-invariant answers. Our findings reveal that RLVR improves accuracy while degrading visual grounding: text-only RLVR achieves negative VRS on PathVQA (-0.09), performing better with mismatched images, while image-text RLVR reduces image sensitivity to 39.8% overall despite improving accuracy. On VQA-RAD, both variants achieve 63% accuracy through different mechanisms: text-only RLVR retains 81% performance with blank images, while image-text RLVR shows only 29% image sensitivity. Models generate visual claims in 68-74% of responses, yet 38-43% are ungrounded (HVRR). These findings demonstrate that accuracy-only rewards enable shortcut exploitation, and progress requires grounding-aware evaluation protocols and training objectives that explicitly enforce visual dependence.

2603.03409 2026-03-05 cs.LG

A Short Note on a Variant of the Squint Algorithm

Haipeng Luo

详情
英文摘要

This short note describes a simple variant of the Squint algorithm of Koolen and Van Erven [2015] for the classic expert problem. Via an equally simple modification of their proof, we prove that this variant ensures a regret bound that resembles the one shown in a recent work by Freund et al. [2026] for a variant of the NormalHedge algorithm [Chaudhuri et al., 2009].

2603.03407 2026-03-05 cs.CL

Tracing Pharmacological Knowledge In Large Language Models

Basil Hasan Khwaja, Dylan Chen, Guntas Toor, Anastasiya Kuznetsova

Comments Accepted, Learning Meaningful Representations of Life (LMRL) Workshop @ ICLR 2026

详情
英文摘要

Large language models (LLMs) have shown strong empirical performance across pharmacology and drug discovery tasks, yet the internal mechanisms by which they encode pharmacological knowledge remain poorly understood. In this work, we investigate how drug-group semantics are represented and retrieved within Llama-based biomedical language models using causal and probing-based interpretability methods. We apply activation patching to localize where drug-group information is stored across model layers and token positions, and complement this analysis with linear probes trained on token-level and sum-pooled activations. Our results demonstrate that early layers play a key role in encoding drug-group knowledge, with the strongest causal effects arising from intermediate tokens within the drug-group span rather than the final drug-group token. Linear probing further reveals that pharmacological semantics are distributed across tokens and are already present in the embedding space, with token-level probes performing near chance while sum-pooled representations achieve maximal accuracy. Together, these findings suggest that drug-group semantics in LLMs are not localized to single tokens but instead arise from distributed representations. This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.

2603.03402 2026-03-05 cs.LG cs.AI

Heterogeneous Time Constants Improve Stability in Equilibrium Propagation

Yoshimasa Kubo, Suhani Pragnesh Modi, Smit Patel

详情
英文摘要

Equilibrium propagation (EP) is a biologically plausible alternative to backpropagation for training neural networks. However, existing EP models use a uniform scalar time step dt, which corresponds biologically to a membrane time constant that is heterogeneous across neurons. Here, we introduce heterogeneous time steps (HTS) for EP by assigning neuron-specific time constants drawn from biologically motivated distributions. We show that HTS improves training stability while maintaining competitive task performance. These results suggest that incorporating heterogeneous temporal dynamics enhances both the biological realism and robustness of equilibrium propagation.

2603.03390 2026-03-05 cs.RO cs.AI

Multi-Agent-Based Simulation of Archaeological Mobility in Uneven Landscapes

Chairi Kiourt, Vassilis Evangelidis, Dimitris Grigoropoulos

详情
英文摘要

Understanding mobility, movement, and interaction in archaeological landscapes is essential for interpreting past human behavior, transport strategies, and spatial organization, yet such processes are difficult to reconstruct from static archaeological evidence alone. This paper presents a multi-agent-based modeling framework for simulating archaeological mobility in uneven landscapes, integrating realistic terrain reconstruction, heterogeneous agent modeling, and adaptive navigation strategies. The proposed approach combines global path planning with local dynamic adaptation, through reinforcment learning, enabling agents to respond efficiently to dynamic obstacles and interactions without costly global replanning. Real-world digital elevation data are processed into high-fidelity three-dimensional environments, preserving slope and terrain constraints that directly influence agent movement. The framework explicitly models diverse agent types, including human groups and animal-based transport systems, each parameterized by empirically grounded mobility characteristics such as load, slope tolerance, and physical dimensions. Two archaeological-inspired use cases demonstrate the applicability of the approach: a terrain-aware pursuit and evasion scenario and a comparative transport analysis involving pack animals and wheeled carts. The results highlight the impact of terrain morphology, visibility, and agent heterogeneity on movement outcomes, while the proposed hybrid navigation strategy provides a computationally efficient and interpretable solution for large-scale, dynamic archaeological simulations.

2603.03389 2026-03-05 cs.LG

Towards Improved Sentence Representations using Token Graphs

Krishna Sri Ipsit Mantri, Carola-Bibiane Schönlieb, Zorah Lähner, Moshe Eliasof

Comments ICLR 2026, 29 Pages, 17 Tables, 5 Figures

详情
英文摘要

Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs. Our code is published at https://github.com/ipsitmantri/GLOT.

2603.03380 2026-03-05 cs.RO cs.AI

LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics

Justin Williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

详情
英文摘要

Vision-Language-Action (VLA) models provide a unified framework for perception, language conditioning, and action generation, but many existing systems remain difficult to deploy in embedded robotic settings because of their computational requirements and inference latency. In this paper, we present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware. Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference through the \texttt{llama.cpp} runtime. Under our deployment configuration, LiteVLA-Edge achieves a mean end-to-end latency of 150.5\,ms (approximately 6.6\,Hz) while operating entirely offline within a ROS~2-integrated perception--reasoning--action pipeline. Rather than introducing a new policy objective, our contribution is a practical systems path for executing compact multimodal control models locally on embedded hardware while preserving modular interfaces between perception, reasoning, and actuation. These results establish timing feasibility for reactive language-conditioned control and provide a reproducible baseline for future task-level evaluation of on-device VLAs in robotics.

2603.03336 2026-03-05 cs.CL cs.LG

Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai

详情
英文摘要

Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.

2603.03335 2026-03-05 cs.CL

Compressed Sensing for Capability Localization in Large Language Models

Anna Bair, Yixuan Even Xu, Mingjie Sun, J. Zico Kolter

详情
英文摘要

Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task-specific heads can degrade performance by up to $65\%$ on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at https://github.com/locuslab/llm-components.

2603.03334 2026-03-05 cs.CL

The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

Bianca Raimondi, Francesco Pivi, Davide Evangelista, Maurizio Gabbrielli

Comments Preprint. Under review

详情
英文摘要

The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: https://github.com/biancaraimondi/CompMath-MCQ.git

2603.03333 2026-03-05 cs.CL

Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding

Jeongtae Lee, Minjung Jo, Hyunjoon Jeong, Gunho Park, Sunghyeon Woo, Joonghoon Kim, Se Jung Kwon, Dongsoo Lee

Comments 14 pages, 6 figures, 10 tables

详情
英文摘要

Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions. By generating multiple decoding paths, our method forms an empirical token distribution against which draft tokens are evaluated for consistency. This acceptance mechanism enables the model to adaptively control the size of decoding paths under an appropriate dropout probability, preventing substantial distortion of the target model predictive distribution. The proposed method operates in a training-free, data-free, and calibration-free manner, requires no architectural modification to pretrained models, and can be orthogonally integrated with a wide range of existing speculative decoding and inference acceleration techniques. Experiments across multiple benchmarks demonstrate that our approach increases acceptance length while maintaining competitive task performance, yielding inference speedups ranging from 1.09x to 1.33x over the standard baseline, and up to an additional 1.09x speedup when applied on top of EAGLE3.

2603.03330 2026-03-05 cs.CL cs.AI

Certainty robustness: Evaluating LLM stability under self-challenging prompts

Mohammadreza Saadat, Steve Nemzer

Comments 20 pages, 7 tables Benchmark and evaluation study of large language models

详情
英文摘要

Large language models (LLMs) often present answers with high apparent confidence despite lacking an explicit mechanism for reasoning about certainty or truth. While existing benchmarks primarily evaluate single-turn accuracy, truthfulness or confidence calibration, they do not capture how models behave when their responses are challenged in interactive settings. We introduce the Certainty Robustness Benchmark, a two-turn evaluation framework that measures how LLMs balance stability and adaptability under self-challenging prompts such as uncertainty ("Are you sure?") and explicit contradiction ("You are wrong!"), alongside numeric confidence elicitation. Using 200 reasoning and mathematics questions from LiveBench, we evaluate four state-of-the-art LLMs and distinguish between justified self-corrections and unjustified answer changes. Our results reveal substantial differences in interactive reliability that are not explained by baseline accuracy alone: some models abandon correct answers under conversational pressure, while others demonstrate strong resistance to challenge and better alignment between confidence and correctness. These findings identify certainty robustness as a distinct and critical dimension of LLM evaluation, with important implications for alignment, trustworthiness and real-world deployment.

2603.03329 2026-03-05 cs.CL cs.AI

AutoHarness: improving LLM agents by automatically synthesizing a code harness

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy

Comments agent harness, code synthesis, self-improvement, code-as-policy, text games

详情
英文摘要

Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

2603.03327 2026-03-05 cs.CL cs.AI

A benchmark for joint dialogue satisfaction, emotion recognition, and emotion state transition prediction

Jing Bian, Haoxiang Su, Liting Jiang, Di Wu, Ruiyu Fang, Xiaomeng Huang, Yanbing Li, Shuangyong Song, Hao Huang

详情
英文摘要

User satisfaction is closely related to enterprises, as it not only directly reflects users' subjective evaluation of service quality or products, but also affects customer loyalty and long-term business revenue. Monitoring and understanding user emotions during interactions helps predict and improve satisfaction. However, relevant Chinese datasets are limited, and user emotions are dynamic; relying on single-turn dialogue cannot fully track emotional changes across multiple turns, which may affect satisfaction prediction. To address this, we constructed a multi-task, multi-label Chinese dialogue dataset that supports satisfaction recognition, as well as emotion recognition and emotional state transition prediction, providing new resources for studying emotion and satisfaction in dialogue systems.

2603.03326 2026-03-05 cs.CL cs.AI

Controllable and explainable personality sliders for LLMs at inference time

Florian Hoppe, David Khachaturov, Robert Mullins, Mark Huasong Meng

Comments 20 pages, 18 figures

详情
英文摘要

Aligning Large Language Models (LLMs) with specific personas typically relies on expensive and monolithic Supervised Fine-Tuning (SFT) or RLHF. While effective, these methods require training distinct models for every target personality profile. Inference-time activation steering offers a parameter-efficient alternative, yet naive approaches fail to control multiple traits simultaneously due to destructive vector interference. In this work, we propose a modular framework for continuous, multi-dimensional personality control. Our key innovation is Sequential Adaptive Steering (SAS): a method that orthogonalizes steering vectors by training subsequent probes on the residual stream shifted by prior interventions. This approach transforms steering vectors into reusable primitives, allowing users to instantly synthesize complex, high-fidelity personality profiles by simply adjusting coefficients alpha. We validate our framework on the Big Five personality traits, demonstrating that it outperforms naive baselines in both goal adherence and coherence, enabling precise, holistic personality modulation without updating model parameters.

2603.03325 2026-03-05 cs.CL cs.AI cs.LG

IntPro: A Proxy Agent for Context-Aware Intent Understanding via Retrieval-conditioned Inference

Guanming Liu, Meng Wu, Peng Zhang, Yu Zhang, Yubo Shu, Xianliang Huang, Kainan Tu, Ning Gu, Liuxin Zhang, Qianying Wang, Tun Lu

详情
英文摘要

Large language models (LLMs) have become integral to modern Human-AI collaboration workflows, where accurately understanding user intent serves as a crucial step for generating satisfactory responses. Context-aware intent understanding, which involves inferring user intentions from situational environments, is inherently challenging because it requires reasoning over both the immediate context and the user's underlying motivations that drive their behavior. Moreover, existing approaches often treat intent understanding as a static recognition task, overlooking users' accumulated intent patterns that could provide valuable references for more accurate and generalizable understanding. To address this gap, we propose IntPro, a proxy agent that learns to adapt to individual users via retrieval-conditioned intent inference. We design intent explanations that abstract how contextual signals connect to expressed intents, and store them in an individual intent history library for retrieval. We train IntPro through supervised fine-tuning on retrieval-conditioned trajectories and multi-turn Group Relative Policy Optimization (GRPO) with tool-aware reward functions, enabling the agent to learn when to leverage historical intent patterns and when to infer directly. Experiments across three diverse scenarios (Highlight-Intent, MIntRec2.0, and Weibo Post-Sync) demonstrate that IntPro achieves strong intent understanding performance with effective context-aware reasoning capabilities across different scenarios and model types.

2603.03324 2026-03-05 cs.CL cs.AI

Controlling Chat Style in Language Models via Single-Direction Editing

Zhenyu Xu, Victor S. Sheng

详情
英文摘要

Controlling stylistic attributes in large language models (LLMs) remains challenging, with existing approaches relying on either prompt engineering or post-training alignment. This paper investigates this challenge through the lens of representation engineering, testing the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space. We provide strong empirical evidence for this hypothesis across a wide range of styles and, based on this finding, present a lightweight, training-free method for precise style control. Our approach supports linear style composition, enhances safety by ablating undesirable behaviors, and, as confirmed by experiments on over a dozen models, achieves high style adherence while preserving core capabilities at minimal computational cost.

2603.03323 2026-03-05 cs.CL cs.AI

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi

Comments 10 Pages

详情
英文摘要

Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

2603.03322 2026-03-05 cs.CL cs.AI

Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Chaoqun Yang, Xinyu Lin, Shulin Li, Wenjie Wang, Ruihan Guo, Fuli Feng, Tat-Seng Chua

详情
英文摘要

Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI's capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench-Bio, a dynamic and fully automated benchmark designed to evaluate AI's biological knowledge discovery ability. DBench-Bio employs a three-stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly-updated benchmark covering 12 biomedical sub-domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.

2603.03321 2026-03-05 cs.CL cs.AI

DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

Nardine Basta, Dali Kaafar

Comments PAKDD 2026

详情
英文摘要

Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction--tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria--semantic equivalence for content predicates, exact precision for numerical predicates--mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.

2603.03320 2026-03-05 cs.CL cs.AI

From We to Me: Theory Informed Narrative Shift with Abductive Reasoning

Jaikrishna Manojkumar Patil, Divyagna Bavikadi, Kaustuv Mukherji, Ashby Steward-Nolan, Peggy-Jean Allin, Tumininu Awonuga, Joshua Garland, Paulo Shakarian

详情
英文摘要

Effective communication often relies on aligning a message with an audience's narrative and worldview. Narrative shift involves transforming text to reflect a different narrative framework while preserving its original core message--a task we demonstrate is significantly challenging for current Large Language Models (LLMs). To address this, we propose a neurosymbolic approach grounded in social science theory and abductive reasoning. Our method automatically extracts rules to abduce the specific story elements needed to guide an LLM through a consistent and targeted narrative transformation. Across multiple LLMs, abduction-guided transformed stories shifted the narrative while maintaining the fidelity with the original story. For example, with GPT-4o we outperform the zero-shot LLM baseline by 55.88% for collectivistic to individualistic narrative shift while maintaining superior semantic similarity with the original stories (40.4% improvement in KL divergence). For individualistic to collectivistic transformation, we achieve comparable improvements. We show similar performance across both directions for Llama-4, and Grok-4 and competitive performance for Deepseek-R1.