arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1220
2602.16855 2026-02-20 cs.AI cs.CL

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, Ming Yan

Comments 25 pages, 11 figures, 11 tables

详情
英文摘要

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction. GUI-Owl-1.5 achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models: (1) on GUI automation tasks, it obtains 56.5 on OSWorld, 71.6 on AndroidWorld, and 48.4 on WebArena; (2) on grounding tasks, it obtains 80.3 on ScreenSpotPro; (3) on tool-calling tasks, it obtains 47.6 on OSWorld-MCP, and 46.8 on MobileWorld; (4) on memory and knowledge tasks, it obtains 75.5 on GUI-Knowledge Bench. GUI-Owl-1.5 incorporates several key innovations: (1) Hybird Data Flywheel: we construct the data pipeline for UI understanding and trajectory generation based on a combination of simulated environments and cloud-based sandbox environments, in order to improve the efficiency and quality of data collection. (2) Unified Enhancement of Agent Capabilities: we use a unified thought-synthesis pipeline to enhance the model's reasoning capabilities, while placing particular emphasis on improving key agent abilities, including Tool/MCP use, memory and multi-agent adaptation; (3) Multi-platform Environment RL Scaling: We propose a new environment RL algorithm, MRPO, to address the challenges of multi-platform conflicts and the low training efficiency of long-horizon tasks. The GUI-Owl-1.5 models are open-sourced, and an online cloud-sandbox demo is available at https://github.com/X-PLUG/MobileAgent.

2602.16849 2026-02-20 cs.LG math.OC stat.ML

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang

详情
英文摘要

We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

2602.16846 2026-02-20 cs.RO

Sound of Touch: Active Acoustic Tactile Sensing via String Vibrations

Xili Yi, Ying Xing, Zachary Manchester, Nima Fazeli

Comments 12 pages, 10 figures

详情
英文摘要

Distributed tactile sensing remains difficult to scale over large areas: dense sensor arrays increase wiring, cost, and fragility, while many alternatives provide limited coverage or miss fast interaction dynamics. We present Sound of Touch, an active acoustic tactile-sensing methodology that uses vibrating tensioned strings as sensing elements. The string is continuously excited electromagnetically, and a small number of pickups (contact microphones) observe spectral changes induced by contact. From short-duration audio signals, our system estimates contact location and normal force, and detects slip. To guide design and interpret the sensing mechanism, we derive a physics-based string-vibration simulator that predicts how contact position and force shift vibration modes. Experiments demonstrate millimeter-scale localization, reliable force estimation, and real-time slip detection. Our contributions are: (i) a lightweight, scalable string-based tactile sensing hardware concept for instrumenting extended robot surfaces; (ii) a physics-grounded simulation and analysis tool for contact-induced spectral shifts; and (iii) a real-time inference pipeline that maps vibration measurements to contact state.

2602.16843 2026-02-20 cs.CL

BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam

Comments Accepted in 2nd LoResLM at EACL 2026

详情
英文摘要

Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $ρ= 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.

2602.16842 2026-02-20 cs.LG

What is the Value of Censored Data? An Exact Analysis for the Data-driven Newsvendor

Rachitesh Kumar, Omar Mouchtaki

详情
英文摘要

We study the offline data-driven newsvendor problem with censored demand data. In contrast to prior works where demand is fully observed, we consider the setting where demand is censored at the inventory level and only sales are observed; sales match demand when there is sufficient inventory, and equal the available inventory otherwise. We provide a general procedure to compute the exact worst-case regret of classical data-driven inventory policies, evaluated over all demand distributions. Our main technical result shows that this infinite-dimensional, non-convex optimization problem can be reduced to a finite-dimensional one, enabling an exact characterization of the performance of policies for any sample size and censoring levels. We leverage this reduction to derive sharp insights on the achievable performance of standard inventory policies under demand censoring. In particular, our analysis of the Kaplan-Meier policy shows that while demand censoring fundamentally limits what can be learned from passive sales data, just a small amount of targeted exploration at high inventory levels can substantially improve worst-case guarantees, enabling near-optimal performance even under heavy censoring. In contrast, when the point-of-sale system does not record stockout events and only reports realized sales, a natural and commonly used approach is to treat sales as demand. Our results show that policies based on this sales-as-demand heuristic can suffer severe performance degradation as censored data accumulates, highlighting how the quality of point-of-sale information critically shapes what can, and cannot, be learned offline.

2602.16839 2026-02-20 cs.LG cs.CL

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao

Comments ICLR 2026, 15 pages

详情
英文摘要

Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

2602.16836 2026-02-20 cs.CL

Claim Automation using Large Language Model

Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong

Comments 46 pages, 12 figures. Code and data processing pipeline described

详情
英文摘要

While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.

2602.16833 2026-02-20 cs.LG cs.AI

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang

详情
英文摘要

Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.

2602.16832 2026-02-20 cs.AI cs.CL

IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages

Priyaranjan Pattnayak, Sanchari Chowdhuri

Comments Accepted in EACL Industry Track Oral, 2026

详情
英文摘要

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce \textbf{Indic Jailbreak Robustness (IJR)}, a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (2.1 Billion speakers), covering 45216 prompts in JSON (contract-bound) and Free (naturalistic) tracks. IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach 1.0 with refusals collapsing. (2) English to Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized or mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization (approx 0.28 to 0.32) indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.

2602.16829 2026-02-20 cs.LG cs.AI cs.NE

Learning under noisy supervision is governed by a feedback-truth gap

Elan Schonfeld, Elias Wisnia

Comments 33 pages, 5 figures, 10 extended data figures, 4 extended data tables; 10-page supplementary information

详情
英文摘要

When feedback is absorbed faster than task structure can be evaluated, the learner will favor feedback over truth. A two-timescale model shows this feedback-truth gap is inevitable whenever the two rates differ and vanishes only when they match. We test this prediction across neural networks trained with noisy labels (30 datasets, 2,700 runs), human probabilistic reversal learning (N = 292), and human reward/punishment learning with concurrent EEG (N = 25). In each system, truth is defined operationally: held-out labels, the objectively correct option, or the participant's pre-feedback expectation - the only non-circular reference decodable from post-feedback EEG. The gap appeared universally but was regulated differently: dense networks accumulated it as memorization; sparse-residual scaffolding suppressed it; humans generated transient over-commitment that was actively recovered. Neural over-commitment (~0.04-0.10) was amplified tenfold into behavioral commitment (d = 3.3-3.9). The gap is a fundamental constraint on learning under noisy supervision; its consequences depend on the regulation each system employs.

2602.16827 2026-02-20 cs.AI

An order-oriented approach to scoring hesitant fuzzy elements

Luis Merino, Gabriel Navarro, Carlos Salvatierra, Evangelina Santos

详情
英文摘要

Traditional scoring approaches on hesitant fuzzy sets often lack a formal base in order theory. This paper proposes a unified framework, where each score is explicitly defined with respect to a given order. This order-oriented perspective enables more flexible and coherent scoring mechanisms. We examine several classical orders on hesitant fuzzy elements, that is, nonempty subsets in [0,1], and show that, contrary to prior claims, they do not induce lattice structures. In contrast, we prove that the scores defined with respect to the symmetric order satisfy key normative criteria for scoring functions, including strong monotonicity with respect to unions and the Gärdenfors condition. Following this analysis, we introduce a class of functions, called dominance functions, for ranking hesitant fuzzy elements. They aim to compare hesitant fuzzy elements relative to control sets incorporating minimum acceptability thresholds. Two concrete examples of dominance functions for finite sets are provided: the discrete dominance function and the relative dominance function. We show that these can be employed to construct fuzzy preference relations on typical hesitant fuzzy sets and support group decision-making.

2602.16826 2026-02-20 cs.LG cs.AI

HiVAE: Hierarchical Latent Variables for Scalable Theory of Mind

Nigel Doering, Rahath Malladi, Arshia Sangwan, David Danks, Tauhidur Rahman

Comments Accepted at the Workshop on Theory of Mind for AI (ToM4AI) at the 40th AAAI Conference on Artificial Intelligence (AAAI-26), Singapore, 2026

详情
英文摘要

Theory of mind (ToM) enables AI systems to infer agents' hidden goals and mental states, but existing approaches focus mainly on small human understandable gridworld spaces. We introduce HiVAE, a hierarchical variational architecture that scales ToM reasoning to realistic spatiotemporal domains. Inspired by the belief-desire-intention structure of human cognition, our three-level VAE hierarchy achieves substantial performance improvements on a 3,185-node campus navigation task. However, we identify a critical limitation: while our hierarchical structure improves prediction, learned latent representations lack explicit grounding to actual mental states. We propose self-supervised alignment strategies and present this work to solicit community feedback on grounding approaches.

2602.16825 2026-02-20 cs.RO

RRT$^η$: Sampling-based Motion Planning and Control from STL Specifications using Arithmetic-Geometric Mean Robustness

Ahmad Ahmad, Shuo Liu, Roberto Tron, Calin Belta

详情
英文摘要

Sampling-based motion planning has emerged as a powerful approach for robotics, enabling exploration of complex, high-dimensional configuration spaces. When combined with Signal Temporal Logic (STL), a temporal logic widely used for formalizing interpretable robotic tasks, these methods can address complex spatiotemporal constraints. However, traditional approaches rely on min-max robustness measures that focus only on critical time points and subformulae, creating non-smooth optimization landscapes with sharp decision boundaries that hinder efficient tree exploration. We propose RRT$^η$, a sampling-based planning framework that integrates the Arithmetic-Geometric Mean (AGM) robustness measure to evaluate satisfaction across all time points and subformulae. Our key contributions include: (1) AGM robustness interval semantics for reasoning about partial trajectories during tree construction, (2) an efficient incremental monitoring algorithm computing these intervals, and (3) enhanced Direction of Increasing Satisfaction vectors leveraging Fulfillment Priority Logic (FPL) for principled objective composition. Our framework synthesizes dynamically feasible control sequences satisfying STL specifications with high robustness while maintaining the probabilistic completeness and asymptotic optimality of RRT$^\ast$. We validate our approach on three robotic systems. A double integrator point robot, a unicycle mobile robot, and a 7-DOF robot arm, demonstrating superior performance over traditional STL robustness-based planners in multi-constraint scenarios with limited guidance signals.

2602.16823 2026-02-20 cs.LG cs.LO

Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

Itamar Hadad, Guy Katz, Shahaf Bassan

Comments To appear in ICLR 2026

详情
英文摘要

*Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with *provable guarantees*. We focus on three types of guarantees: (1) *input domain robustness*, ensuring the circuit agrees with the model across a continuous input region; (2) *robust patching*, certifying circuit alignment under continuous patching perturbations; and (3) *minimality*, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we uncover a diverse set of novel theoretical connections among these three families of guarantees, with critical implications for the convergence of our algorithms. Finally, we conduct experiments with state-of-the-art verifiers on various vision models, showing that our algorithms yield circuits with substantially stronger robustness guarantees than standard circuit discovery methods, establishing a principled foundation for provable circuit discovery.

2602.16814 2026-02-20 cs.AI

Node Learning: A Framework for Adaptive, Decentralised and Collaborative Network Edge AI

Eiman Kanjo, Mustafa Aslanov

Comments 16 pages, 3 figures, 3 tables, this paper introduces a new concept

详情
英文摘要

The expansion of AI toward the edge increasingly exposes the cost and fragility of cen- tralised intelligence. Data transmission, latency, energy consumption, and dependence on large data centres create bottlenecks that scale poorly across heterogeneous, mobile, and resource-constrained environments. In this paper, we introduce Node Learning, a decen- tralised learning paradigm in which intelligence resides at individual edge nodes and expands through selective peer interaction. Nodes learn continuously from local data, maintain their own model state, and exchange learned knowledge opportunistically when collaboration is beneficial. Learning propagates through overlap and diffusion rather than global synchro- nisation or central aggregation. It unifies autonomous and cooperative behaviour within a single abstraction and accommodates heterogeneity in data, hardware, objectives, and connectivity. This concept paper develops the conceptual foundations of this paradigm, contrasts it with existing decentralised approaches, and examines implications for communi- cation, hardware, trust, and governance. Node Learning does not discard existing paradigms, but places them within a broader decentralised perspective

2602.16812 2026-02-20 cs.AI

NeuDiff Agent: A Governed AI Workflow for Single-Crystal Neutron Crystallography

Zhongcan Xiao, Leyi Zhang, Guannan Zhang, Xiaoping Wang

详情
英文摘要

Large-scale facilities increasingly face analysis and reporting latency as the limiting step in scientific throughput, particularly for structurally and magnetically complex samples that require iterative reduction, integration, refinement, and validation. To improve time-to-result and analysis efficiency, NeuDiff Agent is introduced as a governed, tool-using AI workflow for TOPAZ at the Spallation Neutron Source that takes instrument data products through reduction, integration, refinement, and validation to a validated crystal structure and a publication-ready CIF. NeuDiff Agent executes this established pipeline under explicit governance by restricting actions to allowlisted tools, enforcing fail-closed verification gates at key workflow boundaries, and capturing complete provenance for inspection, auditing, and controlled replay. Performance is assessed using a fixed prompt protocol and repeated end-to-end runs with two large language model backends, with user and machine time partitioned and intervention burden and recovery behaviors quantified under gating. In a reference-case benchmark, NeuDiff Agent reduces wall time from 435 minutes (manual) to 86.5(4.7) to 94.4(3.5) minutes (4.6-5.0x faster) while producing a validated CIF with no checkCIF level A or B alerts. These results establish a practical route to deploy agentic AI in facility crystallography while preserving traceability and publication-facing validation requirements.

2602.16811 2026-02-20 cs.CL cs.AI

Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis

详情
英文摘要

Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.

2602.16807 2026-02-20 cs.AI cs.DM math.CO

Improved Upper Bounds for Slicing the Hypercube

Duncan Soiffer, Nathaniel Itty, Christopher D. Rosin, Blake Bruell, Mason DiCicco, Gábor N. Sárközy, Ryan Offstein, Daniel Reichman

详情
英文摘要

A collection of hyperplanes $\mathcal{H}$ slices all edges of the $n$-dimensional hypercube $Q_n$ with vertex set $\{-1,1\}^n$ if, for every edge $e$ in the hypercube, there exists a hyperplane in $\mathcal{H}$ intersecting $e$ in its interior. Let $S(n)$ be the minimum number of hyperplanes needed to slice $Q_n$. We prove that $S(n) \leq \lceil \frac{4n}{5} \rceil$, except when $n$ is an odd multiple of $5$, in which case $S(n) \leq \frac{4n}{5} +1$. This improves upon the previously known upper bound of $S(n) \leq \lceil\frac{5n}{6} \rceil$ due to Paterson reported in 1971. We also obtain new lower bounds on the maximum number of edges in $Q_n$ that can be sliced using $k<n$ hyperplanes. We prove the improved upper bound on $S(n)$ by constructing $8$ hyperplanes slicing $Q_{10}$ aided by the recently introduced CPro1: an automatic tool that uses reasoning LLMs coupled with automated hyperparameter tuning to create search algorithms for the discovery of mathematical constructions.

2602.16805 2026-02-20 cs.AI cs.LG

Simple Baselines are Competitive with Code Evolution

Yonatan Gideoni, Sebastian Risi, Yarin Gal

详情
英文摘要

Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem's search space and domain knowledge in the prompt are chiefly what dictate a search's performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.

2602.16802 2026-02-20 cs.CL cs.AI cs.LG

References Improve LLM Alignment in Non-Verifiable Domains

Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan

Comments ICLR 2026 Camera Ready

详情
英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.

2602.16796 2026-02-20 cs.LG math.OC

Efficient Tail-Aware Generative Optimization via Flow Model Fine-Tuning

Zifan Wang, Riccardo De Santi, Xiaoyu Mo, Michael M. Zavlanos, Andreas Krause, Karl H. Johansson

Comments 33 pages

详情
英文摘要

Fine-tuning pre-trained diffusion and flow models to optimize downstream utilities is central to real-world deployment. Existing entropy-regularized methods primarily maximize expected reward, providing no mechanism to shape tail behavior. However, tail control is often essential: the lower tail determines reliability by limiting low-reward failures, while the upper tail enables discovery by prioritizing rare, high-reward outcomes. In this work, we present Tail-aware Flow Fine-Tuning (TFFT), a principled and efficient distributional fine-tuning algorithm based on the Conditional Value-at-Risk (CVaR). We address two distinct tail-shaping goals: right-CVaR for seeking novel samples in the high-reward tail and left-CVaR for controlling worst-case samples in the low-reward tail. Unlike prior approaches that rely on non-linear optimization, we leverage the variational dual formulation of CVaR to decompose it into a decoupled two-stage procedure: a lightweight one-dimensional threshold optimization step, and a single entropy-regularized fine-tuning process via a specific pseudo-reward. This decomposition achieves CVaR fine-tuning efficiently with computational cost comparable to standard expected fine-tuning methods. We demonstrate the effectiveness of TFFT across illustrative experiments, high-dimensional text-to-image generation, and molecular design.

2602.16790 2026-02-20 cs.SD cs.MM

Generative Audio Extension and Morphing

Prem Seetharaman, Oriol Nieto, Justin Salamon

Comments Accepted to ICASSP 2026

详情
英文摘要

In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.

2602.16787 2026-02-20 cs.LG cs.CL

Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency

Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad

详情
英文摘要

Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.

2602.16784 2026-02-20 cs.LG cs.CL stat.ME

Omitted Variable Bias in Language Models Under Distribution Shift

Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

详情
英文摘要

Despite their impressive performance on a wide variety of tasks, modern language models remain susceptible to distribution shifts, exhibiting brittle behavior when evaluated on data that differs in distribution from their training data. In this paper, we describe how distribution shifts in language models can be separated into observable and unobservable components, and we discuss how established approaches for dealing with distribution shift address only the former. Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models. To address this challenge, we introduce a framework that maps the strength of the omitted variables to bounds on the worst-case generalization performance of language models under distribution shift. In empirical experiments, we show that using these bounds directly in language model evaluation and optimization provides more principled measures of out-of-distribution performance, improves true out-of-distribution performance relative to standard distribution shift adjustment methods, and further enables inference about the strength of the omitted variables when target distribution labels are available.

2602.16764 2026-02-20 cs.LG cs.RO cs.SY eess.SY

Machine Learning Argument of Latitude Error Model for LEO Satellite Orbit and Covariance Correction

Alex Moody, Penina Axelrad, Rebecca Russell

Comments Appearing in 2026 IEEE Aerospace Conference

详情
英文摘要

Low Earth orbit (LEO) satellites are leveraged to support new position, navigation, and timing (PNT) service alternatives to GNSS. These alternatives require accurate propagation of satellite position and velocity with a realistic quantification of uncertainty. It is commonly assumed that the propagated uncertainty distribution is Gaussian; however, the validity of this assumption can be quickly compromised by the mismodeling of atmospheric drag. We develop a machine learning approach that corrects error growth in the argument of latitude for a diverse set of LEO satellites. The improved orbit propagation accuracy extends the applicability of the Gaussian assumption and modeling of the errors with a corrected mean and covariance. We compare the performance of a time-conditioned neural network and a Gaussian Process on datasets computed with an open source orbit propagator and publicly available Vector Covariance Message (VCM) ephemerides. The learned models predict the argument of latitude error as a Gaussian distribution given parameters from a single VCM epoch and reverse propagation errors. We show that this one-dimensional model captures the effect of mismodeled drag, which can be mapped to the Cartesian state space. The correction method only updates information along the dimensions of dominant error growth, while maintaining the physics-based propagation of VCM covariance in the remaining dimensions. We therefore extend the utility of VCM ephemerides to longer time horizons without modifying the functionality of the existing propagator.

2602.16762 2026-02-20 cs.LG cs.AI cs.NI

Attending to Routers Aids Indoor Wireless Localization

Ayush Roy, Tahsin Fuad Hassan, Roshan Ayyalasomayajula, Vishnu Suresh Lokhande

Comments AAAI 2026 Workshop on Machine Learning for Wireless Communication and Networks (ML4Wireless)

详情
英文摘要

Modern machine learning-based wireless localization using Wi-Fi signals continues to face significant challenges in achieving groundbreaking performance across diverse environments. A major limitation is that most existing algorithms do not appropriately weight the information from different routers during aggregation, resulting in suboptimal convergence and reduced accuracy. Motivated by traditional weighted triangulation methods, this paper introduces the concept of attention to routers, ensuring that each router's contribution is weighted differently when aggregating information from multiple routers for triangulation. We demonstrate, by incorporating attention layers into a standard machine learning localization architecture, that emphasizing the relevance of each router can substantially improve overall performance. We have also shown through evaluation over the open-sourced datasets and demonstrate that Attention to Routers outperforms the benchmark architecture by over 30% in accuracy.

2602.16758 2026-02-20 cs.RO cs.SY eess.SY

Smooth trajectory generation and hybrid B-splines-Quaternions based tool path interpolation for a 3T1R parallel kinematic milling robot

Sina Akhbari, Mehran Mahboubkhah

Comments 30 pages, 17 figures, published in Elsevier Precision Engineering (https://www.sciencedirect.com/science/article/abs/pii/S0141635925001266)

Journal ref Precision Engineering 95 (2025): 117-135

详情
英文摘要

This paper presents a smooth trajectory generation method for a four-degree-of-freedom parallel kinematic milling robot. The proposed approach integrates B-spline and Quaternion interpolation techniques to manage decoupled position and orientation data points. The synchronization of orientation and arc-length-parameterized position data is achieved through the fitting of smooth piece-wise Bezier curves, which describe the non-linear relationship between path length and tool orientation, solved via sequential quadratic programming. By leveraging the convex hull properties of Bezier curves, the method ensures spatial and temporal separation constraints for multi-agent trajectory generation. Unit quaternions are employed for orientation interpolation, providing a robust and efficient representation that avoids gimbal lock and facilitates smooth, continuous rotation. Modifier polynomials are used for position interpolation. Temporal trajectories are optimized using minimum jerk, time-optimal piece-wise Bezier curves in two stages: task space followed by joint space, implemented on a low-cost microcontroller. Experimental results demonstrate that the proposed method offers enhanced accuracy, reduced velocity fluctuations, and computational efficiency compared to conventional interpolation methods.

2602.16747 2026-02-20 cs.LG cs.AI

LiveClin: A Live Clinical Benchmark without Leakage

Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang

详情
英文摘要

The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. In benchmarking against human experts, Chief Physicians achieved the highest accuracy, followed closely by Attending Physicians, with both surpassing most models. LiveClin thus provides a continuously evolving, clinically grounded framework to guide the development of medical LLMs towards closing this gap and achieving greater reliability and real-world utility. Our data and code are publicly available at https://github.com/AQ-MedAI/LiveClin.

2602.16744 2026-02-20 cs.RO

ICP-Based Pallet Tracking for Unloading on Inclined Surfaces by Autonomous Forklifts

Takuro Kato, Mitsuharu Morisawa

Comments Accepted and published in IEEE/SICE SII 2024

Journal ref Proc. IEEE/SICE International Symposium on System Integration (SII), 2024, pp.1504-1510

详情
英文摘要

This paper proposes a control method for autonomous forklifts to unload pallets on inclined surfaces, enabling the fork to be withdrawn without dragging the pallets. The proposed method applies the Iterative Closest Point (ICP) algorithm to point clouds measured from the upper region of the pallet and thereby tracks the relative position and attitude angle difference between the pallet and the fork during the unloading operation in real-time. According to the tracking result, the fork is aligned parallel to the target surface. After the fork is aligned, it is possible to complete the unloading process by withdrawing the fork along the tilt, preventing any dragging of the pallet. The effectiveness of the proposed method is verified through dynamic simulations and experiments using a real forklift that replicate unloading operations onto the inclined bed of a truck.

2602.16742 2026-02-20 cs.LG cs.AI

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Haoxiang Sun, Lizhen Xu, Bing Zhao, Wotao Yin, Wei Wang, Boyu Yang, Rui Wang, Hu Wei

Comments Under review

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has been shown effective in enhancing the visual reflection and reasoning capabilities of Large Multimodal Models (LMMs). However, existing datasets are predominantly derived from either small-scale manual construction or recombination of prior resources, which limits data diversity and coverage, thereby constraining further gains in model performance. To this end, we introduce \textbf{DeepVision-103K}, a comprehensive dataset for RLVR training that covers diverse K12 mathematical topics, extensive knowledge points, and rich visual elements. Models trained on DeepVision achieve strong performance on multimodal mathematical benchmarks, and generalize effectively to general multimodal reasoning tasks. Further analysis reveals enhanced visual perception, reflection and reasoning capabilities in trained models, validating DeepVision's effectiveness for advancing multimodal reasoning. Data: \href{https://huggingface.co/datasets/skylenage/DeepVision-103K}{this url}.