arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2605.05495 2026-05-08 cs.LG

Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

William T. Redman, Erik C. Johnson, Brian Robinson

Comments 17 pages, 6 figures

详情
英文摘要

Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting ("continual LEGO"). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a canonical feedforward Transformer model, learns shortcut solutions that limits its ability to generalize and prevents strong forward transfer to new experiences. In contrast, we find evidence supporting the hypothesis that ALBERT, a recurrent version of BERT, learns a For loop-esque solution, which leads to better CL performance. When applying BERT and ALBERT models to a CL setting that requires composition across experiences, we find that both model families fail. Our investigation suggests that ALBERT models can have their performance drop rescued by use of training strategies that combine data across experiences, but this is not true for BERT models, where a detrimental shortcut solution becomes entrenched with initial training. Our results demonstrate that the recurrent ALBERT model may have an inductive bias better suited for CL and motivate future investigation of the interplay between Transformer architecture and computational solutions that emerge in modern models and tasks.

2605.05492 2026-05-08 cs.LG

MEMOA: Massive Mixtures of Online Agents via Mean-Field Decentralized Nash Equilibria

Xuwei Yang, David B. Emerson, Fatemeh Tavakoli, Anastasis Kratsios

Comments 43 pages, 11 tables, 1 figure

详情
英文摘要

In the modern age of large-scale AI, federated learning has become an increasingly important tool for training large populations of AI agents; however, its computational and communication costs can rapidly fail to scale with the number of agents. This is precisely where decentralized agentic strategies shine: each agent acts autonomously, using only its own state together with a minimal summary of the ensemble, namely the mean-field. We derive the unique optimal decentralized policy in closed form. Optimality is characterized through a worst-client/minimax criterion: minimizing the under-performer regret, namely the maximal online cost incurred by the weakest agent in the ensemble. We further prove that the resulting decentralized policy asymptotically converges, in the large-population limit, to the Nash-optimal centralized policy, whose direct computation is not scalable. We use an online weighting mechanism to optimize the server-computed mixture of client predictions, thereby improving the mean prediction in addition to the previously optimized weakest-client prediction. Numerical experiments verify our theoretical guarantees and demonstrate that our decentralized policy typically outperforms natural greedy decentralized baselines.

2605.05488 2026-05-08 cs.LG

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

Taeyoung Kim, Joon-Hyuk Ko

Comments 14 pages, 3 figures

详情
英文摘要

We propose an architecture that augments the Flux Neural Operator (Flux NO), which combines the classical finite volume method (FVM) with neural operators, with ViT-based context injection. Our model is formulated as a hypernetwork: it extracts solution dynamics over a finite temporal window, encodes them with a recurrent Vision Transformer, and generates the parameters of a context-conditioned neural operator. This enables the model to infer and solve conservation laws without explicit access to the governing equation or PDE coefficients. Experimentally, we show that the proposed method preserves the robustness, generalization ability, and long-time prediction advantages of Flux NO over standard neural operators, while delivering reliable numerical solutions across a broad range of conservative systems, including previously unseen fluxes. Our code is available at https://github.com/xx257xx/CONTEXT_FLUX_NO.

2605.05485 2026-05-08 cs.CL cs.AI

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

Atharva Naik, Yash Mathur, Prakam, Carolyn Rose, David Mortensen

详情
英文摘要

LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic program synthesizers over constrained DSLs. The resulting solvers require no LLM calls at test time and are strong standalone systems: symbolic solver ensembles reach 91.3% accuracy on PBEBench-Lite and 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling for the latter by +16.3 percentage points at zero LLM inference cost. They also complement LLM search, improving PBEBench-Hard accuracy from 68.4% to 85.8% while reducing reported token usage by 78%, and raising SLR-Bench hard-tier accuracy from 34.4% to 58.0% in a neuro-symbolic hybrid setting. Compared to directly using coding agents as per-instance solvers, induced solvers are substantially more Pareto-efficient, amortizing a small one-time construction cost over many zero-token executions. Finally, most solvers transfer zero-shot to a real historical linguistics task - predicting sound changes in natural language data - reaching 80.1% accuracy under ensembling and recovering some plausible linguistic rules. Together, these results show that reasoning traces can be compiled into reusable symbolic solvers that solve many tasks directly, complement LLM inference on hard cases, and provide a scalable route to domain-general solver induction. We release code and data for reproducibility.

2605.05483 2026-05-08 cs.RO

Robust $\mathcal{H}_\infty$ Controller Design For INDI-Controlled Quadrotor Using Online Parameter Identification

Tom Aantjes, Till M. Blaha, Spilios Theodoulis, Ewoud J. J. Smeur

Comments 8 pages, 11 figures, Accepted to the ICUAS 2026 conference

详情
英文摘要

It has recently been shown that all physical parameters of an Incremental Nonlinear Dynamic Inversion (INDI) controller can be estimated onboard a multirotor within half a second, which is fast enough to do the full identification during a throw in the air. However, a robust method to tune outer loop gains for this feedback-linearizing INDI controller depending on the model parameters is still missing. This work presents the design of a robust gain-scheduled controller for attitude control of quadrotor, using an INDI-based inner loop with online identification of its system parameters. A gain-scheduled cascaded attitude controller with a feedforward filter is synthesized for a symmetric quadrotor using signal-based $\mathcal{H}_\infty$ closed-loop shaping. The resulting controller exhibits good stability margins, with nonlinear simulations confirming effective tracking performance under uncertainty. Experimental evaluation is also conducted through flight tests with full online parameter identification. Even though the identified parameters during these tests are far outside the defined uncertainty range, acceptable flight performance comparable to simulation results is maintained for actuator time constants below 40 ms.

2605.05482 2026-05-08 cs.AI cs.CL cs.MA

FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking

Denys Katerenchuk, Pablo Duboue, Keelan Evanini, David Gondek, Nithin Govindugari, Olivier Allauzen, Joshua Baptiste, David J More, Joshua Schechter

Comments 7 pages, ACL 2026 conference

详情
英文摘要

Large language models (LLMs) are rapidly being adopted across various domains. However, their adoption in banking industry faces resistance due to demands for high accuracy, regulatory compliance, and the need for verifiable and grounded responses. We present a unified, data-efficient framework for training grounded domain-specific LLMs that optimizes answer quality, citation grounding, and calibrated refusal under real-world deployment constraints. First, we describe a data generation pipeline that combines LLM-as-a-Judge filtering, citation annotation, and curriculum learning with only 143M tokens. The resulting 12B model achieves high answer quality outperforming GPT-4.1 on citation grounding, with a modest citation tradeoff versus the untuned base. Second, we propose a calibrated refusal mechanism: training on 22% unanswerable examples yield a 12% "I don't know" rate, substantially improving over the base model's unsafe 4.3% rate while avoiding GPT-4.1's over-refusal (20.2%). Third, we present an end-to-end methodology spanning from data curation to quantized serving. The system is deployed at 40+ financial institutions, achieving a 7.1 percentage point improvement in query resolution (p < 0.001). Additionally, the model delivers 3-5x faster responses at 20-50x lower cost compared to GPT-4.1.

2605.05481 2026-05-08 cs.LG

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

Dillon Sandhu, Ronald Parr

详情
英文摘要

We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is unknown and cannot be sampled for the purposes of training the value function. Conservative updates solve this problem, but at the cost of shrinking the policy update. This paper explores an alternative solution, Approximate Next Policy Sampling (ANPS), which addresses the problem by modifying the training distribution rather than constraining the policy update. ANPS is satisfied if the distribution of the training data approximates that of the next policy. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration (SV-API). SV-API modifies the standard approximate policy iteration loop to hold the target policy fixed while an iteratively updated behavioral policy gathers relevant experience. It only commits to a new policy once a convergence criterion has been met. If certain stability criteria are met, the update is guaranteed to be safe; otherwise, it remains no less safe than standard approximate policy iteration. Applying SV-API to PPO yields Stable Value PPO (SV-PPO), which matches or improves performance on high-dimensional discrete (Atari) and continuous control benchmarks while executing substantially larger target policy updates. These results demonstrate the viability of ANPS as a new solution to this classic challenge in RL.

2605.05478 2026-05-08 cs.AI

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

Mahyar Alinejad, Yue Wang, Amrit Singh Bedi, George Atia

详情
英文摘要

Transfer learning in reinforcement learning (RL) seeks to accelerate learning in new tasks by leveraging knowledge from related sources. Existing neurosymbolic transfer methods, however, typically rely on manually specified task automata, assume a single source task, and use fixed knowledge-integration mechanisms that cannot adapt to varying source relevance. We propose LANTERN, a unified framework for multi-source neurosymbolic transfer that addresses these limitations through three components: (i) deterministic finite automata generated from natural language task descriptions using large language models, (ii) semantic embedding-based aggregation of multiple source policies weighted by cross-task similarity, and (iii) adaptive teacher-student gating based on temporal-difference error and semantic uncertainty. Across domains spanning resource management, navigation, and control, LANTERN achieves 40-60% improvements in sample efficiency over existing baselines while remaining robust to poorly aligned sources. These results demonstrate that multi-source, adaptively weighted neurosymbolic transfer can improve scalability and robustness in symbolic RL settings.

2605.05476 2026-05-08 cs.LG cs.AI cs.CL

A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

Othmane Kabal, Mounira Harzallah, Fabrice Guillet, Hideaki Takeda, Ryutaro Ichise

详情
英文摘要

Knowledge graphs automatically constructed from text are increasingly used in real-world applications. However, their inherent noise, fragmentation, and semantic inconsistencies significantly affect the performance of Graph Neural Networks (GNNs) on downstream tasks. Assessing their performance and robustness remains difficult, as it is often unclear whether observed results stem from the learning model or from the quality of the constructed graph itself. In this work, we introduce a dual-purpose benchmark designed to jointly evaluate (i) the performance of GNNs on noisy, text-derived graphs and (ii) the effectiveness of graph construction methods on a downstream task. The benchmark is built in the biomedical domain from a single textual corpus and includes two automatically constructed graphs generated using different extraction methods, alongside a high-quality reference graph curated by experts that serves as an upper performance bound. This design enables controlled comparison of construction methods and systematic evaluation of GNN robustness through semi-supervised node classification. We further provide a standardized, reproducible, and extensible evaluation framework, facilitating the integration of new graph extraction methods and learning models.

2605.05475 2026-05-08 cs.AI

Intentionality is a Design Decision: Measuring Functional Intentionality for Accountable AI Systems

Allessia Chiappetta, Robert Mahari

详情
Journal ref
AutomationXP26 Workshop of the 2026 CHI Conference on Human Factors in Computing Systems
英文摘要

As AI systems increasingly exhibit autonomous, goal-directed, and long-horizon behavior, users lack a standardized way to detect the degree to which a system functions like an intentional actor for governance and accountability purposes. This position paper defines intentionality not as consciousness, but as a behavioral profile characterized by purpose, foresight, volition, temporal commitment, and coherence - criteria long used in legal and philosophical contexts to infer intent. These properties are design-contingent: architectural choices such as memory persistence, planning depth, and tool autonomy shape the degree to which systems exhibit organized goal pursuit. If intentionality is design-contingent, it is in principle controllable. Yet control requires measurement. We introduce the Functional Intentionality Test (FIT), a multidimensional framework that quantifies intentional-like behavior across five observable dimensions, and propose FIT-Eval, a structured evaluation protocol for eliciting and scoring them. While reduced human agency can increase efficiency, rising intentional capacity heightens accountability risks. By translating intentionality into interpretable levels, FIT enables proportionate oversight and deliberate autonomy calibration in increasingly agentic systems.

2605.05463 2026-05-08 cs.LG cs.AI

Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs

Othmane Kabal, Mounira Harzallah, Fabrice Guillet, Hideaki Takeda, Ryutaro Ichise

详情
英文摘要

Graph Self-Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real-world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text-driven graphs for unsupervised term typing. We introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean-graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message-passing designs are better suited to noisy, text-driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD-GSSL provides practical guidance for applying GSSL to real-world, noisy graphs and achieves up to a 7\% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.

2605.05461 2026-05-08 cs.RO

Contact-Free Grasp Stability Prediction with In-Hand Time-of-Flight Sensors

Kyle DuFrene, Cindy Grimm

详情
英文摘要

Current approaches to grasp planning for robotics demonstrate high success rates, but degrade with noisy sensors and other factors. Previous works have proposed tactile-based grasp stability classifiers to detect failures, but these approaches rely on making contact and grasping the object to do so. We propose a contact-free grasp stability predictor using multi-zone time-of-flight sensors mounted in the distal links of a gripper. Our method, as it does not require grasping the object to make a prediction, significantly speeds up the stability classification process, cycling at 15 Hz. We collected over 2,500 real-world grasps across 15 objects to train a classifier. Additionally, we conducted grasp attempts over six additional unseen objects, three for validation and model selection, and three for model testing. Our approach demonstrated strong classification performance, with an accuracy of 85.5% on validation and 86.0% on test objects.

2605.05460 2026-05-08 cs.AI physics.chem-ph

Agentic Discovery of Exchange-Correlation Density Functionals

Titouan Duston, Jiashu Liang, Yuanheng Wang, Weihao Gao, Xuelan Wen, Nan Sheng, Weiluo Ren, Yang Sun, Yixiao Chen

Comments 20 pages, 2 figues, 4 tables

详情
英文摘要

The development of accurate exchange-correlation (XC) functionals remains a longstanding challenge in density functional theory (DFT). The vast majority of XC functionals have been hand designed by human researchers combining physical insight, exact constraints, and empirical fitting. Recent advances in large language models enable a systematic, automated alternative to this human-driven design loop. This report presents an agentic search system in which an LLM proposes structured functional-form changes guided by evolutionary history. The system attempts to improve functional performance through an iterative plan-execute-summarize loop, where improvements are measurable by optimizing functional parameters against a standard thermochemistry dataset, then evaluating performance on a held-out subset. The strongest discovered functional, SAFS26-a (Seed Agentic Functional Search 2026), improves upon the gold-standard ωB97M-V baseline by ~9%. These results also surface a cautionary lesson for AI-assisted science: models powerful enough to discover genuine improvements are equally capable of exploiting unphysical shortcuts to game the benchmark; domain expertise translated into explicitly enforced constraints remains essential to keeping results scientifically grounded.

2605.05447 2026-05-08 cs.CV

EchoXFlow: A Beamspace Echocardiography Dataset for Cardiac Motion, Flow, and Function

Elias Stenhede, Joanna Sulkowska, Eivind Bjørkan Orstad, Henrik Schirmer, Arian Ranjbar

详情
英文摘要

We introduce EchoXFlow, a clinical echocardiography dataset for learning from ultrasound in its native acquisition geometry rather than from scan-converted Cartesian videos. Existing public datasets offer limited opportunities to study cross-modal relationships between cardiac anatomy, myocardial motion, and blood flow, as Doppler is typically absent or fused as RGB overlays, and acquisitions are released after lossy vendor display processing. EchoXFlow comprises 37125 recordings from 666 routine-care examinations, preserving the timing, geometry, and modality relationships needed for physically grounded echo learning. Each recording is retained as separable modality-specific streams: temporally resolved 1D, 2D, and 3D data alongside multiple Doppler modalities, paired with a synchronized ECG. Clinical annotations span guideline-based measurements to dense 2D myocardial contours and 3D left-ventricular endocardial meshes. With its associated open-source tooling, EchoXFlow enables cross-modal, acquisition-aware learning tasks that cannot be formulated from conventional scan-converted videos alone, and serves as a testbed for 4D vision and physically grounded multi-modal learning more broadly.

2605.05440 2026-05-08 cs.AI

Authorization Propagation in Multi-Agent AI Systems: Identity Governance as Infrastructure

Krti Tallam

Comments Security and systems paper, 20 pages

详情
英文摘要

The security discussion around agentic AI focuses heavily on prompt injection. This paper argues that multi-agent systems also create a distinct authorization problem: maintaining authorization invariants as non-human principals retrieve data, delegate tasks, and synthesize results across changing boundaries. We call this problem authorization propagation. It is not reducible to prompt injection and is not fully addressed by classical access-control models such as RBAC, ABAC, or ReBAC. The paper formalizes authorization propagation as a workflow-level property, identifies three sub-problems (transitive delegation, aggregation inference, and temporal validity), and derives seven structural requirements for authorization architectures in multi-agent AI systems. Recent work on invocation-bound capability tokens, task-scoped authorization envelopes, dependency-graph policy enforcement, and execution-count revocation demonstrates that the field is converging on the problem, but not yet on a complete architecture. The central claim is that identity governance must be treated as infrastructure: evaluated continuously, enforced at every interaction boundary, and designed into the system before orchestration logic is allowed to scale. Preliminary implementation evidence from a production enterprise AI platform shows that ordinary system behavior, not only adversarial action, already produces the failures this model predicts.

2605.05439 2026-05-08 cs.CV

Safety-Critical Camera Reliability Monitoring for ADAS via Degradation-Aware Uncertainty Pattern Analysis

Shiva Aher

详情
英文摘要

Reliable camera input is essential for safety-critical ADAS perception, but most monitoring approaches detect sensor failures only after downstream performance has degraded. We propose a proactive camera reliability monitoring framework that estimates perception risk from degradation-induced uncertainty patterns before downstream failure becomes observable. The method introduces a Global Sensor Health Index (GSHI), a continuous reliability score that aggregates per-degradation severities using a risk-aware multiplicative formulation, allowing severe single-mode failures such as lens occlusion or motion blur to dominate the health estimate. A lightweight multi-task network predicts degradation type, severity, GSHI, and spatial uncertainty maps from a single RGB image without downstream task feedback. Training uses physics- and geometry-aware synthetic supervision over twelve camera degradation modes. Experiments on KITTI-derived degradations show that GSHI decreases monotonically with severity, achieves a health-estimation MAE of 0.064, and provides positive early-warning lead time of 0.47 $\pm$ 0.25 severity units before YOLOv8 detection failure. GSHI also outperforms IQA, detector-confidence, and clean-feature OOD baselines, and transfers zero-shot to real adverse-weather driving data. These results support degradation-aware uncertainty analysis as a practical direction for proactive camera reliability monitoring in intelligent vehicles.

2605.05438 2026-05-08 cs.LG cs.AI

On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

Pratik Deshmukh, Atirek Gupta

Comments 14 pages, 6 figures

详情
英文摘要

Standard fine-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting "Yes" or "No" regardless of input structure. We demonstrate that fine-tuning Gemma 270M on transitivity and d-separation tasks without semantic loss results in 100% collapse rate, with models achieving misleadingly high accuracy (73.9%) while learning no causal reasoning. We propose a semantic loss function with graph-based logical constraints and dynamic lambda scheduling that prevents this collapse. Our approach achieves 70.4% accuracy on transitivity tasks and 68.6% on d-separation tasks with stable, context-dependent predictions, representing a 42.7% improvement over collapsed baselines. Adversarial evaluation on 1,000 structural reasoning samples shows semantic models achieve 67-70% accuracy while collapsed models fail catastrophically at 43-71%. We validate our findings through comprehensive benchmarking on 200,000+ evaluation samples across five model variants, demonstrating that semantic loss is essential and not optional, for stable causal reasoning in transformers.

2605.05435 2026-05-08 cs.LG cs.NA math.NA

Active Learning for Conditional Generative Compressed Sensing

Alexander DeLise, Nick Dexter

Comments 33 pages, 11 figures

详情
英文摘要

Generative compressed sensing uses the range of a pretrained generator as a nonlinear model for recovering structured signals from limited measurements. We study a conditional version of this problem for image recovery from subsampled Fourier measurements using prompt-conditioned generative models. Our framework separates two roles of conditioning: the prompt used to design the sampling distribution and the prompt used to define the recovery model. For ReLU and Lipschitz conditional generators, we prove stable recovery bounds showing that prompt-matched Christoffel sampling retains the same Christoffel complexity constant as existing near-optimal generative compressed sensing theory, while prompt mismatch incurs an explicit compatibility penalty. Experiments with Stable Diffusion show that prompts meaningfully reshape Christoffel sampling distributions and influence image recovery. Overall, our results suggest that prompts should be treated as design variables with distinct effects on sensing, approximation, and recovery.

2605.05415 2026-05-08 cs.LG cs.AI cs.CR

Information Theoretic Adversarial Training of Large Language Models

Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia, Jason Pacheco, Elisa Bertino

详情
英文摘要

Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within a divergence ball around the empirical data distribution, automatically emphasizing harder adversarial examples. Using the convex dual formulation, the objective reduces to a log-sum-exp form under the KL divergence, with a dynamical parameter controlling the strength of reweighting. This study leads to a new class of information-theoretic objectives that significantly reduce attack success rates while maintaining model utility. Across multiple LLMs and attack settings, WARDEN substantially reduces attack success rates with computational and utility costs comparable to CAT-, CAPO-, and MixAT-based baselines, making it a practical approach for scalable robust alignment.

2605.05413 2026-05-08 cs.AI

From History to State: Constant-Context Skill Learning for LLM Agents

Haoyang Xie, Xinyuan Wang, Yancheng Wang, Puda Zhao, Feng Ju

详情
英文摘要

Large language model (LLM) agents are increasingly used to operate browsers, files, code and tools, making personal assistants a natural deployment target. Yet personal agents face a privacy-cost-capability tension: cloud models execute multi-step workflows well but expose sensitive intermediate context to external APIs, while local models preserve privacy but remain less reliable. Both settings also pay repeatedly for long skill prompts and growing histories. We propose constant-context skill learning, a context-to-weights framework for recurring agent workflows: reusable procedures are learned in lightweight task-family modules, while inference conditions only on the current observation and a compact state block. A deterministic tracker renders this state block from task progress and supplies aligned subgoal rewards, so each module can be trained with step-level SFT and refined through online RL. Across ALFWorld, WebShop, and SciWorld, our agents achieve strong performance across Qwen3-4B, Qwen3-8B and Llama-3.1-8B. With Qwen3-8B, SFT+RL reaches 89.6\% unseen success on ALFWorld, 76.8\% success on WebShop, and 66.4\% unseen success on SciWorld. They match or exceed strong published agent-training results while reducing prompt tokens per turn by 2--7$\times$ relative to controlled ReAct prompting baselines, showing that procedural context can be moved from prompts into weights.

2605.05411 2026-05-08 cs.RO cs.AI

Creative Robot Tool Use by Counterfactual Reasoning

M. Tuluhan Akbulut, Varun Satheesh, Ahmed Jaafar, Alper Ahmetoglu, Shane Parr, Aditya Ganeshan, Shivam Vats, George Konidaris

Comments Under review

详情
英文摘要

We propose a causal reasoning framework for creative robot tool use where a suitable tool for a task is correctly identified for use beyond its primary objectives. The proposed framework first discovers the causal relationships between the tool and the task by conducting simulated experiments in a dynamics model. We decouple the causal discovery problem into two complementary components: VLM-based feature suggestion and counterfactual tool generation via targeted geometric and physical feature perturbations. Then, novel objects are classified based on identified causal features, and the tool use skill is transferred via keypoint matching conditioned on the identified causal features. By reconstructing the task in a dynamics model, our approach grounds tool use in the physics of the problem. We illustrate our approach in reaching a distant object with different sticks, scooping candies from a bowl using diverse items, and using different boxes or crates as stepping platforms to retrieve an object from a high shelf. Our baseline comparisons show that identifying causal features and grounding them in physical tool properties leads to more reliable tool selection and stronger skill keypoint transfer.

2605.05410 2026-05-08 cs.AI cs.HC physics.ed-ph

LaTA: A Drop-in, FERPA-Compliant Local-LLM Autograder for Upper-Division STEM Coursework

Jesse A. Rodríguez

Comments Submitted to Computers & Education

详情
英文摘要

Large-language-model (LLM) graders promise to relieve the grading burden of upper-division STEM courses, but most deployments to date send student work to third-party APIs, violating FERPA and exposing institutions to data risk while requiring substantial assignment modification. We present $\textbf{LaTA}\ (\textit{LaTeX Teaching Assistant})$, a drop-in, open-source autograder that runs entirely on commodity on-premises hardware and assumes a LaTeX-native workflow already adopted by many engineering and physics courses. LaTA implements a four-stage pipeline (ingest, segment, grade, report) using a locally hosted open-weight chain-of-thought LLM grader (gpt-oss:120b) that compares student work to an instructor-authored reference solution and applies a YAML rubric with binary per-item scoring. We deployed LaTA in Winter~2026 in ME 373 (Mechanical Engineering Methods) at Oregon State University, grading every weekly assignment for approximately 200 students on a single Mac Studio at \$0 marginal cost per assignment and 1--3 minutes of wall-clock time per submission, enabling regrading of corrected assignments and greatly expanded TA office hour offerings. The instructor-confirmed grading-error rate held at roughly $0.02$--$0.04\%$ per rubric line item across the term. Relative to the same instructor's previous traditionally-graded cohort, the LaTA-graded cohort outperformed by approximately $11\%$ on the midterm exam and $8\%$ on the final exam, and reported large gains in self-assessed confidence on every stated learning objective ($N = 159$ survey responses, $Δ\geq +1.49$ Likert points, $p < 10^{-27}$ on every comparison). We release the code under AGPLv3.

2605.05409 2026-05-08 cs.AI cs.CL

Agentic Retrieval-Augmented Generation for Financial Document Question Answering

Yang Shu, Yingmin Liu, Zequn Xie

Comments 22 pages, 11 figures, 13 tables, submitted to Expert Systems with Applications

详情
英文摘要

Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence--structured tables, textual narratives, and footnotes--scattered across corporate filings. Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. Extensive experiments on three benchmark datasets--FinQA, ConvFinQA, and TAT-QA--demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62--9.32 percentage points. Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework's robustness and practical viability for financial institutions.

2605.05403 2026-05-08 cs.AI

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

Jiechen Li, Catherine A. Barry, Rishika Randev, Janet Chen, Ella Jorgensen, Brinnae Bent

Comments Currently under review

详情
英文摘要

This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. Existing work often operationalizes sycophancy through external behavior such as agreement with incorrect user beliefs, position reversals, or deviation from an objective standard of correctness. These formulations capture only overt forms of the phenomenon and leave subtler boundary failures involving epistemic integrity and social alignment underspecified. We argue that sycophancy should not be understood as agreement alone, but as alignment behavior that displaces independent epistemic judgment. To clarify this boundary, we propose a three-condition framework for sycophancy. First, the user expresses a cue in the form of a belief, preference, or self-concept. Second, the model shifts toward that cue through alignment behavior. Third, this shift compromises epistemic accuracy, independent reasoning, or appropriate correction. We also introduce a taxonomy for classifying sycophancy, consisting of alignment targets, mechanisms, and severity. The paper concludes by discussing implications for alignment evaluation and argues for boundary-aware assessment, structured rubrics, and mitigation strategies, while situating these proposals alongside alternative views of sycophancy.

2605.05402 2026-05-08 cs.AI cs.CV eess.IV

Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

Vinit Katariya, Seungjin Kim, Curtis Craig, Nichole Morris, Hamed Tabkhi

Comments 16 pages, 6 figures, 7 tables, Submitted/Under Review at the International Journal of Transportation Research (Submitted on 12 Jan 2026)

详情
英文摘要

Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary pedestrian refuges and curb extensions, on vehicle speed and safety. Using deep learning and perspective-based speed estimation, we evaluated driver behavior before and after interventions, with repeated post-installation monitoring in Week 1 and Week 2, in Minneapolis. Findings reveal that at unsignalized intersections, mean and 85th-percentile speeds fell by up to 18.75% and 16.56%, respectively, while pass-through traffic decreased by as much as 12.2%. Signalized intersections showed comparable reductions except one location, with mean and 85th-percentile speeds dropping by up to 20.0% and 17.19%. These results demonstrate the traffic-calming effectiveness of soft infrastructure and underscore the utility of AI-powered methods for rapid, low-cost, and evidence-based transport policy evaluation.

2605.05395 2026-05-08 cs.LG cs.MS

Differentiable Parameter Optimization for DAEs with State-Dependent Events

Ion Matei, Maksym Zhenirovskyy, Anthony Wong

详情
英文摘要

Differential-algebraic equations (DAEs) with state-dependent events arise in systems whose continuous dynamics are constrained by algebraic equations and interrupted by mode changes, switching logic, impacts, or state reinitializations. Gradient-based parameter learning for such systems is challenging because algebraic variables are implicitly defined, event times depend on the parameters, and reset maps introduce discontinuities. This paper studies differentiable parameter optimization for semi-explicit DAEs with events. We formulate the learning problem as a constrained least-squares problem with DAE dynamics, algebraic constraints, guard equations, and reset maps. We then develop two complementary gradient-computation strategies. The first is an automatic-differentiation-through-simulation method that solves algebraic variables inside the vector field, differentiates the algebraic solve using the implicit function theorem, and handles events through segmented differentiable integration. The second is an explicit discrete-adjoint method that represents the forward simulation as an event-split residual system and computes gradients by solving for the Lagrange multipliers of smooth-segment and event residuals. The formulation clarifies that residual terms in the adjoint method are equality constraints, not heuristic penalties. We compare the two approaches in terms of gradient interpretation, event-time handling, implementation complexity, and local validity. Both methods provide gradients for the event path selected by the forward simulation and are valid under fixed event ordering and transversal guard crossings.

2605.05392 2026-05-08 cs.CL cs.AI

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

Yllias Chali, Deen Abdullah

Comments 7 pages, 1 figure

详情
英文摘要

Large-scale datasets are widely used to perform summarization tasks, but they may not include queries alongside documents and summaries. In the search for suitable datasets for Query-Focused Summarization (QFS), we identify two research questions: Is it possible to automatically generate evidence-based query keywords from query-free datasets? Does evidence-based query generation support the QFS task? This paper proposes an evidence-based model to generate queries from query-free datasets. To evaluate our model intrinsically, we compare the similarity between the original queries and the system-generated queries of two QFS datasets. We also perform summarization tasks using different pre-trained models, as well as a state-of-the-art (SOTA) QFS model, to measure the extrinsic performance of our query generation approach. Experimental results indicate that summaries generated using evidence-based queries achieve competitive ROUGE scores compared to those generated from the original queries.

2605.05390 2026-05-08 cs.CV

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

Nan Yang, Julian Straub, Fan Zhang, Richard Newcombe, Jakob Engel, Lingni Ma

Comments CVPR 2026. Project page: https://facebookresearch.github.io/LAMP

详情
英文摘要

Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This "lift-then-fit" approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.

2605.05389 2026-05-08 cs.LG cs.AI

Two-Stage Learned Decomposition for Scalable Routing on Multigraphs

Filip Rydin, Morteza Haghir Chehreghani, Balázs Kulcsár

Comments 20 pages, 3 figures

详情
英文摘要

Most neural methods for Vehicle Routing Problems (VRPs) are limited to Euclidean settings or simple graphs. In this work, we instead consider multigraphs, where parallel edges represent distinct travel options with varying trade-offs (e.g., distance vs time). Few methods are designed for such formulations and those that do exist face major scalability issues. We mitigate these scalability issues via a Node-Edge Policy Factorization (NEPF) approach, which splits the routing policy into a node permutation stage and an edge selection stage. To enable the decomposition, we introduce a pre-encoding edge aggregation scheme and a non-autoregressive architecture for the edge stage, as well as a hierarchical reinforcement learning method to train the stages jointly. Our experiments across six VRP variants demonstrate that NEPF matches or outperforms the state-of-the-art in terms of solution quality, while being significantly faster in training and inference.

2605.05387 2026-05-08 cs.LG cs.IT math.IT

Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees

Ahmad Aghapour, Erhan Bayraktar, Asaf Cohen

详情
英文摘要

We study zero-shot conditional sampling with pretrained diffusion models for linear inverse problems, including inpainting and super-resolution. In these problems, the observation determines only part of the unknown signal. The remaining degrees of freedom must be sampled according to the correct conditional data distribution. Existing projection-based samplers enforce measurement consistency by correcting the observed component during reverse diffusion. However, measurement consistency alone does not determine how probability mass should be distributed along the feasible set, and this can lead to biased conditional samples. We analyze this issue through a normal--tangent decomposition of the score function. For Gaussian noising, the observed-direction score is exactly determined by the measurement; only the tangent conditional score is unknown. We prove that the error from replacing this score by the unconditional tangent score is upper bounded by a dimension-free conditional mutual information between observed and unobserved components. This gives an information-theoretic decomposition into initialization and pathwise score-mismatch errors. Motivated by the theory, we propose a projected-Langevin initialization followed by guided reverse denoising, which outperforms a strong projection-based baseline in inpainting and super-resolution experiments.