arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3004
专题追踪
2604.23074 2026-04-28 cs.RO

A Lightweight Toggleable Adhesion Prototype for Multirotor UAV Landing on Tilting Platforms

Teighin Nordholt, Melissa Greeff

Comments To be published in the proceedings of the International Conference on Unmanned Aircraft Systems (ICUAS) 2026

详情
英文摘要

Autonomous multirotor landings on uncrewed surface vessels (USVs) are critical for persistent maritime operations but remain challenging due to wave-induced tilt, wind disturbances, and limited landing area. Many existing approaches exhibit small pose tolerance for reliable landing. This paper presents a lightweight toggleable adhesion mechanism to improve landing reliability. The system uses a motor-driven corkscrew that engages hook-and-loop material on the landing surface, enabling active adhesion during landing and controlled release during takeoff. We evaluate a prototype using a modified Crazyflie 2.0 and a custom tilting platform at fixed angles representative of extreme wave conditions. Using only a simple vertical PID controller, the proposed approach increases landing success from an average of 40% (baseline) to 80% across platform tilts up to 43 degrees using appropriately selected actuation settings.

2604.23072 2026-04-28 cs.AI

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Junyan Cheng, Kyle Richardson, Peter Chin

Comments ICLR 2026 Camera-ready

详情
英文摘要

Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data-driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive "what-if" scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.

2604.23069 2026-04-28 cs.CL

ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

Yating Wu, Yuhao Zhang, Sayan Ghosh, Sourya Basu, Anoop Deoras, Jun Huan, Gaurav Gupta

详情
英文摘要

Large language model (LLM) agents often struggle in long-context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval-based memory systems surface relevant content but still overlook the causal and logical structure needed for multi-step reasoning. We introduce ContextWeaver, a selective and dependency-structured memory framework that organizes an agent's interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency-based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root-to-step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE-Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding-window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.

2604.23059 2026-04-28 cs.CL

Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort

Baris Karacan, Barbara Di Eugenio, Patrick Thornton, Joanna Tess, Subhash Kumar Kolar

Comments 10 pages. Accepted at IEEE ICHI 2026. This is the author-accepted manuscript

详情
英文摘要

Clinical framing -- the linguistic manner in which clinical information is presented -- can influence patient understanding and decision-making, with important implications for healthcare outcomes. Obstetrics is a high-stakes domain in which physicians counsel patients on delivery mode choices such as vaginal birth after cesarean (VBAC) and repeat cesarean section (RCS), yet counseling language remains underexplored in large-scale clinical text analysis. In this work, we analyze physician counseling language in 2,024 obstetric history and physical narratives for a rigorously defined cohort of patients for whom both VBAC and RCS were clinically viable options. To control for confounding due to medical contraindications, we first construct a VBAC-eligible cohort using structured clinical data supplemented by a large language model (LLM)-based extraction pipeline constrained to grounded, verbatim evidence from free-text narratives. We then apply a zero-shot LLM framework to categorize counseling segments into predefined framing categories capturing how physicians linguistically present delivery options. Our analysis reveals a significant difference in counseling framing distributions between VBAC and RCS notes; risk-focused language accounts for a substantially larger share of counseling segments in RCS documentation than in VBAC, with category-level differences confirmed by statistical testing, highlighting the value of controlled LLM-based framing analysis in obstetric care.

2604.23056 2026-04-28 cs.LG cs.AI

K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

Zixuan Xia, Quanxi Li

Comments Accepted in NewInML Workshop, The 42nd International Conference on Machine Learning (ICML 2025).\href{https://icml.cc/virtual/2025/affinity-event/39980}{Event Page}

详情
英文摘要

We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high-variance returns and adapting to non-stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on \textit{LunarLander} and \textit{CartPole} demonstrate that Kalman-filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at https://github.com/Sumxiaa/Kalman_Normalization.

2604.23054 2026-04-28 cs.CL cs.AI cs.LG

DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

Youze Zheng, Jianyou Wang, Yuhan Chen, Matthew Feng, Longtian Bao, Hanyuan Zhang, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Umber Dube, Ramamohan Paturi

Comments Preprint. Work in Progress

详情
英文摘要

Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.

2604.23051 2026-04-28 cs.CL

Evaluating Temporal Consistency in Multi-Turn Language Models

Yash Kumar Atri, Steven L. Johnson, Tom Hartvigsen

Comments Accepted at ACL 2026

详情
英文摘要

Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories. Through extensive evaluation of state-of-the-art language models, we find that temporal scope stability is frequently violated in controlled multi-turn settings, with models often drifting toward present-day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single-turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at https://github.com/yashkumaratri/ChronoScope

2604.23049 2026-04-28 cs.AI

A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows

Edward Cheng, Jeshua Cheng

Comments 8 pages, 2 figures

详情
英文摘要

AI agents are increasingly deployed to execute tasks and make decisions within agentic workflows, introducing new requirements for safe and controlled autonomy. Prior work has established the importance of human oversight for ensuring transparency, accountability, and trustworthiness in such systems. However, existing implementations of Human-in-the-Loop (HITL) mechanisms are typically embedded within application logic, limiting reuse, consistency, and scalability across multi-agent environments. This paper presents a decoupled HITL system architecture that treats human oversight as an independent system component within the agent operating environment. The proposed design separates human interaction management from application workflows through explicit interfaces and a structured execution model. In addition, a design framework is introduced to formalize HITL integration along four dimensions: intervention conditions, role resolution, interaction semantics, and communication channel. This framework enables selective and context-aware human involvement while maintaining system-level consistency. The approach supports alignment with emerging agent communication protocols, allowing HITL to be implemented as a protocol-level concern. By externalizing HITL and structuring its integration, the system provides a foundation for scalable governance and progressive autonomy in agentic workflows.

2604.23046 2026-04-28 cs.LG cs.IT cs.SI math.IT stat.ML

Shape of Memory: a Geometric Analysis of Machine Unlearning in Second-Order Optimizers

Kennon Stewart

Comments Full experiment data available at secondstreetlabs.io

详情
英文摘要

We argue that current definitions of machine unlearning are underspecified for second-order optimizers. We compare first-order and second-order learners for their ability to handle the data deletion task with varying degrees of eigendecomposition to mimic the loss model memory. While both first and second-order methods realign with the ideal counterfactul in terms of performance and gradient, the second-order optimizer shows significant volatility in the optimizer state. This indicates residual information, supposedly deleted, that isn't detectable by first-order analysis. Various eigendecay treatments show that stability and information loss is regained only under controlled state pertubation where geometric information (or memory) is erased.

2604.23039 2026-04-28 cs.RO

Control Barrier Functions Solved with Hierarchical Quadratic Programming for Safe Physical Human-Robot Interaction

Rui Luo, Jonas Mariager Jakobsen, Wesley Roozing, Federico Califano, Cheng Fang

Comments 8 pages, 8 figures

详情
英文摘要

Physical human-robot interaction offers the potential to leverage human intelligence and robot physical capabilities to enable a range of exciting applications, e.g., collaborative robots for rehabilitation. Safety is critical for the successful deployment of this kind of robotic system. In recent years, Control Barrier Function (CBF) has emerged as an effective approach to enforce safety guarantees, which has been widely applied in various applications, from adaptive cruise control to navigation of legged robots. CBFs can be solved in a Quadratic Programming (QP) problem, which can include many CBF-formulated tasks. To manage a large number of safety tasks, a hierarchical CBF has been used to allow hierarchical relaxation of safety tasks to ensure the feasibility of a solution in the presence of conflicting tasks. In this work, we propose to use a CBF-based Hierarchical Quadratic Programming (HQP) framework in physical human-robot interaction to allow us to design both performance tasks (e.g., preserve the desired behavior at the human-robot interaction point) and safety tasks at any level of a hierarchy to balance the safety and the performance in a more flexible way. Extensive experiments were carried out on a real redundant robot to validate the effectiveness, flexibility, and generality of this approach.

2604.23036 2026-04-28 cs.LG cs.CL

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng, Yibo Zhao, Juncheng Billy Li, Heather Miller

Comments 36 pages

详情
英文摘要

Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mixing or auxiliary load-balancing losses, but these introduce noisy gradients that often degrade performance. In preliminary experiments, we systematically pruned experts and observed that while certain super experts are activated far more frequently, discarding less used experts still leads to notable performance degradation. This suggests that even rarely activated experts encode non-trivial knowledge useful for downstream tasks. Motivated by this, we propose an auxiliary-loss-free MoE SFT framework that combines bias-driven sparsification with always-active gated condenser experts. Rather than enforcing balanced activation across all experts, our method encourages task-relevant experts to remain active while pushing long-tailed experts toward inactivity. The condenser experts provide a persistent, learnable pathway that alleviates gradient starvation and facilitates consolidation of information that would otherwise remain fragmented across sparsely activated experts. Analysis further suggest that this design better preserves long-tailed expert information under sparse routing. Experiments on large-scale MoE models demonstrate that our approach outperforms state-of-the-art SFT baselines such as DenseMixer and ESFT, achieving average gain of 2.5%+ on both mathematical reasoning and commonsenseQA benchmarks.

2604.23033 2026-04-28 cs.RO

Equivariant Filter for Radar-Inertial Odometry

Giulio Delama, Jan Michalczyk, Morten Nissov, Martin Scheiber, Alessandro Fornasier, Kostas Alexis, Stephan Weiss

详情
英文摘要

Radar-Inertial Odometry (RIO) based on the Extended Kalman Filter (EKF) relies on accurate extrinsic calibration between the radar and the Inertial Measurement Unit (IMU) and is sensitive to disturbances, as large linearization errors can degrade performance or even cause divergence. To address these limitations, this letter proposes an Equivariant Filter (EqF) for RIO based on a Lie group symmetry that geometrically couples navigation states and IMU biases, extending it to incorporate radar-IMU extrinsic calibration and multi-state constraint updates. This equivariant formulation inherently preserves consistency and enhances robustness, enabling reliable state estimation even under poor or completely wrong initialization of calibration states. Real-world experiments on two different Uncrewed Aerial Vehicles (UAVs) show that the proposed EqF-RIO achieves state-of-the-art accuracy under correct extrinsic calibration and offers improved convergence under large calibration errors, where the conventional EKF-RIO fails. Evaluation code is open-sourced.

2604.23027 2026-04-28 cs.AI

A Systematic Approach for Large Language Models Debugging

Basel Shbita, Anna Lisa Gentile, Bing Zhang, Sungeun An, Shailja Thakur, Shubhi Asthana, Yi Zhou, Saptha Surendran, Farhan Ahmed, Rohan Kulkarni, Yuya Jeremy Ong, Chad DeLuca, Hima Patel

详情
英文摘要

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.

2604.23019 2026-04-28 cs.CV cs.LG

Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery

Sulagna Saha, Arthur Ouaknine, Etienne Laliberté, Carol Altimas, Evan M. Gora, Adriane Esquivel Muelbert, Ian R. McGregor, Cesar Gutierrez, Vanessa E. Rubio, David Rolnick

Comments ML4RS @ICLR 2026 (Main)

详情
英文摘要

Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.

2604.23012 2026-04-28 cs.LG cs.CV

On-Device Vision Training, Deployment, and Inference on a Thumb-Sized Microcontroller

Jeremy Ellis

Comments 25 pages; 3 figures; 3 tables. Code and datasets available at https://github.com/webmcu-ai/on-device-vision-ai. Paper 1 of the webmcu-ai series. Implements end-to-end on-device CNN training and inference on a thumb-sized microcontroller (ESP32-S3) the XIAO ML Kit in ~1,750 lines of single-file C++ without external ML dependencies

详情
英文摘要

This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing $15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary > baked-in header > He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at https://github.com/webmcu-ai/on-device-vision-ai

2604.23010 2026-04-28 cs.CV cs.RO

GenAssets: Generating in-the-wild 3D Assets in Latent Space

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun

Comments CVPR 2025. Project page: https://waabi.ai/genassets

详情
英文摘要

High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a "reconstruct-then-generate" approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.

2604.23009 2026-04-28 cs.CL

Chinese-SkillSpan: A Span-Level Dataset for ESCO-Aligned Competency Extraction from Chinese Job Ads

Guojing Li, Zichuan Fu, Junyi Li, Wenxia Zhou, Xinyang Wu, Jinning Yang, Jingtong Gao, Feng Huang, Xiangyu Zhao

Comments 18 pages, 10 figures, 3 tables

详情
英文摘要

Job Skill Named Entity Recognition (JobSkillNER) aims to automatically extract key skill information from large-scale job posting data, which is important for improving talent-market matching efficiency and supporting personalized employment services. To the best of our knowledge, this work presents the first Chinese JobSkillNER dataset for recruitment texts. We propose annotation guidelines tailored to Chinese job postings and an LLM-empowered Macro-Micro collaborative annotation pipeline. The pipeline leverages the contextual understanding ability of large language models (LLMs) for initial annotation and then refines the results through expert sentence-level adjudication. Using this pipeline, we annotate more than 20,000 instances collected from four major recruitment platforms over the period 2014-2025. Based on these efforts, we release Chinese-SkillSpan, the first Chinese JobSkillNER dataset aligned with the ESCO occupational skill standard across four dimensions: knowledge, skill, transversal competence, and language competence (LSKT). Experimental results show that the dataset supports effective model training and evaluation, indicating that Chinese-SkillSpan helps fill a major gap in Chinese JobSkillNER resources and provides a useful benchmark for intelligent recruitment research. Code and data are available at https://sites.google.com/view/cn-skillspan-resources .

2604.23003 2026-04-28 cs.LG cs.NE

Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Leszek Siwik, Maciej Sikora, Natalia Leszczyńska, Tomasz Maciej Ciesielski, Eirik Valseth, Manuela Bastidas Olivares, Marcin Łoś, Tomasz Służalec, Jacek Leszczyński, Maciej Paszyński

Comments Robust Variational Physics Informed Neural Networks; Pollution propagation simulations; Longyearbyen at Spitsbergen; Advection-diffusion model; In-field measurements; Open source software

详情
英文摘要

In this paper, we propose a Physics-Informed Neural Network framework for time-dependent simulations of pollution propagation originating from moving emission sources. We formulate a robust variational framework for the time-dependent advection-diffusion problem and establish the boundedness and inf-sup stability of the corresponding discrete weak formulation. Based on this mathematical foundation, we construct a robust loss function that is directly related to the true approximation error, defined as the difference between the neural network approximation and the (unknown) exact solution. Additionally, a collocation-based strategy is introduced to speed up neural network training. As a case study, we investigate pollution propagation caused by snowmobile traffic in Longyearbyen, Spitsbergen, supported by detailed in-field measurements collected using dedicated sensors. The proposed framework is applied to analyze the effects of thermal inversion on pollutant accumulation. Our results demonstrate that thermal inversion traps dense and humid air masses near the ground, significantly enhancing particulate matter (PM) concentration and worsening local air quality.

2604.23002 2026-04-28 cs.AI cs.CL

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Jordan Meadows, Lan Zhang, Andre Freitas

Comments ACL 2026

详情
英文摘要

Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science

2604.23001 2026-04-28 cs.RO cs.AI

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li

Comments This is a survey paper. The survey is already accepted by TMLR after peer-review. The OpenReview link is here: https://openreview.net/forum?id=tAaWFpvnmm

详情
英文摘要

Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon reasoning evaluation that existing protocols fail to address. For data engines, we examine simulation-based, video-reconstruction, and automated task-generation paradigms, identifying their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these analyses, we distill four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation. Addressing them, we argue, requires treating data infrastructure as a first-class research problem rather than a background concern.

2604.23000 2026-04-28 cs.RO

Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning

Soham Kulkarni, Raayan Dhar, Yuchen Cui

Comments 8 pages, 5 figures

详情
英文摘要

In behavioral cloning (BC), policy performance is fundamentally limited by demonstration data quality. Real-world datasets contain trajectories of varying quality due to operator skill differences, teleoperation artifacts, and procedural inconsistencies, yet standard BC treats all demonstrations equally. Existing curation methods require costly policy training in the loop or manual annotation, limiting scalability. We propose RINSE (Ranking and INdexing Smooth Examples), a lightweight framework for scoring demonstrations based on trajectory smoothness that is policy-architecture-agnostic and operates on trajectory data alone, with TED additionally using a phase-boundary/contact signal. Grounded in motor control theory, which establishes smoothness as a hallmark of skilled movement, RINSE uses two complementary metrics: Spectral Arc Length (SAL), a spectral measure of frequency-domain regularity, and Trajectory-Envelope Distance (TED), a spatial measure of contact-aware geometric deviation. We show that smoothness filtering can reduce the conditional action variance of the retained data distribution, with downstream effects that can be amplified by action chunking and compounding error. On RoboMimic benchmarks, SAL filtering achieves 16% higher success using one-sixth of the data. On real-world manipulation, TED filtering achieves 20% improvement with half the data. As a retrieval-stage filter within STRAP on LIBERO-10, RINSE re-ranking improves mean success by 5.6%. As soft weights in Re-Mix domain reweighting, RINSE scores produce domain allocations highly correlated with the learned Re-Mix allocations (Spearman $ρ\geq 0.89$). These results support smoothness as a useful quality signal across filtering, retrieval, and reweighting settings, especially in noisy or heterogeneous data regimes.

2604.22992 2026-04-28 cs.CV cs.RO

Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation

Vitalii Tutevych, Raphael Memmesheimer, Luca Eichler, Dmytro Pavlichenko, Fynn Schilke, Rodja Krudewig, Sven Behnke

Comments 12 pages, 6 figures, 7 tables, submitted to RoboCup 2026 Symposium

详情
英文摘要

Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais-bonn/label_propagation.

2604.22989 2026-04-28 cs.CV cs.AI

CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari

Comments CVPR Findings (2026)

详情
英文摘要

Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.

2604.22985 2026-04-28 cs.CL

Uncertainty Quantification for LLM Function-Calling

Zihuiwen Ye, Lukas Aichberger, Michael Kirchhof, Sinead Williamson, Luca Zappella, Yarin Gal, Arno Blaas, Adam Golinski

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.

2604.22984 2026-04-28 cs.CV cs.GR

BrickNet: Graph-Backed Generative Brick Assembly

Peter Kulits, Cordelia Schmid

Comments CVPR 2026; project page: https://kulits.github.io/BrickNet

详情
英文摘要

We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet

2604.22981 2026-04-28 cs.LG

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

Alex Nikulkov

Comments 27 pages, 14 figures

详情
英文摘要

Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model's output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories (middle-token pairwise accuracy improved from 50% to 88.9%, final-token accuracy preserved); state-of-the-art PRM performance on ProcessBench (44.9% average F1) among models trained only on outcome data; and unified reward/value modeling in PPO, reducing peak GPU memory by 27% and step time by 19% with matching LLM quality.

2604.22979 2026-04-28 cs.AI

Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction

Luca Cotti, Luca Lavazza, Marco Cominelli, Liying Han, Gaofeng Dong, Francesco Gringoli, Mani B. Srivastava, Trevor Bihl, Erik P. Blasch, Daniel O. Brigham, Kara Combs, Lance M. Kaplan, Federico Cerutti

Comments 8 pages, 1 figure. Accepted at FUSION 2026

详情
英文摘要

We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on high-dimensional raw signals. Deep neural models achieve strong predictive performance on CSI-based HAR (CHAR), yet rely on continuous latent representations that are opaque and difficult to modify; purely symbolic approaches, in contrast, cannot process raw CSI streams. We propose a fully automatic and strictly decoupled pipeline in which CSI magnitude windows are compressed by a categorical variational autoencoder with Gumbel-Softmax latent variables under a capacity-controlled objective, yielding a compact discrete representation. The encoder is then frozen and used as a deterministic mapping to one-hot latent trajectories. Causal discovery is performed on these trajectories to estimate class-conditional temporal dependency graphs. Statistically supported lagged dependencies are translated into Linear Temporal Logic (LTL) rules, producing a fully symbolic and deterministic classifier based solely on rule evaluation and aggregation, without any learned discriminative head. Because rules are defined over discrete latent variables, antenna-specific rule sets can in principle be combined at the symbolic level, enabling structured multi-antenna fusion without retraining the encoder. Results from CHAR Latent Temporal Rule Extraction (CHARL-TRE) indicate competitive performance while preserving explicit temporal and causal structure, showing that deterministic symbolic classification grounded in unsupervised discrete latent representations constitutes a viable alternative to end-to-end black-box models for wireless HAR.

2604.22973 2026-04-28 cs.RO

Collaborative Trajectory Prediction via Late Fusion

Nadya Abdel Madjid, Murad Mebrahtu, Zakhar Yagudin, Bilal Hassan, Naoufel Werghi, Jorge Dias, Dzmitry Tsetserukou, Majid Khonji

详情
英文摘要

Predicting future trajectories of surrounding traffic agents is critical for safe autonomous navigation and collision avoidance. Despite all advances in the trajectory forecasting realm, the prediction models remains vulnerable to uncertainty caused by occlusions, limited sensing range, and perception errors. Collaborative vehicle-to-vehicle (V2V) approaches help reduce this uncertainty by sharing complementary information. Existing collaborative trajectory prediction methods typically fuse feature maps at the perception stage to construct a holistic scene view. Further this holistic representation is decoded into the future trajectories. Such design incurs substantial communication overhead due to the exchange of high-dimensional feature representations and often assumes idealized bandwidth and synchronization, limiting practical deployment. We address these limitations by shifting collaboration from perception to the prediction module and introducing a late-fusion framework for shared forecasts. The framework is model-agnostic and treats collaborating vehicles as independent asynchronous agents. We evaluate the approach on the OPV2V, V2V4Real, and DeepAccident datasets, comparing individual and collaborative forecasting. Across all datasets, late fusion consistently reduces miss rate and improves trajectory success rate ($\mathrm{TSR}_{0.5}$), defined as the fraction of ground-truth agents with final displacement error below 0.5 m. On the real-world V2V4Real dataset, collaborative prediction improves the success rate by $1.69\%$ and $1.22\%$ for both intelligent vehicles, respectively, compared with individual forecasting.

2604.22964 2026-04-28 cs.CV cs.LG cs.SE

AnemiaVision: Non-Invasive Anemia Detection via Smartphone Imagery Using EfficientNet-B3 with TrivialAugmentWide, Mixup Augmentation, and Persistent Patient History Management

Rahul Patel

Comments 6 pages, 6 figures, 6 tables. Final year personal project, Department of Electronics and Communication Engineering, Indian Institute of Information Technology Surat. Code: https://github.com/RAHULPATEL2002/anemia-detection Demo: https://anemia-detection-gbmj.onrender.com

详情
英文摘要

Anemia affects over one billion people globally and remains severely under-diagnosed in low-resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end-to-end web-based system for non-invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine-tunes a pre-trained EfficientNet-B3 backbone with a redesigned three-layer classifier head incorporating BatchNorm, GELU activations, and high-rate Dropout (0.45/0.35). Training employs four orthogonal accuracy-boosting techniques: TrivialAugmentWide for policy-free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter-class smoothing, and cosine-annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high-variance epochs. The deployed Flask application integrates persistent patient-history management backed by PostgreSQL on Render, with an automated database-migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy-first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC-ROC of 0.98, compared with 44.9% validation accuracy and AUC-ROC of 0.58 from the three-epoch CPU-only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first-line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.

2604.22958 2026-04-28 cs.AI

On the Existence of an Inverse Solution for Preference-Based Reductions in Argumentation

Alessio Zaninotto, Bruno Yun, Nir Oren, Srdjan Vesic

Comments 14 pages, 2 figures

详情
英文摘要

Preference-based argumentation frameworks (PAFs) extend Dung's approach to abstract argumentation (AAFs) by encoding preferences over arguments. Such preferences control the transformation of attacks into defeats, and different approaches to doing so result in different reductions from a PAF to an AAF. In this paper we consider a PAF inverse problem which takes an argumentation graph, a labelling and a semantics as an input, and outputs a ``yes" or ``no" as to whether there is a preference relation between the arguments which can yield the desired labelling. This inverse problem has applications in areas including preference elicitation and explainability. We consider this problem in the context of the four most widely-used preference based reductions under the complete semantics. We show that in most cases, the problem can be answered in polynomial time.