arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1503
2603.05761 2026-03-09 cs.LG

Score-Guided Proximal Projection: A Unified Geometric Framework for Rectified Flow Editing

Vansh Bansal, James G Scott

详情
英文摘要

Rectified Flow (RF) models achieve state-of-the-art generation quality, yet controlling them for precise tasks -- such as semantic editing or blind image recovery -- remains a challenge. Current approaches bifurcate into inversion-based guidance, which suffers from "geometric locking" by rigidly adhering to the source trajectory, and posterior sampling approximations (e.g., DPS), which are computationally expensive and unstable. In this work, we propose Score-Guided Proximal Projection (SGPP), a unified framework that bridges the gap between deterministic optimization and stochastic sampling. We reformulate the recovery task as a proximal optimization problem, defining an energy landscape that balances fidelity to the input with realism from the pre-trained score field. We theoretically prove that this objective induces a normal contraction property, geometrically guaranteeing that out-of-distribution inputs are snapped onto the data manifold, and it effectively reaches the posterior mode constrained to the manifold. Crucially, we demonstrate that SGPP generalizes state-of-the-art editing methods: RF-inversion is effectively a limiting case of our framework. By relaxing the proximal variance, SGPP enables "soft guidance," offering a continuous, training-free trade-off between strict identity preservation and generative freedom.

2603.05760 2026-03-09 cs.LG

MIRACL: A Diverse Meta-Reinforcement Learning for Multi-Objective Multi-Echelon Combinatorial Supply Chain Optimisation

Rifny Rachman, Josh Tingey, Richard Allmendinger, Wei Pan, Pradyumn Shukla, Bahrul Ilmi Nasution

详情
英文摘要

Multi-objective reinforcement learning (MORL) is effective for multi-echelon combinatorial supply chain optimisation, where tasks involve high dimensionality, uncertainty, and competing objectives. However, its deployment in dynamic environments is hindered by the need for task-specific retraining and substantial computational cost. We introduce MIRACL (Meta multI-objective Reinforcement leArning with Composite Learning), a hierarchical Meta-MORL framework that allows for a few-shot generalisation across diverse tasks. MIRACL decomposes each task into structured subproblems for efficient policy adaptation and meta-learns a global policy across tasks using a Pareto-based adaptation strategy to encourage diversity in meta-training and fine-tuning. To our knowledge, this is the first integration of Meta-MORL with such mechanisms in combinatorial optimisation. Although validated in the supply chain domain, MIRACL is theoretically domain-agnostic and applicable to broader dynamic multi-objective decision-making problems. Empirical evaluations show that MIRACL outperforms conventional MORL baselines in simple to moderate tasks, achieving up to 10% higher hypervolume and 5% better expected utility. These results underscore the potential of MIRACL for robust, efficient adaptation in multi-objective problems.

2603.05758 2026-03-09 cs.CV cs.GR cs.LG

Full Dynamic Range Sky-Modelling For Image Based Lighting

Ian J. Maquignaz

详情
英文摘要

Accurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).

2603.05757 2026-03-09 cs.RO

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, Qi Zhu

详情
英文摘要

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

2603.05754 2026-03-09 cs.RO

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

Dian Yu, Qingchuan Zhou, Bingkun Huang, Majid Khadiv, Zewen Yang

详情
英文摘要

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

2603.05751 2026-03-09 cs.RO cs.HC

Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots

Thanh-Tung Ngo, Emma Murphy, Robert J. Ross

详情
英文摘要

Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.

2603.05750 2026-03-09 cs.CL

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

Genet Asefa Gesese, Zongxiong Chen, Shufan Jiang, Mary Ann Tan, Zhaotai Liu, Sonja Schimmler, Harald Sack

Comments To be published (Accepted at WWW'26)

详情
英文摘要

Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

2603.05748 2026-03-09 cs.RO

Environment-Aware Path Generation for Robotic Additive Manufacturing of Structures

Mahsa Rabiei, Reza Moini

详情
英文摘要

Robotic Additive Manufacturing (AM) has emerged as a scalable and customizable construction method in the last decade. However, current AM design methods rely on pre-conceived (A priori) toolpath of the structure, often developed via offline slicing software. Moreover, considering the dynamic construction environments involving obstacles on terrestrial and extraterrestrial environments, there is a need for online path generation methods. Here, an environment-aware path generation framework (PGF) is proposed for the first time in which structures are designed in an online fashion by utilizing four path planning (PP) algorithms (two search-based and two sampling-based). To evaluate the performance of the proposed PGF in different obstacle arrangements (periodic, random) for two types of structures (closed and open), structural (path roughness, turns, offset, Root Mean Square Error (RMSE), deviation) and computational (run time) performance metrics are developed. Most challenging environments (i.e., dense with high number of obstacles) are considered to saturate the feasibility limits of PP algorithms. The capability of each of the four path planners used in the PGF in finding a feasible path is assessed. Finally, the effectiveness of the proposed structural performance metrics is evaluated individually and comparatively, and most essential metrics necessary for evaluation of toolpath of the resulting structures are prescribed. Consequently, the most promising path planners in challenging environments are identified for robotic additive manufacturing applications.

2603.05739 2026-03-09 cs.LG cs.AI

Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment

Ved Sriraman, Adam Block

Comments 52 pages

详情
英文摘要

Best-of-N (BoN) sampling is a widely used inference-time alignment method for language models, whereby N candidate responses are sampled from a reference model and the one with the highest predicted reward according to a learned reward model is selected. Despite its widespread practical use, recent theoretical work has suggested that it is statistically suboptimal and vulnerable to reward hacking, the process by which models exploit weaknesses in the learned reward model to achieve high estimated reward without genuinely improving performance. We revisit this question under assumptions that more closely reflect practice than that of prior work. In particular, in contradistinction to earlier analyses that focused on expected true reward, which may not be meaningful in many practical settings, we investigate how inference-time alignment affects the win-rate, a pairwise comparison-based metric more closely aligned with how reward models are trained and evaluated in practice. We demonstrate that, under minimal conditions on the quality of the reference model and learned reward model, properly tuned BoN is both computationally and statistically optimal in achieving high win-rate, partially explaining its widespread practical success. Because BoN remains susceptible to reward-hacking in this setting, we propose a simple and practical variant that provably eliminates reward-hacking while maintaining optimal statistical performance. Finally, we show that prior approaches are provably suboptimal when considering win-rate, highlighting the importance of choosing appropriate objectives when analyzing inference-time alignment methods.

2603.05732 2026-03-09 cs.CV

From Phase Grounding to Intelligent Surgical Narratives

Ethan Peterson, Huixin Zhan

详情
英文摘要

Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.

2603.05729 2026-03-09 cs.CV

Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

Junyu Chen, Md Yousuf Harun, Christopher Kanan

Comments Accepted to CVPR 2026 Findings

详情
英文摘要

The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.

2603.05727 2026-03-09 cs.CL cs.NA math.NA

Structured Multidimensional Representation Learning for Large Language Models

Alaa El Ichi, Khalide Jbilou, Mohamed El Guide, Franck Dufrenois

Comments 25 pages, 6 figures. Preprint of a journal submission

详情
英文摘要

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

2603.05723 2026-03-09 cs.CL cs.AI

Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach

Erin van Liemt, Renee Shelby, Andrew Smart, Sinchana Kumbale, Richard Zhang, Neha Dixit, Qazi Mamunur Rashid, Jamila Smith-Loud

Comments 21 pages, 5 figures, 6 tables

详情
英文摘要

There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of its conceptual complexities and how they relate to representations in Generative AI. We survey from across parts of Europe, North and South America, Asia, and Africa. We conclude with a set of recommendations for Culture and GenAI development. These include participatory approaches, prioritizing specific cultural dimensions beyond geography, such as religion and tradition, and a sensitivity framework for addressing cultural ``redlines''.

2603.05716 2026-03-09 cs.RO cs.SY eess.SY

Introducing the transitional autonomous vehicle lane-changing dataset: Empirical Experiments

Abhinav Sharma, Zijun He, Danjue Chen

详情
英文摘要

Transitional autonomous vehicles (tAVs), which operate beyond SAE Level 1-2 automation but short of full autonomy, are increasingly sharing the road with human-driven vehicles (HDVs). As these systems interact during complex maneuvers such as lane changes, new patterns may emerge with implications for traffic stability and safety. Assessing these dynamics, particularly during mandatory lane changes, requires high-resolution trajectory data, yet datasets capturing tAV lane-changing behavior are scarce. This study introduces the North Carolina Transitional Autonomous Vehicle Lane-Changing (NC-tALC) Dataset, a high-fidelity trajectory dataset designed to characterize tAV interactions during lane-changing maneuvers. The dataset includes two controlled experimental series. In the first, tAV lane-changing experiments, a tAV executes lane changes in the presence of adaptive cruise control (ACC) equipped target vehicles, enabling analysis of lane-changing execution. In the second, tAV responding experiments, two tAVs act as followers and respond to cut-in maneuvers initiated by another tAV, enabling analysis of follower response dynamics. The dataset contains 152 trials (72 lane-changing and 80 responding trials) sampled at 20 Hz with centimeter-level RTK-GPS accuracy. The NC-tALC dataset provides a rigorous empirical foundation for evaluating tAV decision-making and interaction dynamics in controlled mandatory lane-changing scenarios.

2603.05711 2026-03-09 cs.CV

Any to Full: Prompting Depth Anything for Depth Completion in One Stage

Zhiyuan Zhou, Ruofeng Liu, Taichi Liu, Weijian Zuo, Shanshan Wang, Zhiqing Hong, Desheng Zhang

详情
英文摘要

Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2\% in average AbsREL and delivers a 1.4$\times$ speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at https://github.com/zhiyuandaily/Any2Full.

2603.05708 2026-03-09 cs.CV

Interpretable Perception and Reasoning for Audiovisual Geolocation

Yiyang Su, Xiaoming Liu

详情
英文摘要

While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

2603.05706 2026-03-09 cs.AI

Reasoning Models Struggle to Control their Chains of Thought

Chen Yueh-Han, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak

详情
英文摘要

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

2603.05697 2026-03-09 cs.CV

MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Dannong Xu, Zhongyu Yang, Jun Chen, Yingfang Yuan, Ming Hu, Lei Sun, Luc Van Gool, Danda Pani Paudel, Chun-Mei Feng

详情
英文摘要

Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.

2603.05694 2026-03-09 cs.LG cs.FL

Warm Starting State-Space Models with Automata Learning

William Fishell, Sam Nicholas Kouteili, Mark Santolucito

详情
英文摘要

We prove that Moore machines can be exactly realized as state-space models (SSMs), establishing a formal correspondence between symbolic automata and these continuous machine learning architectures. These Moore-SSMs preserve both the complete symbolic structure and input-output behavior of the original Moore machine, but operate in Euclidean space. With this correspondence, we compare the training of SSMs with both passive and active automata learning. In recovering automata from the SYNTCOMP benchmark, we show that SSMs require orders of magnitude more data than symbolic methods and fail to learn state structure. This suggests that symbolic structure provides a strong inductive bias for learning these systems. We leverage this insight to combine the strengths of both automata learning and SSMs in order to learn complex systems efficiently. We learn an adaptive arbitration policy on a suite of arbiters from SYNTCOMP and show that initializing SSMs with symbolically-learned approximations learn both faster and better. We see 2-5 times faster convergence compared to randomly initialized models and better overall model accuracies on test data. Our work lifts automata learning out of purely discrete spaces, enabling principled exploitation of symbolic structure in continuous domains for efficiently learning in complex settings.

2603.05690 2026-03-09 cs.CL

FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

Hung Nguyen Huy, Mo El-Haj, Dawn Knight, Paul Rayson

Comments 10 pages

详情
Journal ref
Language resources and evaluation conference (LREC) 2026
英文摘要

FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.

2603.05686 2026-03-09 cs.CV

OWL: A Novel Approach to Machine Perception During Motion

Daniel Raviv, Juan D. Yepes

详情
英文摘要

We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.

2603.05673 2026-03-09 cs.LG cs.SC math.AG

Reinforcement Learning for Power-Flow Network Analysis

Alperen Ergur, Julia Lindberg, Vinny Miller

Comments more experiments will be added in a relatively soon date

详情
英文摘要

The power flow equations are non-linear multivariate equations that describe the relationship between power injections and bus voltages of electric power networks. Given a network topology, we are interested in finding network parameters with many equilibrium points. This corresponds to finding instances of the power flow equations with many real solutions. Current state-of-the art algorithms in computational algebra are not capable of answering this question for networks involving more than a small number of variables. To remedy this, we design a probabilistic reward function that gives a good approximation to this root count, and a state-space that mimics the space of power flow equations. We derive the average root count for a Gaussian model, and use this as a baseline for our RL agents. The agents discover instances of the power flow equations with many more solutions than the average baseline. This demonstrates the potential of RL for power-flow network design and analysis as well as the potential for RL to contribute meaningfully to problems that involve complex non-linear algebra or geometry. \footnote{Author order alphabetic, all authors contributed equally.

2603.05671 2026-03-09 cs.LG

The Value of Graph-based Encoding in NBA Salary Prediction

Junhao Su, David Grimsman, Christopher Archibald

Comments 6 pages,IEEE tempelate conference style. Submitted to IETC 2026, get decision on Mar 22th

详情
英文摘要

Market valuations for professional athletes is a difficult problem, given the amount of variability in performance and location from year to year. In the National Basketball Association (NBA), a straightforward way to address this problem is to build a tabular data set and use supervised machine learning to predict a player's salary based on the player's performance in the previous year. For younger players, whose contracts are mostly built on draft position, this approach works well, however it can fail for veterans or those whose salaries are on the high tail of the distribution. In this paper, we show that building a knowledge graph with on and off court data, embedding that graph in a vector space, and including that vector in the tabular data allows the supervised learning to better understand the landscape of factors that affect salary. We compare several graph embedding algorithms and show that such a process is vital to NBA salary prediction.

2603.05670 2026-03-09 cs.RO

TransMASK: Masked State Representation through Learned Transformation

Sagar Parekh, Preston Culbertson, Dylan P. Losey

详情
英文摘要

Humans train robots to complete tasks in one environment, and expect robots to perform those same tasks in new environments. As humans, we know which aspects of the environment (i.e., the state) are relevant to the task. But there are also things that do not matter; e.g., the color of the table or the presence of clutter in the background. Ideally, the robot's policy learns to ignore these irrelevant state components. Achieving this invariance improves generalization: the robot knows not to factor irrelevant variables into its control decisions, making the policy more robust to environment changes. In this paper we therefore propose a self-supervised method to learn a mask which, when multiplied by the observed state, transforms that state into a latent representation that is biased towards relevant elements. Our method -- which we call TransMASK -- can be combined with a variety of imitation learning frameworks (such as diffusion policies) without any additional labels or alterations to the loss function. To achieve this, we recognize that the learned policy updates to better match the human's true policy. This true policy only depends on the relevant parts of the state; hence, as the gradients pass back through the learned policy and our proposed mask, they increase the value for elements that cause the robot to better imitate the human. We can therefore train TransMASK at the same time as we learn the policy. By normalizing the magnitude of each row in TransMASK, we force the mask to align with the Jacobian of the expert policy: columns that correspond to relevant states have large magnitudes, while columns for irrelevant states approach zero magnitude. We compare our approach to other methods that extract relevant states for downstream imitation learning. See our project website: https://collab.me.vt.edu/TransMASK/

2603.05663 2026-03-09 cs.CV

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, Yu Guan

详情
英文摘要

Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.

2603.05651 2026-03-09 cs.CL cs.AI cs.HC

The Fragility Of Moral Judgment In Large Language Models

Tom van Nuenen, Pratik S. Sachdeva

Comments 22 pages, 7 figures, 10 tables, plus appendices

详情
英文摘要

People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

2603.05641 2026-03-09 cs.RO cs.HC

RFM-HRI : A Multimodal Dataset of Medical Robot Failure, User Reaction and Recovery Preferences for Item Retrieval Tasks

Yashika Batra, Giuliano Pioldi, Promise Ekpo, Arman Sayatqyzy, Purnjay Maruur, Shalom Otieno, Kevin Ching, Angelique Taylor

详情
英文摘要

While robots deployed in real-world environments inevitably experience interaction failures, understanding how users respond through verbal and non-verbal behaviors remains under-explored in human-robot interaction (HRI). This gap is particularly significant in healthcare-inspired settings, where interaction failures can directly affect task performance and user trust. We present the Robot Failures in Medical HRI (RFM-HRI) Dataset, a multimodal dataset capturing dyadic interactions between humans and robots embodied in crash carts, where communication failures are systematically induced during item retrieval tasks. Through Wizard-of-Oz studies with 41 participants across laboratory and hospital settings, we recorded responses to four failure types (speech, timing, comprehension, and search) derived from three years of crash-cart robot interaction data. The dataset contains 214 interaction samples including facial action units, head pose, speech transcripts, and post-interaction self-reports. Our analysis shows that failures significantly degrade affective valence and reduce perceived control compared to successful interactions. Failures are strongly associated with confusion, annoyance, and frustration, while successful interactions are characterized by surprise, relief, and confidence in task completion. Emotional responses also evolve across repeated failures, with confusion decreasing and frustration increasing over time. This work contributes (1) a publicly available multimodal dataset (RFM-HRI), (2) analysis of user responses to different failure types and preferred recovery strategies, and (3) a crash-cart retrieval scenario enabling systematic comparison of recovery strategies with implications for safety-critical failure recovery. Our findings provide a foundation for failure detection and recovery methods in embodied HRI.

2603.05638 2026-03-09 cs.RO

Control Lyapunov Functions for Underactuated Soft Robots

Huy Pham, Zach J. Patterson

Comments 8 pages, 5 figures, 2 tables. Submitted for publication to a conference

详情
英文摘要

Soft and soft-rigid hybrid robots are inherently underactuated and operate under tight actuator limits, making task-space control with stability guarantees challenging. Common nonlinear strategies for soft robots (e.g., those based on PD control) often rely on the assumption of full actuation with no actuator limits. This paper presents a general control framework for task-space regulation and tracking of underactuated soft robots under bounded inputs. The method enforces a rapidly exponentially stabilizing control Lyapunov function as a convex inequality constraint while simultaneously satisfying underactuated full-body dynamics and actuator bounds. We validate the approach in simulation on several platforms spanning increasing underactuation: a simple two link tendon-driven "finger", a trimmed helicoid manipulator, and a highly underactuated spiral robot. We compare against a number of baseline methods from the literature. Results show improved task-space accuracy and consistent Lyapunov convergence under input limits, achieving superior set-point and trajectory-tracking performance.

2603.05629 2026-03-09 cs.CV

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

Merve Tapli, Quentin Bouniot, Wolfgang Stammer, Zeynep Akata, Emre Akbas

Comments Accepted to CVPR 2026

详情
英文摘要

Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the "linearity problem" causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.

2603.05625 2026-03-09 cs.LG

Identifying Adversary Characteristics from an Observed Attack

Soyon Choi, Scott Alfeld, Meiyi Ma

详情
英文摘要

When used in automated decision-making systems, machine learning (ML) models are vulnerable to data-manipulation attacks. Some defense mechanisms (e.g., adversarial regularization) directly affect the ML models while others (e.g., anomaly detection) act within the broader system. In this paper we consider a different task for defending the adversary, focusing on the attacker, rather than the attack. We present and demonstrate a framework for identifying characteristics about the attacker from an observed attack. We prove that, without additional knowledge, the attacker is non-identifiable (multiple potential attackers would perform the same observed attack). To address this challenge, we propose a domain-agnostic framework to identify the most probable attacker. This framework aids the defender in two ways. First, knowledge about the attacker can be leveraged for exogenous mitigation (i.e., addressing the vulnerability by altering the decision-making system outside the learning algorithm and/or limiting the attacker's capability). Second, when implementing defense methods that directly affect the learning process (e.g., adversarial regularization), knowledge of the specific attacker improves performance. We present the details of our framework and illustrate its applicability through specific instantiations on a variety of learners.