arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1771
2603.22629 2026-03-25 cs.CL cs.AI

LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation

Hailay Teklehaymanot, Dren Fazlija, Wolfgang Nejdl

Comments 12 pages, 1 figure, 1 Table

详情
英文摘要

Adapting pretrained language models to low-resource, morphologically rich languages remains a significant challenge. Existing vocabulary expansion methods typically rely on arbitrarily segmented subword units, resulting in fragmented lexical representations and loss of critical morphological information. To address this limitation, we propose the Lexically Grounded Subword Embedding Initialization (LGSE) framework, which introduces morphologically informed segmentation for initializing embeddings of novel tokens. Instead of using random vectors or arbitrary subwords, LGSE decomposes words into their constituent morphemes and constructs semantically coherent embeddings by averaging pretrained subword or FastText-based morpheme representations. When a token cannot be segmented into meaningful morphemes, its embedding is constructed using character n-gram representations to capture structural information. During Language-Adaptive Pretraining, we apply a regularization term that penalizes large deviations of newly introduced embeddings from their initialized values, preserving alignment with the original pretrained embedding space while enabling adaptation to the target language. To isolate the effect of initialization, we retain the original pre-trained model vocabulary and tokenizer and update only the new embeddings during adaptation. We evaluate LGSE on three NLP tasks: Question Answering, Named Entity Recognition, and Text Classification, in two morphologically rich, low-resource languages: Amharic and Tigrinya, where morphological segmentation resources are available. Experimental results show that LGSE consistently outperforms baseline methods across all tasks, demonstrating the effectiveness of morphologically grounded embedding initialization for improving representation quality in underrepresented languages. Project resources are available in the GitHub link.

2603.22626 2026-03-25 cs.CV

PIVM: Diffusion-Based Prior-Integrated Variation Modeling for Anatomically Precise Abdominal CT Synthesis

Dinglun He, Baoming Zhang, Xu Wang, Yao Hao, Deshan Yang, Ye Duan

Comments Accepted at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026 (Oral). Equal contribution by the first three authors

详情
英文摘要

Abdominal CT data are limited by high annotation costs and privacy constraints, which hinder the development of robust segmentation and diagnostic models. We present a Prior-Integrated Variation Modeling (PIVM) framework, a diffusion-based method for anatomically accurate CT image synthesis. Instead of generating full images from noise, PIVM predicts voxel-wise intensity variations relative to organ-specific intensity priors derived from segmentation labels. These priors and labels jointly guide the diffusion process, ensuring spatial alignment and realistic organ boundaries. Unlike latent-space diffusion models, our approach operates directly in image space while preserving the full Hounsfield Unit (HU) range, capturing fine anatomical textures without smoothing. Source code is available at https://github.com/BZNR3/PIVM.

2603.22624 2026-03-25 cs.CV cs.AI

Toward Faithful Segmentation Attribution via Benchmarking and Dual-Evidence Fusion

Abu Noman Md Sakib, OFM Riaz Rahman Aranya, Kevin Desai, Zijie Zhang

Journal ref CVPR 2026

详情
英文摘要

Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model's prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention-based faithfulness, off-target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual-Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region-level intervention signals through agreement-weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion-based faithfulness over gradient-only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness-stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at https://github.com/anmspro/DEA.

2603.22623 2026-03-25 cs.CV cs.AI

To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models

OFM Riaz Rahman Aranya, Kevin Desai

详情
英文摘要

Vision-language models (VLMs) adapted to the medical domain have shown strong performance on visual question answering benchmarks, yet their robustness against two critical failure modes, hallucination and sycophancy, remains poorly understood, particularly in combination. We evaluate six VLMs (three general-purpose, three medical-specialist) on three medical VQA datasets and uncover a grounding-sycophancy tradeoff: models with the lowest hallucination propensity are the most sycophantic, while the most pressure-resistant model hallucinates more than all medical-specialist models. To characterize this tradeoff, we propose three metrics: L-VASE, a logit-space reformulation of VASE that avoids its double-normalization; CCS, a confidence-calibrated sycophancy score that penalizes high-confidence capitulation; and Clinical Safety Index (CSI), a unified safety index that combines grounding, autonomy, and calibration via a geometric mean. Across 1,151 test cases, no model achieves a CSI above 0.35, indicating that none of the evaluated 7-8B parameter VLMs is simultaneously well-grounded and robust to social pressure. Our findings suggest that joint evaluation of both properties is necessary before these models can be considered for clinical use. Code is available at https://github.com/UTSA-VIRLab/AgreeOrRight

2603.22622 2026-03-25 cs.CV

A Vision Language Model for Generating Procedural Plant Architecture Representations from Simulated Images

Heesup Yun, Isaac Kazuo Uyehara, Ioannis Droutsas, Earl Ranario, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles

详情
英文摘要

Three-dimensional (3D) procedural plant architecture models have emerged as an important tool for simulation-based studies of plant structure and function, extracting plant architectural parameters from field measurements, and for generating realistic plants in computer graphics. However, measuring the architectural parameters and nested structures for these models at the field scales remains prohibitively labor-intensive. We present a novel algorithm that generates a 3D plant architecture from an image, creating a functional structural plant model that reflects organ-level geometric and topological parameters and provides a more comprehensive representation of the plant's architecture. Instead of using 3D sensors or processing multi-view images with computer vision to obtain the 3D structure of plants, we proposed a method that generates token sequences that encode a procedural definition of plant architecture. This work used only synthetic images for training and testing, with exact architectural parameters known, allowing testing of the hypothesis that organ-level architectural parameters could be extracted from image data using a vision-language model (VLM). A synthetic dataset of cowpea plant images was generated using the Helios 3D plant simulator, with the detailed plant architecture encoded in XML files. We developed a plant architecture tokenizer for the XML file defining plant architecture, converting it into a token sequence that a language model can predict. The model achieved a token F1 score of 0.73 during teacher-forced training. Evaluation of the model was performed through autoregressive generation, achieving a BLEU-4 score of 94.00% and a ROUGE-L score of 0.5182. This led to the conclusion that such plant architecture model generation and parameter extraction were possible from synthetic images; thus, future work will extend the approach to real imagery data.

2603.22621 2026-03-25 cs.LG

Transfer learning via interpolating structures

T. A. Dardeno, A. J. Hughes, L. A. Bull, R. S. Mills, N. Dervilis, K. Worden

Comments preprint submitted to Mechanical Systems and Signal Processing

详情
英文摘要

Despite recent advances in population-based structural health monitoring (PBSHM), knowledge transfer between highly-disparate structures (i.e., heterogeneous populations) remains a challenge. The current work proposes that heterogeneous transfer may be accomplished via intermediate structures that bridge the gap in information between the structures of interest. A key aspect of the technique is the idea that by varying parameters such as material properties and geometry, one structure can be continuously morphed into another. The approach is demonstrated via a case study involving the parameterisation of (and transfer between) simulated heterogeneous bridge designs (Case 1). Transfer between simplified physical representations of a 'bridge' and 'aeroplane' is then demonstrated in Case 2, via a chain of finite-element models. The facetious question 'When is a bridge not an aeroplane?' has been previously asked in the context of predicting positive transfer based on structural similarity. While the obvious answer to this question is 'Always,' the results presented in the current paper show that, in some cases, positive transfer can indeed be achieved between highly-disparate systems.

2603.22619 2026-03-25 cs.AI

Bridging the Know-Act Gap via Task-Level Autoregressive Reasoning

Jihyun Janice Ahn, Ryo Kamoi, Berk Atil, Renze Lou, WonWoo Kang, Heehyun Park, Sarkar Snigdha Sarathi Das, Zhuoyang Zou, Xiaoxin Lu, Yusen Zhang, Asfahan Shah, Ridwanul Hasan Tanvir, Lingxiao Zhao, Hongxi Huang, Vignesh Venkatesh, Dianjun Lin, Hamid Shah, Wentao Wang, Zhanpeng Song, Joshua Reed Bassin, Dax Patel, Ishan Appareddy Agrahar, Sahil Pardasani, Xin Dong, Fatemeh Rahbari, Benjamin David Rishel, Soochan Andrew Lee, Yuv Boghani, Ali B. AlNaseeb, Pranav Suby, Seokhyeon Bae, Shreya Buddharaju, Damien Kula, Soumyadeep Das, Hanyang Frank Liu, Faye Mo, Wenpeng Yin

Comments 12 pages

详情
英文摘要

LLMs often generate seemingly valid answers to flawed or ill-posed inputs. This is not due to missing knowledge: under discriminative prompting, the same models can mostly identify such issues, yet fail to reflect this in standard generative responses. This reveals a fundamental know-act gap between discriminative recognition and generative behavior. Prior work largely characterizes this issue in narrow settings, such as math word problems or question answering, with limited focus on how to integrate these two modes. In this work, we present a comprehensive analysis using FaultyScience, a newly constructed large-scale, cross-disciplinary benchmark of faulty scientific questions. We show that the gap is pervasive and stems from token-level autoregression, which entangles task selection (validate vs. answer) with content generation, preventing discriminative knowledge from being utilized. To address this, we propose DeIllusionLLM, a task-level autoregressive framework that explicitly models this decision. Through self-distillation, the model unifies discriminative judgment and generative reasoning within a single backbone. Empirically, DeIllusionLLM substantially reduces answer-despite-error failures under natural prompting while maintaining general reasoning performance, demonstrating that self-distillation is an effective and scalable solution for bridging the discriminative-generative know-act gap

2603.22606 2026-03-25 cs.CV

TrajLoom: Dense Future Trajectory Generation from Video

Zewei Zhang, Jia Jun Cheng Xian, Kaiwen Liu, Ming Liang, Hang Chu, Jun Chen, Renjie Liao

Comments Project page, code, model checkpoints, and datasets: https://trajloom.github.io/

详情
英文摘要

Predicting future motion is crucial in video understanding and controllable video generation. Dense point trajectories are a compact, expressive motion representation, but modeling their future evolution from observed video remains challenging. We propose a framework that predicts future trajectories and visibility from past trajectories and video context. Our method has three components: (1) Grid-Anchor Offset Encoding, which reduces location-dependent bias by representing each point as an offset from its pixel-center anchor; (2) TrajLoom-VAE, which learns a compact spatiotemporal latent space for dense trajectories with masked reconstruction and a spatiotemporal consistency regularizer; and (3) TrajLoom-Flow, which generates future trajectories in latent space via flow matching, with boundary cues and on-policy K-step fine-tuning for stable sampling. We also introduce TrajLoomBench, a unified benchmark spanning real and synthetic videos with a standardized setup aligned with video-generation benchmarks. Compared with state-of-the-art methods, our approach extends the prediction horizon from 24 to 81 frames while improving motion realism and stability across datasets. The predicted trajectories directly support downstream video generation and editing. Code, model checkpoints, and datasets are available at https://trajloom.github.io/.

2603.22604 2026-03-25 cs.RO

Trajectory Generation for Underactuated Soft Robot Manipulators using Discrete Elastic Rod Dynamics

Beibei Liu, Akua K. Dickson, Ran Jing, Andrew P. Sabelhaus

详情
英文摘要

Soft robots are well suited for contact-rich tasks due to their compliance, yet this property makes accurate and tractable modeling challenging. Planning motions with dynamically-feasible trajectories requires models that capture arbitrary deformations, remain computationally efficient, and are compatible with underactuation. However, existing approaches balance these properties unevenly: continuum rod models provide physical accuracy but are computationally demanding, while reduced-order approximations improve efficiency at the cost of modeling fidelity. To address this, our work introduces a control-oriented reformulation of Discrete Elastic Rod (DER) dynamics for soft robots, and a method to generate trajectories with these dynamics. The proposed formulation yields a control-affine representation while preserving certain first-principles force-deformation relationships. As a result, the generated trajectories are both dynamically feasible and consistent with the underlying actuation assumptions. We present our trajectory generation framework and validate it experimentally on a pneumatic soft robotic limb. Hardware results demonstrate consistently improved trajectory tracking performance over a constant-curvature-based baseline, particularly under complex actuation conditions.

2603.22590 2026-03-25 cs.LG cs.CR eess.AS

Precision-Varying Prediction (PVP): Robustifying ASR systems against adversarial attacks

Matías Pizarro, Raghavan Narasimhan, Asja Fischer

详情
英文摘要

With the increasing deployment of automated and agentic systems, ensuring the adversarial robustness of automatic speech recognition (ASR) models has become critical. We observe that changing the precision of an ASR model during inference reduces the likelihood of adversarial attacks succeeding. We take advantage of this fact to make the models more robust by simple random sampling of the precision during prediction. Moreover, the insight can be turned into an adversarial example detection strategy by comparing outputs resulting from different precisions and leveraging a simple Gaussian classifier. An experimental analysis demonstrates a significant increase in robustness and competitive detection performance for various ASR models and attack types.

2603.22589 2026-03-25 cs.SD eess.AS eess.SP

Velocity Potential Neural Field for Efficient Ambisonics Impulse Response Modeling

Yoshiki Masuyama, Francois G. Germain, Gordon Wichern, Chiori Hori, Jonathan Le Roux

Comments Accepted to ICASSP 2026

详情
英文摘要

First-order Ambisonics (FOA) is a standard spatial audio format based on spherical harmonic decomposition. Its zeroth- and first-order components capture the sound pressure and particle velocity, respectively. Recently, physics-informed neural networks have been applied to the spatial interpolation of FOA signals, regularizing the network outputs based on soft penalty terms derived from physical principles, e.g., the linearized momentum equation. In this paper, we reformulate the task so that the predicted FOA signal automatically satisfies the linearized momentum equation. Our network approximates a scalar function called velocity potential, rather than the FOA signal itself. Then, the FOA signal can be readily recovered through the partial derivatives of the velocity potential with respect to the network inputs (i.e., time and microphone position) according to physics of sound propagation. By deriving the four channels of FOA from the single-channel velocity potential, the reconstructed signal follows the physical principle at any time and position by construction. Experimental results on room impulse response reconstruction confirm the effectiveness of the proposed framework.

2603.22583 2026-03-25 cs.CV cs.RO

A vision-language model and platform for temporally mapping surgery from video

Dani Kiyasseh

详情
英文摘要

Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.

2603.22582 2026-03-25 cs.CL cs.AI

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Richard J. Young

Comments 27 pages, 7 figures, 12 tables

详情
英文摘要

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

2603.22580 2026-03-25 cs.RO

Task-Agnostic Exoskeleton Control Supports Elderly Joint Energetics during Hip-Intensive Tasks

Jiefu Zhang, Nikhil V. Divekar, Chandramouli Krishnan, Robert D. Gregg

详情
英文摘要

Age-related mobility decline is frequently accompanied by a redistribution of joint kinetics, where older adults compensate for reduced ankle function by increasing demand on the hip. Paradoxically, this compensatory shift typically coincides with age-related reductions in maximal hip power. Although robotic exoskeletons can provide immediate energetic benefits, conventional control strategies have limited previous studies in this population to specific tasks such as steady-state walking, which do not fully reflect mobility demands in the home and community. Here, we implement a task-agnostic hip exoskeleton controller that is inherently sensitive to joint power and validate its efficacy in eight older adults. Across a battery of hip-intensive activities that included level walking, ramp ascent, stair climbing, and sit-to-stand transitions, the exoskeleton matched biological power profiles with high accuracy (mean cosine similarity 0.89). Assistance significantly reduced sagittal plane biological positive work by 24.7% at the hip and by 9.3% for the lower limb, while simultaneously augmenting peak total (biological + exoskeleton) hip power and reducing peak biological hip power. These results suggest that hip exoskeletons can potentially enhance endurance through biological work reduction, and increase functional reserve through total power augmentation, serving as a promising biomechanical intervention to support older adults' mobility.

2603.22576 2026-03-25 cs.CL

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Giovana Kerche Bonás, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Celio Larcher, Ramon Pires, Rodrigo Nogueira

详情
英文摘要

We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

2603.22574 2026-03-25 cs.RO

GIFT: Generalizing Intent for Flexible Test-Time Rewards

Fin Amin, Nathaniel Dennler, Andreea Bobu

Comments To appear at IEEE ICRA '26

详情
英文摘要

Robots learn reward functions from user demonstrations, but these rewards often fail to generalize to new environments. This failure occurs because learned rewards latch onto spurious correlations in training data rather than the underlying human intent that demonstrations represent. Existing methods leverage visual or semantic similarity to improve robustness, yet these surface-level cues often diverge from what humans actually care about. We present Generalizing Intent for Flexible Test-Time Rewards (GIFT), a framework that grounds reward generalization in human intent rather than surface cues. GIFT leverages language models to infer high-level intent from user demonstrations by contrasting preferred with non-preferred behaviors. At deployment, GIFT maps novel test states to behaviorally equivalent training states via intent-conditioned similarity, enabling learned rewards to generalize across distribution shifts without retraining. We evaluate GIFT on tabletop manipulation tasks with new objects and layouts. Across four simulated tasks with over 50 unseen objects, GIFT consistently outperforms visual and semantic similarity baselines in test-time pairwise win rate and state-alignment F1 score. Real-world experiments on a 7-DoF Franka Panda robot demonstrate that GIFT reliably transfers to physical settings. Further discussion can be found at https://mit-clear-lab.github.io/GIFT/

2603.22572 2026-03-25 cs.CV

FullCircle: Effortless 3D Reconstruction from Casual 360$^\circ$ Captures

Yalda Foroutan, Ipek Oztas, Daniel Rebain, Aysegul Dundar, Kwang Moo Yi, Lily Goli, Andrea Tagliasacchi

详情
英文摘要

Radiance fields have emerged as powerful tools for 3D scene reconstruction. However, casual capture remains challenging due to the narrow field of view of perspective cameras, which limits viewpoint coverage and feature correspondences necessary for reliable camera calibration and reconstruction. While commercially available 360$^\circ$ cameras offer significantly broader coverage than perspective cameras for the same capture effort, existing 360$^\circ$ reconstruction methods require special capture protocols and pre-processing steps that undermine the promise of radiance fields: effortless workflows to capture and reconstruct 3D scenes. We propose a practical pipeline for reconstructing 3D scenes directly from raw 360$^\circ$ camera captures. We require no special capture protocols or pre-processing, and exhibit robustness to a prevalent source of reconstruction errors: the human operator that is visible in all 360$^\circ$ imagery. To facilitate evaluation, we introduce a multi-tiered dataset of scenes captured as raw dual-fisheye images, establishing a benchmark for robust casual 360$^\circ$ reconstruction. Our method significantly outperforms not only vanilla 3DGS for 360$^\circ$ cameras but also robust perspective baselines when perspective cameras are simulated from the same capture, demonstrating the advantages of 360$^\circ$ capture for casual reconstruction. Additional results are available at: https://theialab.github.io/fullcircle

2603.22566 2026-03-25 cs.CL

Reddit After Roe: A Computational Analysis of Abortion Narratives and Barriers in the Wake of Dobbs

Aria Pessianzadeh, Alex H. Poole, Rezvaneh Rezapour

详情
英文摘要

The 2022 U.S. Supreme Court decision in Dobbs v. Jackson Women's Health Organization reshaped the reproductive rights landscape, introducing new uncertainty and barriers to abortion access. We present a large-scale computational analysis of abortion discourse on Reddit, examining how barriers to access are articulated across information-seeking and information-sharing behaviors, different stages of abortion (before, during, after), and three phases of the Dobbs decision in 2022. Drawing on more than 17,000 posts from four abortion-related subreddits, we employed a multi-step pipeline to classify posts by information type, abortion stage, barrier category, and expressed emotions. Using a codebook of eight barrier types, including legal, financial, emotional, and social obstacles, we analyzed their associations with emotions and information behaviors. Topic modeling of model-generated barrier rationales further revealed how discourse evolved in response to shifting legal and cultural contexts. Our findings show that emotional and psychological barriers consistently dominate abortion narratives online, with emotions such as nervousness, confusion, fear, and sadness prevalent across discourse. By linking information behaviors, barriers, emotions, and temporal dynamics, this study provides a multi-dimensional account of how abortion is navigated in online communities.

2603.22561 2026-03-25 cs.AI

AI Mental Models: Learned Intuition and Deliberation in a Bounded Neural Architecture

Laurence Anthony

详情
英文摘要

This paper asks whether a bounded neural architecture can exhibit a meaningful division of labor between intuition and deliberation on a classic 64-item syllogistic reasoning benchmark. More broadly, the benchmark is relevant to ongoing debates about world models and multi-stage reasoning in AI. It provides a controlled setting for testing whether a learned system can develop structured internal computation rather than only one-shot associative prediction. Experiment 1 evaluates a direct neural baseline for predicting full 9-way human response distributions under 5-fold cross-validation. Experiment 2 introduces a bounded dual-path architecture with separate intuition and deliberation pathways, motivated by computational mental-model theory (Khemlani & Johnson-Laird, 2022). Under cross-validation, bounded intuition reaches an aggregate correlation of r = 0.7272, whereas bounded deliberation reaches r = 0.8152, and the deliberation advantage is significant across folds (p = 0.0101). The largest held-out gains occur for NVC, Eca, and Oca, suggesting improved handling of rejection responses and c-a conclusions. A canonical 80:20 interpretability run and a five-seed stability sweep further indicate that the deliberation pathway develops sparse, differentiated internal structure, including an Oac-leaning state, a dominant workhorse state, and several weakly used or unused states whose exact indices vary across runs. These findings are consistent with reasoning-like internal organization under bounded conditions, while stopping short of any claim that the model reproduces full sequential processes of model construction, counterexample search, and conclusion revision.

2603.22539 2026-03-25 cs.CV

Generalized multi-object classification and tracking with sparse feature resonator networks

Lazar Supic, Alec Mullen, E. Paxon Frady

Comments 6 pages, 2 figures, NICE 2026

详情
英文摘要

In visual scene understanding tasks, it is essential to capture both invariant and equivariant structure. While neural networks are frequently trained to achieve invariance to transformations such as translation, this often comes at the cost of losing access to equivariant information - e.g., the precise location of an object. Moreover, invariance is not naturally guaranteed through supervised learning alone, and many architectures generalize poorly to input transformations not encountered during training. Here, we take an approach based on analysis-by-synthesis and factoring using resonator networks. A generative model describes the construction of simple scenes containing MNIST digits and their transformations, like color and position. The resonator network inverts the generative model, and provides both invariant and equivariant information about particular objects. Sparse features learned from training data act as a basis set to provide flexibility in representing variable shapes of objects, allowing the resonator network to handle previously unseen digit shapes from the test set. The modular structure provides a shape module which contains information about the object shape with translation factored out, allowing a simple classifier to operate on centered digits. The classification layer is trained solely on centered data, requiring much less training data, and the network as a whole can identify objects with arbitrary translations without data augmentation. The natural attention-like mechanism of the resonator network also allows for analysis of scenes with multiple objects, where the network dynamics selects and centers only one object at a time. Further, the specific position information of a particular object can be extracted from the translation module, and we show that the resonator can be designed to track multiple moving objects with precision of a few pixels.

2603.22531 2026-03-25 cs.CV

UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images

Kaizhen Tan, Fan Zhang

详情
英文摘要

Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.

2603.22529 2026-03-25 cs.CV cs.AI cs.CL

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Shoubin Yu, Lei Shu, Antoine Yang, Yao Fu, Srinivas Sunkara, Maria Wang, Jindong Chen, Mohit Bansal, Boqing Gong

Comments CVPR 2026. Project page: https://ego2web.github.io/

详情
英文摘要

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundings and then complete a related task online. To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification and refinement to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commerce, media retrieval, knowledge lookup, etc. To facilitate accurate and scalable evaluation for our benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method, Ego2WebJudge, which achieves approximately 84% agreement with human judgment, substantially higher than existing evaluation methods. Experiments with diverse SoTA agents on our Ego2Web show that their performance is weak, with substantial headroom across all task categories. We also conduct a comprehensive ablation study on task design, highlighting the necessity of accurate video understanding in the proposed task and the limitations of current agents. We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.

2603.22527 2026-03-25 cs.RO cs.CV

Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

详情
英文摘要

Sidewalk micromobility is a promising solution for last-mile transportation, but current learning-based control methods struggle in complex urban environments. Imitation learning (IL) learns policies from human demonstrations, yet its reliance on fixed offline data often leads to compounding errors, limited robustness, and poor generalization. To address these challenges, we propose a framework that advances IL through corrective behavior expansion and multi-scale imitation learning. On the data side, we augment teleoperation datasets with diverse corrective behaviors and sensor augmentations to enable the policy to learn to recover from its own mistakes. On the model side, we introduce a multi-scale IL architecture that captures both short-horizon interactive behaviors and long-horizon goal-directed intentions via horizon-based trajectory clustering and hierarchical supervision. Real-world experiments show that our approach significantly improves robustness and generalization in diverse sidewalk scenarios.

2603.22525 2026-03-25 cs.LG cs.CR

Adversarial Vulnerabilities in Neural Operator Digital Twins: Gradient-Free Attacks on Nuclear Thermal-Hydraulic Surrogates

Samrendra Roy, Kazuma Kobayashi, Souvik Chakraborty, Rizwan-uddin, Syed Bahauddin Alam

Comments 56 pages, 14 figures, 22 tables

详情
英文摘要

Operator learning models are rapidly emerging as the predictive core of digital twins for nuclear and energy systems, promising real-time field reconstruction from sparse sensor measurements. Yet their robustness to adversarial perturbations remains uncharacterized, a critical gap for deployment in safety-critical systems. Here we show that neural operators are acutely vulnerable to extremely sparse (fewer than 1% of inputs), physically plausible perturbations that exploit their sensitivity to boundary conditions. Using gradient-free differential evolution across four operator architectures, we demonstrate that minimal modifications trigger catastrophic prediction failures, increasing relative $L_2$ error from $\sim$1.5% (validated accuracy) to 37-63% while remaining completely undetectable by standard validation metrics. Notably, 100% of successful single-point attacks pass z-score anomaly detection. We introduce the effective perturbation dimension $d_{\text{eff}}$, a Jacobian-based diagnostic that, together with sensitivity magnitude, yields a two-factor vulnerability model explaining why architectures with extreme sensitivity concentration (POD-DeepONet, $d_{\text{eff}} \approx 1$) are not necessarily the most exploitable, since low-rank output projections cap maximum error, while moderate concentration with sufficient amplification (S-DeepONet, $d_{\text{eff}} \approx 4$) produces the highest attack success. Gradient-free search outperforms gradient-based alternatives (PGD) on architectures with gradient pathologies, while random perturbations of equal magnitude achieve near-zero success rates, confirming that the discovered vulnerabilities are structural. Our findings expose a previously overlooked attack surface in operator learning models and establish that these models require robustness guarantees beyond standard validation before deployment.

2603.22518 2026-03-25 cs.CV cs.AI

High Resolution Flood Extent Detection Using Deep Learning with Random Forest Derived Training Labels

Azizbek Nuriddinov, Ebrahim Ahmadisharaf, Mohammad Reza Alizadeh

Comments Accepted to IGARSS 2026

详情
英文摘要

Validation of flood models, used to support risk mitigation strategies, remains challenging due to limited observations during extreme events. High-frequency, high-resolution optical imagery (~3 m), such as PlanetScope, offers new opportunities for flood mapping, although applications remain limited by cloud cover and the lack of labeled training data during disasters. To address this, we develop a flood mapping framework that integrates PlanetScope optical imagery with topographic features using machine learning (ML) and deep learning (DL) algorithms. A Random Forest model was applied to expert-annotated flood masks to generate training labels for DL models, U-Net. Two U-Net models with ResNet18 backbone were trained using optical imagery only (4 bands) and optical imagery combined with Height Above Nearest Drainage (HAND) and topographic slope (6 bands). Hurricane Ida (September 2021), which caused catastrophic flooding across the eastern United States, including the New York City metropolitan area, was used as an example to evaluate the framework. Results demonstrate that the U-Net model with topographic features achieved very close performance to the optical-only configuration (F1=0.92 and IoU=0.85 by both modeling scenarios), indicating that HAND and slope provide only marginal value to inundation extent detection. The proposed framework offers a scalable and label-efficient approach for mapping inundation extent that enables modeling under data-scarce flood scenarios.

2603.22509 2026-03-25 cs.CV

Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

Delin An, Chaoli Wang

详情
英文摘要

Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.

2603.22507 2026-03-25 cs.RO cs.MA

Energy-Aware Collaborative Exploration for a UAV-UGV Team

Cahit Ikbal Er, Saikiran Juttu, Yasin Yazicioglu

详情
英文摘要

We present an energy-aware collaborative exploration framework for a UAV-UGV team operating in unknown environments, where the UAV's energy constraint is modeled as a maximum flight-time limit. The UAV executes a sequence of energy-bounded exploration tours, while the UGV simultaneously explores on the ground and serves as a mobile charging station. Rendezvous is enforced under a shared time budget so that the vehicles meet at the end of each tour before the UAV reaches its flight-time limit. We construct a sparsely coupled air-ground roadmap using a density-aware layered probabilistic roadmap (PRM) and formulate tour selection over the roadmap as coupled orienteering problems (OPs) to maximize information gain subject to the rendezvous constraint. The resulting tours are constructed over collision-validated roadmap edges. We validate our method through simulation studies, benchmark comparisons, and real-world experiments.

2603.22502 2026-03-25 cs.RO

MapForest: A Modular Field Robotics System for Forest Mapping and Invasive Species Localization

Sandeep Zachariah, Francisco Yandun, Sachet Korada, Abhisesh Silwal

Comments 8 pages, 9 figures. Under review

详情
英文摘要

Monitoring and controlling invasive tree species across large forests, parks, and trail networks is challenging due to limited accessibility, reliance on manual scouting, and degraded under-canopy GNSS. We present MapForest, a modular field robotics system that transforms multi-modal sensor data into GIS-ready invasive-species maps. Our system features: (i) a compact, platform-agnostic sensing payload that can be rapidly mounted on UAV, bicycle, or backpack platforms, and (ii) a software pipeline comprising LiDAR-inertial mapping, image-based invasive-species detection, and georeferenced map generation. To ensure reliable operation in GNSS-intermittent environments, we enhance a LiDAR-inertial mapping backbone with covariance-aware GNSS factors and robust loss kernels. We train an object detector to detect the Tree-of-Heaven (Ailanthus altissima) from onboard RGB imagery and fuse detections with the reconstructed map to produce geospatial outputs suitable for downstream decision making. We collected a dataset spanning six sites across urban environments, parks, trails, and forests to evaluate individual system modules, and report end-to-end results on two sites containing Tree-of-Heaven. The enhanced mapping module achieved a trajectory deviation error of 1.95 m over a 1.2 km forest traversal, and the Tree-of-Heaven detector achieved an F1 score of 0.653. The datasets and associated tooling are released to support reproducible research in forest mapping and invasive-species monitoring.

2603.22497 2026-03-25 cs.CL

Rashid: A Cipher-Based Framework for Exploring In-Context Language Learning

Niyati Bafna, Ryan Soh-Eun Shim, Barbara Plank, David Yarowsky, Hale Sirin

详情
英文摘要

Where there is growing interest in in-context language learning (ICLL) for unseen languages with large language models, such languages usually suffer from the lack of NLP tools, data resources, and researcher expertise. This means that progress is difficult to assess, the field does not allow for cheap large-scale experimentation, and findings on ICLL are often limited to very few languages and tasks. In light of such limitations, we introduce a framework (Rashid), for studying ICLL wherein we reversibly cipher high-resource languages (HRLs) to construct truly unseen languages with access to a wide range of resources available for HRLs, unlocking previously impossible exploration of ICLL phenomena. We use our framework to assess current methods in the field with SOTA evaluation tools and manual analysis, explore the utility of potentially expensive resources in improving ICLL, and test ICLL strategies on rich downstream tasks beyond machine translation. These lines of exploration showcase the possibilities enabled by our framework, as well as providing actionable insights regarding current performance and future directions in ICLL.

2603.22472 2026-03-25 cs.RO cs.LG cs.MA

Wake Up to the Past: Using Memory to Model Fluid Wake Effects on Robots

Luca Vendruscolo, Eduardo Sebastián, Amanda Prorok, Ajay Shankar

Comments 8 pages, 7 figures. Submitted to IROS 2026. Project website: https://sites.google.com/view/wake-up-to-the-past

详情
英文摘要

Autonomous aerial and aquatic robots that attain mobility by perturbing their medium, such as multicopters and torpedoes, produce wake effects that act as disturbances for adjacent robots. Wake effects are hard to model and predict due to the chaotic spatio-temporal dynamics of the fluid, entangled with the physical geometry of the robots and their complex motion patterns. Data-driven approaches using neural networks typically learn a memory-less function that maps the current states of the two robots to a force observed by the "sufferer" robot. Such models often perform poorly in agile scenarios: since the wake effect has a finite propagation time, the disturbance observed by a sufferer robot is some function of relative states in the past. In this work, we present an empirical study of the properties a wake-effect predictor must satisfy to accurately model the interactions between two robots mediated by a fluid. We explore seven data-driven models designed to capture the spatio-temporal evolution of fluid wake effects in four different media. This allows us to introspect the models and analyze the reasons why certain features enable improved accuracy in prediction across predictors and fluids. As experimental validation, we develop a planar rectilinear gantry for two spinning monocopters to test in real-world data with feedback control. The conclusion is that support of history of previous states as input and transport delay prediction substantially helps to learn an accurate wake-effect predictor.