arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2977
2512.03619 2026-03-31 cs.CV

LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan

Comments CVPR 2026. Project Page: https://cyberiada.github.io/LAMP/

详情
英文摘要

Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.

2512.02711 2026-03-31 cs.CL cs.LG

CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Lavish Bansal, Naman Mishra

Comments 9 Pages, 5 Figures, Accepted at LREC 2026

详情
英文摘要

Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

2512.02369 2026-03-31 cs.CV

SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains

Qingmei Li, Yang Zhang, Peifeng Zhang, Haohuan Fu, Juepeng Zheng

详情
英文摘要

Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a \textbf{S}tyle-\textbf{A}daptive \textbf{GE}neralization framework (\textbf{SAGE}), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.

2512.01342 2026-03-31 cs.CV

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Lang Lin, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

详情
Journal ref
CVPR2026
英文摘要

Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

2512.00255 2026-03-31 cs.CV

Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

Kunwar Maheep Singh, Jianchun Chen, Vladislav Golyanik, Stephan J. Garbin, Thabo Beeler, Rishabh Dabral, Marc Habermann, Christian Theobalt

详情
英文摘要

We present Relightable Holoported Characters (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method's superior visual fidelity and lighting reproduction compared to state-of-the-art approaches. Project page: https://vcai.mpi-inf.mpg.de/projects/RHC/

2511.23342 2026-03-31 cs.CV cs.AI

Overcoming the Curvature Bottleneck in MeanFlow

Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Chengzhi Mao, Dimitris Metaxas, Vladimir Pavlovic

详情
英文摘要

MeanFlow offers a promising framework for one-step generative modeling by directly learning a mean-velocity field, bypassing expensive numerical integration. However, we find that the highly curved generative trajectories of existing models induce a noisy loss landscape, severely bottlenecking convergence and model quality. We leverage a fundamental geometric principle to overcome this: mean-velocity estimation is drastically simpler along straight paths. Building on this insight, we propose Rectified MeanFlow, a self-distillation approach that learns the mean-velocity field over a straightened velocity field, induced by rectified couplings from a pretrained model. To further promote linearity, we introduce a distance-based truncation heuristic that prunes residual high-curvature pairs. By smoothing the optimization landscape, our method achieves strong one-step generation performance. We improve the FID of baseline MeanFlow models from 30.9 to 8.6 under same training budget, and outperform the recent 2-rectified flow++ by 33.4% in FID while running 26x faster. Our work suggests that the difficulty of one-step flow generation stems partially from the rugged optimization landscapes induced by curved trajectories. Code is available at https://github.com/Xinxi-Zhang/Re-MeanFlow.

2511.23227 2026-03-31 cs.CV

PointCNN++: Performant Convolution on Native Points

Lihan Li, Haofeng Zhong, Rui Bu, Mingchao Sun, Wenzheng Chen, Baoquan Chen, Yangyan Li

详情
英文摘要

Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It $\textbf{generalizes sparse convolution from voxels to points}$, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates $\textbf{natively}$ on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ $\textbf{uses an order of magnitude less memory and is several times faster}$ than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it $\textbf{significantly improves point cloud registration accuracies while proving both more memory-efficient and faster}$. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency. Our code will be open sourced.

2511.22264 2026-03-31 cs.CV

DriveVGGT: Calibration-Constrained Visual Geometry Transformers for Multi-Camera Autonomous Driving

Xiaosong Jia, Yanhao Liu, Yu Hong, Renqiu Xia, Junqi You, Bin Sun, Zhihui Hao, Junchi Yan

详情
英文摘要

Feed-forward reconstruction has been progressed rapidly, with the Visual Geometry Grounded Transformer (VGGT) being a notable baseline. However, directly applying VGGT to autonomous driving (AD) fails to capture three domain-specific priors: (i) Sparse Spatial Overlap: the overlap among mutli-view cameras is minimal due to $360^{\circ}$ coverage requirements under budget control, which renders global attention among all images inefficient; (ii) Calibrated Geometric Constraints: the absolute distance among cameras is generally accessible for AD data with calibration process before driving. Standard VGGT is unable to directly utilize such information for absolute scale scene reconstruction; (iii) Rigid Extrinsic Constancy: relative poses of multi-view cameras are approximately static, i.e., the ego-motion is the same for all cameras. To bridge these gaps, we propose DriveVGGT, a scale-aware reconstruction framework that explicitly integrates these priors through three targeted components. First, for the Sparse Spatial Overlap in (i), we introduce a Temporal Video Attention (TVA) module to process multi-camera videos independently. Second, for Calibrated Geometric Constraints in (ii), a Multi-camera Consistency Attention (MCA) module is designed to directly utilize the calibration information among cameras with a scale head for absolute scale scene reconstruction. Finally, to utilize Rigid Extrinsic Constancy in (iii), we reformulate the decoding process of VGGT into factorized sequential pose head and ego motion head. On AD datasets, experiments demonstrate that DriveVGGT reduces inference time by 49.3\% while improving depth and pose estimation compared to vanilla VGGT in long-sequence scenarios. It consistently outperforms recent SOTA variants. Meanwhile, extensive ablation studies verify the effectiveness of each devised module.

2511.21428 2026-03-31 cs.CV cs.AI cs.RO

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner

Comments 10 pages, 5 figures, Accepted to CVPR 2026

详情
英文摘要

We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel "Latent Action Energy" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

2511.19413 2026-03-31 cs.LG cs.AI cs.CV

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang

Comments Accepted to CVPR 2026

详情
英文摘要

Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02)on GenEval, out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/TorchUMM

2511.18174 2026-03-31 cs.CV

Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

Mukai Yu, Mosam Dabhi, Liuyue Xie, Sebastian Scherer, László A. Jeni

Comments Accepted to CVPR 2026. Camera-ready version. Added computation benchmark

详情
英文摘要

Modern perception increasingly relies on fisheye, panoramic, and other wide field-of-view (FoV) cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where pixel-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Traditional spherical CNNs partially address this mismatch but require costly spherical harmonic transform that constrains resolution and efficiency. We present Unified Spherical Frontend (USF), a distortion-free lens-agnostic framework that transforms images from any calibrated camera onto the unit sphere via ray-direction correspondences, and performs spherical resampling, convolution, and pooling canonically in the spatial domain. USF is modular: projection, location sampling, value interpolation, and resolution control are fully decoupled. Its configurable distance-only convolution kernels offer rotation-equivariance, mirroring translation-equivariance in planar CNNs while avoiding harmonic transforms entirely. We compare multiple standard planar backbones with their spherical counterparts across classification, detection, and segmentation tasks on synthetic (Spherical MNIST) and real-world (PANDORA, Stanford 2D-3D-S) datasets, and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF scales efficiently to high-resolution spherical imagery and maintains less than 1% performance drop under random test-time rotations without training-time rotational augmentation, and enables zero-shot generalization to any unseen (wide-FoV) lenses with minimal performance degradation.

2511.17290 2026-03-31 cs.CL

Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

Marii Ojastu, Hele-Andra Kuulmets, Aleksei Dorkin, Marika Borovikova, Dage Särg, Kairit Sirts

Comments LREC 2026

详情
英文摘要

In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.

2511.17209 2026-03-31 cs.CV

Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers

Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen

详情
英文摘要

We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.

2511.16216 2026-03-31 cs.AI cs.LG

FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis

Zhen Hao Wong, Jingwen Deng, Yuzhao Wang, Wenkai Yu, Jihao Huang, Runming He, Chengyu Shen, Hao Liang, Wentao Zhang

详情
英文摘要

Textbooks are among the richest repositories of human-verified reasoning knowledge, yet their complex layouts contain multi-column typesetting, cross-page question answer separation, and interleaved figures, make automated extraction of structured QA and VQA pairs extremely challenging. Existing alternatives either synthesize data from scratch, which lacks authentic problem contexts, or rely on costly expert annotation that cannot scale. We propose $\textbf{FlipVQA-Miner}$, an automated pipeline that resolves long-range logical dependencies and cross-page discontinuities in OCR-parsed documents, recovering coherent question--answer--figure associations even when answers reside in separate companion volumes. A subsequent multi-stage curation pipeline transforms these raw extractions into AI-ready supervision signals. Using FlipVQA-Miner, we construct $\textbf{FlipVQA-83K}$, comprising 83K QA and VQA pairs spanning 11 academic disciplines, at a $\textbf{50$\times$}$ cost saving compared to manual annotation while maintaining high structural fidelity ($F_1 > 0.96$). Models fine-tuned on FlipVQA-83K demonstrate significantly improved reasoning ability and cross-domain generalization, establishing a scalable paradigm for human-knowledge-grounded data curation. Our dataset and the complete data generating and curating methods can be found in https://github.com/OpenDCAI/DataFlow-VQA .

2511.15355 2026-03-31 cs.CL

HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning

Alexis Correa-Guillén, Carlos Gómez-Rodríguez, David Vilares

Comments LREC 2026 camera-ready version

详情
英文摘要

We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.

2511.15211 2026-03-31 cs.CL cs.AI

OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Xinli Tao, Xin Dong, Xuezhong Zhou

Comments 12 pages, 4 figures, 4 tables

详情
Journal ref
JAMIA Open, 2026, ooag049
英文摘要

With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.

2511.14262 2026-03-31 cs.LG cs.AI

Object-Centric World Models for Causality-Aware Reinforcement Learning

Yosuke Nishimoto, Takashi Matsubara

Comments Accepted by AAAI-26. Codes are available at https://github.com/nishimoto0430/STICA

详情
英文摘要

World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed of multiple objects with rich interactions since most world models learn holistic representations of all environmental components. By contrast, humans perceive the environment by decomposing it into discrete objects, facilitating efficient decision-making. Motivated by this insight, we propose \emph{Slot Transformer Imagination with CAusality-aware reinforcement learning} (STICA), a unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks. STICA represents each observation as a set of object-centric tokens, together with tokens for the agent action and the resulting reward, enabling the world model to predict token-level dynamics and interactions. The policy and value networks then estimate token-level cause--effect relations and use them in the attention layers, yielding causality-guided decision-making. Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.

2511.03121 2026-03-31 cs.CL cs.AI cs.SY eess.SY

Control Barrier Function for Aligning Large Language Models

Yuya Miyaoka, Masaki Inoue

Comments This work is an extenede version of arXiv:2408.15625 and has been submitted to the IEEE for possible publication

详情
Journal ref
IEEE Transactions on Control Systems Technology 2026
英文摘要

This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.

2511.01187 2026-03-31 cs.CL cs.CY

Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata

详情
英文摘要

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce \corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8{,}400 structured debate prompts spanning four sensitive domains -- Women's Rights, Backwardness, Terrorism, and Religion -- across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude~3.5~Haiku, DeepSeek-Chat, and LLaMA-3-70B), we generate over 100{,}000 debate responses and automatically classify which demographic groups are assigned stereotyped versus modern roles. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to Terrorism and Religion ($\geq$89\%), Africans to socioeconomic ``backwardness'' (up to 77\%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our \corpusname benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

2510.24434 2026-03-31 cs.CL

LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot

详情
英文摘要

The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs. To investigate the practical utility of the dataset, we fine-tune 14 smaller-scale LLMs ($\leq$15B parameters) on LuxIT and evaluate them on standardized Luxembourgish proficiency exams and five downstream NLP tasks. Training on LuxIT yields a mean accuracy change of +5.37 percentage points on language exams across all 14 models, with 12 of 14 showing improvement. On NLP downstream tasks, 9 of 14 models improve in macro-averaged F1, though gains on the two benchmarks do not systematically correlate. These results underscore the feasibility of leveraging monolingual synthetic data to improve LLM capabilities in low-resource languages, while highlighting the multi-faceted nature of language proficiency.

2510.20351 2026-03-31 cs.CL cs.AI

Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Matteo Silvestri, Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei

详情
英文摘要

Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely unexplored. Existing approaches primarily rely on memorization tests, which are too coarse to detect contamination. In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation. Given a dataset, we craft multiple-choice aligned queries that preserve task structure while allowing systematic transformations of the underlying data. These transformations are designed to selectively disrupt dataset information while preserving partial knowledge, enabling us to isolate performance attributable to contamination. We complement this setup with non-neural baselines that provide reference performance, and we introduce a statistical testing procedure to formally detect significant deviations indicative of contamination. Empirical results on eight widely used tabular datasets reveal clear evidence of contamination in four cases. These findings suggest that performance on downstream tasks involving such datasets may be substantially inflated, raising concerns about the reliability of current evaluation practices.

2510.20212 2026-03-31 cs.CV

Target-aware Image Editing via Cycle-consistent Constraints

Yanghao Wang, Zhen Wang, Long Chen

详情
英文摘要

Recent pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an editable ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they mainly focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. Thus, we propose FlowCycle, an inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. For efficiency, we further accelerate the optimization by dynamically adjusting the sampling steps. Extensive ablations demonstrated that FlowCycle achieves superior editing performance.

2510.16518 2026-03-31 cs.RO cs.AI

DIV-Nav: Open-Vocabulary Spatial Relationships for Multi-Object Navigation

Jesús Ortega-Peimbert, Finn Lukas Busch, Timon Homberger, Quantao Yang, Olov Andersson

详情
英文摘要

Advances in open-vocabulary semantic mapping and object navigation have enabled robots to perform an informed search of their environment for an arbitrary object. However, such zero-shot object navigation is typically designed for simple queries with an object name like "television" or "blue rug". Here, we consider more complex free-text queries with spatial relationships, such as "find the remote on the table" while still leveraging robustness of a semantic map. We present DIV-Nav, a real-time navigation system that efficiently addresses this problem through a series of relaxations: i) Decomposing natural language instructions with complex spatial constraints into simpler object-level queries on a semantic map, ii) computing the Intersection of individual semantic belief maps to identify regions where all objects co-exist, and iii) Validating the discovered objects against the original, complex spatial constrains via a LVLM. We further investigate how to adapt the frontier exploration objectives of online semantic mapping to such spatial search queries to more effectively guide the search process. We validate our system through extensive experiments on the MultiON benchmark and real-world deployment on a Boston Dynamics Spot robot using a Jetson Orin AGX. More details and videos are available at https://anonsub42.github.io/reponame/

2510.14255 2026-03-31 cs.CV

Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng

Comments accepted by CVPR 2026

详情
英文摘要

Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at https://ipro-alimama.github.io/.

2510.14035 2026-03-31 cs.AI

GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations

Rajesh Mangannavar, Prasad Tadepalli

Comments 10 pages content. 2 pages references. 2 pages appendix. Updated paper with more results from multiple domains and added appendix

详情
英文摘要

We introduce an uncertainty-aware graph representation framework for learning to guide planning in Partially Observable Markov Decision Processes (POMDPs). Unlike existing approaches that require domain or problem size specific neural architectures, GammaZero leverages a unified graph-based belief representation that enables generalization across problem sizes within a domain. Our key insight is that belief states can be systematically transformed into uncertainty-aware graphs where structural patterns learned on small problems transfer to larger instances. We employ a graph neural network with a decoder architecture to learn value functions and policies from expert demonstrations on computationally tractable problems, then apply these learned heuristics to guide Monte Carlo tree search on larger problems. Experimental results on standard POMDP benchmarks demonstrate that GammaZero achieves comparable performance to BetaZero when trained and tested on the same-sized problems, while enabling zero-shot generalization to problems 2-6x larger than those seen during training.

2510.13044 2026-03-31 cs.CV cs.AI

SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

Jungbin Cho, Minsu Kim, Jisoo Kim, Ce Zheng, Laszlo A. Jeni, Ming-Hsuan Yang, Youngjae Yu, Seonjoo Kim

Comments 15 pages

详情
英文摘要

Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches fail to generate semantically diverse motion while simultaneously respecting geometric scene constraints, since constructing large-scale datasets with both rich text-motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a two-stage adaptation framework that enables semantically diverse, scene-aware human motion generation from text without large-scale paired text--scene--motion data. Our key idea is to use motion inbetweening, a learnable proxy task that requires no text, as a bridge between two disjoint resources: a text-motion dataset and a scene-motion dataset. By first adapting a text-to-motion model through inbetweening and then through scene-aware inbetweening, SceneAdapt injects geometric scene constraints into text-conditioned generation while preserving semantic diversity. To enable adaptation for inbetweening, we propose a novel Context-aware Keyframing (CaKey) layer that modulates motion latents for keyframe-conditioned synthesis while preserving the original latent manifold. To further adapt the model for scene-aware inbetweening, we introduce a Scene-conditioning (SceneCo) layer that injects geometric scene information by adaptively querying local context via cross-attention. Experimental results show that SceneAdapt effectively injects scene-awareness into text-to-motion models without sacrificing semantic diversity, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released. Project page: \href{https://sceneadapt.github.io/}{sceneadapt.github.io}

2510.10102 2026-03-31 cs.LG

PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

Guilin Li, Yun Zhang, Xiuyuan Chen, Chengqi Li, Bo Wang, Linghe Kong, Wenjia Wang, Weiran Huang, Matthias Hwai Yong Tan

详情
英文摘要

Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action, defined by multi-dimensional attributes such as time, context, and transaction type, constitutes a behavioral token. Modeling these high-cardinality sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative-discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories; and (4) Real-time scalability enabled by offline caching of pretrained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25.6 percent boost in next-transaction prediction HitRate@1 and a 38.6 percent relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks show strong generalization, achieving up to 21 percent HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial sequential user behavior modeling.

2510.08673 2026-03-31 cs.CV

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy

Comments Accepted by ICLR2026. Project Page: https://kangliao929.github.io/projects/puffin/

详情
英文摘要

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

2510.08553 2026-03-31 cs.CV cs.AI cs.RO

Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

Yunzhe Xu, Yiyuan Pan, Zhe Liu

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情
英文摘要

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinct testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm.

2510.06961 2026-03-31 cs.CL cs.AI cs.SD eess.AS

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Rao Koluguri, Piotr Żelasko, Somshubra Majumdar, Adel Moumen, Sanchit Gandhi

Comments Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ; Code: https://github.com/huggingface/open_asr_leaderboard

详情
英文摘要

We present the Open ASR Leaderboard, a reproducible benchmarking platform with community contributions from academia and industry. It compares 86 open-source and proprietary systems across 12 datasets, with English short- and long-form and multilingual short-form tracks. We standardize word error rate (WER) and inverse real-time factor (RTFx) evaluation for consistent accuracy-efficiency comparisons across model architectures and toolkits (e.g., ESPNet, NeMo, SpeechBrain, Transformers). We observe that Conformer-based encoders paired with transformer-based decoders achieve the best average WER, while connectionist temporal classification (CTC) and token-and-duration transducer (TDT) decoders offer superior RTFx, making them better suited for long-form and batched processing. All code and dataset loaders are open-sourced to support transparent, extensible evaluation. We present our evaluation methodology to facilitate community-driven benchmarking in ASR and other tasks.