arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2365
专题追踪
2602.18947 2026-02-24 cs.AI

(Perlin) Noise as AI coordinator

Kaijie Xu, Clark Verbrugge

详情
英文摘要

Large scale control of nonplayer agents is central to modern games, while production systems still struggle to balance several competing goals: locally smooth, natural behavior, and globally coordinated variety across space and time. Prior approaches rely on handcrafted rules or purely stochastic triggers, which either converge to mechanical synchrony or devolve into uncorrelated noise that is hard to tune. Continuous noise signals such as Perlin noise are well suited to this gap because they provide spatially and temporally coherent randomness, and they are already widely used for terrain, biomes, and other procedural assets. We adapt these signals for the first time to large scale AI control and present a general framework that treats continuous noise fields as an AI coordinator. The framework combines three layers of control: behavior parameterization for movement at the agent level, action time scheduling for when behaviors start and stop, and spawn or event type and feature generation for what appears and where. We instantiate the framework reproducibly and evaluate Perlin noise as a representative coordinator across multiple maps, scales, and seeds against random, filtered, deterministic, neighborhood constrained, and physics inspired baselines. Experiments show that coordinated noise fields provide stable activation statistics without lockstep, strong spatial coverage and regional balance, better diversity with controllable polarization, and competitive runtime. We hope this work motivates a broader exploration of coordinated noise in game AI as a practical path to combine efficiency, controllability, and quality.

2602.18943 2026-02-24 cs.AI

High Dimensional Procedural Content Generation

Kaijie Xu, Clark Verbrugge

详情
英文摘要

Procedural content generation (PCG) has made substantial progress in shaping static 2D/3D geometry, while most methods treat gameplay mechanics as auxiliary and optimize only over space. We argue that this limits controllability and expressivity, and formally introduce High-Dimensional PCG (HDPCG): a framework that elevates non-geometric gameplay dimensions to first-class coordinates of a joint state space. We instantiate HDPCG along two concrete directions. Direction-Space augments geometry with a discrete layer dimension and validates reachability in 4D (x,y,z,l), enabling unified treatment of 2.5D/3.5D mechanics such as gravity inversion and parallel-world switching. Direction-Time augments geometry with temporal dynamics via time-expanded graphs, capturing action semantics and conflict rules. For each direction, we present three general, practicable algorithms with a shared pipeline of abstract skeleton generation, controlled grounding, high-dimensional validation, and multi-metric evaluation. Large-scale experiments across diverse settings validate the integrity of our problem formulation and the effectiveness of our methods on playability, structure, style, robustness, and efficiency. Beyond quantitative results, Unity-based case studies recreate playable scenarios that accord with our metrics. We hope HDPCG encourages a shift in PCG toward general representations and the generation of gameplay-relevant dimensions beyond geometry, paving the way for controllable, verifiable, and extensible level generation.

2602.18941 2026-02-24 cs.CV

Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation

Kaiming Jin, Yuefan Wu, Shengqiong Wu, Bobo Li, Shuicheng Yan, Tat-Seng Chua

Comments 18 pages, 9 figures

详情
英文摘要

Vision-and-Language Scene navigation is a fundamental capability for embodied human-AI collaboration, requiring agents to follow natural language instructions to execute coherent action sequences in complex environments. Existing approaches either rely on multiple agents, incurring high coordination and resource costs, or adopt a single-agent paradigm, which overloads the agent with both global planning and local perception, often leading to degraded reasoning and instruction drift in long-horizon settings. To address these issues, we introduce DACo, a planning-grounding decoupled architecture that disentangles global deliberation from local grounding. Concretely, it employs a Global Commander for high-level strategic planning and a Local Operative for egocentric observing and fine-grained execution. By disentangling global reasoning from local action, DACo alleviates cognitive overload and improves long-horizon stability. The framework further integrates dynamic subgoal planning and adaptive replanning to enable structured and resilient navigation. Extensive evaluations on R2R, REVERIE, and R4R demonstrate that DACo achieves 4.9%, 6.5%, 5.4% absolute improvements over the best-performing baselines in zero-shot settings, and generalizes effectively across both closed-source (e.g., GPT-4o) and open-source (e.g., Qwen-VL Series) backbones. DACo provides a principled and extensible paradigm for robust long-horizon navigation. Project page: https://github.com/ChocoWu/DACo

2602.18940 2026-02-24 cs.AI

DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, Ron Litman

详情
英文摘要

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

2602.18934 2026-02-24 cs.LG cs.CR

LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

Abdullah Caglar Oksuz, Anisa Halimi, Erman Ayday

详情
英文摘要

Membership inference attacks (MIAs) threaten the privacy of machine learning models by revealing whether a specific data point was used during training. Existing MIAs often rely on impractical assumptions such as access to public datasets, shadow models, confidence scores, or training data distribution knowledge and making them vulnerable to defenses like confidence masking and adversarial regularization. Label-only MIAs, even under strict constraints suffer from high query requirements per sample. We propose a cost-effective label-only MIA framework based on transferability and model extraction. By querying the target model M using active sampling, perturbation-based selection, and synthetic data, we extract a functionally similar surrogate S on which membership inference is performed. This shifts query overhead to a one-time extraction phase, eliminating repeated queries to M . Operating under strict black-box constraints, our method matches the performance of state-of-the-art label-only MIAs while significantly reducing query costs. On benchmarks including Purchase, Location, and Texas Hospital, we show that a query budget equivalent to testing $\approx1\%$ of training samples suffices to extract S and achieve membership inference accuracy within $\pm1\%$ of M . We also evaluate the effectiveness of standard defenses proposed for label-only MIAs against our attack.

2602.18920 2026-02-24 cs.CL cs.AI

DeepInnovator: Triggering the Innovative Capabilities of LLMs

Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu, Chengen Huang, Junyang Lin, Chao Huang

详情
英文摘要

The application of Large Language Models (LLMs) in accelerating scientific discovery has garnered increasing attention, with a key focus on constructing research agents endowed with innovative capability, i.e., the ability to autonomously generate novel and significant research ideas. Existing approaches predominantly rely on sophisticated prompt engineering and lack a systematic training paradigm. To address this, we propose DeepInnovator, a training framework designed to trigger the innovative capability of LLMs. Our approach comprises two core components. (1) ``Standing on the shoulders of giants''. We construct an automated data extraction pipeline to extract and organize structured research knowledge from a vast corpus of unlabeled scientific literature. (2) ``Conjectures and refutations''. We introduce a ``Next Idea Prediction'' training paradigm, which models the generation of research ideas as an iterative process of continuously predicting, evaluating, and refining plausible and novel next idea. Both automatic and expert evaluations demonstrate that our DeepInnovator-14B significantly outperforms untrained baselines, achieving win rates of 80.53\%-93.81\%, and attains performance comparable to that of current leading LLMs. This work provides a scalable training pathway toward building research agents with genuine, originative innovative capability, and will open-source the dataset to foster community advancement. Source code and data are available at: https://github.com/HKUDS/DeepInnovator.

2602.18918 2026-02-24 cs.AI cs.LG

Early Evidence of Vibe-Proving with Consumer LLMs: A Case Study on Spectral Region Characterization with ChatGPT-5.2 (Thinking)

Brecht Verbeken, Brando Vagenende, Marie-Anne Guerry, Andres Algaba, Vincent Ginis

Comments 41 pages

详情
英文摘要

Large Language Models (LLMs) are increasingly used as scientific copilots, but evidence on their role in research-level mathematics remains limited, especially for workflows accessible to individual researchers. We present early evidence for vibe-proving with a consumer subscription LLM through an auditable case study that resolves Conjecture 20 of Ran and Teng (2024) on the exact nonreal spectral region of a 4-cycle row-stochastic nonnegative matrix family. We analyze seven shareable ChatGPT-5.2 (Thinking) threads and four versioned proof drafts, documenting an iterative pipeline of generate, referee, and repair. The model is most useful for high-level proof search, while human experts remain essential for correctness-critical closure. The final theorem provides necessary and sufficient region conditions and explicit boundary attainment constructions. Beyond the mathematical result, we contribute a process-level characterization of where LLM assistance materially helps and where verification bottlenecks persist, with implications for evaluation of AI-assisted research workflows and for designing human-in-the-loop theorem proving systems.

2602.18910 2026-02-24 cs.LG

SLDP: Semi-Local Differential Privacy for Density-Adaptive Analytics

Alexey Kroshnin, Alexandra Suvorikova

详情
英文摘要

Density-adaptive domain discretization is essential for high-utility privacy-preserving analytics but remains challenging under Local Differential Privacy (LDP) due to the privacy-budget costs associated with iterative refinement. We propose a novel framework, Semi-Local Differential Privacy (SLDP), that assigns a privacy region to each user based on local density and defines adjacency by the potential movement of a point within its privacy region. We present an interactive $(\varepsilon, δ)$-SLDP protocol, orchestrated by an honest-but-curious server over a public channel, to estimate these regions privately. Crucially, our framework decouples the privacy cost from the number of refinement iterations, allowing for high-resolution grids without additional privacy budget cost. We experimentally demonstrate the framework's effectiveness on estimation tasks across synthetic and real-world datasets.

2602.18906 2026-02-24 cs.CV

Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

Shengjie Zhu, Ahmed Abdelkader, Mark J. Matthews, Xiaoming Liu, Wen-Sheng Chu

详情
英文摘要

Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.

2602.18904 2026-02-24 cs.LG cs.CV

PCA-VAE: Differentiable Subspace Quantization without Codebook Collapse

Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan

详情
英文摘要

Vector-quantized autoencoders deliver high-fidelity latents but suffer inherent flaws: the quantizer is non-differentiable, requires straight-through hacks, and is prone to collapse. We address these issues at the root by replacing VQ with a simple, principled, and fully differentiable alternative: an online PCA bottleneck trained via Oja's rule. The resulting model, PCA-VAE, learns an orthogonal, variance-ordered latent basis without codebooks, commitment losses, or lookup noise. Despite its simplicity, PCA-VAE exceeds VQ-GAN and SimVQ in reconstruction quality on CelebAHQ while using 10-100x fewer latent bits. It also produces naturally interpretable dimensions (e.g., pose, lighting, gender cues) without adversarial regularization or disentanglement objectives. These results suggest that PCA is a viable replacement for VQ: mathematically grounded, stable, bit-efficient, and semantically structured, offering a new direction for generative models beyond vector quantization.

2602.18903 2026-02-24 cs.CV cs.HC

SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model

Luca Cazzaniga

Comments 24 pages, 8 tables. Based on SCHEMA Method v1.0 (deposited December 11, 2025). Previously published on Zenodo: doi:10.5281/zenodo.18721380

详情
英文摘要

This paper presents SCHEMA (Structured Components for Harmonized Engineered Modular Architecture), a structured prompt engineering methodology specifically developed for Google Gemini 3 Pro Image. Unlike generic prompt guidelines or model-agnostic tips, SCHEMA is an engineered framework built on systematic professional practice encompassing 850 verified API predictions within an estimated corpus of approximately 4,800 generated images, spanning six professional domains: real estate photography, commercial product photography, editorial content, storyboards, commercial campaigns, and information design. The methodology introduces a three-tier progressive system (BASE, MEDIO, AVANZATO) that scales practitioner control from exploratory (approximately 5%) to directive (approximately 95%), a modular label architecture with 7 core and 5 optional structured components, a decision tree with explicit routing rules to alternative tools, and systematically documented model limitations with corresponding workarounds. Key findings include an observed 91% Mandatory compliance rate and 94% Prohibitions compliance rate across 621 structured prompts, a comparative batch consistency test demonstrating substantially higher inter-generation coherence for structured prompts, independent practitioner validation (n=40), and a dedicated Information Design validation demonstrating >95% first-generation compliance for spatial and typographical control across approximately 300 publicly verifiable infographics. Previously published on Zenodo (doi:10.5281/zenodo.18721380).

2602.18897 2026-02-24 cs.LG cs.AI

HEHRGNN: A Unified Embedding Model for Knowledge Graphs with Hyperedges and Hyper-Relational Edges

Rajesh Rajagopalamenon, Unnikrishnan Cheramangalath

Comments 38 Pages paper and 1 highlights page, 9 figures, 10 tables, Submitted to Elsevier Knowledge-Based Systems journal

详情
英文摘要

Knowledge Graph(KG) has gained traction as a machine-readable organization of real-world knowledge for analytics using artificial intelligence systems. Graph Neural Network(GNN), is proven to be an effective KG embedding technique that enables various downstream tasks like link prediction, node classification, and graph classification. The focus of research in both KG embedding and GNNs has been mostly oriented towards simple graphs with binary relations. However, real-world knowledge bases have a significant share of complex and n-ary facts that cannot be represented by binary edges. More specifically, real-world knowledge bases are often a mix of two types of n-ary facts - (i) that require hyperedges and (ii) that require hyper-relational edges. Though there are research efforts catering to these n-ary fact types, they are pursued independently for each type. We propose $H$yper$E$dge $H$yper-$R$elational edge $GNN$(HEHRGNN), a unified embedding model for n-ary relational KGs with both hyperedges and hyper-relational edges. The two main components of the model are i)HEHR unified fact representation format, and ii)HEHRGNN encoder, a GNN-based encoder with a novel message propagation model capable of capturing complex graph structures comprising both hyperedges and hyper-relational edges. The experimental results of HEHRGNN on link prediction tasks show its effectiveness as a unified embedding model, with inductive prediction capability, for link prediction across real-world datasets having different types of n-ary facts. The model also shows improved link prediction performance over baseline models for hyperedge and hyper-relational datasets.

2602.18896 2026-02-24 cs.CV

Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization

Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan

详情
英文摘要

Vector Quantization (VQ) underpins many modern generative frameworks such as VQ-VAE, VQ-GAN, and latent diffusion models. Yet, it suffers from the persistent problem of codebook collapse, where a large fraction of code vectors remains unused during training. This work provides a new theoretical explanation by identifying the nonstationary nature of encoder updates as the fundamental cause of this phenomenon. We show that as the encoder drifts, unselected code vectors fail to receive updates and gradually become inactive. To address this, we propose two new methods: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution. Experiments on the CelebA-HQ dataset demonstrate that both methods achieve near-complete codebook utilization and superior reconstruction quality compared to baseline VQ variants, providing a principled and scalable foundation for future VQ-based generative models. The code is available at: https://github.com/CAIR- LAB- WFUSM/NSVQ-TransVQ.git

2602.18886 2026-02-24 cs.CV cs.GR

PhysConvex: Physics-Informed 3D Dynamic Convex Radiance Fields for Reconstruction and Simulation

Dan Wang, Xinrui Cui, Serge Belongie, Ravi Ramamoorthi

详情
英文摘要

Reconstructing and simulating dynamic 3D scenes with both visual realism and physical consistency remains a fundamental challenge. Existing neural representations, such as NeRFs and 3DGS, excel in appearance reconstruction but struggle to capture complex material deformation and dynamics. We propose PhysConvex, a Physics-informed 3D Dynamic Convex Radiance Field that unifies visual rendering and physical simulation. PhysConvex represents deformable radiance fields using physically grounded convex primitives governed by continuum mechanics. We introduce a boundary-driven dynamic convex representation that models deformation through vertex and surface dynamics, capturing spatially adaptive, non-uniform deformation, and evolving boundaries. To efficiently simulate complex geometries and heterogeneous materials, we further develop a reduced-order convex simulation that advects dynamic convex fields using neural skinning eigenmodes as shape- and material-aware deformation bases with time-varying reduced DOFs under Newtonian dynamics. Convex dynamics also offers compact, gap-free volumetric coverage, enhancing both geometric efficiency and simulation fidelity. Experiments demonstrate that PhysConvex achieves high-fidelity reconstruction of geometry, appearance, and physical properties from videos, outperforming existing methods.

2602.18884 2026-02-24 cs.AI

TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

Zhenkun Gao, Xuhong Wang, Xin Tan, Yuan Xie

Comments Accepted to ICLR 2026. 17 pages. Code, data, and models are available at: https://github.com/Stephen-gzk/TPRU

详情
英文摘要

Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real-world embodied AI. This gap is largely caused by a systemic failure in training paradigms, which lack large-scale, procedurally coherent data. To address this problem, we introduce TPRU, a large-scale dataset sourced from diverse embodied scenarios such as robotic manipulation and GUI navigation. TPRU is systematically designed to cultivate temporal reasoning through three complementary tasks: Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review. A key feature is the inclusion of challenging negative samples, compelling models to transition from passive observation to active, cross-modal validation. We leverage TPRU with a reinforcement learning (RL) fine-tuning methodology, specifically targeting the enhancement of resource-efficient models. Experiments show our approach yields dramatic gains: on our manually curated TPRU-Test, the accuracy of TPRU-7B soars from 50.33\% to 75.70\%, a state-of-the-art result that significantly outperforms vastly larger baselines, including GPT-4o. Crucially, these capabilities generalize effectively, demonstrating substantial improvements on established benchmarks. The codebase is available at https://github.com/Stephen-gzk/TPRU/ .

2602.18880 2026-02-24 cs.CV cs.AI

FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

Zhou Liu, Tonghua Su, Hongshi Zhang, Fuxiang Yang, Donglin Di, Yang Song, Lei Fan

详情
英文摘要

Advances in image tampering techniques, particularly generative models, pose significant challenges to media verification, digital forensics, and public trust. Existing image forgery detection and localization (IFDL) methods suffer from two key limitations: over-reliance on semantic content while neglecting textural cues, and limited interpretability of subtle low-level tampering traces. To address these issues, we propose FOCA, a multimodal large language model-based framework that integrates discriminative features from both the RGB spatial and frequency domains via a cross-attention fusion module. This design enables accurate forgery detection and localization while providing explicit, human-interpretable cross-domain explanations. We further introduce FSE-Set, a large-scale dataset with diverse authentic and tampered images, pixel-level masks, and dual-domain annotations. Extensive experiments show that FOCA outperforms state-of-the-art methods in detection performance and interpretability across both spatial and frequency domains.

2602.18874 2026-02-24 cs.CV cs.AI

Structure-Level Disentangled Diffusion for Few-Shot Chinese Font Generation

Jie Li, Suorong Yang, Jian Zhao, Furao Shen

详情
英文摘要

Few-shot Chinese font generation aims to synthesize new characters in a target style using only a handful of reference images. Achieving accurate content rendering and faithful style transfer requires effective disentanglement between content and style. However, existing approaches achieve only feature-level disentanglement, allowing the generator to re-entangle these features, leading to content distortion and degraded style fidelity. We propose the Structure-Level Disentangled Diffusion Model (SLD-Font), which receives content and style information from two separate channels. SimSun-style images are used as content templates and concatenated with noisy latent features as the input. Style features extracted by a CLIP model from target-style images are integrated via cross-attention. Additionally, we train a Background Noise Removal module in the pixel space to remove background noise in complex stroke regions. Based on theoretical validation of disentanglement effectiveness, we introduce a parameter-efficient fine-tuning strategy that updates only the style-related modules. This allows the model to better adapt to new styles while avoiding overfitting to the reference images' content. We further introduce the Grey and OCR metrics to evaluate the content quality of generated characters. Experimental results show that SLD-Font achieves significantly higher style fidelity while maintaining comparable content accuracy to existing state-of-the-art methods.

2602.18872 2026-02-24 cs.RO

Equivalence and Divergence of Bayesian Log-Odds and Dempster's Combination Rule for 2D Occupancy Grids

Tatiana Berlenko, Kirill Krinkin

Comments 29 pages, 6 figures, 6 tables. Includes complete proofs, ablation studies, and supplementary statistical analysis

详情
英文摘要

We introduce a pignistic-transform-based methodology for fair comparison of Bayesian log-odds and Dempster's combination rule in occupancy grid mapping, matching per-observation decision probabilities to isolate the fusion rule from sensor parameterization. Under BetP matching across simulation, two real lidar datasets, and downstream path planning, Bayesian fusion is consistently favored (15/15 directional consistency, p = 3.1e-5) with small absolute differences (0.001-0.022). Under normalized plausibility matching, the direction reverses, confirming the result is matching-criterion-specific. The methodology is reusable for any future Bayesian/belief function comparison.

2602.18869 2026-02-24 cs.CV

Enhancing 3D LiDAR Segmentation by Shaping Dense and Accurate 2D Semantic Predictions

Xiaoyu Dong, Tiankui Xian, Wanshui Gan, Naoto Yokoya

详情
英文摘要

Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real-world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi-modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross-modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.

2602.18862 2026-02-24 cs.RO

Gait Asymmetry from Unilateral Weakness and Improvement With Ankle Assistance: a Reinforcement Learning based Simulation Study

Yifei Yuan, Ghaith Androwis, Xianlian Zhou

详情
英文摘要

Unilateral muscle weakness often leads to asymmetric gait, disrupting interlimb coordination and stance timing. This study presents a reinforcement learning (RL) based musculoskeletal simulation framework to (1) quantify how progressive unilateral muscle weakness affects gait symmetry and (2) evaluate whether ankle exoskeleton assistance can improve gait symmetry under impaired conditions. The overarching goal is to establish a simulation- and learning-based workflow that supports early controller development prior to patient experiments. Asymmetric gait was induced by reducing right-leg muscle strength to 75%, 50%, and 25% of baseline. Gait asymmetry was quantified using toe-off timing, peak contact forces, and joint-level symmetry metrics. Increasing weakness produced progressively larger temporal and kinematic asymmetry, most pronounced at the ankle. Ankle range of motion symmetry degraded from near-symmetric behavior at 100% strength (symmetry index, SI = +6.4%; correlation r=0.974) to severe asymmetry at 25% strength (SI = -47.1%, r=0.889), accompanied by a load shift toward the unimpaired limb. At 50% strength, ankle exoskeleton assistance improved kinematic symmetry relative to the unassisted impaired condition, reducing the magnitude of ankle SI from 25.8% to 18.5% and increasing ankle correlation from r=0.948 to 0.966, although peak loading remained biased toward the unimpaired side. Overall, this framework supports controlled evaluation of impairment severity and assistive strategies, and provides a basis for future validation in human experiments.

2602.18861 2026-02-24 cs.CV

Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation

Shile Li, Markus Karmann, Onay Urfalioglu

详情
英文摘要

We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a <adjective> photo of <adjective> <cls>".

2602.18857 2026-02-24 cs.LG

VariBASed: Variational Bayes-Adaptive Sequential Monte-Carlo Planning for Deep Reinforcement Learning

Joery A. de Vries, Jinke He, Yaniv Oren, Pascal R. van der Vaart, Mathijs M. de Weerdt, Matthijs T. J. Spaan

详情
英文摘要

Optimally trading-off exploration and exploitation is the holy grail of reinforcement learning as it promises maximal data-efficiency for solving any task. Bayes-optimal agents achieve this, but obtaining the belief-state and performing planning are both typically intractable. Although deep learning methods can greatly help in scaling this computation, existing methods are still costly to train. To accelerate this, this paper proposes a variational framework for learning and planning in Bayes-adaptive Markov decision processes that coalesces variational belief learning, sequential Monte-Carlo planning, and meta-reinforcement learning. In a single-GPU setup, our new method VariBASeD exhibits favorable scaling to larger planning budgets, improving sample- and runtime-efficiency over prior methods.

2602.18856 2026-02-24 cs.LG

Issues with Measuring Task Complexity via Random Policies in Robotic Tasks

Reabetswe M. Nkhumise, Mohamed S. Talamali, Aditya Gilra

Comments 16 pages, 9 figures, The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
英文摘要

Reinforcement learning (RL) has enabled major advances in fields such as robotics and natural language processing. A key challenge in RL is measuring task complexity, which is essential for creating meaningful benchmarks and designing effective curricula. While there are numerous well-established metrics for assessing task complexity in tabular settings, relatively few exist in non-tabular domains. These include (i) Statistical analysis of the performance of random policies via Random Weight Guessing (RWG), and (ii) information-theoretic metrics Policy Information Capacity (PIC) and Policy-Optimal Information Capacity (POIC), which are reliant on RWG. In this paper, we evaluate these methods using progressively difficult robotic manipulation setups, with known relative complexity, with both dense and sparse reward formulations. Our empirical results reveal that measuring complexity is still nuanced. Specifically, under the same reward formulation, PIC suggests that a two-link robotic arm setup is easier than a single-link setup - which contradicts the robotic control and empirical RL perspective whereby the two-link setup is inherently more complex. Likewise, for the same setup, POIC estimates that tasks with sparse rewards are easier than those with dense rewards. Thus, we show that both PIC and POIC contradict typical understanding and empirical results from RL. These findings highlight the need to move beyond RWG-based metrics towards better metrics that can more reliably capture task complexity in non-tabular RL with our task framework as a starting point.

2602.18851 2026-02-24 cs.LG cs.AI

Rank-Aware Spectral Bounds on Attention Logits for Stable Low-Precision Training

Seyed Morteza Emadi

Comments 17 pages, 3 figures

详情
英文摘要

Attention scores in transformers are bilinear forms $S_{ij} = x_i^\top M x_j / \sqrt{d_h}$ whose maximum magnitude governs overflow risk in low-precision training. We derive a \emph{rank-aware concentration inequality}: when the interaction matrix $M = W^Q W^{K\top}$ has rank $r \ll d$, tail probabilities for $\max_{i,j}|S_{ij}|$ decay as $\exp(-d^{2}α^{2}/(γr))$ rather than $\exp(-dα^{2})$, where $γ> 1$ is a typicality parameter. For transformer attention where $r = d_h$, this yields $8$--$28\times$ tighter concentration than rank-agnostic bounds in modern architectures. We apply this result to FP8 training, deriving \emph{geometry-aware scale factors} that provide principled overflow guarantees without observing activations. The method computes per-layer scales from the spectral norm $\|W^Q W^{K\top}\|_2$ via implicit power iteration, includes a grouped query attention formulation that avoids key expansion, and remains compatible with fused attention kernels. Across GPT-2 XL to Llama-2-70B, geometry-aware scaling eliminates overflows in transient scenarios where delayed scaling fails, while achieving comparable downstream MMLU accuracy.

2602.18850 2026-02-24 cs.RO cs.AI

When the Inference Meets the Explicitness or Why Multimodality Can Make Us Forget About the Perfect Predictor

J. E. Domínguez-Vidal, Alberto Sanfeliu

Comments Original version submitted to the International Journal of Social Robotics. Final version available on the SORO website

详情
英文摘要

Although in the literature it is common to find predictors and inference systems that try to predict human intentions, the uncertainty of these models due to the randomness of human behavior has led some authors to start advocating the use of communication systems that explicitly elicit human intention. In this work, it is analyzed the use of four different communication systems with a human-robot collaborative object transportation task as experimental testbed: two intention predictors (one based on force prediction and another with an enhanced velocity prediction algorithm) and two explicit communication methods (a button interface and a voice-command recognition system). These systems were integrated into IVO, a custom mobile social robot equipped with force sensor to detect the force exchange between both agents and LiDAR to detect the environment. The collaborative task required transporting an object over a 5-7 meter distance with obstacles in the middle, demanding rapid decisions and precise physical coordination. 75 volunteers perform a total of 255 executions divided into three groups, testing inference systems in the first round, communication systems in the second, and the combined strategies in the third. The results show that, 1) once sufficient performance is achieved, the human no longer notices and positively assesses technical improvements; 2) the human prefers systems that are more natural to them even though they have higher failure rates; and 3) the preferred option is the right combination of both systems.

2602.18849 2026-02-24 cs.LG cs.AI

Exact Attention Sensitivity and the Geometry of Transformer Stability

Seyed Morteza Emadi

Comments 18 pages, 6 figures

详情
英文摘要

Despite powering modern AI, transformers remain mysteriously brittle to train. We develop a stability theory that explains why pre-LayerNorm works, why DeepNorm uses $N^{-1/4}$ scaling, and why warmup is necessary, all from first principles. Our framework has two pillars: (1) We derive the \emph{exact} operator norm of the softmax Jacobian, $\|J_{softmax}(u/τ)\|_{\infty\to 1} = θ(p)/τ$, where the balanced-mass factor $θ(p)\in[0,1]$ quantifies attention sensitivity. (2) We introduce a block-$\infty$/RMS geometry aligned with tokenwise computation, yielding Lipschitz bounds independent of sequence length. Using this framework, we prove that pre-LN preserves identity gradient paths while post-LN compounds LayerNorm Jacobians exponentially with depth, and we show that DeepNorm's $N^{-1/4}$ emerges from the quartic structure of attention's four projection matrices. We validate our theory on 774M-parameter models and find that, contrary to the intuition that attention sharpens during training to reduce sensitivity, $θ(p) \approx 1$ persists throughout. Transformer stability arises entirely from architectural gradient flow, not from attention dynamics. This finding changes how we reason about training: the architecture itself must handle sensitivity, not learned attention patterns.

2602.18842 2026-02-24 cs.CV

Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification

Jiangling Zhang, Shuxuan Gao, Bofan Liu, Siqiang Feng, Jirui Huang, Yaxiong Chen, Ziyu Chen

Comments Accepted to CVPR 2026

详情
英文摘要

The proliferation of highly realistic AI-generated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries and often struggle with novel manipulations as editing techniques continue to evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning "what is fake" to modeling "what is real". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, followed by a Task-Adaptive Prior Injection (TAPI) module that converts this coarse prediction into guiding prompts to steer the MAE decoder and amplify reconstruction failures in suspicious regions for precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.

2602.18833 2026-02-24 cs.CV

CLAP Convolutional Lightweight Autoencoder for Plant Disease Classification

Asish Bera, Subhajit Roy, Sudiptendu Banerjee

详情
英文摘要

Convolutional neural networks have remarkably progressed the performance of distinguishing plant diseases, severity grading, and nutrition deficiency prediction using leaf images. However, these tasks become more challenging in a realistic in-situ field condition. Often, a traditional machine learning model may fail to capture and interpret discriminative characteristics of plant health, growth and diseases due to subtle variations within leaf subcategories. A few deep learning methods have used additional preprocessing stages or network modules to address the problem, whereas several other methods have utilized pre-trained backbone CNNs, most of which are computationally intensive. Therefore, to address the challenge, we propose a lightweight autoencoder using separable convolutional layers in its encoder decoder blocks. A sigmoid gating is applied for refining the prowess of the encoders feature discriminability, which is improved further by the decoder. Finally, the feature maps of the encoder decoder are combined for rich feature representation before classification. The proposed Convolutional Lightweight Autoencoder for Plant disease classification, called CLAP, has been experimented on three public plant datasets consisting of cassava, tomato, maize, groundnut, grapes, etc. for determining plant health conditions. The CLAP has attained improved or competitive accuracies on the Integrated Plant Disease, Groundnut, and CCMT datasets balancing a tradeoff between the performance, and little computational cost requiring 5 million parameters. The training time is 20 milliseconds and inference time is 1 ms per image.

2602.18831 2026-02-24 cs.CV

IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbation

Fadi Boutros, Eduarda Caldeira, Tahar Chettaoui, Naser Damer

Comments Accepted at CVPR 2026

详情
英文摘要

Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDPERTURB, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDPERTURB perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results demonstrate that training FR on datasets generated using IDPERTURB yields improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches.

2602.18830 2026-02-24 cs.CV

Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

Liying Yang, Jialun Liu, Jiakui Hu, Chenhao Guan, Haibin Huang, Fangqiu Yi, Chi Zhang, Yanyan Liang

详情
英文摘要

Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.