arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.22286 2026-03-24 cs.CV cs.AI cs.CL cs.LG

WorldCache: Content-Aware Caching for Accelerated Video World Models

Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan

Comments 33 Pages

详情
英文摘要

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.

2603.22283 2026-03-24 cs.CV cs.AI cs.GR cs.LG

End-to-End Training for Unified Tokenization and Latent Denoising

Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman

Comments First two authors contributed equally. Project: https://xingjianbai.com/unite-tokenization-generation/ Code: https://github.com/ShivamDuggal4/UNITE-tokenization-generation

详情
英文摘要

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

2603.22282 2026-03-24 cs.CV cs.AI

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu

Comments 42 pages, 16 figures

详情
英文摘要

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

2603.22280 2026-03-24 cs.CV cs.RO

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li

详情
英文摘要

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

2603.22279 2026-03-24 cs.CV cs.AI

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing

Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan

详情
英文摘要

Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.

2603.22276 2026-03-24 cs.LG stat.ML

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Alexandra Zelenin, Alexandra Zhuravlyova

Comments 30 pages, 15 figures, 15 tables, including appendices. Code and data at https://github.com/sockeye44/dorafactors

详情
英文摘要

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.

2603.22275 2026-03-24 cs.CV

Repurposing Geometric Foundation Models for Multi-view Diffusion

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu

Comments project website: https://cvlab-kaist.github.io/GLD/

详情
英文摘要

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

2603.22271 2026-03-24 cs.CV

DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution

Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong

Comments Accepted to CVPR 2026

详情
英文摘要

Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.

2603.22270 2026-03-24 cs.CV

GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning

Yixuan Luo, Feng Qiao, Zhexiao Xiong, Yanjing Li, Nathan Jacobs

详情
英文摘要

Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.

2603.22264 2026-03-24 cs.RO

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, Huazhe Xu

Comments Accepted by CVPR 2026

详情
英文摘要

Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6--24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.

2603.22263 2026-03-24 cs.RO

DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming

Hung-Chieh Fang, Amber Xie, Jennifer Grannen, Kenneth Llontop, Dorsa Sadigh

Comments Website: https://dexdrummer.github.io/

详情
英文摘要

Performing in-hand, contact-rich, and long-horizon dexterous manipulation remains an unsolved challenge in robotics. Prior hand dexterity works have considered each of these three challenges in isolation, yet do not combine these skills into a single, complex task. To further test the capabilities of dexterity, we propose drumming as a testbed for dexterous manipulation. Drumming naturally integrates all three challenges: it involves in-hand control for stabilizing and adjusting the drumstick with the fingers, contact-rich interaction through repeated striking of the drum surface, and long-horizon coordination when switching between drums and sustaining rhythmic play. We present DexDrummer, a hierarchical object-centric bimanual drumming policy trained in simulation with sim-to-real transfer. The framework reduces the exploration difficulty of pure reinforcement learning by combining trajectory planning with residual RL corrections for fast transitions between drums. A dexterous manipulation policy handles contact-rich dynamics, guided by rewards that explicitly model both finger-stick and stick-drum interactions. In simulation, we show our policy can play two styles of music: multi-drum, bimanual songs and challenging, technical exercises that require increased dexterity. Across simulated bimanual tasks, our dexterous, reactive policy outperforms a fixed grasp policy by 1.87x across easy songs and 1.22x across hard songs F1 scores. In real-world tasks, we show song performance across a multi-drum setup. DexDrummer is able to play our training song and its extended version with an F1 score of 1.0.

2603.22262 2026-03-24 cs.CG math.CO

Flip Distance of Non-Crossing Spanning Trees: NP-Hardness and Improved Bounds

Håvard Bakke Bjerkevik, Joseph Dorfer, Linda Kleist, Torsten Ueckerdt, Birgit Vogtenhuber

详情
英文摘要

We consider the problem of reconfiguring non-crossing spanning trees on point sets. For a set $P$ of $n$ points in general position in the plane, the flip graph $F(P)$ has a vertex for each non-crossing spanning tree on $P$ and an edge between any two spanning trees that can be transformed into each other by the exchange of a single edge. This flip graph has been intensively studied, lately with an emphasis on determining its diameter diam$(F(P))$ for sets $P$ of $n$ points in convex position. The current best bounds are $\frac{14}{9}n-O(1) \leq$ diam$(F(P))<\frac{15}{9}n-3$ [Bjerkevik, Kleist, Ueckerdt, and Vogtenhuber; SODA 2025]. The crucial tool for both the upper and lower bound are so-called *conflict graphs*, which the authors stated might be the key ingredient for determining the diameter (up to lower-order terms). In this paper, we pick up the concept of conflict graphs and show that this tool is even more versatile than previously hoped. As our first main result, we use conflict graphs to show that computing the flip distance between two non-crossing spanning trees is NP-hard, even for point sets in convex position. Interestingly, the result still holds for more constrained flip operations, concretely, compatible flips (where the removed and the added edge do not cross) and rotations (where the removed and the added edge share an endpoint). Extending the line of research from [BKUV SODA25], we present new insights on the diameter of the flip graph. Their lower bound is based on a constant-size pair of trees, one of which is *stacked*. We show that if one of the trees is stacked, then the lower bound is indeed optimal up to a constant term, that is, there exists a flip sequence of length at most $\frac{14}{9}(n-1)$ to any other tree. Lastly, we improve the lower bound on the diameter of the flip graph $F(P)$ for $n$ points in convex position to $\frac{11}{7}n-o(n)$.

2603.22260 2026-03-24 cs.CL

Greater accessibility can amplify discrimination in generative AI

Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, Anne Lauscher

Comments Preprint

详情
英文摘要

Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.

2603.22254 2026-03-24 cond-mat.mtrl-sci cond-mat.mes-hall cs.LG physics.atm-clus physics.chem-ph

Characterizing High-Capacity Janus Aminobenzene-Graphene Anode for Sodium-Ion Batteries with Machine Learning

Claudia Islas-Vargas, L. Ricardo Montoya, Carlos A. Vital-José, Oliver T. Unke, Klaus-Robert Müller, Huziel E. Sauceda

Comments 8 pages, 5 figures, research article

详情
英文摘要

Sodium-ion batteries require anodes that combine high capacity, low operating voltage, fast Na-ion transport, and mechanical stability, which conventional anodes struggle to deliver. Here, we use the SpookyNet machine-learning force field (MLFF) together with all-electron density-functional theory calculations to characterize Na storage in aminobenzene-functionalized Janus graphene (Na$_x$AB) at room-temperature. Simulations across state of charge reveal a three-stage storage mechanism-site-specific adsorption at aminobenzene groups and Na$_n$@AB$_m$ structure formation, followed by interlayer gallery filling-contrasting the multi-stage pore-, graphite-interlayer-, and defect-controlled behavior in hard carbon. This leads to an OCV profile with an extended low-voltage plateau of 0.15 V vs. Na/Na$^{+}$, an estimated gravimetric capacity of $\sim$400 mAh g$^{-1}$, negligible volume change, and Na diffusivities of $\sim10^{-6}$ cm$^{2}$ s$^{-1}$, two to three orders of magnitude higher than in hard carbon. Our results establish Janus aminobenzene-graphene as a promising, structurally defined high-capacity Na-ion anode and illustrate the power of MLFF-based simulations for characterizing electrode materials.

2603.22252 2026-03-24 eess.AS cs.SD

SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation

Lucas H. Ueda, João G. T. Lima, Pedro R. Corrêa, Flávio O. Simões, Mário U. Neto, Paula D. P. Costa

Comments Submitted to Interspeech 2026

详情
英文摘要

This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglement strategy utilizing Gradient Reversal Layers (GRL) combined with cosine similarity loss to decouple speaker and emotion information. We introduce Multi Positive Contrastive Learning (MPCL) to induce clustered representations of speaker and emotion embeddings based on their respective labels. Furthermore, SelfTTS employs a self-refinement strategy via Self-Augmentation, exploiting the model's voice conversion capabilities to enhance the naturalness of synthesized speech. Experimental results demonstrate that SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.

2603.22251 2026-03-24 cs.DC

exaCB: Reproducible Continuous Benchmark Collections at Scale Leveraging an Incremental Approach

Jayesh Badwaik, Mathis Bode, Michal Rajski, Andreas Herten

详情
英文摘要

The increasing heterogeneity of high-performance computing (HPC) systems and the transition to exascale architectures require systematic and reproducible performance evaluation across diverse workloads. While continuous integration (CI) ensures functional correctness in software engineering, performance and energy efficiency in HPC are typically evaluated outside CI workflows, motivating continuous benchmarking (CB) as a complementary approach. Integrating benchmarking into CI workflows enables reproducible evaluation, early detection of regressions, and continuous validation throughout the software development lifecycle. We present exaCB, a framework for continuous benchmarking developed in the context of the JUPITER exascale system. exaCB enables application teams to integrate benchmarking into their workflows while supporting large-scale, system-wide studies through reusable CI/CD components, established harnesses, and a shared reporting protocol. The framework supports incremental adoption, allowing benchmarks to be onboarded easily and to evolve from basic runnability to more advanced instrumentation and reproducibility. The approach is demonstrated in JUREAP, the early-access program for JUPITER, where exaCB enabled continuous benchmarking of over 70 applications at varying maturity levels, supporting cross-application analysis, performance tracking, and energy-aware studies. These results illustrate the practicality using exaCB for continuous benchmarking for exascale HPC systems across large, diverse collections of scientific applications.

2603.22249 2026-03-24 cs.CV

EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild

Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham, Guha Balakrishnan, Paola Cascante-Bonilla

Comments Project Page: https://lab-spell.github.io/EgoGroups/

详情
英文摘要

Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.

2603.22248 2026-03-24 cs.LG cs.AI cs.IT math.IT stat.ML

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models

Changxiao Cai, Gen Li

详情
英文摘要

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} -- which determines the order and number of tokens generated at each iteration -- critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.

2603.22240 2026-03-24 cs.DS cs.CC

A Dividing Line for Structural Kernelization of Component Order Connectivity via Distance to Bounded Pathwidth

Jakob Greilhuber, Roohani Sharma

Comments Abstract shortened due to arXiv length requirements

详情
英文摘要

In this work we study a classic generalization of the Vertex Cover (VC) problem, called the Component Order Connectivity (COC) problem. In COC, given an undirected graph $G$, integers $d \geq 1$ and $k$, the goal is to determine if there is a set of at most $k$ vertices whose deletion results in a graph where each connected component has at most $d$ vertices. When $d=1$, this is exactly VC. This work is inspired by polynomial kernelization results with respect to structural parameters for VC. On one hand, Jansen & Bodlaender [TOCS 2013] show that VC admits a polynomial kernel when the parameter is the distance to treewidth-$1$ graphs, on the other hand Cygan, Lokshtanov, Pilipczuk, Pilipczuk & Saurabh [TOCS 2014] showed that VC does not admit a polynomial kernel when the parameter is distance to treewidth-$2$ graphs. Greilhuber & Sharma [IPEC 2024] showed that, for any $d \geq 2$, $d$-COC cannot admit a polynomial kernel when the parameter is distance to a forest of pathwidth $2$. Here, $d$-COC is the same as COC only that $d$ is a fixed constant not part of the input. We complement this result and show that like for the VC problem where distance to treewidth-$1$ graphs versus distance to treewidth-$2$ graphs is the dividing line between structural parameterizations that allow and respectively disallow polynomial kernelization, for COC this dividing line happens between distance to pathwidth-$1$ graphs and distance to pathwidth-$2$ graphs. The main technical result of this work is that COC admits a polynomial kernel parameterized by distance to pathwidth-$1$ graphs plus $d$.

2603.22237 2026-03-24 cs.IT math.IT physics.soc-ph

Structure-aware divergences for comparing probability distributions

Rohit Sahasrabuddhe, Renaud Lambiotte

详情
英文摘要

Many natural and social science systems are described using probability distributions over elements that are related to each other: for instance, occupations with shared skills or species with similar traits. Standard information theory quantities such as entropies and $f$-divergences treat elements interchangeably and are blind to the similarity structure. We introduce a family of divergences that are sensitive to the geometry of the underlying domain. By virtue of being the Bregman divergences of structure-aware entropies, they provide a framework that retains several advantages of Kullback-Leibler divergence and Shannon entropy. Structure-aware divergences recover planted patterns in a synthetic clustering task that conventional divergences miss and are orders of magnitude faster than optimal transport distances. We demonstrate their applicability in economic geography and ecology, where structure plays an important role. Modelling different notions of occupation relatedness yields qualitatively different regionalisations of their geographic distribution. Our methods also reproduce established insights into functional $β$-diversity in ecology obtained with optimal transport methods.

2603.22231 2026-03-24 cs.IR cs.AI cs.GT cs.LG

One Model, Two Markets: Bid-Aware Generative Recommendation

Yanchen Jiang, Zhe Feng, Christopher P. Mah, Aranyak Mehta, Di Wang

详情
英文摘要

Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad's likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.

2603.22230 2026-03-24 cs.CV

Riverine Land Cover Mapping through Semantic Segmentation of Multispectral Point Clouds

Sopitta Thurachen, Josef Taher, Matti Lehtomäki, Leena Matikainen, Linnea Blåfield, Mikel Calle Navarro, Antero Kukko, Tomi Westerlund, Harri Kaartinen

详情
英文摘要

Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model's generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.

2603.22229 2026-03-24 cs.CV

Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre

Alex Salvatierra, José Antonio Sanz, Christian Gutiérrez, Mikel Galar

Comments 6 pages, 2 figures

详情
英文摘要

Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.

2603.22228 2026-03-24 cs.CV cs.AI

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao

详情
英文摘要

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

2603.22227 2026-03-24 cs.HC cs.AI cs.CL

Dyadic: A Scalable Platform for Human-Human and Human-AI Conversation Research

David M. Markowitz

详情
英文摘要

Conversation is ubiquitous in social life, but the empirical study of this interactive process has been thwarted by tools that are insufficiently modular and unadaptive to researcher needs. To relieve many constraints in conversation research, the current tutorial presents an overview and introduction to a new tool, Dyadic (https://www.chatdyadic.com/), a web-based platform for studying human-human and human-AI conversations using text-based or voice-based chats. Dyadic is distinct from other platforms by offering studies with multiple modalities, AI suggestions (e.g., in human-human studies, AI can suggest responses to a participant), live monitoring (e.g., researchers can evaluate, in real time, chats between communicators), and survey deployment (e.g., Likert-type scales, feeling thermometers, and open-ended text boxes can be sent to humans for in situ evaluations of the interaction), among other consequential features. No coding is required to operate Dyadic directly, and integrations with existing survey platforms are offered.

2603.22222 2026-03-24 eess.SY cs.SY

A Portfolio-Level Optimization Framework for Coordinated Market Participation and Operational Scheduling of Hydrogen-Centric Companies

Seyed Amir Mansouri, Kenneth Bruninx

详情
英文摘要

The vision of electrolytic hydrogen as a clean energy vector prompts the emergence of hydrogen-centric companies that must simultaneously engage in electricity, hydrogen, and green certificate markets while operating complex, geographically distributed asset portfolios. This paper proposes a portfolio-level optimization framework tailored for the integrated operational scheduling and market participation of such companies. The model co-optimizes asset scheduling and market decisions across multiple sites, incorporating spatial distribution, technical constraints, and company-level policy requirements. It supports participation in the electricity market, physical and virtual Power Purchase Agreements (PPAs), bundled and unbundled hydrogen markets, and green certificate transactions. The model is applied to three operational scenarios to evaluate the economic and operational impacts of different compliance strategies. Results show that centralized, portfolio-level control unlocks the full flexibility of geographically distributed assets, enabling a 2.42-fold increase in hydrogen production and a 9.4% reduction in daily operational costs, while satisfying all company policy constraints.

2603.22220 2026-03-24 cs.DB

Accelerating Fresh Data Exploration with Fluid ETL Pipelines

Maxwell Norfolk, Dong Xie

详情
英文摘要

Recently, we have seen an increasing need for fresh data exploration, where data analysts seek to explore the main characteristics or detect anomalies of data being actively collected. In addition to the common challenges in classic data exploration, such as a lack of prior knowledge about the data or the analysis goal, fresh data exploration also demands an ingestion system with sufficient throughput to keep up with rapid data accumulation. However, leveraging traditional Extract-Transform-Load (ETL) pipelines to achieve low query latency can still be extremely resource-intensive as they must conduct an excessive amount of data preprocessing routines (DPRs) (e.g., parsing and indexing) to cover unpredictable data characteristics and analysis goals. To overcome this challenge, we seek to approach it from a different angle: leveraging occasional idle system capacity or cheap preemptive resources (e.g., Amazon Spot Instance) during ingestion. In particular, we introduce a new type of data ingestion system called fluid ETL pipelines, which allow users to start/stop arbitrary DPRs on demand without blocking data ingestion. With fluid ETL pipelines, users can start potentially useful DPRs to accelerate future exploration queries whenever idle/cheap resources are available. Moreover, users can dynamically change which DPRs to run with limited resources to adapt to users' evolving interests. We conducted experiments on a real-world dataset and verified that our vision is viable. The introduction of fluid ETL pipelines also raises new challenges in handling essential tasks, such as ad-hoc query processing, DPR generation, and DPR management. In this paper, we discuss open research challenges in detail and outline potential directions for addressing them.

2603.22219 2026-03-24 cs.LG stat.ML

Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting

Qilin Wang

详情
英文摘要

Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.

2603.22216 2026-03-24 cs.CL cs.LG

Gumbel Distillation for Parallel Text Generation

Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu

Comments ICLR 2026

详情
英文摘要

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.

2603.22214 2026-03-24 cs.CR cs.AI cs.LG

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

详情
英文摘要

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.