arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.28766 2026-03-31 cs.CV

HandX: Scaling Bimanual Motion and Interaction Generation

Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui

Comments CVPR 2026. Project Page: https://handx-project.github.io. Code: https://github.com/handx-project/HandX

详情
英文摘要

Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.

2603.28765 2026-03-31 cs.CL

Adaptive Block-Scaled Data Types

Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan, Song Han

Comments 19 pages, 9 figures

详情
英文摘要

NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.

2603.28763 2026-03-31 cs.CV

PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht

详情
英文摘要

Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.

2603.28760 2026-03-31 cs.CV cs.RO

SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, Sizhe An, He Wen, Alex Wong, Tomas Hodan, Kun He

Comments CVPR 2026

详情
英文摘要

Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io

2603.28757 2026-03-31 cs.CV cs.MM cs.SD

SonoWorld: From One Image to a 3D Audio-Visual Scene

Derong Jin, Xiyi Chen, Ming C. Lin, Ruohan Gao

Comments Accepted by CVPR 2026, project page: https://humathe.github.io/sonoworld/

详情
英文摘要

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

2603.28756 2026-03-31 cs.MS

Fast Large-Scale Model-Based Iterative Tomography via Exploiting Mathematical Structure, Hierarchical Optimization, Smart Initialization, and Distributed GPU Computing

Dinesh Kumar, Jeffrey Donatelli

详情
英文摘要

Model-Based Iterative Reconstruction (MBIR) is important because direct methods, such as Filtered Back-Projection (FBP) can introduce significant noise and artifacts in sparse-angle tomography, especially for time-evolving samples. Although MBIR produces high-quality reconstructions through prior-informed optimization, its computational cost has traditionally limited its broader adoption. In previous work, we addressed this limitation by expressing the Radon transform and its adjoint using non-uniform fast Fourier transforms (NUFFTs), reducing computational complexity relative to conventional projection-based methods. We further accelerated computation by employing a multi-GPU system for parallel processing. In this work, we further accelerate our Fourier-domain framework, by introducing four main strategies: (1) a reformulation of the MBIR forward and adjoint operators that exploits their multi-level Toeplitz structure for efficient Fourier-domain computation; (2) an improved initialization strategy that uses back-projected data filtered with a standard ramp filter as the starting estimate; (3) a hierarchical multi-resolution reconstruction approach that first solves the problem on coarse grids and progressively transitions to finer grids using Lanczos interpolation; and (4) a distributed-memory implementation using MPI that enables near-linear scaling on large high-performance computing (HPC) systems. Together, these innovations significantly reduce iteration counts, improve parallel efficiency, and make high-quality MBIR reconstruction practical for large-scale tomographic imaging. These advances open the door to near-real-time MBIR for applications such as in situ, in operando, and time-evolving experiments.

2603.28755 2026-03-31 cs.CY

Graphilosophy: Graph-Based Digital Humanities Computing with The Four Books

Minh-Thu Do, Quynh-Chau Le-Tran, Duc-Duy Nguyen-Mai, Thien-Trang Nguyen, Khanh-Duy Le, Minh-Triet Tran, Tam V. Nguyen, Trung-Nghia Le

Comments AI & Society journal

详情
英文摘要

The Four Books have shaped East Asian intellectual traditions, yet their multi-layered interpretive complexity limits their accessibility in the digital age. While traditional bilingual commentaries provide a vital pedagogical bridge, computational frameworks are needed to preserve and explore this wisdom. This paper bridges AI and classical philosophy by introducing Graphilosophy, an ontology-guided, multi-layered knowledge graph framework for modeling and interpreting The Four Books. Integrating natural language processing, multilingual semantic embeddings, and humanistic analysis, the framework transforms a bilingual Chinese-Vietnamese corpus into an interpretively grounded resource. Graphilosophy encodes linguistic, conceptual, and interpretive relationships across interconnected layers, enabling cross-lingual retrieval and AI-assisted reasoning while explicitly preserving scholarly nuance and interpretive plurality. The system also enables non-expert users to trace the evolution of ethical concepts across borders and languages, ensuring that ancient wisdom remains a living resource for modern moral discourse rather than a static relic of the past. Through an interactive interface, users can trace the evolution of ethical concepts across languages, ensuring ancient wisdom remains relevant for modern discourse. A preliminary user study suggests the system's capacity to enhance conceptual understanding and cross-cultural learning. By linking algorithmic representation with ethical inquiry, this research exemplifies how AI can serve as a methodological bridge, accommodating the ambiguity of cultural heritage rather than reducing it to static data. The Source code and data are released at https://github.com/ThuDoMinh1102/confucian-texts-knowledge-graph.

2603.28754 2026-03-31 eess.SY cs.SY

Sparse State-Space Realizations of Linear Controllers

Yaozhi Du, Jing Shuang Li

Comments Submitted to 2026 CDC

详情
英文摘要

This paper provides a novel approach for finding sparse state-space realizations of linear systems (e.g., controllers). Sparse controllers are commonly used in distributed control, where a controller is synthesized with some sparsity penalty. Here, motivated by a modeling problem in sensorimotor neuroscience, we study a complementary question: given a linear time-invariant system (e.g., controller) in transfer function form and a desired sparsity pattern, can we find a suitably sparse state-space realization for the transfer function? This problem is highly nonconvex, but we propose an exact method to solve it. We show that the problem reduces to finding an appropriate similarity transform from the modal realization, which in turn reduces to solving a system of multivariate polynomial equations. Finally, we leverage tools from algebraic geometry (namely, the Gröbner basis) to solve this problem exactly. We provide algorithms to find real- and complex-valued sparse realizations and demonstrate their efficacy on several examples.

2603.28753 2026-03-31 cs.NI

Iran's January 2026 Internet Shutdown: Public Data, Censorship Methods, and Circumvention Techniques

Giuseppe Aceto, Valerio Persico, Antonio Pescapè

Comments 12 pages, 3 figures, 1 table

详情
英文摘要

This paper analyzes the Internet shutdown that occurred in Iran in January 2026 in the context of protests, focusing on its impact on the country's digital communication infrastructure and on information access and control dynamics. The scale, complexity, and nation-state nature of the event motivate a comprehensive investigation that goes beyond isolated reports, aiming to provide a unified and systematic understanding of what happened and how it was observed. The study is guided by a set of research questions addressing: the characterization of the shutdown via the timeline of the disruption events and post-event "new normal"; the detectability of the event, encompassing monitoring initiatives, measurement techniques, and precursory signals; and the interplay between censorship and circumvention, assessing both the imposed restrictions and the effectiveness of tools designed to bypass them. To answer these questions, we adopt a multi-source, multi-perspective methodology that integrates heterogeneous public data, primarily from grey literature produced by network measurement and monitoring initiatives, complemented by additional private measurements. This approach enables a holistic view of the event and allows us to reconcile and compare partial observations from different sources.

2603.28747 2026-03-31 math.OC cs.SY eess.SY

Constrained Optimization on Matrix Lie Groups via Interior-Point Method

Aclécio J. Santos, Jean C. Pereira, Guilherme V. Raffo

Comments This is a preprint submitted to IEEE Control Systems Letters

详情
英文摘要

This paper proposes an interior-point framework for constrained optimization problems whose decision variables evolve on matrix Lie groups. The proposed method, termed the Matrix Lie Group Interior-Point Method (MLG-IPM), operates directly on the group structure using a minimal Lie algebra parametrization, avoiding redundant matrix representations and eliminating explicit dependence on Riemannian metrics. A primal-dual formulation is developed in which the Newton system is constructed through sensitivity and curvature matrices. Also, multiplicative updates are performed via the exponential map, ensuring intrinsic feasibility with respect to the group structure while maintaining strict positivity of slack and dual variables through a barrier strategy. A local analysis establishes quadratic convergence under standard regularity assumptions and characterizes the behavior under inexact Newton steps. Statistical comparisons against Riemannian Interior-Point Methods, specifically for optimization problems defined over the Special Orthogonal Group SO(n) and Special Linear Group SL(n), demonstrate that the proposed approach achieves higher success rates, fewer iterations, and superior numerical accuracy. Furthermore, its robustness under perturbations suggests that this method serves as a consistent and reliable alternative for structured manifold optimization.

2603.28744 2026-03-31 cs.LG

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vitória Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt

详情
英文摘要

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.

2603.28740 2026-03-31 cs.RO

FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

Yichi Zhang, Weihao Yuan, Yizhuo Zhang, Xidong Zhang, Jia Wan

Comments 25 pages, 18 figures

详情
英文摘要

Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual information introduces substantial noise - together severely impairing the quality of action. In this paper, we investigate how to effectively utilize different visual representations for action generation. To this end, we first empirically validate the above issues and show that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. Based on these insights, we introduce FocusVLA, a novel paradigm that directs the model's attention to task-relevant visual regions to effectively bridge vision to action. Specifically, we first propose Modality Cascaded Attention to eliminate shortcut pathways, thereby compelling VLA models to rely on task-relevant visual details for action generation. Furthermore, we propose Focus Attention, which dynamically selects task-relevant visual patches to control information quantity while explicitly modulating their influence to suppress task-irrelevant noise. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that FocusVLA not only effectively leverages visual details to perform dexterous manipulations, but also substantially improves performance and accelerates convergence across a variety of tasks.

2603.28739 2026-03-31 cs.LG stat.ML

Expectation Error Bounds for Transfer Learning in Linear Regression and Linear Neural Networks

Meitong Liu, Christopher Jung, Rui Li, Xue Feng, Han Zhao

详情
英文摘要

In transfer learning, the learner leverages auxiliary data to improve generalization on a main task. However, the precise theoretical understanding of when and how auxiliary data help remains incomplete. We provide new insights on this issue in two canonical linear settings: ordinary least squares regression and under-parameterized linear neural networks. For linear regression, we derive exact closed-form expressions for the expected generalization error with bias-variance decomposition, yielding necessary and sufficient conditions for auxiliary tasks to improve generalization on the main task. We also derive globally optimal task weights as outputs of solvable optimization programs, with consistency guarantees for empirical estimates. For linear neural networks with shared representations of width $q \leq K$, where $K$ is the number of auxiliary tasks, we derive a non-asymptotic expectation bound on the generalization error, yielding the first non-vacuous sufficient condition for beneficial auxiliary learning in this setting, as well as principled directions for task weight curation. We achieve this by proving a new column-wise low-rank perturbation bound for random matrices, which improves upon existing bounds by preserving fine-grained column structures. Our results are verified on synthetic data simulated with controlled parameters.

2603.28737 2026-03-31 eess.AS cs.AI cs.CL cs.SD

ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath

Comments Under review

详情
英文摘要

We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .

2603.28735 2026-03-31 cs.SE cs.AI

RAD-AI: Rethinking Architecture Documentation for AI-Augmented Ecosystems

Oliver Aleksander Larsen, Mahyar T. Moghaddam

Comments Accepted at ANGE 2026, co-located with IEEE ICSA 2026. 8 pages

详情
英文摘要

AI-augmented ecosystems (interconnected systems where multiple AI components interact through shared data and infrastructure) are becoming the architectural norm for smart cities, autonomous fleets, and intelligent platforms. Yet the architecture documentation frameworks practitioners rely on, arc42 and the C4 model, were designed for deterministic software and cannot capture probabilistic behavior, data-dependent evolution, or dual ML/software lifecycles. This gap carries regulatory consequence: the EU AI Act (Regulation 2024/1689) mandates technical documentation through Annex IV that no existing framework provides structured support for, with enforcement for high-risk systems beginning August 2, 2026. We present RAD-AI, a backward-compatible extension framework that augments arc42 with eight AI-specific sections and C4 with three diagram extensions, complemented by a systematic EU AI Act Annex IV compliance mapping. A regulatory coverage assessment with six experienced software-architecture practitioners provides preliminary evidence that RAD-AI increases Annex IV addressability from approximately 36% to 93% (mean rating) and demonstrates substantial improvement over existing frameworks. Comparative analysis on two production AI platforms (Uber Michelangelo, Netflix Metaflow) captures eight additional AI-specific concerns missed by standard frameworks and demonstrates that documentation deficiencies are structural rather than domain-specific. An illustrative smart mobility ecosystem case study reveals ecosystem-level concerns, including cascading drift and differentiated compliance obligations, that are invisible under standard notation.

2603.28732 2026-03-31 cs.RO cs.CV

Pandora: Articulated 3D Scene Graphs from Egocentric Vision

Alan Yu, Yun Chang, Christopher Xie, Luca Carlone

Comments 14 pages, 5 figures. Presented at the 2025 British Machine Vision Conference (BMVC) in Sheffield, UK

详情
英文摘要

Robotic mapping systems typically approach building metric-semantic scene representations from the robot's own sensors and cameras. However, these "first person" maps inherit the robot's own limitations due to its embodiment or skillset, which may leave many aspects of the environment unexplored. For example, the robot might not be able to open drawers or access wall cabinets. In this sense, the map representation is not as complete, and requires a more capable robot to fill in the gaps. We narrow these blind spots in current methods by leveraging egocentric data captured as a human naturally explores a scene wearing Project Aria glasses, giving a way to directly transfer knowledge about articulation from the human to any deployable robot. We demonstrate that, by using simple heuristics, we can leverage egocentric data to recover models of articulate object parts, with quality comparable to those of state-of-the-art methods based on other input modalities. We also show how to integrate these models into 3D scene graph representations, leading to a better understanding of object dynamics and object-container relationships. We finally demonstrate that these articulated 3D scene graphs enhance a robot's ability to perform mobile manipulation tasks, showcasing an application where a Boston Dynamics Spot is tasked with retrieving concealed target items, given only the 3D scene graph as input.

2603.28731 2026-03-31 cs.SE cs.AI

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Oliver Aleksander Larsen, Mahyar T. Moghaddam

Comments Accepted at SAGAI 2026, co-located with IEEE ICSA 2026. 8 pages

详情
英文摘要

Modern distributed systems integrate heterogeneous services, REST APIs with different schema versions, GraphQL endpoints, and IoT devices with proprietary payloads that suffer from persistent schema mismatches. Traditional static adapters require manual coding for every schema pair and cannot handle novel combinations at runtime. We present SAGAI-MID, a FastAPI-based middleware that uses large language models (LLMs) to dynamically detect and resolve schema mismatches at runtime. The system employs a five-layer pipeline: hybrid detection (structural diff plus LLM semantic analysis), dual resolution strategies (per-request LLM transformation and LLM-generated reusable adapter code), and a three-tier safeguard stack (validation, ensemble voting, rule-based fallback). We frame the architecture through Bass et al.'s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. We evaluate SAGAI-MID on 10 interoperability scenarios spanning REST version migration, IoT-to-analytics bridging, and GraphQL protocol conversion across six LLMs from two providers. The best-performing configuration achieves 0.90 pass@1 accuracy. The CODEGEN strategy consistently outperforms DIRECT (0.83 vs 0.77 mean pass@1), while cost varies by over 30x across models with no proportional accuracy gain; the most accurate model is also the cheapest. We discuss implications for software architects adopting LLMs as runtime architectural components.

2603.28728 2026-03-31 cs.NI

Study of Post Quantum status of Widely Used Protocols

Tushin Mallick, Ashish Kundu, Ramana Kompella

详情
英文摘要

The advent of quantum computing poses significant threats to classical public-key cryptographic primitives such as RSA and elliptic-curve cryptography. As many critical network and security protocols depend on these primitives for key exchange and authentication, there is an urgent need to understand their quantum vulnerability and assess the progress made towards integrating post-quantum cryptography (PQC). This survey provides a detailed examination of nine widely deployed protocols - TLS, IPsec, BGP, DNSSEC, SSH, QUIC, OpenID Connect, OpenVPN, and Signal Protocol - analysing their cryptographic foundations, quantum risks, and the current state of PQC migration. We find that TLS and Signal lead the transition with hybrid post-quantum key exchange already deployed at scale, while IPsec and SSH have standardised mechanisms but lack widespread production adoption. DNSSEC and BGP face the most significant structural barriers, as post-quantum signature sizes conflict with fundamental protocol constraints. Across all protocols, key exchange proves consistently easier to migrate than authentication, and protocol-level limitations such as message size and fragmentation often dominate over raw algorithm performance. We also discuss experimental deployments and emerging standards that are shaping the path towards a quantum-resistant communication infrastructure.

2603.28727 2026-03-31 cs.CR cs.DC cs.NI cs.SE

BitSov: A Composable Bitcoin-Native Architecture for Sovereign Internet Infrastructure

Oliver Aleksander Larsen, Rasmus Thorsen Larsen, Mahyar T. Moghaddam

Comments Accepted at BlockArch 2026, co-located with IEEE ICSA 2026. 4 pages

详情
英文摘要

Today's internet concentrates identity, payments, communication, and content hosting under a small number of corporate intermediaries, creating single points of failure, enabling censorship, and extracting economic rent from participants. We present BitSov, an architectural framework for sovereign internet infrastructure that composes existing decentralized technologies (Bitcoin, Lightning Network, decentralized storage, federated messaging, and mesh connectivity) into a unified, eight-layer protocol stack anchored to Bitcoin's base layer. The framework introduces three architectural patterns: (1) payment-gated messaging, where every transmitted message requires cryptographic proof of a Bitcoin payment, deterring spam through economic incentives rather than moderation; (2) timechain-locked contracts, which anchor subscriptions and licenses to Bitcoin block height (the timechain) rather than calendar dates; and (3) a self-sustaining economic flywheel that converts service revenue into infrastructure growth. A dual settlement model supports both on-chain transactions for permanence and auditability and Lightning micropayments for high-frequency messaging. As a position paper, we analyze the quality attributes, discuss open challenges, and propose a research agenda for empirical validation.

2603.28719 2026-03-31 eess.SY cs.SY

Alertness Optimization for Shift Workers Using a Physiology-based Mathematical Model

Zidi Tao, A. Agung Julius, John T Wen

Comments 35 pages single column, 9 figures

详情
英文摘要

Sleep is vital for maintaining cognitive function, facilitating metabolic waste removal, and supporting memory consolidation. However, modern societal demands, particularly shift work, often disrupt natural sleep patterns. This can induce excessive sleepiness among shift workers in critical sectors such as healthcare and transportation and increase the risk of accidents. The primary contributors to this issue are misalignments of circadian rhythms and enforced sleep-wake schedules. Regulating circadian rhythms that are tied to alertness can be regarded as a control problem with control inputs in the form of light and sleep schedules. In this paper, we address the problem of optimizing alertness by optimizing light and sleep schedules to improve the cognitive performance of shift workers. A key tool in our approach is a mathematical model that relates the control input variables (sleep and lighting schedules) to the dynamics of the circadian clock and sleep. In the sleep and circadian modeling literature, the newer physiology-based model shows better accuracy in predicting the alertness of shift workers than the phenomenology-based model, but the dynamics of physiological-based model have differential equations with different time scales, which pose challenges in optimization. To overcome the challenge, we propose a hybrid version of the PR model by applying singular perturbation techniques to reduce the system to a non-stiff, differentiable hybrid system. This reformulation facilitates the application of the calculus of variation and the gradient descent method to find the optimal light and sleep schedules that maximize the subjective alertness of shift worker. Our approach is validated through numerical simulations, and the simulation results demonstrate improved alertness compared to other existing schedules.

2603.28718 2026-03-31 cs.LG cs.AI cs.CV

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Yash Savani, Branislav Kveton, Yuchen Liu, Yilin Wang, Jing Shi, Subhojyoti Mukherjee, Nikos Vlassis, Krishna Kumar Singh

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026 Project page: https://stepwiseflowgrpo.com

详情
英文摘要

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

2603.28713 2026-03-31 cs.CV

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao

Comments https://carlofkl.github.io/dreamlite/

详情
英文摘要

Diffusion models have made significant progress in both text-to-image (T2I) generation and text-guided image editing. However, these models are typically built with billions of parameters, leading to high latency and increased deployment challenges. While on-device diffusion models improve efficiency, they largely focus on T2I generation and lack support for image editing. In this paper, we propose DreamLite, a compact unified on-device diffusion model (0.39B) that supports both T2I generation and text-guided image editing within a single network. DreamLite is built on a pruned mobile U-Net backbone and unifies conditioning through in-context spatial concatenation in the latent space. It concatenates images horizontally as input, using a (target | blank) configuration for generation tasks and (target | source) for editing tasks. To stabilize the training of this compact model, we introduce a task-progressive joint pretraining strategy that sequentially targets T2I, editing, and joint tasks. After high-quality SFT and reinforcement learning, DreamLite achieves GenEval (0.72) for image generation and ImgEdit (4.11) for image editing, outperforming existing on-device models and remaining competitive with several server-side models. By employing step distillation, we further reduce denoising processing to just 4 steps, enabling our DreamLite could generate or edit a 1024 x 1024 image in less than 1s on a Xiaomi 14 smartphone. To the best of our knowledge, DreamLite is the first unified on-device diffusion model that supports both image generation and image editing.

2603.28709 2026-03-31 cs.AR

Physical Design of UET-RVMCU: A Streamlined Open-Source RISC-V Microcontroller

Abdullah Azhar, Uneeb Kamal, Wajid Ali, Saad Gillani, Dr Suleman Sami Qazi

详情
英文摘要

This paper presents the design and physical implementation of UET-RVMCU, a lightweight RISC-V microcontroller derived from the UETRV-PCore. Aimed at creating an accessible and flexible open-source RISC-V-based microcontroller, UET-RVMCU simplifies the application-class UETRV-PCore by reducing pipeline stages, removing MMU functionality, and integrating GPIO peripherals. The final GDSII layout was generated using an open-source RTL-to-GDS flow (OpenLane). This project demonstrates the feasibility of transforming an application-class SoC into a feature-rich microcontroller suitable for embedded systems, emphasizing low area, design simplicity, and open-source development.

2603.28708 2026-03-31 cs.LG cs.DC

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Soutrik Mukherjee, Sangwhan Cha

Comments 10 pages, 8 figures, 15 tables

详情
英文摘要

This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.

2603.28706 2026-03-31 math.NA cs.NA physics.comp-ph

A Scalable Monolithic Modified Newton Multigrid Framework for Time-Dependent $p$-Navier-Stokes Flow

Nils Margenberg, Carolin Mehlmann

Comments 28 pages, 7 figures, 3 tables

详情
英文摘要

Fully implicit tensor-product space-time discretizations of time-dependent $(p,δ)$-Navier-Stokes models yield, on each time step, large nonlinear monolithic saddle-point systems. In the shear-thinning regime $1<p<2$, especially as $p\downarrow 1$ and $δ\downarrow 0$, the decisive difficulty is the constitutive tangent: its ill-conditioning impairs Newton globalization and the preconditioning of the arising linear systems. We therefore develop a scalable monolithic modified Newton framework for tensor-product space-time finite elements in which the exact constitutive tangent in the Jacobian action is replaced by a better-conditioned surrogate. Picard and exact Newton serve as reference linearizations within the same algebraic framework. Scalability is achieved through matrix-free operator evaluation, a monolithic multigrid V-cycle preconditioner, order-preserving reduced Gauss-Radau time quadrature, and an inexact space-time Vanka smoother with single-time-point coefficient freezing in local patch matrices. We prove coercivity of the linearized viscous-Nitsche term in the uniformly elliptic regime $ν_\infty>0$ and consistency of the reduced time quadrature. Numerical tests demonstrate robustness with respect to model parameters, nonlinear and linear iteration counts, and scalable parallel performance.

2603.28700 2026-03-31 cs.DS

Improved Approximation Algorithms for Multiway Cut by Large Mixtures of New and Old Rounding Schemes

Joshua Brakensiek, Neng Huang, Aaron Potechin, Uri Zwick

Comments 49 pages, full version of STOC 2026 paper

详情
英文摘要

The input to the Multiway Cut problem is a weighted undirected graph, with nonnegative edge weights, and $k$ designated terminals. The goal is to partition the vertices of the graph into $k$ parts, each containing exactly one of the terminals, such that the sum of weights of the edges connecting vertices in different parts of the partition is minimized. The problem is APX-hard for $k\ge3$. The currently best known approximation algorithm for the problem for arbitrary $k$, obtained by Sharma and Vondrák [STOC 2014] more than a decade ago, has an approximation ratio of 1.2965. We present an algorithm with an improved approximation ratio of 1.2787. Also, for small values of $k \ge 4$ we obtain the first improvements in 25 years over the currently best approximation ratios obtained by Karger et al. [STOC 1999]. (For $k=3$ an optimal approximation algorithm is known.) Our main technical contributions are new insights on rounding the LP relaxation of Călinescu, Karloff, and Rabani [STOC 1998], whose integrality ratio matches Multiway Cut's approximability ratio, assuming the Unique Games Conjecture [Manokaran et al., STOC 2008]. First, we introduce a generalized form of a rounding scheme suggested by Kleinberg and Tardos [FOCS 1999] and use it to replace the Exponential Clocks rounding scheme used by Buchbinder et al. [STOC 2013] and by Sharma and Vondrák. Second, while previous algorithms use a mixture of two, three, or four basic rounding schemes, each from a different family of rounding schemes, our algorithm uses a computationally-discovered mixture of hundreds of basic rounding schemes, each parametrized by a random variable with a distinct probability distribution, including in particular many different rounding schemes from the same family. We give a completely rigorous analysis of our improved algorithms using a combination of analytical techniques and interval arithmetic.

2603.28698 2026-03-31 cs.CL

EpiScreen: Early Epilepsy Detection from Electronic Health Records with Large Language Models

Shuang Zhou, Kai Yu, Zaifu Zhan, Huixue Zhou, Min Zeng, Feng Xie, Zhiyi Sha, Rui Zhang

Comments 24 pages, 5 figures, 4 tables

详情
英文摘要

Epilepsy and psychogenic non-epileptic seizures often present with similar seizure-like manifestations but require fundamentally different management strategies. Misdiagnosis is common and can lead to prolonged diagnostic delays, unnecessary treatments, and substantial patient morbidity. Although prolonged video-electroencephalography is the diagnostic gold standard, its high cost and limited accessibility hinder timely diagnosis. Here, we developed a low-cost, effective approach, EpiScreen, for early epilepsy detection by utilizing routinely collected clinical notes from electronic health records. Through fine-tuning large language models on labeled notes, EpiScreen achieved an AUC of up to 0.875 on the MIMIC-IV dataset and 0.980 on a private cohort of the University of Minnesota. In a clinician-AI collaboration setting, EpiScreen-assisted neurologists outperformed unaided experts by up to 10.9%. Overall, this study demonstrates that EpiScreen supports early epilepsy detection, facilitating timely and cost-effective screening that may reduce diagnostic delays and avoid unnecessary interventions, particularly in resource-limited regions.

2603.28696 2026-03-31 cs.CV cs.AI

AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys

Comments Project page: https://haozheqi.github.io/adapt-token

详情
英文摘要

Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token

2603.28691 2026-03-31 cs.RO

DRIVE-Nav: Directional Reasoning, Inspection, and Verification for Efficient Open-Vocabulary Navigation

Maoguo Gao, Zejun Zhu, Zhiming Sun, Zhengwei Ma, Longze Yuan, Zhongjing Ma, Zhigang Gao, Jinhui Zhang, Suli Zou

Comments 8 pages, 4 figures. Project page: https://coolmaoguo.github.io/drive-nav-page/

详情
英文摘要

Open-Vocabulary Object Navigation (OVON) requires an embodied agent to locate a language-specified target in unknown environments. Existing zero-shot methods often reason over dense frontier points under incomplete observations, causing unstable route selection, repeated revisits, and unnecessary action overhead. We present DRIVE-Nav, a structured framework that organizes exploration around persistent directions rather than raw frontiers. By inspecting encountered directions more completely and restricting subsequent decisions to still-relevant directions within a forward 240 degree view range, DRIVE-Nav reduces redundant revisits and improves path efficiency. The framework extracts and tracks directional candidates from weighted Fast Marching Method (FMM) paths, maintains representative views for semantic inspection, and combines vision-language-guided prompt enrichment with cross-frame verification to improve grounding reliability. Experiments on HM3D-OVON, HM3Dv2, and MP3D demonstrate strong overall performance and consistent efficiency gains. On HM3D-OVON, DRIVE-Nav achieves 50.2% SR and 32.6% SPL, improving the previous best method by 1.9% SR and 5.6% SPL. It also delivers the best SPL on HM3Dv2 and MP3D and transfers to a physical humanoid robot. Real-world deployment also demonstrates its effectiveness. Project page: https://coolmaoguo.github.io/drive-nav-page/

2603.28690 2026-03-31 cs.RO cs.CE

Vision-Based Robotic Disassembly Combined with Real-Time MFA Data Acquisition

Federico Zocco, Maria Pozzi, Monica Malvezzi

Comments Submitted

详情
英文摘要

Stable and reliable supplies of rare-Earth minerals and critical raw materials (CRMs) are essential for the development of the European Union. Since a large share of these materials enters the Union from outside, a valid option for CRMs supply resilience and security is to recover them from end-of-use products. Hence, in this paper we present the preliminary phases of the development of real-time visual detection of PC desktop components running on edge devices to simultaneously achieve two goals. The first goal is to perform robotic disassembly of PC desktops, where the adaptivity of learning-based vision can enable the processing of items with unpredictable geometry caused by accidental damages. We also discuss the robot end-effectors for different PC components with the object contact points derivable from neural detector bounding boxes. The second goal is to provide in an autonomous, highly-granular, and timely fashion, the data needed to perform material flow analysis (MFA) since, to date, MFA often lacks of the data needed to accurately study material stocks and flows. The second goal is achievable thanks to the recently-proposed synchromaterials, which can generate both local and wide-area (e.g., national) material mass information in a real-time and synchronized fashion.