arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3007
2604.23688 2026-04-28 cs.CV

Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?

Ruiqing Sun, Xingshan Yao, Zhijing Wu, Tian Lan, Chenhao Cui, Huiyang Zhao, Jialing Shi, Chen Yang, Xianling Mao

详情
英文摘要

Proactive defense methods protect portrait images from unauthorized editing or talking face generation (TFG) by introducing pixel-level protective perturbations, and have already attracted increasing attention for privacy protection. In real-world scenarios, images inevitably undergo various transformations during cross-device display and dissemination--such as scale transformations and color compression--that directly alter pixel values. However, it remains unclear whether such pixel-level modifications affect the effectiveness of existing proactive defense methods that rely on pixel-level perturbations. To solve this problem, we conduct a systematic evaluation of representative proactive defenses under image transformation. The evaluated methods are selected to span different generation architectures such as diffusion and GAN-based models, as well as defense scopes covering both portrait and natural images, and are assessed using both qualitative and quantitative metrics for subjective and objective comparison. Experimental results indicate that defense methods based on pixel-level perturbations struggle to withstand common image transformations, posing a risk of defense failure in real-world applications. To further highlight this risk, we propose a simple yet effective purification framework by leveraging the vulnerabilities induced by real-world image transformations. Experimental results demonstrate that the proposed method can efficiently remove protective perturbations with low computational cost, highlighting previously overlooked risks to the research community.

2604.23685 2026-04-28 cs.CV

Reading in the Dark: Low-light Scene Text Recognition

Xuanshuo Fu, Lei Kang, Ernest Valveny, Dimosthenis Karatzas, Javier Vazquez-Corral

详情
英文摘要

Accurate text recognition in low-light environments is essential for intelligent systems in applications ranging from autonomous vehicles to smart surveillance. However, challenges such as poor illumination and noise interference remain underexplored. To address this gap, we introduce LSTR, a large-scale Low-light Scene Text Recognition dataset comprising 11,273 low-light images generated from well-lit datasets (ICDAR2015, IIIT5K, and WordArt), along with ESTR, which includes 60 real nighttime street-scene images in English and Spanish for exclusive evaluation. We explore two solution strategies: (1) employing Optical Character Recognition (OCR) models with fine-tuning and LoRA-based fine-tuning and (2) a joint training strategy that integrates a low-light image enhancement (LLIE) module with an OCR model. In particular, we propose a novel re-render LLIE (RLLIE) module, which demonstrates improved performance on real-world data. Through extensive experimentation, we analyze various training strategies and address a key research question: \emph{How bright is bright enough for effective scene text recognition?} Our results indicate that standalone LLIE or OCR models perform inadequately under low-light conditions, highlighting the advantages of specialized, jointly trained text-centric approaches. Additionally, we provide a comprehensive benchmark to support future research in robust low-light scene text recognition. https://huggingface.co/datasets/lumimusta/Low-light_Scene_Text_Dataset.

2604.23683 2026-04-28 cs.CV

Learning to Decipher from Pixels -- A Case Study of Copiale

Lei Kang, Giuseppe De Gregorio, Raphaela Heil, Alicia Fornés, Beáta Megyesi

Comments Accepted to HistoCrypt 2026

详情
英文摘要

Historical encrypted manuscripts require both paleographic interpretation of cipher symbols and cryptanalytic recovery of plaintext. Most existing computational workflows rely on a transcription-first paradigm, in which handwritten symbols are transcribed prior to decipherment. This intermediate step is labor-intensive, error-prone, and not always aligned with the goal of direct plaintext recovery. We propose an end-to-end, transcription-free approach that directly maps handwritten cipher images to plaintext. Using the Copiale cipher as a case study, we introduce the first text-line-level dataset pairing cipher images with German plaintext. We show that pretraining on generic handwriting data followed by cipher-specific fine-tuning substantially improves decipherment accuracy. Our results demonstrate that transcription-free image-to-plaintext decipherment is both feasible and effective for historical substitution ciphers, offering a simplified and scalable alternative to traditional pipelines. https://github.com/leitro/Decipher-from-Pixels-Copiale

2604.23681 2026-04-28 cs.LG cs.CL stat.ML

Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

Giansalvo Cirrincione

Comments 36 pages, 8 figures, 1 table. Submitted to Artificial Intelligence (Elsevier)

详情
英文摘要

A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.

2604.23678 2026-04-28 cs.AI

Transferable Human Mobility Network Reconstruction with neuroGravity

Jinming Yang, Shaoyu Huang, Zongyuan Huang, Yaohui Jin, Xiaokang Yang, Marta C. Gonzalez, Yanyan Xu

详情
英文摘要

Accurate modeling of human mobility is critical for tackling urban planning and public health challenges. In undeveloped regions, the absence of comprehensive travel surveys necessitates reconstructing mobility networks from publicly available data. Here we develop neuroGravity, a physics-informed deep learning model that reliably reconstructs mobility flows from limited observations and transfers to unobserved cities. Using only urban facility and population distributions, we find that neuroGravity's regional representations strongly correlate with socioeconomic and livability status, offering scalable proxies for costly surveys. Furthermore, we uncover that spatial income segregation plays a key role in model transferability: mobility networks are most reliably reconstructed when target cities share similar segregation levels with the source. We design an index to quantify this segregation and accurately predict transferability. Finally, we generate mobility flow proxies for over 1,200 cities worldwide, highlighting neuroGravity's potential to mitigate critical data shortages in resource-limited, underdeveloped areas.

2604.23674 2026-04-28 cs.AI

Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work

Zihao Wu, Steven Xu, Bowen Chen, Shaowen Wan, Yiwei Li, Wei Ruan, Yanjun Lyu, Siyuan Li, Dajiang Zhu, Tianming Liu, Lin Zhao

详情
英文摘要

With the emergence of large language models (LLMs) and AI agent frameworks, the human-AI co-work paradigm known as Vibe Coding is changing how people code, making it more accessible and productive. In scientific research, where workflows are more complex and the burden of specialized labor limits independent researchers and those in low-resource areas, the potential impact is even greater, particularly in biomedicine, which involves heterogeneous data modalities and multi-step analytical pipelines. In this paper, we introduce Vibe Medicine, a co-work paradigm in which clinicians and researchers direct skill-augmented AI agents through natural language to execute complex, multi-step biomedical workflows, while retaining the role of research director who specifies objectives, reviews intermediate results, and makes domain-informed decisions. The enabling infrastructure consists of three layers: capable LLMs, agent frameworks such as OpenClaw and Hermes Agent, and the OpenClaw medical skills collection, which includes more than 1,000 curated skills from multiple open-source repositories. We analyze the architecture and skill categories of this collection across ten biomedical domains, and present case studies covering rare disease diagnosis, drug repurposing, and clinical trial design that demonstrate end-to-end workflows in practice. We also identify the principal risks, such as hallucination, data privacy, and over-reliance, and outline directions toward more reliable, trustworthy, and clinically integrated agent-assisted research that advances research and technological equity and reduces health care resource disparities.

2604.23670 2026-04-28 cs.CV

Deploy DINO with Many-to-Many Association

Haodong Jiang, Mingzhe Li, Junfeng Wu

详情
英文摘要

Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.

2604.23665 2026-04-28 cs.CV

HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA

Francesco Dibitonto, Cigdem Beyan, Vittorio Murino

Comments This is the preprint version of the paper. The final version has been accepted for publication in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026)

详情
英文摘要

Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC's training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC's task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC

2604.23662 2026-04-28 cs.CV

SolarFCD: A Large-Scale Dataset and Benchmark for Solar Fault Classification in Photovoltaic Systems

Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Arooj Zaib, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

详情
英文摘要

The increasing global deployment of solar photovoltaic (PV) systems needs robust, scalable, and automated inspection technologies capable of detecting a wide range of panel flaws under a variety of operating situations. The lack of large-scale, multi-modal, publicly available annotated datasets is a major obstacle preventing advancement in this field. We introduce SolarFCD, an extensive dataset of solar panel defects created by methodically combining and reconciling three publicly accessible datasets covering two imaging modalities: RGB/Drone images and Thermal Infrared. The dataset consist of 4,435 images arranged under four unified defect classes such as: healthy images, Surface Obstruction, structural fault, and electrical fault. The dataset was divided into training, validation, and test splits at an 80:10:10 ratio through methodical label mapping, near-duplicate removal, and targeted augmentation of minority classes. Sixteen classification architectures from five design families were trained and assessed on the dataset to provide repeatable benchmark baselines. With an accuracy of 86.68%, precision of 88.65%, recall of 88.62%, and F1-score of 88.17%, ResNet101V2 performed the best overall. Per-class results showed balanced detection across all four defect categories within a narrow performance band of less than 1.2 percentage points. To promote open and repeatable research in automated PV inspection and solar energy operations and maintenance, the dataset, annotation files, and baseline code are made openly available.

2604.23655 2026-04-28 cs.CV

BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments

Guoxi Huang, Ruirui Lin, Yini Li, David R. Bull, Nantheera Anantrasirichai

详情
英文摘要

Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.

2604.23653 2026-04-28 cs.CV cs.AI

ResAF-Net: An Anchor-Free Attention-Based Network for Tree Detection and Agricultural Mapping in Palestine

Rabee Al-Qasem

详情
英文摘要

Reliable agricultural data is essential for food security, land-use planning, and economic resilience, yet in Palestine, such data remains difficult to collect at scale because of fragmented landscapes, limited field access, and restrictions on aerial monitoring. This paper presents ResAF-Net, a satellite-based tree detection framework designed for large-scale agricultural monitoring in resource-constrained settings. The proposed architecture combines a ResNet-50 encoder, Atrous Spatial Pyramid Pooling (ASPP), a feature-fusion stage, a multi-head self-attention refinement module, and an anchor-free FCOS detection head to improve tree localization in dense and heterogeneous scenes. Trained on the MillionTrees benchmark, the model achieved 82% Recall, 63.03% mAP@0.50, and 35.47% mAP@0.50:0.95 on the validation split, indicating strong sensitivity to tree presence while maintaining competitive localization quality. Beyond benchmark evaluation, we implemented the model within a web-based GIS application integrated with Palestinian cadastral data from GeoMolg, enabling tree analysis at scene, parcel, and community levels. This deployment demonstrates the practical feasibility of AI-assisted agricultural inventorying in Palestine. It provides a foundation for data-driven monitoring, reporting, and future species-level analysis of Mediterranean tree crops.

2604.23651 2026-04-28 cs.CV

Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation

Navid Aslankhani Khameneh, Marco Carletti, Cigdem Beyan

Comments This is the preprint version of the paper. The final version has been accepted for publication in the Proceedings of the 20th IEEE International Conference on Automatic Face and Gesture Recognition (IEEE FG 2026)

详情
英文摘要

Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.

2604.23648 2026-04-28 cs.RO

Safe Navigation in Unknown and Cluttered Environments via Direction-Aware Convex Free-Region Generation

Zhicheng Song, Yongjian Li, Kai Chen, Yulin Li, Fan Shi, Jun Ma

详情
英文摘要

Convex free regions provide a structured and optimization-friendly representation of collision-free space for robot navigation in unknown and cluttered environments. However, existing methods typically enlarge local collision-free regions mainly according to surrounding obstacle geometry. In cluttered environments, such strategies may fail to generate regions that both accommodate robot geometry and preserve traversable extension along candidate motion directions, thereby limiting downstream traversal, especially in narrow passages. Even when such a region is available, safe motion generation remains challenging, because safety checking at discretized trajectory samples does not guarantee continuously collision-free motion when robot geometry is modeled explicitly. To address these issues, we propose a navigation framework that jointly incorporates candidate motion directions and robot geometry into convex free-region generation, and achieves continuously collision-free motion through continuous-safe trajectory generation. Within each region, the framework performs geometry-aware target pose selection and trajectory generation, together with Lipschitz-based continuous safety certification and local refinement. The resulting free regions and candidate motions are maintained in a region-based graph to support incremental planning. Quantitative results in cluttered 2D navigation scenarios show that the proposed method generates free regions better aligned with downstream traversal and enables reliable collision-free navigation, while additional 3D and real-world experiments on a quadrupedal robot and a UAV demonstrate the extensibility and practical applicability of the framework. The open-source project can be found at https://github.com/ZhichengSong6/FRGraph.

2604.23646 2026-04-28 cs.AI cs.CR

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

Rong Xiang

详情
英文摘要

Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured $K \times I \times P$ threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.

2604.23644 2026-04-28 cs.CV cs.AI

RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing

Pritesh Jha

详情
英文摘要

Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh-2711/RaV-IDP for experimentation and use.

2604.23641 2026-04-28 cs.CV

VDLF-Net: Variational Feature Fusion for Adaptive and Few-Shot Visual Learning

Jiawei Yan

详情
英文摘要

This paper introduces VDLF-Net, which attaches a compact VAE to a multi-scale CNN backbone. Latent vectors and softmax-gate support the backbone feature maps, while $\ell_2$-normalized embeddings from the gated maps contribute toward supervised classification or episodic few-shot prediction. Under standard CIFAR-100 and Mini-ImageNet protocols, VDLF-Net demonstrates an improved performance over ResNet-50 Enhanced, VGG-16, Prototypical Networks, and Matching Networks. Extensive ablations show that removing the fine-resolution scale has the greatest impact on VDLF-Net's performance. At the same time, KL and reconstruction at the chosen $α$ pose a minor performance reduction, demonstrating that performance gains over classical episodic baselines mainly originate from the full VDLF-Net architecture and training strategy.

2604.23636 2026-04-28 cs.CV

Discriminator-Guided Adaptive Diffusion for Source-Free Test-Time Adaptation under Image Corruptions

Francesco Olivato, Cigdem Beyan, Vittorio Murino

Comments This is the preprint (submitted version) of the paper. The final version has been accepted for publication in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026)

详情
英文摘要

In this work, we study Source-Free Unsupervised Domain Adaptation under corruption-induced domain shifts, where performance degradation is caused by natural image corruptions that go beyond additive noise, including blur, weather effects, and digital artifacts. We propose a diffusion-based, input-level adaptation framework that operates entirely at test time and keeps all source-trained models frozen, explicitly targeting robustness to corrupted target inputs. Our method leverages a source-trained diffusion model as a generative prior and introduces a discriminator-guided adaptive diffusion strategy that dynamically controls the amount of perturbation applied to each test sample. Rather than relying on a fixed diffusion depth, the discriminator determines, on a per-image basis, when sufficient forward diffusion has been applied to suppress corruption-specific artifacts, with each corruption type effectively defining a distinct target domain. This adaptive stopping mechanism applies only the necessary amount of noise to remove domainspecific corruption while preserving class-discriminative structure. The reverse diffusion process then reconstructs a source-aligned image, optionally stabilized through structural guidance, which is classified using a frozen source-trained classifier. We evaluate the proposed approach across a broad spectrum of corruption-induced target domains, covering 15 diverse corruption types, and demonstrate more balanced robustness with competitive or improved performance across non-noise corruptions. Additional analyses reveal how the adaptive diffusion schedule responds to different corruption characteristics, highlighting the practicality, generality, and robustness of the proposed framework. The code is publicly available at https://github.com/fmolivato/dgadiffusion/.

2604.23633 2026-04-28 cs.AI

Causal Discovery as Dialectical Aggregation: A Quantitative Argumentation Framework

Sheng Wei, Yulin Chen, Beishui Liao

Comments Accepted at the 23rd International Conference on Principles of Knowledge Representation and Reasoning (KR 2026). This arXiv version includes supplementary material and additional implementation details

详情
英文摘要

Constraint-based causal discovery is brittle in finite-sample regimes because erroneous conditional-independence (CI) decisions can cascade into substantial structural errors. We propose Quantitative Argumentation for Causal Discovery (QACD), a semantics-driven framework that represents CI outcomes as graded, defeasible arguments rather than irreversible constraints. QACD maps statistical test outcomes to argument strengths and aggregates conflicting evidence through connectivity-mediated witness propagation, producing a fixed-point acceptability labeling over candidate adjacencies. Experiments on standard benchmark Bayesian networks suggest that QACD improves structural coherence and interventional reliability in several noisy or inconsistent CI regimes, while remaining competitive with classical constraint-based, hybrid, and prior argumentation-based baselines.

2604.23632 2026-04-28 cs.CV cs.MM cs.SD

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

Chunyu Li, Jiaye Li, Ruiqiao Mei, Haoyuan Xia, Hao Zhu, Jingdong Wang, Siyu Zhu

详情
英文摘要

Real-time text-driven joint audio-video avatar generation requires jointly synthesizing portrait video and speech with high fidelity and precise synchronization, yet existing audio-visual diffusion models remain too slow for interactive use and often degrade noticeably after aggressive acceleration. We present Hallo-Live, a streaming framework for joint audio-visual avatar generation that combines asynchronous dual-stream diffusion with human-centric preference-guided distillation. To reduce articulation lag in causal generation, we introduce Future-Expanding Attention, which allows each video block to access synchronous audio together with a short horizon of future phonetic cues. To mitigate the quality loss of few-step distillation, we further propose Human-Centric Preference-Guided DMD (HP-DMD), which reweights training samples using rewards from visual fidelity, speech naturalness, and audio-visual synchronization. On two NVIDIA H200 GPUs, Hallo-Live runs at 20.38 FPS with 0.94 seconds latency, yielding 16.0x higher throughput and 99.3x lower latency than the teacher model Ovi. Despite this speedup, it retains strong generation quality, reaching comparable VideoAlign overall score and Sync Confidence score while outperforming other accelerated baselines in the overall quality-efficiency trade-off. Qualitative results further show robust generalization across photorealistic, multi-speaker, and stylized scenarios. To the best of our knowledge, Hallo-Live is the first framework to combine streaming dual-stream diffusion with preference-guided distillation for real-time, text-driven audio-visual generation.

2604.23627 2026-04-28 cs.CL cs.LG

Neural Grammatical Error Correction for Romanian

Teodor-Mihai Cotet, Stefan Ruseti, Mihai Dascalu

详情
英文摘要

Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger

2604.23626 2026-04-28 cs.CL

GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

Tao Feng, Haozhen Zhang, Zijie Lei, Peixuan Han, Jiaxuan You

详情
Journal ref
ICLR 2026
英文摘要

LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.

2604.23623 2026-04-28 cs.AI

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Zichuan Fu, Xian Wu, Guojing Li, Yejing Wang, Yijun Chen, Zihao Zhao, Yixuan Luo, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

Comments ACL 2026 Findings

详情
英文摘要

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.

2604.23622 2026-04-28 cs.CV

A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification

Peng Chen, Wenxuan He, Feng Qian, Guangyao Shi, Jingwen Yan

详情
英文摘要

In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.

2604.23620 2026-04-28 cs.RO

Move-Then-Operate: Behavioral Phasing for Human-Like Robotic Manipulation

Haoming Xu, Lei Lei, Jie Gu, Chu Tang, Jingmin Chen, Ruiqi Wang

Comments 15 pages, 10 figures

详情
英文摘要

We present Move-Then-Operate, a Vision language action framework that explicitly decouples robotic manipulation into two distinct behavioral phases: coarse relocation (move) and contact-critical interaction (operate). Unlike monolithic policies that conflate these heterogeneous regimes, our architecture employs a dual-expert policy routed by a learnable phase selector, introducing a structural inductive bias that isolates phase-specific dynamics. Phase labels are automatically generated via an MLLM-based pipeline conditioned on lightweight contextual cues such as end-effector velocity and subtask decomposition to ensure alignment with human motor patterns. Evaluated on the RoboTwin2 benchmark, our method achieves an average success rate of $68.9\%$, outperforming the monolithic $π_0$ baseline by $24\%$. It matches or exceeds models trained on $10\times$ more data and reaches peak performance in $40\%$ fewer training steps, demonstrating that architectural disentanglement of move and operate phases is a highly effective and efficient strategy for mastering high-precision manipulation.

2604.23615 2026-04-28 cs.CL cs.AI

Applications of the Transformer Architecture in AI-Assisted English Reading Comprehension

Ping Li

Comments 9 pages, 5 figures, Conference paper for International Conference on Big Data Applications in Education and Engineering {ICBDAEE 2026)

详情
Journal ref
Neural Computing and Applications.35(2025)18603-18622
英文摘要

This paper studies interpretable and fair artificial intelligence architectures for understanding English reading. Introduced transformer-based models, integrating advanced attention mechanisms and gradient-based feature attribution. The model's lack of interpretability, reduction of algorithmic bias, and unreliable performance in learning environments are the current issues faced in natural language teaching. A unified technical pipeline has been constructed, including adversarial bias correction methods, token-level attribution analysis, and multi-head attention heatmap visualization. Experimental validation was conducted using a large-scale labeled English reading comprehension dataset, and the data partitioning scheme and parameter optimization procedures have been determined. The method significantly outperforms the state-of-the-art models for this task in terms of accuracy and macro-average F1 score; in some aspects, it even surpasses or closely matches the results of human evaluations. In multi-week user experiments, the explainable transformer improved teachers' trust and operability in feedback-based assessments within the scoring system. The proposed method aims to ensure high prediction accuracy and fairness for different learners. This indicates that it is a real-world educational application based on artificial intelligence with a focus on interpretation. Improve the user experience in AI-assisted reading comprehension systems, counteract biases, and enhance the details explained by transformers.

2604.23612 2026-04-28 cs.CV

Comparative Study of Weighted and Coupled Second- and Fourth-Order PDEs for Image Despeckling in Grayscale, Color, SAR, and Ultrasound

Manish Kumar, Rajendra K. Ray

详情
英文摘要

Partial Differential Equation (PDE)-based approaches have gained significant attention in image despeckling due to their strong capability to preserve structural details while suppressing noise. However, conventional second-order PDE models tend to generate blocky artifacts, whereas higher-order models often introduce speckle patterns. To resolve it, this paper proposes and comparatively analyzes two advanced PDE-based frameworks designed for speckle noise suppression while preserving the fine edges. The first model introduces a novel weighted formulation that combines second and fourth-order PDEs through a weighting parameter. The second-order diffusion coefficient employs grayscale and gradient-based indicators, while the fourth-order term is guided solely by a Laplacian-based indicator. The second model constructs a coupled PDE framework, where independent fourth and second-order components are explicitly solved in an iterative manner. In this coupled structure, each diffusion coefficient is defined separately to enhance adaptability in varying image regions. Both models are implemented using the explicit finite difference method. The proposed techniques are extensively evaluated on a variety of datasets, including standard grayscale, color, Synthetic Aperture Radar (SAR), and ultrasound images. Comparative experiments with the existing Telegraph Diffusion Model (TDM) and Fourth-Order Telegraph Diffusion Model (TDFM) demonstrate the superiority of the proposed approaches in reducing speckle noise while effectively preserving fine image structures and edges. Quantitative evaluations using PSNR, SSIM and Speckle Index metrics confirm that the proposed models produce higher image quality and enhanced visual perception. Overall, the presented PDE-based formulations provide a reliable and efficient framework for image despeckling in both natural and medical imaging.

2604.23609 2026-04-28 cs.RO

Tube Diffusion Policy: Reactive Visual-Tactile Policy Learning for Contact-rich Manipulation

Teng Xue, Alberto Rigo, Bingjian Huang, Jiayi Shen, Zhengtong Xu, Nick Colonnese, Amirhossein H. Memar

详情
英文摘要

Contact-rich manipulation is central to many everyday human activities, requiring continuous adaptation to contact uncertainty and external disturbances through multi-modal perception, particularly vision and tactile feedback. While imitation learning has shown strong potential for learning complex manipulation behaviors, most existing approaches rely on action chunking, which fundamentally limits their ability to react to unforeseen observations during execution. This limitation becomes especially critical in contact-rich scenarios, where physical uncertainty and high-frequency tactile feedback demand rapid, reactive control. To address this challenge, we propose Tube Diffusion Policy (TDP), a novel reactive visual-tactile policy learning framework that bridges diffusion-based imitation learning with tube-based feedback control. By leveraging the expressive power of generative models, TDP learns an observation-conditioned feedback flow around nominal action chunks, forming an action tube that enables fast and adaptive reactions during execution. We evaluate TDP on the widely used Push-T benchmark and three additional challenging visual-tactile dexterous manipulation tasks. Across all benchmarks, TDP consistently outperforms state-of-the-art imitation learning baselines. Two real-world experiments further validate its robust reactivity under contact uncertainty and external disturbances. Moreover, the step-wise correction mechanism enabled by action tube significantly reduces the required denoising steps, making TDP well suited for real-time, high-frequency feedback control in contact-rich manipulation.

2604.23606 2026-04-28 cs.LG math-ph math.MP

Hamiltonian Graph Inference Networks: Joint structure discovery and dynamics prediction for lattice Hamiltonian systems from trajectory data

Ru Geng, Panayotis Kevrekidis, Yixian Gao, Hong-Kun Zhang, Jian Zu

Comments 18 pages, 8 figures

详情
英文摘要

Lattice Hamiltonian systems underpin models across condensed matter, nonlinear optics, and biophysics, yet learning their dynamics from data is obstructed by two unknowns: the interaction topology and whether node dynamics are homogeneous. Existing graph-based approaches either assume the graph is given or, as in $α$-separable graph Hamiltonian network, infer it only for separable Hamiltonians with homogeneous node dynamics. We introduce the Hamiltonian Graph Inference Network (HGIN), which jointly recovers the interaction graph and predicts long-time trajectories from state data alone, for both separable and non-separable Hamiltonians and under heterogeneous node dynamics. HGIN couples a structure-learning module -- a learnable weighted adjacency matrix trained under a Hamilton's-equations loss -- with a trajectory-prediction module that partitions edges into physically distinct subgraphs via $k$-means clustering, assigning each subgraph its own encoder and thereby breaking the parameter-sharing bottleneck of conventional GNNs. On three benchmarks -- a Klein--Gordon lattice with long-range interactions and two discrete nonlinear Schrödinger lattices (homogeneous and heterogeneous) -- HGIN reduces long-time energy prediction error and trajectory prediction error by six to thirteen orders of magnitude relative to baselines. A symmetry argument on the Hamiltonian loss further shows that the learned weights encode the parity of the underlying pair potential, yielding an interpretable readout of the system's interaction structure.

2604.23605 2026-04-28 cs.AI

Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Zhiqi Lv, Duofan Tu, Jun Li, Mingyue Zhao, Heqin Zhu, Wenliang Li, Shaohua Kevin Zhou

详情
英文摘要

The application of large language models (LLMs) in clinical decision support faces significant challenges of "tunnel vision" and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain-based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician's cognitive trajectory that consists of "Memory Anchoring", "Navigation" and "Verification" phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree-of-Thoughts (Med-ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing "Angel-Devil" adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, DxChain achieves state-of-the-art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next-generation clinical AI. The code is at https://anonymous.4open.science/r/Dx-Chain.

2604.23604 2026-04-28 cs.CV cs.RO

Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation

Simone Mosco, Daniel Fusaro, Alberto Pretto

Comments This paper has been accepted at the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

详情
英文摘要

Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido-page/.