arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.11631 2026-03-13 cs.AI cs.CV

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim

Comments 30 pages, 21 figures, EACL 2026 Findings

详情

英文摘要

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2603.11627 2026-03-13 cs.CV

Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography

Yichi Zhang, Le Xue, Wenbo Zhang, Lanlan Li, Feiyang Xiao, Yuchen Liu, Xiaohui Zhang, Hongwei Zhang, Shuqi Wang, Gang Feng, Liling Peng, Xin Gao, Yuanfan Xu, Yuan Qi, Kuangyu Shi, Hong Zhang, Yuan Cheng, Mei Tian, Zixin Hu

2603.11625 2026-03-13 cs.CV cs.AI

MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models

Shengyuan Liu, Zanting Ye, Yunrui Lin, Chen Hu, Wanting Geng, Xu Han, Bulat Ibragimov, Yefeng Zheng, Yixuan Yuan

Comments 10 pages

2603.11623 2026-03-13 cs.AI

The Density of Cross-Persistence Diagrams and Its Applications

Alexander Mironenko, Evgeny. Burnaev, Serguei Barannikov

Comments 19 pages, 20 figures

详情

DOI: 10.1109/ACCESS.2026.3669415
Journal ref: in IEEE Access, vol. 14, pp. 34320-34338, 2026,

英文摘要

Topological Data Analysis (TDA) provides powerful tools to explore the shape and structure of data through topological features such as clusters, loops, and voids. Persistence diagrams are a cornerstone of TDA, capturing the evolution of these features across scales. While effective for analyzing individual manifolds, persistence diagrams do not account for interactions between pairs of them. Cross-persistence diagrams (cross-barcodes), introduced recently, address this limitation by characterizing relationships between topological features of two point clouds. In this work, we present the first systematic study of the density of cross-persistence diagrams. We prove its existence, establish theoretical foundations for its statistical use, and design the first machine learning framework for predicting cross-persistence density directly from point cloud coordinates and distance matrices. Our statistical approach enables the distinction of point clouds sampled from different manifolds by leveraging the linear characteristics of cross-persistence diagrams. Interestingly, we find that introducing noise can enhance our ability to distinguish point clouds, uncovering its novel utility in TDA applications. We demonstrate the effectiveness of our methods through experiments on diverse datasets, where our approach consistently outperforms existing techniques in density prediction and achieves superior results in point cloud distinction tasks. Our findings contribute to a broader understanding of cross-persistence diagrams and open new avenues for their application in data analysis, including potential insights into time-series domain tasks and the geometry of AI-generated texts. Our code is publicly available at https://github.com/Verdangeta/TDA_experiments

URL PDF HTML ☆

赞 0 踩 0

2603.11620 2026-03-13 cs.LG

Personalized Federated Learning via Gaussian Generative Modeling

Peng Hu, Jianwei Ma

详情

英文摘要

Federated learning has emerged as a paradigm to train models collaboratively on inherently distributed client data while safeguarding privacy. In this context, personalized federated learning tackles the challenge of data heterogeneity by equipping each client with a dedicated model. A prevalent strategy decouples the model into a shared feature extractor and a personalized classifier head, where the latter actively guides the representation learning. However, previous works have focused on classifier head-guided personalization, neglecting the potential personalized characteristics in the representation distribution. Building on this insight, we propose pFedGM, a method based on Gaussian generative modeling. The approach begins by training a Gaussian generator that models client heterogeneity via weighted re-sampling. A balance between global collaboration and personalization is then struck by employing a dual objective: a shared objective that maximizes inter-class distance across clients, and a local objective that minimizes intra-class distance within them. To achieve this, we decouple the conventional Gaussian classifier into a navigator for global optimization, and a statistic extractor for capturing distributional statistics. Inspired by the Kalman gain, the algorithm then employs a dual-scale fusion framework at global and local levels to equip each client with a personalized classifier head. In this framework, we model the global representation distribution as a prior and the client-specific data as the likelihood, enabling Bayesian inference for class probability estimation. The evaluation covers a comprehensive range of scenarios: heterogeneity in class counts, environmental corruption, and multiple benchmark datasets and configurations. pFedGM achieves superior or competitive performance compared to state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2603.11617 2026-03-13 cs.CV

Noise-aware few-shot learning through bi-directional multi-view prompt alignment

Lu Niu, Cheng Xue

2603.11616 2026-03-13 cs.CV

SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation

Muyi Sun, Yifan Gao, Ziang Jia, Xingqun Qi, Qianli Zhang, Qian Liu, Tianzheng Deng

Comments 5 pages, 5 figures. Accepted to IEEE ICASSP 2026

2603.11611 2026-03-13 cs.LG cs.CL

Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

2603.11607 2026-03-13 cs.CV

DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling

Tong Zhao, Mingkun Lei, Liangyu Yuan, Yanming Yang, Chenxi Song, Yang Wang, Beier Zhu, Chi Zhang

Comments Code Link: see AGI-Lab/DyWeight" target="_blank" rel="noopener">https://github.com/Westlake-AGI-Lab/DyWeight

2603.11606 2026-03-13 cs.CV

Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints

Lijun Guo, Haoyu Zhao, Xingyue Zhao, Rong Fu, Linghao Zhuang, Siteng Huang, Zhongyu Li, Hua Zou

Comments 26 pages, 12 figures

2603.11605 2026-03-13 cs.CV

LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference

Junkun Jiang, Ho Yin Au, Jingyu Xiang, Jie Chen

Comments Accepted by CVPR 2026. Supplementary material included. Project page: https://jjkislele.github.io/LaMoGen/

2603.11603 2026-03-13 cs.LG

AutoScout: Structured Optimization for Automating ML System Configuration

Jimmy Shong, Yuhan Ding, Yihan Jiang, Liheng Jing, Haonan Chen, Gaokai Zhang, Aditya Akella, Fan Lai

2603.11598 2026-03-13 cs.LG cs.AI

Survival Meets Classification: A Novel Framework for Early Risk Prediction Models of Chronic Diseases

Shaheer Ahmad Khan, Muhammad Usamah Shahid, Muddassar Farooq

2603.11597 2026-03-13 cs.CL cs.AI

Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai, Yasuto Fujimoto, Yoshiyuki Takahara, Atsushi Ohara, Hirohiko Miyake, Genichiro Ishii

Comments 9 pages (including bibliography), 2 figures, 6 tables

2603.11594 2026-03-13 cs.AI

Leveraging Large Language Models and Survival Analysis for Early Prediction of Chemotherapy Outcomes

Muhammad Faisal Shahid, Asad Afzal, Abdullah Faiz, Muhammad Siddiqui, Arbaz Khan Shehzad, Fatima Aftab, Muhammad Usamah Shahid, Muddassar Farooq

2603.11593 2026-03-13 cs.CV

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, Yu-Gang Jiang

2603.11589 2026-03-13 cs.SD cs.AI

Toward Complex-Valued Neural Networks for Waveform Generation

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

Comments ICLR 2026 (accepted)

2603.11578 2026-03-13 cs.CL

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin, Jeon Haesung, Lianbo Liu, Hao Shi, Mengjie Zhao, Yusuke Fujita, Yui Sudo

Comments 16 pages, 6 figures

2603.11565 2026-03-13 cs.LG

CAETC: Causal Autoencoding and Treatment Conditioning for Counterfactual Estimation over Time

Nghia D. Nguyen, Pablo Robles-Granda, Lav R. Varshney

2603.11564 2026-03-13 cs.CL

Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

Zhenxu Tian, Yi Su, Juntao Li, Min Zhang

2603.11563 2026-03-13 cs.CV cs.RO

SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning

Yuyuan Yang, Junkun Hong, Hongrong Wang, Honghao Cai, Xunpeng Ren, Ge Wang, Mingcong Lei, Shenhao Yan, Jiahao Yang, Chengsi Yao, Xi Li, Yiming Zhao, Yatong Han, Jinke Ren

2603.11559 2026-03-13 cs.AI cs.HC

AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics in Frontier LLMs Under High-Stakes Decisions

Alejandro R Jadad

Comments 22 pages, 2 tables, 1 appendix

2603.11557 2026-03-13 cs.CV

TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision

Robinson Umeike, Cuong Pham, Ryan Hausen, Thang Dao, Shane Crawford, Tanya Brown-Giammanco, Gerard Lemson, John van de Lindt, Blythe Johnston, Arik Mitschang, Trung Do

详情

英文摘要

We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05% mAP@0.5 at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70% mAP@0.5, a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet

URL PDF HTML ☆

赞 0 踩 0

2603.11556 2026-03-13 cs.CV

Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception

Xinyu Nan, Ning Wang, Yuyao Zhai, Mei Yang

2603.11554 2026-03-13 cs.CV cs.AI cs.RO

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang, Gregory Dudek, Xueqian Wang, Jian Su

2603.11546 2026-03-13 cs.LG

Multi-Task Anti-Causal Learning for Reconstructing Urban Events from Residents' Reports

Liangkai Zhou, Susu Xu, Shuqi Zhong, Shan Lin

2603.11543 2026-03-13 cs.CV

Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting

Tingxuan Huang, Haowei Zhu, Jun-hai Yong, Hao Pan, Bin Wang

2603.11542 2026-03-13 cs.CV cs.AI

ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation

Md Jahidul Islam

详情

DOI: 10.2139/ssrn.6396531

英文摘要

The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data -- specifically in the one-shot regime -- is often hindered by a significant "Stability-Plasticity" dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.

URL PDF HTML ☆

赞 0 踩 0

2603.11535 2026-03-13 cs.AI cs.CL

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun

2603.11534 2026-03-13 cs.CV

Risk-Controllable Multi-View Diffusion for Driving Scenario Generation

Hongyi Lin, Wenxiu Shi, Heye Huang, Dingyi Zhuang, Song Zhang, Yang Liu, Xiaobo Qu, Jinhua Zhao