arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2507.16861 2026-03-20 cs.CV cs.AI

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

Comments accepted to cvpr 2026

详情

英文摘要

Integrating LiDAR and camera inputs into a unified Bird's-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect. The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.7%.

URL PDF HTML ☆

赞 0 踩 0

2507.13323 2026-03-20 cs.LG

GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, Meeyoung Cha

Comments 10 pages, 9 figures, 4 tables

2507.05751 2026-03-20 cs.CV

SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim

2507.02861 2026-03-20 cs.CV cs.AI cs.GR

LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, Joan Lasenby

Comments Project Page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c&feature=youtu.be Camera-Ready Version

2506.13387 2026-03-20 cs.CV

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast

Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren

Comments CVPR 2026

2506.08625 2026-03-20 cs.CL

RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee

2506.02535 2026-03-20 cs.CV

Video Anomaly Detection with Semantics-Aware Information Bottleneck

Juntong Li, Lingwei Dang, Qingxin Xiao, Shishuo Shang, Jiajia Cheng, Haomin Wu, Yun Hao, Qingyao Wu

Comments Accepted by ICME 2026

2505.21854 2026-03-20 cs.CV cs.AI

Rethinking Gradient-based Adversarial Attacks on Point Cloud Classification

Jun Chen, Xinke Li, Mingyue Xu, Chongshou Li, Truiani Li

Comments ICME 2026

2505.09109 2026-03-20 cs.RO cs.CV

FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis

Yuxing Chen, Bowen Xiao, He Wang

Comments Project: https://pku-epic.github.io/FoldNet/

2505.02024 2026-03-20 cs.AI

From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Minjie Shen, Yanshu Li, Lulu Chen, Zhichao Fan, Yanhang Li, Qikai Yang

2504.15995 2026-03-20 cs.LG cs.AI

OPUS-VFL: Incentivizing Optimal Privacy-Utility Tradeoffs in Vertical Federated Learning

Sindhuja Madabushi, Ahmad Faraz Khan, Haider Ali, Jin-Hee Cho

2504.14634 2026-03-20 cs.RO cs.CV

Latent Representations for Visual Proprioception in Inexpensive Robots

Sahara Sheikholeslami, Ladislau Bölöni

2504.12441 2026-03-20 cs.RO cs.LG cs.SY eess.SY

Learning Transferable Friction Models and LuGre Identification Via Physics-Informed Neural Networks

Asutay Ozmen, João P. Hespanha, Katie Byl

Comments 7 pages, 8 figures, Accepted to 2026 American Control Conference (ACC)

2503.21800 2026-03-20 cs.CL cs.AI cs.LG

ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries

Lovedeep Gondara, Jonathan Simkin, Shebnum Devji, Gregory Arbour, Raymond Ng

详情

英文摘要

Background: Population-based cancer registries (PBCRs) manually extract data from unstructured pathology reports, a labor-intensive process where assigning reports to tumor groups can consume 900 person-hours annually for approximately 100,000 reports at a medium-sized registry. Current automated rule-based systems fail to handle the linguistic complexity of this classification task. Materials and Methods: We present ELM (Ensemble of Language Models), a novel hybrid approach combining small, encoder only language models and large language models (LLMs). ELM employs an ensemble of six fine-tuned encoder only models: three analyzing the top portion and three analyzing the bottom portion of each report to maximize text coverage given token limits. A tumor group is assigned when at least five of six models agree; otherwise, an LLM arbitrates using a carefully curated prompt constrained to likely tumor groups. Results: On a held-out test set of 2,058 pathology reports spanning 19 tumor groups, ELM achieves weighted precision and recall of 0.94, representing a statistically significant improvement (p<0.001) over encoder-only ensembles (0.91 F1-score) and substantially outperforming rule-based approaches. ELM demonstrates particular gains for challenging categories including leukemia (F1: 0.76 to 0.88), lymphoma (0.76 to 0.89), and skin cancer (0.44 to 0.58). Discussion: Deployed in production at British Columbia Cancer Registry, ELM has reduced manual review requirements by approximately 60-70%, saving an estimated 900 person-hours annually while maintaining data quality standards. Conclusion: ELM represents the first successful deployment of a hybrid small, encoder only models-LLM architecture for tumor group classification in a real-world PBCR setting, demonstrating how strategic combination of language models can achieve both high accuracy and operational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2503.18253 2026-03-20 cs.CL

Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Iqra Ameer, Grigori Sidorov, Seid Muhie Yimam

Comments LREC 2026

2503.16426 2026-03-20 cs.CV

DynamicVis: Dynamic Visual Perception for Efficient Remote Sensing Foundation Models

Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Shijian Lu, Zhenwei Shi

详情

英文摘要

The advancement of RS technology has enabled high-resolution Earth observation; however, interpreting these images using modern VFMs remains a significant challenge. Unlike object-centric natural images, RS imagery is fundamentally characterized by extreme target sparsity and massive spatial redundancy. Key objects of interest (e.g., ships, vehicles) often occupy less than 1% of the spatial extent, surrounded by vast, target-free backgrounds. Existing VFMs predominantly rely on uniform dense processing (e.g., ViTs) and pixel-reconstruction pre-training paradigms (e.g., MAE). These approaches inherently waste substantial computational capacity on modeling redundant backgrounds and inadvertently dilute the feature representations of small, sparse targets. To bridge this structural misalignment, we propose DynamicVis, a visual foundation model explicitly tailored to the sparse nature of RS imagery. Architecturally, DynamicVis introduces a Dynamic Region-Aware SSM that bypasses uniform computation. It adaptively routes and incrementally models only task-relevant, high-salience tokens while employing a parameter-free integration for background context, drastically reducing the complexity of processing ultra-long 2D token sequences ($\sim$100,000). Crucially, to equip the network with robust spatial-selection capabilities, we propose a novel Region-Level Meta-Embedding Multi-Instance Learning (MIL) pre-training paradigm. Trained on a million-scale dataset, this paradigm explicitly disentangles sparse foreground instances from dense backgrounds in the latent semantic space, overcoming the semantic ambiguity of conventional pixel-reconstruction methods. Extensive evaluations across nine diverse downstream tasks reveal that DynamicVis exhibits exceptional efficacy, particularly dominating in sparse-target and instance-level perception tasks (e.g., small object detection, and change detection).

URL PDF HTML ☆

赞 0 踩 0

2502.19159 2026-03-20 cs.CV

Sliding-Window Merging for Compacting Patch-Redundant Layers in LLMs

Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Angelica I Aviles-Rivero, Chuanlong Xie, Yao Zhu

2502.10978 2026-03-20 cs.AI cs.CY

Agentic LLM Framework for Adaptive Decision Discourse

Antoine Dolant, Praveen Kumar

Comments 24 pages, 4 figures, 1 appendix

2502.03714 2026-03-20 cs.CV cs.LG

Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis

2502.00340 2026-03-20 cs.LG cs.CL cs.DC

Unlocking Full Efficiency of Token Filtering in Large Language Model Training

Di Chai, Pengbo Li, Feiyuan Zhang, Yilun Jin, Han Tian, Kaiqiang Xu, Binhang Yuan, Dian Shen, Junxue Zhang, Kai Chen

2501.02364 2026-03-20 cs.LG cs.CV stat.ML

Linearly Separable Features in Shallow Nonlinear Networks: Width Scales Polynomially with Intrinsic Data Dimension

Alec S. Xu, Can Yaras, Peng Wang, Qing Qu

Comments 33 pages, 10 figures

2412.10488 2026-03-20 cs.CV cs.AI cs.GR

SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers

Zehao Chen, Rong Pan

Comments Accepted by AAAI 2025. Project: https://svgbuilder.github.io

2412.02484 2026-03-20 cs.LG stat.AP stat.ML

Vector Optimization with Gaussian Process Bandits

İlter Onat Korkmaz, Yaşar Cahit Yıldırım, Çağın Ararat, Cem Tekin

2412.01113 2026-03-20 cs.CL

LLMs Faithfully and Iteratively Compute Answers During CoT: A Systematic Analysis With Multi-step Arithmetics

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

2411.08794 2026-03-20 cs.AI

LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

Chang Yang, Xinrun Wang, Junzhe Jiang, Qinggang Zhang, Xiao Huang

Comments Accepted to TMLR

2410.13106 2026-03-20 cs.LG cs.AI

Cliqueformer: Model-Based Optimization with Structured Transformers

Jakub Grudzien Kuba, Pieter Abbeel, Sergey Levine

2409.16215 2026-03-20 cs.RO cs.CV

TiROD: Tiny Robotics Dataset and Benchmark for Continual Object Detection

Francesco Pasti, Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto, Nicola Bellotto

2409.05585 2026-03-20 cs.CV cs.AI

Latent Causal Modeling for 3D Brain MRI Counterfactuals

Wei Peng, Tian Xia, Fabio De Sousa Ribeiro, Tomas Bosschieter, Ehsan Adeli, Qingyu Zhao, Ben Glocker, Kilian M. Pohl

2408.07221 2026-03-20 cs.CV cs.LG

A Review of Pseudo-Labeling for Computer Vision

Patrick Kage, Jay C. Rothenberger, Pavlos Andreadis, Dimitrios I. Diochnos

Comments 40 pages, 4 figures, 2 tables

2407.17869 2026-03-20 cs.LG

Modeling Inverse Ellipsometry Problem via Flow Matching with a Large-Scale Dataset

Yiming Ma, Jianzhi Teng, Xinjie Li, Xin Sun, Zhiyong Wang, Yuzhou Song, Lionel Z. Wang, Bin Chen