arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2602.01908 2026-02-03 cs.SD

LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency

Jaejun Lee, Yoori Oh, Kyogu Lee

Comments This paper has been accepted to ICASSP 2026

2602.01906 2026-02-03 cs.CV cs.AI

DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification

Farhan Ullah, Irfan Ullah, Khalil Khan, Giovanni Pau, JaKeoung Koo

2602.01905 2026-02-03 cs.CV cs.AI cs.LG

Learning Sparse Visual Representations via Spatial-Semantic Factorization

Theodore Zhengde Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan, Noel C Codella, Tristan Naumann, Hoifung Poon, Mu Wei

2602.01901 2026-02-03 cs.CV

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu

Comments Accepted by AAAI26

详情

英文摘要

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.

URL PDF HTML ☆

赞 0 踩 0

2602.01899 2026-02-03 cs.RO cs.CV

Multi-Task Learning for Robot Perception with Imbalanced Data

Ozgur Erkent

Comments 16 pages

Journal ref Ordu Üniversitesi Bilim ve Teknoloji Dergisi, 15(2), 151-164 (2025)

2602.01898 2026-02-03 cs.LG stat.ML

Observation-dependent Bayesian active learning via input-warped Gaussian processes

Sanna Jarl, Maria Bånkestad, Jonathan J. S. Scragg, Jens Sjölund

Comments 13 pages

2602.01897 2026-02-03 cs.LG

Internal Flow Signatures for Self-Checking and Refinement in LLMs

Sungheon Jeong, Sanggeon Yun, Ryozo Masukawa, Wenjun Haung, Hanning Chen, Mohsen Imani

2602.01893 2026-02-03 cs.AI cs.LG

Geometric Analysis of Token Selection in Multi-Head Attention

Timur Mudarisov, Mikhal Burtsev, Tatiana Petrova, Radu State

2602.01892 2026-02-03 cs.RO cs.SY eess.SY

Path Tracking with Dynamic Control Point Blending for Autonomous Vehicles: An Experimental Study

Alexandre Lombard, Florent Perronnet, Nicolas Gaud, Abdeljalil Abbas-Turki

2602.01885 2026-02-03 cs.CL cs.AI

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang

Comments 12 pages, 7 figures. Accepted to The Web Conference (WWW) 2026

2602.01884 2026-02-03 cs.AI cs.LG

Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models

Shidong Yang, Tongwen Huang, Hao Wen, Yong Wang, Li Chen, Xiangxiang Chu

2602.01881 2026-02-03 cs.CV

ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding

Ye Chen, Yupeng Zhu, Xiongzhen Zhang, Zhewen Wan, Yingzhe Li, Wenjun Zhang, Bingbing Ni

详情

英文摘要

Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.

URL PDF HTML ☆

赞 0 踩 0

2602.01879 2026-02-03 cs.SD

Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

Jaejun Lee, Yoori Oh, Kyogu Lee

Comments This paper was presented at ICASSP 2025

2602.01877 2026-02-03 cs.LG math.OC

Autocorrelated Optimize-via-Estimate: Predict-then-Optimize versus Finite-sample Optimal

Zichun Wang, Gar Goei Loke, Ruiting Zuo

2602.01875 2026-02-03 cs.CL

PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

Langming Liu, Kangtao Lv, Haibin Chen, Weidong Zhang, Yejing Wang, Shilei Liu, Xin Tong, Yujin Yuan, Yongwei Wang, Wenbo Su, Bo Zheng

2602.01870 2026-02-03 cs.RO

BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models

Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

2602.01864 2026-02-03 cs.CV

Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling

Yuan Wang, Yuhao Wan, Siming Zheng, Bo Li, Qibin Hou, Peng-Tao Jiang

Comments 26 pages, 19 figures. Accepted to ICLR 2026

2602.01860 2026-02-03 cs.RO

Vision-only UAV State Estimation for Fast Flights Without External Localization Systems: A2RL Drone Racing Finalist Approach

Filip Novák, Matěj Petrlík, Matej Novosad, Parakh M. Gupta, Robert Pěnička, Martin Saska

Comments Visit our webpage for more details: https://mrs.fel.cvut.cz/papers/vision-only-uav-state-estimation

2602.01858 2026-02-03 cs.AI

SOPRAG: Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures

Liangtao Lin, Zhaomeng Zhu, Tianwei Zhang, Yonggang Wen

2602.01854 2026-02-03 cs.CV

Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection

A S M Sharifuzzaman Sagar, Mohammed Bennamoun, Farid Boussaid, Naeha Sharif, Lian Xu, Shaaban Sahmoud, Ali Kishk

详情

英文摘要

In multimodal misinformation, deception usually arises not just from pixel-level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image-text pair. Yet most deepfake detectors, engineered to detect pixel-level forgeries, do not account for claim-level meaning, despite their growing integration in automated fact-checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel-level detectors contribute useful signal for verifying image-text claims, or do they instead introduce misleading authenticity priors that undermine evidence-based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state-of-the-art image-only deepfake detectors, (2) an evidence-driven fact-checking system that performs tool-guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi-Agent Debate (MAD), and (3) a hybrid fact-checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26-0.53 on MMFakeBench and 0.33-0.49 on DGM4, and that incorporating their predictions into fact-checking pipelines consistently reduces performance by 0.04-0.08 F1 due to non-causal authenticity assumptions. In contrast, the evidence-centric fact-checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel-level artifact signals do not reliably enhance reasoning over real-world image-text misinformation.

URL PDF HTML ☆

赞 0 踩 0

2602.01853 2026-02-03 cs.LG stat.ME stat.ML

Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

Xiangkun Wu, Qianglin Wen, Yingying Zhang, Hongtu Zhu, Ting Li, Chengchun Shi

2602.01852 2026-02-03 cs.LG cs.DC

FUPareto: Bridging the Forgetting-Utility Gap in Federated Unlearning via Pareto Augmented Optimization

Zeyan Wang, Zhengmao Liu, Yongxin Cai, Chi Li, Xiaoying Tang, Jingchao Chen, Zibin Pan, Jing Qiu

2602.01850 2026-02-03 cs.CV

WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

Pei Li, Jiaxi Yin, Lei Ouyang, Shihan Pan, Ge Wang, Han Ding, Fei Wang

Comments Under Review. 28 pages, 9 figures, 6 tables

详情

英文摘要

IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.

URL PDF HTML ☆

赞 0 踩 0

2602.01849 2026-02-03 cs.LG

Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

Ziwei Luo, Ziqi Jin, Lei Wang, Lidong Bing, Thomas B. Schön

Comments Project page: https://algolzw.github.io/sr-smc

2602.01845 2026-02-03 cs.LG q-bio.QM

No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Furkan Eris

2602.01843 2026-02-03 cs.CV

SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection

Qian Xu, Xi Li, Fei Gao, Jie Guo, Haojuan Yuan, Shuaipeng Fan, Mingjin Zhang

2602.01836 2026-02-03 cs.CV

Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery

Yin Wu, Daniel Slieter, Carl Esselborn, Ahmed Abouelazm, Tsung Yuan Tseng, J. Marius Zöllner

2602.01832 2026-02-03 cs.AI

Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs

Rui Wang, Yaoguang Cao, Yuyi Chen, Jianyi Xu, Zhuoyang Li, Jiachen Shang, Shichun Yang

2602.01826 2026-02-03 cs.LG cs.AI

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Yaxiang Zhang, Yingru Li, Jiacai Liu, Jiawei Xu, Ziniu Li, Qian Liu, Haoyuan Li

2602.01816 2026-02-03 cs.CV

Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies

Wenjin Hou, Wei Liu, Han Hu, Xiaoxiao Sun, Serena Yeung-Levy, Hehe Fan

AI 大模型

视觉与机器人

科学与医疗

LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency

DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification

Learning Sparse Visual Representations via Spatial-Semantic Factorization

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Multi-Task Learning for Robot Perception with Imbalanced Data

Observation-dependent Bayesian active learning via input-warped Gaussian processes

Internal Flow Signatures for Self-Checking and Refinement in LLMs

Geometric Analysis of Token Selection in Multi-Head Attention

Path Tracking with Dynamic Control Point Blending for Autonomous Vehicles: An Experimental Study

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models

ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding

Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

Autocorrelated Optimize-via-Estimate: Predict-then-Optimize versus Finite-sample Optimal

PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models

Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling

Vision-only UAV State Estimation for Fast Flights Without External Localization Systems: A2RL Drone Racing Finalist Approach

SOPRAG: Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures

Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection

Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

FUPareto: Bridging the Forgetting-Utility Gap in Federated Unlearning via Pareto Augmented Optimization

WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection

Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery

Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies