arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.17068 2026-03-19 cs.CV cs.RO

TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects

Yeheng Zong, Yizhou Chen, Alexander Bowler, Chia-Tung Yang, Ram Vasudevan

详情

英文摘要

Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.

URL PDF HTML ☆

赞 0 踩 0

2603.17067 2026-03-19 cs.CL cs.AI

Evaluating Ill-Defined Tasks in Large Language Models

Yi Zhou, Basel Shbita

2603.17063 2026-03-19 cs.AI

Transformers are Bayesian Networks

Gregory Coppola

2603.17056 2026-03-19 cs.CV cs.LG

DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems

Yasaswini Chebolu

Comments 10 pages, 6 figures, 3 tables. Preprint also available on Zenodo (DOI: 10.5281/zenodo.19053085)

2603.17055 2026-03-19 cs.CV

PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

Yijian Wang, Qingsen Yan, Jiantao Zhou, Duwei Dai, Wei Dong

2603.17052 2026-03-19 cs.LG cs.AI

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Wenhao Zhao, Qiran Zou, Rushi Shah, Yudi Wu, Zhouhan Lin, Dianbo Liu

2603.17051 2026-03-19 cs.CV

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao

Comments 53 pages, 37 figures

2603.17048 2026-03-19 cs.LG cs.CV

SCE-LITE-HQ: Smooth visual counterfactual explanations with generative foundation models

Ahmed Zeid, Sidney Bender

2603.17043 2026-03-19 cs.CV

OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials

Sankalp Pandey, Xuan-Bac Nguyen, Hoang-Quan Nguyen, Tim Faltermeier, Nicholas Borys, Hugh Churchill, Khoa Luu

详情

英文摘要

The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain-specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics-informed reasoning, their outputs are optimized for step-by-step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real-world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics-Aware Instruction Multi-modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain-expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale-aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 μm) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain-specific experts, transforms isolated inferences into a context-aware assistant capable of accelerating high-throughput device fabrication.

URL PDF HTML ☆

赞 0 踩 0

2603.17022 2026-03-19 cs.RO cs.SY eess.SY

Contingency-Aware Planning via Certified Neural Hamilton-Jacobi Reachability

Kasidit Muenprasitivej, Derya Aksaray

Comments 9 pages, 4 figures

2603.17019 2026-03-19 cs.LG

Transformers Can Learn Rules They've Never Seen: Proof of Computation Beyond Interpolation

Andy Gray

Comments 26 pages, 6 figures

详情

英文摘要

A central question in the LLM debate is whether transformers can infer rules absent from training, or whether apparent generalisation reduces to similarity-based interpolation over observed examples. We test a strong interpolation-only hypothesis in two controlled settings: one where interpolation is ruled out by construction and proof, and one where success requires emitting intermediate symbolic derivations rather than only final answers. In Experiment 1, we use a cellular automaton with a pure XOR transition rule and remove specific local input patterns from training; since XOR is linearly inseparable, each held-out pattern's nearest neighbours have the opposite label, so similarity-based predictors fail on the held-out region. Yet a two-layer transformer recovers the rule (best 100%; 47/60 converged runs), and circuit extraction identifies XOR computation. Performance depends on multi-step constraint propagation: without unrolling, accuracy matches output bias (63.1%), while soft unrolling reaches 96.7%. In Experiment 2, we study symbolic operator chains over integers with one operator pair held out; the model must emit intermediate steps and a final answer in a proof-like format. Across all 49 holdout pairs, the transformer exceeds every interpolation baseline (mean 41.8%, up to 78.6%; mean KRR 4.3%; KNN and MLP score 0% on every pair), while removing intermediate-step supervision degrades performance. Together with a construction showing that a standard transformer block can implement exact local Boolean rules, these results provide an existence proof that transformers can learn rule structure not directly observed in training and express it explicitly, ruling out the strongest architectural form of interpolation-only accounts: that transformers cannot in principle discover and communicate unseen rules, while leaving open when such behaviour arises in large-scale language training.

URL PDF HTML ☆

赞 0 踩 0

2603.17017 2026-03-19 cs.CL cs.AI

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Lifu Tu, Rongguang Wang, Tao Sheng, Sujjith Ravi, Dan Roth

2603.16987 2026-03-19 cs.CV cs.AI

Empirical Recipes for Efficient and Compact Vision-Language Models

Jiabo Huang, Zhizhong Li, Sina Sajadmanesh, Weiming Zhuang, Lingjuan Lyu

2603.16983 2026-03-19 cs.LG cs.LO

Formal verification of tree-based machine learning models for lateral spreading

Krishna Kumar

详情

英文摘要

Machine learning models for geotechnical hazard prediction can achieve high accuracy while learning physically inconsistent relationships from sparse or biased training data. Current remedies (post-hoc explainability, such as SHAP and LIME, and training-time constraints) either diagnose individual predictions approximately or restrict model capacity without providing exhaustive guarantees. This paper encodes trained tree ensembles as logical formulas in a Satisfiability Modulo Theories (SMT) solver and checks physical specifications across the entire input domain, not just sampled points. Four geotechnical specifications (water table depth, PGA monotonicity, distance safety, and flat-ground safety) are formalized as decidable logical formulas and verified via SMT against both XGBoost ensembles and Explainable Boosting Machines (EBMs) trained on the 2011 Christchurch earthquake lateral spreading dataset (7,291 sites, four features). The SMT solver either produces a concrete counterexample where a specification fails or proves that no violation exists. The unconstrained EBM (80.1% accuracy) violates all four specifications. A fully constrained EBM (67.2%) satisfies three of four specifications, demonstrating that iterative constraint application guided by verification can progressively improve physical consistency. A Pareto analysis of 33 model variants reveals a persistent trade-off, as none of the variants studied achieve both greater than 80% accuracy and full compliance with the specified set. SHAP analysis of specification counterexamples shows that the offending feature can rank last, demonstrating that post-hoc explanations do not substitute for formal verification. These results establish a verify-fix-verify engineering loop and a formal certification for deploying physically consistent ML models in safety-critical geotechnical applications.

URL PDF HTML ☆

赞 0 踩 0

2603.16978 2026-03-19 cs.RO cs.LG

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

Pierre Krack, Tobias Jülg, Wolfram Burgard, Florian Walter

Comments 10 pages, 5 figures, submitted to IEEE

2603.16974 2026-03-19 cs.CV cs.AI

Are a Thousand Words Better Than a Single Picture? Beyond Images -- A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

Pengyu Zhang, Klim Zaporojets, Jie Liu, Jia-Hong Huang, Paul Groth

2603.16967 2026-03-19 cs.CV cs.AI

MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing

Zhaoyuan Qiu, Ken Chen, Xiangwei Wang, Yu Xia, Sachith Seneviratne, Saman Halgamuge

Comments 14 pages, 6 figures, 3 tables, appendix and references provided

2603.16966 2026-03-19 cs.CV cs.AI cs.MM cs.SD eess.AS

CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Liangbin Huang, Xiaohua Liao, Chaoqun Cui, Shijing Wang, Zhaolong Huang, Yanlong Du, Wenji Mao

Comments Accepted to CVPR 2026

2603.16958 2026-03-19 cs.CV cs.AI

PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Hisayuki Yokomizo, Taiki Miyanishi, Yan Gang, Shuhei Kurita, Nakamasa Inoue, Yusuke Iwasawa

Comments Code and dataset will be available at https://github.com/hisasnow/PhysQuantAgent

2603.16945 2026-03-19 cs.CV cs.AI

Joint Optimization of Storage and Loading for High-Performance 3D Point Cloud Data Processing

Ke Wang, Yanfei Cao, Xiangzhi Tao, Naijie Gu, Jun Yu, Zhengdong Wang, Shouyang Dong, Fan Yu, Cong Wang, Yang Luo

2603.16944 2026-03-19 cs.CV cs.AI

Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Tiankun Yang, Chenxi Bao, Haopeng Jin, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Haijin Liang, Jin Ma, Xinming Wang, Ruiwen Tao, Hongzhu Yi

2603.16943 2026-03-19 cs.CV cs.AI

KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition

Yuhan Chen, Yicui Shi, Guofa Li, Liping Zhang, Jie Li, Jiaxin Gao, Wenbo Chu

详情

英文摘要

Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.

URL PDF HTML ☆

赞 0 踩 0

2603.16939 2026-03-19 cs.CV

Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion

Aislan Gabriel O. Souza, Agostinho Freire, Leandro Honorato Silva, Igor Lucas B. da Silva, João Vinícius R. de Andrade, Gabriel C. de Albuquerque, Lucas Matheus da S. Oliveira, Mário Stela Guerra, Luciana Machado

2603.16937 2026-03-19 cs.LG stat.AP stat.ME

Integrating Explainable Machine Learning and Mixed-Integer Optimization for Personalized Sleep Quality Intervention

Mahfuz Ahmed Anik, Mohsin Mahmud Topu, Azmine Toushik Wasi, Md Isfar Khan, MD Manjurul Ahsan

Comments 34 Pages. 7 Tables. 6 Figures

2603.16936 2026-03-19 cs.CV cs.AI

TDMM-LM: Bridging Facial Understanding and Animation via Language Models

Luchuan Song, Pinxin Liu, Haiyang Liu, Zhenchao Jin, Yolo Yunlong Tang, Zichong Xu, Susan Liang, Jing Bi, Jason J Corso, Chenliang Xu

Comments 12 pages, 13 figures

2603.16935 2026-03-19 cs.CV cs.AI

GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference

Zongshun Zhang, Yao Liu, Qiao Liu, Xuefeng Peng, Peiyuan Jiang, Jiaye Yang, Daibing Yao, Wei Lin

Comments Accepted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

2603.16934 2026-03-19 cs.CV cs.AI

AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding

Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed

2603.16932 2026-03-19 cs.CV cs.AI

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz

2603.16931 2026-03-19 cs.CV cs.AI

Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation

Rena Suzuki, Masato Kikuchi, Tadachika Ozono

Comments The 21st International Conference on E-Service and Knowledge Management (ESKM 2025-Winter)

2603.16930 2026-03-19 cs.CV cs.AI

Facial beauty prediction fusing transfer learning and broad learning system

Junying Gan, Xiaoshan Xie, Yikui Zhai, Guohui He, Chaoyun Mai, Heng Luo