arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2508.20765 2026-04-09 cs.CV cs.AI

Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding

Gowreesh Mago, Pascal Mettes, Stevan Rudinac

Comments Accepted at IJCV

详情

DOI: 10.1007/s11263-026-02784-5
Journal ref: International Journal of Computer Vision 134, 217 (2026)

英文摘要

The automatic understanding of video content is advancing rapidly. Empowered by deeper neural networks and large datasets, machines are increasingly capable of understanding what is concretely visible in video frames, whether it be objects, actions, events, or scenes. In comparison, humans retain a unique ability to also look beyond concrete entities and recognize abstract concepts like justice, freedom, and togetherness. Abstract concept recognition forms a crucial open challenge in video understanding, where reasoning on multiple semantic levels based on contextual information is key. In this paper, we argue that the recent advances in foundation models make for an ideal setting to address abstract concept understanding in videos. Automated understanding of high-level abstract concepts is imperative as it enables models to be more aligned with human reasoning and values. In this survey, we study different tasks and datasets used to understand abstract concepts in video content. We observe that, periodically and over a long period, researchers have attempted to solve these tasks, making the best use of the tools available at their disposal. We advocate that drawing on decades of community experience will help us shed light on this important open grand challenge and avoid ``re-inventing the wheel'' as we start revisiting it in the era of multi-modal foundation models.

URL PDF HTML ☆

赞 0 踩 0

2508.15358 2026-04-09 cs.AI

Planning with Minimal Disruption

Alberto Pozanco, Marianela Morales, Daniel Borrajo, Manuela Veloso

2508.13657 2026-04-09 cs.LG cs.AI

In-Context Decision Making for Optimizing Complex AutoML Pipelines

Amir Rezaei Balef, Katharina Eggensperger

Comments Accepted at the European Conference on Artificial Intelligence (ECAI) 2025

2508.12243 2026-04-09 cs.CL

SEA-BED: How Do Embedding Models Represent Southeast Asian Languages?

Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Raymond Ng, Jann Railey Montalan, Thura Aung, Jian Gang Ngui, Yosephine Susanto, William Chandra Tjhi, Panuthep Tasawong, Erik Cambria, Ekapol Chuangsuwanich, Sarana Nutanong

2508.07299 2026-04-09 cs.LG cs.AI

Quantitative Estimation of Target Task Performance from Unsupervised Pretext Task in Semi/Self-Supervised Learning

Lin-Han Jia, Si-Yu Han, Wen-Chao Hu, Jie-Jing Shao, Wen-Da Wei, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li

2508.07112 2026-04-09 cs.CV cs.LG

AugLift: Depth-Aware Input Reparameterization Improves Domain Generalization in 2D-to-3D Pose Lifting

Nikolai Warner, Wenjin Zhang, Hamid Badiozamani, Irfan Essa, Apaar Sadhwani

Comments Preprint. Under review

2508.05423 2026-04-09 cs.LG stat.ML

Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling

Yixuan Zhang, Jinhao Sheng, Wenxin Zhang, Quyu Kong, Feng Zhou

2508.04691 2026-04-09 cs.RO cs.AI cs.MA

Before Humans Join the Team: Diagnosing Coordination Failures in Healthcare Robot Team Simulation

Yuanchen Bai, Zijian Ding, Shaoyue Wen, Xiang Chang, Angelique Taylor

Comments Revised version incorporating new analysis and restructuring

2507.20906 2026-04-09 cs.CL

Soft Head Selection for Injecting ICL-Derived Task Embeddings

Jungwon Park, Jimyeong Kim, Changin Choi, Wonjong Rhee

Comments Accepted by ACL 2026 Findings

2507.08390 2026-04-09 cs.LG

Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon

2507.00102 2026-04-09 cs.LG cs.AI eess.SP

Towards transparent and data-driven fault detection in manufacturing: A case study on univariate, discrete time series

Bernd Hofmann, Patrick Bruendl, Huong Giang Nguyen, Joerg Franke

详情

DOI: 10.1007/s44244-026-00028-6

英文摘要

Ensuring consistent product quality in modern manufacturing is crucial, particularly in safety-critical applications. Conventional quality control approaches, reliant on manually defined thresholds and features, lack adaptability to the complexity and variability inherent in production data and necessitate extensive domain expertise. Conversely, data-driven methods, such as machine learning, demonstrate high detection performance but typically function as black-box models, thereby limiting their acceptance in industrial environments where interpretability is paramount. This paper introduces a methodology for industrial fault detection, which is both data-driven and transparent. The approach integrates a supervised machine learning model for multi-class fault classification, Shapley Additive Explanations for post-hoc interpretability, and a do-main-specific visualisation technique that maps model explanations to operator-interpretable features. Furthermore, the study proposes an evaluation methodology that assesses model explanations through quantitative perturbation analysis and evaluates visualisations by qualitative expert assessment. The approach was applied to the crimping process, a safety-critical joining technique, using a dataset of univariate, discrete time series. The system achieves a fault detection accuracy of 95.9 %, and both quantitative selectivity analysis and qualitative expert evaluations confirmed the relevance and inter-pretability of the generated explanations. This human-centric approach is designed to enhance trust and interpretability in data-driven fault detection, thereby contributing to applied system design in industrial quality control.

URL PDF HTML ☆

赞 0 踩 0

2506.19420 2026-04-09 cs.AI

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin, Prayag Tiwari

2506.18841 2026-04-09 cs.CL cs.AI cs.LG

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li

Comments ICLR 2026 Oral

2506.11786 2026-04-09 cs.LG

SSPINNpose: A Self-Supervised PINN for Inertial Pose and Dynamics Estimation

Markus Gambietz, Eva Dorschky, Altan Akat, Marcel Schöckel, Jörg Miehling, Anne D. Koelewijn

详情

DOI: 10.1145/3806199

英文摘要

Accurate real-time estimation of human movement dynamics, including internal joint moments and muscle forces, is essential for applications in clinical diagnostics and sports performance monitoring. Inertial measurement units (IMUs) provide a minimally intrusive solution for capturing motion data, particularly when used in sparse sensor configurations. However, current real-time methods rely on supervised learning, where a ground truth dataset needs to be measured with laboratory measurement systems, such as optical motion capture. These systems are known to introduce measurement and processing errors and often fail to generalize to real-world or previously unseen movements, necessitating new data collection efforts that are time-consuming and impractical. To overcome these limitations, we propose SSPINNpose, a self-supervised, physics-informed neural network that estimates joint kinematics and kinetics directly from IMU data, without requiring ground truth labels for training. We run the network output through a physics model of the human body to optimize physical plausibility and generate virtual measurement data. Using this virtual sensor data, the network is trained directly on the measured sensor data instead of a ground truth. When compared to optical motion capture, SSPINNpose is able to accurately estimate joint angles and joint moments at an RMSD of 8.7 deg and 4.9 BWBH%, respectively, for walking and running at speeds up to 4.9 m/s at a latency of 3.5 ms. Furthermore, the framework demonstrates robustness across sparse sensor configurations and can infer the anatomical locations of the sensors. These results underscore the potential of SSPINNpose as a scalable and adaptable solution for real-time biomechanical analysis in both laboratory and field environments.

URL PDF HTML ☆

赞 0 踩 0

2505.20049 2026-04-09 cs.CV

Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay

Hongsong Wang, Ao Sun, Jie Gui, Liang Wang

Comments Code is on https://github.com/sunao-101/PGPFR-3/

详情

DOI: 10.1109/TIP.2026.3671627
Journal ref: IEEE Transactions on Image Processing, 2026

英文摘要

Gesture recognition is an important research area in the field of computer vision. Most gesture recognition efforts focus on close-set scenarios, thereby limiting the capacity to effectively handle unseen or novel gestures. We aim to address class-incremental gesture recognition, which entails the ability to accommodate new and previously unseen gestures over time. Specifically, we introduce a Prototype-Guided Pseudo Feature Replay (PGPFR) framework for data-free class-incremental gesture recognition. This framework comprises four components: Pseudo Feature Generation with Batch Prototypes (PFGBP), Variational Prototype Replay (VPR) for old classes, Truncated Cross-Entropy (TCE) for new classes, and Continual Classifier Re-Training (CCRT). To tackle the issue of catastrophic forgetting, the PFGBP dynamically generates a diversity of pseudo features in an online manner, leveraging class prototypes of old classes along with batch class prototypes of new classes. Furthermore, the VPR enforces consistency between the classifier's weights and the prototypes of old classes, leveraging class prototypes and covariance matrices to enhance robustness and generalization capabilities. The TCE mitigates the impact of domain differences of the classifier caused by pseudo features. Finally, the CCRT training strategy is designed to prevent overfitting to new classes and ensure the stability of features extracted from old classes. Extensive experiments conducted on two widely used gesture recognition datasets, namely SHREC 2017 3D and EgoGesture 3D, demonstrate that our approach outperforms existing state-of-the-art methods by 11.8\% and 12.8\% in terms of mean global accuracy, respectively. The code is available on https://github.com/sunao-101/PGPFR-3/.

URL PDF HTML ☆

赞 0 踩 0

2505.15325 2026-04-09 cs.CV

SoftHGNN: Soft Hypergraph Neural Networks for General Visual Recognition

Mengqi Lei, Yihong Wu, Siqi Li, Xinhu Zheng, Juan Wang, Shaoyi Du, Yue Gao

Comments This paper has been accepted by the International Journal of Computer Vision (IJCV)

详情

英文摘要

Visual recognition relies on understanding the semantics of image tokens and their complex interactions. Mainstream self-attention methods, while effective at modeling global pair-wise relations, fail to capture high-order associations inherent in real-world scenes and often suffer from redundant computation. Hypergraphs extend conventional graphs by modeling high-order interactions and offer a promising framework for addressing these limitations. However, existing hypergraph neural networks typically rely on static and hard hyperedge assignments, which lead to redundant hyperedges and overlooking the continuity of visual semantics. In this work, we present Soft Hypergraph Neural Networks (SoftHGNN), a lightweight plug-and-play hypergraph computation method for late-stage semantic reasoning in existing vision pipelines. Our SoftHGNN introduces the concept of soft hyperedges, where each vertex is associated with hyperedges via continuous and differentiable participation weights rather than hard binary assignments. These weights are produced by measuring similarities between vertex features and a small set of learnable hyperedge prototypes, yielding input-adaptive and semantically rich soft hyperedges. Using soft hyperedges as the medium for message aggregation and dissemination, SoftHGNN enriches feature representations with high-order contextual associations. To further enhance efficiency when scaling up the number of soft hyperedges, we incorporate a sparse hyperedge selection mechanism that activates only the top-k important hyperedges, along with a load-balancing regularizer to ensure adequate and balanced hyperedge utilization. Experimental results across three tasks on five datasets demonstrate that SoftHGNN efficiently captures high-order associations in visual scenes, achieving significant performance improvements. The code is available at: https://github.com/Mengqi-Lei/SoftHGNN.

URL PDF HTML ☆

赞 0 踩 0

2505.02781 2026-04-09 cs.AI

Local Markov Equivalence for PC-style Local Causal Discovery and Identification of Controlled Direct Effects

Timothée Loranchet, Charles K. Assaad

Comments Accepted to CleaR 2026 and to the UAI 2025 workshop on Causal Abstractions and Representations

2504.03468 2026-04-09 cs.CV

D-Garment: Physically Grounded Latent Diffusion for Dynamic Garment Deformations

Antoine Dumoulin, Adnane Boukhayma, Laurence Boissieux, Bharath Bhushan Damodaran, Pierre Hellier, Stefanie Wuhrer

Comments 18 pages, 11 figures

详情

Journal ref: Transactions on Machine Learning Research (TMLR), 2026

英文摘要

We present a method to dynamically deform 3D garments, in the form of a 3D polygon mesh, based on body shape, motion, and physical cloth material properties. Considering physical cloth properties allows to learn a physically grounded model, with the advantage of being more accurate in terms of physically inspired metrics such as strain or curvature. Existing work studies pose-dependent garment modeling to generate garment deformations from example data, and possibly data-driven dynamic cloth simulation to generate realistic garments in motion. We propose D-Garment, a learning-based approach trained on new data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations conditioned by physical material properties, which allows to model loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion. Furthermore, the model can be efficiently fitted to observations captured using vision sensors such as 3D point clouds. We leverage the capability of diffusion models to learn flexible and powerful generative priors by modeling the 3D garment in a 2D parameter space independently from the mesh resolution. This representation allows to learn a template-specific latent diffusion model. This allows to condition global and local geometry with body and cloth material information. We quantitatively and qualitatively evaluate D-Garment on both simulations and data captured with a multi-view acquisition platform. Compared to recent baselines, our method is more realistic and accurate in terms of shape similarity and physical validity metrics. Code and data are available for research purposes at https://dumoulina.github.io/d-garment/

URL PDF HTML ☆

赞 0 踩 0

2503.18018 2026-04-09 cs.AI cs.LG

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar

2503.03088 2026-04-09 cs.CV cs.AR cs.LG

AHCQ-SAM: Toward Accurate and Hardware-Compatible Post-Training Segment Anything Model Quantization

Wenlun Zhang, Yunshan Zhong, Weiqi Yan, Shengchuan Zhang, Shimpei Ando, Kentaro Yoshioka

Comments Update to AHCQ-SAM

详情

英文摘要

The Segment Anything Model (SAM) has revolutionized image and video segmentation with its powerful zero-shot capabilities. However, its massive parameter scale and high computational demands hinder efficient deployment on resource-constrained edge devices. While Post-Training Quantization (PTQ) offers a practical solution, existing methods still fail to handle four critical quantization challenges: (1) ill-conditioned weights; (2) skewed and long-tailed post-GELU activations; (3) pronounced inter-channel variance in linear projections; and (4) exponentially scaled and heterogeneous attention scores. To mitigate these bottlenecks, we propose AHCQ-SAM, an accurate and hardware-compatible PTQ framework featuring four synergistic components: (1) Activation-aware Condition Number Reduction (ACNR), which regularizes weight matrices via a proximal point algorithm to suppress ill-conditioning; (2) Hybrid Log-Uniform Quantization (HLUQ), which combines power-of-two and uniform quantizers to capture skewed post-GELU activations; (3) Channel-Aware Grouping (CAG), which clusters channels with homogeneous statistics to achieve high accuracy with minimal hardware overhead; and (4) Logarithmic Nonlinear Quantization (LNQ), which utilizes logarithmic transformations to adaptively adjust quantization resolution for exponential and heterogeneous attention scores. Experimental results demonstrate that AHCQ-SAM outperforms current methods on SAM. Compared with the SOTA method, it achieves a 15.2% improvement in mAP for 4-bit SAM-B with Faster R-CNN on the COCO dataset. Furthermore, we establish a PTQ benchmark for SAM2, where AHCQ-SAM yields a 14.01% improvement in J&F for 4-bit SAM2-Tiny on the SA-V Test dataset. Finally, FPGA-based implementation validates the practical utility of AHCQ-SAM, delivering a 7.12x speedup and a 6.62x power efficiency improvement over the floating-point baseline.

URL PDF HTML ☆

赞 0 踩 0

2503.02129 2026-04-09 cs.LG cs.AI math.ST stat.TH

Path Regularization: A Near-Complete and Optimal Nonasymptotic Generalization Theory for Multilayer Neural Networks and Double Descent Phenomenon

Hao Yu

2503.00450 2026-04-09 cs.CV

Unsupervised Source-Free Ranking of Biomedical Segmentation Models Under Distribution Shift

Joshua Talks, Kevin Marchesini, Luca Lumetti, Federico Bolelli, Anna Kreshuk

Comments 15 pages, 4 figures

2502.18978 2026-04-09 cs.CL cs.AI

Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning

Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong

Comments Accepted to EMNLP Findings 2025

2502.17421 2026-04-09 cs.CL cs.AI cs.LG

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An

Comments Accepted by ACL'25 (Main)

2501.16150 2026-04-09 cs.AI cs.HC cs.SY eess.SY

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann

详情

DOI: 10.1613/jair.1.19490
Journal ref: Journal of Artificial Intelligence Research, vol. 85, 2026

英文摘要

Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices -- such as desktops, mobile phones, and web platforms -- given instructions in natural language. These agents can automate tasks by controlling software via low-level actions like mouse clicks and touchscreen gestures. However, despite rapid progress, ACUs are not yet mature for everyday use. In this survey, we investigate the state-of-the-art, trends, and research gaps in the development of practical ACUs. We provide a comprehensive review of the ACU landscape, introducing a unifying taxonomy spanning three dimensions: (I) the domain perspective, characterizing agent operating contexts; (II) the interaction perspective, describing observation modalities (e.g., screenshots, HTML) and action modalities (e.g., mouse, keyboard, code execution); and (III) the agent perspective, detailing how agents perceive, reason, and learn. We review 87 ACUs and 33 datasets across foundation model-based and classical approaches through this taxonomy. Our analysis identifies six major research gaps: insufficient generalization, inefficient learning, limited planning, low task complexity in benchmarks, non-standardized evaluation, and a disconnect between research and practical conditions. To address these gaps, we advocate for: (a) vision-based observations and low-level control to enhance generalization; (b) adaptive learning beyond static prompting; (c) effective planning and reasoning methods and models; (d) benchmarks that reflect real-world task complexity; (e) standardized evaluation based on task success; (f) aligning agent design with real-world deployment constraints. Together, our taxonomy and analysis establish a foundation for advancing ACU research toward general-purpose agents for robust and scalable computer use.

URL PDF HTML ☆

赞 0 踩 0

2412.04880 2026-04-09 cs.CV eess.IV

MozzaVID: Mozzarella Volumetric Image Dataset

Pawel Tomasz Pieta, Peter Winkel Rasmussen, Anders Bjorholm Dahl, Jeppe Revall Frisvad, Siavash Arjomand Bigdeli, Carsten Gundlach, Anders Nymark Christensen

Comments Accepted at MetaFood (CVPR 2026 Workshop)

2411.19121 2026-04-09 cs.CV cs.AI

MSG Score: Automated Video Verification for Reliable Multi-Scene Generation

Daewon Yoon, Hyeongseok Lee, Wonsik Shin, Sangyu Han, Nojun Kwak

Comments 8 pages, 5 figures, 1 table, Accepted AAAI 2026 CVM workshop

2411.09553 2026-04-09 cs.CV

OOD-SEG: Exploiting out-of-distribution detection techniques for learning image segmentation from sparse multi-class positive-only annotations

Junwen Wang, Zhonghao Wang, Oscar MacCormac, Jonathan Shapey, Tom Vercauteren

Comments Accepted in MedIA

详情

DOI: 10.1016/j.media.2026.104046

英文摘要

Despite significant advancements, segmentation based on deep neural networks in medical and surgical imaging faces several challenges, two of which we aim to address in this work. First, acquiring complete pixel-level segmentation labels for medical images is time-consuming and requires domain expertise. Second, typical segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them prone to spurious outputs during deployment. In this work, we propose a novel segmentation approach which broadly falls within the positive-unlabelled (PU) learning paradigm and exploits tools from OOD detection techniques. Our framework learns only from sparsely annotated pixels from multiple positive-only classes and does not use any annotation for the background class. These multi-class positive annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may contain positive classes but also negative ones, including what is typically referred to as \emph{background} in standard segmentation formulations. To the best of our knowledge, this work is the first to formulate multi-class segmentation with sparse positive-only annotations as a pixel-wise PU learning problem and to address it using OOD detection techniques. Here, we forgo the need for background annotation and consider these together with any other unseen classes as part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection approaches designed for classification tasks. To address the lack of existing OOD datasets and established evaluation metric for medical image segmentation, we propose a cross-validation strategy that treats held-out labelled classes as OOD. Extensive experiments on both multi-class hyperspectral and RGB surgical imaging datasets demonstrate the robustness and generalisation capability of our proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2410.17473 2026-04-09 cs.LG

DROP: Distributional and Regular Optimism and Pessimism for Reinforcement Learning

Taisuke Kobayashi

Comments 51 pages, 15 figures (accepted in Neural Computation)

2410.08334 2026-04-09 cs.CL cs.AI cs.LG cs.MA

Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

Tirthankar Mittra