arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.23221 2026-04-20 cs.CV cs.AI

Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

Youngchae Kwon, Jinyoung Choi, Injung Kim

Comments 20 pages, 6 figures

详情

DOI: 10.1007/s00521-026-12030-1

英文摘要

Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).

URL PDF HTML ☆

赞 0 踩 0

2512.22278 2026-04-20 cs.CV

FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

Hussain Alasmawi, Numan Saeed, Mohammad Yaqub

详情

英文摘要

The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.

URL PDF HTML ☆

赞 0 踩 0

2512.20025 2026-04-20 cs.CV

A Contextual Analysis of Driver-Facing and Dual-View Video Inputs for Distraction Detection in Naturalistic Driving Environments

Anthony Dontoh, Stephanie Ivey, Armstrong Aboah

详情

DOI: 10.1109/FMLDS67896.2025.00059
Journal ref: 2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS), 02-05 Nov. 2025

英文摘要

Despite increasing interest in computer vision-based distracted driving detection, most existing models rely exclusively on driver-facing views and overlook crucial environmental context that influences driving behavior. This study investigates whether incorporating road-facing views alongside driver-facing footage improves distraction detection accuracy in naturalistic driving conditions. Using synchronized dual-camera recordings from real-world driving, we benchmark three leading spatiotemporal action recognition architectures: SlowFast-R50, X3D-M, and SlowOnly-R50. Each model is evaluated under two input configurations: driver-only and stacked dual-view. Results show that while contextual inputs can improve detection in certain models, performance gains depend strongly on the underlying architecture. The single-pathway SlowOnly model achieved a 9.8 percent improvement with dual-view inputs, while the dual-pathway SlowFast model experienced a 7.2 percent drop in accuracy due to representational conflicts. These findings suggest that simply adding visual context is not sufficient and may lead to interference unless the architecture is specifically designed to support multi-view integration. This study presents one of the first systematic comparisons of single- and dual-view distraction detection models using naturalistic driving data and underscores the importance of fusion-aware design for future multimodal driver monitoring systems.

URL PDF HTML ☆

赞 0 踩 0

2512.14554 2026-04-20 cs.CL cs.AI

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu

2512.12858 2026-04-20 cs.LG cs.AI

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta

详情

Journal ref: 2026 IEEE Conference on Artificial Intelligence (CAI)

英文摘要

Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability when prompts are phrased with minor differences, even when semantically equivalent. Such inconsistency undermines trust, complicates compliance, and disrupts user experience. While personalization is desirable in certain contexts, many enterprise scenarios, such as HR onboarding, customer support, or policy disclosure, require invariant information delivery regardless of phrasing or prior conversational history. Existing approaches, including retrieval-augmented generation (RAG) and temperature tuning, improve factuality or reduce stochasticity, but cannot guarantee stability across equivalent prompts. In this paper, we propose a reinforcement learning framework based on Group Relative Policy Optimization (GRPO) to directly optimize for consistency. Unlike prior applications of GRPO, which have been limited to reasoning and code generation, we adapt GRPO to enforce the stability of information content across groups of semantically equivalent prompts. We introduce entropy-based helpfulness and stability rewards, treating prompt variants as groups and resetting conversational context to isolate phrasing effects. Experiments on investment and job recommendation tasks show that our GRPO-fine-tuned model reduces variability compared to the baseline LLM model. To our knowledge, this is a novel application of GRPO for aligning LLMs toward information consistency, reframing variability not as an acceptable feature of generative diversity, but as a correctable flaw in enterprise deployments.

URL PDF HTML ☆

赞 0 踩 0

2512.07515 2026-04-20 cs.CL cs.AI

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang

Comments Accepted by ACL 2026

2512.07173 2026-04-20 cs.LG

Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu

Comments 12 pages, 3 figures. Accepted to Findings of ACL 2026

2512.05722 2026-04-20 cs.LG physics.chem-ph

Teaching Language Models Mechanistic Explainability Through MechSMILES

Théo A. Neukomm, Zlatko Jončev, Philippe Schwaller

详情

英文摘要

Chemical reaction mechanisms are the foundation of how chemists evaluate reactivity and feasibility, yet current Computer-Assisted Synthesis Planning (CASP) systems operate without this mechanistic reasoning. We introduce a computational framework that teaches language models to predict reaction mechanisms through arrow-pushing formalism, a century-old notation that tracks electron flow while enforcing conservation of mass and charge. This mechanistic understanding enables three capabilities that are difficult or impossible with current methods: post-hoc validation of CASP proposals by reconstructing physically plausible electron pathways, holistic atom-to-atom mapping that tracks all atoms including hydrogens, and extraction of catalyst-aware reaction templates that distinguish recycled catalysts from spectator species. Central to our approach is MechSMILES, a compact textual format encoding molecular structure and electron flow through three arrow types, designed within a Python-based environment that enforces conservation laws and eliminates the possibility of atom hallucination. We trained and benchmarked models on four mechanism prediction tasks of increasing complexity using the main mechanistic datasets in the literature. On our most challenging task, predicting complete mechanisms given only reactants, conditions, and the desired product, our models achieve 93.2\% and 73.3\% pathway retrieval on the FlowER and mech-USPTO-31k datasets respectively, with top-3 retrieval reaching 97.6\% and 86.5\%. Furthermore, the framework rapidly learns new reaction classes, with strong mechanistic predictions for ozonolysis and Suzuki cross-coupling emerging from as few as 40 training examples each. By grounding predictions in physically meaningful electron movements, this work provides an architecture-agnostic, open-source foundation for more explainable and chemically valid CASP.

URL PDF HTML ☆

赞 0 踩 0

2512.04847 2026-04-20 cs.SD cs.AI

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed

2512.03053 2026-04-20 cs.LG cs.AI cs.AR cs.PL

Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation

Andrew S. Cassidy, Guillaume Garreau, Jay Sivagnaname, Mike Grassi, Bernard Brezzo, John V. Arthur, Dharmendra S. Modha

Comments 7 pages, 2 figures, 7 tables

2511.13131 2026-04-20 cs.AI cs.CV cs.ET cs.NI

MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

Anshul Kumar, Gagan Raj Gupta, Manish Rai, Apu Chakraborty, Ashutosh Modi, Abdelaali Chaoub, Soumajit Pramanik, Moyank Giri, Yashwanth Holla, Sunny Kumar, M. V. Kiran Sooraj

2511.10262 2026-04-20 cs.CL cs.AI eess.AS

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, Irwin King

Comments Accepted to Findings of ACL 2026

2511.03056 2026-04-20 cs.CL cs.AI cs.LG

Reading Between the Lines: The One-Sided Conversation Problem

Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota

Comments 8 pages, 6 figures, 4 tables. Accepted to ACL Findings 2026

2511.00739 2026-04-20 cs.AI cs.LG cs.MA

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Ritik Raj, Souvik Kundu, Ishita Vohra, Hong Wang, Tushar Krishna

2510.27617 2026-04-20 cs.AI

VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan

2510.24887 2026-04-20 cs.CV

Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS

Daniele L. V. dos Santos, Thiago B. Pereira, Carlos Eduardo G. R. Alves, Richard J. M. G. Tello, Francisco de A. Boldt, Thiago M. Paixão

Comments This work was accepted for presentation at IEEE SAS 2026

2510.24328 2026-04-20 cs.CL cs.AI

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Hunzalah Hassan Bhatti, Firoj Alam

Comments Cultural Knowledge, Everyday Knowledge, Open-Ended Question, Chain-of-Thought, Large Language Models, Native, Multilingual, Language Diversity

2510.22149 2026-04-20 cs.LG cs.AI cs.CL cs.CR cs.CV cs.DC

Power to the Clients: Federated Learning in a Dictatorship Setting

Mohammadsajad Alipour, Mohammad Mohammadi Amiri

2510.21977 2026-04-20 cs.AI

Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions

Ji Huang, Mengfei Li, Shuai Shao

2510.21934 2026-04-20 cs.LG stat.ML

Joint Score-Threshold Optimization for Interpretable Risk Assessment

Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Kimia Ghobadi

2510.20616 2026-04-20 cs.LG

On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

Aki Rehn, Linzh Zhao, Mikko A. Heikkilä, Antti Honkela

Comments ICLR 2026

2510.20299 2026-04-20 cs.LG cs.AI

DB-FGA-Net: Dual Backbone Frequency Gated Attention Network for Multi-Class Brain Tumor Classification with Grad-CAM Interpretability

Saraf Anzum Shreya, MD. Abu Ismail Siddique, Sharaf Tasnim

Comments 25 pages, 14 figures, 13 tables

2510.17210 2026-04-20 cs.CL

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

Chenchen Tan, Youyang Qu, Xinghao Li, Hui Zhang, Shujie Cui, Cunjian Chen, Longxiang Gao

Comments Accepted by NeurIPS 2025

详情

英文摘要

The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs' reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.

URL PDF HTML ☆

赞 0 踩 0

2510.13920 2026-04-20 cs.CL

FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Ye Yuan, Mohammad Amin Shabani, Siqi Liu

Comments Accepted by ACL 2026 Findings

2510.13829 2026-04-20 cs.CL cs.AI

A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han

Comments ACL 2026

2510.13220 2026-04-20 cs.AI cs.CL

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi

Comments ICLR 2026

2510.12700 2026-04-20 cs.LG cs.AI cs.CG math.AT stat.ML

Topological Signatures of ReLU Neural Network Activation Patterns

Vicente Bosca, Tatum Rask, Sunia Tanweer, Andrew R. Tawfeek, Branden Stone

2510.10959 2026-04-20 cs.LG cs.AI cs.CL stat.ML

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Ai Jian, Kejiang Chen, Xing Hu

Comments 16 pages, 4 figures

2510.08480 2026-04-20 cs.CV

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

2510.07774 2026-04-20 cs.CL

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Pinjia He

Comments Accepted by ACL 2026 Main, 22 pages, 10 figures, 7 Tables