arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.14432 2026-04-22 cs.SD

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Comments Accepted to Findings of ACL 2026

详情

英文摘要

Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.

URL PDF HTML ☆

赞 0 踩 0

2603.13779 2026-04-22 cs.CV cs.AI

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng

Comments Code and models are released at https://github.com/jam-cc/AD-Copilot

详情

英文摘要

Multimodal Large Language Models (MLLMs) have achieved impressive success in natural visual understanding, yet they consistently underperform in industrial anomaly detection (IAD). This is because MLLMs trained mostly on general web data differ significantly from industrial images. Moreover, they encode each image independently and can only compare images in the language space, making them insensitive to subtle visual differences that are key to IAD. To tackle these issues, we present AD-Copilot, an interactive MLLM specialized for IAD via visual in-context comparison. We first design a novel data curation pipeline to mine inspection knowledge from sparsely labeled industrial images and generate precise samples for captioning, VQA, and defect localization, yielding a large-scale multimodal dataset Chat-AD rich in semantic signals for IAD. On this foundation, AD-Copilot incorporates a novel Comparison Encoder that employs cross-attention between paired image features to enhance multi-image fine-grained perception, and is trained with a multi-stage strategy that incorporates domain knowledge and gradually enhances IAD skills. In addition, we introduce MMAD-BBox, an extended benchmark for anomaly localization with bounding-box-based evaluation. The experiments show that AD-Copilot achieves 82.3% accuracy on the MMAD benchmark, outperforming all other models without any data leakage. In the MMAD-BBox test, it achieves a maximum improvement of $3.35\times$ over the baseline. AD-Copilot also exhibits excellent generalization of its performance gains across other specialized and general-purpose benchmarks. Remarkably, AD-Copilot surpasses human expert-level performance on several IAD tasks, demonstrating its potential as a reliable assistant for real-world industrial inspection. All datasets and models will be released for the broader benefit of the community.

URL PDF HTML ☆

赞 0 踩 0

2603.08899 2026-04-22 cs.CL cs.LG

ConFu: Contemplate the Future for Better Speculative Sampling

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun

Comments v3: Added stress test with long drafts (DL=12, top-k=1) and tail-acceptance (survival) analysis. Earlier versions added Qwen3-4B results and ablations

2603.01455 2026-04-22 cs.CV cs.AI cs.CL cs.IR cs.MM

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia

Comments Accepted by ACL 2026 Main. 17 pages, 7 figures, 8 tables. TL;DR: We propose MM-Mem, a cognition-inspired, dual-trace hierarchical memory framework for long-horizon video understanding grounded in Fuzzy-Trace Theory. It features adaptive memory compression via the Information Bottleneck and employs an entropy-driven top-down retrieval to access fine-grained details only when necessary

2602.20323 2026-04-22 cs.RO cs.AI

PhysMem: Scaling Test-time Physical Memory for Robot Manipulation

Haoyang Li, Yang You, Hao Su, Leonidas Guibas

2602.19991 2026-04-22 cs.CL

Cross-lingual Matryoshka Representation Learning across Speech and Text

Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina

Comments Preprint, under review

2602.19790 2026-04-22 cs.LG stat.ML

Drift Localization using Conformal Predictions

Fabian Hinder, Valerie Vaquet, Johannes Brinkrolf, Barbara Hammer

Comments Paper is an extended version; the original was published at the 34th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) 2026

2602.15173 2026-04-22 cs.AI

Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Luise Ge, Yongyan Zhang, Yevgeniy Vorobeychik

2602.12735 2026-04-22 cs.CV cs.CL

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding

2602.12708 2026-04-22 cs.LG

Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

Jon Irureta, Gorka Azkune, Jon Imaz, Aizea Lojo, Javier Fernandez-Marques

2602.11199 2026-04-22 cs.CL cs.LG

When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

Jiale Zhao, Ke Fang, Lu Cheng

2602.09642 2026-04-22 cs.CL cs.AI

MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do

Comments Accepted to ACL 2026 (findings)

2602.06430 2026-04-22 cs.CL cs.AI

Investigating the structure of emotions by analyzing similarity and association of emotion words

Fumitaka Iwaki, Tatsuji Takahashi

Comments 5 figures, 8 tables; data github link added

2602.06400 2026-04-22 cs.CV cs.AI cs.RO

TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

Zhenxing Ming, Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

2602.05437 2026-04-22 cs.CL

Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models

Basel Mousi, Fahim Dalvi, Shammur Chowdhury, Firoj Alam, Nadir Durrani

2602.03108 2026-04-22 cs.CL

ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Aaditya Baranwal, Shruti Vyas

Comments Accepted at Artificial Intelligence Chemistry Journal

2602.01651 2026-04-22 cs.LG cs.AI

On the Spatiotemporal Dynamics of Generalization in Neural Networks

Zichao Wei

2602.00758 2026-04-22 cs.CL cs.IR

Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study

Ali El Lahib, Ying-Jieh Xia, Zehan Li, Yuxuan Wang, Xinyu Pi

Comments 9 pages, 2 figures. Accepted to ACL 2026

2601.22737 2026-04-22 cs.CV

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu, Xiaobo Xia, Fei Shen, Tat-Seng Chua

详情

英文摘要

The robust safety of Vision-Language Large Models (VLLMs) against joint multilingual and multimodal threats remains severely underexplored. Current benchmarks typically isolate these dimensions, being either multilingual but text-only, or multimodal but monolingual. While recent red-teaming efforts attempt to bridge this gap by rendering harmful prompts as images, their overreliance on typography-style visuals and lack of semantically grounded image-text pairs fail to capture realistic cross-modal interactions under multilingual and multimodal conditions. To address this, we introduce Lingua-SafetyBench, a comprehensive benchmark of 100,440 harmful image-text pairs spanning 10 languages. Crucially, Lingua-SafetyBench explicitly partitions data into image-dominant and text-dominant subsets to precisely disentangle sources of risk. Extensive evaluations reveal that current VLLMs retain non-negligible vulnerabilities under these joint inputs. Linguistically, requests in Non-High-Resource Languages (Non-HRLs) and non-Latin scripts generally pose greater threats. Furthermore, analyzing modality-language interactions uncovers a striking asymmetry: in High-Resource Languages (HRLs), models are most vulnerable to image-dominant risks, whereas in Non-HRLs, text-dominant risks severely degrade safety performance. Finally, a controlled study on the Qwen series demonstrates that while model scaling and iterative upgrades improve overall safety, they disproportionately benefit HRLs. This exacerbates the safety disparity between HRLs and Non-HRLs under text-dominant risks, highlighting that achieving robust safety requires dedicated language- and modality-aware alignment strategies beyond mere scaling. The code and dataset will be available at https://github.com/zsxr15/Lingua-SafetyBench.Warning: this paper contains examples with unsafe content.

URL PDF HTML ☆

赞 0 踩 0

2601.18296 2026-04-22 cs.CL cs.AI cs.LG

Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang

Comments ACL 2026 main

2601.18027 2026-04-22 cs.AI cs.CL

Sentipolis: Emotion-Aware Agents for Social Simulations

Chiyuan Fu, Lyuhao Chen, Yunze Xiao, Weihao Xuan, Carlos Busso, Mona Diab

2601.17647 2026-04-22 cs.LG cs.AI

Knowledge-Guided Time-Varying Causal Inference for Arctic Sea Ice Dynamics

Akila Sampath, Vandana Janeja, Jianwu Wang

2601.15755 2026-04-22 cs.CL

Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs

Tristan Williams, Franziska Weeber, Sebastian Padó, Alan Akbik

Comments ACL 2026 Findings

2601.15488 2026-04-22 cs.CL cs.AI

Multi-Persona Thinking for Bias Mitigation in Large Language Models

Yuxing Chen, Guoqing Luo, Zijun Wu, Lili Mou

Comments 15 pages

2601.15349 2026-04-22 cs.RO cs.SY eess.SY

Preparation and Motion Study of Magnetically Driven Micro Soft Robot Mimicking the Cownose Ray

Jiaqing Chang, Song Gao, Chaowei Dong, zhaobang Li, Yang Liu

Comments There have several mistakes on it

2601.13729 2026-04-22 cs.CL

On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation

Weichuan Wang, Mingyang Liu, Linqi Song, Chen Ma

Comments 9 pages, 22 figures

2601.11301 2026-04-22 cs.CV

SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2

Gergely Dinya, András Gelencsér, Krisztina Kupán, Clemens Küpper, Kristóf Karacs, Anna Gelencsér-Horváth

2601.09953 2026-04-22 cs.CL

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger

2601.04237 2026-04-22 cs.AI cs.CL cs.LG

SAGE-32B: Agentic Reasoning via Iterative Distillation

Basab Jha, Firoj Paudel, Ujjwal Puri, Ethan Henkel, Zhang Yuting, Mateusz Kowalczyk, Mei Huang, Choi Donghyuk, Wang Junhao

Comments 23 Pages, 3 figures, 4 tables

2601.03066 2026-04-22 cs.CL cs.AI cs.LG

Do LLMs Encode Functional Importance of Reasoning Tokens?

Janvijay Singh, Dilek Hakkani-Tür

Comments Updated after ACL Main 2026 acceptance; 25 pages, 8 figures, 4 tables;