arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2511.19365 2026-04-09 cs.CV cs.AI

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian

Comments Accepted to CVPR2026. Project Page: https://zehong-ma.github.io/DeCo. Code Repository: https://github.com/Zehong-Ma/DeCo

详情

英文摘要

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

URL PDF HTML ☆

赞 0 踩 0

2511.17844 2026-04-09 cs.CV cs.AI

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov

2511.11663 2026-04-09 cs.LG cs.AI

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

Zhixiong Zhao, Fangxin Liu, Junjie Wang, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan

Comments Accepted at AAAI 2026

2511.10354 2026-04-09 cs.CL cs.AI cs.DL cs.IR

Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Andrea Schimmenti, Valentina Pasqual, Fabio Vitali, Marieke van Erp

Comments 46 pages

详情

英文摘要

Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

URL PDF HTML ☆

赞 0 踩 0

2511.08409 2026-04-09 cs.AI

Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Junxian Li, Xinyue Xu, Sai Ma, Di Zhang, Sichao Li

Comments Accepted by ACL 2026 Findings

2511.08019 2026-04-09 cs.RO cs.SY eess.SY

Model Predictive Control via Probabilistic Inference: A Tutorial and Survey

Kohei Honda

Comments 41 pages, 7 figures

2511.03595 2026-04-09 cs.LG cs.SY eess.SY

Tensor-Efficient High-Dimensional Q-learning

Junyi Wu, Dan Li

Comments 61 pages, 7 figures. v2 updated to include additional experimental results and refined proofs

2511.03295 2026-04-09 cs.CL cs.AI

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli

Comments Main additions with respect to v2: Section 5.5 "Low-Resource Evaluation", Section 5.6 "Validation against Human Judgments", two instances of XLR-Segmenter: XLR-SimAlign and XLR-LaBSE, per-language analyses

详情

英文摘要

Automatic evaluation of ST systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In MT, recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, ASR transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. The robustness of these findings is further confirmed by experiments on a low-resource language pair (Bemba-English) and by a direct validation against human quality judgments. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

URL PDF HTML ☆

赞 0 踩 0

2511.00643 2026-04-09 cs.CV

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi, Meng Wei, Charlie Budd, Tom Vercauteren, Miaojing Shi

详情

DOI: 10.1007/s11548-026-03596-1

英文摘要

Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

URL PDF HTML ☆

赞 0 踩 0

2510.26083 2026-04-09 cs.LG cs.AI

Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism

Yuhua Jiang, Shuang Cheng, Yihao Liu, Ermo Hua, Che Jiang, Weigao Sun, Yu Cheng, Feifei Gao, Biqing Qi, Bowen Zhou

2510.20220 2026-04-09 cs.LG cs.NA math.NA

Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

Iván Ojeda-Ruiz, Young Ju Lee, Malcolm Dickens, Leonardo Cambisaca

2510.20200 2026-04-09 cs.LG

Approximate Replicability in Learning

Max Hopkins, Russell Impagliazzo, Christopher Ye

Comments 73 pages, 1 figure

2510.18196 2026-04-09 cs.CL cs.AI

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Yoshinari Fujinuma

Comments To appear at ACL 2026

2510.15510 2026-04-09 cs.CV cs.RO

Exploring Conditions for Diffusion models in Robotic Control

Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim

Comments Accepted to CVPR 2026. Project page: https://orca-rc.github.io/

2510.14718 2026-04-09 cs.CL

Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI

Xingmeng Zhao, Tongnian Wang, Dan Schumacher, Veronica Rammouz, Anthony Rios

Comments 8 pages main + Appendix; Accepted to ACL Findings 2026

2510.14063 2026-04-09 cs.RO

Adaptive Obstacle-Aware Task Assignment and Planning for Heterogeneous Robot Teaming

Nan Li, Jiming Ren, Haris Miller, Samuel Coogan, Karen M. Feigh, Ye Zhao

Comments 24 pages, 19 figures, 5 tables

2510.12088 2026-04-09 cs.AI cs.CL cs.LG

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Zaid Khan, Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

Comments Accepted to ICLR 2026. Project page: https://onelife-worldmodel.github.io/; 44 pages

详情

英文摘要

Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

URL PDF HTML ☆

赞 0 踩 0

2510.11539 2026-04-09 cs.RO math.OC

Simultaneous Calibration of Noise Covariance and Kinematics for State Estimation of Legged Robots via Bi-level Optimization

Denglin Cheng, Jiarong Kang, Xiaobin Xiong

2510.08052 2026-04-09 cs.CV

RASALoRE: Region Aware Spatial Attention with Location-based Random Embeddings for Weakly Supervised Anomaly Detection in Brain MRI Scans

Bheeshm Sharma, Karthikeyan Jaganathan, Balamurugan Palaniappan

Comments Accepted at the 36th British Machine Vision Conference (BMVC-2025)

2510.00635 2026-04-09 cs.CV

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang, Daiheng Gao, Yun Zhou, Yanxia Chang, Zheng Zhu, Yeying Jin, Wenjun Wu

2509.25477 2026-04-09 cs.CL

The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis

Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Eusebio Ricardez Vazquez, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad

2509.24857 2026-04-09 cs.CL cs.CY

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver

Comments Accepted for publication in JMIR Mental Health. DOI: 10.2196/88435

详情

DOI: 10.2196/88435

英文摘要

Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

URL PDF HTML ☆

赞 0 踩 0

2509.23435 2026-04-09 cs.SD cs.AI cs.MM eess.AS

AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

Wenyu Li, Xiaoqi Jiao, Yi Chang, Guangyan Zhang, Yiwen Guo

2509.22981 2026-04-09 cs.LG math.OC

MDP modeling for multi-stage stochastic programs

David P. Morton, Oscar Dowson, Bernardo K. Pagnoncelli

2509.17183 2026-04-09 cs.CL cs.AI cs.LG

LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

Junsong Li, Jie Zhou, Bihao Zhan, Yutao Yang, Qianjun Pan, Shilian Chen, Tianyu Huai, Xin Li, Qin Chen, Liang He

2509.15623 2026-04-09 cs.CV

PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Zhuoyao Liu, Yang Liu, Wentao Feng, Shudong Huang

Comments 7 pages, 3 figures

2509.09926 2026-04-09 cs.LG cs.CV

LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning in Open-World Scenarios

Zhiyuan Huang, Jiahao Chen, Bing Su

2509.08873 2026-04-09 cs.SD physics.data-an

In situ estimation of the acoustic surface impedance using simulation-based inference

Jonas M. Schmid, Johannes D. Schmid, Martin Eser, Steffen Marburg

详情

DOI: 10.1121/10.0042242
Journal ref: The Journal of the Acoustical Society of America, 159(1), 422-436, 2026

英文摘要

Accurate acoustic simulations of enclosed spaces require precise boundary conditions, typically expressed through surface impedances for wave-based methods. Conventional measurement techniques often rely on simplifying assumptions about the sound field and mounting conditions, limiting their validity for real-world scenarios. To overcome these limitations, this study introduces a Bayesian framework for the in situ estimation of frequency-dependent acoustic surface impedances from sparse interior sound pressure measurements. The approach employs simulation-based inference, which leverages the expressiveness of modern neural network architectures to directly map simulated data to posterior distributions of model parameters, bypassing conventional sampling-based Bayesian approaches and offering advantages for high-dimensional inference problems. Impedance behavior is modeled using a damped oscillator model extended with a fractional calculus term. The framework is verified on a finite element model of a cuboid room and further tested with impedance tube measurements used as reference, achieving robust and accurate estimation of all six individual impedances. Application to a numerical car cabin model further demonstrates reliable uncertainty quantification and high predictive accuracy even for complex-shaped geometries. Posterior predictive checks and coverage diagnostics confirm well-calibrated inference, highlighting the method's potential for generalizable, efficient, and physically consistent characterization of acoustic boundary conditions in real-world interior environments.

URL PDF HTML ☆

赞 0 踩 0

2509.01986 2026-04-09 cs.CV cs.AI

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Ziyun Zeng, David Junhao Zhang, Wei Li, Mike Zheng Shou

Comments ICLR 2026 Camera Ready Version; Add more discussions and fix typos

2508.21618 2026-04-09 cs.LG cs.AI

Physics-Informed Spectral Modeling for Hyperspectral Imaging

Zuzanna Gawrysiak, Krzysztof Krawiec

Comments Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works