arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14758 2026-05-15 cs.AI

Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning

Luca Marzari, Enrico Marchesini

AI总结 该论文研究了基于循环神经网络(RNN)的策略在部分可观测强化学习中的概率验证问题。针对现有工具在验证RNN策略时依赖严格假设或粗略近似导致结果过于保守的问题,提出了一种名为RNN-ProVe的概率验证框架,通过策略驱动采样估计策略下隐藏状态空间中不良行为的发生概率,并给出统计误差界以提供高置信度的验证结果。实验表明,该方法在单智能体和多智能体任务中能够提供更定量且更具可行性意识的概率保证。

Comments Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI) 2026

详情
英文摘要

History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

2605.14754 2026-05-15 cs.AI

XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

Gong Zhiren, Tiantong Wu, Jiaming Zhang, Fuyao Zhang, Che Wang, Yurong Hao, Yikun Hou, Foo Ping, Yilei Zhao, Fei Huang, Chau Yuen, Wei Yang Bryan Lim

AI总结 XDomainBench 是一个用于诊断大语言模型在高维科学知识组合中推理崩溃问题的诊断基准。该研究通过系统化设计不同学科组合和任务难度,揭示了随着知识组合复杂度增加,模型推理能力显著下降的现象。研究发现,推理崩溃主要由学科组合带来的难度提升以及交互过程中错误累积和领域混淆所导致,为科学知识合成中的模型评估提供了新的视角和实验框架。

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

2605.14752 2026-05-15 cs.LG cs.AI

Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions

Qirui Liu, Hao Chen, Weijie Shi, Jiajie Xu, Jia Zhu

AI总结 该研究旨在准确识别学生的错误概念,以支持个性化教育,针对数据稀缺、标注噪声大及模型部署受限等挑战,提出了一种基于认知不确定性的两阶段知识蒸馏框架。该方法通过挖掘现有数据中的高价值样本,结合教师模型的不确定性与置信度差异,识别关键样本并设计难度自适应机制,使学生模型能够有效继承类别间关系并区分模糊错误类型。实验表明,该方法在少量数据训练下显著提升了分类性能,优于当前最优模型。

Comments ACL 2026 Findings. 10 pages, 5 figures, 19 tables

详情
英文摘要

Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.

2605.14749 2026-05-15 cs.CL cs.AI cs.LG

Non-linear Interventions on Large Language Models

Sangwoo Kim

AI总结 本文研究了如何对大语言模型中的非线性表示特征进行干预,突破了现有线性干预方法的局限。作者提出了一种适用于非线性特征的通用干预框架,并设计了相应的学习方法,能够对缺乏直接输出信号的隐式特征进行干预。实验表明,该方法在拒绝绕过引导任务中表现优于传统线性方法,干预效果更精确。

详情
英文摘要

Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

2605.14747 2026-05-15 cs.CL cs.AI cs.CV cs.LG

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

AI总结 本文提出了一种名为Video2GUI的全自动框架,用于从未标注的互联网视频中提取结构化的GUI交互轨迹,以解决当前GUI智能体预训练数据规模小、领域单一的问题。该方法通过粗到细的过滤策略筛选高质量的GUI教程视频,并将其转化为可用于训练的交互轨迹,构建了包含1200万条轨迹、覆盖1500多个应用和网站的大型数据集WildGUI。基于该数据集预训练的模型在多个GUI定位和操作基准测试中取得了5-20%的性能提升,达到了或超越了现有最佳水平。

Comments Accepted at ICML 2026

详情
英文摘要

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

2605.14746 2026-05-15 cs.LG

Selective Safety Steering via Value-Filtered Decoding

Bat-Sheva Einbinder, Hen Davidov, Yee Whye Teh, Yarin Gal, Yaniv Romano

AI总结 本文研究了如何在解码阶段通过安全奖励对大语言模型进行选择性引导,以减少不必要的安全干预,同时提升输出的安全性。提出了一种基于价值过滤的解码方法,通过设定阈值显式控制误干预的概率,从而在安全性和模型原有性能之间取得更好的平衡。实验表明,该方法在多个数据集上优于现有方法,实现了更优的安全性与输出质量的权衡。

详情
英文摘要

While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model's sampling policy at decoding time using a safety reward. However, existing decoding-time steering methods often intervene unnecessarily, modifying generations that would have been safe under the base model. Such unnecessary interventions are undesirable, as they can distort key properties of the base model such as helpfulness, fluency, style, and coherence. We propose a new test-time steering method designed to reduce such unnecessary interventions while improving the safety of unsafe responses. Our approach filters tokens using a value-based safety criterion and provides an explicit bound on the probability of false interventions. A single threshold hyperparameter controls this bound, allowing practitioners to trade off higher rates of unnecessary intervention for better output safety. Across multiple datasets and experiments, we show that our value-filtered decoding method outperforms existing baselines, achieving better trade-offs between safety, helpfulness, and similarity to the base model.

2605.14744 2026-05-15 cs.CL cs.AI cs.CY

Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

José Manuel de la Chica Rodríguez, Carlos Martí-González

AI总结 本研究探讨了在受监管的金融决策系统中,大型语言模型(LLM)如何通过自然语言政策进行治理的问题,指出当前的评估方法仅关注任务准确性,而忽略了治理对决策推理过程的约束。为此,研究提出了五个衡量治理合规性的指标,并引入四种独立于模型解释循环的机械强制方法,显著提升了决策信息的完整性和任务准确性。实验表明,机械强制不仅大幅降低了无信息决策的比例,还验证了治理与任务性能之间的解耦现象,即在系统压力下,治理质量可以独立于任务表现得到保持。

详情
英文摘要

Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

2605.14742 2026-05-15 cs.CV cs.RO

EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding

Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang

AI总结 本文提出EARL,一种以自我视角分析为导向的强化学习框架,旨在提升机器人对人类与环境交互的推理能力和像素级定位精度。EARL采用两阶段解析结构,首先生成结构化文本描述,再根据用户查询生成回答和像素掩码,并通过分析引导特征合成器整合语义先验信息。实验表明,EARL在像素级定位任务中取得了优于现有基于强化学习方法的显著提升,展现出良好的泛化能力。

Comments Accepted at ICML 2026. Project page: https://github.com/yuggiehk/EARL

详情
英文摘要

Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

2605.14733 2026-05-15 cs.CV

Video-Zero: Self-Evolution Video Understanding

Ruixu Zhang, Deyi Ji, Lanyun Zhu, Xuanyi Liu, Yuxin Meng, Ruihang Chu, Yujiu Yang

AI总结 本文提出了一种名为Video-Zero的自进化视频理解框架,旨在无需人工标注的情况下提升视频理解模型的推理能力。该方法通过一个问答共进化系统,聚焦于视频中时间局部化的关键证据,生成基于证据的问题并进行对齐学习,从而实现更有效的监督与模型训练。实验表明,Video-Zero在多个视频理解任务中显著提升了基础模型的性能,验证了其有效性与泛化能力。

详情
英文摘要

Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

2605.14727 2026-05-15 cs.CV

CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

Pengcheng Fang, Hongli Chen, Yuxia Chen, Tengjiao Sun, Jiaxin Liu, Xiaohao Cai

AI总结 本文提出了一种名为CHASM的跨频率协调轴分离混合器,用于改进基于傅里叶变换的光谱token操作器。CHASM通过共享一个学习到的通道特征基,并为每个频率保留独立的正谱增益,实现了跨频率的通道方向对齐与局部频率适应性的结合。该方法在多个视觉任务中表现出色,实验表明其结构设计有助于提升模型性能,并验证了跨频率协调作为光谱操作器的有效归纳偏置。

详情
英文摘要

Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.

2605.14723 2026-05-15 cs.AI cs.CL cs.LG

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha

AI总结 本文提出了一种名为SepsisAgent的新型代理模型,用于重症监护中的脓毒症治疗决策。该模型通过结合临床世界模型,模拟患者对不同治疗方案的反应,并采用“提出—模拟—优化”的流程进行决策优化。研究显示,SepsisAgent在遵循指南和安全指标方面表现优异,优于传统强化学习和大语言模型基线方法,其核心贡献在于通过与临床世界模型的反复交互,使模型能够学习患者生理变化的规律并提升决策可靠性。

详情
英文摘要

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

2605.14721 2026-05-15 cs.AI

On Strong Equivalence Notions in Logic Programming and Abstract Argumentation

Giovanni Buraglio, Wolfgang Dvorak, Stefan Woltran

AI总结 本文研究了逻辑编程与抽象论证中强等价性的差异问题,指出在动态环境下,两类形式系统由于更新机制的不同,导致强等价性无法直接对应。为此,作者提出了一种新的逻辑程序强等价性定义,使得在特定类别的逻辑程序与邓式及扩展型论证框架之间,强等价性得以保持,从而恢复了不同形式系统间的兼容性。

详情
英文摘要

Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.

2605.14717 2026-05-15 cs.CV cs.AI

Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

Saqib Nazir, Ardhendu Behera

AI总结 该研究旨在解决无标记单细胞成像中直接从明场图像推断分子表型的难题,提出了一种基于多任务学习的深度学习框架,能够同时完成白细胞分类和蛋白质表达水平的回归预测。该模型采用卷积神经网络与Transformer相结合的混合架构,通过可学习的跨分支门控模块融合局部纹理特征与全局表示,从而实现对差分相位对比图像的鲁棒形态-分子联合推理。实验表明,该方法在多个基准数据集上表现出色,为无需荧光染色的低成本血液学分析提供了新途径。

Comments Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.

2605.14712 2026-05-15 cs.RO cs.AI cs.CL cs.CV

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Shijie Lian, Bin Yu, Xiaopeng Lin, Zhaolong Shen, Laurence Tianruo Yang, Yurun Jin, Haishan Liu, Changti Wu, Hang Yuan, Cong Huang, Kai Chen

AI总结 该研究针对机器人模仿学习中因短时意图差异导致的动作冲突问题,提出了一种基于历史信息的视觉-语言-动作(VLA)框架IntentVLA,通过编码近期视觉观测生成紧凑的短时意图表示,用于指导动作生成。研究还构建了AliasBench基准,用于评估短时观测歧义下的策略性能,实验表明IntentVLA在多个任务中提升了动作执行的稳定性并优于现有VLA方法。

Comments Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA

详情
英文摘要

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

2605.14710 2026-05-15 cs.CV cs.AI

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu, Guanjie Wang, Yiqing Xia, Ting Xiao

AI总结 该研究针对缺血性中风预后预测中多模态数据融合不足的问题,提出了一种三模态融合模型,有效整合了医学影像、结构化临床数据和非结构化文本。核心方法通过大语言模型自动生成半结构化诊断文本,缓解了专家标注稀缺的问题,并设计了以视觉特征为条件的对齐融合模块,实现了跨模态的深度交互与异构性缓解。实验表明,该模型在真实临床数据上取得了最先进的预测性能。

Comments Corresponding author: Ting Xiao

详情
英文摘要

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

2605.14708 2026-05-15 cs.CV

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

Zeyu Chen, Fangmin Zhao, Yan Shu, Yichao Liu, Liu Yu, Yu Zhou

AI总结 StyleTextGen 是一种用于多语言场景文本生成的风格条件生成框架,旨在解决从复杂背景中准确提取文本风格并保持跨字符细粒度风格一致性的挑战。该方法引入了双分支风格编码器、文本风格一致性损失以及掩码引导的生成策略,有效提升了多语言文本风格的感知与复制能力。研究还构建了首个双语场景文本风格基准 StyleText-CE,并在多项指标上取得了当前最优的性能。

Comments This paper has been accepted to CVPR 2026

详情
英文摘要

Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

2605.14705 2026-05-15 cs.CV

Towards Continuous Sign Language Conversation from Isolated Signs

Youngmin Kim, Kyobin Choo, Jiwoo Park, Minseo Kim, Chanyoung Kim, Junhyeok Kim, Seong Jae Hwang

AI总结 该研究旨在直接建模手语对话系统,以更好地支持聋人和听力障碍者使用手语进行交流。面对现有手语数据集词汇量有限、泛化能力弱的问题,研究者构建了大规模的孤立手语动作数据集SignaVox-W,并基于此生成连续的手语对话数据集SignaVox-U。通过引入检索引导的语音到手语翻译模型和扩散变换器BRAID,实现了从孤立动作到连续对话的生成,最终训练出无需依赖语音或书面语的直接手语到手语对话模型SignaVox,显著提升了手语生成的质量与语义对齐能力。

详情
英文摘要

Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.

2605.14704 2026-05-15 cs.CV cs.AI cs.RO

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

AI总结 在现实场景中,目标物体可能位于不可见区域,而当前视觉语言模型(VLMs)在推理这些被遮挡物体的位置方面仍面临挑战。为此,研究提出SceneFunRI基准,基于SceneFun3D数据集构建了一个包含855个实例的2D空间推理任务,要求模型通过任务指令和常识推理定位不可见的功能性物体。实验表明,现有最强基线模型在该任务上的表现仍较为有限,揭示了当前模型在不可见区域推理能力上的不足,亟需更紧密融合任务意图、常识先验、空间定位与不确定性感知搜索的模型改进。

详情
英文摘要

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

2605.14703 2026-05-15 cs.CV

Generating HDR Video from SDR Video

SaiKiran Tedla, Francesco Banterle, Trevor Canham, Karanpreet Raja, David B. Lindell, Kiriakos N. Kutulakos, Jiacheng Li, Feiran Li, Daisuke Iso

AI总结 本文研究如何从标准动态范围(SDR)视频生成高动态范围(HDR)视频,提出了一种基于大规模生成视频模型的解决方案。该方法引入了多曝光视频模型(MEVM)和可学习的视频合并模型(VMM),能够从单个非线性SDR视频输入生成多曝光SDR序列,并将其合并为高质量的HDR视频,有效保留暗部和亮部细节。实验表明,该方法在真实场景的消费级视频和经典影片中均能实现鲁棒的HDR转换,并可与现有SDR生成模型结合构建HDR合成流程。

详情
英文摘要

The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: sdr2hdrvideo.github.io

2605.14700 2026-05-15 cs.RO

SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis

Ben Wei Lim, Minh Duc Le, Thang Truong, Thanh Nguyen Canh

AI总结 SR-Platform 是一个基于智能体的系统,旨在通过自然语言指令自动生成可用于机器人学习的 MuJoCo 模拟环境。该系统将场景合成分解为四个阶段,包括意图解析、3D 资产生成、布局规划和场景装配,有效降低了构建训练环境的技术门槛。实验表明,SR-Platform 能够在不到一分钟内生成可执行的 MuJoCo 场景,显著提升了机器人模拟环境创建的效率和自动化程度。

详情
英文摘要

Generating robot simulation environments remains a major bottleneck in simulation-based robot learning. Constructing a training-ready MuJoCo scene typically requires expertise in 3D asset modeling, MJCF specification, spatial layout, collision avoidance, and robot-model integration. We present SR-Platform, a production-deployed agentic system that converts free-form natural language descriptions into executable, physically valid MuJoCo environments. SR-Platform decomposes scene synthesis into four stages: an LLM-based orchestrator that converts user intent into a structured scene plan; an asset forge that retrieves cached assets or generates new 3D geometry through LLM-to-CadQuery synthesis; a layout architect that assigns object poses and verifies industrial constraints; and a bridge layer that assembles the final MJCF scene and merges the selected robot model. The system is deployed as a nine-service Docker stack with WebSocket progress streaming, MinIO-backed mesh storage, Qdrant-based semantic asset retrieval, Redis job state, and InfluxDB telemetry. Using 30 days of production telemetry covering 611 successful LLM calls, SR-Platform generates five-object scenes with a median end-to-end latency of approximately 50 s, while cache-accelerated scenes complete in approximately 30-40 s. The asset forge shows an 11.3% first-attempt retry rate with automatic recovery, and cached asset retrieval removes per-object LLM calls for previously generated object types. These results show that agentic scene synthesis can reduce the manual effort required to create diverse robot training environments, enabling users to produce executable MuJoCo scenes from plain English prompts in under one minute.

2605.14698 2026-05-15 cs.LG cs.AI

NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces

Konstantinos Kontras, Trui Osselaer, Stylianos G. Mouslech, Angeliki-Ilektra Karaiskou, Guido Gagliardi, Thomas Strypsteen, Mohammad Hossein Badiei, Anku Rani, Maarten Vanmarcke, Miguel Bhagubai, Chanakya Ekbote, Jaedong Hwang, Christos Chatzichristos, Paul Pu Liang, Maarten De Vos

AI总结 本文介绍了NeuroAtlas,这是目前最大的临床脑电图(EEG)基准数据集,包含42个数据集和26万小时的EEG数据,涵盖癫痫、睡眠医学和脑龄估计等领域,并引入了专门的临床评估指标。研究对比了专门针对EEG的预训练模型与通用时间序列模型的性能,发现后者在某些任务上表现相当甚至更优。研究还指出,传统机器学习指标难以准确评估临床实用性,因此提出了更贴近实际应用的评估方法,并揭示了当前预训练模型在统一EEG建模方面仍存在较大差距。

详情
英文摘要

Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.

2605.14696 2026-05-15 cs.CV

EponaV2: Driving World Model with Comprehensive Future Reasoning

Jiawei Xu, Zhizhou Zhong, Zhijian Shu, Mingkai Jia, Mingxiao Li, Jia-Wang Bian, Qian Zhang, Kaicheng Zhang, Jin Xie, Jian Yang, Wei Yin

AI总结 本文提出了一种名为 EponaV2 的新型驾驶世界模型,旨在解决现有自动驾驶系统在轨迹规划中依赖大量人工标注数据的问题。该模型通过引入全面的未来推理机制,能够预测未来几何和语义信息,从而提升对环境的理解和规划能力。此外,受大语言模型训练方法的启发,EponaV2 引入了流匹配组相对策略优化机制,进一步提升了规划精度,在多个基准测试中取得了优于现有方法的性能。

详情
英文摘要

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

2605.14694 2026-05-15 cs.LG

The Rate-Distortion-Polysemanticity Tradeoff in SAEs

Tommaso Mencattini, Francesco Montagna, Francesco Locatello

AI总结 本文研究了稀疏自编码器(SAEs)在重建精度(最小化失真)、编码效率(最小化速率)与表示语义单一性(单义性)之间的权衡问题,提出了“速率-失真-多义性”三重权衡。通过理论分析与实验验证,作者表明强制SAEs学习单义表示会导致速率和失真增加,并发现最优SAEs的多义性程度由训练数据分布决定,尤其是特征共现概率。研究进一步拓展到实际场景,提出多义性度量应满足的必要条件,并在大语言模型训练的SAEs上对现有度量方法进行了评估,揭示了多义性本质上是数据层面的问题,应在架构和优化层面加以考虑。

详情
英文摘要

Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the analysis to real-world settings by deriving necessary conditions that a polysemanticity measure should satisfy when the data-generating process is unknown, and we benchmark existing proxy metrics on SAEs trained on Large Language Models. Taken together, our findings show that polysemanticity is a data problem that should be accounted for when addressing it at the architectural and optimization level.

2605.14689 2026-05-15 cs.CV

Are Candidate Models Really Needed for Active Learning?

Harshini Mridula Mohan, Maanya Manjunath, Vipul Arya, S. H. Shabbeer Basha, Nitin Cheekatla

AI总结 本文探讨了在主动学习中是否真的需要候选模型,并提出了一种无需初始候选模型的主动学习方法。研究采用随机初始化的卷积神经网络和变换器模型,结合基于置信度的采样策略,验证了其在减少标注负担方面与传统方法相当的效果。实验表明,低置信度采样策略在多数情况下表现最佳,为高效、灵活的主动学习提供了新思路。

Comments Accepted for publication in Computer Vision and Image Understanding (CVIU)

详情
英文摘要

Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.

2605.14686 2026-05-15 cs.LG

ReMIA: a Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators

Davide Scassola, Andrea Coser, Sebastiano Saccani

AI总结 在隐私保护日益重要的背景下,合成数据生成器(SDGs)被广泛用于数据共享,但其生成的数据仍面临成员推理攻击(MIAs)的威胁。本文提出了一种名为 ReMIA 的新型隐私评估方法,该方法仅需两次 SDG 训练运行和与原始训练集规模相当的辅助数据,显著提升了 MIAs 的实用性。实验表明,ReMIA 在保持高灵敏度的同时,相比现有方法更加高效,同时揭示了 SDGs 在隐私与数据效用平衡方面优于传统去标识化方法的潜力。

详情
英文摘要

Tabular data sharing under privacy constraints is increasingly important for research and collaboration. Synthetic data generators (SDGs) are a promising solution, but synthetic data remains vulnerable to attacks, such as membership inference attacks (MIAs), which aim to determine whether a specific record was part of the training data. State-of-the-art MIAs are powerful but impractical: they rely on shadow modeling, requiring hundreds of SDG training runs, and need auxiliary data several times larger than the original training set. Fast proxy metrics like distance to closest record (DCR) are efficient but have limited sensitivity to MIA risk. We introduce ReMIA (Relative Membership Inference Attack), a practical privacy metric that requires only two SDG training runs and additional data no larger than the original training set. Rather than predicting whether a record was in the training set, ReMIA generates two synthetic datasets from two source datasets and measures whether a classifier can identify which source a record came from. Experiments across multiple tabular datasets and SDGs show that ReMIA has a sensitivity comparable to state-of-the-art MIAs while being substantially more practical. We further observe that SDGs can achieve privacy-utility trade-offs that traditional noise-based anonymization methods do not match. Code is available at https://github.com/aindo-com/remia.

2605.14685 2026-05-15 cs.LG cond-mat.stat-mech cs.AI

Spontaneous symmetry breaking and Goldstone modes for deep information propagation

Nabil Iqbal, T. Anderson Keller, Yue Song, Takeru Miyato, Max Welling

AI总结 本文研究了具有连续对称性的深度神经网络中自发对称性破缺现象及其类似戈德斯通模式的自由度,揭示了这些自由度能够支持信息在深度网络和循环迭代中的相干传播。通过理论分析与实验验证,作者表明这种机制可以在无需残差连接或归一化等结构稳定器的情况下实现稳定的信息流,提升了前馈网络的可训练性和表示多样性,并在循环网络中有效增强了长期记忆能力,改善了长序列建模任务的性能。

Comments 28 pages. Code at https://github.com/nabiliqbal/ssb-goldstone-deep-info-prop

详情
英文摘要

In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.

2605.14683 2026-05-15 cs.RO cs.SY eess.SY

SeaVis: Modeling and Control of a Remotely Operated Towed Vehicle for Seabed Visualization and Mapping

Abdelhakim Amer, Aske Alstrup, Frederik Rasmussen, Yury Brodskiy, Andriy Sarabakha, Erdal Kayacan

AI总结 本文提出了一种用于海底可视化与测绘的遥控拖曳式水下机器人SeaVis的新型数学模型,并设计了一种增益调度的线性二次调节器(LQR)以实现其深度和姿态的鲁棒控制。通过高保真仿真验证,结果表明该LQR控制器在抗干扰能力、控制效率和舵面动作幅度等方面均优于传统PID控制器,并且在全操作速度范围内均表现出良好的控制效果。研究为水下机器人高精度稳定作业提供了有效的控制方法。

Comments Accepted at IEEE/ASME AIM 2026

详情
英文摘要

High-resolution seafloor mapping necessitates stable and precise positioning for underwater robots. This paper introduces a novel mathematical model for SeaVis remotely operated towed vehicles (ROTVs) and develops a gain-scheduled linear-quadratic regulator (LQR) for robust depth and attitude control. We validate the approach in a high-fidelity simulation, benchmarking the LQR against a conventional PID controller over a challenging seabed profile. The presented results demonstrate the LQR's superior performance, with significantly enhanced robustness to disturbances, greater control efficiency, and substantially reduced flap actuation. The gain scheduling also confirms the controller's effectiveness across the full operational velocity range. The complete simulation environment and controller are open-sourced.

2605.14679 2026-05-15 cs.CL cs.AI

AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

Vicent Briva-Iglesias, María Ferre-Fernández

AI总结 本研究探讨了在岩画文献等术语密集的文化遗产领域中,如何通过人工智能辅助提升多语言传播的质量。研究比较了三种英文机器翻译方法在西班牙语学术文本中的表现,重点评估了基于术语表增强的提示策略对专业术语准确性的提升效果。结果表明,结合术语表的大型语言模型(Gemini-RAG)在术语准确性和整体翻译质量上均优于传统神经机器翻译和基础提示模型,为文化机构提供了一种低成本、高效率的术语控制解决方案。

详情
英文摘要

Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

2605.14672 2026-05-15 cs.LG

AQKA: Active Quantum Kernel Acquisition Under a Shot Budget

Jian Xu, Chao Li, Delu Zeng, John Paisley, Qibin Zhao

AI总结 该论文研究了在有限测量资源下高效估计量子核矩阵的问题,提出了一种名为AQKA的新方法,通过动态分配测量资源以提升分类性能。其核心贡献包括:建立了一个完整的资源分配策略选择框架,并提出了基于梯度和核值的成对测量分配理论,显著提升了在有限预算下的模型表现。实验表明,AQKA在多个量子硬件平台上均优于现有方法,尤其在稀疏敏感任务中表现突出。

详情
英文摘要

Estimating an $N \times N$ quantum kernel from circuit fidelities requires $Θ(N^2 S)$ measurement shots, the dominant bottleneck for deployment on near-term hardware. Existing budget-saving methods (Nyström-QKE, ShoFaR, kernel-target alignment) sub-sample \emph{which} entries to measure but allocate shots \emph{uniformly} within their chosen subset, ignoring how much each entry drives the downstream classifier. We close this gap with two contributions. \textbf{First, a complete regime decomposition} for shot-budgeted quantum kernel learning: a principled menu of when each allocator wins. Our method, \emph{AQKA}, dominates the budget-limited regime ($B \lesssim 16 n_{\mathrm{pairs}}$) on sparse-sensitivity KRR, with the gap \emph{growing} from $+8$ to $+25$ pts over uniform as $N$ scales $225{\to}1000$ and reaching $+26$--$32$ pts on an \texttt{ibm\_pittsburgh} (156-qubit Heron) hardware kernel; Nyström-QKE wins at saturating budgets on planted-sparse via low-rank reconstruction; ShoFaR is competitive only at extreme low budgets. \textbf{Second, a closed-form pair-level acquisition theory}: $s_{ij}^{\star} \propto |g_{ij}|\sqrt{K_{ij}(1-K_{ij})}$ with explicit gradient $g_{ij}$ for KRR (Lemma~1, $|β_iα_j+β_jα_i|\sqrt{K_{ij}(1-K_{ij})}$) and SVM via the envelope theorem ($|η_i^*η_j^*|\sqrt{K_{ij}(1-K_{ij})}$); a \emph{corrected} sparsity-aware Cauchy--Schwarz rate $ρ\le 2m/N$ matching empirics (vs.\ the naive $m^2/N^2$); an explicit-constant plug-in regret bound (Theorem~2); and a tighter SVM ceiling $ρ^{\mathrm{SVM}} \le m_{\mathrm{sv}}^2/N^2$. We close with the first multi-seed live online adaptive shot allocation on quantum hardware: $+17.0 \pm 4.8$ pts at $N{=}20$ on \texttt{ibm\_aachen} ($3.5σ$, 5 seeds), with the advantage holding at $N{=}30$ at higher budget on \texttt{ibm\_berlin} ($+14.0 \pm 8.5$ pts, 5 seeds).

2605.14667 2026-05-15 cs.AI

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

D. Gil, I. Sanchez, C. Sanchez

AI总结 本文研究了放射组学AI模型对影像采集参数的敏感性,提出了一种基于混合效应的框架,用于量化临床相关参数对模型性能的影响,并识别出有助于提升跨数据集鲁棒性的关键参数范围。通过在两个独立的多中心CT数据集上应用该框架,研究发现优化的扫描参数配置(如管电流≥200mA、螺距≤1.5、层厚≤1.25mm)可在保证诊断质量的同时降低辐射剂量,显著提升模型的敏感性和特异性。

详情
英文摘要

A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.