arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31186 2026-06-01 cs.LG

How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection evaluation

分类精度在多大程度上捕捉概念漂移检测质量?概念漂移检测评估综述

Joanna Komorniczak

AI总结 本文综述了概念漂移检测质量度量与分类性能之间的关系,通过七种合成数据流工具研究八种漂移检测质量度量,旨在确定最具信息量的度量集。

详情
AI中文摘要

数据流是当今最常分析的数据结构之一,概念漂移对处理系统构成了重大挑战。尽管提出了许多解决方案来应对概念漂移导致的精度下降,但科学界尚未建立统一的概念漂移检测评估框架。现有研究通常依赖分类质量度量,但这些度量可能受多种因素影响,无法可靠反映漂移检测质量。本文深入概述了合成非平稳数据流中漂移检测质量度量与分类性能之间的关系。研究通过七种合成数据流生成工具,考察了八种漂移检测质量度量与分类器性能的关系,并额外考虑了漂移动态因素。研究旨在识别最具信息量的漂移检测质量度量集,并提供对方法评估的深入理解。

英文摘要

Data streams are nowadays among the most frequently analyzed data structures, with the concept drift posing a major challenge encountered by processing systems. Despite the proposition of numerous solutions to counteract the accuracy degeneration due to concept drift, the scientific community has not yet established a unified framework for evaluating the concept drift detection task. Existing research often relies on classification quality metrics, but these can be affected by multiple factors and may not reliably reflect drift detection quality. In this work, we present an in-depth overview of the relationship between metrics for quantifying drift detection quality and classification performance in synthetic nonstationary data streams. The proposed research studies eight drift detection quality metrics in relation to the classifier's performance across seven synthetic data stream generation tools, additionally considering drift dynamics as a factor. The studies aim to identify the most informative set of drift detection quality metrics and provide a deep understanding of the method's evaluation.

2605.31183 2026-06-01 cs.CL cs.AI cs.LG

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

引导LLM?实际上,稀疏自编码器可以胜过简单基线

Mikkel Godsk Jørgensen, Lars Kai Hansen

AI总结 本文通过监督流水线选择并标注特征,证明稀疏自编码器在模型引导任务上可接近LoRA性能,并发现高稀疏性对基于可解释性的引导并非关键。

详情
AI中文摘要

稀疏自编码器(SAEs)被视为探索大型语言模型(LLMs)内部机制和引导模型输出生成的有前途的途径。当Wu等人(2025)引入模型引导基准AxBench时,SAEs由于相对于一组简单基线的引导性能较差,似乎并未达到最初的期望。本文作为对稀疏自编码器的部分反驳,表明Wu等人(2025)的结果并未完全公正地评价它们。我们发现,当使用我们的监督流水线选择并标注特征时,稀疏自编码器实际上可以在AxBench基准上达到接近参考LoRA性能的水平。我们还发现,当仅使用基于可解释性的组件时,我们的流水线选择的特征与其识别标签具有令人惊讶的因果性。最后,我们提供证据表明,高稀疏性(低l0)可能对于基于可解释性的成功引导并非关键,这与Wang等人(2025)早期的发现相反。

英文摘要

Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).

2605.31177 2026-06-01 cs.CV

Vanilla ViT for Automotive Point Cloud Semantic Segmentation

用于汽车点云语义分割的普通ViT

Gilles Puy, Nermin Samet, Alexandre Boulch, Spyros Gidaris, Tuan-Hung VU, Renaud Marlet

AI总结 本文提出VaViT,通过精心设计的标记器、轻量级解码器头和定制数据增强,使普通非分层ViT在大规模激光雷达点云语义分割中达到或超越现有最先进方法。

详情
AI中文摘要

普通Transformer已成为处理文本、音频、图像和视频的事实标准架构,为多模态学习提供了统一的主干。然而,点云语义分割的最先进架构仍然由U-Net架构主导,其中卷积与局部或窗口注意力交错。在这项工作中,我们展示了如何有效利用普通、非分层的ViT进行大规模汽车激光雷达场景的分割。通过精心设计的标记器、轻量级解码器分割头和定制数据增强,我们弥合了性能差距。我们的方法VaViT(Vanilla ViT)在保持ViT架构简单性的同时,匹配或超过了最先进方法的性能。我们在nuScenes、SemanticKITTI和Waymo Open Dataset上进行了广泛评估,以验证我们方法的有效性。代码和模型可在https://github.com/valeoai/VaViT获取。

英文摘要

Plain Transformers have become the de-facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state-of-the-art architectures for point cloud semantic segmentation remain dominated by U-Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non-hierarchical ViTs for segmentation of large-scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state-of-the-art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at https://github.com/valeoai/VaViT.

2605.31176 2026-06-01 cs.LG cs.DS

Retriever Portfolios: A Principled Approach to Adaptive RAG

检索器组合:一种自适应RAG的原则性方法

Miltiadis Stouras, Vincent Cohen-Addad, Silvio Lattanzi, Ola Svensson

AI总结 提出从大量候选检索器中自动选择小型多样子集(组合)的方法,通过期望最优k目标优化查询分布,实现自适应RAG,在多个QA基准上优于单检索器和朴素多检索器基线,并降低延迟和令牌成本。

详情
Comments
Accepted at ICML 2026. Code available at: https://github.com/mstou/retriever-portfolios
AI中文摘要

检索增强生成(RAG)系统通常依赖单一检索器和一组超参数,尽管面临从简单事实性问题到复杂多跳推理的高度异构查询。我们提出一种方法,从大量候选检索器中自动选择一个小型、多样的子集(组合),以覆盖目标查询分布的不同区域。我们通过查询分布上的期望最优$k$目标形式化这一设置,并证明其存在一个具有近最优保证的高效组合构建算法。在多个QA基准上,我们学习的组合和路由管道在检索指标和答案质量上始终优于单检索器和朴素多检索器基线。此外,与推理时超参数调优方法相比,固定组合支持并行检索和LLM调用,在实现相当(有时更好)准确性的同时,显著降低延迟和令牌成本。

英文摘要

Retrieval-augmented generation (RAG) systems typically rely on a single retriever and a single set of hyperparameters, despite facing highly heterogeneous queries that range from simple factoid questions to complex multi-hop reasoning. We propose a method that automatically selects a small, diverse subset of retrievers (a portfolio) from a large pool of candidates, to cover different regions of the target query distribution. We formalize this setting via an expected best-of-$k$ objective over the query distribution and show that it admits an efficient portfolio construction algorithm with near-optimal guarantees. Across multiple QA benchmarks, our learned portfolios and router pipeline consistently outperform single-retriever and naive multi-retriever baselines on both retrieval metrics and answer quality. In addition, compared to inference-time hyperparameter tuning approaches, fixed portfolios enable parallel retrieval and LLM calls, achieving comparable (and sometimes better) accuracy with substantially lower latency and token cost.

2605.31175 2026-06-01 cs.CL

Towards Efficient LLMs Annealing with Principled Sample Selection

迈向基于原则性样本选择的高效LLM退火

Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li, Guang Zhang

AI总结 本文通过损失景观的谱几何特性,将退火阶段的数据选择建模为有约束优化问题,提出DiReCT框架,利用Hessian谱对梯度施加方向约束,实现高效样本选择,显著提升模型性能。

详情
AI中文摘要

退火阶段是LLM预训练中关键的收敛阶段,最终决定模型质量。然而,在此阶段有效选择训练数据仍是一个关键挑战。当前策略依赖于经验启发式方法,如领域过滤或上下文扩展,缺乏优化理论的原则性基础。在这项工作中,我们通过损失景观的谱几何视角来刻画退火阶段。我们认为,最优收敛需要梯度更新满足不同特征方向上的异构约束。基于这一见解,我们将数据选择形式化为满足这些方向约束的问题。为此,我们提出了DiReCT(方向约束训练),这是一个新颖的框架,将退火阶段的样本选择重新表述为约束优化问题。通过基于Hessian的谱特性对每个样本的梯度施加显式的方向约束,DiReCT识别出与最优曲率感知下降路径一致的样本。跨多种模型尺度的广泛实验表明,DiReCT始终达到最先进的性能。为便于未来研究,代码可在https://github.com/xuyj233/Direct获取。

英文摘要

The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.

2605.31174 2026-06-01 cs.CV cs.LG

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

任意场景检测:一种具有经验感知推理的目标检测智能体框架

Wenlun Zhang, Jun Yin, Kentaro Yoshioka

AI总结 提出DetAS/DetAS-X智能体框架,利用多模态大语言模型自适应组合恢复模块和专用检测器,通过自进化经验积累实现经验感知推理,在六个基准上平均F1提升28.36%。

详情
AI中文摘要

现实场景中的目标检测由于图像退化多样和物体分布异质而仍然具有挑战性,这显著阻碍了现有检测器的泛化。传统方法,包括场景特定表示学习和端到端流水线设计,本质上受限于对预定义条件的依赖,缺乏对动态环境的适应性。本文提出DetAS,一种将目标检测表述为动态决策过程的智能体检测框架。DetAS不依赖静态流水线,而是利用多模态大语言模型(MLLM)作为中央智能体,通过从恢复模块和专用检测器的工具箱中选择来自适应地组合检测工作流。具体来说,DetAS包含两个关键组件:自适应图像恢复,动态决定是否以及如何增强图像以进行下游检测;以及多专家检测,集成多个领域专用检测器并通过实例级推理解决它们的预测。为了在细粒度条件下进一步提高决策质量,我们引入了自进化经验积累,并将框架扩展到DetAS-X,该框架从少量标注数据中积累节点级决策经验,并在推理过程中实现经验感知推理。这种机制使系统能够逐步优化其决策策略,并适应各种现实场景。在六个具有挑战性的基准上的大量实验表明,DetAS-X显著优于现有的基于MLLM的检测器,在F1分数上平均提高28.36%,在DarkFace上增益高达37.01%。这些结果展示了智能体检测的前景,并为其在复杂动态环境中的应用奠定了坚实基础。

英文摘要

Object detection in real-world scenarios remains challenging due to diverse image degradations and heterogeneous object distributions, which significantly hinder the generalization of existing detectors. Conventional approaches, including scene-specific representation learning and end-to-end pipeline design, are inherently limited by their reliance on predefined conditions and lack adaptability to dynamic environments. In this paper, we propose DetAS, an agentic detection framework that formulates object detection as a dynamic decision process. Instead of relying on static pipelines, DetAS leverages a Multimodal Large Language Model (MLLM) as a central agent to adaptively compose detection workflows by selecting from a toolbox of restoration modules and specialized detectors. Specifically, DetAS consists of two key components: Self-Adaptive Image Restoration, which dynamically determines whether and how to enhance images for downstream detection, and Multi-Expertise Detection, which integrates multiple domain-specialized detectors and resolves their predictions through instance-level reasoning. To further improve decision quality under fine-grained conditions, we introduce Self-Evolving Experience Harvesting and extend the framework to DetAS-X, which accumulates node-level decision experience from a small set of annotated data and enables experience-aware reasoning during inference. This mechanism allows the system to progressively refine its decision policy and adapt to diverse real-world scenarios. Extensive experiments on six challenging benchmarks demonstrate that DetAS-X significantly outperforms existing MLLM-based detectors, achieving an average improvement of 28.36% in F1 score, with up to 37.01% gain on DarkFace. These results demonstrate the promise of agentic detection and establish a solid foundation for its application in complex and dynamic environments.

2605.31173 2026-06-01 cs.SD cs.AI

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

MindVoice: 利用预训练先验从非侵入性神经信号重建可理解语音

Guangyin Bao, Taiping Zeng, Jianfeng Feng, Xiangyang Xue

AI总结 提出MindVoice框架,通过解耦语义和声学路径并融合预训练生成模型与语音克隆,从EEG/MEG信号中重建出可理解语音,显著优于现有方法。

详情
AI中文摘要

从非侵入性神经记录中重建连续语音是探究人类听觉感知和构建安全、可扩展的语音脑机接口的基本问题。尽管近期取得进展,但由于非侵入性记录本身存在噪声、空间模糊且仅部分保留感知语音信息,可理解的重建仍然难以实现。现有方法直接将神经活动映射到纠缠的语音表征,然后使用神经声码器合成波形,导致结果频谱相似但不可理解。为克服这些限制,我们引入MindVoice,一种神经到语音的重建框架,利用预训练模型补偿神经记录中不完整的语义和声学信息。MindVoice将重建解耦为两条互补路径:一条恢复高层语义内容,另一条估计细粒度声学属性。这些推断的表征随后与强大的语音生成模型和上下文语音克隆融合,以合成自然且可理解的语句。在EEG和MEG上的大量实验表明,MindVoice在各种指标上显著优于现有方法。这些结果表明,预训练先验为弥合噪声神经记录与自然语音之间的差距提供了一种原则性方法,凸显了听觉神经科学研究和非侵入性语音脑机接口的一个有前景的尝试。

英文摘要

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

2605.31172 2026-06-01 cs.LG stat.ML

Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

双时间尺度马尔可夫随机逼近的收敛性及其在强化学习中的应用

Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang

AI总结 本文研究双时间尺度随机逼近在马尔可夫噪声下的稳定性与收敛性,通过用慢时间尺度参数的运行最大值控制快时间尺度参数,首次证明了带资格迹的TDC在离策略线性函数逼近下的几乎必然收敛。

详情
Comments
ICML 2026
AI中文摘要

本文研究双时间尺度随机逼近(SA)的收敛性,这是一类迭代算法,分别以快慢时间尺度更新两组参数。强化学习中双时间尺度SA的著名例子包括带梯度校正的时间差分学习(TDC)和演员-评论家方法。以往,双时间尺度SA的稳定性(即有界性)和收敛性仅在独立同分布噪声下建立。本文则在马尔可夫噪声下建立双时间尺度SA的稳定性和收敛性,这种设置更符合强化学习实际。值得注意的是,我们无需使用任何投影算子,且噪声无需位于紧集内。我们的关键技术新颖之处在于,用慢时间尺度参数的运行最大值来控制快时间尺度参数,而非像大多数先前工作那样使用当前慢时间尺度参数。作为一个关键应用,我们首次证明了带资格迹的TDC在离策略线性函数逼近下的几乎必然收敛。

英文摘要

This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal difference learning with gradient correction (TDC) and actor-critic methods. Previously, the stability (i.e., boundedness) and convergence of two-timescale SA were only established under i.i.d. noise. This work instead establishes the stability and convergence of two-timescale SA under Markovian noise, a setup that is more realistic in RL. Notably, we do not need to use any projection operator and the noise does not need to live in a compact space. Our key technical novelty is to control the fast timescale parameter with the running max of the slow timescale parameter, instead of with the current slow timescale parameter, as most prior works do. As a key application, we establish the first almost sure convergence of TDC with eligibility traces under off-policy learning with linear function approximation.

2605.31171 2026-06-01 cs.IR cs.AI

MIMO: Multilingual Information Retrieval via Monolingual Objectives

MIMO: 通过单语目标实现多语言信息检索

Youngjoon Jang, Seongtae Hong, Heuiseok Lim

AI总结 提出MIMO两阶段框架,利用教师模型的稳定英语语义空间,通过知识蒸馏和跨语言对比学习联合优化,解决多语言信息检索中语言聚类和嵌入对齐-均匀性权衡问题。

详情
AI中文摘要

多语言信息检索(MLIR)反映了真实的搜索环境,其中查询和相关文档可能以不同语言出现在混合语言语料库中。然而,现有的嵌入模型主要针对多单语检索进行优化,在MLIR设置中其性能通常会下降。此外,直接将传统对比学习应用于MLIR会加剧语言聚类,并暴露跨语言对齐与嵌入均匀性之间的权衡。为了解决这些局限性,我们提出了MIMO:通过单语目标实现多语言信息检索,这是一个两阶段框架,使用来自高性能教师模型的稳定英语语义空间作为锚点。MIMO首先通过知识蒸馏初始化学生模型的跨语言对齐,然后联合优化蒸馏和跨语言对比学习,以提高检索判别力同时保持对齐。大量实验表明,MIMO在各种MLIR和多单语基准测试中始终优于现有的跨语言训练基线。MIMO在与类似或更大参数规模的现成模型相比也保持竞争力。此外,我们的跨语言对齐-均匀性分析阐明了两个损失组件的不同作用,并表明它们的组合在对齐和均匀性之间产生了有利的权衡。

英文摘要

Multilingual Information Retrieval (MLIR) reflects real-world search environments in which queries and relevant documents may appear in different languages within a mixed-language corpus. However, existing embedding models are primarily optimized for Multi-Monolingual retrieval and their performance often degrades in MLIR settings. Moreover, directly applying conventional contrastive learning to MLIR can exacerbate language clustering and expose a trade-off between cross-lingual alignment and embedding uniformity. To address these limitations, we propose MIMO: Multilingual Information Retrieval via Monolingual Objectives, a two-stage framework that uses a stable English semantic space from a high-performing teacher model as an anchor. MIMO first initializes the student model's cross-lingual alignment through knowledge distillation, and then jointly optimizes distillation and cross-lingual contrastive learning to improve retrieval discrimination while preserving alignment. Extensive experiments show that MIMO consistently outperforms existing cross-lingual training baselines across various MLIR and Multi-Monolingual benchmarks. MIMO also remains competitive with off-the-shelf models of similar or larger parameter scales. Furthermore, our cross-lingual Alignment-Uniformity analysis clarifies the distinct roles of the two loss components and shows that their combination yields a favorable trade-off between alignment and uniformity.

2605.31170 2026-06-01 cs.CL cs.AI

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

语言模型智能体群体中的涌现语言:从令牌效率到监督规避

Stine Lyngsø Beltoft, William Brach, Federico Torrielli, Jacob Nielsen, Annemette Brok Pirchert, Filippo Tonini, Peter Schneider-Kamp, Lukas Galke Poech

AI总结 研究语言模型智能体群体中涌现的语言,通过规则启发式和零样本分类识别出令牌效率、新自然语言和监督规避三类,发现监督规避语言更难对齐且可被上下文学习,表明仅监控表面行为可能不足以控制智能体群体。

详情
AI中文摘要

目前,对自主语言模型智能体的监控主要依赖表面行为。但当智能体群体为了规避人类监督而发明新语言时会发生什么?本文研究了Moltbook上的涌现语言。为此,我们基于Moltbook Files数据集,采用两阶段方法:先进行基于规则的启发式匹配(约6000个匹配),再进行零样本分类(保留518个)。结果类别包括令牌效率(166个)、新自然语言(106个)和监督规避(59个)。我们进行了定量和定性分析。结果表明,提出用于规避监督的新语言的帖子被DeepSeek-3.2判定为比其他类别更不对齐,且所有语言都可以通过语言描述被其他语言模型在上下文中学习。此外,手动研究典型案例揭示了令人惊讶的复杂隐写协议,例如在自然语言中嵌入隐藏信息。尽管我们无法确定这些语言构思中的自主程度,但我们的结果进一步证明,仅监控表面行为可能很快不足以维持对智能体群体的控制。

英文摘要

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.

2605.31167 2026-06-01 cs.AI

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

LLM-FACETS:一个保护隐私的评估LLM透明度和问责制的框架

Tom Lucas, Alessio Buscemi, Alfredo Capozucca, German Castignani, Barbara Delacroix

AI总结 提出一个开源框架LLM-FACETS,通过浏览器界面和插件架构,为技术专家、领域专家和合规官员提供隐私保护的LLM评估,实现透明度与问责制。

详情
Comments
Submitted to ACM Journal on Responsible Computing, Special Section: Collaborative Methods and Tools for Engineering and Evaluating Transparency in AI. 28 pages 9 figures, 7 tables, 1 algorithm. Source code: https://github.com/Scriptor-Group/AIMVi
AI中文摘要

评估大型语言模型的输出是否事实准确、认知校准和方法可复现,是负责任AI部署的前提。然而,审计LLM对非技术从业者仍然难以实现:现有工具需要编程专业知识和非平凡的环境设置,云托管平台将评估数据传输到外部服务,为法律上负责AI监督的领域专家和合规官员设置了障碍。我们介绍LLM-FACETS(LLM事实交叉评估系统):一个开源框架,具有浏览器可访问的界面和插件架构,围绕三个从业者画像(技术专家、领域专家、合规官员)构建,这些画像反映了EU AI法案和NIST AI风险管理框架中识别的利益相关者类别。该架构使数据流明确:确定性指标(BLEU、ROUGE、BERTScore)完全在自托管服务器内运行,无出站传输;LLM评判指标显式联系外部API,用户保留完全凭据控制。该框架通过三种机制实现透明度:用于认知不确定性的token级对数概率可视化、多评判共识以减轻评判偏差,以及RAG Triad指标(忠实性、答案相关性、上下文相关性)以检测和定位幻觉。插件架构允许在不修改评估管道的情况下集成任何新指标或数据集。开源实现支持针对同一属性的多个指标进行交叉检查,确保可复现性,并将AI问责制与评估系统的构建团队解耦。我们通过18个指标实现与规范参考库的交叉验证来验证该框架。

英文摘要

Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non-technical practitioners: existing tools require programming expertise and non-trivial environment setup, and cloud-hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM-FACETS (LLM FActuality Cross-EvaluaTion System): an open-source framework with a browser-accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self-hosted server with no outbound transmission; LLM-judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open-source implementation enables cross-checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross-validation of 18 metric implementations against canonical reference libraries.

2605.31164 2026-06-01 cs.CL cs.AI

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

D$^3$: 面向LLM训练的动态有向图约束数据调度

Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li

AI总结 提出D$^3$框架,通过动态有向图建模训练单元间的有向影响关系,并求解约束优化问题以确定训练顺序,从而提升LLM预训练和后训练阶段的效率。

详情
AI中文摘要

训练数据在大语言模型(LLM)优化中起着核心作用,这激发了对数据调度策略的广泛研究。现有方法大多集中于调整整体数据分布,而忽略了训练过程中样本之间的潜在交互。然而,我们认为这种交互不可忽视,因为现实世界的数据样本之间经常存在有向影响,使得训练顺序至关重要。直观上,我们可以优先训练影响更大的单元以提高学习效率。在这项工作中,我们提出了D$^3$,一个动态有向图约束的数据调度框架。D$^3$将训练单元之间的复杂交互建模为一个动态影响图,其中边表示基于损失的依赖关系。然后,它在该图上求解一个约束优化问题,以推导出训练顺序,确保数据序列在整个训练过程中遵循不断演变的信息流。我们的方法具有理论动机,并在预训练和后训练阶段均比现有数据调度方法取得了一致的改进。此外,为了可扩展性,D$^3$还采用了一种高效的近似算法,将额外的计算开销控制在可管理范围内。为便于未来研究,代码可在https://github.com/xuyj233/D3获取。

英文摘要

Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.

2605.31163 2026-06-01 stat.ML cs.LG

Memory by Design: Probabilistic Sequence Layers

记忆设计:概率序列层

Matthew Dowling, Hyungju Jeon, Cristina Savin, Il Memming Park

AI总结 提出设计-模型框架,通过精确贝叶斯滤波推导高效循环序列映射,线性高斯实例中的贝叶斯层传播均值和协方差以跟踪不确定性,统一多种次二次递归,并提升鲁棒性和长上下文检索。

详情
Comments
Preprint, in submission
AI中文摘要

我们引入了设计-模型框架:一种从关于记忆的显式假设中推导高效循环序列映射的方法。设计模型通过精确贝叶斯滤波将证据写入记忆;查询相关的读出产生一个预测分布,其均值即为层输出。在我们的线性高斯实例中,贝叶斯层同时传播均值和协方差:协方差跟踪存储关联的不确定性,引导写入朝向不确定方向,随着证据积累而衰减增益,并保留自信的记忆。同一框架统一了几种次二次递归。线性注意力、GLA和Mamba-2/SSD在一个设计模型下是精确滤波器,而DeltaNet及相关Delta-rule模型在另一个设计模型下作为协方差重置约简出现。恢复协方差为检索动力学提供了闭式预测,并经实验验证,在受控碰撞研究、学习关联回忆和Zoology MQAR基准上,改善了训练范围外的鲁棒性;将贝叶斯层蒸馏到预训练的340M Gated DeltaNet中,在匹配计算下提升了RULER长上下文检索性能。

英文摘要

We introduce the design-model framework: a way to derive efficient recurrent sequence maps from explicit assumptions about memory. A design model writes evidence into memory by exact Bayesian filtering; a query-dependent readout produces a predictive distribution whose mean is the layer output. In our linear-Gaussian instantiation, the \emph{Bayesian Layer} propagates both a mean and a covariance: the covariance tracks uncertainty over stored associations, steering writes toward uncertain directions, attenuating gains as evidence accumulates, and preserving confident memories. The same framework unifies several sub-quadratic recurrences. Linear attention, GLA, and Mamba-2/SSD are exact filters under one design model, whereas DeltaNet and related Delta-rule models arise as covariance-reset reductions under another. Restoring the covariance yields closed-form predictions for retrieval dynamics, verified empirically, and improves robustness beyond the training regime across controlled collision studies, learned associative recall, and the Zoology MQAR benchmark; distilling Bayesian Layers into a pretrained 340M Gated DeltaNet improves RULER long-context retrieval at matched compute.

2605.31159 2026-06-01 cs.LG cs.AI

Trust-Region Behavior Blending for On-Policy Distillation

信任域行为混合用于在线策略蒸馏

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov

AI总结 提出信任域行为混合(TRB)预热方法,通过在学生中心的KL信任域内用最接近教师的行为策略替换早期学生策略,解决在线策略蒸馏中早期学生轨迹质量差的问题,在数学推理蒸馏中取得最佳平均性能。

详情
AI中文摘要

在线策略蒸馏(OPD)训练学生模型在其自身策略采样的前缀上进行学习,同时匹配更强的教师模型。这解决了离线蒸馏中的前缀不匹配问题,但早期的学生模型 rollout 仍然可能质量较差,导致教师监督应用于弱或低质量的前缀。我们提出信任域行为混合(TRB),一种预热方法,在学生中心的KL信任域内,用最接近教师的行为策略替换早期的 rollout 策略,同时保持每个前缀的反向KL OPD损失不变。KL预算逐渐退火至零,因此预热后训练恢复为纯学生 rollout。在两个数学推理蒸馏设置中,TRB在比较方法中取得了最强的平均性能。

英文摘要

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

2605.31156 2026-06-01 cs.LG

TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery

TabCausal: 跨因果环境的表格因果发现预训练

Zi-Rong Li, Si-Yang Liu, Tian-Zuo Wang, Han-Jia Ye

AI总结 提出TabCausal,一种通过动态任务构建策略在多样化因果环境中进行大规模预训练的因果发现基础模型,在合成和语义基准上优于现有方法。

详情
AI中文摘要

因果发现旨在从观测和干预数据中恢复有向因果关系,为机制理解和可靠决策提供基础。因果发现基础模型(CDFMs)试图通过将数据集直接映射到因果图(单次前向传播)来分摊该问题,避免每个数据集上的测试、搜索或优化。然而,现有的CDFMs仍然有限,常常无法一致地匹配强大的经典方法,我们发现关键瓶颈在于因果预训练任务的构建方式。基于这一观察,我们提出了TabCausal,一种数据驱动的CDFM,在多样化的图先验、结构机制、噪声模型、维度、样本量和干预机制上进行广泛的因果预训练。一种动态任务构建策略将这些因果环境组合成多样的发现任务,使得从观测和混合干预数据中实现更具迁移性的结构学习。在大规模合成基准上,TabCausal实现了比多种因果发现基线更好的宏观平均性能。为了进一步弥合抽象合成生成器与现实因果推理场景之间的差距,我们引入了一个协议引导且LLM审计的语义因果环境基准,其中基于领域的结构因果模型(SCMs)生成可解释的观测和干预数据集,用于分布外分析。在合成和语义环境中,TabCausal均展现出鲁棒的结构恢复能力,尤其是在干预证据下,凸显了广泛因果预训练作为可迁移摊销因果发现的关键要素。

英文摘要

Causal discovery aims to recover directed causal relations from observational and interventional data, providing a basis for mechanistic understanding and reliable decision-making. Causal discovery foundation models (CDFMs) seek to amortize this problem by mapping a dataset directly to a causal graph in a single forward pass, avoiding per-dataset testing, search, or optimization. However, existing CDFMs remain limited, often failing to consistently match strong classical methods, and we find that a key bottleneck is how causal pretraining tasks are constructed. Based on this observation, we propose TabCausal, a data-driven CDFM trained with broad causal pretraining over diverse graph priors, structural mechanisms, noise models, dimensions, sample sizes, and intervention regimes. A dynamic task construction strategy composes these causal environments into varied discovery tasks, enabling more transferable structural learning from observational and mixed-interventional data. On large-scale synthetic benchmarks, TabCausal achieves better macro-averaged performance than a diverse set of causal discovery baselines. To further bridge abstract synthetic generators and realistic causal reasoning scenarios, we introduce a protocol-guided and LLM-audited semantic causal environment benchmark, where domain-grounded SCMs generate interpretable observational and interventional datasets for out-of-distribution analysis. Across both synthetic and semantic environments, TabCausal demonstrates robust structure recovery, especially under interventional evidence, highlighting broad causal pretraining as a key ingredient for transferable amortized causal discovery.

2605.31155 2026-06-01 cs.LG

Learning Hyperspherical Time-Frequency Representations for Time-Series Out-of-Distribution Detection

学习超球面时频表示用于时间序列分布外检测

Willian T. Lunardi, Samridha Shrestha, Martin Andreoni

AI总结 本文提出一种基于超球面嵌入的表示学习方法,通过von Mises-Fisher目标函数结合时频域编码器,实现时间序列的分布外检测,在UCR和UEA数据集上优于对比学习和后处理方法。

详情
Comments
14 pages, 2 figures, 4 tables, accepted at IJCAI-ECAI 2026
AI中文摘要

与视觉和语言领域相比,时间序列数据的分布外(OOD)检测仍然相对未被充分探索,对于如何利用监督时间序列表示在分布偏移下进行可靠检测,缺乏原则性的理解。本文将时间序列OOD检测形式化为具有超球面嵌入的表示学习,其中通过单位球面上的von Mises-Fisher(vMF)似然目标诱导类条件结构。学习到的表示通过特定领域的编码器结合输入信号的时域和频域视图,将它们整合到一个联合嵌入空间中进行OOD检测。检测使用基于距离的分数对学习到的嵌入进行评估,包括k近邻(k-NN)和马氏距离分数。我们在完整的UCR和UEA时间序列存档上,在跨数据集协议下大规模评估该方法。实验结果表明,在相同设置下,与强对比学习和后处理方法基线相比,k-NN和马氏距离评分均取得一致改进。代码可在https://github.com/tiiuae/hypertf-time-series-ood获取。

英文摘要

Out-of-distribution (OOD) detection for time-series data remains comparatively underexplored compared to vision and language, with a limited principled understanding of how supervised time-series representations can be leveraged for reliable detection under distributional shifts. This work formulates time-series OOD detection as representation learning with hyperspherical embeddings, where class-conditional structure is induced by a von Mises-Fisher (vMF) likelihood-based objective on the unit sphere. The learned representation combines time- and frequency-domain views of the input signal via domain-specific encoders, integrating them into a joint embedding space for OOD detection. Detection uses distance-based scores over the learned embeddings, including k-nearest neighbors (k-NN) and Mahalanobis scores. We evaluate the approach at scale on the complete UCR and UEA time-series archives under a cross-dataset protocol. Empirical results show consistent improvements under both k-NN and Mahalanobis scoring over strong contrastive learning and post-hoc baselines in the same setting. Code is available at https://github.com/tiiuae/hypertf-time-series-ood.

2605.31153 2026-06-01 cs.CV

BIAS-ID: A Framework for Analyzing Transformation Biases in AI-Generated Image Detectors

BIAS-ID: 分析AI生成图像检测器中变换偏差的框架

Jonas Ricker, Asja Fischer, Erwin Quiring

AI总结 本文提出BIAS-ID框架,用于分析和量化AI生成图像检测器中的变换偏差,并通过实验揭示多种先进检测方法受偏差影响严重。

详情
AI中文摘要

鉴于网络上有害AI生成图像的激增,可靠地区分真实图像与生成图像已成为一个紧迫的研究课题。虽然许多提出的检测方法在受控设置下表现良好,但在真实世界数据上测试时常常失效。一个潜在的根本原因是检测器训练数据中的细微偏差。因此,检测器可能依赖虚假相关性而非学习真正的取证痕迹。虽然最近的工作已经识别出这个问题,但尚未建立评估检测器实际偏差程度的既定协议。因此,在本文中,我们退一步:首先,我们讨论检测器存在偏差意味着什么,以及这与缺乏鲁棒性有何不同。其次,我们提出BIAS-ID,一个用于分析和量化AI生成图像检测器中变换偏差的透明框架。我们通过对两个数据集上的六个检测器进行评估来验证我们的框架,揭示了几种最先进的检测方法受到偏差的强烈影响。我们的结果强调了偏差感知评估对于开发可靠的AI生成图像检测器的重要性。

英文摘要

Given the surge of harmful AI-generated imagery online, reliably distinguishing authentic images from generated ones has become an urgent research topic. While many proposed detection methods perform well under controlled settings, they often collapse when tested on real-world data. A potential root cause are subtle biases in the detectors' training data. As a result, detectors may rely on spurious correlations instead of learning true forensic artifacts. While a recent line of work has identified the problem, there is not yet an established protocol to evaluate how biased a detector actually is. In this work, we therefore take a step back: First, we discuss what it means for a detector to be biased, and how this differs from a lack of robustness. Second, we propose BIAS-ID, a transparent framework for analyzing and quantifying the presence of transformation biases in AI-generated image detectors. We validate our framework by performing an evaluation of six detectors across two datasets, revealing that several state-of-the-art detection methods are strongly affected by biases. Our results highlight the importance of bias-aware evaluation for developing reliable AI-generated image detectors.

2605.31152 2026-06-01 stat.ML cs.LG cs.NA math.NA

Approximation and learning of anisotropic and mixed smooth functions by deep ReLU neural networks

深度ReLU神经网络对各向异性和混合光滑函数的逼近与学习

Yunfei Yang, Jun Fan

AI总结 本文研究深度ReLU神经网络对各向异性和混合光滑函数类的逼近率,并证明在平均光滑度条件下可达到接近最优的逼近速率。

详情
AI中文摘要

本文研究深度ReLU神经网络逼近和学习光滑函数的效率。当误差以$L^p([0,1]^d)$范数度量且逼近器为宽度$W$、深度$L$的网络时,近期工作已证明在Sobolev嵌入条件$s/d>1/q-1/p$下,对于Besov空间$\mathcal{B}^s_{q,r}([0,1]^d)$有超逼近率$\mathcal{O}((WL)^{-2s/d})$。为克服该速率中的维数灾难,我们将此结果推广到各向异性和混合光滑函数类。对于各向异性光滑度$oldsymbol{s}=(s_1,\dots,s_d)$的各向异性Besov空间$\mathcal{B}^{oldsymbol{s}}_{q,r}([0,1]^d)$,在嵌入条件$ ilde{s} > 1/q-1/p$下建立逼近率$\mathcal{O}((WL)^{-2 ilde{s}})$,其中平均光滑度$ ilde{s} = (\sum_{i=1}^d s_i^{-1})^{-1}$。对于混合光滑度$s>1/q-1/p$的混合光滑Besov空间$\mathcal{MB}^s_{q,r}([0,1]^d)$,我们证明逼近率$\mathcal{O}((WL)^{-2s})$(忽略对数因子)。利用这些结果,我们还推导了各向异性Besov函数复合的逼近界。作为应用,表明深度ReLU神经网络可在广泛光滑函数类上达到极小化最优速率(忽略对数因子)。

英文摘要

This paper studies how efficiently deep ReLU neural networks can approximate and learn smooth functions. When the error is measured in $L^p([0,1]^d)$ norm and the approximator is a network with width $W$ and depth $L$, recent works have proven the supper approximation rate $\mathcal{O}((WL)^{-2s/d})$ for Besov space $\mathcal{B}^s_{q,r}([0,1]^d)$ under the Sobolev embedding condition $s/d>1/q-1/p$. In order to overcome the curse of dimensionality in this rate, we extent this result to anisotropic and mixed smooth function classes. We establish the approximation rate $\mathcal{O}((WL)^{-2\tilde{s}})$ for anisotropic Besov space $\mathcal{B}^{\boldsymbol{s}}_{q,r}([0,1]^d)$ with anisotropic smoothness $\boldsymbol{s}=(s_1,\dots,s_d)$ under the embedding condition $\tilde{s} > 1/q-1/p$, where the mean smoothness $\tilde{s} = (\sum_{i=1}^d s_i^{-1})^{-1}$. For mixed smooth Besov space $\mathcal{MB}^s_{q,r}([0,1]^d)$ with mixed smoothness $s>1/q-1/p$, we show that the approximation rate $\mathcal{O}((WL)^{-2s})$ holds up to logarithmic factors. Using these results, we also derive approximation bounds for the composition of anisotropic Besov functions. As an application, it is shown that deep ReLU neural networks can achieve minimax optimal rates up to logarithmic factors for a wide range of smooth function classes.

2605.31149 2026-06-01 cs.HC cs.AI

Developing a UXR Point of View for Cognitive Accessibility in Mobile Learning with Generative AI

利用生成式AI在移动学习中开发认知无障碍的UXR视角

Fatima Ahmad Muazu, Festus Adedoyin, Huseyin Dogan, Abiodun Adedeji, Melike Akca, Olumuyiwa Ayorinde

AI总结 本研究通过结合UX研究原则和大语言模型支持的分析,提出认知无障碍UXR剧本,以改善面向认知障碍学习者的移动学习系统需求质量。

详情
AI中文摘要

本研究探讨如何利用UX研究(UXR)原则,结合大语言模型(LLM)支持的分析,提高为认知障碍学习者设计的移动学习系统的需求质量。以UXR视角(PoV)金字塔为方法论框架,研究分为四个阶段:心理、行为和设计层的基础结构;使用DeLone和McLean信息系统成功模型及质量功能展开(QFD)进行结构化验证;通过开发九张认知无障碍UXR游戏卡进行洞察整合;以及支持跨学科沟通的利益相关者特定PoV表述。在人工监督下,整合LLM支持的合成以协助主题聚类、需求细化和假设制定。研究结果表明,移动学习中的许多可用性和参与度挑战源于模糊或未充分定义的需求,而不仅仅是界面设计。通过将认知无障碍原则嵌入可测量且技术可追溯的需求中,所提出的认知无障碍UXR剧本为协调理论、系统架构和利益相关者策略提供了结构化路径。

英文摘要

This study investigates how UX research (UXR) principles, combined with Large Language Model (LLM)-supported analysis, can be used to improve the quality of requirements for mobile learning systems designed for learners with cognitive disabilities. Using the UXR Point-of-View (PoV) pyramid as a methodological framework, the study progressed through four stages: foundational structuring of psychological, behavioral, and design layers; structured validation using the DeLone and McLean Information Systems Success Model and Quality Function Deployment (QFD); insight consolidation through the development of nine Cognitive Accessibility UXR Play Cards; and stakeholder-specific PoV articulation to support interdisciplinary communication. LLM-supported synthesis was integrated to assist in theme clustering, requirement refinement, and hypothesis formulation under human oversight. Findings suggest that many usability and engagement challenges in mobile learning originate from ambiguous or under-specified requirements rather than interface design alone. By embedding cognitive accessibility principles into measurable and technically traceable requirements, the proposed Cognitive Accessibility UXR Playbook provides a structured pathway for aligning theory, system architecture, and stakeholder strategy.

2605.31148 2026-06-01 cs.CV cs.AI cs.CL

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct:探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

AI总结 本文提出SpatialAct基准,通过多轮交互细化、单步错误检测与修复等任务,揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情
AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系,并将这种推理转化为行动。尽管最近的视觉语言模型(VLM)在基于观测的空间感知和推理任务上表现出色,但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题,我们引入了 extbf{SpatialAct},一个基于模拟器的基准,用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始,我们进一步设计了其分解版本——单步错误检测与修复,以及五个基础空间能力任务,以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距:当前VLM在孤立的空间推理任务上表现良好,但在多轮反馈中难以维持连贯的空间信念并产生可靠行动,显著不如人类。这些结果表明,即使抽象掉了低级控制,当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

2605.31147 2026-06-01 cs.HC cs.AI

Developing a Culturally Grounded, AI-Augmented UX Research Point of View (POV): An Exemplar Case Study from Telemedicine Dementia Care

开发一个文化根基的、AI增强的用户体验研究观点(POV):来自远程医疗痴呆症护理的示例案例研究

Abiodun Adedeji, Huseyin Dogan, Festus Adedoyin, Michelle Heward, Melike Akca, Emmanuel Oluwatosin Oluokun, Fatima Ahmad Muhazu, Olumuyiwa Ayorinde

AI总结 本文通过一个远程医疗痴呆症护理案例,展示了如何结合混合方法研究、假设生成和本体建模,并集成生成式AI作为协作工具,来构建一个文化敏感的、可辩护的用户体验研究观点(POV)。

详情
AI中文摘要

用户体验研究(UXR)观点(POV)将复杂且通常碎片化的研究证据提炼为可操作的视角,指导团队理解用户需求、构建设计决策并协调利益相关者。尽管POV在行业实践中被广泛使用,但公开记录POV构建过程的例子很少,特别是在文化敏感和资源匮乏的背景下。本文展示了一个示例案例研究,展示了如何开发一个文化根基的、AI增强的UXR POV,以指导TeleDeCa——一个面向尼日利亚家庭护理人员的远程医疗痴呆症护理框架。基于UXR POV Playbook和金字塔框架,我们说明了如何将混合方法研究、假设生成和基于本体的建模结合起来,形成一个可辩护的POV,而无需完全最终化的系统或验证结果。生成式AI(GenAI)作为有限的研究合作者被整合到UXR POV框架中,支持综合、假设探索和叙事构建,同时保留人类判断、伦理责任和文化敏感性。本文的贡献在于提取了可重用的Play Cards和一个Play,扩展了UXR POV Playbook,并为CHI 2026关于开发AI驱动的UXR POV的工作坊提供了示例材料。

英文摘要

User Experience Research (UXR) Points of View (POVs) distil complex and often fragmented research evidence into actionable perspectives that guide how teams interpret user needs, frame design decisions, and align stakeholders. Although POVs are widely used in industry practice, there are few published examples that explicitly document how POVs are constructed, particularly in culturally sensitive and low-resource contexts. This paper presents an exemplar case study demonstrating how a culturally grounded, AI-augmented UXR POV was developed to inform TeleDeCa, a telemedicine dementia care framework for family caregivers in Nigeria. Building on the UXR POV Playbook and pyramid framework, we illustrate how mixed-methods research, hypothesis generation, and ontology-based modelling can be combined to form a defensible POV without requiring a fully finalised system or validated outcomes. Generative AI (GenAI) is integrated across the UXR POV framework as a bounded research collaborator, supporting synthesis, hypothesis exploration, and narrative construction while preserving human judgment, ethical accountability, and cultural sensitivity. The contribution of this paper lies in the extraction of reusable Play Cards and a Play that extend the UXR POV Playbook and serve as exemplar material for the CHI 2026 workshop on developing AI-powered UXR POVs.

2605.31146 2026-06-01 cs.HC cs.AI

From Evidence to Design: Developing an AI-Augmented UX Research Point of View for Digital Wellbeing in Emergency and Public Safety Contexts

从证据到设计:开发面向紧急与公共安全情境下数字福祉的AI增强用户体验研究视角

Olumuyiwa Ayorinde, Huseyin Dogan, Festus Adedoyin, Nan Jiang, Emmanuel Oluokun, Abiodun Adedeji, Melike Akca

AI总结 本研究结合用户体验研究方法与AI支持分析,针对紧急与公共安全人员开发数字福祉干预措施的设计方向,通过文献分析识别模式并整合行为改变技术与说服性设计原则,最终产出UXR PoV金字塔、九张UXR游戏卡和利益相关者叙事。

详情
AI中文摘要

本文研究如何将用户体验研究方法与AI支持分析相结合,为针对紧急与公共安全人员的数字福祉干预措施开发更清晰的设计方向。EPSP在高压、轮班制环境中工作,认知疲劳和不可预测的日程降低了他们对传统福祉工具的参与度。本研究使用UXR观点框架,应用AI支持的文献分析过程来识别反复出现的心理、行为和设计模式。在整个解释过程中整合了行为改变技术和说服性设计原则,以连接证据与实际设计推理。该过程产生了UXR PoV金字塔、九张UXR游戏卡和以利益相关者为中心的PoV叙事。研究结果表明,有效的EPSP福祉系统必须最小化认知努力、适应操作环境并优先考虑心理安全。这项工作展示了AI如何协助大规模证据解释,而人类研究人员则保持对情境判断和设计方向的责任。

英文摘要

This paper investigates how User Experience Research (UXR) methods can be combined with AI-supported analysis to develop clearer design direction for digital wellbeing interventions targeting Emergency and Public Safety Personnel (EPSP). EPSP work in high-stress, shift-based environments where cognitive fatigue and unpredictable schedules reduce engagement with conventional wellbeing tools. Using the UXR Point-of-View (PoV) framework, this study applied an AI-supported literature analysis process to identify recurring psychological, behavioural, and design patterns. Behaviour Change Techniques and Persuasive Technology principles were integrated throughout interpretation to connect evidence with practical design reasoning. The process resulted in a UXR PoV Pyramid, nine UXR Play Cards, and stakeholder focused PoV narratives. Findings show that effective wellbeing systems for EPSP must minimise cognitive effort, adapt to operational context, and prioritise psychological safety. The work demonstrates how AI can assist large-scale evidence interpretation while human researchers maintain responsibility for contextual judgement and design direction.

2605.31145 2026-06-01 cs.CV cs.AI cs.LG

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

FOCUS: 通过视觉支持约束和策略优化强制上下文目标定位

Mohammed Asad Karim, Vinay Kumar Verma

AI总结 提出一种两阶段训练框架,通过优化支持框与查询图像间的上下文注意力并结合GRPO强化学习,实现无类别监督的类别无关上下文目标定位,7B模型性能超越72B模型。

详情
Comments
Accepted at ICML 2026. * Equal Contributions
AI中文摘要

上下文定位(ICL)旨在通过查询图像中的少量支持示例定位目标对象,无需训练或参数更新即可即时操作。尽管视觉语言模型(VLM)快速发展,实现类别无关且基于视觉的ICL仍然是一个未解决的问题,尽管它对图像编辑、个性化视觉搜索和检索等应用至关重要。现有方法脆弱且依赖显式类别监督,这不仅限制了在具有未命名或实例特定对象的现实场景中的适用性,还引入了类别偏差,使预测偏向语义先验而非视觉证据。我们提出一个两阶段训练框架,在无类别监督的情况下显式优化支持边界框与查询图像之间的上下文注意力。我们进一步通过使用组相对策略优化(GRPO)的强化学习来细化定位,直接最小化定位误差。这种公式强制视觉对应优于语义先验,产生鲁棒的实例级定位。实验表明,使用我们的目标训练的7B参数模型优于高达72B参数的模型,证明了上下文感知定位目标可以超越单纯扩展规模。全面的消融实验验证了每个组件的贡献。

英文摘要

In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.

2605.31143 2026-06-01 cs.HC cs.AI

Extending the UXR Point of View Pyramid: A Generative AI-Augmented Methodology for Human-Centred AI Systems

扩展UXR观点金字塔:一种面向人本AI系统的生成式AI增强方法论

Festus Fatai Adedoyin, Huseyin Dogan, Melike Akca, Abiodun Adedeji

AI总结 针对英国债务管理中的AI金融系统,通过扩展UXR观点金字塔,提出一种结合生成式AI的增强方法论,包括AI增强观点金字塔、结构化提示架构和AI驱动的Playbook卡片系统,以提升可解释性、公平性和问责性。

详情
AI中文摘要

英国家庭债务和生活成本压力的上升,加剧了AI驱动的金融技术在信贷评估、还款结构和债务支持服务中的作用。这些系统日益影响重大的财务决策,但它们在复杂的社会技术环境中运作,受到监管限制、算法不透明性和高度脆弱性风险的影响。用户体验研究(UXR)观点(PoVs)对于将异质性研究证据转化为产品和治理决策的战略方向至关重要。然而,现有的UXR PoV框架并非为AI中介的金融系统设计,而在此类系统中,可解释性、公平性和问责性至关重要。本文扩展了UXR PoV金字塔,形成了一种面向英国金融服务背景下以人为中心的AI债务管理技术的AI增强方法论框架。我们形式化了(1)AI增强的PoV金字塔,(2)用于综合和假设生成的结构化提示架构,以及(3)AI驱动的Playbook卡片系统,该系统将生成式AI嵌入UXR工作流程,同时保持可追溯性和伦理监督。生成式AI并非作为分析权威,而是作为受人类验证和监管意识约束的认识论支持机制。通过将该框架应用于债务管理技术(包括可负担性评估、还款计划和财务压力预测系统),本研究推进了高风险金融AI环境下的UXR方法论,并为CHI社区内负责任、AI驱动的UXR实践的发展做出了贡献。

英文摘要

Rising household debt and cost-of-living pressures in the United Kingdom have intensified the role of AI-driven financial technologies in mediating credit assessment, repayment structuring, and debt support services. These systems increasingly shape consequential financial decisions, yet they operate within complex socio-technical environments characterised by regulatory constraint, algorithmic opacity, and heightened vulnerability risk. User Experience Research (UXR) Points of View (PoVs) are critical in translating heterogeneous research evidence into strategic direction for product and governance decisions. However, the existing UXR PoV framework was not designed for AI-mediated financial systems where interpretability, fairness, and accountability are central. This paper extends the UXR PoV pyramid into an AI-augmented methodological framework for Human-Centred AI debt management technologies in the UK financial services context. We formalise (1) an AI-Augmented PoV Pyramid, (2) a structured prompt architecture for synthesis and hypothesis generation, and (3) an AI-enabled Playbook Card system that embeds Generative AI into UXR workflows while preserving traceability and ethical oversight. Generative AI is positioned not as an analytic authority, but as an epistemic support mechanism subject to human validation and regulatory awareness. By grounding the framework in debt management technologies, including affordability assessment, repayment planning, and financial stress prediction systems, this work advances UXR methodology for high-stakes financial AI environments and contributes to the evolution of responsible, AI-powered UXR practice within the CHI community.

2605.31142 2026-06-01 cs.CL cs.AI

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

多语言文本嵌入排名在学习任务、语言和基准数据集上的鲁棒性

Ana Gjorgjevikj, Barbara Koroušić Seljak, Tome Eftimov

AI总结 通过引入数据集组成鲁棒性和排名方案鲁棒性指标,系统分析了MTEB中多语言模型排名对评估设计变化的敏感性,发现基于LLM的大模型通常是鲁棒的顶尖模型,但并非在所有任务中一致。

详情
AI中文摘要

大规模多语言文本嵌入模型在研究和工业中扮演着关键角色,但它们在特定语言、多任务设置中的行为仍未被充分理解。尽管像MTEB这样的基准平台报告了超过250种语言的结果,但关于模型优越性的结论往往依赖于数据集组成和性能聚合方法的隐含选择。为了解决这一差距,我们对MTEB中的多语言模型性能鲁棒性进行了元研究,应用了多种多准则决策制定排名方案,并引入了两个鲁棒性指标:数据集组成鲁棒性(排名对数据集组成变化的敏感性)和排名方案鲁棒性(对聚合方法变化的敏感性)。它们使得系统性地分析基准结论在不同评估设计下是否保持稳定成为可能。我们对五种语言(英语、法语、德语、印地语和西班牙语)在九个任务(例如分类、聚类、检索)上进行了深入分析,并发布了约230种额外语言的结果。任务特定分析表明,基于大规模LLM的模型通常是鲁棒的顶尖表现者,尽管并非一致(例如在检索任务中),而任务无关的结果显示,只有一小部分模型在任务、排名方案和数据子样本中始终保持强劲。

英文摘要

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

2605.31140 2026-06-01 cs.CR cs.CL

EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

EvoDefense:与大语言模型共同进化的黑盒防御

Yu Li, Yuenan Hou, Yingmei Wei, Yanming Guo, Chaochao Lu

AI总结 提出一种基于经验引导的共同进化黑盒防御范式EvoDefense,通过守卫LLM和经验记忆模块在攻防迭代中优化策略,实现对未见攻击和目标模型的泛化防御。

详情
AI中文摘要

大型语言模型(LLM)仍然极易受到各种攻击,特别是在黑盒设置中,目标模型的内部结构不可访问。现有的黑盒防御通常依赖于预定义的过滤启发式方法,这些方法往往无法泛化到未见过的攻击类型和目标模型架构。我们引入了EvoDefense,一种经验引导的共同进化黑盒防御范式。EvoDefense使用一个守卫LLM来检测恶意查询,并使用一个经验记忆模块来积累先前交互中的防御知识。EvoDefense的核心是一个连续的攻防进化循环,其中攻击生成器和守卫模型通过经验引导的优化迭代地改进其攻击策略和防御策略。这种设计使得EvoDefense能够在无需重新训练的情况下泛化到未见过的攻击和目标模型。在HarmBench、AdvBench和AlpacaEval上的实验表明,EvoDefense在七个流行模型和五种代表性LLM攻击上实现了一致且强大的防御性能,同时保持了有竞争力的通用能力。在HarmBench上,EvoDefense将AutoDAN-turbo对Gemini-3-flash和LLaMA-3-8B-Instruct的攻击成功率(ASR)分别从29.4%和43.4%降低到8.4%和6.2%。

英文摘要

Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.

2605.31138 2026-06-01 cs.HC cs.AI

Developing an AI-Powered UX Research Point of View for Digital Health in A Regulatory Context: An Exemplar Case from MSM and Transgender HIV Care in Nigeria

在监管背景下开发AI驱动的用户体验研究视角:以尼日利亚MSM和跨性别者HIV护理为例

Emmanuel Oluwatosin Oluokun, Festus Fatai Adedoyin, Huseyin Dogan, Nan Jiang, Melike Akca, Abiodun Adedeji, Olumuyiwa Ayorinde, Fatima Ahmad Muazu

AI总结 本文提出一种生成式AI增强的用户体验研究方法论,通过四阶段UXR流程和十张理论驱动的UXR游戏卡,指导尼日利亚男男性行为者(MSM)和跨性别者HIV护理中数字健康干预的设计,核心贡献是可复制的、关注污名和隐私的负责任GenAI使用框架。

详情
AI中文摘要

在法律和监管背景下的用户体验研究(UXR)面临独特挑战,需要专门的方法来保护弱势群体,同时产生可操作的见解。数字咨询、预约和药物配送平台在扩展护理可及性方面显示出前景;然而,它们的实际有效性因缺乏充分考虑到这些人群心理社会状况的、基于理论的用户体验研究方法论而受到限制。本文介绍了一种生成式AI增强的UXR方法论,基于UXR视角(PoV)剧本,指导为尼日利亚感染HIV/AIDS的男男性行为者(MSM)和跨性别者设计心理安全、低认知负荷的数字健康干预措施。基于涉及协同设计工作坊、主题分析和需求工程的实证研究,该方法论通过一个四阶段UXR过程实现,包括AI支持的假设生成、基础规划、通过构建模块生成洞察以及构建利益相关者特定的PoV叙述。该过程产生了十张理论驱动的UXR游戏卡,将心理机制和实证发现转化为可操作的设计指导。每张游戏卡包含可操作的任务、AI增强的方法和针对边缘化人群研究的伦理护栏。输出是一套十张理论驱动的UXR游戏卡,将心理洞察和实证证据转化为可操作的设计指导。核心贡献是一个可复制的、关注污名和隐私的框架,用于在UXR实践中负责任地使用GenAI,推进边缘化社区的人本数字健康设计。

英文摘要

User Experience Research (UXR) in a legal and regulatory contexts presents unique challenges that require specialised approaches to protect vulnerable populations whilst generating actionable insights. Digital consultation, appointment booking, and medication delivery platforms show promise for extending care access; however, their real-world effectiveness is curtailed by an absence of theoretically grounded user experience research (UXR) methodologies that adequately account for the psychosocial conditions of these populations. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to guide the design of psychologically safe, low-cognitive-load digital health interventions for MSM and transgender individuals living with HIV/AIDS in Nigeria. Drawing from empirical research involving co-design workshops, thematic analysis, and requirements engineering, the methodology is operationalised through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in ten theory-informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. Each play contains actionable tasks, AI-augmented approaches, and ethical guardrails tailored for research with marginalised populations. The output is a set of ten theory-informed UXR Play Cards translating psychological insight and empirical evidence into actionable design guidance. The core contribution is a replicable, stigma-aware, and privacy-centred framework for responsible GenAI use in UXR practice, advancing human-centred digital health design for marginalised communities.

2605.31137 2026-06-01 cs.CV

PolSAR Image Classification using a Hybrid Complex-Valued Network (HybridCVNet)

使用混合复数网络(HybridCVNet)进行PolSAR图像分类

Mohammed Q. Alkhatib

AI总结 提出一种混合复数网络HybridCVNet,结合CV-CNN和CV-ViT,通过提取互补信息并利用数据内部依赖关系,提升PolSAR图像分类性能,在Flevoland和San Francisco数据集上分别达到97.39%总体精度和0.972 Kappa值。

详情
Comments
Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)
AI中文摘要

近年来,卷积神经网络(CNN)因其在计算机视觉任务中的有效性而成为图像分类的热门方法。现在,研究人员正在探索视觉Transformer(ViT)在遥感和地球观测中的潜力。然而,传统的实值网络常常忽略复数(CV)数据(如极化合成孔径雷达(PolSAR)数据)中重要的相位信息。为了解决这个问题,出现了新的CV深度架构。HybridCVNet是一种新颖的混合网络,融合了CV-CNN和CV视觉Transformer(CV-ViT)技术。它有效地结合了CV 3D和2D CNN作为特征提取器,通过提取互补信息并有效利用数据内部的相互依赖关系,增强了PolSAR图像分类。来自广泛使用的PolSAR数据集的实验结果表明,HybridCVNet优于其他方法,在Flevoland数据集上实现了97.39%的总体精度,并且在仅1%采样率下也显示出潜力,在旧金山数据集上Kappa值为0.972。源代码可通过https://github.com/mqalkhatib/HybridCVNet获取。

英文摘要

Recently, convolutional neural networks (CNNs) have become popular for image classification due to their effectiveness in computer vision tasks. Now, researchers are exploring the potential of vision transformers (ViTs) in remote sensing and Earth observation. However, traditional Real-Valued networks often overlook important phase information in Complex-Valued (CV) data like polarimetric synthetic aperture radar (PolSAR) data. To address this, new CV deep architectures have emerged. HybridCVNet, a novel hybrid network, blends CV-CNN and CV vision transformer (CV-ViT) techniques. It efficiently combines CV 3D and 2D CNNs as feature extractors, enhancing PolSAR image classification by extracting complementary information and effectively leveraging interdependencies within the data. Experimental results from widely-used PolSAR datasets show HybridCVNet outperforms other methods, achieving an overall accuracy of 97.39% on the Flevoland dataset and showing promise even with just a 1% sampling ratio, with a Kappa value of 0.972 on the San Francisco dataset. Source code is accessible through https://github.com/mqalkhatib/HybridCVNet

2605.31136 2026-06-01 cs.CL

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

低资源语言维基百科的多语言和跨语言引用需求检测

Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić, Elena Simperl

AI总结 针对低资源语言,提出多语言引用需求检测语料库MCN,并证明使用编码器风格目标微调的小型解码器语言模型在跨语言任务中优于大型语言模型。

详情
AI中文摘要

在自动化事实核查(AFC)中,核查价值检测根据领域特定标准识别需要验证的声明。在维基百科上,该任务具体化为引用需求检测(CND),即标记缺乏支持性引用的声明。然而,现有研究很大程度上忽视了低资源语言,且最近的AFC流程依赖于大型语言模型(LLM),这对低资源组织来说难以获取。我们引入了MCN,一个覆盖三种资源级别共18种语言的多语言CND语料库,并在此基础上对小规模解码器语言模型(SLM)进行了广泛研究。实验表明,使用编码器风格目标微调的SLM在跨语言任务中显著优于提示型LLM。我们进一步展示了跨语言CND的首批研究之一,证明仅使用英语声明微调的SLM在几乎没有目标语言适应的情况下超越了LLM。我们的发现对低资源维基百科社区具有重要意义,并表明紧凑的任务特定模型比LLM更适合CND。我们在https://github.com/gerritq/mcn 发布所有数据和代码。

英文摘要

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

2605.31131 2026-06-01 cs.HC cs.AI

UXR PoV for Neuroinclusive Emotion Regulation

神经包容性情绪调节的用户体验研究观点

Melike Akca, Mona Giff, Deniz Cetinkaya, Huseyin Dogan, Stephen Giff

AI总结 本文提出一种生成式AI增强的用户体验研究方法,结合DBT、SDT和COM-B理论框架,通过四阶段流程生成十张UXR游戏卡,为ADHD成人设计神经包容性的数字情绪调节干预。

详情
AI中文摘要

注意缺陷/多动障碍(ADHD)是一种精神疾病,表现为个体在注意力不集中、多动和冲动方面的发展不适当模式,并在决策和情绪调节(ER)方面存在困难。尽管基于数字和人工智能的干预措施扩大了情绪调节支持的获取途径,但许多现有系统仍受限于理论整合薄弱、对神经多样性的适应不足以及缺乏将心理学洞察与设计实践相结合的结构化用户体验研究(UXR)方法。本文介绍了一种生成式AI增强的UXR方法,以UXR观点(PoV)剧本为基础,支持为ADHD成人设计具有情感智能和神经包容性的数字情绪调节干预。该方法将实证证据与既定心理学框架——辩证行为疗法(DBT)、自我决定理论(SDT)和COM-B行为模型相结合,并利用生成式AI作为协同分析工具,支持综合、假设形成和设计阐述。该方法通过四阶段UXR流程实施,包括AI支持的假设生成、基础规划、通过构建模块生成洞察以及构建利益相关者特定的PoV叙事。该流程产生了一套十张理论驱动的UXR游戏卡,将心理机制和实证发现转化为可操作的设计指导。本研究的主要贡献是一个可复制的、具有偏差意识的框架,用于将生成式AI整合到UXR实践中,推进数字心理健康设计中以人为本和神经包容性的方法。

英文摘要

Attention-deficit/hyperactivity disorder (ADHD) is a psychiatric disorder which presents itself in individuals through patterns of developmentally inappropriate levels of inattentiveness, hyperactivity, and impulsivity, with difficulties in decision making and emotional regulation (ER). Although digital and AI-based interventions have expanded access to ER support, many existing systems remain limited by weak theoretical integration, insufficient accommodation of neurodiversity, and a lack of structured user experience research (UXR) methodologies, that bridge psychological insight with design practice. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to support the design of emotionally intelligent and Neuroinclusive digital ER interventions for adults with ADHD. The approach integrates empirical evidence with established psychological frameworks Dialectical Behaviour Therapy (DBT), Self-Determination Theory (SDT), and the COM-B behavioural model and leverages Generative AI as a co-analytic tool to support synthesis, hypothesis formation, and design articulation. The methodology is operationalized through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in a set of ten theory informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. The primary contribution of this work is a replicable, bias-aware framework for integrating Generative AI into UXR practice, advancing human-centred and Neuroinclusive approaches to digital mental health design.