arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2601.12809 2026-05-27 cs.CV cs.AI cs.LG

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

CLIP风格视觉语言模型在合成空间关系数据训练中的左右对称性破缺

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

AI总结 通过可控一维图像文本测试平台,研究基于Transformer的视觉语言编码器在CLIP风格对比学习下如何通过位置与标记嵌入交互产生左右关系理解,并发现标签多样性比布局多样性更关键。

Comments Accepted at ICML 2026

详情
AI中文摘要

空间理解仍然是视觉语言模型中的一个关键挑战。然而,这种理解是否真正获得,如果是,通过什么机制,目前尚不清楚。我们提出了一个可控的一维图像文本测试平台,以探究在基于Transformer的视觉和文本编码器中,使用CLIP风格的对比目标训练时,左右关系理解是如何出现的。我们在单对象和双对象场景的配对描述上端到端地训练轻量级基于Transformer的视觉和文本编码器,并评估对未见对象对的泛化能力,同时系统性地改变标签和布局多样性。我们发现对比训练学习了左右关系,并且标签多样性(而非布局多样性)是这种情况下泛化的主要驱动因素。为了获得机制性理解,我们进行了注意力分解,并表明位置嵌入和标记嵌入之间的相互作用导致了水平注意力梯度,从而打破了编码器中的左右对称性;消除这一贡献会显著降低左右辨别能力。我们的结果提供了关于CLIP风格模型何时以及如何获得关系能力的机制性见解。

英文摘要

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

2601.08267 2026-05-27 cs.CL

Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

Med-CoReasoner: 通过语言感知的协同推理减少医学推理中的语言差异

Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akiko Aizawa, Irene Li

AI总结 提出Med-CoReasoner框架,通过并行英语和本地语言推理、结构化概念抽象及概念级对齐与检索,将本地临床知识整合到英语逻辑框架中,以缩小医学推理中的多语言差距,在MultiMed-X基准上平均提升5%的多语言推理性能。

详情
AI中文摘要

尽管推理增强的大语言模型在英语医学任务上表现强劲,但多语言差距仍然存在,本地语言的推理能力明显较弱,限制了全球医疗部署的公平性。为弥合这一差距,我们引入了Med-CoReasoner,一种语言感知的协同推理框架,它引出平行的英语和本地语言推理,将其抽象为结构化概念,并通过概念级对齐和检索将本地临床知识整合到英语逻辑框架中。这种设计结合了英语推理的结构稳健性和本地语言编码的实践基础专业知识。为评估超越选择题设置的多语言医学推理,我们构建了MultiMed-X基准,涵盖七种语言,包含专家标注的长文本问答和自然语言推理任务,每种语言350个实例。在三个基准上的实验表明,Med-CoReasoner平均提高了5%的多语言推理性能,在低资源语言上提升尤为显著。此外,模型蒸馏和专家评估分析进一步证实,Med-CoReasoner产生了临床合理且文化扎根的推理轨迹。

英文摘要

While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.

2601.08146 2026-05-27 cs.CL cs.AI cs.LG

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

超越迁移准确率:用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

AI总结 提出基于上下文分解的电路发现方法(CD-T),通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现,并利用电路目标监督微调(CT-SFT)在低资源跨语言情感迁移中最小化灾难性遗忘,优于全局微调。

详情
AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务,限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分,将上下文分解方法适配到非结构化设置(CD-T),实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调(CT-SFT),将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明,CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率,但CT-SFT独特地最小化灾难性遗忘,保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立,表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

2511.02360 2026-05-27 cs.CV cs.CL

LaRe: Latent Refocusing for Multimodal Reasoning

LaRe: 用于多模态推理的潜在重聚焦

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

AI总结 提出LaRe范式,在潜在空间内进行视觉重聚焦,结合语义增强训练,在提升推理准确率的同时大幅减少推理所需token数。

详情
AI中文摘要

思维链推理通过分解复杂任务提升逻辑性能,但其多模态扩展面临权衡。主流的“用图像思考”范式通过显式裁剪图像区域实现视觉重聚焦,但导致计算开销快速增长。新兴的潜在空间推理范式减少了token消耗,但缺乏动态重聚焦能力。我们认为这种权衡源于一个默认前提:有效的视觉重聚焦必须以显式token的形式发生。基于此,我们提出潜在重聚焦(LaRe),一种新的多模态推理范式,其中视觉重聚焦完全在潜在空间内进行。我们进一步设计了一种语义增强训练策略,通过视觉重建目标确保潜在空间的语义结构。实验评估表明,与现有基线相比,LaRe将平均准确率提高了7.6%,同时将推理所需的token数量减少了59.7%。当扩展到8B参数的视觉语言模型骨干时,LaRe实现了与最先进方法相当的性能,证明了我们提出的潜在重聚焦范式在多模态推理中的有效性。

英文摘要

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

2512.01572 2026-05-27 cs.LG cs.AI physics.app-ph

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

使用自编码器-扩散级联从极度稀疏测量中重建多尺度物理场

Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su, Zhilu Lai

AI总结 提出Cascaded Sensing框架,通过粗尺度确定性估计和细尺度条件扩散模型级联,解决极度稀疏测量下物理场重建的不适定性和多模态后验问题。

Comments 34 pages,22 figures

详情
AI中文摘要

极端传感器稀疏性使得全场重建成为科学传感中一个根本性的不适定问题,其目标是从稀疏测量中推断物理场。在此情况下,后验严重欠约束且固有地多模态,使其近似高度病态。具体而言,确定性映射会坍塌不确定性,直接条件学习无法覆盖可能的观测条件解空间,而似然引导采样对噪声和传感器配置高度敏感。这些限制导致后验估计不稳定,并突显了以结构化方式建模不确定性的必要性。为此,我们提出了Cascaded Sensing,一个跨尺度重构后验推理的分层框架。Cas-Sensing不直接建模全场后验,而是首先通过确定性粗阶段估计器解决全局结构模糊性。一个基于神经算子的功能自编码器,使用掩码输入训练,将稀疏观测映射到粗尺度结构场,其作用类似于最大后验估计器,选择主导全局配置。该结构锚点固定了后验的主要自由度,并将问题转化为一个条件更好的残差推理任务。然后,一个条件扩散模型仅学习细化尺度的残差分布,将采样限制在合理解的稳定邻域内,并抑制观测一致模式之间的竞争。为了增强在不同传感条件下的鲁棒性,我们引入了掩码级联训练,通过中间粗重建使模型暴露于多样的稀疏观测模式。在推理过程中,流形约束引导将观测一致性作为细化机制而非全局模式选择过程来实施。

英文摘要

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer physical fields from sparse measurements.In this regime,the posterior is severely underconstrained and inherently multimodal,making its approximation highly ill-conditioned.Specifically,deterministic mappings collapse uncertainty,direct conditional learning cannot cover the space of possible observation-conditioned solutions,and likelihood-guided sampling becomes highly sensitive to noise and sensor configurations.These limitations result in unstable posterior estimates and highlight the need for modeling uncertainty in a structural manner.To this end,we propose Cascaded Sensing,a hierarchical framework that restructures posterior inference across scales.Rather than modeling the full-field posterior directly,Cas-Sensing first resolves global structural ambiguity through a deterministic coarse-stage estimator.A neural-operator-based functional autoencoder,trained with masked inputs,maps sparse observations to a coarse-scale structural field,acting analogously to a maximum a posteriori estimator that selects the dominant global configuration.This structural anchor fixes the principal degrees of freedom of the posterior and transforms the problem into a better-conditioned residual inference task.A conditional diffusion model then learns only the refined-scale residual distribution,confining sampling to a stable neighborhood of plausible solutions and suppressing competition among observation-consistent modes.To enhance robustness under varying sensing conditions,we introduce mask-cascade training,which exposes the model to diverse sparse observation patterns through intermediate coarse reconstructions.During inference,manifold-constrained guidance enforces observation consistency as a refinement mechanism rather than a global mode-selection process.

2601.09886 2026-05-27 cs.CL

Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

缩小差距:探究为何语言模型惊奇度优于完形填空惊奇度

Sathvik Nair, Byung-Doh Oh

AI总结 本研究通过三个假设(低分辨率、语义相似词区分、低频词概率准确性)解释了语言模型概率在预测处理努力上优于完形填空数据的原因。

Comments 18 pages, 10 figures, accepted to ACL 2026 Main Conference

详情
AI中文摘要

一个词的可预测性可以通过两种方式量化:使用人类对完形填空任务的响应或使用语言模型(LM)的概率。当用作处理努力的预测因子时,LM概率优于从完形填空数据得出的概率。然而,重要的是要确定LM概率之所以如此是出于正确的原因,因为不同的预测因子可能导致关于预测在语言理解中作用的科学结论不同。我们提供了关于LM概率优势的三个假设的证据:不受低分辨率影响、区分语义相似的词、以及准确分配低频词的概率。这些结果呼吁努力提高完形填空研究的分辨率,同时进行实验以确定类似人类的预测是否也对LM概率所做的细粒度区分同样敏感。

英文摘要

How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.

2601.08375 2026-05-27 cs.CV

Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation

地理空间点云语义分割的无源域适应

Yuan Gao, Di Cao, Xiaohuan Xi, Sheng Nie, Shaobo Xia, Cheng Wang

AI总结 提出LoGo无源域适应框架,通过局部类平衡原型估计和全局最优传输分布对齐,解决地理空间点云语义分割中的域偏移问题。

详情
AI中文摘要

3D地理空间点云的语义分割是遥感应用的基础,但由区域和采集相关变化引起的域偏移通常会降低模型性能。尽管域适应可以缓解这种偏移,但现有方法通常需要访问源域数据,由于隐私问题和监管政策,这往往不可行。为了解决这个问题,我们提出了LoGo(局部-全局双共识),一种新颖的无源无监督域适应(SFUDA)框架,仅需要预训练模型和无标签目标数据。在局部层面,我们引入了一个类平衡原型估计模块,确保即使对于样本稀缺的尾部类别也能生成鲁棒的特征原型,有效缓解长尾分布引起的特征崩溃。在全局层面,我们引入了一个基于最优传输的全局分布对齐模块,将伪标签分配公式化为全局优化问题,有效纠正局部贪婪分配中头部类别的过度主导,从而防止模型预测严重偏向多数类别。最后,我们提出了一种双一致性伪标签过滤机制,仅保留局部多增强集成预测与全局最优传输分配一致的高置信度伪标签用于自训练。在两个具有挑战性的基准测试(包括跨场景和跨传感器设置)上的大量实验表明,LoGo始终优于现有的最先进方法。源代码可在 https://github.com/GYproject/LoGo-SFUDA 获取。

英文摘要

Semantic segmentation of 3D geospatial point clouds is fundamental to remote sensing applications, yet domain shifts caused by regional and acquisition-related variations often degrade model performance. Although domain adaptation can mitigate such shifts, existing methods typically require access to source-domain data, which is often infeasible due to privacy concerns and regulatory policies. To address this, we propose LoGo (Local-Global Dual-Consensus), a novel source-free unsupervised domain adaptation (SFUDA) framework requiring only a pretrained model and unlabeled target data. At the local level, we introduce a class-balanced prototype estimation module that ensures that robust feature prototypes can be generated even for sample-scarce tail classes, effectively mitigating the feature collapse caused by long-tailed distributions. At the global level, we introduce an optimal transport-based global distribution alignment module that formulates pseudo-label assignment as a global optimization problem, effectively correcting the over-dominance of head classes inherent in local greedy assignments, and thereby preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual-consistency pseudo-label filtering mechanism that retains only high-confidence pseudo-labels where local multi-augmented ensemble predictions align with global optimal transport assignments for self-training. Extensive experiments on two challenging benchmarks, encompassing cross-scene and cross-sensor settings, demonstrate that LoGo consistently outperforms existing state-of-the-art methods. The source code is available at https://github.com/GYproject/LoGo-SFUDA.

2601.07737 2026-05-27 cs.CV cs.AI

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

看见 vs. 相信:评估开源多模态大模型在反直觉场景中的语言偏见

Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding

AI总结 为评估多模态大模型处理反直觉动作场景的能力,提出CAIT基准(400个高保真合成场景),发现开源模型因语言先验而忽视视觉证据,性能接近随机水平,而链式思维推理虽提升准确率但导致过度思考拒绝视觉内容,通过微调和结构化提示可缓解此偏见。

详情
AI中文摘要

多模态大语言模型(MLLMs)在主流视觉理解任务中表现出色,但其处理违背日常常识的动作场景的能力尚未得到充分测试。为填补这一空白,我们引入了CAIT,一个包含400个高保真合成场景的基准,专注于反直觉的视觉动作,例如“兔子在追老虎”,其中视觉证据明确违背常识预期。我们评估了人类、领先的专有模型(如Claude和Gemini)以及14个代表性的开源MLLMs。人类达到近乎完美的性能(约0.95准确率),专有模型表现出稳健的理解(达到0.88准确率),而标准的开源指令微调模型性能处于随机水平。进一步分析表明,这种失败是由强烈的语言先验驱动的:模型不信任视觉输入,而是自动用统计上常见的文本描述覆盖异常的视觉信号。尽管引入链式思维推理机制可以提高准确率,但会显著减慢响应速度并产生新的失败模式:模型过度思考场景,仅仅因为违反现实物理定律而拒绝接受实际的视觉内容。最后,我们证明有针对性的微调和结构化提示可以有效缓解这种对语言先验的依赖,使开源模型能够基于实际视觉证据准确地进行推理。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.

2601.07284 2026-05-27 cs.RO

AdaMorph: Unified Motion Retargeting via Embodiment-Aware Adaptive Transformers

AdaMorph: 通过具身感知自适应变换器实现统一运动重定向

Haoyu Zhang, Shibo Jin, Lusong Li, Jun Li, Liang Lin, Xiaodong He, Zecui Zeng

AI总结 提出AdaMorph统一框架,利用具身感知自适应变换器将人体运动重定向到多种机器人形态,实现零样本泛化。

详情
AI中文摘要

将人体运动重定向到异构机器人是机器人学中的一个基本挑战,主要由于不同具身之间的严重运动学和动力学差异。现有解决方案通常训练特定于具身的模型,这扩展性差且无法利用共享的运动语义。为了解决这个问题,我们提出了AdaMorph,一个统一的神经重定向框架,使单个模型能够将人体运动适应到多种机器人形态。我们的方法将重定向视为一个条件生成任务。我们将人体运动映射到一个与形态无关的潜在意图空间,并利用双用途提示机制来条件化生成。不同于简单的输入拼接,我们利用自适应层归一化(AdaLN)根据具身约束动态调制解码器的特征空间。此外,我们通过基于课程的训练目标强制执行物理合理性,通过积分确保方向和轨迹一致性。在12个不同的人形机器人上的实验结果表明,AdaMorph有效地统一了跨异构拓扑的控制,在保持源行为动态本质的同时,对未见过的复杂运动表现出强大的零样本泛化能力。

英文摘要

Retargeting human motion to heterogeneous robots is a fundamental challenge in robotics, primarily due to the severe kinematic and dynamic discrepancies between varying embodiments. Existing solutions typically resort to training embodiment-specific models, which scales poorly and fails to exploit shared motion semantics. To address this, we present AdaMorph, a unified neural retargeting framework that enables a single model to adapt human motion to diverse robot morphologies. Our approach treats retargeting as a conditional generation task. We map human motion into a morphology-agnostic latent intent space and utilize a dual-purpose prompting mechanism to condition the generation. Instead of simple input concatenation, we leverage Adaptive Layer Normalization (AdaLN) to dynamically modulate the decoder's feature space based on embodiment constraints. Furthermore, we enforce physical plausibility through a curriculum-based training objective that ensures orientation and trajectory consistency via integration. Experimental results on 12 distinct humanoid robots demonstrate that AdaMorph effectively unifies control across heterogeneous topologies, exhibiting strong zero-shot generalization to unseen complex motions while preserving the dynamic essence of the source behaviors.

2601.06580 2026-05-27 cs.CL

Stylistic Evolution and LLM Neutrality in Singlish Language

新加坡英语中的文体演变与LLM中立性

Linus Tze En Foo, Weihan Angela Ng, Wenkai Li, Lynnette Hui Xian Ng

AI总结 通过分析十年间非正式数字信息的文体变化,研究大型语言模型(LLM)能否生成时间中立的输出,发现文体可分离性随时间距离增加,且LLM在真实性和时间中立性之间存在结构性权衡。

详情
AI中文摘要

新加坡英语是一种根植于新加坡多语言环境的克里奥尔语,随着社会和技术变革持续演变。我们考察了十年间非正式数字信息的历时文体变化,并探究大型语言模型(LLM)能否生成时间中立的输出,以近似该变体的稳定本质。使用词汇、语用、心理语言学和基于编码器的特征,我们发现文体可分离性随时间距离增加而增强,这主要由长度和复杂度等结构特征驱动。与零分布基线相比,大多数LLM未能同时实现真实性和时间中立性,揭示了一种结构性权衡:生成真实新加坡英语的模型继承了其时间偏差,而时间中立的模型则产生不真实的输出。这些发现将时间中立性定位为评估LLM社会方言基础的诊断指标。

英文摘要

Singlish is a creole rooted in Singapore's multilingual environment that continues to evolve alongside social and technological change. We examine diachronic stylistic change across a decade of informal digital messages and ask whether Large Language Models (LLMs) can generate temporally neutral outputs approximating the stable essence of the variety. Using lexical, pragmatic, psycholinguistic, and encoder-based features, we find that stylistic separability increases with temporal distance, driven primarily by structural features such as length and complexity. Evaluated against a null distribution baseline, most LLMs fail to achieve both authenticity and temporal neutrality simultaneously, revealing a structural trade-off: models generating realistic Singlish inherit its temporal biases, while temporally neutral models produce inauthentic outputs. These findings position temporal neutrality as a diagnostic metric for assessing sociolectal grounding in LLMs.

2601.05899 2026-05-27 cs.AI

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

TowerMind: 一个用于LLM作为智能体的塔防游戏学习环境与基准

Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

AI总结 本文提出TowerMind,一个基于塔防子类型的轻量级、多模态游戏环境,用于评估大语言模型在长期规划和决策中的能力,并揭示其与人类专家的性能差距及关键局限性。

Comments AAAI 2026 Oral

详情
AI中文摘要

近年来,大语言模型(LLM)的突破性进展使其成为智能体的一种有前景的范式,其中长期规划和决策作为适应不同场景和任务的核心通用能力逐渐凸显。实时策略(RTS)游戏因其固有的游戏玩法需要宏观战略规划和微观战术调整与行动执行,成为评估这两种能力的理想测试平台。现有的基于RTS游戏的环境要么计算需求较高,要么缺乏对文本观察的支持,这限制了RTS游戏在LLM评估中的应用。受此启发,我们提出了TowerMind,一种基于RTS游戏子类型——塔防(TD)的新型环境。TowerMind保留了RTS游戏评估LLM的关键优势,同时具有低计算需求和多模态观察空间,包括基于像素、文本和结构化游戏状态的表示。此外,TowerMind支持模型幻觉评估,并提供高度的可定制性。我们设计了五个基准关卡,以评估几种广泛使用的LLM在不同多模态输入设置下的表现。结果揭示了LLM与人类专家在能力和幻觉维度上的明显性能差距。实验进一步突出了LLM行为的关键局限性,例如规划验证不足、决策缺乏多终性以及行动使用效率低下。我们还评估了两种经典强化学习算法:Ape-X DQN和PPO。通过提供轻量级和多模态设计,TowerMind补充了现有的基于RTS游戏的环境格局,并为AI智能体领域引入了一个新的基准。源代码已在GitHub上公开(https://github.com/tb6147877/TowerMind)。

英文摘要

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

2601.05729 2026-05-27 cs.CV

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

TAGRPO: 通过直接轨迹对齐提升图像到视频生成中的GRPO

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

AI总结 针对图像到视频生成中GRPO优化效果不佳的问题,提出基于对比学习的TAGRPO框架,通过中间潜变量对齐高奖励轨迹并远离低奖励轨迹,结合记忆库提升多样性,显著优于DanceGRPO。

Comments 18 pages, 12 figures

详情
AI中文摘要

近期研究表明,将组相对策略优化(GRPO)集成到流匹配模型中,特别是在文本到图像和文本到视频生成中,具有显著效果。然而,我们发现将这些技术直接应用于图像到视频(I2V)模型往往无法带来一致的奖励提升。为解决这一局限,我们提出了TAGRPO,一个受对比学习启发的鲁棒后训练框架,适用于I2V模型。我们的方法基于以下观察:从相同初始噪声生成的rollout视频为优化提供了更优的指导。基于这一洞察,我们提出了一种应用于中间潜变量的新型GRPO损失,鼓励直接对齐高奖励轨迹,同时最大化与低奖励轨迹的距离。此外,我们引入了一个用于rollout视频的记忆库,以增强多样性并降低计算开销。尽管方法简单,TAGRPO在I2V生成中相比DanceGRPO取得了显著改进。相关成果将在 https://tagrpo.github.io/ 更新。

英文摘要

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation. The deliverables will be updated at https://tagrpo.github.io/ .

2601.03525 2026-05-27 cs.LG cs.AI

Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation

超越二元:将部分成功转化为代码生成中强化学习的密集可验证奖励

Longwen Wang, Yirui Liu, Xuan'er Wu, Xiaohui Hu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

AI总结 提出VeRPO框架,利用代码测试的部分成功作为可验证密集奖励,通过动态密度校准局部奖励修正基数偏差,并与全局执行结果结合,提升代码生成强化学习的性能。

详情
AI中文摘要

有效的奖励设计是代码生成强化学习(RL)中的核心挑战。主流的测试套件级结果奖励强制执行功能正确性但导致稀疏性,而外部奖励模型(RM)提供密集监督但代价是错位和额外开销。由于代码评估自然产生多个测试用例级结果,部分成功(即通过部分测试用例)提供了内在的、可验证的密集监督来源。在本文中,我们提出VeRPO(可验证密集奖励策略优化),一个系统地将可验证的部分成功转化为可靠密集奖励的RL框架。我们使用加权和公式分析部分成功奖励,理论上识别出一个关键的基数偏差,导致策略更新不成比例地偏向于从简单测试成功中获益,而非在前沿测试上取得进展。基于此,VeRPO引入了一个动态的、密度校准的局部奖励,明确纠正这种偏差,并从部分成功中提供稳健的密集监督。为了增强与端到端功能正确性的一致性,VeRPO进一步将局部密集奖励与全局执行结果相结合。在多种基准和设置上的大量实验表明,VeRPO优于结果驱动和基于RM的基线,实现了高达+8.83 pass@1的提升,且时间成本可忽略不计(<0.02%),GPU内存开销为零。

英文摘要

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewards enforce functional correctness but induce sparsity, while external Reward Models (RMs) provide dense supervision at the cost of misalignment and additional overhead. Since code evaluation naturally yields multiple test-case-level outcomes, partial success, i.e., passing a subset of test cases, offers an intrinsic, verifiable source of dense supervision. In this paper, we propose VeRPO (Verifiable Dense Reward Policy Optimization), an RL framework that systematically turns verifiable partial success into reliable dense rewards. We analyze partial-success rewards using a weighted sum formulation, theoretically identifying a critical cardinality bias that causes policy updates to disproportionately favor gains from easy-test successes over progress on frontier tests. Based on this, VeRPO introduces a dynamic, density-calibrated local reward that explicitly corrects this bias and provides robust dense supervision from partial success. To enhance alignment with end-to-end functional correctness, VeRPO further integrates the local dense reward with global execution outcomes. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO outperforms outcome-driven and RM-based baselines, achieving up to +8.83 pass@1 gain with negligible time cost (< 0.02%) and zero GPU memory overhead.

2601.05028 2026-05-27 cs.LG

Approximate Equivariance via Projection-based Regularisation

基于投影正则化的近似等变性

Torben Berndt, Jan Stühmer

AI总结 提出一种基于投影的正则化方法,通过在线性层中分解等变与非等变分量并惩罚非等变算子范数,实现高效且精确的近似等变性,在SO(3)等连续群上优于样本基方法。

详情
AI中文摘要

等变性是神经网络中一种强大的归纳偏置,能够提高泛化能力和物理一致性。然而,最近非等变模型因其更好的运行时性能以及现实应用中可能出现的不完美对称性而重新受到关注。这推动了近似等变模型的发展,这些模型在尊重对称性和拟合数据分布之间取得了平衡。该领域现有的方法通常使用基于样本的正则化器,这些正则化器依赖于训练时的数据增强,导致较高的样本复杂度,特别是对于$SO(3)$等连续群。相反,本文通过基于投影的正则化器来处理近似等变性,该正则化器利用线性层到等变和非等变分量的正交分解。与现有方法不同,本文在算子层面上对整个群轨道上的非等变性进行惩罚,而不是逐点惩罚。我们提出了一个数学框架,用于在空间域和谱域中精确且高效地计算非等变性惩罚。在我们的实验中,我们的方法在模型性能和效率上始终优于先前的近似等变性方法,与基于样本的正则化器相比,实现了显著的运行时增益。

英文摘要

Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as $SO(3)$. This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

2410.00995 2026-05-27 cs.LG

CktGen: Automated Analog Circuit Design with Generative Artificial Intelligence

CktGen: 基于生成式人工智能的自动化模拟电路设计

Yuxuan Hou, Hehe Fan, Jianrong Zhang, Yue Zhang, Hua Chen, Min Zhou, Faxin Yu, Roger Zimmermann, Yi Yang

AI总结 提出CktGen,一种基于条件变分自编码器的模拟电路生成方法,通过解耦电路与规格编码并采用对比训练和分类器引导,实现从目标规格到有效电路的生成,显著优于现有方法。

Comments Paper accepted by Engineering

详情
AI中文摘要

模拟电路的自动综合面临重大挑战。大多数现有方法将问题表述为单目标优化任务,忽略了给定电路类型的设计规格在不同应用中的广泛变化。为了解决这个问题,我们引入了规格条件模拟电路生成,这是一项根据目标规格直接生成模拟电路的任务。其动机是利用现有的设计良好的电路来提高模拟电路设计的自动化程度。具体来说,我们提出了CktGen,一种简单而有效的变分自编码器,它将离散化的规格和电路映射到联合潜在空间,并从该潜在向量重建电路。值得注意的是,由于单个规格可能对应多个有效电路,简单地将规格信息融合到生成模型中无法捕捉这些一对多的关系。为了解决这个问题,我们解耦了电路和规格的编码,并对齐它们映射的潜在空间。然后,我们采用带有过滤掩码的对比训练来最大化编码电路和规格之间的差异。此外,分类器引导与潜在特征对齐促进了共享相同规格的电路的聚类,避免了模型崩溃为平凡的一对一映射。通过根据规格规范化潜在空间,我们可以搜索满足有效目标规格的最优电路。我们在开放电路基准上进行了全面实验,并引入了评估跨模型一致性的指标。实验结果表明,CktGen相比最先进的方法取得了显著改进。

英文摘要

The automatic synthesis of analog circuits presents significant challenges. Most existing approaches formulate the problem as a single-objective optimization task, overlooking that design specifications for a given circuit type vary widely across applications. To address this, we introduce specification-conditioned analog circuit generation, a task that directly generates analog circuits based on target specifications. The motivation is to leverage existing well-designed circuits to improve automation in analog circuit design. Specifically, we propose CktGen, a simple yet effective variational autoencoder that maps discretized specifications and circuits into a joint latent space and reconstructs the circuit from that latent vector. Notably, as a single specification may correspond to multiple valid circuits, naively fusing specification information into the generative model does not capture these one-to-many relationships. To address this, we decouple the encoding of circuits and specifications and align their mapped latent space. Then, we employ contrastive training with a filter mask to maximize differences between encoded circuits and specifications. Furthermore, classifier guidance along with latent feature alignment promotes the clustering of circuits sharing the same specification, avoiding model collapse into trivial one-to-one mappings. By canonicalizing the latent space with respect to specifications, we can search for an optimal circuit that meets valid target specifications. We conduct comprehensive experiments on the open circuit benchmark and introduce metrics to evaluate cross-model consistency. Experimental results demonstrate that CktGen achieves substantial improvements over state-of-the-art methods.

2601.03089 2026-05-27 cs.CL cs.AI cs.LG

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

基于受控保留信息的仅解码器LLM归因忠实性评估

Xin Huang, Antoni B. Chan

AI总结 针对现有软扰动忠实性指标因保留词数不同导致评估偏差的问题,提出π-Soft-NC和π-Soft-NS框架,通过控制期望保留概率公平比较归因方法,并引入专用于自回归解码器LLM的梯度归因方法Grad-ELLM。

详情
AI中文摘要

大型语言模型(LLM)越来越多地使用输入归因方法进行评估,但比较这些解释仍然具有挑战性。现有的软扰动忠实性指标,如Soft-NC和Soft-NS,可能将归因质量与扰动期间保留的词数混为一谈:平均得分较高的归因方法可能保留更多词,从而获得膨胀的分数。为解决此问题,我们提出π-Soft-NC和π-Soft-NS,这是一个在相同期望保留概率下比较归因方法的评估框架,从而控制保留词数。我们进一步引入Grad-ELLM,一种针对自回归仅解码器LLM定制的基于梯度的归因方法,该方法在每个解码步骤将梯度导出的通道重要性与注意力导出的标记重要性相结合。在Llama和Mistral上的分类和开放生成任务实验表明,Grad-ELLM在π-Soft-NC下实现了强全面性导向的忠实性,而在π-Soft-NS下没有主导方法。我们的评估指标为比较LLM的可解释人工智能方法提供了一个严格的框架,将支持该领域的进展。

英文摘要

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $π$-Soft-NC and $π$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $π$-Soft-NC, while there is no dominant method under $π$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

2601.01668 2026-05-27 cs.CL cs.AI

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

EHRSummarizer:一种隐私感知、FHIR原生的源接地EHR摘要参考架构

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

AI总结 提出一种隐私感知、FHIR原生的参考架构EHRSummarizer,通过检索HL7 FHIR R4资源并约束生成源接地摘要,以支持临床病历审查。

Comments 15 pages, 2 figures, 2 tables. Version 2 clarifies missing-data status handling, medication-status ambiguity, controlled narrative-document handling, source-grounded resource grouping, and future source-to-summary traceability

详情
AI中文摘要

临床医生通常需要浏览碎片化的电子健康记录(EHR)界面,以整合患者问题、用药、近期就诊和纵向趋势的连贯图像。本文描述了EHRSummarizer,一种用于结构化EHR摘要的隐私感知、FHIR原生参考架构。该架构检索一组目标性的高收益HL7 FHIR R4资源,将其标准化为临床上下文包,并使用受约束的摘要阶段生成源接地摘要,旨在支持病历审查。该架构进一步阐明了缺失数据状态处理、用药状态模糊性、在可用时对叙述性临床文档的受控使用,以及未来的源到摘要可追溯性。本文描述的是参考架构和原型行为,而非经过验证的临床干预、自主临床决策支持系统或临床获益证据。在合成和测试FHIR环境上的原型演示展示了端到端行为和输出格式;然而,本文未报告临床结果、受控工作流研究或基准结果。我们概述了一个评估计划,重点关注忠实性、遗漏风险、时间正确性、可用性、隐私和操作监控,以指导未来的机构评估。

英文摘要

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This manuscript describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture for structured EHR summarization. The architecture retrieves a targeted set of high-yield HL7 FHIR R4 resources, normalizes them into a clinical context package, and uses a constrained summarization stage to produce source-grounded summaries intended to support chart review. The architecture further clarifies missing-data status handling, medication-status ambiguity, controlled use of narrative clinical documents when available, and future source-to-summary traceability. The manuscript describes a reference architecture and prototype behavior rather than a validated clinical intervention, autonomous clinical decision-support system, or evidence of clinical benefit. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes, controlled workflow studies, or benchmark results. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, privacy, and operational monitoring to guide future institutional assessment.

2601.01608 2026-05-27 cs.CV

Guiding Token-Sparse Diffusion Models

引导令牌稀疏扩散模型

Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

AI总结 针对稀疏训练扩散模型在推理时对无分类器引导响应不足的问题,提出令牌级稀疏引导方法,在保持输出高质量和高方差的同时降低计算成本。

详情
AI中文摘要

扩散模型在图像合成中质量高,但训练和推理成本昂贵。近期工作利用视觉内容固有的冗余性,仅对视觉信息子集进行训练以降低训练成本。虽然这些方法成功实现了更便宜且更有效的训练,但稀疏训练的扩散模型在推理时表现不佳,原因是它们对无分类器引导(CFG)响应不足。为解决此问题,我们提出稀疏引导(SG)。SG不使用条件丢弃作为引导扩散模型的信号,而是使用令牌级稀疏性。因此,SG更好地保留了条件预测的高方差,实现了高质量和高方差输出。在推理时利用令牌级稀疏性,SG以更低的计算量提高了保真度,在常用的ImageNet-256基准上以25%更少的FLOPs实现了1.58 FID,并在匹配基线质量时节省高达58%的FLOPs。为证明稀疏引导的有效性,我们使用训练时稀疏性训练了一个2.5B文本到图像扩散模型,并在推理时利用SG。SG在提高吞吐量的同时,在构图和人类偏好评分上取得了改进。

英文摘要

Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

2601.00575 2026-05-27 cs.CL

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

InfoSynth: 信息引导的大语言模型基准合成

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

AI总结 提出基于信息论(KL散度和熵)的InfoSynth框架,自动生成高难度、多样化的Python编程基准,97%的测试用例和解决方案准确。

详情
AI中文摘要

大型语言模型(LLM)在推理和代码生成方面取得了显著进展,但高效创建新基准来评估这些能力仍然是一个挑战。传统的基准创建依赖人工,成本高且耗时。此外,现有基准常常污染LLM训练数据,因此需要新颖多样的基准来准确评估其真实能力。本文介绍了InfoSynth,一个基于信息论原理自动生成和评估推理基准的新框架。我们提出了基于KL散度和熵的度量标准,无需昂贵的模型评估即可量化基准的新颖性和多样性。在此框架基础上,我们开发了一个端到端的流水线,使用遗传算法和迭代代码反馈从种子数据集中合成稳健的Python编程问题。我们的方法在97%的情况下生成准确的新问题测试用例和解决方案,并且合成的基准在难度上始终高于先前的工作。此外,我们的算法提供了控制生成问题的新颖性/多样性和难度的方法。InfoSynth为LLM构建高质量、具有挑战性的编程基准提供了一个可扩展、自验证的流水线。项目页面:https://ishirgarg.github.io/infosynth_web/

英文摘要

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation, but efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, which is expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher difficulty compared to prior works. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, challenging coding benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

2512.22666 2026-05-27 cs.CV cs.LG

INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

INTERACT-CMIL:用于结膜黑色素细胞上皮内病变分级的任务共享学习与任务间一致性

Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind, Julia M. Weller, Yousef Yeganeh, Martina C. Herwig-Carl, Shadi Albarqouni

AI总结 提出INTERACT-CMIL多任务深度学习框架,通过共享特征学习、组合部分监督和任务间一致性损失联合预测五个组织病理学轴,在486张结膜活检图像数据集上相比CNN和基础模型实现最高55.1%的宏F1提升。

详情
Journal ref
IEEE ISBI 2026
AI中文摘要

结膜黑色素细胞上皮内病变(CMIL)的准确分级对于治疗和黑色素瘤预测至关重要,但由于细微的形态学线索和相互关联的诊断标准,仍然困难。我们提出INTERACT-CMIL,一个多头深度学习框架,通过共享特征学习与组合部分监督以及强制跨任务一致性的相互依赖损失,联合预测五个组织病理学轴:WHO4、WHO5、水平扩散、垂直扩散和细胞异型性。在来自三家大学医院的486张专家注释的结膜活检斑块的新整理多中心数据集上进行训练和评估,INTERACT-CMIL在CNN和基础模型(FM)基线上取得了一致的改进,相对宏F1增益高达55.1%(WHO4)和25.0%(垂直扩散)。该框架提供与专家分级一致的连贯、可解释的多标准预测,为CMIL诊断提供了可重复的计算基准,并朝着标准化数字眼科病理学迈出了一步。

英文摘要

Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

2512.19602 2026-05-27 cs.CV

No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

无数据?没问题:面向缺失值的鲁棒视觉-表格学习

Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

AI总结 提出RoVTL框架,通过对比预训练中的表格属性缺失增强和下游任务中的Tabular More vs. Fewer损失,实现从0%到100%表格数据可用性下的鲁棒多模态学习。

详情
AI中文摘要

大规模医学数据库提供成像数据以及广泛的表格信息,如临床测量或人口统计数据。然而,这种丰富的表格属性并不反映现实世界的数据集,其中可能只有一部分属性可用。这种差异要求方法在推理时对缺失值保持鲁棒。为了解决这一挑战,我们提出了RoVTL(鲁棒视觉-表格学习),一个旨在处理任何级别表格数据可用性(从0%到100%)的框架。RoVTL包括两个关键阶段:对比预训练,其中我们将表格属性缺失作为数据增强引入以促进鲁棒性;以及下游任务微调,其中表格缺失通过一种新颖的Tabular More vs. Fewer损失来补充,该损失根据可用表格数据的数量对性能进行排序。结合门控交叉注意力融合模块,我们的微调方法在所有表格数据完整性场景下实现了一致的性能。我们在英国生物银行的 cardiac MRI 扫描上评估了RoVTL,证明了与先前方法相比对缺失表格数据的优越鲁棒性。此外,RoVTL成功泛化到外部 cardiac MRI 数据集进行多模态疾病分类,并扩展到自然图像领域,在汽车广告数据集上实现了鲁棒性能。模型权重和代码可在 https://github.com/marteczkah/RoVTL 获取。

英文摘要

Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as clinical measurements or demographics. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that remain robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning, where tabular missingness is complemented by a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with gated-cross attention fusion module, our tuning approach enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The model weights and code are available at https://github.com/marteczkah/RoVTL.

2512.19332 2026-05-27 cs.LG cs.LO

A Logical View of GNN-Style Computation and the Role of Activation Functions

GNN风格计算的逻辑视角与激活函数的作用

Pablo Barceló, Floris Geerts, Matthias Lanzinger, Klara Pakhomenko, Jan Van den Bussche

AI总结 本文通过定义语言MPLang,从逻辑角度研究图神经网络的计算能力,重点分析激活函数(特别是ReLU与有界激活函数)对数值和布尔表达能力的影响,并首次证明在存在线性层时,ReLU比有界激活函数具有更强的数值查询表达能力。

详情
AI中文摘要

我们研究了MPLang的数值和布尔表达能力,MPLang是一种声明式语言,通过线性消息传递和激活函数捕获图神经网络(GNN)的计算。我们从A-MPLang(无激活函数的片段)开始,并基于游走求和特征刻画了其表达能力。对于有界激活函数,我们证明(在温和条件下)所有最终恒定的激活函数产生相同的表达能力——数值和布尔——并且它包含了先前为具有最终恒定激活函数但无线性层的GNN建立的逻辑。最后,我们证明了在存在线性层的情况下,无界激活函数与有界激活函数之间的第一个表达能力分离:使用ReLU的MPLang在数值查询上严格强于使用最终恒定激活函数(例如截断ReLU)的MPLang。这依赖于线性聚合与最终恒定非线性之间的微妙交互,并确立了使用ReLU的GNN比那些仅限于最终恒定激活函数和线性层的GNN更具表达能力。

英文摘要

We study the numerical and Boolean expressiveness of MPLang, a declarative language that captures the computation of graph neural networks (GNNs) through linear message passing and activation functions. We begin with A-MPLang, the fragment without activation functions, and give a characterization of its expressive power in terms of walk-summed features. For bounded activation functions, we show that (under mild conditions) all eventually constant activations yield the same expressive power - numerical and Boolean - and that it subsumes previously established logics for GNNs with eventually constant activation functions but without linear layers. Finally, we prove the first expressive separation between unbounded and bounded activations in the presence of linear layers: MPLang with ReLU is strictly more powerful for numerical queries than MPLang with eventually constant activation functions, e.g., truncated ReLU. This hinges on subtle interactions between linear aggregation and eventually constant non-linearities, and it establishes that GNNs using ReLU are more expressive than those restricted to eventually constant activations and linear layers.

2512.17090 2026-05-27 cs.LG cs.AI

How to Square Tensor Networks and Circuits Without Squaring Them

如何平方张量网络和电路而不进行平方操作

Lorenzo Loconte, Adrián Javaloy, Antonio Vergari

AI总结 提出一种参数化方法,通过正交性和确定性条件简化平方张量网络和电路的边际化计算,避免额外复杂度,并在分布估计任务中保持表达能力且提升学习效率。

详情
AI中文摘要

平方张量网络(TNs)及其作为计算图的扩展——平方电路——已被用作表达性的分布估计器,同时支持闭式边际化。然而,平方操作在计算配分函数或边际化变量时引入了额外的复杂性,这阻碍了它们在机器学习中的应用。为了解决这个问题,张量网络的正则形式通过酉矩阵参数化以简化边际计算。然而,这些正则形式不适用于电路,因为电路可以表示不直接映射到已知张量网络的分解。受正则形式中的正交性和电路中实现可处理最大化的确定性的启发,我们展示了如何参数化平方电路以克服其边际化开销。我们的参数化即使在不同于张量网络的分解中也能实现高效的边际化,这些分解编码为电路,否则其结构会使边际化计算变得困难。最后,我们在分布估计上的实验表明,我们提出的平方电路条件在没有任何表达能力损失的情况下,实现了更高效的学习。

英文摘要

Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

2512.16111 2026-05-27 cs.LG eess.SP

BUILD with Precision: Bottom-Up Inference of Linear DAGs

精确构建:线性有向无环图的由底向上推断

Hamed Ajorlou, Samuel Rey, Gonzalo Mateos, Geert Leus, Antonio G. Marques

AI总结 提出BUILD算法,利用等噪声方差线性高斯SEM下观测数据的集成精度矩阵的独特结构,通过确定性逐步方法精确重构DAG,并在有限数据下通过周期性重估计精度矩阵增强鲁棒性。

详情
AI中文摘要

从观测数据中学习有向无环图(DAG)的结构是因果发现、统计信号处理和机器学习中的核心问题。在等噪声方差的线性高斯结构方程模型(SEM)下,该问题是可识别的,并且我们证明观测数据的集成精度矩阵展现出一种有助于DAG恢复的独特结构。利用这一性质,我们提出了BUILD(线性DAG的由底向上推断),一种确定性的逐步算法,该算法识别叶节点及其父节点,然后通过移除关联边来修剪叶节点以进入下一步,从真实的精度矩阵中精确重构DAG。在实践中,精度矩阵必须从有限数据中估计,而病态条件可能导致BUILD步骤中的误差累积。作为一种缓解策略,我们定期重新估计精度矩阵(随着叶节点被修剪,变量减少),以运行时换取增强的鲁棒性。在具有挑战性的合成基准上的可重复结果表明,BUILD与最先进的DAG学习算法相比具有优势,同时提供了对复杂性的明确控制。

英文摘要

Learning the structure of directed acyclic graphs (DAGs) from observational data is a central problem in causal discovery, statistical signal processing, and machine learning. Under a linear Gaussian structural equation model (SEM) with equal noise variances, the problem is identifiable and we show that the ensemble precision matrix of the observations exhibits a distinctive structure that facilitates DAG recovery. Exploiting this property, we propose BUILD (Bottom-Up Inference of Linear DAGs), a deterministic stepwise algorithm that identifies leaf nodes and their parents, then prunes the leaves by removing incident edges to proceed to the next step, exactly reconstructing the DAG from the true precision matrix. In practice, precision matrices must be estimated from finite data, and ill-conditioning may lead to error accumulation across BUILD steps. As a mitigation strategy, we periodically re-estimate the precision matrix (with less variables as leaves are pruned), trading off runtime for enhanced robustness. Reproducible results on challenging synthetic benchmarks demonstrate that BUILD compares favorably to state-of-the-art DAG learning algorithms, while offering an explicit handle on complexity.

2512.14561 2026-05-27 cs.CL

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

大型语言模型与人类评分者在论文评分中的一致性:一项研究综合

Hongli Li, Che Han Chen, Kevin Fan, Chiho Young-Johnson, Soyoung Lim, Yali Feng

AI总结 通过综合65项研究,发现大型语言模型与人类评分者在论文评分中的一致性高度依赖上下文,且跨研究及研究内部差异显著。

详情
AI中文摘要

尽管大型语言模型(LLMs)在自动论文评分(AES)中展现出越来越大的潜力,但关于其与人类评分者相比可靠性的实证结果仍然不一。遵循PRISMA 2020指南,我们综合了2022年1月至2025年8月间65项已发表和未发表的研究,这些研究考察了LLM生成的分数与人类评分之间的一致性。一致性水平在研究之间和研究内部均有显著差异,报告值范围广泛。总体而言,研究结果表明LLM-人类一致性高度依赖上下文。讨论了未来研究的启示、挑战和方向。

英文摘要

Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is highly context-dependent. Implications, challenges, and directions for future research are discussed.

2512.14140 2026-05-27 cs.CV

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

SketchAssist:一种用于语义编辑和精确局部重绘的实用助手

Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan

AI总结 提出SketchAssist,一种结合指令引导编辑和线条引导区域重绘的交互式草图助手,通过可控数据生成管道和基于DiT的统一框架(集成任务引导的混合专家模块)实现高效、可控的草图操作,在语义和结构一致性上达到最先进性能。

详情
AI中文摘要

草图编辑需要同时处理高级语义变化和精确的局部重绘,这种组合对于稀疏且风格敏感的线条艺术尤其具有挑战性。与自然图像不同,草图依赖于最小的视觉线索,使得现有方法难以在保持整体一致性的同时协调全局语义修改与细粒度结构控制。我们提出了SketchAssist,一种交互式草图助手,它统一了指令引导编辑和线条引导区域重绘,在保持整体构图的同时实现高效且可控的草图操作。为了支持这一任务,我们引入了一个可控数据生成管道,该管道构建具有精确属性变化的结构化编辑序列,并在多步修改中保持结构对齐,同时通过保持风格的变换扩展风格多样性。基于这些数据,SketchAssist采用基于DiT的统一框架,使用多通道输入表示在单一接口内编码草图、掩码和引导信号。为了进一步处理不同的编辑模式,我们将任务引导的混合专家(T-MoE)集成到LoRA层中,实现对语义和结构引导的自适应控制。大量实验表明,在两个任务上都达到了最先进的性能,与最近的方法相比,实现了更强的指令遵循以及改进的结构和风格一致性。总之,我们的方法为草图编辑提供了一种实用且可控的解决方案。

英文摘要

Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minimal visual cues, making it difficult for existing methods to reconcile global semantic modifications with fine-grained structural control while preserving overall coherence. We present SketchAssist, an interactive sketch assistant that unifies instruction-guided editing with line-guided region redrawing, enabling efficient and controllable sketch manipulation while preserving overall composition. To support this task, we introduce a controllable data generation pipeline that constructs structured edit sequences with precise attribute variations and maintains structural alignment across multi-step modifications, while expanding stylistic diversity via style-preserving transformations. Building on this data, SketchAssist adopts a unified framework based on DiT, using a multi-channel input representation to encode sketches, masks, and guidance signals within a single interface. To further handle different editing modes, we integrate a Task-guided Mixture-of-Experts (T-MoE) into LoRA layers, enabling adaptive control over semantic and structural guidance. Extensive experiments demonstrate state-of-the-art performance on both tasks, achieving strong instruction adherence and improved structural and style consistency compared to recent methods. Together, our method provide a practical and controllable solution for sketch editing.

2512.12413 2026-05-27 cs.AI cs.HC

Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale

生成式人工智能使用中的批判性思维:批判性思维在AI使用中的量表开发、验证与关联因素

Gabriel R. Lau, Wei Yan Low, Louis Tay, Ysabel Guevarra, Dragan Gašević, Andree Hartanto

AI总结 本研究开发并验证了13项批判性思维在AI使用中的量表,发现其包含验证、动机和反思三个因子,并与开放性、外向性、积极情感和AI使用频率正相关,且能预测更频繁的验证策略和更高的真实性判断准确性。

详情
Journal ref
Computers in Human Behavior Reports, 22, 101103 (2026)
AI中文摘要

生成式AI工具日益嵌入日常工作和学习中,但其流畅性、不透明性和产生幻觉的倾向意味着用户必须批判性地评估AI输出,而不是全盘接受。本研究将AI使用中的批判性思维概念化为一种倾向性特质,包括验证AI生成信息的来源和内容、理解模型的工作原理及其失败之处,以及反思依赖AI的更广泛影响。通过六项研究(N=1365),我们开发并验证了13项批判性思维在AI使用中的量表,并绘制了其法则网络。研究1生成并内容验证了量表项目。研究2支持了三因子结构(验证、动机和反思)。研究3、4和5确认了这一高阶模型,展示了内部一致性、重测信度、强因子载荷、性别不变性以及收敛和判别效度。研究3和4进一步揭示,AI使用中的批判性思维与开放性、外向性、积极特质情感和AI使用频率正相关。最后,研究6展示了量表的效标效度,更高的批判性思维在AI使用中的得分预测了更频繁和多样化的验证策略、在新型自然主义ChatGPT驱动的事实核查任务中更高的真实性判断准确性,以及对负责任AI的更深入反思。总之,当前工作阐明了人们为何以及如何对生成式AI输出进行监督,并提供了一个经过验证的量表和生态学基础的任务范式,以支持关于批判性参与生成式AI输出的理论检验、跨群体和纵向研究。

英文摘要

Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.

2512.11280 2026-05-27 cs.CL

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

AdaSD:面向高效语言模型推理的自适应推测解码

Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu, Jan-Jan Wu

AI总结 提出一种无超参数的自适应推测解码方法AdaSD,通过动态调整生成长度和接受标准,在保持准确率下降低于1.8%的同时实现最高1.46倍加速。

详情
AI中文摘要

大型语言模型(LLM)在广泛任务中取得了显著性能,但其不断增长的参数规模显著拖慢了推理速度。推测解码通过利用较小的草稿模型预测候选令牌,再由较大的目标模型验证,从而缓解这一问题。然而,现有方法通常需要额外训练、大量超参数调整或在部署前对模型和任务进行预先分析。在本文中,我们提出自适应推测解码(AdaSD),一种无超参数的解码方案,在推理过程中动态调整生成长度和接受标准。AdaSD引入两个自适应组件:一个决定何时停止候选令牌生成,另一个决定令牌接受,两者均基于令牌熵和Jensen-Shannon距离实时更新。该方法无需预先分析或微调,且兼容现有模型。在基准数据集上的实验表明,AdaSD相比普通推测解码实现最高1.46倍加速,同时将准确率下降限制在1.8%以内,使其成为高效且自适应的LLM推理的实用解决方案。

英文摘要

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive components: one to determine when to stop candidate token generation and the other to decide token acceptance, updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 1.46x speedup over vanilla speculative decoding while limiting accuracy degradation to under 1.8%, making it a practical solution for efficient and adaptive LLM inference.

2512.08371 2026-05-27 cs.LG stat.ML

A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

基于多元伯努利的采样方法用于多标签数据及其在元研究中的应用

Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown

AI总结 针对多标签数据中标签频率差异大且存在依赖关系的问题,提出一种基于多元伯努利分布的加权采样算法,通过估计标签组合权重实现目标分布特征,并在Web of Science研究文章数据上验证了其增强少数类别代表性的效果。

详情
AI中文摘要

数据集可能包含具有多个标签的观测值。如果标签不是互斥的,并且标签的频率差异很大,那么获取一个样本,该样本包含足够多的稀有标签观测值以对这些标签进行推断,并且以已知方式偏离总体频率,这带来了挑战。在本文中,我们将多元伯努利分布视为多标签问题的底层分布。我们提出了一种新颖的采样算法,该算法考虑了标签依赖性。它使用观测到的标签频率来估计多元伯努利分布参数,并为每个标签组合计算权重。这种方法确保加权采样在考虑标签依赖性的同时获得目标分布特征。我们将该方法应用于各种数据集,包括来自Web of Science的研究文章样本,这些文章标有64个生物医学主题类别。我们的目标是保持类别频率顺序,减少最常见和最不常见类别之间的频率差异,并考虑类别依赖性。该方法产生了更平衡的子样本,增强了少数类别的代表性。

英文摘要

Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculates weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a variety of datasets, including a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

2511.20586 2026-05-27 cs.AI cs.LG

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

PaTAS:基于主观逻辑的神经网络信任传播框架

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Houda Labiod, Frank Kargl

AI总结 提出PaTAS框架,利用主观逻辑在神经网络中并行传播信任,通过信任节点和信任函数量化输入、参数和激活的信任,并设计参数信任更新和推理路径信任评估方法,以在对抗或退化条件下提供可解释的信任估计。

详情
AI中文摘要

可信度已成为安全关键应用中人工智能系统部署的关键要求。传统的评估指标(如准确率和精确率)无法充分捕捉不确定性或模型预测的可靠性,尤其是在对抗或退化条件下。本文介绍了并行信任评估系统(PaTAS),这是一个使用主观逻辑(SL)对神经网络中的信任进行建模和传播的框架。PaTAS通过信任节点和信任函数与标准神经计算并行运行,这些节点和函数在网络中传播输入、参数和激活信任。该框架定义了一种参数信任更新机制,以在训练过程中优化参数可靠性,以及一种推理路径信任评估(IPTA)方法,以在推理时计算实例特定的信任。在真实世界和对抗性数据集上的实验表明,PaTAS产生可解释、对称且收敛的信任估计,这些估计补充了准确率,并揭示了在中毒、有偏或不确定数据场景中的可靠性差距。结果表明,PaTAS有效区分良性输入和对抗性输入,并识别模型置信度与实际可靠性不一致的情况。通过在神经架构中实现透明且可量化的信任推理,PaTAS为评估AI生命周期中的模型可靠性提供了基础。

英文摘要

Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a foundation for evaluating model reliability across the AI lifecycle.