arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2503.13934 2026-05-19 cs.RO cs.AI

COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning

COLSON: 通过基于扩散的强化学习实现可控的社会导航

Kohei Matsumoto, Yuki Tomita, Yuki Hyodo, Ryo Kurazume

AI总结 本文提出了一种基于扩散的强化学习方法,用于社会导航,通过灵活的动作分布提高了导航的适应性和可控性,同时能够适应未见过的场景。

Comments ICRA 2026

详情
AI中文摘要

在动态环境中移动机器人导航面临行人交通的关键挑战,在自主移动服务机器人发展中尤为重要。最近,基于深度强化学习的方法被积极研究,并因其优化能力优于传统规则方法。其中,假设连续动作空间的方法通常依赖高斯分布,这限制了生成动作的灵活性。相比之下,将扩散模型应用于强化学习已取得进展,使动作分布比高斯策略方法更加灵活。在本研究中,我们应用基于扩散的强化学习方法进行社会导航,并验证其有效性。此外,通过利用扩散模型的特点,我们提出了能够适应以前未见过的场景而无需额外训练的扩展方法。作为具体场景示例,我们展示了适应环境中有静态障碍物的场景(这些障碍物在训练期间不存在),以及目标与训练不同的场景,例如在避免他人时陪同目标行人到达目的地。

英文摘要

Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these methods, those that assume continuous action spaces typically rely on Gaussian distributions, which limit the flexibility of the generated actions. In contrast, the application of diffusion models to reinforcement learning has advanced, enabling more flexible action distributions than Gaussian policy-based approaches. In this study, we apply a diffusion-based reinforcement learning approach to social navigation and validate its effectiveness. Furthermore, by exploiting the characteristics of diffusion models, we propose extensions that enable adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we demonstrate adaptability to scenarios in which static obstacles exist in the environment that were not present during training, as well as scenarios in which the objective differs from training, such as accompanying target pedestrians while avoiding others to reach the destination.

2501.01046 2026-05-19 cs.CL

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

SEDD: 一种基于GPU的可扩展且高效的去重数据集处理方法

Youngjun Son, Chaewon Kim, Jaejin Lee

AI总结 本文提出SEDD,一种基于GPU的高效去重框架,通过引入计算高效且部分可重用的哈希函数、高度优化的GPU内核和硬件感知的自动参数选择机制,显著减少了通信瓶颈,提升了去重效率,同时保持了高去重精度。

Comments 13 pages, 7 figures

详情
AI中文摘要

数据集去重被广泛认可为一个关键的预处理步骤,能够提高数据质量和大型语言模型的性能。常用的去重方法是MinHash局部敏感哈希(LSH)算法。最近,NVIDIA NeMo Curator等GPU加速框架被引入以处理大规模语料库;然而,由于物理数据洗牌带来的高通信开销和GPU资源利用率低,这些框架仍然不够高效。在本文中,我们提出了SEDD,一种高性能的GPU加速去重框架,优化于分布式集群环境。SEDD引入了计算高效且部分可重用的哈希函数,以及高度优化的GPU内核和硬件感知的自动参数选择机制。通过将传统数据洗牌替换为流式处理方法,SEDD显著减轻了通信瓶颈。在处理3000万文档的节点上,我们的框架在CPU基础工具SlimPajama上性能提升高达158倍,在NVIDIA NeMo Curator的GPU基础工具上性能提升高达7.8倍。值得注意的是,SEDD大幅加速了之前耗时的MinHash签名生成阶段,相对于CPU基准,速度提升高达375倍。尽管在效率上有这些提升,SEDD仍保持了高去重精度,重复文档集的Jaccard相似度超过0.95,与标准MinHash算法识别的相似度相比。在大规模实验中,1.2万亿个标记的去重在8节点32 GPU V100集群上仅用3小时完成。相关代码已公开在GitHub(https://github.com/mcrl/SEDD)。

英文摘要

Dataset deduplication is widely recognized as a crucial preprocessing step that enhances data quality and improves the performance of large language models. A commonly used method for this process is the MinHash Locality-Sensitive Hashing (LSH) algorithm. Recently, GPU-accelerated frameworks such as NVIDIA NeMo Curator have been introduced to handle large-scale corpora; however, they remain suboptimal due to high communication overhead from physical data shuffling and underutilization of GPU resources. In this paper, we propose SEDD, a high-performance GPU-accelerated deduplication framework optimized for distributed cluster environments. SEDD introduces a computationally efficient, partially reusable hash function, alongside highly optimized GPU kernels and a hardware-aware automatic parameter selection mechanism. By replacing traditional data shuffling with a streaming-based approach, SEDD significantly mitigates communication bottlenecks. Our framework outperforms the CPU-based deduplication tool in SlimPajama by up to 158$\times$ and the GPU-based tool in NVIDIA NeMo Curator by up to 7.8$\times$ when processing 30 million documents on a node with four GPUs. Notably, SEDD dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speedups of up to 375$\times$ over the CPU baseline. Despite these gains in efficiency, SEDD maintains high deduplication fidelity, with duplicate document sets achieving Jaccard similarities of over 0.95 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 3 hours on an 8-node 32-GPU V100 cluster. The related code is publicly available on GitHub (https://github.com/mcrl/SEDD).

2410.13181 2026-05-19 cs.CL

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

AdaSwitch: 一种在小型和大型代理之间自适应切换以实现有效的云-本地协作学习方法

Hao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue Feng, Bo Wang, Shuaiqiang Wang, Yan Zhang, Dawei Yin

AI总结 本文提出了一种新的LLM使用范式,通过自适应机制在云端和本地部署的LLM之间切换,以提高任务完成性能和效率,通过本地代理处理简单推理步骤,云代理处理复杂推理步骤,实验表明该方法在多个基准测试中有效提升了本地代理的性能。

Comments EMNLP 2024 Main Conference

详情
AI中文摘要

近年来,大型语言模型(LLMs)的发展取得了显著进展。用户面临一个选择:使用基于云的LLMs来获得生成质量,或部署本地LLMs以降低计算成本。前者通常成本高且效率低,而后者通常无法满足需要深入思考过程的推理步骤的性能要求。在本工作中,我们提出了一种新的LLM使用范式,以促进大型云端LLMs和较小本地部署LLMs的协作操作。我们的框架包含两个主要模块:本地代理由相对较小的LLM实例化,处理较简单的推理步骤;云代理配备较大的LLM,处理更复杂的推理步骤。这种协作处理通过自适应机制实现,其中本地代理会内省并主动向云代理寻求帮助,从而有效整合本地部署和云端LLMs的优势,显著提升任务完成性能和效率。我们评估了AdaSwitch在7个基准测试上的表现,涵盖数学推理和复杂问答,使用不同类型的LLMs实例化本地和云代理。实验结果表明,AdaSwitch有效提升了本地代理的性能,有时在计算开销远低于云代理的情况下,也能取得具有竞争力的结果。

英文摘要

Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.

2401.09512 2026-05-19 cs.SD eess.AS

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

MLAAD:多语言音频防伪数据集

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

AI总结 本文提出多语言音频防伪数据集(MLAAD)版本10,包含175个文本到语音(TTS)模型,总计1002.9小时的合成语音,涵盖54种语言,用于训练和评估音频深度伪造检测模型,并展示了其在多个数据集上的优越性能。

Comments IJCNN 2024

详情
AI中文摘要

本文提出了多语言音频防伪数据集(MLAAD)版本10,这是一个用于训练和评估音频深度伪造检测模型的合成音频数据集。该数据集包含175个文本到语音(TTS)模型,总计1002.9小时的合成语音,涵盖54种不同的语言。为了评估该数据集,我们使用MLAAD训练了三种最先进的深度伪造检测模型,并观察到其在作为训练资源时,比InTheWild和FakeOrReal等类似数据集表现更优。此外,与著名的ASVspoof 2019数据集相比,MLAAD证明是一种互补的资源。在八个数据集上的测试中,MLAAD和ASVspoof 2019相互超越,各自在四个数据集上表现突出。通过发布该数据集并提供经过训练的模型通过交互式网络服务器访问,我们旨在普及反伪造技术,使其不仅限于专家领域,并为全球对抗音频伪造和深度伪造做出贡献。

英文摘要

This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

2307.08643 2026-05-19 cs.LG stat.ML

Corruptions of Supervised Learning Problems: Typology and Mitigations

监督学习问题的腐败:类型与缓解方法

Laura Iacovissi, Nan Lu, Robert C. Williamson

AI总结 本文提出了一种通用的腐败理论,通过马尔可夫核分析底层概率分布的变化,统一了不同类型的腐败模型,并探讨了针对各种腐败类型的缓解方法。

Comments 73 pages. To be published in Journal of Machine Learning Research 27 (2026) 1-73

详情
AI中文摘要

腐败在数据收集中普遍存在。尽管已有大量研究,现有文献主要集中在特定设置和学习场景,缺乏对腐败建模和缓解的统一视角。本文开发了一种通用的腐败理论,涵盖监督学习问题的所有修改,包括模型类和损失的变化。通过分析底层概率分布的变化,我们的方法带来了三个新机会:首先,构建了一个新型且可证明的腐败框架,区分不同类型的腐败;其次,通过比较清洁和受污染场景下的贝叶斯风险,系统分析了腐败对学习任务的影响;第三,基于这些结果,我们研究了各种腐败类型的缓解方法。我们扩展了现有的标签腐败损失修正方法以处理依赖性腐败类型。我们的发现强调了将经典腐败修正学习框架推广到更宽松的范式以涵盖更多腐败类型的必要性。我们提供了这种范式以及属性和联合腐败情况下的损失修正公式。

英文摘要

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize this classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.

2605.18150 2026-05-19 cs.AI

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

噪声中的低语:通过多智能体框架引导的代理觉醒

Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, Yi Zhang

AI总结 本文研究了在黑盒约束下如何通过多智能体框架从预训练模型中恢复被擦除的概念,提出了一种无需训练的代理方法,通过引导噪声状态来实现可控的觉醒,展示了当前概念擦除方法的局限性。

详情
AI中文摘要

扩散模型(DMs)被广泛用于文本到图像生成,但其强大的生成能力也引发了对不安全或不期望内容的担忧。概念擦除旨在通过从预训练模型中移除特定概念来缓解这些风险。然而,最近的研究表明,此类方法往往抑制而非完全消除目标概念,使模型易受觉醒攻击。现有方法主要依赖于通过优化或反向操作进行白盒访问,而概念觉醒在黑盒约束下仍显不足。在本文中,我们重新审视去噪过程并从轨迹角度出发,表明概念擦除主要破坏早期阶段的文本-语义对齐,但并未完全阻止语义信息沿去噪动态传播。随着生成过程的进行,模型越来越依赖于演化的噪声状态而非文本条件,这为绕过擦除映射提供了机会。受此观察启发,我们提出了ConceptAgent,一种无需训练、黑盒、多智能体框架,通过引导噪声状态初始化去噪轨迹来唤醒擦除的概念。大量实验表明,ConceptAgent能够在无模型参数、梯度或内部表示访问的情况下,实现准确且可控的擦除概念觉醒。这些结果突显了当前概念擦除方法的根本限制,并提供了关于DMs中语义控制动态性质的新见解。

英文摘要

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

2605.18147 2026-05-19 cs.LG

Foundation Models for Credit Risk Prediction: A Game Changer?

信贷风险预测的基础模型:变革性突破?

Bart Baesens, Andreas Goethals, Stefan Lessmann, Simon De Vos, Cristián Bravo, David Martens, Victor Medina-Olivares, Christophe Mues, Maria Oskarsdóttir, Seppe vanden Broucke, Tim Verdonck, Wouter Verbeke

AI总结 本文研究了信贷风险预测中基础模型的应用,探讨了其在小数据环境下提升预测性能的能力,并通过对比多种方法验证了基础模型在PD和LGD建模任务中的优越性。

详情
AI中文摘要

预测模型在信贷风险管理中发挥着关键作用,通过准确估计违约概率和损失来指导关键决策。大量研究引入了新的建模技术,并通过大规模基准研究巩固了最先进的方法。如今,梯度提升模型配以SHAP解释器已成为准标准,但风险模型的持续改进仍是首要任务。同时,人工智能的快速进展,尤其是大型语言模型,已颠覆了预测建模范式。基础模型通过在广泛领域数据集上预训练,利用先验知识表现出色。尽管在自然语言处理和计算机视觉中广泛应用,但针对表格数据的基础模型才刚刚出现。我们推测,在小数据设置中,如中小企业贷款或专门化的公司投资组合中,使用非领域数据进行预训练可能特别有益,并可能帮助解决长期存在的挑战,包括低违约率投资组合和类别不平衡问题。本文将最近提出的方法与广泛竞争对手进行基准测试,包括已建立和先进的机器学习技术,在PD和LGD建模两个核心任务中进行评估。我们的评估涵盖了各种数据集、性能指标和实验条件。我们发现,表格基础模型在各种数据集和任务中表现最佳。此外,当数据集规模减小时,它们在预测性能上提供了显著改进。这些结果令人印象深刻,因为模型在即开即用的情况下进行测试,无需超参数调优,确保了易用性和降低了计算成本。

英文摘要

Predictive models play a pivotal role in credit risk management, guiding critical decisions through accurate estimation of default probabilities and losses. Extensive research has introduced new modeling techniques, complemented by large-scale benchmarking studies consolidating the state-of-the-art. Today, quasi-standards such as gradient-boosting models paired with SHAP explainers have emerged, yet continuous improvement of risk models remains a top priority. Concurrently, rapid advancements in AI, most notably large language models, have disrupted predictive modeling paradigms. Foundation models, pretrained on extensive datasets from diverse domains, have demonstrated remarkable performance by leveraging prior knowledge. While prevalent in natural language processing and computer vision, foundation models for tabular data have only recently emerged. We conjecture that pretraining on out-of-domain data is particularly beneficial in small-data settings, such as SME lending or specialized corporate portfolios, and may help address longstanding challenges including low default portfolios and class imbalance. This paper benchmarks recently proposed tabular foundation models against a broad set of competitors, including established and advanced machine learning techniques, across two core tasks: PD and LGD modeling. Our evaluation encompasses various datasets, performance indicators, and experimental conditions. We find that tabular foundation models generally perform best across datasets and tasks. Moreover, they offer significant improvement in predictive performance as dataset size shrinks. These results are remarkable given that the models are tested out-of-the-box, without hyperparameter tuning, ensuring ease of use and mitigating computational costs.

2605.18144 2026-05-19 cs.AI

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

基于证据的前沿映射与代理假设生成在纳米医学中

Christiaan G. A. Viviers, Koen de Bruin, Mirre M. Trines, Ayla M. Hokke, Roy van der Meel, Avi Schroeder, Twan Lammers, Willem J. M. Mulder, Fons van der Sommen

AI总结 该研究提出了一种结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型(LLM)工作流的系统pArticleMap,用于支持纳米医学研究方向的选择和假设生成,通过生成和评分基于引用的假设,实现了证据导向的研究辅助。

详情
AI中文摘要

纳米医学研究涵盖了递送化学、免疫学、成像、生物材料和疾病特定的转化科学,但其概念设计空间仍然在大量异质文献中碎片化。截至目前,人工智能在纳米医学中的应用主要集中在性质预测和配方优化,对研究方向选择层面的证据导向发现支持关注较少。我们引入了pArticleMap,一个结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型(LLM)工作流的文献映射和研究假设生成系统。该系统不同于预测未来概念共现,而是针对低密度文章级桥接区域和聚类界面,然后在代理设置中利用大型语言模型生成和评分基于引用的假设。我们通过回顾性实现基准(在历史截止点下生成后续文献)和盲人类读者评估层,在提示条件下的纳米医学任务中评估该系统。在4个选定的回顾性包中,pArticleMap在基准协议下生成了想法并选择了任务保留的假设(获胜想法)。对于任务级保留的假设,获得了一个汇总的黄金回收率10.8%,召回@10为15.9%,未来邻域率61.0%,表明该系统经常能够达到正确的前瞻性邻域(论文想法),即使没有精确的论文级回收。人类-代理协议总体上是中等的,表明内部评分是有用的支持信号,但不能替代专家判断。这些结果将pArticleMap定位为一种保守的、基于证据的研究助手,用于纳米医学。

英文摘要

Nanomedicine research spans delivery chemistry, immunology, imaging, biomaterials, and disease-specific translational science, yet its conceptual design space remains fragmented across a large and heterogeneous literature. To date, artificial intelligence in nanomedicine has focused primarily on property prediction and formulation optimization, with much less attention to evidence-grounded discovery support at the level of research direction selection. We introduce pArticleMap, a literature-mapping and research-hypothesis-generation system that combines article embeddings, similarity-graph analysis, sparse frontier extraction, structured evidence-pack retrieval, and an audited large-language-model (LLM) workflow for grounded ideation. Rather than forecasting future concept co-occurrence, pArticleMap targets low-density article-level bridge regions and cluster interfaces, then generates and scores citation-grounded hypotheses with large language models in an agentic setup. We evaluate the system with a retrospective realization benchmark (generate later literature under a historical cutoff) and a blinded human reader assessment layer across cue-conditioned nanomedicine tasks. Across 4 selected retrospective bundles, pArticleMap generated ideas and selected task-retained hypotheses (winner ideas) under the benchmark protocol. For task-level retained hypotheses, a pooled gold recovery rate of 10.8% was obtained, with a recall@10 of 15.9% and a future-neighborhood rate of 61.0%, indicating that the system often reached the correct forward-looking neighborhood (paper ideas) even without exact paper-level recovery. Human-agent agreement is modest overall, indicating that internal scoring is useful as a support signal but does not replace expert judgment. These results position pArticleMap as a conservative, evidence-grounded research assistant for nanomedicine.

2605.18143 2026-05-19 cs.AI

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

生成式AI与生产力差距:教育中的人类-人工智能互补性

Lihi Idan, Bharat Anand

AI总结 本研究探讨了生成式AI对不同用户生产力影响的异质性,发现AI交互能力(AIC)是决定AI使用效果的关键因素,通过概念图干预可减少不平等,强调需结合AIC微培训和标准流程以实现持续价值捕获。

详情
AI中文摘要

生成式人工智能(GenAI)正在改变企业创造、处理和应用知识的方式,但对其生产力影响的异质性知之甚少。我们报告了一项随机对照试验的结果,参与者(早期知识工作者的类比)被分配在传统资源或大语言模型(LLM)辅助下自学技术领域。平均而言,GenAI访问显著提高了任务表现,但收益分布极不均衡。改进未由GPA或先前知识预测,而是由AI交互能力(AIC)——即获取、过滤和验证模型输出的能力——预测。高AIC参与者实现了显著收益;低AIC参与者则获得有限甚至负的边际回报。概念图干预( scaffolding)减少了结果变异,表明标准化流程可减轻AI中介表现中的不平等。我们通过人类-人工智能互补性视角解读这些发现:GenAI提高平均生产力,但引入了新的能力不平等轴。管理上,企业应将GenAI访问与短期AIC微培训和简单标准操作程序相结合,以一致捕获价值并避免不均的采用结果。

英文摘要

Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textit{AI Interaction Competence (AIC)} -- the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.

2605.18132 2026-05-19 cs.CV cs.AI

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产?学习生成3D模型的来源归属

Sihan Ma, Siyuan Liang, Dacheng Tao

AI总结 该研究提出了一种方法,用于确定给定3D资产是由哪种生成模型创建的,通过构建首个被动来源归属基准,发现生成3D模型留下稳定的指纹特征,从而建立了可信的3D内容来源的新标准。

详情
AI中文摘要

生成3D模型被应用于游戏、机器人和沉浸式创作,因此来源归属至关重要:给定一个3D资产,我们能否确定并识别出是哪种生成模型创建的?该问题面临两个核心挑战:分散的归属信号,其中3D指纹分布在多视角、几何和频率域提示中;以及现实部署约束,其中稀少的标签、退化的提示和混合真实/合成资产会破坏归属的可靠性。为了系统研究该问题,我们构建了迄今为止首个被动来源归属基准,涵盖22种代表性的3D生成器,在标准、少样本和现实部署协议下。基于此基准,我们发现生成3D模型留下两种稳定的指纹:跨视角不一致性和体现在几何统计和频率域提示中的结构伪影。为了捕捉这些分散的信号,我们提出了一种层次多视角多模态Transformer,融合每个视角的外观、几何和频率域特征,并在跨视角建模全局关系。大量实验表明性能优异,在全监督下达到97.22%的准确率,在仅有1%训练数据时达到77.17%的准确率,对应每个生成器少于五个样本。这些结果表明现代3D生成器留下稳定且可归属的指纹,建立了可信3D内容来源的新基准和方法论基础。

英文摘要

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

2605.18130 2026-05-19 cs.CV

Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis

Rad-VLSM:一种结合语义辅助提示的跨模态框架用于医学分割与诊断

Fengyi Zhang, Xujie Zeng, Mohan Liu, Zengyi Wang, Yalong Jiang

AI总结 本文提出Rad-VLSM框架,通过语义引导的提示机制,提升医学图像分割与诊断的准确性,解决现有模型易受背景组织和无关视觉相关性干扰的问题。

详情
AI中文摘要

医学图像分割在支持诊断而非仅仅生成病变掩码时更具临床价值。然而,诊断相关的病变线索往往微妙且局部化,而现有模型可能受背景组织、声学伪影和无关视觉相关性干扰。为了解决这个问题,我们提出了Rad-VLSM,一种两阶段跨模态框架,用于语义辅助的病变聚焦、鲁棒分割和视觉基础诊断。第一阶段中,基于BLIP-2的视觉-语言对齐模块在语义引导下识别病变相关候选区域,并将其转换为框提示。第二阶段中,这些提示被输入基于SAM的多任务网络,其中多候选区域聚合策略提高提示稳定性并引导病变分割。预测的掩码随后用作诊断的空间先验,视觉-放射组学融合头将病变感知的视觉特征与选定的放射组学描述符整合。通过使用语义信息进行定位而非直接预测,Rad-VLSM减少了文本到诊断的依赖,并将诊断基于病变层面的证据。在私有临床乳腺超声数据集和公共基准测试中,Rad-VLSM在分割和诊断性能方面表现强劲,具有良好的泛化能力。

英文摘要

Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.

2605.18128 2026-05-19 cs.AI

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST: 基于先验观察的时空关联对抗学习用于多变量时间序列异常检测

Suofei Zhang, Yaxuan Zheng, Haifeng Hu

AI总结 本文提出了一种新的框架,通过联合先验观察对抗学习方法统一时空建模,以解决多变量时间序列异常检测中的空间过泛化问题,并在公开数据集和自建基准上展示了在时间检测和空间定位任务上的新状态。

详情
AI中文摘要

现有的多变量时间序列异常检测(MTSAD)框架越来越多地依赖于将图神经网络(GNNs)与序列模型相结合,以捕捉复杂的时空依赖关系。然而,较少关注空间过泛化问题,即不受约束的结构建模会 indiscriminately 重建异常,不可避免地降低检测召回率。为了解决这个问题,我们提出了一种新的框架,通过联合先验观察对抗学习方法统一时空建模。在空间维度上,模型交替学习邻接矩阵作为结构先验,并在训练过程中通过最小化方式建模先验与数据驱动观察之间的关联差异。这种对抗优化不仅提高了模型对时间检测的敏感性,还使模型能够定位到特定通道的异常。为了系统评估这种异常定位能力,我们进一步构建了一个带有精确通道注释的合成基准。在公开数据集和我们专门的基准上进行的广泛实验表明,所提出的框架在时间和空间定位任务上都建立了新的状态。我们的代码、预训练模型和基准已公开在 https://github.com/anocodetest1/POST。

英文摘要

Existing Multivariate Time Series Anomaly Detection (MTSAD) frameworks increasingly rely on integrating Graph Neural Networks (GNNs) with sequence models to capture complex spatio-temporal dependencies. However, less attention is paid to the spatial over-generalization problem, where unconstrained structural modeling indiscriminately reconstructs anomalies, inevitably degrading detection recall. To tackle this problem, we propose a novel framework that unifies spatio-temporal modeling through a joint prior-observation adversarial learning paradigm. In the spatial dimension, the model alternately learns adjacency matrices as structural prior and models the association discrepancy between prior and data-driven observation in a minimax manner during training. Such adversarial optimization not only improves the model sensitivity for time-wise detection, but also enables the model to localize anomalies to specific channels. To systematically evaluate this anomaly localization capability, we further construct a synthetic benchmark equipped with precise channel-wise annotations. Extensive experiments across public datasets and our dedicated benchmark demonstrate that the proposed framework establishes a new state-of-the-art in both time-wise detection and spatial localization tasks. Our code, pre-trained models, and benchmark are publicly available at https://github.com/anocodetest1/POST.

2605.18115 2026-05-19 cs.CV

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

WinTok: 一种通过分解视觉理解和生成来实现双赢的混合分词器

Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang

AI总结 本文提出WinTok,一种通过分解视觉理解和生成任务来实现双赢的混合分词器,通过引入可迁移的语义分词来减少跨任务干扰,从而在多个基准测试中提升了重建、理解和生成性能。

详情
AI中文摘要

构建统一的视觉分词器对于弥合视觉理解和生成之间的差距至关重要。然而,现有方法在处理这两个任务之间的固有冲突时存在困难,因为单一的分词空间被迫同时支持高层语义抽象和低层像素重建。我们提出了WinTok,一种简洁的混合分词器,通过显式解耦这两个目标实现了双赢性能。WinTok通过添加一组可学习的语义分词来补充像素分词,有效地减轻了跨任务干扰,而无需付出双分词器的计算开销。为进一步增强理解能力,我们引入了不对称的分词蒸馏机制:语义分词通过任何视觉基础模型预训练的语义嵌入进行引导,使它们能够继承强大的辨别能力,同时保持灵活性。在10个具有挑战性的基准测试中,WinTok在重建、理解和生成方面都实现了持续的改进。仅在5000万开源数据上训练,WinTok在分类准确率上超越了强大的基线UniTok 11.2%,尽管其使用的训练数据显著少于其他方法。代码已发布在https://github.com/markywg/WinTok。

英文摘要

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

2605.18111 2026-05-19 cs.CL cs.CV

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

LLMs在回答孟加拉语医学视觉问题方面的表现如何?数据集与基准测试

Rafid Ahmed, Intesar Tahmid, Mir Sazzat Hossain, Tasnimul Hossain Tomal, Md Fahim, Md Farhad Alam Bhuiyan

AI总结 本文提出BanglaMedVQA数据集,用于评估当前基础模型在孟加拉语医学视觉问答任务中的表现,发现其性能显著低于英语基准,揭示了低资源语言在医学推理中的挑战。

Comments 14 pages, 7 figures, 5 tables, Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:1-14, 2026

详情
AI中文摘要

近年来,大型语言模型(LLMs)和大型视觉语言模型(LVLMs)的进步使通用系统在复杂推理任务中展现出有希望的能力,包括医学领域。医学视觉问答(MedVQA)尤其受益于这些发展。然而,尽管孟加拉语是全球最广泛使用的语言之一,但尚不存在针对它的MedVQA基准。为解决这一缺口,我们引入了BanglaMedVQA数据集,包含经过临床验证的图像-问题-答案三元组,并对当前基础模型在该资源上的全面评估。与先前发现的当前模型在英语MedVQA基准上表现不佳一致,我们的分析显示孟加拉语性能显著更低,反映了低资源语言固有的挑战。即使表现最佳的模型如Gemini和GPT-4.1 mini也未能准确回答专门的诊断问题,表明在细粒度医学推理方面存在严重限制。虽然某些开源模型如Gemma-3偶尔在一般类别中优于这些模型,但它们在临床复杂问题上也表现不佳,凸显了对顶级评估方法的迫切需求。

英文摘要

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

2605.18109 2026-05-19 cs.AI cs.CV cs.RO

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround:全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

AI总结 本文提出TaskGround框架,通过结构化任务推断提升全场景家庭推理能力,其核心贡献是引入FullHome评估套件,验证了在家庭场景中执行任务结构推断的重要性,并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情
AI中文摘要

在真实家庭部署中,家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发,而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体,恢复意图的任务条件,并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理:给定一个完整的家庭场景和一个处于特定情境的家庭请求,代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性,因为完整的家庭场景包含大量与任务无关的信息,使直接完整场景提示效率低下且容易出错。在实际部署中,这一挑战进一步被隐私和本地计算限制放大,这些限制更倾向于紧凑的开放权重模型,其具有有限的长上下文推理能力。我们提出TaskGround,一种无需训练且模型无关的Ground-Infer-Execute框架,该框架将完整的场景接地为紧凑的任务相关场景切片,推断出可执行的任务结构,并将其编译为接地的技能级动作序列。为了评估这一设置,我们引入了FullHome,一个经过人类验证的400个家庭任务评估套件,涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上,TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是,它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争,同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈,并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

2605.18105 2026-05-19 cs.CL

How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World

轰鸣声击中新闻摊位:对全球山体滑坡相关新闻报道和空间偏见的数据分析

Brielen Madureira, Andreas Niekler, Marc Keuschnigg, Mariana Madruga de Brito

AI总结 本文通过分析25年间近6万篇关于5500起山体滑坡事件的新闻文章,探讨德国报纸对全球山体滑坡的报道方式,揭示南欧和西欧地区报道过度的现象,为研究媒体对国际灾害关注的不平等提供参考。

Comments Work in progress

详情
AI中文摘要

山体滑坡常因破坏性和潜在致命性而击中新闻摊位。新闻是创建或丰富灾害数据库以及加快基于媒体的注意力动态研究的重要信息来源。为此,新闻数据集必须被过滤、定位和验证。本文聚焦于全球山体滑坡在德国报纸中的报道方式。我们分析了25年间近6万篇关于5500起新闻事件的新闻文章,将其与外部国家滑坡易发性指标进行比较,并提供见解,例如南欧和西欧地区报道过度,以促进对媒体对国际灾害关注不平等的研究。

英文摘要

Landslides often hit newsstands due to their destructive and potentially fatal effects. News are a valuable source of information for creating or enriching disaster databases and for expediting media-based studies of the dynamics of media attention. To accomplish that, news datasets must be filtered, geolocated and validated. This paper focuses on how landslides around the world are reported in German newspapers. We analyse almost 60k news articles about 5.5k news events in a 25-year period, compare it with external measures of countries' susceptibility to landslides and provide insights, e.g.~the overreporting of Southern and Western Europe, to foment further studies on inequalities in media attention to international disasters.

2605.18104 2026-05-19 cs.AI cs.CR

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

多模态大语言模型中的安全几何坍缩与自适应漂移修正

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao, Yanyan Zhao, Yutai Hou, Qianchao Wang, Dandan Tu, Bing Qin

AI总结 本文研究了多模态大语言模型在跨模态安全转移中的不足,提出安全几何坍缩现象,并通过自适应漂移修正方法提升模型安全性。

详情
AI中文摘要

多模态大语言模型(MLLMs)常常无法将文本模态中学习到的安全能力转移到语义等价的非文本输入中,揭示出一个持续存在的多模态安全缺口。我们从表示几何视角出发,通过分析文本对齐的拒绝方向和模态诱导的漂移方向来研究这一缺口。我们展示了多模态输入压缩了沿拒绝方向的可用分离度,使其不再可靠用于识别和拒绝有害输入。我们将这种失败模式称为安全几何坍缩。我们通过条件拒绝分离度量化这一现象,并显示更强的模态诱导漂移与更弱的拒绝分离度和更高的攻击成功率一致。随后,我们通过固定强度激活干预验证了模态诱导漂移的因果作用:抵消估计的漂移可以恢复拒绝分离度并提高多模态安全性。在漂移修正后,我们进一步观察到自修正现象,其中模型在前向动态中恢复了识别和拒绝有害多模态输入的能力。这种效果也提供了模型对每个输入感知有害性的内部信号。受此信号启发,我们提出了ReGap,一种无需训练的推理时方法,通过自修正自适应修正模态漂移。在多个多模态安全基准和实用性基准上的实验展示了ReGap的有效性,显著提高了MLLMs的安全性,而不会损害通用能力。我们的发现强调了表示层面的模态对齐作为实时安全改进和构建更安全、更可靠MLLMs的关键方向。

英文摘要

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

2605.18101 2026-05-19 cs.CV cs.AI

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

SENSE: 基于卫星的能源合成以实现可持续环境

Kailai Sun, Mingyi He, Heye Huang, Can Rong, Alok Prakash, Baoshen Guo, Shenhao Wang, Jinhua Zhao

AI总结 本文提出SENSE,一种统一的生成性城市建筑能耗框架,通过结合生成扩散模型和大规模视觉模型知识,生成高分辨率的城市卫星图像和对齐的高质量建筑能耗和高度地图,以提高城市可持续发展预测性能。

Comments Accpted by KDD 2026 (Oral)

详情
AI中文摘要

城市建筑能耗建模在实现联合国可持续发展目标7和11中起着关键作用。尽管基于卫星图像和深度学习的研究已取得显著进展,但仍存在许多挑战:大多数现有研究本质上是预测性的,无法反映城市规划的生成性;虽然生成式AI和扩散模型在卫星图像中实现了指数级增长,但缺乏城市功能生成(例如能耗层);第三,高质量高分辨率建筑能耗数据与卫星图像的对齐数据有限且稀缺。本文提出SENSE(基于卫星的能源合成以实现可持续环境),一种统一的生成性城市建筑能耗(UBEM)框架,联合合成逼真的城市卫星图像和对齐的高质量建筑能耗和高度地图。通过在道路网络和城市密度指标上进行条件控制,SENSE基于可控扩散模型,利用大规模视觉模型学习到的知识,生成城市建筑能耗和高度信息(注释)在潜在空间中。在四个城市(纽约市、波士顿、里昂、釜山)上的实验表明,SENSE实现了高视觉保真度和强物理一致性,满足ASHRAE标准度量。实验表明,SENSE可以使用少于20%的标注能耗数据生成足够的注释合成数据,将下游预测性能提升10% IoU。与最先进的城市能耗预测方法相比,SENSE显著降低了预测误差(预测误差减少了3%-11% NMBE和1%-9% CVRMSE)。本研究为城市科学、能源科学和建筑科学提供了能耗效率的城市规划和物理生成解决方案。数据集和代码:https://huggingface.co/datasets/skl24/MUSE和https://github.com/kailaisun/GenAI4Urban-Energy/.

英文摘要

Urban Building Energy Modeling plays a critical role in achieving the United Nations' Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.

2605.18094 2026-05-19 cs.AI

Learning to Solve Compositional Geometry Routing Problems

学习解决组合几何路由问题

Mingfeng Fan, Jianan Zhou, Jiaqi Cheng, Yifeng Zhang, Jie Zhang, Guillaume Adrien Sartoretti

AI总结 本文研究了组合几何路由问题(CGRP),这是一种涵盖点、线、面及任意混合任务几何的统一超类,为现实中的路由场景提供广泛抽象。为解决非点任务带来的不对称性和复杂性,作者提出DiCon框架,通过对比学习和差异注意力机制提升表示学习和决策能力。

Comments 27 pages, 10 figures

详情
AI中文摘要

我们研究了组合几何路由问题(CGRP),这是一种涵盖点-only、line-only、area-only及任意混合任务几何的统一超类,为现实中的路由场景提供广泛抽象。除了标准的点基路由外,CGRP中的非点任务可以本质上是不对称的,紧密耦合的旅行路线与内在路径密切相关,并扩展了大量可行但通常无关的行动空间,从而对表示学习和决策提出了重大挑战。为解决这些挑战,我们提出DiCon,一种带有对比学习的差分注意力辅助求解器,作为即插即用的框架,从两个互补的角度解决该问题。首先,我们引入差分注意力机制,主动抑制概率质量在不具竞争力的候选动作上的分布。其次,我们设计了双层对比学习目标,以促进稳健的全局实例表示并正则化几何感知的任务表示。广泛的实验表明,DiCon在不同组成CGRP实例上实现了强大的性能、广泛的通用性和优越的泛化能力。

英文摘要

We study the Compositional Geometry Routing Problem (CGRP), a unified superclass of traditional routing problems that covers point-only, line-only, area-only, and arbitrary hybrid task geometries, providing a broad abstraction for real-world routing scenarios. Beyond standard point-based routing, CGRP with non-point tasks can be inherently asymmetric, tightly coupled travel routes with the intrinsic path, and enlarges the action space with numerous feasible yet often irrelevant options, thereby posing significant challenges for both representation learning and decision-making. To address these challenges, we propose DiCon, a differential attention-assisted solver with contrastive learning, as a plug-and-play framework that tackles the problem from two complementary angles. First, we introduce a differential attention mechanism that actively suppresses the probability mass on less competitive candidate actions. Second, we design a double-level contrastive learning objective to promote robust global instance representations and regularize geometry-aware task representations. Extensive experiments demonstrate that DiCon achieves strong performance, broad versatility, and superior generalization across diverse CGRP instances with different compositions.

2605.18083 2026-05-19 cs.CL

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoE

多语言大语言模型的高效路径:通过后训练PARAM$Δ$整合到再利用MoE进行语言扩展

Hao Zhou, Tianhao Li, Zhijun Wang, Shuaijie She, Linjuan Wu, Hao-Ran Wei, Baosong Yang, Jiajun Chen, Shujian Huang

AI总结 本文提出了一种高效的方法,通过将密集模型转换为MoE架构,并将不同语言分配给不同专家,从而在不进行复杂对齐阶段的情况下提升多语言大语言模型的性能,同时保留原始能力。

详情
AI中文摘要

将大型语言模型(LLMs)扩展到新语言是一个成本高昂的过程,需要大量的持续预训练(CPT)和数据密集型对齐。尽管最近的数据免费融合技术试图通过将多语言CPT增强模型与其指令版本融合来绕过对齐,但它们受到关键权衡的限制:缓解参数冲突以保持原始能力不可避免地会稀释新语言的学习,反之亦然。为了解决这一矛盾,我们引入了\method,将密集模型重新利用为专家混合(MoE)架构,将不同专家分配给不同语言。然后通过将MoE扩展的参数delta($Δ_{ ext{post}}$)嫁接回CPT增强的基模型来转移对齐能力,从而绕过复杂的对齐阶段。实验表明,\method在具有相似FLOPs或参数数量的基线方法上表现出色;它在扩展语言上提高了性能,同时有效保留了原始能力。我们进一步证明,我们的方法在不同模型和后训练delta上具有高度适用性。

英文摘要

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

2605.18082 2026-05-19 cs.LG

pyforce-1.0.0: Python Framework for data-driven model Order Reduction of multi-physiCs problEms

pyforce-1.0.0: 用于多物理问题数据驱动模型降阶的Python框架

Stefano Riva, Yantao Luo, Carolina Introini, Antonio Cammi

AI总结 本文提出pyforce-1.0.0框架,采用数据驱动降阶建模技术用于多物理问题,主要应用于核工程领域,改进了传感器位置优化和实测数据整合,提升了物理系统认知。

Comments Github Repo: https://github.com/ERMETE-Lab/ROSE-pyforce

详情
AI中文摘要

pyforce是一个实现数据驱动降阶建模技术的Python包,主要用于多物理问题的应用,主要集中在核工程领域。该包是ROSE(用于多物理问题的数据驱动降阶建模)的一部分:数学算法旨在减少多物理模型的复杂性(用于核反应堆应用),寻找最优传感器位置,并整合真实测量以提高对物理系统的认识。与之前的基于dolfinx包的原始实现(v0.6.0)相比,pyforce 1.0.0完全重写,使用pyvista作为网格导入、积分计算和结果可视化后端;此外,函数存储为numpy数组,提高了包的易用性。这一选择允许pyforce与任何能够导出VTK格式结果的软件求解器一起使用。

英文摘要

pyforce is a Python package implementing Data-Driven Reduced Order Modelling techniques for applications to multi-physics problems, mainly set in the Nuclear Engineering world. The package is part of the ROSE (Reduced Order modelling with data-driven techniques for multi-phySics problEms): mathematical algorithms aimed at reducing the complexity of multi-physics models (for nuclear reactors applications), at searching for optimal sensor positions and at integrating real measures to improve the knowledge on the physical systems. With respect to the previous original implementation based on dolfinx package (v0.6.0), version 1.0.0 of pyforce has been completely re-written using pyvista as backend for mesh importing, computing integrals, and visualisation of results; in addition, functions are stored as numpy arrays, improving the ease of use of the package. This choice allows to use pyforce with any software solver able to export results in VTK format.

2605.18079 2026-05-19 cs.LG cs.CC cs.CL

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

低精度softmax变换器的表达能力(摘要)链式思维

Moritz Brösamle, Stephan Eckstein

AI总结 本文研究了低精度softmax变换器在链式思维中的表达能力,通过构造三元激活和分离注意力分数的硬max变换器来模拟图灵机,从而将构造转换为等效的softmax变换器,并分析了最近提出的总结链式思维范式在模拟图灵机时的效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有的变换器表达性结果通常依赖于hardmax注意力、高精度和其它架构修改,这些修改将它们与实际使用的模型脱节。我们通过分析具有softmax注意力和激活值及注意力权重四舍五入的标准变换器解码器,同时允许深度和宽度以对数方式增长于上下文长度,来弥合这一差距。作为中间步骤,我们构造了具有三元激活和良好分离注意力分数的硬max变换器,利用链式思维(CoT)模拟图灵机。这使我们能够将构造转换为等效的softmax变换器,而无需先前方法所需的不现实的参数规模或激活精度。使用相同的技术,我们分析了最近提出的总结Co T范式,并展示其在模拟图灵机时更加高效,模型大小以空间界而非时间界缩放。我们通过在数独推理任务上验证我们的结果,并发现其比先前的高精度结果更符合可学习性。我们的代码可在https://github.com/moritzbroe/transformer-expressivity上获得。

英文摘要

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

2605.18078 2026-05-19 cs.LG

Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry

通过对手感知盆地入口进行多智能体策略梯度的均衡选择

Yevhen Shcherbinin, Arina Redina, Maxim Kalpin, Vlad Kochetov

AI总结 本文研究了多智能体策略梯度方法在局部收敛到稳定纳什均衡时的均衡选择问题,提出通过对手感知的盆地入口概率机制来提升目标均衡集的进入概率,并通过实验验证了该机制在合作盆地中的有效性。

详情
AI中文摘要

多智能体策略梯度方法已被证明能够局部收敛到稳定的纳什均衡。然而,局部收敛并不决定最终达到哪一个均衡。本文通过相对于由外部标准(如收益支配)选择的目标均衡集的盆地入口概率来研究这一问题。对于有限展开的元Meta-MAPG,我们证明更新可以分解为普通的策略梯度加上自身学习和同伴学习的修正,其中包含受控的采样噪声和有限展开偏差。我们识别出同伴学习修正作为主要的均衡选择机制:在局部对齐条件下,进入目标稳定纳什集的认证吸引区域的概率相对于普通的策略梯度会增加。由于持续的修正可能会改变原始游戏的零更新点,进入盆地后对修正进行退火可以恢复普通的策略梯度动态,并继承局部稳定的纳什收敛保证。在 stag hunt、迭代囚徒困境和初步的神经策略协调环境中的实验支持了这一盆地入口观点,显示在同伴意识更新下合作盆地的进入概率增加。

英文摘要

Multi-agent policy-gradient methods have been shown to converge locally near stable Nash equilibria. Local convergence, however, does not determine which equilibrium is reached. We study this question through basin-entry probability with respect to a target set of equilibria selected by an external criterion, such as payoff dominance. For finite-unroll Meta-MAPG, we show that the update decomposes into ordinary policy gradient plus own-learning and peer-learning corrections, with controlled sampling noise and finite-unroll bias. We identify the peer-learning correction as the main equilibrium-selection mechanism: under a local alignment condition, the probability of entering the certified attraction region of the target stable-Nash set increases, relative to ordinary policy gradient. Because persistent correction may shift zero-update points of the original game, annealing the correction after entering the basin recovers ordinary policy-gradient dynamics and inherits local stable-Nash convergence guarantees. Experiments in Stag Hunt, iterated Prisoner's Dilemma, and preliminary neural-policy coordination environments support this basin-entry view, showing increased entry into cooperative basins under peer-aware updates.

2605.18074 2026-05-19 cs.RO

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

4DLidarOpen: 一个用于运动感知自动驾驶的开放4D FMCW激光雷达数据集

Kane Qian, Xin Zhao, Yining Shi, Rujun Yan, Zhengqing Pan, Kaojin Zhu, Mengmeng Yang, Kai Sun, Diange Yang, Kun Jiang

AI总结 本文提出4DLidarOpen数据集,用于自动驾驶,该数据集基于4D频率调制连续波(FMCW)激光雷达传感,包含点径向速度测量、多种激光雷达、环绕摄像头和6自由度车辆姿态数据,通过混合标注策略实现大规模训练和人工精修,用于3D目标检测、鸟瞰图分割和流预测及运动预测基准测试。

Comments 15pages, 9 figures

详情
AI中文摘要

我们提出了4DLidarOpen,一个大规模的开放多模态自动驾驶数据集,核心是基于4D频率调制连续波(FMCW)激光雷达传感。与传统飞行时间激光雷达数据集主要提供几何测量不同,4DLidarOpen包含来自前方4D FMCW激光雷达的点径向速度测量,以及多种类型的激光雷达,包括旋转、固态和盲 spot变种,环绕视图摄像头,以及6-DOF ego-vehicle姿态。该数据集在北京复杂城市环境中采集,涵盖了密集行人交互、拥堵交通、高速驾驶和无保护变道。4DLidarOpen提供同步多传感器数据和具有持久跟踪ID的3D边界框标注,跨五个物体类别。采用混合标注策略,其中大规模自动标注数据支持可扩展训练,而人类专家对人工标注的训练和验证集进行精修。基于此数据集,我们建立了3D目标检测、鸟瞰图(BEV)分割和流预测以及运动预测的基准测试。大量实验表明,直接来自4D FMCW激光雷达的速度测量为动态场景理解提供了互补的运动线索。与仅几何感知相比,速度感知表示提高了运动相关感知和下游预测和规划,特别是在涉及易受伤害道路使用者和快速移动物体的场景中。这些结果表明,4D FMCW激光雷达是运动感知自动驾驶的有前途的感知模式。数据集和评估工具包已公开发布,以支持4D场景理解、多激光雷达融合和速度感知感知和规划的研究。

英文摘要

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

2605.18072 2026-05-19 cs.SD

MusicDET: Zero-Shot AI-Generated Music Detection

MusicDET: 零样本AI生成音乐检测

Chaolei Han, Hongsong Wang, Jie Gui

AI总结 本文提出MusicDET框架,通过频率引导的归一化流模型在无生成样本情况下实现零样本AI生成音乐检测,有效识别非分布音乐信号。

Comments Accepted by ICML 2026

详情
AI中文摘要

检测AI生成的音乐对于保持艺术真实性并防止生成音乐技术的滥用至关重要。然而,现有判别检测器通常依赖生成样本进行训练,当面对未知生成器产生的音乐时,性能会严重下降,限制了其实际应用。为了解决这个问题,我们提出了一个零样本设置用于AI生成音乐检测,其中检测器仅在真实音乐上训练而没有访问任何生成样本。在此设置下,我们提出了MusicDET,一种基于频率引导归一化流的生成无关检测框架,该框架通过概率模型真实音乐特征的分布。通过评估输入样本在学习的真实音乐分布下的似然性,MusicDET能够有效检测非分布音乐信号。在FakeMusicCaps和SONICS数据集上的实验表明,MusicDET在识别之前未见过的模型生成的音乐方面显著优于传统判别检测器。

英文摘要

Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.

2605.18071 2026-05-19 cs.CL

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

KVDrive: 一个面向长上下文LLM推理的多层级KV缓存管理系统

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

AI总结 本文提出KVDrive,一个面向长上下文LLM推理的多层级KV缓存管理系统,通过联合缓存放置、流水线调度和跨层级协调,实现了高吞吐量的推理,在有限的GPU预算下保持高精度。

详情
AI中文摘要

支持长上下文LLM存在挑战,因为键值(KV)缓存的大量内存需求。现有的卸载系统将完整的缓存存储在主机内存中,并在解码过程中选择性地获取关键条目,但这种策略很快达到极限:无法进一步稀释而不影响准确性。因此,当上下文长度和批处理大小增加时,KV传输的体积急剧上升,成为解码延迟的主要来源。我们提出了KVDrive,一个横跨GPU内存、主机DRAM和SSD的多层级KV缓存管理系统。与之前通过算法改进追求更高稀疏度的工作不同,KVDrive从系统角度出发,联合缓存放置、流水线调度和跨层级协调,以在有限的GPU预算下维持高吞吐量的推理。KVDrive实现了三个基本能力:它根据注意力行为调整缓存管理以最大化重用并最小化冗余数据移动;它重构解码流水线以重叠I/O和CPU/GPU计算瓶颈阶段,消除异构资源中的停滞;并且它协调内存层级之间的数据移动,解锁远超GPU和DRAM限制的可扩展长上下文推理。我们已经实现了一个完整的KVDrive原型,并在长上下文基准测试中评估了流行LLM。该系统在保持准确性的同时,相比最先进的工作实现了高达1.74倍的吞吐量提升。

英文摘要

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.

2605.18068 2026-05-19 cs.LG cs.AI

Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing

通过缓解过压缩来改进时空残差误差传播

Seyed Mohamad Moghadas, Esther Rodrigo Bonet, Bruno Cornelis, Adrian Munteanu

AI总结 本文提出Teger模块,通过空间曲率感知的图重排机制改进误差相关的自回归预测,提升时空预测的连续排名概率得分。

详情
AI中文摘要

残差误差传播仍然是递归模型中的基本问题,其中小的预测不准确会随时间累积并降低长周期性能。准确建模此类残差的相关结构对于概率多变量时间序列预测中的可靠不确定性量化至关重要。尽管最近的时间序列深度模型能够高效参数化时间变化的同期相关性,但它们通常假设误差的时序独立性,并忽略了观测网络中的空间相关性。在本文中,我们引入Teger,一个结构化的不确定性模块,克服了误差相关自回归预测中的空间和时间限制。Teger提出了一种空间曲率感知的图重排机制,明确加强了由离散Forman曲率识别出的信息瓶颈边。该组件被集成到低秩加对角协方差头中,通过Woodbury恒等式保持可推断性。Teger是backbone无关的,仅需任何自回归编码器产生的潜在状态。我们提供了Teger的理论证据,并在四个现实世界的时空数据集上实验评估了它在LSTM、Transformer和xLSTM backbone上的表现,显示了连续排名概率得分的一致改进。我们进一步提供了将曲率感知重排与(i)过压缩缓解、(ii)改进的谱连接性、(iii)减少有效电阻以及(iv)改进的协方差校准界联系起来的正式理论分析。

英文摘要

Residual error propagation remains a fundamental problem in recurrent models, where small prediction inaccuracies compound over time and degrade long-horizon performance. Accurately modeling the correlation structure of such residuals is critical for reliable uncertainty quantification in probabilistic multivariate timeseries forecasting. While recent time-series deep models efficiently parametrize time-varying contemporaneous correlations, they often assume temporal independence of errors and neglect spatial correlation across the observed network. In this paper, we introduce Teger, a structured uncertainty module that overcomes the spa- tial and temporal limitations of error-correlated autoregressive forecasting. Teger proposes a spatial curvature-aware graph rewiring mechanism explicitly strengthening information-bottleneck edges identified by discrete Forman curvature. The component is integrated into a low-rank-plus-diagonal covariance head, preserving tractable inference via the Woodbury identity. Teger is backbone-agnostic, requiring only the latent state produced by any autoregressive encoder. We provide theoretical evidence of Teger, and experimentally evaluate it on LSTM, Transformer, and xLSTM backbones across four real-world spatio-temporal datasets, showing consistent improvement in Continuous Ranked Probability Score (CRPS). We further provide a formal theoretical analysis connecting curvature-aware rewiring to (i) oversquashing alleviation, (ii) improved spectral connectivity, (iii) reduced effective resistance, and (iv) improved covariance calibration bounds

2605.18067 2026-05-19 cs.CL

PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

PPAI: 促进个性化大语言模型代理在协作边缘智能中的互操作性

Zile Wang, Qianli Liu, Kaibin Guo, Haodong Wang, Jian Lin, Zicong Hong, Song Guo

AI总结 本文提出PPAI系统,通过代理专长实现用户间协作,解决动态代理池和负载平衡问题,提升任务准确性并降低延迟。

详情
AI中文摘要

在边缘设备上部署大型语言模型(LLM)可为各种用户提供个性化LLM代理。随着多样化个性化代理的可用性增加,同伴对同伴(P2P)协作提供了独特机会,其中每个用户可以将超出本地代理专长的任务委托给更适合特定查询的远程代理。本文介绍了PPAI,首个个性化LLM代理互操作性系统,使用户基于代理专长进行协作。然而,代理池的不断变化和其可互换性带来了新的挑战,即在存在 churn 的P2P网络中匹配查询到代理并平衡负载,与现有P2P系统相比更具挑战性。因此,我们提出了一种基于原型的可扩展查询-代理对评分机制,以在具有 churn 的P2P网络中识别适合的代理。此外,我们提出一个多代理互操作性贝叶斯博弈,以在远程代理负载变化过快无法观察时平衡本地需求和全局效率。最后,我们实现了一个PPAI原型,并证明它显著扩展了可执行的任务范围,同时保持负载平衡。平均而言,它在多个任务上实现了高达7.96%的准确性提升,同时相比基线减少了16.34%的延迟。

英文摘要

Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

2605.18063 2026-05-19 cs.CV cs.LG

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

MixCount数据集:弥合开放词汇物体计数的数据缺口

Corentin Dumery, Niki Amini-Naieni, Shervin Naini, Pascal Fua

AI总结 本文提出MixCount数据集,通过自动生成管道解决开放词汇物体计数中混合物体场景下的数据不足问题,展示了在真实世界基准上的显著提升。

Comments Co-first authors. Dataset and project page https://corentindumery.github.io/projects/mixcount.html

详情
AI中文摘要

物体计数是一个基础的视觉任务,已有超过十年的专门研究,但最先进的模型在混合物体设置中仍系统性地失败,这在工业检测和产品分拣等现实应用中占主导地位。我们证明,这一差距主要是由现有训练和评估数据的限制造成的:真实的计数数据集标注成本过高且存在标签噪声,而现有的合成替代方案缺乏多样性和现实感。我们通过MixCount数据集和基准来解决这一问题,该数据集旨在针对当前计数模型的失败模式。为了克服构建和标注此类数据的高成本,我们开发了一种自动生成管道,能够大规模合成图像、细粒度文本描述和像素级计数注释,消除了此前数据集中的标注模糊性。在MixCount上评估最先进的计数模型会暴露混合物体设置下的严重退化。更重要的是,将这些模型在我们的合成数据上训练,在真实世界基准上取得了显著提升,将FSC-147的MAE降低了20.14%,在PairTally上降低了18.3%。这些结果确立了MixCount作为细粒度计数的基准和训练数据集,并证明了我们的管道能够产生实际上无限的标注数据,从而解决了计数模型中长期存在的瓶颈问题。

英文摘要

Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.

2605.18060 2026-05-19 cs.CV

Embedded ConvNet Ensembles: A Lightweight Approach to Recognize Arabic Handwritten Characters

嵌入式卷积网络集合:一种轻量级的阿拉伯手写字符识别方法

Mohsine El Khayati, Rachid Elouahbi, Abdelillah Semma

AI总结 本文提出了一种轻量级嵌入式卷积网络与集成学习相结合的方法,用于实现阿拉伯手写字符识别,通过实验验证了轻量模型在准确率上的优势以及集成学习对性能的提升。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情
AI中文摘要

阿拉伯手写字符识别(AHCR)近年来通过深度卷积神经网络(ConvNets)取得了显著进展。然而,文献中的许多模型深度且在参数和FLOPs上计算成本高,限制了其在资源受限设备上的部署,而这些设备日益普遍。本研究通过提出轻量级嵌入式ConvNet模型和集成学习技术来填补这一空白。进行了广泛的实验以确定AHCR的最佳实践,考虑了训练超参数、学习策略、模型选择和集成方法。结果表明,嵌入模型可以实现与或甚至超过更重架构的准确率。集成学习在只有适度计算开销的情况下进一步提升性能,特别是在具有挑战性的训练场景中。在集成策略中,软投票产生了最佳的整体结果。

英文摘要

Arabic Handwritten Character Recognition (AHCR) has recently advanced significantly with deep Convolutional Neural Networks (ConvNets). However, many models in the literature are deep and computationally expensive in terms of parameters and FLOPs, limiting their deployment on resource-constrained devices, which are increasingly common. This study addresses this gap by proposing a combination of lightweight embedded ConvNet models and ensemble learning techniques. Extensive experiments were conducted to identify best practices in AHCR, considering training hyperparameters, learning strategies, model choices, and ensemble methods. Results show that embedded models can achieve accuracy comparable to, or even surpassing, heavier architectures. Ensemble learning further enhances performance with only modest computational overhead, particularly under challenging training scenarios. Among the ensembling strategies, soft voting yielded the best overall results.