arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2506.10341 2026-06-09 cs.LG cs.CL 版本更新

Formalizing Learning from Language Feedback with Provable Guarantees

从语言反馈中学习的形式化与可证明保证

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 本文形式化语言反馈学习问题,提出转移埃尔泽维度刻画学习难度,并开发无遗憾算法HELiX,证明其性能保证,展示丰富语言反馈可指数级加速学习。

Comments ICML 2026

详情
AI中文摘要

通过观察和语言反馈进行交互式学习是一个日益受到关注的领域,其驱动力来自大型语言模型(LLM)智能体的出现。尽管有令人印象深刻的实证演示,但迄今为止,这些决策问题的原则性框架仍然缺乏。我们形式化了语言反馈学习(LLF)问题,提出了足以在潜在奖励下实现学习的假设,并引入了$\ extit{转移埃尔泽维度}$作为衡量LLF难度的指标。我们形式化了语言反馈中的信息控制学习复杂性的直觉,并展示了从丰富语言反馈中学习可以比从奖励中学习指数级更快的案例。我们开发了一种名为$\ exttt{HELiX}$的无遗憾算法,通过顺序交互可证明地解决LLF问题,其性能保证随转移埃尔泽维度缩放。在多个实证领域,我们展示了即使重复提示LLM不可靠时,$\ exttt{HELiX}$也能表现良好。我们的贡献标志着朝着使用通用语言反馈设计原则性交互学习算法迈出了重要一步。

英文摘要

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. Despite impressive empirical demonstrations, so far a principled framing of these decision problems remains lacking. We formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a measure to characterize the hardness of LLF. We formalize the intuition that information in the language feedback governs the learning complexity, and demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms using generic language feedback.

2506.06295 2026-06-09 cs.LG cs.AI cs.CL 版本更新

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

dLLM-Cache:基于自适应缓存的扩散大语言模型加速

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Yichen Zhu, Linfeng Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对扩散大语言模型推理延迟高的问题,提出一种无需训练的自适应缓存框架dLLM-Cache,通过长间隔提示缓存和基于特征相似性的部分响应更新,实现高效中间计算复用,在保持输出质量的同时大幅降低FLOPs。

Comments Accepted by ICML 2026

详情
AI中文摘要

自回归模型长期以来主导了大语言模型领域。最近,一种基于扩散的大语言模型(dLLMs)的新范式出现,它通过迭代去噪掩码段来生成文本。这种方法显示出显著的优势和潜力。然而,dLLMs存在高推理延迟的问题。传统的自回归模型加速技术,如键值缓存,由于dLLMs的双向注意力机制而无法兼容。为了应对这一特定挑战,我们的工作首先基于一个关键观察:dLLM推理涉及一个静态提示和一个部分动态的响应,其中大多数标记在相邻去噪步骤中保持稳定。基于此,我们提出了dLLM-Cache,一种无需训练的自适应缓存框架,它结合了长间隔提示缓存和基于特征相似性的部分响应更新。这种设计能够在不影响模型性能的情况下高效重用中间计算。在代表性dLLMs(包括LLaDA 8B和Dream 7B)上的大量实验表明,dLLM-Cache在LongBench-HotpotQA上实现了高达9.1倍的FLOPs减少,同时保持了具有竞争力的输出质量。值得注意的是,我们的方法使dLLM推理延迟在许多设置下接近自回归模型。本工作的代码公开于:https://github.com/maomaocun/dLLM-cache。

英文摘要

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

2505.21457 2026-06-09 cs.CV cs.AI 版本更新

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

ACTIVE-o3:通过纯强化学习赋予多模态大语言模型主动感知能力

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Mingyu Liu, Zheng Huang, Anzhou Li, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ACTIVE-o3框架,基于GRPO强化学习,通过模块化感知-动作设计和双形式奖励,使MLLM自主学会高效准确的区域选择策略,在开放世界和领域特定任务中显著提升主动感知能力。

Comments Accepted to ICML 2026. Project page: https://aim-uofa.github.io/ACTIVE-o3

详情
AI中文摘要

主动视觉,也称为主动感知,指主动选择观察位置和方式以收集任务相关信息。它是人类和高级具身智能体高效感知与决策的关键组成部分。随着多模态大语言模型(MLLM)成为机器人系统中的核心规划器,缺乏赋予MLLM主动感知能力的方法已成为一个关键缺口。我们首先对基于MLLM的主动感知任务进行了系统定义,并表明GPT-o3的缩放策略可视为一个特例,尽管它存在效率低和区域选择不准确的问题。为解决这些问题,我们提出ACTIVE-o3,一个基于GRPO构建的强化学习框架,赋予MLLM主动感知能力。利用模块化感知-动作设计和双形式奖励,ACTIVE-o3在没有显式区域选择监督的情况下自主学会高效且稳定的区域选择策略。我们进一步建立了一个全面的基准测试,涵盖开放世界任务(包括小目标和密集目标定位)以及领域特定场景(包括遥感、自动驾驶和交互式分割)。实验结果表明,与基线相比,ACTIVE-o3显著增强了主动感知能力。此外,我们表明该框架不仅保留了模型的通用理解能力,还可作为利用感知数据的代理任务,进一步提升在RealWorldQA和MME等基准测试上的性能。

英文摘要

Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, ACTIVE-o3 autonomously learns efficient and stable region selection strategies without explicit region-selection supervision. We further establish a comprehensive benchmark covering both open-world tasks, including small- and dense-object grounding, and domain-specific scenarios, including remote sensing, autonomous driving, and interactive segmentation. Experimental results demonstrate that ACTIVE-o3 significantly enhances active perception capabilities compared to baselines. Moreover, we show that our framework not only preserves the model's general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA and MME.

2505.21239 2026-06-09 cs.CL 版本更新

A Unified LLM-Adaptable Framework for Cold-Start Cognitive Diagnosis

面向冷启动认知诊断的统一LLM可适配框架

Zihan Yao, Chentao Song, Yu He, Tianyu Qi, Jian Zhang, Weiping Fu, Jun Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LMCD框架,通过知识扩散和语义-认知融合两阶段,利用大语言模型增强冷启动场景下的认知诊断性能。

Comments Under review

详情
AI中文摘要

认知诊断已成为人工智能赋能教育中的关键任务,通过准确评估学生的认知状态来支持个性化学习。然而,传统的认知诊断模型(CDMs)由于缺乏学生-练习交互数据,在冷启动场景中常常表现不佳。最近基于NLP的方法利用预训练语言模型(PLMs)通过文本特征显示出潜力,但未能完全弥合语义理解与认知建模之间的差距。为了解决这一局限,我们提出了基于语言模型的认知诊断(LMCD),这是一个统一的、可适配LLM的框架,旨在通过利用大语言模型(LLMs)的高级能力来应对冷启动挑战。LMCD通过两个主要阶段运行:(1)知识扩散,其中LLMs为练习和知识概念(KCs)生成丰富的内容,以建立更强的语义联系;(2)语义-认知融合,利用LLMs将文本信息与学生认知状态深度融合。通过统一语义和认知空间,LMCD创建了全面的表示,作为各种现成CDMs的即插即用增强。在两个真实世界数据集上的实验表明,LMCD在练习冷启动和领域冷启动设置中均显著优于最先进的方法。代码已公开在 https://this URL。

英文摘要

Cognitive Diagnosis has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional cognitive diagnosis models (CDMs) often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features, but they fail to fully bridge the gap between semantic understanding and cognitive profiling. To address this limitation, we propose \textbf{L}anguage \textbf{M}odel-based \textbf{C}ognitive \textbf{D}iagnosis (LMCD), a unified, LLM-adaptable framework designed to tackle cold-start challenges by harnessing the advanced capabilities of large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched content for exercises and knowledge concepts (KCs) to establish stronger semantic links; and (2) Semantic-Cognitive Fusion, which leverages LLMs to deeply integrate textual information with student cognitive states. By unifying the semantic and cognitive spaces, LMCD creates comprehensive representations that serve as a plug-and-play enhancement for various off-the-shelf CDMs. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. https://github.com/TAL-auroraX/LMCDThe code is publicly available at https://github.com/TAL-auroraX/LMCD

2505.17868 2026-06-09 cs.LG math.OC 版本更新

SpectraLDS: Provable Distillation for Linear Dynamical Systems

SpectraLDS:线性动力系统的可证明蒸馏

Devan Shah, Shlomo Fortgang, Sofiia Druchyna, Elad Hazan

发表机构 * Computer Science Department, Princeton University(普林斯顿大学计算机科学系) Google DeepMind Princeton(谷歌DeepMind普林斯顿)

AI总结 提出首个可证明方法识别对称线性动力系统,通过谱变换实现与状态维度无关的精度保证,并实现常数时间和空间推理。

详情
AI中文摘要

我们提出了第一个可证明的方法,用于识别具有精度保证的对称线性动力系统(LDS),该精度保证与系统的状态维度或有效记忆无关。我们的方法建立在最近的工作基础上,该工作将对称LDS表示为可通过固定谱变换学习的卷积。我们展示了如何反转这种表示,从而从谱变换中恢复LDS模型,并产生端到端的凸优化过程。这种蒸馏保留了预测精度,同时实现了每个token的常数时间和常数空间推理,与序列长度无关。我们将我们的方法SpectraLDS作为序列预测架构中的一个组件进行评估,并证明在语言建模等任务上,精度得以保持,同时推理效率得到提升。

英文摘要

We present the first provable method for identifying symmetric linear dynamical systems (LDS) with accuracy guarantees that are independent of the systems' state dimension or effective memory. Our approach builds upon recent work that represents symmetric LDSs as convolutions learnable via fixed spectral transformations. We show how to invert this representation, thereby recovering an LDS model from its spectral transform and yielding an end-to-end convex optimization procedure. This distillation preserves predictive accuracy while enabling constant-time and constant-space inference per token, independent of sequence length. We evaluate our method, SpectraLDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling.

2505.13225 2026-06-09 cs.CV 版本更新

CoSeP: Complementary Separability Pruning via Class-Separability Clustering

CoSeP:基于类别可分性聚类的互补可分离性剪枝

David Levin, Gonen Singer

发表机构 * Faculty of Engineering Bar-Ilan University(巴伊兰大学工程学院)

AI总结 提出CoSeP方法,通过类别可分性空间中的互补性建模和自动剪枝率选择,在多种网络和数据集上实现精度提升或持平,并降低计算量。

Comments Major revision and extension of arXiv:2505.13225

详情
AI中文摘要

神经网络剪枝旨在压缩模型以实现高效部署,但仍存在两个基本挑战。首先,许多方法依赖每个组件的重要性分数,独立选择滤波器或神经元,忽略了冗余性:保留的集合可能包含多个捕捉相似判别模式的组件,而完全遗漏其他组件。其次,确定每层剪枝率通常需要手动、特定于架构的调整,且没有原则性的停止准则。我们提出CoSeP(互补可分离性剪枝)来解决这两个问题。CoSeP不是孤立地评分组件,而是通过Jeffries-Matusita距离计算每个组件在所有类别对上的类别可分性轮廓来表示该组件。这定义了一个可分性空间,其中邻近的组件可能冗余,而远离的组件捕捉互补信息。CoSeP在该空间中选择一个紧凑的代表集:通过k-medoids聚类对组件进行分组,使用平均简化轮廓评估候选子集大小,并通过拐点检测准则自动确定保留多少个组件。在CIFAR-10、CIFAR-100和ImageNet-1K上,针对ResNet、VGG、MobileNet和DenseNet架构,CoSeP在减少FLOPs的同时匹配或提高了精度,实测推理时间减少高达20%。例如,在ResNet-50/ImageNet-1K上实现了+0.66%的top-1准确率提升,同时FLOPs减少2.30倍;在VGG-16/CIFAR-10上实现了0.37%的准确率提升,FLOPs减少2.59倍。这些结果表明,在类别可分性空间中建模互补性为剪枝提供了一种有效且原则性的方法。

英文摘要

Neural network pruning aims to compress models for efficient deployment, yet two fundamental challenges remain. First, many methods rely on per-component importance scores, selecting filters or neurons independently and ignoring redundancy: the retained set may include multiple components capturing similar discriminative patterns while missing others entirely. Second, determining per-layer pruning ratios typically requires manual, architecture-specific tuning with no principled stopping criterion. We propose CoSeP (Complementary Separability Pruning) to address both issues. Rather than scoring components in isolation, CoSeP represents each component by its class-separability profile across all class pairs, computed via Jeffries--Matusita distances. This defines a separability space in which nearby components are potentially redundant and distant components capture complementary information. CoSeP selects a compact set of representatives in this space: components are grouped via k-medoids clustering, candidate subset sizes are evaluated using the Mean Simplified Silhouette, and a knee-detection criterion automatically determines how many components to retain. Across CIFAR-10, CIFAR-100, and ImageNet-1K, on ResNet, VGG, MobileNet, and DenseNet architectures, CoSeP matches or improves accuracy while reducing FLOPs, with measured wall-clock inference-time reductions of up to 20%. For example, it achieves a +0.66% top-1 accuracy gain with 2.30x FLOPs reduction on ResNet-50/ImageNet-1K, and a 0.37% gain with 2.59x FLOPs reduction on VGG-16/CIFAR-10. These results demonstrate that modeling complementarity in class-separability space provides an effective and principled approach to pruning.

2505.07573 2026-06-09 cs.CV cs.AI 版本更新

Robust Renal Mass Segmentation on CT: A Validation Study of an AI-Based Framework

基于CT的肾脏肿块鲁棒分割:AI框架的验证研究

Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering

发表机构 * Department of Medical Imaging, Radboudumc, Nijmegen, The Netherlands(医学影像部门,Radboudumc,尼姆维根,荷兰) Department of Radiology, Charité - Universitätsmedizin Berlin, Berlin, Germany(放射科,Charité - 大学医学中心柏林,柏林,德国) Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Berlin, Germany(神经放射科,Charité - 大学医学中心柏林,柏林,德国) Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany(诊断和介入放射科,Klinikum rechts der Isar,TUM大学医院,慕尼黑技术大学,慕尼黑,德国) Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center, TUM University Hospital, Technical University of Munich, Munich, Germany(心血管放射学和核医学部,德国心脏中心,TUM大学医院,慕尼黑技术大学,慕尼黑,德国) Fraunhofer MEVIS, Bremen, Germany(Fraunhofer MEVIS,不莱梅,德国)

AI总结 提出Renal-Net,基于nnU-Net和公开数据训练,在CT图像上实现肾脏肿块分割,验证显示优于现有模型且鲁棒性强。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:012. 23 pages, 12 figures

详情
Journal ref
Machine.Learning.for.Biomedical.Imaging. 2026 (2026)
AI中文摘要

肾脏肿块分割在临床工作流中具有重要潜力,尤其是在需要定量评估的场景中。肾脏体积可作为肾脏疾病的重要生物标志物,其体积变化与肾功能直接相关。目前,临床实践常依赖主观视觉评估来评价肾脏大小和肾脏病变(包括肿瘤和囊肿),这些病变通常根据直径、体积和解剖位置进行分期。为了支持更客观和可重复的方法,本研究旨在开发一个鲁棒且经过充分验证的肾脏肿块分割算法,命名为Renal-Net。我们使用公开可用的训练数据集,并利用最先进的医学图像分割框架nnU-Net。使用专有和公开测试数据集进行验证,分割性能通过Dice系数和95百分位Hausdorff距离量化。此外,我们根据患者性别、年龄、CT对比相和肿瘤组织学亚型分析亚组鲁棒性。我们的结果表明,仅使用公开数据训练的分割算法能有效泛化到外部测试集,并在所有测试数据集上优于现有最先进模型。亚组分析显示一致的高性能,表明强鲁棒性和可靠性。开发的算法和相关代码可在以下网址公开获取:https://this.url。

英文摘要

Renal mass segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and kidney lesions, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated renal mass segmentation algorithm, named Renal-Net. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.

2505.03528 2026-06-09 cs.CV 版本更新

Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

Coop-WD:具有加权和去噪的协作感知用于鲁棒V2V通信

Chenguang Liu, Jianjun Chen, Yunfei Chen, Yubei He, Zhuangkun Wei, Hongjian Sun, Haiyan Lu, Qi Hao

发表机构 * Department of Engineering, Durham University(工程系,达勒姆大学) Faculty of Engineering and Information Technology, University of Technology, Sydney(工程与信息技术学院,悉尼技术大学) Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology(可信自主系统研究院,南方科技大学)

AI总结 针对V2V通信损伤对协作感知的影响,提出联合加权与去噪框架Coop-WD,通过自监督对比模型和条件扩散概率模型分层增强特征,并设计高效变体Coop-WD-eco降低计算开销,在各类信道下优于传统方法。

Comments submitted to IEEE Transactions on Intelligent Transportation Systems

详情
AI中文摘要

协作感知通过车对车(V2V)通信利用多辆车的共享信息,在自动驾驶中发挥着重要作用,以缓解单车感知的局限性。现有工作已探索V2V通信损伤对感知精度的影响,但缺乏对不同损伤程度的泛化能力。本文提出一个联合加权与去噪框架Coop-WD,以增强在V2V信道损伤下的协作感知。在该框架中,自监督对比模型和条件扩散概率模型被分层用于车辆级和像素级特征增强。提出一个高效变体模型Coop-WD-eco,选择性地停用去噪以减少处理开销。考虑了瑞利衰落、非平稳性和时变失真。仿真结果表明,所提出的Coop-WD在所有类型信道中均优于传统基准。通过可视化示例的定性分析进一步证明了我们提出方法的优越性。所提出的Coop-WD-eco在严重失真下实现了高达50%的计算成本降低,同时随着信道条件改善保持相当的精度。

英文摘要

Cooperative perception, leveraging shared information from multiple vehicles via vehicle-to-vehicle (V2V) communication, plays a vital role in autonomous driving to alleviate the limitation of single-vehicle perception. Existing works have explored the effects of V2V communication impairments on perception precision, but they lack generalization to different levels of impairments. In this work, we propose a joint weighting and denoising framework, Coop-WD, to enhance cooperative perception subject to V2V channel impairments. In this framework, the self-supervised contrastive model and the conditional diffusion probabilistic model are adopted hierarchically for vehicle-level and pixel-level feature enhancement. An efficient variant model, Coop-WD-eco, is proposed to selectively deactivate denoising to reduce processing overhead. Rician fading, non-stationarity, and time-varying distortion are considered. Simulation results demonstrate that the proposed Coop-WD outperforms conventional benchmarks in all types of channels. Qualitative analysis with visual examples further proves the superiority of our proposed method. The proposed Coop-WD-eco achieves up to 50% reduction in computational cost under severe distortion while maintaining comparable accuracy as channel conditions improve.

2504.18451 2026-06-09 cs.LG 版本更新

Enhancing Strawberry Yield Forecasting with Backcasted IoT Sensor Data and Machine Learning

利用回测IoT传感器数据和机器学习增强草莓产量预测

Tewodros Alemu Ayall, Andy Li, Matthew Beddows, Milan Markovic, Georgios Leontidis

发表机构 * The School of Natural and Computing Sciences and the Interdisciplinary Institute at the University of Aberdeen(阿伯丁大学自然科学与计算科学学院及跨学科研究所) UiT The Arctic University of Norway(挪威北极大学)

AI总结 针对IoT数据缺失问题,提出基于AI的回测方法合成传感器数据,结合真实数据训练产量预测模型,在草莓生产中验证了合成数据可提升预测精度。

Comments V2: 10 pages, 4 figures, 4 Tables

详情
AI中文摘要

全球人口的快速增长凸显了数字化农业系统的必要性,该系统支持可持续粮食生产以及为农民和利益相关者提供数据驱动的资源管理。采用能够捕获实时环境(如温度、湿度)和操作(如灌溉)参数的物联网(IoT)技术,是实现基于AI的产量预测等高级应用的关键一步。然而,此类模型的有效性通常受限于数据可用性有限,特别是在动态农场环境中,IoT观测数据需要跨越多个生长季节积累。在本研究中,我们在两个生长季节内于草莓生产塑料大棚中部署了IoT传感器,收集了用水量、内外温湿度、土壤湿度、土壤温度以及光合有效辐射数据。这些观测数据与跨越四个季节的手动记录产量数据相结合。为了填补无传感器覆盖的两个季节的IoT数据缺口,我们开发了一种基于AI的回测方法,利用附近气象站的历史天气数据和现有塑料大棚测量值合成缺失的传感器观测数据。然后,我们使用真实和合成数据集训练基于AI的产量预测模型。在这项回顾性评估中,结果表明,结合合成数据提高了产量预测准确性,在组合数据集上训练的模型优于仅使用真实传感器、天气和产量数据的模型。

英文摘要

Rapid global population growth underscores the need for digitally enabled agricultural systems that support sustainable food production and data-driven resource management for farmers and stakeholders. The adoption of Internet of Things (IoT) technologies, capable of capturing real-time environmental (e.g., temperature, humidity) and operational (e.g., irrigation) parameters, is a crucial step toward enabling advanced applications such as AI-based yield forecasting. However, the effectiveness of such models is often constrained by limited data availability, particularly in dynamic farm environments where IoT observations must be accumulated over multiple growing seasons. In this study, we deployed IoT sensors in strawberry production polytunnels over two growing seasons to collect data on water usage, internal and external temperature and humidity, soil moisture, soil temperature, and photosynthetically active radiation. These observations were combined with manually recorded yield data spanning four seasons. To address gaps in IoT data for the two seasons without sensor coverage, we developed an AI-based backcasting approach that synthesizes missing sensor observations using historical weather data from a nearby station and existing polytunnel measurements. We then trained AI-based yield forecasting models using both real and synthetic datasets. In this retrospective evaluation, results show that incorporating synthetic data improved yield forecasting accuracy, with models trained on the combined dataset outperforming those using only real sensor, weather, and yield data.

2504.00977 2026-06-09 cs.CL 版本更新

Chinese Grammatical Error Correction: A Survey

中文语法纠错:综述

Mengyang Qiu, Qingyu Gao, Linxuan Yang, Yang Gu, Tran Minh Nguyen, Zihao Huang, Jungyeul Park

发表机构 * KAIST(韩国科学技术院)

AI总结 本文综述中文语法纠错(CGEC)研究,涵盖数据集、标注方案、评估方法和系统进展,指出关键挑战并展望未来方向。

详情
AI中文摘要

中文语法纠错(CGEC)是自然语言处理中的一项关键任务,旨在满足第二语言(L2)和母语(L1)中文写作中对自动写作辅助日益增长的需求。虽然L2学习者难以掌握复杂的语法结构,但在学术、专业和正式场合中,L1用户也能从CGEC中受益,因为这些场合对写作精度要求很高。本综述全面回顾了CGEC研究,涵盖数据集、标注方案、评估方法和系统进展。我们考察了广泛使用的CGEC数据集,突出了它们的特点、局限性以及对改进标准化的需求。我们还分析了错误标注框架,讨论了诸如分词歧义和中文特有错误类型分类等挑战。此外,我们回顾了评估指标,重点关注它们从英文GEC到中文的适应过程,包括字符级评分和多参考的使用。在系统开发方面,我们追溯了从基于规则和统计方法到神经架构的演变,包括基于Transformer的模型和大型预训练语言模型的集成。通过整合现有研究并识别关键挑战,本综述提供了对CGEC现状的见解,并概述了未来方向,包括完善标注标准以解决分词挑战,以及利用多语言方法增强CGEC。

英文摘要

Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

2504.00375 2026-06-09 cs.CV 版本更新

CamoSAM2: SAM2-oriented Prompt Auto-Refinement for Video Camouflaged Object Detection

CamoSAM2: 面向SAM2的提示自动精化用于视频伪装目标检测

Xin Zhang, Keren Fu, Qijun Zhao

发表机构 * National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University(合成视觉基础科学国家重点实验室,四川大学) College of Computer Science, Sichuan University(计算机学院,四川大学)

AI总结 提出CamoSAM2框架,通过运动-外观提示诱导器和自适应多提示精化策略,自动生成并优化SAM2的提示,显著提升视频伪装目标检测性能。

Comments 13 pages, 8 figures,

详情
AI中文摘要

Segment Anything Model 2 (SAM2) 是一种提示引导的视频基础模型,在视频目标分割中表现出色,引起了社区的广泛关注。由于伪装目标与其周围环境高度相似,即使人眼也难以区分,因此SAM2在现实场景中自动分割的应用面临伪装感知和可靠提示生成的挑战。为了解决这些问题,我们提出了CamoSAM2,一个运动-外观提示诱导器(MAPI)和精化框架,用于自动生成和精化SAM2的提示,从而在VCOD任务中实现高质量的自动检测和分割。首先,我们引入了一个提示诱导器,它同时整合运动和外观线索来检测伪装目标,比现有方法提供更准确的初始预测。其次,我们提出了一种针对SAM2的基于视频的自适应多提示精化(AMPR)策略,旨在减轻初始粗糙掩码中的提示错误,并进一步生成良好的提示。具体来说,我们引入了一个新颖的三步过程,通过伪装目标确定、关键提示帧选择和多提示形成来生成可靠的提示。在两个基准数据集上进行的大量实验表明,我们提出的模型CamoSAM2显著优于现有最先进的方法,在mIoU指标上分别提高了8.0%和10.1%。此外,与当前的VCOD模型相比,我们的方法实现了最快的推理速度。

英文摘要

The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompt frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.

2411.08314 2026-06-09 cs.LG 版本更新

Modeling Stochastic Conditional Dynamics from Sparse Observations via Kernel-Stabilized Flow Matching

通过核稳定流匹配从稀疏观测中建模随机条件动力学

Adam P. Generale, Andreas E. Robertson, Surya R. Kalidindi

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sandia National Laboratories(桑地亚国家实验室)

AI总结 提出条件变量流匹配(CVFM)框架,通过联合采样状态和条件变量流,利用条件不匹配核和Wasserstein距离重加权目标,从稀疏非配对数据中学习条件分布的时间演化,在材料结构建模中表现更优。

Comments Accepted to Transactions on Machine Learning Research (2026); OpenReview: https://openreview.net/forum?id=3A6oAS2TWo

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

学习随时间变换条件概率密度是概率建模和自然科学中的一个基本挑战。在生物和物理领域中,预测随机非线性动力系统的演化至关重要。虽然基于流的模型可以预测概率分布的时间演化,但现有方法通常假设离散条件且样本在时间上配对,限制了其在仅有稀疏非配对连续条件数据时的科学适用性。我们提出条件变量流匹配(CVFM),这是一个学习流的框架,通过跨条件密度的连续空间摊销来变换条件分布。CVFM通过联合采样状态和条件变量流,利用条件不匹配核和条件Wasserstein距离重新加权条件最优传输目标,解决了先前方法的高方差不稳定性。总的来说,这些进展允许从跨时间的稀疏非配对状态-条件测量中学习动力学。我们在条件映射基准和制造过程中材料内部结构时间演化的案例研究上评估了CVFM,观察到与现有条件变体相比,性能和收敛特性有所改善。代码可在https://this https URL获取。

英文摘要

Learning to transform conditional probability densities over time is a fundamental challenge spanning probabilistic modeling and the natural sciences. This task is paramount when forecasting the evolution of stochastic nonlinear dynamical systems in biological and physical domains. While flow-based models can predict the temporal evolution of probability distributions, existing approaches often assume discrete conditioning with samples that are paired across time, limiting their scientific applicability where frequently only sparse data with unpaired continuous conditioning is available. We propose Conditional Variable Flow Matching (CVFM), a framework for learning flows transforming conditional distributions with amortization across the continuous space of conditional densities. CVFM addresses the high-variance instability of prior methods by jointly sampling flows over state and conditioning variables, utilizing a conditioning mismatch kernel alongside a conditional Wasserstein distance to reweight the conditional optimal transport objective. Collectively, these advances allow for learning dynamics from sparse unpaired measurements of state-condition across time. We evaluate CVFM on conditional mapping benchmarks and a case study modeling the temporal evolution of materials internal structure during manufacturing processes, observing improved performance and convergence characteristics over existing conditional variants. Code is available at https://github.com/agenerale/conditional-variable-flow-matching.

2501.08238 2026-06-09 cs.SD eess.AS 版本更新

CodecFake+: Codec-Based Resynthesized Data as a Proxy for Detecting CodecFake Speech

CodecFake+: 基于编解码器的重合成数据作为检测CodecFake语音的代理

Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University(国家交通大学通信工程研究院) Department of Computer Science and Information Engineering, National Taiwan University(国家交通大学计算机科学与信息工程系) Center for Language and Speech Processing at Johns Hopkins University(约翰霍普金斯大学语言与语音处理中心) Department of Electrical Engineering, National Taiwan University(国家交通大学电子工程系) Research Center for Information Technology Innovation, Academia Sinica(学术院信息技术创新研究中心) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国家交通大学人工智能研究中心)

AI总结 针对新兴的CodecFake深度伪造语音检测挑战,提出大规模数据集CodecFake+,包含31种开源编解码器重合成训练数据和17种先进CoSG模型网络数据,并建立编解码器分类体系,验证了重合成语音作为训练数据的有效性。

Comments Accepted by TASLP 2026

详情
AI中文摘要

随着神经音频编解码器的快速发展,基于编解码器的语音生成(CoSG)系统变得非常强大。不幸的是,CoSG也使得创建高度逼真的深度伪造语音成为可能,更容易模仿个人声音并传播错误信息。我们将这种由CoSG系统生成的新兴深度伪造语音称为CodecFake。检测这种CodecFake是一个紧迫的挑战,然而现有系统大多主要关注检测由传统语音合成模型生成的伪造语音。在本文中,我们介绍了CodecFake+,一个旨在推进CodecFake检测的大规模数据集。据我们所知,CodecFake+是包含最多样化编解码器架构的最大数据集。训练集通过使用31个公开可用的开源编解码器模型进行重合成生成,而评估集包括来自17个先进CoSG模型的网络数据。我们还提出了一个全面的分类体系,根据编解码器的根组件:向量量化器、辅助目标和解码器类型对其进行分类。我们提出的数据集和分类体系使得能够在多个层面进行详细分析,以辨别成功检测CodecFake的关键因素。在单个编解码器层面,我们验证了使用编解码器重合成语音(CoRS)作为训练数据用于大规模CodecFake检测的有效性。在分类体系层面,我们表明当重合成模型包含解缠辅助目标或频域解码器时,检测性能最强。此外,从使用所有CoRS训练数据的角度,我们表明我们提出的分类体系可用于选择更好的训练数据以提高检测性能。总体而言,我们期望CodecFake+将成为通用和细粒度探索的重要资源,以开发更好的针对CodecFake的反欺骗模型。

英文摘要

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

2503.05169 2026-06-09 cs.LG 版本更新

phepy: Visual benchmarks and improvements for out-of-distribution detectors

phepy: 面向分布外检测器的可视化基准与改进

Felix Krumbiegel, Juniper Tyree, Michael Boy, Petri Clusius, Andreas Rupp

发表机构 * Department of Mathematics, Saarland University(萨尔兰大学数学系) Institute for Atmospheric and Earth System Research, University of Helsinki(赫尔辛基大学大气与地球系统研究所) School of Engineering Sciences, LUT University(卢霍斯大学工程科学学院)

AI总结 提出包含三个可视化玩具示例的OOD检测基准,评估现有方法,并引入t-poking和OOD样本加权改进监督式检测器在ID-OOD边界上的精度。

详情
AI中文摘要

将机器学习应用于日益高维且训练数据稀疏或有偏的问题,增加了模型在其训练领域之外的输入上使用的风险。对于此类分布外(OOD)输入,模型无法再做出有效预测,其误差可能无界。由于在真实数据集上测试OOD检测方法较为复杂,我们设计了一个OOD检测基准,其中包含三个新颖且易于可视化的玩具示例。这些简单示例提供了直接且直观的洞察,判断检测器是否能够检测(1)线性和(2)非线性概念,以及(3)在高维空间(干草堆)中识别细小的分布内(ID)子空间(针)。我们利用该基准评估了文献中多种方法的性能。由于OOD输入的触觉示例可能有益于OOD检测,我们还回顾了几种用于监督训练合成OOD输入的简单方法。我们引入了两项改进,即$t$-poking和OOD样本加权,使监督式检测器在ID-OOD边界上更加精确。当真实ID样本与合成OOD样本之间的冲突模糊了决策边界时,这一点尤为重要。最后,我们为在机器学习中构建和应用OOD检测器提供了建议。

英文摘要

Applying machine learning to increasingly high-dimensional problems with sparse or biased training data increases the risk that a model is used on inputs outside its training domain. For such out-of-distribution (OOD) inputs, the model can no longer make valid predictions, and its error is potentially unbounded. Since testing OOD detection methods on real-world datasets is complicated, we design a benchmark for OOD detection, which includes three novel and easily-visualisable toy examples. These simple examples provide direct and intuitive insight into whether the detector is able to detect (1) linear and (2) non-linear concepts and (3) identify thin in-distribution (ID) subspaces (needles) within high-dimensional spaces (haystacks). We use our benchmark to evaluate the performance of various methods from the literature. Since tactile examples of OOD inputs may benefit OOD detection, we also review several simple methods to synthesise OOD inputs for supervised training. We introduce two improvements, $t$-poking and OOD sample weighting, to make supervised detectors more precise at the ID-OOD boundary. This is especially important when conflicts between real ID and synthetic OOD sample blur the decision boundary. Finally, we provide recommendations for constructing and applying OOD detectors in machine learning.

2502.16584 2026-06-09 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

Audio-FLAN:面向语音、音乐和声音的统一音频理解与生成的指令跟随数据集

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Xingjian Du, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Inner Mongolia University(内蒙古大学) Beihang University(北京航空航天大学) Queen Mary University of London(伦敦玛丽女王大学) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) University of Surrey(萨里大学) University of Rochester(罗切斯特大学) Independent Researcher(独立研究者)

AI总结 提出Audio-FLAN数据集,包含80种任务和1亿实例,支持统一音频理解与生成的零样本学习。

详情
AI中文摘要

最近音频标记化的进展显著增强了将音频能力集成到大语言模型(LLM)中的能力。然而,音频理解和生成通常被视为不同的任务,阻碍了真正统一的音频-语言模型的发展。虽然指令调优在文本和视觉领域已显示出在改善泛化和零样本学习方面的显著成功,但其在音频领域的应用仍基本未被探索。一个主要障碍是缺乏统一音频理解和生成的全面数据集。为解决这一问题,我们引入了Audio-FLAN,这是一个大规模指令调优数据集,涵盖语音、音乐和声音领域的80种不同任务,包含超过1亿个实例。Audio-FLAN为统一的音频-语言模型奠定了基础,这些模型能够以零样本方式无缝处理跨多种音频领域的理解(如转录、理解)和生成(如语音、音乐、声音)任务。Audio-FLAN数据集可在HuggingFace和GitHub上获取。

英文摘要

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.

2502.06819 2026-06-09 cs.LG cs.GR 版本更新

AccioScene: Compositional 3D Scene Generation via Graph Diffusion and Interaction-driven Critics

AccioScene: 基于图扩散与交互驱动评判的组合式3D场景生成

Yao Wei, Matteo Toso, Pietro Morerio, Changjae Oh, Michael Ying Yang, Alessio Del Bue

发表机构 * Queen Mary University of London, UK(伦敦大学玛丽女王学院) Italian Institute of Technology (IIT), Italy(意大利理工学院) University of Bath, UK(巴斯大学)

AI总结 提出多阶段流水线,通过图扩散生成上下文一致的场景图并预测物体布局,结合轻量级人-物交互先验和空间约束,生成支持人类交互且物理合理的3D室内场景。

详情
AI中文摘要

本文提出一个从文本提示生成3D室内场景的框架。现有方法通常将场景合成视为基于单一输入模态(如文本描述、房间形状或场景图)的物体布局预测问题,这种设计可能导致物体碰撞和功能合理性受限,降低了其实用性。为解决这些局限,我们引入一个多阶段流水线,更好地反映实际场景创建场景。给定描述部分场景内容的文本提示,我们的方法首先使用图扩散生成上下文连贯的场景图,然后预测合理的物体布局。此外,我们融入轻量级人-物交互先验以鼓励以人为中心和功能性的布局,并加入显式空间约束以减少相互穿透。我们的方法生成连贯的3D场景,其布局可行且更好地支持人类交互。在3D-FRONT数据集上的实验表明,与现有方法相比,我们的方法达到了有竞争力或最先进的性能,同时提高了生成场景的物理合理性。

英文摘要

This paper presents a framework for generating 3D indoor scenes from text prompts. Existing methods often formulate scene synthesis as an object layout prediction problem conditioned on a single input modality, such as a text description, room shape, or scene graph. This design can lead to object collisions and limited functional plausibility, reducing its practical applicability. To address these limitations, we introduce a multi-stage pipeline that better reflects practical scene creation scenarios. Given a text prompt describing partial scene content, our method first uses graph diffusion to produce a contextually coherent scene graph and then predicts a realistic object layout. In addition, we incorporate lightweight human-object interaction priors to encourage human-centric and functional arrangements, with explicit spatial constraints to reduce interpenetration. Our approach generates coherent 3D scenes with viable layouts that better support human interaction. Experiments on the 3D-FRONT dataset demonstrate that our method achieves competitive or state-of-the-art performance compared with existing approaches, while improving the physical plausibility of generated scenes.

2412.13858 2026-06-09 cs.AI cs.LG 版本更新

IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem (TSP) by Leveraging the Structure of the Solution Space

IDEQ -- 利用解空间结构改进旅行商问题的扩散模型

Mickael Basson, Philippe Preux

发表机构 * Université de Lille(里尔大学) CNRS(国家科学研究中心) Inria(法国国家信息与自动化技术研究院) UMR 9198-CRIStAL(UMR 9198-CRIStAL研究中心)

AI总结 提出IDEQ方法,通过利用TSP解空间的约束结构和基于2-opt轨道的均匀分布训练目标,改进扩散模型求解TSP,在合成实例和TSPlib上达到新SOTA,接近LKH3性能。

详情
AI中文摘要

我们研究扩散模型求解旅行商问题。基于最近的DIFUSCO和T2TCO方法,我们提出IDEQ。IDEQ通过利用TSP状态空间的约束结构来提高解的质量。IDEQ的另一个关键组成部分是,将DIFUSCO课程学习的最后阶段替换为考虑哈密顿环上的均匀分布,这些环在2-opt算子下的轨道收敛到最优解作为训练目标。我们的实验表明,IDEQ在合成实例上改进了此类神经网络技术的现有水平。更重要的是,我们的实验表明,IDEQ在TSPlib(TSP社区的参考基准)的实例上表现非常好:它紧密匹配最佳启发式算法LKH3的性能,甚至在两个分别包含1577和3795个城市的TSPlib实例上能够获得比LKH3更好的解。IDEQ在500个城市的TSP实例上获得0.3%的最优性差距,在1000个城市的TSP实例上获得0.5%的最优性差距。这为基于神经网络的TSP求解方法设立了新的SOTA。此外,与DIFUSCO和T2TCO相比,IDEQ表现出更低的方差和更好的随城市数量扩展的能力。

英文摘要

We investigate diffusion models to solve the Traveling Salesman Problem. Building on the recent DIFUSCO and T2TCO approaches, we propose IDEQ. IDEQ improves the quality of the solutions by leveraging the constrained structure of the state space of the TSP. Another key component of IDEQ consists in replacing the last stages of DIFUSCO curriculum learning by considering a uniform distribution over the Hamiltonian tours whose orbits by the 2-opt operator converge to the optimal solution as the training objective. Our experiments show that IDEQ improves the state of the art for such neural network based techniques on synthetic instances. More importantly, our experiments show that IDEQ performs very well on the instances of the TSPlib, a reference benchmark in the TSP community: it closely matches the performance of the best heuristics, LKH3, being even able to obtain better solutions than LKH3 on 2 instances of the TSPlib defined on 1577 and 3795 cities. IDEQ obtains 0.3% optimality gap on TSP instances made of 500 cities, and 0.5% on TSP instances with 1000 cities. This sets a new SOTA for neural based methods solving the TSP. Moreover, IDEQ exhibits a lower variance and better scales-up with the number of cities with regards to DIFUSCO and T2TCO.

2412.06147 2026-06-09 cs.LG cs.ET 版本更新

Advancements in Machine Learning and Deep Learning for Early Detection and Management of Mental Health Disorder

机器学习和深度学习在心理健康障碍早期检测和管理中的进展

Kamala Devi Kannan, Senthil Kumar Jagatheesaperumal, Rajesh N. V. P. S. Kandala, Mojtaba Lotfaliany, Roohallah Alizadehsanid, Mohammadreza Mohebbi

发表机构 * Department of Computer Science and Engineering, Mepco Schlenk Engineering College(梅科斯伦克工程学院计算机科学与工程系) Department of Electronics and Communication Engineering, Mepco Schlenk Engineering College(梅科斯伦克工程学院电子与通信工程系) School of Electronics Engineering (SENSE), VIT-AP University(VIT-AP大学电子工程学院(SENSE)) The Institute for Mental and Physical Health and Clinical Translation (IMPACT), School of Medicine, Deakin University(德金大学医学院心理健康与身体健康及临床转化研究所(IMPACT)) Biostatistics Unit, Faculty of Health, Deakin University(德金大学健康学院生物统计学单位) School of Medicine, Deakin University(德金大学医学院)

AI总结 综述了ML/DL在心理健康障碍早期诊断中的应用,涵盖医学影像、遗传和行为数据,并讨论了数据整合、伦理挑战及未来方向。

Comments 21 pages, 2 figures, 3 tables

详情
AI中文摘要

对于心理健康疾病的早期识别、诊断和治疗,深度学习(DL)和机器学习(ML)的整合已开始发挥重要作用。通过评估来自影像、遗传学和行为评估的复杂数据,这些技术有潜力显著改善临床结果。然而,它们也带来了与数据整合和伦理问题相关的独特挑战。本综述回顾了ML和DL方法在心理健康问题早期诊断和治疗中的发展。它考察了一系列应用,特别强调了行为评估、遗传和生物标志物分析,以及用于诊断抑郁症、双相情感障碍和精神分裂症等疾病的医学影像。综述进一步讨论了疾病发展的预测建模,重点关注风险预测模型和纵向研究的作用。重要发现显示了ML和DL如何提高诊断准确性和治疗结果,同时解决方法不一致、数据整合和伦理问题。研究强调了构建用于个性化治疗的实时监测系统、改进数据融合技术和跨学科合作的重要性。未来的研究应集中于克服这些障碍,以最大化ML和DL在心理健康服务中的有益和道德实施。

英文摘要

For the early identification, diagnosis, and treatment of mental health illnesses, the integration of deep learning (DL) and machine learning (ML) have started playing a significant role. By evaluating complex data from imaging, genetics, and behavioral assessments, these technologies have the potential to improve clinical results significantly. However, they also present unique challenges relating to data integration and ethical issues. The development of ML and DL methods for the early diagnosis and treatment of mental health issues is reviewed in this survey. It examines a range of applications, with a particular emphasis on behavioral assessments, genetic and biomarker analysis, and medical imaging for the diagnosis of diseases like depression, bipolar disorder, and schizophrenia. Predictive modeling for illness development is further discussed in the review, focusing on the function of risk prediction models and longitudinal investigations. Important discoveries show how ML and DL might improve treatment outcomes and diagnostic accuracy while tackling methodological inconsistency, data integration, and ethical concerns. The study emphasizes the significance of building real-time monitoring systems for individualized treatment, improving data fusion techniques, and interdisciplinary collaboration. Upcoming studies should concentrate on surmounting these obstacles to maximize ML and DL's valuable and moral implementation in mental health services.

2412.00508 2026-06-09 cs.LG cs.AI cs.CE 版本更新

Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Graph-to-SFILES: 基于生成式人工智能从过程拓扑预测控制结构

Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group(过程智能研究组) Department of Chemical Engineering(化学工程系) Delft University of Technology(代尔夫特理工大学)

AI总结 提出Graph-to-SFILES模型,利用图神经网络从流程图拓扑生成控制扩展流程图序列,在小数据集上显著提升控制结构预测精度。

详情
Journal ref
Computers & Chemical Engineering, Volume 199, 2025, Pages 109121
AI中文摘要

控制结构设计是P&ID开发中重要但繁琐的步骤。生成式人工智能有望通过支持工程师来减少P&ID开发时间。先前关于化学过程设计中生成式AI的研究主要用序列表示过程。然而,图因其置换不变性而成为一种有前景的替代方案。我们提出了Graph-to-SFILES模型,一种从流程图拓扑预测控制结构的生成式AI方法。Graph-to-SFILES模型将流程图拓扑作为图输入,并返回以SFILES 2.0符号表示的控制扩展流程图序列。我们比较了四种不同的图编码器架构,其中一种是本文提出的图神经网络(GNN)。Graph-to-SFILES模型在10,000个流程图拓扑上训练时达到了73.2%的top-5准确率。此外,所提出的GNN在编码器架构中表现最佳。与纯基于序列的方法相比,Graph-to-SFILES模型在相对较小的1,000个流程图训练数据集上将top-5准确率从0.9%提高到28.4%。然而,在100,000个流程图的大规模数据集上,基于序列的方法表现更好。这些结果突显了基于图的AI模型在小数据场景下加速P&ID开发的潜力,但其在工业相关案例研究中的有效性仍需进一步研究。

英文摘要

Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

2411.19504 2026-06-09 cs.AI cs.CL cs.IR 版本更新

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering

TQA-Bench:评估大语言模型在多表问答中的表现

Zipeng Qiu, Chenyue Li, You Peng, Guangxin He, Binhang Yuan, Chen Wang

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学)

AI总结 提出TQA-Bench基准,通过长上下文多表问答任务评估LLM,揭示其在复杂数据驱动环境中的挑战与机遇。

Comments Accepted by IEEE Transactions on Big Data

详情
AI中文摘要

大语言模型(LLMs)的进步为复杂的多模态数据管理任务带来了巨大机遇,尤其是在涉及复杂多表关系数据的问答(QA)中。尽管取得了显著进展,但由于分析关系数据结构模态的固有复杂性以及序列化表格数据可能的大规模性,系统评估LLMs在多表QA上的表现仍然是一个关键挑战。现有基准主要关注单表QA,未能捕捉金融、医疗和电子商务等真实世界领域中多个关系表之间连接的复杂性。我们提出了TQA-Bench,一个基于真实世界公共数据集的长上下文分析型多表QA基准,具有灵活的采样机制,可变化上下文长度(8K--64K tokens)和符号扩展,以评估超越检索和模式匹配的推理能力。我们系统评估了一系列参数规模从20亿到6710亿的LLMs。大量实验揭示了LLMs在多表QA中的关键性能洞察,突出了推进其在复杂数据驱动环境中应用的挑战和机遇。

英文摘要

The advance of large language models (LLMs) has unlocked great opportunities in complex multi-modal data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing the modality of relational data structures and the potentially large scale of serialized tabular data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of connections across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. We present TQA-Bench, a long-context analytical multi-table QA benchmark derived from real-world public datasets, with a flexible sampling mechanism that varies context length (8K--64K tokens) and symbolic extensions for assessing reasoning beyond retrieval and pattern matching. We systematically evaluate a set of LLMs spanning model scales from 2 billion to 671 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments.

2411.16102 2026-06-09 cs.LG 版本更新

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

BlendServe: 利用资源感知批处理优化自回归大模型的离线推理

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of California, Davis(加州大学戴维斯分校) Rice University(里士满大学)

AI总结 针对离线批处理中资源重叠与前缀共享的冲突,提出资源感知前缀树来最大化资源利用率,相比vLLM和SGLang吞吐量提升1.44倍。

详情
AI中文摘要

离线批处理利用请求批处理的灵活性实现更高吞吐量和更低成本,在延迟不敏感的应用中越来越受欢迎。同时,模型能力和模态的最新进展使得请求在计算和内存需求上更加多样化,通过资源重叠为吞吐量提升创造了独特机会。然而,最大化资源重叠的请求调度可能与最大化前缀共享(一种广泛使用的性能优化)的调度冲突,导致次优的推理吞吐量。我们提出BlendServe,该系统通过结合资源重叠和前缀共享的优势,使用资源感知前缀树来最大化离线批处理的资源利用率。BlendServe利用离线批处理中宽松的延迟要求,重新排序和重叠具有不同资源需求的请求,同时确保高前缀共享。我们在各种合成多模态工作负载上评估BlendServe,结果表明,与广泛使用的行业标准vLLM和SGLang相比,它提供了高达1.44倍的吞吐量提升。

英文摘要

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.

2401.01599 2026-06-09 cs.LG math.ST stat.TH 版本更新

Generalization Error Curves for Analytic Spectral Algorithms under Power-law Decay

幂律衰减下解析谱算法的泛化误差曲线

Yicheng Li, Weiye Gan, Zuoqiang Shi, Qian Lin

发表机构 * Tsinghua University(清华大学)

AI总结 本文在温和假设下,完整刻画了核梯度下降等解析谱算法在核回归中的泛化误差曲线,揭示了核插值的不一致性和高资格算法的饱和效应,并通过神经正切核理论加深了对宽神经网络泛化行为的理解。

详情
AI中文摘要

某些核回归方法的泛化误差曲线旨在确定在不同源条件、噪声水平和正则化参数选择下泛化误差的精确阶数,而非极小极大速率。在这项工作中,在温和假设下,我们严格地提供了核梯度下降方法(以及一大类解析谱算法)在核回归中泛化误差曲线的完整刻画。因此,我们可以锐化核插值的近不一致性,并阐明具有更高资格的核回归算法的饱和效应等。得益于神经正切核理论,这些结果极大地提高了我们对训练宽神经网络泛化行为的理解。一个新颖的技术贡献——解析泛函论证——可能具有独立的意义。

英文摘要

The generalization error curve of certain kernel regression method aims at determining the exact order of generalization error with various source condition, noise level and choice of the regularization parameter rather than the minimax rate. In this work, under mild assumptions, we rigorously provide a full characterization of the generalization error curves of the kernel gradient descent method (and a large class of analytic spectral algorithms) in kernel regression. Consequently, we could sharpen the near inconsistency of kernel interpolation and clarify the saturation effects of kernel regression algorithms with higher qualification, etc. Thanks to the neural tangent kernel theory, these results greatly improve our understanding of the generalization behavior of training the wide neural networks. A novel technical contribution, the analytic functional argument, might be of independent interest.

2411.11350 2026-06-09 cs.LG eess.SP 版本更新

Zero and Few Shot Load Forecasting with Large Language Models

基于大语言模型的零样本和少样本负荷预测

Wenlong Liao, Chengrui Zhang, Zhe Yang, Mengshuo Jia, Christian Rehtanz, Jiannong Fang, Fernando Porté-Agel

发表机构 * School of Electrical Engineering, Southeast University(东南大学电气工程学院) Wind Engineering and Renewable Energy Laboratory, Ecole Polytechnique Federale de Lausanne (EPFL)(瑞士联邦理工学院洛桑分校风能与可再生能源实验室) College of Electrical Engineering and New Energy, China Three Gorges University(中国三峡大学电气工程与新能源学院) Department of Electrical and Electronic Engineering, Imperial College London(伦敦帝国理工学院电子与电气工程系) The Department of Automation, School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) The Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai(中国教育部系统控制与信息处理重点实验室,上海) State Key Laboratory of Submarine Geoscience, Shanghai(上海 submarine 地球科学国家重点实验室) Institute of Energy Systems, Energy Efficiency and Energy Economic, TU Dortmund University(德意志图林根大学能源系统、能效与能源经济研究所)

AI总结 提出利用预训练语言模型Chronos进行零样本和少样本负荷预测,在数据稀缺场景下显著优于多种基线模型。

Comments 24 pages,5 figures

详情
Journal ref
International Journal of Electrical Power & Energy Systems, Volume 177,April 2026
AI中文摘要

深度学习模型在负荷预测中表现出色,但通常需要大量数据进行模型训练才能应用于新场景,这限制了其在数据稀缺场景下的有效性。受预训练语言模型(LLMs)在自然语言处理中巨大成功的启发,本文提出了一种使用高级LLM框架(称为Chronos模型)的零样本和少样本负荷预测方法。通过利用其广泛的预训练知识,Chronos模型能够在数据稀缺场景下实现准确的负荷预测。在五个真实世界数据集上的仿真结果表明,Chronos模型在确定性和概率性负荷预测中,针对不同的预测时间范围(例如1至48小时),均显著优于九种流行的基线模型,尽管Chronos模型既未针对这些特定负荷数据集进行定制也未进行微调。值得注意的是,与基线模型相比,Chronos将均方根误差(RMSE)、连续排序概率得分(CRPS)和分位数得分(QS)分别降低了约7.34%-84.30%、19.63%-60.06%和22.83%-54.49%。这些结果突显了Chronos模型的优越性和灵活性,使其成为数据稀缺场景下的有效解决方案。

英文摘要

Deep learning models have shown strong performance in load forecasting, but they generally require large amounts of data for model training before being applied to new scenarios, which limits their effectiveness in data-scarce scenarios. Inspired by the great success of pre-trained language models (LLMs) in natural language processing, this paper proposes a zero and few shot load forecasting approach using an advanced LLM framework denoted as the Chronos model. By utilizing its extensive pre-trained knowledge, the Chronos model enables accurate load forecasting in data-scarce scenarios. Simulation results across five real-world datasets demonstrate that the Chronos model significantly outperforms nine popular baseline models for both deterministic and probabilistic load forecasting with various forecast horizons (e.g., 1 to 48 hours), even though the Chronos model is neither tailored nor fine-tuned to these specific load datasets. Notably, Chronos reduces root mean squared error (RMSE), continuous ranked probability score (CRPS), and quantile score (QS) by approximately 7.34%-84.30%, 19.63%-60.06%, and 22.83%-54.49%, respectively, compared to baseline models. These results highlight the superiority and flexibility of the Chronos model, positioning it as an effective solution in data-scarce scenarios.

2411.06469 2026-06-09 cs.CL 版本更新

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

ClinicalBench: 大型语言模型能在临床预测中击败传统机器学习模型吗?

Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Shuang Zhou, Yuan Luo, Rui Zhang, Danielle Bitterman, Fei Wang, Kai Shu

发表机构 * Department of Computer Science Northwestern University Evanston USA(计算机科学系西北大学艾文斯顿美国) Department of Computer Science University of Texas at Austin Austin USA(计算机科学系德克萨斯大学奥斯汀美国) Boston Children's Hospital, Harvard Medical School Boston USA(波士顿儿童医院哈佛医学院波士顿美国) Department of Computer Science Imperial College London London UK(计算机科学系伦敦帝国学院伦敦英国) Department of Computer Science Ohio State University Columbus USA(计算机科学系俄亥俄州立大学哥伦布美国) Massachusetts General Hospital, Harvard Medical School Boston USA(麻省总医院哈佛医学院波士顿美国) Department of Preventive Medicine, Feinberg School of Medicine Northwestern University Chicago USA(预防医学系费因伯格医学院西北大学芝加哥美国) Division of Computational Health Sciences, Department of Surgery University of Minnesota Minneapolis USA(计算健康科学部外科部明尼苏达大学明尼阿波利斯美国) Department of Population Health Sciences, Weill Cornell Medicine Cornell University New York USA(流行病学与公共卫生系韦尔·科恩医学中心康奈尔大学纽约美国) Department of Computer Science Emory University Atlanta USA(计算机科学系埃默里大学亚特兰大美国) Northwestern University(西北大学) University of Texas at Austin(德克萨斯大学奥斯汀) Boston Children's Hospital, Harvard Medical School(波士顿儿童医院哈佛医学院) Imperial College London(伦敦帝国学院) Ohio State University(俄亥俄州立大学) Massachusetts General Hospital, Harvard Medical School(麻省总医院哈佛医学院) University of Minnesota(明尼苏达大学) Cornell University(康奈尔大学) Emory University(埃默里大学)

AI总结 构建ClinicalBench基准,通过三个临床预测任务比较14个通用和8个医学LLM与11个传统ML模型,发现LLM在临床预测上仍无法超越传统ML模型。

Comments Accepted to Proceedings of KDD 2026. The first two authors contributed equally. 12 pages for main paper, 62 pages including appendix. Project website: https://clinicalbench.github.io

详情
AI中文摘要

大型语言模型(LLMs)因其在医学文本处理任务和医学执照考试中的卓越能力,有望彻底改变当前的临床系统。与此同时,传统机器学习模型如SVM和XGBoost仍然主要应用于临床预测任务。一个新兴的问题是:LLMs能否在临床预测中击败传统ML模型?因此,我们构建了一个新的基准ClinicalBench,全面研究通用和医学LLMs的临床预测建模能力,并将其与传统ML模型进行比较。ClinicalBench包含三个常见的临床预测任务、两个数据库、14个通用LLMs、8个医学LLMs和11个传统ML模型。通过广泛的实证研究,我们发现,无论是通用还是医学LLMs,即使采用不同的模型规模、多样的提示或微调策略,仍然无法在临床预测中击败传统ML模型,这揭示了它们在临床推理和决策中的潜在缺陷。我们呼吁从业者在临床应用中使用LLMs时保持谨慎。ClinicalBench可用于弥合LLMs在医疗保健领域的发展与现实临床实践之间的差距。

英文摘要

Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is: Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.

2312.15946 2026-06-09 cs.SD cs.GR eess.AS 版本更新

EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

EnchantDance: 揭示音乐驱动舞蹈动作的潜力

Bo Han, Teng Zhang, Zeyu Ling, Feilin Han

发表机构 * Zhejiang University(浙江大学) Tongji University(同济大学)

AI总结 提出EnchantDance框架,通过构建舞蹈潜在空间和扩散模型,结合大规模数据集ChoreoSpectrum3D和音乐流派预测网络,提升舞蹈生成的质量、多样性和一致性。

Comments Project Page: https://fluide1022.github.io/EnchantDance/

详情
AI中文摘要

音乐驱动的舞蹈生成任务涉及创建与给定音乐相对应的连贯舞蹈动作。现有方法虽然能生成物理上合理的舞蹈,但往往难以泛化到未见数据。挑战来自三个方面:1)舞蹈动作的高度多样性和音乐模态分布的显著差异,使得生成与音乐对齐的舞蹈动作困难;2)缺乏大规模音乐-舞蹈数据集,阻碍了从音乐生成泛化舞蹈动作;3)舞蹈动作的持续性对保持一致的舞蹈风格构成挑战。在这项工作中,我们引入了EnchantDance框架,一种最先进的舞蹈生成方法。由于原始舞蹈序列在时间轴上的冗余性,EnchantDance首先构建一个强大的舞蹈潜在空间,然后在舞蹈潜在空间上训练舞蹈扩散模型。为了解决数据缺口,我们构建了一个大规模音乐-舞蹈数据集ChoreoSpectrum3D Dataset,包含四种舞蹈风格,总时长70.32小时,是迄今为止报道的最大音乐-舞蹈数据集。为了增强音乐流派与舞蹈风格之间的一致性,我们使用迁移学习预训练了一个音乐流派预测网络,并在舞蹈扩散模型的训练中将音乐流派作为额外的条件信息。大量实验表明,我们提出的框架在舞蹈质量、多样性和一致性方面达到了最先进的性能。

英文摘要

The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency.

2411.03253 2026-06-09 cs.LG cs.AI cs.DS 版本更新

Discovering Data Structures: Nearest Neighbor Search and Beyond

发现数据结构:最近邻搜索及其他

Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

发表机构 * Université de Montréal(蒙特利尔大学) Mila HEC Montréal(蒙特利尔高等商学院) Microsoft Research(微软研究院) University of Southern California(南加州大学) Stanford University(斯坦福大学)

AI总结 提出一个端到端学习数据结构的通用框架,自动适应数据分布并控制查询与空间复杂度,在最近邻搜索中逆向工程出二分搜索、插值搜索、k-d树和局部敏感哈希等算法。

Comments Neurips 2025 Version

详情
AI中文摘要

我们提出了一个用于端到端学习数据结构的通用框架。我们的框架适应底层数据分布,并对查询和空间复杂度提供细粒度控制。关键在于,数据结构是从头开始学习的,不需要仔细初始化或用候选数据结构/算法进行种子化。我们首先将该框架应用于最近邻搜索问题。在多种设置中,我们能够逆向工程出学习到的数据结构和查询算法。对于一维最近邻搜索,模型发现了最优的分布(不)依赖算法,如二分搜索和插值搜索的变体。在更高维度中,模型学习到的解决方案在某些情况下类似于k-d树,而在其他情况下则具有局部敏感哈希的元素。该模型还能学习高维数据的有用表示,并利用它们设计有效的数据结构。我们还将框架应用于数据流上的频率估计问题,并相信它也可以成为新问题的强大发现工具。

英文摘要

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

2410.21747 2026-06-09 cs.CV 版本更新

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

MotionGPT-2:用于运动生成与理解的通用运动-语言模型

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Dan Xu, Shixiang Tang

发表机构 * Tsinghua University(清华大学) The University of Sydney(悉尼大学) University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Intime Department Store(Intime百货) Deepeleph HKUST(香港科技大学)

AI总结 提出MotionGPT-2,一种统一的大规模运动-语言模型,通过预训练大语言模型支持多模态控制条件,实现运动生成、描述和补全等多种任务,并引入Part-Aware VQVAE实现细粒度身体和手部运动表示。

详情
AI中文摘要

近年来,从描述性文本生成逼真的人体运动受到了显著的研究关注,这得益于数字内容创作等新兴需求。尽管取得了令人印象深刻的进展,现有方法通常受限于有限的控制模态、任务特异性,并且仅关注身体运动。在本文中,我们提出了MotionGPT-2,一种统一的大规模运动-语言模型(LMLM),以解决这些局限性。MotionGPT-2通过预训练的大语言模型(LLM)支持多种运动相关任务和多模态控制条件。它将多模态输入(如文本和单帧姿态)量化为离散的、LLM可解释的标记,无缝集成到LLM的词汇表中。这些标记随后被组织成统一的提示,通过预训练-微调范式引导LLM生成运动输出。我们还展示了所提出的MotionGPT-2通过创新的运动离散化框架Part-Aware VQVAE,能够高度适应具有挑战性的3D整体运动生成任务,该框架确保了身体和手部运动的细粒度表示。大量实验和可视化验证了我们方法的有效性,展示了MotionGPT-2在运动生成、运动描述和广义运动补全任务中的适应性。

英文摘要

Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.

2402.13425 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Investigating the Histogram Loss in Regression

探究回归中的直方图损失

Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, Martha White

发表机构 * Alberta Machine Intelligence Institute (Amii) and Reinforcement Learning and Artificial Intelligence Laboratory(阿尔伯塔机器智能研究所(Amii)和强化学习与人工智能实验室) Department of Computing Science, University of Alberta(计算科学系,阿尔伯塔大学) University of Tübingen(图宾根大学) Zuse School ELIZA(祖斯学校ELIZA)

AI总结 本文通过理论和实验分析,探究直方图损失在回归任务中提升性能的原因,发现其优势源于优化改进而非额外信息建模,并在常见深度学习应用中验证其有效性。

Comments 52 pages

详情
Journal ref
JMLR,2026
AI中文摘要

在回归任务中,即使预测只需要均值,训练神经网络来建模整个分布也变得越来越常见。这种额外的建模通常会带来性能提升,但其背后的原因尚不完全清楚。本文研究了一种最近的回归方法——直方图损失,该方法通过最小化目标分布与灵活直方图预测之间的交叉熵来学习目标变量的条件分布。我们设计了理论和实证分析,以确定这种性能提升出现的原因和时机,以及损失的不同组成部分如何贡献于这种提升。我们的结果表明,在这种设置下学习分布的好处来自于优化方面的改进,而非建模额外信息。然后,我们展示了直方图损失在常见深度学习应用中的可行性,无需昂贵的超参数调优。

英文摘要

It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

2312.15321 2026-06-09 cs.CL 版本更新

Greedy Grammar Induction with Indirect Negative Evidence

带间接负面证据的贪婪语法归纳

Joseph Potashnik

发表机构 * London, United Kingdom(伦敦,英国)

AI总结 提出一种非词汇化语法归纳程序,通过规则覆盖界和间接负面证据区分观察呈现与假设生成串,给出贪婪搜索算法并证明条件弱恢复定理。

Comments 29 pages (including appendices and references)

详情
AI中文摘要

本文提出一种非词汇化语法归纳程序,该程序分离了两个测试:识别观察到的有限呈现,以及拒绝由假设生成但无证据支持的短前终止符串。核心对象是规则覆盖界 \(\ell^*(G)\):对于 \(G\) 中的每条规则,使用该规则推导的最短前终止符串的长度的最大值。该界诱导出比较宇宙 \(\Sigma_{\mathrm{pre}}^{\le \ell^*(G)}\),其中无支持的产生串作为反对过度生成假设的间接证据。我们给出一个在规则集上的贪婪搜索算法,并证明一个条件弱恢复定理:在显式可达性条件和呈现充分饱和的情况下,精确学习器达到一个与未知目标弱等价的语法。复杂度分析是分片的:对于每个固定增量半径 \(k\),搜索在有限规则宇宙中探索多项式多个规则集扩展。在跨越 Dyck-\(k\) 语言 \((1\le k\le4)\)、回文、\(a^n b^n\)、类英语递归片段以及一个固有歧义联合语言的 31 个基准测试中,语法级分析建立了每个返回语法与其目标之间的弱等价性。

英文摘要

This paper proposes a non-lexicalized grammar-induction procedure that separates two tests: recognition of the observed finite presentation, and rejection of short preterminal strings generated by a hypothesis but unsupported by the evidence. The central object is the rule-coverage bound \(\ell^*(G)\): the maximum, over rules in \(G\), of the length of the shortest preterminal string whose derivation uses that rule. This bound induces the comparison universe \(Σ_{\mathrm{pre}}^{\le \ell^*(G)}\), where unsupported generated strings serve as indirect evidence against overgenerating hypotheses. We give a greedy search algorithm over rule sets and prove a conditional weak-recovery theorem: under explicit reachability conditions and sufficient saturation of the presentation, the exact learner reaches a grammar weakly equivalent to the unknown target. The complexity analysis is slice-wise: for each fixed incrementality radius \(k\), the search explores polynomially many rule-set extensions in the finite rule universe. Across 31 benchmark runs spanning Dyck-\(k\) languages \((1\le k\le4)\), palindromes, \(a^n b^n\), English-like recursive fragments, and an inherently ambiguous union language, grammar-level analysis establishes weak equivalence between every returned grammar and its target.

2408.00684 2026-06-09 cs.CL 版本更新

Assessing the Variety of a Concept Space Using an Unbiased Estimate of Rao's Quadratic Index

使用Rao二次指数的无偏估计评估概念空间的多样性

Anubhab Majumder, Ujjwal Pal, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science(印度科学研究院设计与制造系)

AI总结 提出一种基于距离的多样性度量方法,通过无偏估计Rao二次指数,并开发软件工具VariAnT,以支持工程设计早期概念空间的多样性评估。

详情
AI中文摘要

过去的研究将设计创造力与“发散性思维”联系起来,即概念空间在设计早期阶段被探索的程度。研究人员认为,生成多个概念会增加产生更好设计解决方案的机会。“多样性”是量化设计师探索的概念空间广度的参数之一。在概念设计阶段评估多样性是有用的,因为在这个阶段,设计师可以自由探索不同的解决方案原则,以用新颖的概念满足设计问题。本文详细阐述并批判性地审视了工程设计文献中现有的多样性度量方法,讨论了它们的局限性。提出了一种新的基于距离的多样性度量方法,并附带了一个支持评估过程的规范性框架。该框架使用所选的基础抽象层次表示,测量两个设计概念之间的实值距离。所提出的框架在名为“VariAnT”的软件工具中实现。此外,通过一个说明性示例展示了该工具的应用。

英文摘要

Past research relates design creativity to 'divergent thinking,' i.e., how well the concept space is explored during the early phase of design. Researchers have argued that generating several concepts would increase the chances of producing better design solutions. 'Variety' is one of the parameters by which one can quantify the breadth of a concept space explored by the designers. It is useful to assess variety at the conceptual design stage because, at this stage, designers have the freedom to explore different solution principles so as to satisfy a design problem with substantially novel concepts. This article elaborates on and critically examines the existing variety metrics from the engineering design literature, discussing their limitations. A new distance-based variety metric is proposed, along with a prescriptive framework to support the assessment process. The framework measures the real-valued distance between two design concepts using any chosen representation of their underlying abstraction levels. The proposed framework is implemented in a software tool called 'VariAnT.' Furthermore, the tool's application is demonstrated through an illustrative example.