arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2509.20641 2026-05-18 cs.LG cs.SD

Investigating Modality Contribution in Audio LLMs for Music

在音乐音频大语言模型中探讨模态贡献

Giovana Morais, Magdalena Fuentes

AI总结 本文通过MM-SHAP框架量化音频大语言模型中各模态的贡献,发现高准确率模型更依赖文本回答问题,但音频仍能局部化关键声音事件,首次将MM-SHAP应用于音频大语言模型。

Comments 5 pages, 2 figures, accepted at ICASSP 2026

详情
AI中文摘要

音频大语言模型(Audio LLMs)能够实现人类般的音乐对话,但尚不清楚它们是否真正听懂音频还是仅仅依赖文本推理,正如最近的基准测试所表明的。本文通过量化每个模态对模型输出的贡献来探讨这一问题。我们适应了MM-SHAP框架,这是一个基于Shapley值的性能无关评分,用于量化每个模态对模型预测的相对贡献。我们在MuChoMusic基准上评估了两个模型,并发现准确性更高的模型更依赖文本来回答问题,但进一步检查显示,即使整体音频贡献较低,模型仍能成功局部化关键声音事件,这表明音频并未被完全忽略。我们的研究是首次将MM-SHAP应用于音频大语言模型,我们希望它能为未来可解释AI和音频领域的研究奠定基础。

英文摘要

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

2509.20349 2026-05-18 cs.LG

Process-Informed Forecasting of Complex Thermal Dynamics in Pharmaceutical Manufacturing

基于过程的制药制造复杂热力学动态预报

Ramona Rubini, Siavash Khodakarami, Aniruddha Bora, George Em Karniadakis, Michele Dassisti

AI总结 本文提出基于过程的预报方法,结合传统模型和深度学习架构,通过整合过程先验信息提升预测准确性与物理一致性,验证了其在制药冻干过程中的有效性。

详情
AI中文摘要

准确的时间序列预测对于复杂物理系统的现代工业监控和控制至关重要,但深度学习模型在受监管环境中往往缺乏所需的物理一致性。为弥合这一差距,我们引入了基于过程的预报(PIF)模型,用于制药冻干过程中的温度预测,将确定性生产配方作为宏结构先验。我们研究了经典方法(如自回归积分滑动平均模型)和现代深度学习架构,包括Kolmogorov-Arnold网络。我们比较了三种不同的损失函数形式,整合了过程指导的轨迹先验:固定权重损失、动态不确定性基于损失和残差注意力机制。我们不仅评估了所有模型的准确性和物理一致性,还评估了其对传感器噪声的鲁棒性。此外,我们测试了最佳模型在迁移学习场景下的实际泛化能力,以适应新过程。我们的结果表明,PIF模型在准确性、物理合理性和噪声鲁棒性方面均优于数据驱动的模型,提供了一种可扩展的框架,用于关键制造中的可靠和可推广的预测解决方案。

英文摘要

Accurate time-series forecasting for complex physical systems is the backbone of modern industrial monitoring and control, yet deep learning models often lack the physical consistency required in regulated environments.To bridge this gap, we introduce Process-Informed Forecasting (PIF) models for temperature in pharmaceutical lyophilization, embedding deterministic production recipes as macro-structural priors. We investigate classical methods (e.g., Autoregressive Integrated Moving Average (ARIMA) model) and modern deep learning architectures, including Kolmogorov-Arnold Networks (KANs). We compare three different loss function formulations that integrate a process-informed trajectory prior: a fixed-weight loss, a dynamic uncertainty-based loss, and a Residual-Based Attention (RBA) mechanism. We evaluate all models not only for accuracy and physical consistency but also for robustness to sensor noise. Furthermore, we test the practical generalizability of the best model in a transfer-learning scenario to a new process. Our results show that PIF models outperform their data-driven counterparts in terms of accuracy, physical plausibility and noise resilience, offering a scalable framework for reliable and generalizable forecasting solutions in critical manufacturing.

2509.15267 2026-05-18 cs.CV cs.AI cs.LG

Autoguided Online Data Curation for Diffusion Model Training

自引导在线数据精炼用于扩散模型训练

Valeria Pais, Luis Oala, Daniele Faccio, Marco Aversa

AI总结 本文研究自引导和在线数据选择方法对扩散模型训练效率的影响,通过合成数据任务验证了自引导在样本质量和多样性上的优势。

Comments Accepted non-archival paper at ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL)

详情
AI中文摘要

生成模型计算成本重新点燃了高效数据精炼的希望。本文探讨了最近发展的自引导和在线数据选择方法是否能提升扩散模型训练的时间和样本效率。我们整合了联合示例选择(JEST)和自引导到统一代码库中,以实现快速消融分析和基准测试。我们在受控的二维合成数据生成任务以及(3x64x64)-D图像生成上评估了数据精炼的组合。我们的比较是在相等的墙钟时间和样本数量下进行的,明确考虑了选择的开销。在所有实验中,自引导一致地提高了样本质量和多样性。早期AJEST(仅在训练开始时应用选择)在两个任务上都能匹配或略微超过自引导单独的效率。然而,其时间开销和额外的复杂性使自引导或均匀随机数据选择在大多数情况下更优。这些发现表明,尽管目标在线选择在早期训练中能带来效率提升,但稳健的样本质量改进主要由自引导驱动。我们讨论了限制和范围,并概述了数据选择何时可能有益。

英文摘要

The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

2509.10310 2026-05-18 cs.CV math.OC

A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments

面向复杂城市环境的街道设施地理定位的随机生灭方法

Evan Murphy, Marco Viola, Vladimir A. Krylov

AI总结 本文提出基于能量地图的随机生灭优化算法,用于精确定位城市街道设施,通过整合地理空间信息提升定位精度,验证了其在大规模设施映射中的可行性。

Comments Accepted for publication in the Proceedings of the 27th Irish Machine Vision and Image Processing Conference (IMVIP 2025)

详情
AI中文摘要

本文针对复杂城市环境中街道设施的精确地理定位问题,提出基于能量地图的概率框架。该框架通过将能量表示为基于地图的地理定位格式,使优化过程能够无缝整合外部地理空间信息,如GIS图层、道路地图或放置约束,从而提升上下文意识和定位准确性。引入随机生灭优化算法以推断资产最可能的配置,并通过基于都柏林市中心街道照明基础设施的现实模拟验证了该方法的可行性,展示了其在大规模和精确城市资产映射中的潜力。该算法的实现将在GitHub仓库https://github.com/EMurphy0108/SBD_Street_Furniture中提供。

英文摘要

In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.

2507.17572 2026-05-18 cs.RO

Sampling-Based Global Optimal Control and Estimation via Semidefinite Programming

基于采样的全局最优控制与估计通过半正定规划

Antoine Groudiev, Fabian Schramm, Éloïse Berthier, Justin Carpentier, Frederike Dümbgen

AI总结 本文将KernelSOS理论应用于控制和机器人领域,解决实际应用中的重启策略、超参数校准等关键问题,并展示其在高维非参数轨迹优化中的优势。

详情
AI中文摘要

全局优化在过去几十年中因理论基础和数值方法的发展而受到关注。Kernel Sum of Squares (KernelSOS) 结合了核方法的表达能力和SOS优化的保证,提供了一种强大的理论框架。本文将KernelSOS从理论推向实践,展示了其在挑战性控制和机器人问题中的应用。我们识别并解决了使该方法在应用环境中有效所需的实际考虑因素:重启策略、系统化超参数校准、恢复极小值的方法以及与快速局部求解器的结合。作为概念验证,将KernelSOS应用于机器人定位,展示了其与依赖启发式和手工改写问题的现有SOS方法的竞争力。即使在高维、非参数的轨迹优化设置中,模拟器被视为黑盒,我们展示了如何将KernelSOS与快速局部求解器结合,以发现高质量的解决方案,而不会影响总体运行时间。

英文摘要

Global optimization has gained attraction over the past decades, thanks to the development of both theoretical foundations and efficient numerical routines. Among recent advances, Kernel Sum of Squares (KernelSOS) provides a powerful theoretical framework, combining the expressivity of kernel methods with the guarantees of SOS optimization. In this paper, we take KernelSOS from theory to practice and demonstrate its use on challenging control and robotics problems. We identify and address the practical considerations required to make the method work in applied settings: restarting strategies, systematic calibration of hyperparameters, methods for recovering minimizers, and the combination with fast local solvers. As a proof of concept, the application of KernelSOS to robot localization highlights its competitiveness with existing SOS approaches that rely on heuristics and handcrafted reformulations to render the problem polynomial. Even in the high-dimensional, non-parametric setting of trajectory optimization with simulators treated as black boxes, we demonstrate how KernelSOS can be combined with fast local solvers to uncover higher-quality solutions without compromising overall runtimes.

2507.15970 2026-05-18 cs.SD cs.AI eess.AS

CIS-BWE: Chaos-Informed Speech Bandwidth Extension

CIS-BWE: 基于混沌的语音带宽扩展

Tarikul Islam Tamiti, Tonmoy Das, Nursadul Mamun, Anomadarshi Barua

AI总结 本文提出NDSI-BWE框架,利用六种基于非线性动力学系统的判别器捕捉语音的复杂时间行为,通过深度卷积实现参数减少,提升语音带宽扩展性能。

详情
AI中文摘要

恢复因带宽限制丢失的高频成分对于电信和有限资源下的高保真音频应用至关重要。我们引入NDSI-BWE,一种新的对抗性带宽扩展(BWE)框架,利用四种新的判别器灵感来自非线性动力学系统以捕捉多样的时间行为:多分辨率李雅普诺夫判别器(MRLD)用于确定初始条件的敏感性,通过捕捉确定性混沌;多尺度递归判别器(MS-RD)用于自相似递归动力学;多尺度去趋势分形分析判别器(MSDFA)用于长程缓慢变异性尺度不变关系;多分辨率庞加莱图判别器(MR-PPD)用于捕捉隐藏的潜在空间关系;多周期判别器(MPD)用于捕捉周期性模式;多分辨率振幅判别器(MRAD)和多分辨率相位判别器(MRPD)用于捕捉复杂的振幅-相位转换统计。通过在每个判别器中使用深度卷积块的核心深度卷积,NDSI-BWE实现了八倍的参数减少。这些七个判别器指导一个基于复数ConformerNeXt的生成器,采用双流Lattice-Net架构,同时优化幅度和相位。生成器利用基于Transformer的Conformer的全局依赖建模能力和ConvNeXt块的局部时间建模能力。在六个客观评估指标和包含五名人类评委的主观文本中,NDSI-BWE在BWE中建立了新的SoTA。

英文摘要

Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

2507.14200 2026-05-18 cs.CL cs.AI cs.LG

A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

可扩展的多语言模型协作系统:基于检索的选择与探索-利用驱动增强

Shengji Tang, Jianjian Cao, Weihao Lin, Jiale Hong, Bo Zhang, Shuyue Hu, Lei Bai, Tao Chen, Wanli Ouyang, Peng Ye

AI总结 本文提出SMCS系统,通过检索优先选择模块和探索-利用驱动后验增强模块,有效协调多个开源语言模型,实验显示其在多个任务中优于闭源模型,且在不同数据集上超越开源模型的平均最佳结果。

详情
AI中文摘要

现有多语言模型协作系统在整合新语言模型和任务时常面临可扩展性挑战,导致性能不佳。为此,我们提出SMCS,一种可扩展的多语言模型协作系统,旨在有效协调多个开源语言模型。系统包含两个核心模块:基于检索的优先选择模块(RPS),动态选择最适合的语言模型;探索-利用驱动的后验增强模块(EPE),通过混合评分机制促进响应多样性并选择高质量输出。在八个主流基准测试中,实验验证了系统的有效性:通过整合十五个开源语言模型,SMCS在多个任务中优于现有的闭源语言模型,例如GPT-4(+5.36%)和GPT-o3-mini(+5.28%)。值得注意的是,它甚至在不同数据集上超越了开源语言模型的最佳平均结果(+2.86%),显著推进了开源协作的实证性能前沿。代码已发布在https://github.com/magent4aci/SMCS。

英文摘要

Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (+2.86%), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.

2507.10236 2026-05-18 cs.CV

Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

在真实世界中导航AI生成图像检测的挑战:真正重要的是什么?

Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos

AI总结 研究真实世界中AI生成图像检测的挑战,分析设计选择对检测性能的影响,提出优化方法并提升AUC 26.87%。

Comments ACM International Workshop on Multimedia AI against Disinformation 2026 (MAD 2026)

详情
AI中文摘要

随着生成式人工智能的发展,AI生成图像的逼真度已达到足以欺骗甚至警惕的人类观察者的水平。然而,尽管当前的AI生成图像检测(AID)方法在受控基准数据集上表现优异,但在真实世界案例中却表现不佳。为此,我们引入了ITW-SM数据集,一个经过精心编排的真实和AI生成图像集合,源自主要社交媒体平台。我们利用它分析构建检测器时的关键设计选择,包括其架构、预训练的潜在空间、训练数据以及预处理方法。我们指出,简单地扩大预训练阶段或选择更多训练数据并不总是能提高检测性能。相反,我们的研究揭示了优化每个设计选择以使处理流程能够传播并有效分析低级痕迹和高级图像语义的重要性。基于我们的发现,我们在多种最先进的检测方法上实现了平均AUC提升26.87%,为开发更具鲁棒性的检测器提供了路线图。我们的资源可在https://mever-team.github.io/itw-sm获取。

英文摘要

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors. Our assets are available on https://mever-team.github.io/itw-sm.

2506.22604 2026-05-18 cs.AI cs.HC cs.RO

Bootstrapping Human-Like Planning via LLMs

通过大语言模型实现人类样式的规划

David Porfirio, Vincent Hsiao, Morgan Fine-Morris, Leslie Smith, Laura M. Hiatt

AI总结 本文研究如何结合自然语言接口与拖放界面,利用大语言模型生成人类风格的动作序列,并与手工指定的动作序列进行比较。

Comments Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

机器人终端用户日益需要能够指定机器人执行任务的可访问方法。两种常见的终端用户编程范式包括拖放界面和自然语言编程。尽管自然语言接口利用了人类沟通的直观形式,拖放界面使用户能够精确地规定机器人任务中的关键动作。在本文中,我们探讨这两种方法结合的程度。具体来说,我们构建了一个基于大语言模型(LLM)的管道,接受自然语言作为输入,并生成人类风格的动作序列作为输出,其细度水平与人类产生的相似。我们然后将生成的动作序列与另一组手工指定的动作序列进行比较。尽管我们的结果表明,较大的模型在生成人类风格的动作序列方面优于较小的模型,但较小的模型仍然实现了令人满意的性能。

英文摘要

Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.

2506.16129 2026-05-18 cs.CV

Neurosymbolic Object-Centric Learning with Distant Supervision

基于远监督的神经符号对象中心学习

Stefano Colamonaco, David Debot, Giuseppe Marra

AI总结 本文提出DeepObjectLog模型,通过概率神经符号方法实现对象中心学习,无需逐对象标签或掩码,提升对组合、对象计数和规则转移的泛化能力。

详情
AI中文摘要

神经符号学习可通过符号规则为潜在概念提供监督,但通常假设规则引用的实体已指定。对象中心模型将图像分解为槽状表示,但这些槽未必与符号推理所需的谓词对齐。本文研究了基于远监督的对象中心神经符号学习,通过逻辑程序的物体级参数直接从图像中学习,引入DeepObjectLog模型,整合槽式感知编码器与概率逻辑层,预测候选物体表示的对象性和类别概率,逻辑层通过潜在的对象性和类别分配计算观测标签的似然,无需逐对象标签、掩码、边界框或启发式集合匹配。在多样化的视觉推理任务中,DeepObjectLog在组合、对象计数和规则转移的分布外泛化方面优于神经对象中心和标准神经符号基线。

英文摘要

Neurosymbolic learning can use symbolic rules to provide supervision for latent concepts from weak labels, but it commonly assumes that the entities referenced by these rules are already specified. Object-centric models decompose images into slot-like representations; however, such slots are not necessarily aligned with the predicates required for symbolic reasoning. We investigate object-centric neurosymbolic learning under distant supervision, where the object-level arguments of a logic program are learned directly from images using only global task labels. We introduce DeepObjectLog, a probabilistic neurosymbolic model that integrates a slot-based perceptual encoder with a probabilistic logic layer. The encoder predicts objectness and class probabilities for candidate object representations, while the logic layer marginalizes over latent objectness and class assignments to compute the likelihood of the observed label. This formulation provides a differentiable task-level learning signal for object-centric perception without requiring per-object labels, masks, bounding boxes, or heuristic set matching. Evaluations across diverse visual reasoning tasks demonstrate that DeepObjectLog achieves superior out-of-distribution generalization to compositional, object-count, and rule shifts compared to neural object-centric and standard neurosymbolic baselines.

2506.12405 2026-05-18 cs.SD

Methods for pitch analysis in contemporary popular music: multiple pitches from harmonic tones in Vitalic's music

当代流行音乐中音高分析的方法:来自Vitalic音乐中和声音的多重音高

Emmanuel Deruty, David Meredith, Maarten Grachten, Pascal Arbez-Nicolas, Andreas Hasselholt Jørgensen, Oliver Søndermølle Hansen, Magnus Stensli, Christian Nørkær Petersen

AI总结 研究探讨了当代流行音乐中单个和声复合音产生多个感知音高的现象,通过Vitalic等电子艺术家的作品示例,分析信号特征与音高感知之间的关系,并发现不同听众对多重模糊音高的感知存在显著差异。

Comments Pending review, Journal of the Audio Engineering Society

详情
AI中文摘要

目的。本研究提出,单个和声复合音产生多个感知音高是当代流行音乐的主动和有意特征。通过Vitalic等电子艺术家作品中的例子加以说明。方法。进行了两项听觉测试:(1) 评估单个和声音产生的同时感知音高的数量,(2) 手动转录和声音序列的音高。随后分析了信号特征与音高感知之间的关系。结果。研究中发现的合成和声音在音乐序列中比其声学对应物传递更多的感知音高,不同听众之间存在显著差异。多重模糊音高与和声音的特性如显著的上部谐波和特定的自相关谱形有关。结论。在当代流行音乐的背景下,和声音可以一般地传达多个模糊音高。感知的音高集合取决于听众和听音条件。

英文摘要

Aims. This study suggests that the use of multiple perceived pitches arising from a single harmonic complex tone is an active and intentional feature of contemporary popular music. The phenomenon is illustrated through examples drawn from the work of electronic artist Vitalic and others. Methods. Two listening tests were conducted: (1) evaluation of the number of simultaneous pitches perceived from single harmonic tones, and (2) manual pitch transcription of sequences of harmonic tones. Relationships between signal characteristics and pitch perception were then analyzed. Results. The synthetic harmonic tones found in the musical sequences under study were observed to transmit more perceived pitches than their acoustic counterparts, with significant variation across listeners. Multiple ambiguous pitches were associated with tone properties such as prominent upper partials and particular autocorrelation profiles. Conclusions. Harmonic tones in a context of contemporary popular music can, in general, convey several ambiguous pitches. The set of perceived pitches depends on both the listener and the listening conditions.

2506.07073 2026-05-18 cs.SD cs.HC eess.AS

Insights on Harmonic Tones from a Generative Music Experiment

从生成音乐实验中洞察和声音调

Emmanuel Deruty, Maarten Grachten

AI总结 生成音乐AI旨在提升音乐创作,实验显示AI模型能生成结构化和声音调,揭示人类对和声的感知问题,推动音乐创造力与理论理解。

Comments 15th International Workshop on Machine Learning and Music, September 9, 2024, Vilnius, Lithuania

详情
AI中文摘要

生成音乐AI的最终目的是音乐创作。在艺术科学交叉领域的工作室-实验室中,通过研究人员、音乐制作人和生成低音音频的AI模型进行实验,发现制作人利用模型输出传达两个或更多音高,表明模型能通过单个和声复音生成结构化、连贯的同时旋律线。这些发现促使重新审视人类是否能将和声视为独立音高,同时展示生成AI如何提升音乐创造力并深化音乐理解。

英文摘要

The ultimate purpose of generative music AI is music production. The studio-lab, a social form within the art-science branch of cross-disciplinarity, is a way to advance music production with AI music models. During a studio-lab experiment involving researchers, music producers, and an AI model for music generating bass-like audio, it was observed that the producers used the model's output to convey two or more pitches with a single harmonic complex tone, which in turn revealed that the model had learned to generate structured and coherent simultaneous melodic lines using monophonic sequences of harmonic complex tones. These findings prompt a reconsideration of the long-standing debate on whether humans can perceive harmonics as distinct pitches and highlight how generative AI can not only enhance musical creativity but also contribute to a deeper understanding of music.

2505.23678 2026-05-18 cs.CV

Grounded Reinforcement Learning for Visual Reasoning

基于视觉的强化学习用于视觉推理

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

AI总结 本文提出ViGoRL,通过强化学习实现视觉推理,通过空间坐标锚定推理步骤,提升视觉定位和搜索性能,优于传统方法。

Comments Project website: https://visually-grounded-rl.github.io/

详情
AI中文摘要

尽管强化学习在数学和编码任务中显著提升了语言模型,但视觉推理需要模型引导视觉注意力、解读感知输入并用空间证据支撑抽象推理。我们引入ViGoRL,通过强化学习训练视觉语言模型,将每个推理步骤明确锚定到特定视觉坐标。受人类视觉决策启发,ViGoRL学习生成空间接地的推理轨迹,每一步引导视觉注意力到相关区域。当需要精细探索时,我们的新型多轮强化学习框架使模型能动态放大预测坐标。在多样化的视觉推理基准上,ViGoRL在空间推理、视觉搜索和基于网页的接地任务中均优于监督微调和传统强化学习基线。结合多轮强化学习与放大视觉反馈显著提升了ViGoRL在定位小GUI元素和视觉搜索中的性能,达到86.4%的V*Bench成绩。此外,我们发现接地增强了其他视觉行为,如区域探索、接地子目标设定和视觉验证。最终,人类评估显示模型的视觉参考不仅空间准确,而且有助于理解模型推理步骤。我们的结果表明,视觉接地强化学习是赋予模型通用视觉推理能力的强大范式。

英文摘要

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

2505.18853 2026-05-18 cs.CL

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Smoothie: 通过令牌嵌入进行扩散平滑以实现文本生成

Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov

AI总结 本文提出Smoothie,通过基于语义相似性的逐步平滑令牌嵌入,结合连续潜在空间和分类单纯空间的优势,提升文本生成质量。

Comments 18 pages, 4 figures, 13 tables

详情
AI中文摘要

扩散模型在图像、音频和视频生成中取得了最先进的性能,但其适应文本生成仍具有挑战性,因为文本具有离散性质。以往方法要么在连续潜在空间中应用高斯扩散,继承语义结构但难以处理令牌解码,要么在分类单纯空间中操作,尊重离散性但忽视令牌间的语义关系。本文提出Smoothie,一种新的扩散方法,通过逐步平滑令牌嵌入结合两者的优势。该技术使信息逐步移除的同时保持自然的解码过程。在多个序列到序列和无条件生成任务上的实验结果表明,Smoothie在生成质量上优于现有扩散模型。进一步的消融研究显示,所提出的扩散空间比标准嵌入空间和分类单纯空间表现更好。代码可在https://github.com/ashaba1in/smoothie获取。

英文摘要

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence and unconditional generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. The code is available at https://github.com/ashaba1in/smoothie.

2505.15692 2026-05-18 cs.CL cs.LG

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

TemplateRL: 结构化模板引导的强化学习用于大语言模型推理

Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Haoran Luo, Ling Yang, Huazhe Xu, Jianhua Tao

AI总结 TemplateRL通过结构化模板引导强化学习提升大语言模型推理能力,通过MCTS构建问题解决模板库并整合到RL训练中,提高轨迹命中率并减少无效探索,实验显示在AIME和AMC上表现优于GRPO。

Comments Accepted by ACL 2026

详情
AI中文摘要

强化学习(RL)已显现为增强模型推理的有效范式。然而,现有RL方法如GRPO通常依赖无结构的自我采样来拟合标量奖励,往往产生低效的rollouts,无法捕捉可转移的问题解决策略。为解决这一限制,我们提出了**TemplateRL**,一种结构化模板引导的RL框架,通过显式模板引导增强策略优化。我们的方法首先通过MCTS在小种子集上构建问题解决模板库,然后无缝整合此高层结构化引导到RL训练中。通过引导rollout生成与已验证的模板结构对齐,TemplateRL显著提高了高质量轨迹命中率,同时减少了无效探索。这种结构引导设计使策略朝着已验证的战略模式前进,稳定了训练动态,并提高了RL采样效率。值得注意的是,显式模板库是可解释、可编辑的,并支持在线更新,使在训练和推理过程中都能持续更新。大量实验表明,TemplateRL在AIME上比GRPO高出99%,在AMC上高出41%,在弱模型上表现更稳定,并具有显著的跨领域泛化能力,突显了其在更广泛任务中的潜力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO typically rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address this limitation, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.

2505.05583 2026-05-18 cs.CL

KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification

KG-HTC:将知识图谱整合进LLMs以实现有效的零样本层次文本分类

Qianbo Zang, Christophe Zgrzendek, Igor Tchappi, Afshin Khadangi, Johannes Sedlmeir

AI总结 本文提出KG-HTC方法,通过整合知识图谱与大语言模型,解决层次文本分类中标注数据不足、标签空间大和长尾分布等问题,实验表明其在零样本设置下表现优异。

详情
AI中文摘要

层次文本分类(HTC)涉及将文档分配到由分类学组织的标签中。大多数先前的HTC研究集中在监督方法上。然而,在现实场景中,使用监督HTC具有挑战性,因为缺乏标注数据。此外,HTC经常面临大规模标签空间和长尾分布的问题。在本文中,我们提出了用于零样本层次文本分类的知识图谱(KG-HTC),旨在通过将知识图谱与大语言模型(LLMs)整合,为分类提供结构化的语义上下文来解决这些挑战。我们的方法使用检索增强生成(RAG)方法从与输入文本相关的知识图谱中检索相关子图。我们的KG-HTC可以增强LLMs以在不同层次上理解标签语义。我们评估了KG-HTC在三个开源HTC数据集上:WoS、DBpedia和Amazon。我们的实验结果表明,KG-HTC在严格零样本设置下显著优于三个基线方法,特别是在层次的更深层次上取得了显著改进。这项评估证明了将结构化知识整合到LLMs中以解决HTC在大规模标签空间和长尾标签分布中的挑战的有效性。我们的代码可在:https://github.com/QianboZang/KG-HTC 上获得。

英文摘要

Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.

2504.18361 2026-05-18 cs.CV cs.AI

COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

COCO-Inpaint:用于检测和定位基于修补的图像篡改的基准

Haozhen Yan, Yan Hong, Jiahui Zhan, Suning Lang, Yikun Ji, Huijia Zhu, Jun Lan, Jianfu Zhang

AI总结 本文提出COCO-Inpaint基准,用于检测和定位基于修补的图像篡改,通过高质样本、多样场景和大规模覆盖,揭示修补与真实区域的内在不一致。

Comments 6 pages, 8 figures

详情
AI中文摘要

近年来,图像篡改技术的进步使高逼真内容生成成为可能,但也降低了随意编辑的门槛,引发了对多媒体真实性和安全性的担忧。现有图像篡改检测与定位(IMDL)方法主要针对拼接或复制移动伪造,而基于修补的篡改基准仍有限。为弥合这一差距,我们提出了COCO-Inpaint,一个专门用于修补检测和定位的综合基准,主要贡献包括:1)由六个最先进的修补模型生成的高质量修补样本;2)通过四种掩码生成策略和可选文本引导实现的多样化生成场景;3)包含238,302张具有丰富语义多样性的修补图像的大规模覆盖。本基准旨在突出修补区域与真实区域之间的内在不一致,而非表面语义特征如物体形状。我们进一步建立了严格的评估协议,通过三个标准指标来评估现有IMDL方法,揭示当前趋势和挑战。

英文摘要

Recent advances in image manipulation have enabled highly photorealistic content generation, but also lowered the barrier to arbitrary editing, raising concerns about multimedia authenticity and security. Existing Image Manipulation Detection and Localization (IMDL) methods mainly target splicing or copy-move forgeries, while benchmarks for inpainting-based manipulations remain limited. To bridge this gap, we present COCO-Inpaint, a comprehensive benchmark specifically designed for inpainting detection and localization, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage of 238,302 inpainted images with rich semantic diversity. Our benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

2504.00663 2026-05-18 cs.LG

Searching on a Budget: HW-NAS with 10 Latency Probes

在预算内搜索:具有10个延迟探针的HW-NAS

Francesco Capuano, Gabriele Tiboni, Niccolò Cavagnero, Giuseppe Averta

AI总结 本文提出一种两阶段HW-NAS框架,通过在合成设备上预训练控制器,再在目标设备上直接部署,利用少量高保真延迟测量实现目标设备架构设计,无需预收集信息。

详情
AI中文摘要

现有的硬件感知NAS(HW-NAS)方法通常假设可以访问目标设备的精确信息,要么通过后编译延迟模型的分析近似,要么通过学习的延迟预测器。此类近似方法可能引入估计误差,这对风险敏感的应用可能有害。在本工作中,我们提出了一种两阶段HW-NAS框架,首先在合成设备的分布上学习架构控制器,然后直接在目标设备上部署控制器。在测试时,我们的网络控制器直接部署到目标设备,不依赖任何预收集的信息,仅利用直接交互。特别是,预训练阶段在合成设备上使控制器能够通过少量高保真延迟测量与目标设备交互,设计出适合的目标设备架构。为保证方法的可访问性,我们仅使用无训练准确度代理进行训练,允许我们在不产生完整网络训练开销的情况下扩展元训练阶段。我们在HW-NATS-Bench上进行了基准测试,证明我们的方法能够泛化到未见过的设备,并通过上下文适应在测试时仅使用少量真实世界延迟评估来搜索延迟高效的架构。

英文摘要

Existing hardware-aware NAS (HW-NAS) methods typically assume access to precise information circa the target device, either via analytical approximations of the post-compilation latency model, or through learned latency predictors. Such approximate approaches risk introducing estimation errors that may prove detrimental in risk-sensitive applications. In this work, we propose a two-stage HW-NAS framework, in which we first learn an architecture controller on a distribution of synthetic devices, and then directly deploy the controller on a target device. At test-time, our network controller deploys directly to the target device without relying on any pre-collected information, and only exploits direct interactions. In particular, the pre-training phase on synthetic devices enables the controller to design an architecture for the target device by interacting with it through a small number of high-fidelity latency measurements. To guarantee accessibility of our method, we only train our controller with training-free accuracy proxies, allowing us to scale the meta-training phase without incurring the overhead of full network training. We benchmark on HW-NATS-Bench, demonstrating that our method generalizes to unseen devices and searches for latency-efficient architectures by in-context adaptation using only a few real-world latency evaluations at test-time.

2504.00289 2026-05-18 cs.CL cs.AI cs.CY

Do Chinese models speak Chinese languages?

中国模型会说中文吗?

Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno

AI总结 本文通过比较中西方开源大模型的多语言能力,发现中国模型在多数语言上表现与西方模型相似,但对部分中国少数民族语言识别能力较弱,揭示了多语言发展中的优先级与权衡。

Comments First and second author contribute equally

详情
AI中文摘要

顶级开源大模型的发布巩固了中国在AI发展中的领先地位。这些模型支持中国使用的语言吗?还是与美国或欧洲开发的模型支持相同的语言?比较多语言能力对于两个原因很重要:首先,语言能力提供了关于预训练数据编纂的见解,从而揭示了资源分配和发展优先级;其次,中国模型开发者需要在服务于国内语言多样化的群体与优化全球可见基准(主要为英语)之间取得平衡。我们通过比较中国开发和西方开发的开源大模型,在21种语言变体(包括亚洲地区、中文和欧洲语言)上进行了研究。我们的信息平衡和阅读理解实验表明,中国模型在这些语言上的表现与西方模型高度相关(r=0.93),唯一的例外是中文表现更好。中国开发的模型在法语和德语方面表现良好,但有时无法识别中国少数民族语言,如哈萨克语和维吾尔语。总体而言,所有研究的开源大模型在多语言表现上相似,尽管模型开发者所处的语言和文化背景各不相同。我们将这种同质化解释为全球基准实践和共享训练资源影响的结果。而不是将当前语言支持视为不可避免,我们的结果强调多语言发展是一个优先级和权衡的空间,对模型开发者、政策制定者和用户都有影响。

英文摘要

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they support the same languages as models developed in the United States or in Europe? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, Chinese model developers need to navigate the tension between serving a linguistically diverse population domestically, and optimizing for globally visible benchmarks that are predominantly English. We investigate Chinese model developers' priorities through a comparative study of Chinese-developed and Western-developed open-weight LLMs, on 21 language variants including Asian regional, Chinese, and European languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with their Western counterparts, with the sole exception being better Mandarin. Chinese-developed models are good at French and German, but they sometimes cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur. Overall, all open-weight LLMs we study have a similar multilingual performance profile, despite the diverse linguistic and cultural contexts the model developers operated within. We interpret the homogenization as consistent with the influence of global benchmarking practices and shared training resources. Rather than treating current language support as inevitable, our results highlight multilingual development as a space of prioritization and trade-offs, with implications for model developers, policymakers, and users.

2503.13113 2026-05-18 cs.LG math.OC

Exploring the Potential of Bilevel Optimization for Calibrating Neural Networks

探索双层优化在校准神经网络中的潜力

Gabriele Sanguin, Arjun Pakrashi, Marco Viola, Francesco Rinaldi

AI总结 本文提出基于双层优化的神经网络校准方法,通过玩具数据集和模拟数据集验证其在提升预测置信度和减少校准误差方面的有效性,优于等价回归方法。

详情
AI中文摘要

处理不确定性对于确保智能系统中的可靠决策至关重要。现代神经网络已知校准不佳,导致预测置信度分数难以使用。本文探讨通过应用双层优化框架来改进置信度估计和校准,该框架旨在解决具有相互依赖优化层次的分层问题。介绍了一种自我校准的双层神经网络训练方法以提高模型的预测置信度分数。通过玩具数据集如Blobs和Spirals以及更实际的模拟数据集如血酒精浓度(BAC)分析所提出框架的有效性。将其与一种广为人知且广泛使用的校准策略,等价回归进行比较。报告的实验结果表明,所提出的双层优化方法在减少校准误差的同时保持了准确性。

英文摘要

Handling uncertainty is critical for ensuring reliable decision-making in intelligent systems. Modern neural networks are known to be poorly calibrated, resulting in predicted confidence scores that are difficult to use. This article explores improving confidence estimation and calibration through the application of bilevel optimization, a framework designed to solve hierarchical problems with interdependent optimization levels. A self-calibrating bilevel neural-network training approach is introduced to improve a model's predicted confidence scores. The effectiveness of the proposed framework is analyzed using toy datasets, such as Blobs and Spirals, as well as more practical simulated datasets, such as Blood Alcohol Concentration (BAC). It is compared with a well-known and widely used calibration strategy, isotonic regression. The reported experimental results reveal that the proposed bilevel optimization approach reduces the calibration error while preserving accuracy.

2503.00794 2026-05-18 cs.RO

Detecting Heel Strike and toe off Events Using Kinematic Methods and LSTM Models

利用运动学方法和LSTM模型检测脚跟触地和脚尖离地事件

Longbin Zhang, Zhizhang Li, Xinyi Fu, Yi Xie, Xiaoyue Yan, Suiyuan Wang, Te Zhang, Hui Zhang, Kailun Yang, Tsung-Lin Wu, Prayook Jatesiktat, Ananda Sidarta, Wei Tech Ang

AI总结 本文评估了七种运动学方法和LSTM模型在检测脚跟触地和脚尖离地事件中的性能,发现Zeni等方法在运动学方法中准确率最高,而LSTM模型提供了无系统偏差的数据驱动替代方案。

详情
AI中文摘要

准确的步态事件检测对于步态分析、康复和辅助技术至关重要,特别是在外骨骼控制中,精确识别支撑相和摆动相尤为关键。本研究评估了七种基于运动学的方法和一个长短期记忆(LSTM)模型,在588名健康受试者4363个步态周期中检测脚跟触地和脚尖离地事件的表现。结果表明,尽管Zeni等方法在运动学方法中实现了最高准确率,其他方法表现出系统性偏差或需要数据集特定的调优。LSTM模型的表现与Zeni等方法相当,提供了一种数据驱动的替代方案,无系统性偏差。这些发现突显了基于深度学习的方法在步态事件检测中的潜力,同时强调了在临床人群和多样步态条件下进一步验证的必要性。未来研究将探索这些方法在病理人群(如中风后患者和膝关节骨性关节炎患者)中的泛化能力,以及在不同步态条件和数据收集设置中的鲁棒性,以提高其在康复和外骨骼控制中的应用性。

英文摘要

Accurate gait event detection is crucial for gait analysis, rehabilitation, and assistive technology, particularly in exoskeleton control, where precise identification of stance and swing phases is essential. This study evaluated the performance of seven kinematics-based methods and a Long Short-Term Memory (LSTM) model for detecting heel strike and toe-off events across 4363 gait cycles from 588 able-bodied subjects. The results indicated that while the Zeni et al. method achieved the highest accuracy among kinematics-based approaches, other methods exhibited systematic biases or required dataset-specific tuning. The LSTM model performed comparably to Zeni et al., providing a data-driven alternative without systematic bias. These findings highlight the potential of deep learning-based approaches for gait event detection while emphasizing the need for further validation in clinical populations and across diverse gait conditions. Future research will explore the generalizability of these methods in pathological populations, such as individuals with post-stroke conditions and knee osteoarthritis, as well as their robustness across varied gait conditions and data collection settings to enhance their applicability in rehabilitation and exoskeleton control.

2412.06853 2026-05-18 cs.LG cs.AI

Tube Loss: A Novel Approach for Prediction Interval Estimation

Tube Loss:预测区间估计的一种新方法

Pritam Anand, Tathagata Bandyopadhyay, Suresh Chandra

AI总结 本文提出Tube Loss损失函数,用于回归任务中同时估计预测区间边界。该方法能渐近达到指定置信水平,允许用户调整区间位置以优化覆盖范围和宽度,适用于偏斜分布。

详情
AI中文摘要

本文提出了一种名为'Tube Loss'的新损失函数,用于回归任务中同时估计预测区间(PI)的边界。基于Tube Loss最小化经验风险得到的PI在以下方面优于现有方法:首先,渐近达到指定置信水平t∈(0,1)。其次,用户可通过调整参数移动区间,以捕捉响应变量概率分布的密集区域,从而缩小区间宽度。该方法通过单个优化问题平衡覆盖范围和平均宽度,并通过重新校准进一步减少平均宽度。不同于现有方法,梯度下降法可用于最小化经验风险。通过大量实验,我们证明了基于Tube Loss的PI估计在核机和神经网络中的有效性,并展示了基于Tube Loss的深度概率预报模型在多个基准和风能数据集上优于现有概率预报技术。最后,我们通过符合预测框架验证了Tube Loss方法的优势。代码可在https://github.com/ltpritamanand/Tube$_$loss获取。

英文摘要

This paper proposes a novel loss function, called 'Tube Loss', for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence level t $\in$ (0,1) asymptotically. A theoretical proof of this fact is given. Secondly, the user is allowed to move the interval up or down by controlling the value of a parameter. This helps the user to choose a PI capturing denser regions of the probability distribution of the response variable inside the interval, and thus, sharpening its width. This is shown to be especially useful when the conditional distribution of the response variable is skewed. Further, the Tube Loss based PI estimation method can trade-off between the coverage and the average width by solving a single optimization problem. It enables further reduction of the average width of PI through re-calibration. Also, unlike a few existing PI estimation methods the gradient descent (GD) method can be used for minimization of empirical risk. Through extensive experiments, we demonstrate the effectiveness of Tube Loss-based PI estimation in both kernel machines and neural networks. Additionally, we show that Tube Loss-based deep probabilistic forecasting models achieve superior performance compared to existing probabilistic forecasting techniques across several benchmark and wind datasets. Finally, we empirically validate the advantages of the Tube loss approach within the conformal prediction framework. Codes are available at https://github.com/ltpritamanand/Tube$\_$loss.

2405.13901 2026-05-18 cs.CV cs.LG eess.SP

Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

基于离散余弦变换的去相关注意力机制用于视觉Transformer

Hongyi Pan, Emadeldeen Hamdan, Xin Zhu, Ahmet Enis Cetin, Ulas Bagci

AI总结 本文提出基于DCT的去相关注意力机制,通过改进初始化策略和压缩技术提升视觉Transformer的效率和性能,实验表明在Swin Transformer上显著降低计算开销且保持性能。

Comments This paper has been accepted to IJCAI-ECAI 2026

详情
AI中文摘要

自注意力机制是Transformer架构成功的关键,但学习查询、键和值投影仍具挑战性且计算成本高。本文提出两种互补方法,利用离散余弦变换(DCT)提升视觉Transformer的效率和性能。首先,引入基于DCT的初始化策略,通过DCT系数初始化投影权重,提升CIFAR-10和ImageNet-1K的分类精度。其次,提出基于DCT的注意力压缩技术,利用频域的去相关特性,通过截断高频成分减少查询、键和值投影的维度,不牺牲精度。实验表明,该压缩方法在Swin Transformer上显著降低计算开销,同时保持性能。

英文摘要

Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by introducing a simple yet effective DCT-based initialization strategy for self-attention, where projection weights are initialized using DCT coefficients. This structure-preserving approach consistently improves classification accuracy on the CIFAR-10 and ImageNet-1K benchmarks. Second, we propose a DCT-based attention compression technique that exploits the decorrelation properties of the frequency domain. By observing that high-frequency DCT coefficients typically correspond to noise, we truncate high-frequency components of the input patches, thereby reducing the dimensionality of the query, key, and value projections without sacrificing accuracy. Experiments on Swin Transformer models demonstrate that the proposed compression method achieves a substantial reduction in computational overhead while maintaining comparable performance.

2405.01557 2026-05-18 cs.LG

An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification

对不平衡分类中平衡方法拉什蒙效应的实验研究

Mustafa Cavus, Przemysław Biecek

AI总结 本文研究了平衡方法对预测多样性的影响,通过拉什蒙效应发现平衡方法会增加预测多样性,提出扩展的性能-收益图来平衡训练数据。

Comments 16 pages, 6 figures

详情
Journal ref
In Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2024, Communications in Computer and Information Science
AI中文摘要

预测模型在分类不平衡数据集时可能产生偏见预测,当模型偏向多数类时,少数类的准确预测性能会降低。为解决此问题,平衡或重采样方法是数据导向的AI关键方法,用于提升预测性能。然而,近年来对这些方法的功能存在争议。特别是许多候选模型可能在模型选择中表现出非常相似的预测性能,称为拉什蒙效应,且可能对同一观测产生不同预测。在不考虑预测多样性时选择模型可能导致盲目选择。本文通过拉什蒙效应考察了平衡方法对预测多样性的冲击。这很重要,因为数据导向的AI中,从一组几乎同样准确的模型中选择模型是危险的。这可能导致模型选择、验证和解释中的严重问题。为解决此问题,我们通过拉什蒙效应使用新提出的模糊度指标,结合现有的模糊性和差异性指标,进行了真实数据集实验,观察平衡方法对预测多样性的冲击。我们的发现表明,平衡方法会放大预测多样性并产生不同结果。为了监控预测性能与预测多样性之间的权衡,以负责任地进行建模过程,我们提出了在平衡训练数据时使用扩展的性能-收益图版本。

英文摘要

Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical data-centric AI approaches in the modeling process to improve prediction performance. However, there have been debates and questions about the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, called the Rashomon effect, in model selection, and they may even produce different predictions for the same observations. Selecting one of these models without considering the predictive multiplicity -- which is the case of yielding conflicting models' predictions for any sample -- can result in blind selection. In this paper, the impact of balancing methods on predictive multiplicity is examined using the Rashomon effect. It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models. This may lead to severe problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect by using a newly proposed metric obscurity in addition to the existing ones: ambiguity and discrepancy. Our findings showed that balancing methods inflate the predictive multiplicity and yield varying results. To monitor the trade-off between the prediction performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended version of the performance-gain plot when balancing the training data.

2312.05975 2026-05-18 cs.CV cs.AI cs.LG

FM-G-CAM: A Holistic Approach for Explainable AI in Computer Vision

FM-G-CAM:计算机视觉中可解释AI的综合方法

Ravidu Suien Rammuni Silva, Jordan J. Bird

AI总结 本文提出FM-G-CAM方法,通过综合考虑多个预测类别,提供CNN模型决策的全面解释,改进传统Grad-CAM的局限性。

详情
AI中文摘要

可解释性是现代AI在现实应用中的关键因素。本文旨在强调理解计算机视觉模型(特别是卷积神经网络)预测的必要性。现有方法主要基于梯度加权类激活图(Grad-CAM),仅关注单一目标类别,忽略了CNN预测过程的大部分内容。本文提出了一种全面的方法,称为融合多类梯度加权类激活图(FM-G-CAM),考虑多个高预测类别,提供预测器CNN的全面解释。我们还提供了详细数学和算法描述。此外,通过现实应用场景的定量和定性比较,展示了FM-G-CAM相较于Grad-CAM的优势。最后,我们提供了一个开源Python库,包含FM-G-CAM实现,方便生成CNN模型预测的显著图。

英文摘要

Explainability is a vital aspect of modern AI for real-world impact and usability. The main objective of this paper is to emphasise the need to understand the predictions of Computer Vision models, specifically Convolutional Neural Network (CNN) models. Existing methods for explaining CNN predictions are largely based on Gradient-weighted Class Activation Maps (Grad-CAM) and focus solely on a single target class; this assumption about the target class selection neglects a large portion of the predictor CNN's prediction process. In this paper, we present an exhaustive methodology, called Fused Multi-class Gradient-weighted Class Activation Map (FM-G-CAM), that considers multiple top-predicted classes and provides a holistic explanation of the predictor CNN's rationale. We also provide a detailed mathematical and algorithmic description of our method. Furthermore, alongside a concise comparison of existing methods, we compare FM-G-CAM with Grad-CAM, quantitatively and qualitatively highlighting its benefits through real-world practical use cases. Finally, we present an open-source Python library with an FM-G-CAM implementation to conveniently generate saliency maps for CNN-based model predictions.

2308.06822 2026-05-18 cs.LG cs.AI cs.CR math.OC

Approximate and Weighted Data Reconstruction Attack in Federated Learning

联邦学习中的近似和加权数据重建攻击

Yongcun Song, Ziqi Wang, Enrique Zuazua

AI总结 本文提出了一种基于插值的近似方法,用于攻击联邦学习中的联邦平均场景,通过生成客户端本地训练过程中的中间模型更新,改进数据重建质量,并通过实验验证了其在图像数据重建中的优越性。

详情
AI中文摘要

联邦学习(FL)是一种分布式学习范式,允许多个客户端在不共享私人数据的情况下协作构建机器学习模型。尽管FL被设计为隐私保护,但最近的数据重建攻击表明,攻击者可以根据FL中共享的参数恢复客户端的训练数据。然而,大多数现有方法无法攻击最广泛使用的水平联邦平均(FedAvg)场景,其中客户端在多次本地训练步骤后共享模型参数。为了解决这个问题,我们提出了一种基于插值的近似方法,通过生成客户端本地训练过程中的中间模型更新,使攻击FedAvg场景成为可能。然后,我们设计了一种层间加权损失函数以提高数据重建质量。我们根据神经网络结构为不同层的模型更新分配不同的权重,权重通过贝叶斯优化进行调整。最后,实验结果验证了所提出的近似和加权攻击(AWA)方法在不同评估指标上优于其他最先进的方法,显示出在图像数据重建中的显著改进。

英文摘要

Federated Learning (FL) is a distributed learning paradigm that enables multiple clients to collaborate on building a machine learning model without sharing their private data. Although FL is considered privacy-preserved by design, recent data reconstruction attacks demonstrate that an attacker can recover clients' training data based on the parameters shared in FL. However, most existing methods fail to attack the most widely used horizontal Federated Averaging (FedAvg) scenario, where clients share model parameters after multiple local training steps. To tackle this issue, we propose an interpolation-based approximation method, which makes attacking FedAvg scenarios feasible by generating the intermediate model updates of the clients' local training processes. Then, we design a layer-wise weighted loss function to improve the data quality of reconstruction. We assign different weights to model updates in different layers concerning the neural network structure, with the weights tuned by Bayesian optimization. Finally, experimental results validate the superiority of our proposed approximate and weighted attack (AWA) method over the other state-of-the-art methods, as demonstrated by the substantial improvement in different evaluation metrics for image data reconstructions.

2306.04321 2026-05-18 cs.AI cs.MM

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

生成语义通信:扩散模型超越位恢复

Eleonora Grassucci, Sergio Barbarossa, Danilo Comminiello

AI总结 本文提出一种新的生成扩散框架,利用扩散模型合成多媒体内容并保留语义特征,通过空间自适应归一化生成语义一致的场景,提升在信道噪声下的图像生成质量。

详情
Journal ref
IEEE Transactions on Cognitive Communication and Networking, 2026
AI中文摘要

语义通信被认为是下一代AI通信的核心之一。其可能使接收端能再生与传输内容语义等价的图像或视频,而无需恢复传输的位序列。当前解决方案仍缺乏从接收到的有限信息中构建复杂场景的能力。本文提出一种新的生成扩散指导框架,利用扩散模型在合成多媒体内容和保留语义特征方面的强大能力,通过发送高度压缩的语义信息来减少带宽使用。然后,扩散模型通过空间自适应归一化从去噪的语义信息中学习生成语义一致的场景。通过深入评估多个场景,证明我们的方法在接收到显著退化的内容时,仍能生成高质量的图像并保留语义信息。具体而言,即使在通信信道极其嘈杂的条件下,对象、位置和深度仍可识别。代码可在https://github.com/ispamm/GESCO获取。

英文摘要

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.

2210.13455 2026-05-18 cs.LG cs.AI

Epistemic Monte Carlo Tree Search

认知蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, Wendelin Böhmer

AI总结 本文提出Epistemic MCTS,通过考虑认知不确定性提升搜索效率,在代码编写等稀疏奖励任务中表现更优。

详情
AI中文摘要

本文提出Epistemic MCTS,通过考虑认知不确定性提升搜索效率,在代码编写等稀疏奖励任务中表现更优。

英文摘要

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

2605.15769 2026-05-18 cs.RO cs.AI

Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics

动态环境中的拉马克继承:关键变量如何影响进化动态

K. Ege de Bruin, Kyrre Glette, Kai Olav Ellefsen

AI总结 本文研究动态环境中关键变量对进化动态的影响,通过虚拟软机器人和两种学习方法,发现拉马克继承在环境变化冲突且不可预测时表现欠佳,但添加环境感知传感器可恢复其优势。

详情
AI中文摘要

在动态环境中机器人身体与控制器的共优化是一个耦合挑战:形态约束了哪些控制策略有效,而控制则决定了形态的表现。为了解决这一问题,我们结合形态优化作为进化与控制器优化作为生命周期学习,利用拉马克继承将学习到的控制器参数从父代传递给子代。在动态环境中,现有文献呈现矛盾证据:虽然传统进化理论通常认为拉马克继承无益,但最近的进化机器人研究显示它可以提高性能。我们假设这是因为以前的研究没有包含所有与动态环境相关的变量。在本工作中,我们发现拉马克继承的益处取决于两个变量:环境变化对机器人控制的冲突程度,以及这些变化对机器人代理的可预测性。使用虚拟软机器人和两种不同的学习方法,贝叶斯优化和强化学习,我们发现拉马克继承只在环境变化既冲突又不可预测时表现欠佳。我们发现添加一个检测环境变化的传感器可以恢复拉马克继承在冲突环境中的优势,通过允许机器人代理预测需要不同行为的需要,从而泛化其控制。

英文摘要

The co-optimization of a robot's body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.

2605.15764 2026-05-18 cs.CV cs.AI

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

GRASP:学习多个人非语言互动中的社会推理

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

AI总结 GRASP通过连接高层社会问答与细粒度目光和指代手势事件,提升多个人非语言互动的社会推理能力,包含290万对问题-答案对,提出Social Grounding Reward提升模型性能。

Comments Project page: https://social-reaoning.github.io/grasp/

详情
AI中文摘要

理解社会互动需要推理微妙的非语言线索,但当前多模态大语言模型(MLLMs)在多个人视频中常无法识别谁与谁互动。我们引入GRASP,一个大规模社会推理数据集,将高层社会问答与细粒度目光和指代手势事件连接起来。GRASP包含290K个问题-答案对,覆盖46K小时视频,按16类分类涵盖目光、手势及联合目光-手势推理,同时包含GRASP-Bench用于评估。不同于以往仅关注孤立线索或高层社会问答的资源,GRASP通过身份一致的目光轨迹、指代手势及其联合组成构建社会事件。此外,我们提出Social Grounding Reward(SGR),一种利用这些社会事件鼓励模型推理每个互动参与者的学习信号。实验显示,SGR在GRASP-Bench上提升性能,同时在相关社会视频问答基准上保持零样本性能。

英文摘要

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.