arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14841 2026-05-15 cs.LG cs.AI

GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning

Paolo Mandica, Michał Brzozowski, Zuzanna Dubanowska, Neo Christopher Chung

AI总结 本文提出了一种名为 GPart 的全新参数高效微调方法,通过全局参数划分实现端到端等距微调,解决了传统低秩适配(LoRA)方法在参数映射过程中破坏距离保持性质的问题。GPart 采用单一等距划分矩阵,将低维可训练向量直接映射到模型的完整权重空间,从而完全消除低秩瓶颈,显著提升了参数效率。实验表明,GPart 在自然语言理解、计算机视觉和数学推理等任务上均表现出色,达到了当前参数高效微调方法的最先进水平。

详情
英文摘要

Low-rank adaptation (LoRA) has become the dominant paradigm for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). However, its bilinear structure introduces a critical limitation: the mapping from trainable parameters to weight updates is not distance-preserving, distorting the optimization landscape. Methods that project a low-dimensional vector into LoRA's parameter space, such as Uni-LoRA, improve parameter efficiency, but the subsequent bilinear LoRA map breaks end-to-end isometry, leaving the core distance-preservation problem unresolved. We propose GPart (Global Partition fine-tuning), a highly parameter-efficient fine-tuning method which removes the low-rank bottleneck entirely. Our method uses a single isometric partition matrix to map a $d$-dimensional trainable vector directly into the full weight space of the model. The result is an extremely minimal fine-tuning pipeline: one random projection, end-to-end isometric, with a single clean hyperparameter ($d$) and storage cost of $d+1$ values (the trainable vector plus a random seed). GPart builds on the theoretical premise that effective fine-tuning can emerge from random low-dimensional subspaces of the full weight space, without imposing low-rank matrix structure. We empirically demonstrate the superior or comparable performance of GPart to existing PEFT methods on natural language understanding, computer vision tasks, and mathematical reasoning. Overall, GPart achieves state-of-the-art efficiency and performance by removing structural constraints, offering a straightforward and elegant path to PEFT.

2605.14840 2026-05-15 cs.LG math.OC stat.ML

In-Context Learning for Data-Driven Censored Inventory Control

Sohom Mukherjee, Anh-Duy Pham, Richard Pibernik, Yunbei Xu

AI总结 本文研究了在数据驱动环境下具有决策依赖性截断的库存控制问题,提出了一种基于上下文生成后验采样的新方法(ICGPS),结合了生成模型的离线元训练与在线自回归生成,以应对订单量影响需求观测完整性的挑战。该方法理论上保证了其贝叶斯遗憾与理想完成核下的TS基准相比仅增加一个与时间平方根成正比的惩罚项,并在实际应用中通过ChronosFlow实现,表现出对先验偏差和分布偏移的鲁棒性,实验显示其在模拟和真实数据集上均优于传统方法。

详情
英文摘要

We study inventory control with decision-dependent censoring, focusing on the censored or repeated newsvendor (R-NV), where each order quantity determines whether demand is fully observed or censored by sales. Existing approaches based on parametric Thompson sampling (TS) can be brittle under prior mismatch, while offline imputation methods need not transfer to online learning. Motivated by the predictive view of decision making, we combine these ideas by taking oracle actions on learned completions of latent demand. We propose in-context generative posterior sampling (ICGPS), which uses modern generative models that are meta-trained offline and deployed online by in-context autoregressive generation. Theoretically, we show that the Bayesian regret of ICGPS with a learned completion kernel is bounded by the Bayesian regret of a TS benchmark with the ideal completion kernel plus a deployment penalty scaling as $\sqrt{T}$ times the square root of the completion mismatch. This yields a plug-in template for operational problems with known TS regret bounds. For R-NV, we derive sublinear Bayesian regret by reducing censored feedback to bandit convex optimization feedback. We also show that, under reasonable coverage and stability assumptions, the online completion mismatch is controlled by the offline censored predictive mismatch, so offline predictive quality transfers to online performance. Practically, we instantiate ICGPS with ChronosFlow, which combines a frozen time-series transformer backbone with a trainable conditional normalizing-flow head for fast censoring-consistent sampling. In benchmark experiments, ChronosFlow-ICGPS matches correctly specified TS, outperforms myopic and UCB-style baselines, and is robust to prior mismatch and distribution shift. ChronosFlow-ICGPS also performs well for the real-world SuperStore dataset, especially under heavy censoring.

2605.14839 2026-05-15 cs.LG eess.SP

GenAI for Energy-Efficient and Interference-Aware Compressed Sensing of GNSS Signals on a Google Edge TPU

Thorben Wegner, Lucas Heublein, Tobias Feigl, Felix Ott, Christopher Mutschler, Alexander Rügamer

AI总结 本文提出了一种基于生成式人工智能(GenAI)的新型方法,用于在谷歌边缘TPU上对全球导航卫星系统(GNSS)信号进行高效压缩与干扰分类。该方法利用变分自编码器(VAEs)在接收端直接压缩信号并实时识别干扰和欺骗攻击,显著降低数据传输和处理能耗。实验表明,该方法在保持信号特征的前提下实现了超过42倍的压缩比,并在重构信号上准确分类约72种干扰类型,为GNSS干扰抑制提供了一种高效且实用的解决方案。

Comments 12 pages

详情
Journal ref
IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025
英文摘要

Traditional methods for classifying global navigation satellite system (GNSS) jamming signals typically involve post-processing raw or spectral data streams, requiring complex and costly data transmission to cloud-based interference classification systems. In contrast, our proposed approach efficiently compresses GNSS data streams directly at the hardware receiver while simultaneously classifying jamming and spoofing attacks in real time. Given the growing prevalence of GNSS jamming, there is a critical need for real-time solutions suitable for power-constrained environments. This paper introduces a novel method for compressing and classifying GNSS jamming threats using generative artificial intelligence (GenAI), specifically variational autoencoders (VAEs), deployed on Google Edge tensor processing units (TPUs). The study evaluates various autoencoder (AE) architectures to compress and reconstruct GNSS signals, focusing on preserving interference characteristics while minimizing data size near the receiver hardware. The pipeline adapts large-scale AE models for Google Edge TPUs through 8-bit quantization to ensure energy-efficient deployment. Tests on raw in-phase and quadrature-phase (IQ) data, Fast Fourier Transform (FFT) data, and handcrafted features show the system achieves significant compression (>42x) and accurate classification of approximately 72 interference types on reconstructed signals (F2-score 0.915), closely matching the original signals (F2-score 0.923). The hardware-centric GenAI approach also substantially reduces jammer signal transmission costs, offering a practical solution for interference mitigation. Ablation studies on conditional and factorized VAEs (i.e., FactorVAE) explore latent feature disentanglement for data generation, enhancing model interpretability and fostering trust in machine learning (ML) solutions for sensitive interference applications.

2605.14838 2026-05-15 cs.CV cs.MM

Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

Bolin Zhang, Chao Yang, Bin Jiang, Takahiro Komamizu, Ichiro Ide

AI总结 本文研究弱监督视频时刻检索(VMR)问题,旨在仅利用视频级别的对应关系,而不依赖时间标注,从未剪辑视频中找到与给定查询语义相似的时刻。为了解决现有方法在生成高质量时间提案、区分视频内错位时刻以及模型稳定性方面的不足,提出了一种新的弱监督方法MCMT,通过多提案协作与多任务训练,生成多个提案并结合可学习的高斯掩码构建高质量正样本掩码,同时引入正向和逆向掩码查询重建任务,提升模型的鲁棒性和检索性能。实验表明该方法在两个标准数据集上表现优异。

Comments 26 pages, 4 figures. Preprint version of the article published in International Journal of Machine Learning and Cybernetics

详情
Journal ref
International Journal of Machine Learning and Cybernetics 16, 4509-4524 (2025)
英文摘要

This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.

2605.14833 2026-05-15 cs.AI cs.HC

Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale

Vineet Kotecha, Vansh Gupta

AI总结 当前语言模型系统在会话间本质上是无状态的,限制了其随时间个性化交互的能力。本文提出了一种基于情绪关注的有状态记忆架构(EASM),能够在推理时动态构建用户的个性化对话上下文,结合长期历史、情绪信号和意图推断。实验表明,该架构在多个情感类别对话中显著提升了记忆关联性、计划清晰度和情感验证效果,尤其在处理悲伤、焦虑等复杂情感场景时表现稳定,为构建高度个性化的AI系统提供了新的基础架构思路。

Comments 18 pages, 3 figures, 3 tables. Industry research whitepaper. Includes controlled A/B evaluation across 30 scenarios and 6 emotional categories

详情
英文摘要

Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary

2605.14832 2026-05-15 cs.RO cs.CV

Learning Direct Control Policies with Flow Matching for Autonomous Driving

Marcello Ceresini, Federico Pirazzoli, Andrea Bertogalli, Lorenzo Cipelli, Filippo D'Addeo, Anthony Dell'Eva, Alessandro Paolo Capasso, Alberto Broggi

AI总结 本文提出了一种基于流匹配的自主驾驶规划方法,能够直接生成由加速度和曲率构成的可执行控制轨迹。该方法以鸟瞰图(BEV)作为输入条件,通过少量常微分方程(ODE)积分步骤生成控制序列,实现了低延迟的实时闭环重规划。研究在意大利帕尔马市的真实城市道路场景中训练模型,并在分布内和显著分布外的环境中进行了闭环测试,结果显示模型在未见过的场景中仍能保持稳定控制并成功完成任务,主要得益于BEV表示和流匹配方法对分布偏移的鲁棒性。

Comments 16 pages, 6 figures, 2 tables. Accepted at IEEE ITSC 2026

详情
英文摘要

We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at https://marcelloceresini.github.io/DirectControlFlowMatching.

2605.14831 2026-05-15 cs.AI cs.LG

Interestingness as an Inductive Heuristic for Future Compression Progress

Vincent Herrmann, Jürgen Schmidhuber

AI总结 本文研究了“有趣性”作为未来压缩进展的归纳启发式方法,旨在解决递归自我改进系统中识别潜在进步任务或数据的瓶颈问题。通过引入算法统计和 Kolmogorov 复杂度工具,作者证明了有趣性具有理论可行性和实证支持,并发现未来进展的期望值与最近突破的时效性呈指数关系。研究还表明,与长度先验相比,算法先验对预期发现的估计更为乐观,且在三种不同的计算范式中得到了实验验证。

详情
英文摘要

One of the bottlenecks on the way towards recursively self-improving systems is the challenge of interestingness: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, we demonstrate that the inductive property of interestingness -- the capacity for past progress to signal future discovery -- is theoretically viable and empirically supported. We prove that expected future progress depends exponentially on the recency of the last observed breakthrough. Furthermore, we show that the Algorithmic Prior is significantly more optimistic than the Length Prior, yielding a quadratic increase in expected discovery for the same observed profile. These findings are experimentally confirmed across three diverse universal computational paradigms.

2605.14824 2026-05-15 cs.LG math.AT

ToMAToMP: Robust and Multi-Parameter Topological Clustering

Ludo Andrianirina, Mathieu Carrière

AI总结 本文提出了一种名为 ToMAToMP 的新型拓扑聚类方法,旨在解决传统 ToMATo 算法在处理多参数函数、对异常值敏感以及依赖图结构参数等局限性。该方法基于多参数持续同调中的 MMA 分解,实现了对多个函数的同时处理,并在理论上保证了鲁棒性。实验表明,ToMAToMP 在多个数据集上相比非拓扑和拓扑方法均表现出更优的聚类效果。

详情
英文摘要

Topological clustering, and its main algorithm ToMATo, is a clustering method from Topological Data Analysis (TDA) which has been applied successfully in several applications during the last few years. This is due to its high versatility, as clusters are detected from the persistent components in the sublevel sets of any user-defined function (gene expression, pixel values, etc), and efficiency, as topological clustering enjoys robustness guarantees. However, ToMATo is also limited in several ways. First, a graph on the data points needs to be provided as a hyper-parameter of the method (whose fine-tuning is left to the user). Second, ToMATo is known to be very sensitive to outlier values in the function range. Finally, and most importantly, ToMATo can only handle one function at a time, whereas it is critical to use several functions in various applications. In this article, we introduce ToMAToMP: the first topological clustering method able to handle several functions at the same time with theoretical guarantees. More specifically, we leverage a recent tool from multi-parameter persistent homology, called MMA decomposition, to design our clustering algorithm, and prove that it enjoys robustness properties. As corollaries, we show that it can be used to make ToMATo independent of graph tuning, and robust to outliers. Finally, we provide a set of numerical experiments showcasing the efficiency and quality of the clusterings produced by ToMAToMP, by showing strong improvement over non-topological and topological baselines for various datasets.

2605.14821 2026-05-15 cs.CV

HDRFace: Rethinking Face Restoration with High-Dimensional Representation

Zirui Wang, Xianhui Lin, Yi Dong, Bo Wei, Gangjian Zhang, Siteng Ma, Zebiao Zheng, Xing Liu, Hong Gu, Minjing Dong

AI总结 在复杂退化条件下的人脸修复仍是一个信息严重丢失的病态逆问题。本文提出HDRFace,一种基于高维表示的人脸修复框架,通过引入语义丰富的先验知识,在不改变生成主干网络的前提下提升修复质量。该方法首先利用现成修复器获得结构可靠的中间结果,再通过预训练的高维特征编码器提取输入和中间结果的细粒度面部表示,并将其作为额外条件注入生成过程。此外,提出结构-细节感知的自适应融合机制(SDFM),在结构建模时强调全局约束,在细节生成时加强表示引导,从而在结构一致性和细节保真之间取得平衡。

详情
英文摘要

Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.

2605.14819 2026-05-15 cs.CV

The Velocity Deficit: Initial Energy Injection for Flow Matching

Linze Li, Zong-Wei Hong, Shen Zhang, Bo Lin, Jinglun Li, Yao Tang, Jiajun Liang

AI总结 该论文提出了一种名为“速度亏损”(Velocity Deficit)的现象,指出在高维流匹配中,均方误差(MSE)目标函数会导致速度幅值被系统性低估,从而使生成样本无法到达数据流形,产生积分滞后问题。为了解决这一问题,作者提出了初始能量注入方法,包括基于训练的幅度感知流匹配(MAFM)和无需训练的尺度调度校正器(SSC),揭示了速度收缩在轨迹起点和终点的不对称影响。实验表明,SSC在ImageNet-1k等任务上显著提升了生成质量并加快了生成速度,同时方法也适用于文本到图像生成和高分辨率图像生成。

Comments Accepted by ICML2026

详情
英文摘要

While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

2605.14816 2026-05-15 cs.CL

Conversion of Lexicon-Grammar tables to LMF. Application to French

Eric Laporte, Elsa Tolone, Mathieu Constant

AI总结 本文介绍了将法语动词的Lexicon-Grammar表转换为词法标记框架(LMF)格式的首次实验。Lexicon-Grammar是法语的重要词法和句法信息来源,将其转换为符合LMF标准的格式,有助于提升自然语言处理词典的标准化与互操作性。文章分析了转换过程中遇到的主要困难,并介绍了转换后得到的资源。

详情
Journal ref
LMF. Lexical Markup Framework, 2013, ISTE - Wiley, pp.157-187
英文摘要

We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

2605.14815 2026-05-15 cs.CV

Probing into Camera Control of Video Models

Chen Hou, Christian Rupprecht

AI总结 本文研究了视频生成模型中的相机控制问题,旨在使模型能够生成具有几何意义的内容。不同于以往依赖额外模块和配对数据的方法,作者提出将相机控制视为一种几何引导,通过在去噪过程中对潜在特征进行可微分重采样来实现。该方法无需额外训练,适用于大多数视频扩散模型,并可用于探测基础模型的相机控制能力,揭示了现有模型在多视角生成任务中的共性偏差与性能差异。

详情
英文摘要

Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.

2605.14810 2026-05-15 cs.RO

CaMeRL: Collision-Aware and Memory-Enhanced Reinforcement Learning for UAV Navigation in Multi-Scale Obstacle Environments

Hong Hong, Feiyu Liao, Yongheng Liang, Boning Zhang, Haitao Wang, Hejun Wu

AI总结 在无人机避障导航中,障碍物尺度的变化往往被忽视,现有方法通常仅依赖单帧深度观测的几何特征,难以有效应对多尺度障碍物环境。为此,本文提出CaMeRL,一种结合碰撞感知与记忆增强的强化学习框架,通过编码细粒度障碍物结构提升对小障碍物的敏感性,并利用时序记忆模块缓解大障碍物遮挡带来的部分可观测问题。实验表明,CaMeRL在超小和超大障碍物场景中均优于现有方法,且在复杂户外环境中表现出可靠的导航能力。

Comments 8 pages, 7 figures. Submitted to IEEE Robotics and Automation Letters

详情
英文摘要

In obstacle avoidance navigation of unmanned aerial vehicles (UAVs), variations in obstacle scale have received strangely less attention than obstacle number or density. Existing methods typically extract purely geometric features from single-frame depth observations. Such representations tend to neglect small obstacles and lose spatial context under occlusions caused by large obstacles, leading to noticeable degradation in environments with multi-scale obstacles. To address this issue, we propose CaMeRL, a Collision-aware and Memory-enhanced Reinforcement Learning framework for UAV navigation. The collision-aware latent representation encodes risk-sensitive depth cues to preserve fine-grained obstacle structures, thereby improving sensitivity to small obstacles. The temporal memory module integrates observations across frames, mitigating partial observability caused by large-obstacle occlusions. We evaluate CaMeRL with multi-scale obstacles, including ultra-small and extra-large obstacle settings. Results show that CaMeRL outperforms state-of-the-art baselines across all scales, with success rate gains of 0.48 and 0.28 in the ultra-small and extra-large settings, respectively. More importantly, CaMeRL achieves reliable navigation in cluttered outdoor environments.

2605.14808 2026-05-15 cs.CV

SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track

Lukas Roming, Felix Lehnerer, Jonas V. Funk, Andreas Michel, Georg Maier, Thomas Längle, Jürgen Beyerer

AI总结 本文提出了一种无需训练且类别无关的工业异常分割方法SuperADD,用于应对生产环境中因采集条件变化导致的数据分布偏移问题。该方法基于SuperAD改进,引入了DINOv3主干网络、重叠块处理、强度增强、优化的记忆库采样以及迭代形态闭合等技术,提升了模型在不同光照条件下的鲁棒性和泛化能力。实验表明,SuperADD在MVTec AD 2数据集上取得了优于现有方法的分割性能,适用于工业场景中对产品变体和外观变化的高效处理需求。

Comments Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track

详情
英文摘要

Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.

2605.14805 2026-05-15 cs.RO

Learning Cross-Coupled and Regime Dependent Dynamics for Aerial Manipulation

Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy, Wei Pan

AI总结 本文研究了空中机械臂在复杂任务(如负载运输)中对精确动力学模型的需求,针对其强耦合、气动延迟和负载变化引起的非稳态动力学特性,提出了一个结构化的编码器-解码器框架。该方法通过非线性编码器捕捉状态-输入历史中的交叉耦合和时序依赖,结合轻量级线性解码器实现对非稳态残差动力学的在线自适应,从而提升模型预测精度和轨迹跟踪性能。实验表明,该方法在实际平台中表现出更优的适应能力和控制性能。

详情
英文摘要

Accurate dynamics models are critical for aerial manipulators operating under complex tasks such as payload transport. However, modeling these systems remains fundamentally challenging due to strong quadrotor-manipulator coupling, delayed aerodynamic interactions, and regime-dependent dynamics variations arising from payload changes and manipulator reconfiguration. These effects produce residual dynamics that are simultaneously cross-coupled, history-dependent, and nonstationary, causing both analytical models and purely offline learned models to degrade during deployment. To address these challenges, we propose a structured encoder-decoder framework for adaptive residual dynamics learning in aerial manipulators. The proposed nonlinear latent encoder captures cross-variable coupling and temporal dependencies from state-input histories, while a lightweight linear latent decoder enables online adaptation under regime-dependent nonstationary dynamics. The linear-in-parameter decoder structure permits closed-form Bayesian adaptation together with consistency-driven covariance inflation, enabling rapid and stable adaptation to both transient and slowly varying dynamics changes while remaining compatible with real-time model predictive control (MPC). Experimental results on a real aerial manipulation platform demonstrate improved residual prediction accuracy, faster adaptation under changing operating conditions, and enhanced MPC-based trajectory tracking performance. These results highlight the importance of jointly modeling coupled temporal dynamics and deployment-time nonstationarity for reliable aerial manipulation.

2605.14802 2026-05-15 cs.AI

A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

Zhao Yang, Wang Huan, Li Yingshuo, Tu Haomiao, Lin Hujite

AI总结 该研究针对大语言模型在长期交互中面临的事实遗忘、时间线混乱、角色漂移和稳定性下降等问题,提出了一种异构时间记忆治理框架ARPM。该框架将静态知识记忆与动态对话经验记忆分离,并结合向量检索、BM25、RRF融合、双时间重排序等多种技术,实现对连续性和角色一致性的可追溯治理。实验表明,ARPM在高噪声环境下仍能保持语义连续性与角色一致性,并揭示了长期角色一致性可以被分解为可治理的组件并进行白盒评估。

Comments 23 pages, 5 figures, 2 tables. Preprint version. Code for ARPM v4.0 is available at: https://github.com/Spirtxiaoqi7/ARPM

详情
英文摘要

Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

2605.14801 2026-05-15 cs.RO

Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN

Ziyi Xia, Chaoran Xiong, Litao Wei, Xinhao Hu, Ling Pei

AI总结 本文研究了视觉-语言导航(VLN)系统中的关键瓶颈,即当前3D感知模型对像素级精度的追求与导航任务对实时性和计算效率的需求之间的冲突。作者基于典型的VLM-LLM框架,提出了两个核心子系统的统计成功率上限,揭示了感知精度提升到一定程度后对导航性能的边际效益递减现象。研究指出,VLN中的3D场景理解应更关注导航相关的语义词汇和边界框精度,而非单纯的像素级准确度。

Comments Accepted by ICRA Workshop MM-Spatial AI, Oral

详情
英文摘要

Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models (VLMs) and Large Language Models (LLMs), where VLMs construct 3D scene graphs while LLMs handle high-level reasoning and decision-making. However, a critical bottleneck exists in this system: current 3D perception models prioritize pixel-level accuracy, directly conflicting with the strict computational limits and real-time efficiency demanded by embodied navigation. To address this gap, this paper quantifies the actual impact of 3D scene understanding capability on VLN performance. Based on typical VLM-LLM frameworks, we propose statistical success rate (SR) upper bounds for two core subsystems: 1) the slow LLM planner, which relies on topological mapping semantics, and 2) the fast reactive navigator, which utilizes spatial coordinates and bounding boxes to execute LLM decisions. Evaluations using state-of-the-art 3D scene understanding models validate our proposed bounds and reveal a perception saturation phenomenon, indicating that improvements in perception accuracy beyond a certain threshold yield diminishing returns in navigation success. Our findings suggest that 3D scene understanding for VLN should pivot away from strict pixel-level precision, prioritizing instead navigation-relevant core vocabularies and accurate bounding box proportions.

2605.14795 2026-05-15 cs.CV

COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

Shukun Jia, Shiyu Hu, Yipei Wang, Ximeng Cheng, Yichao Cao, Xiaobo Lu

AI总结 该论文研究了在稀疏语义监督下如何提升指称多目标跟踪(RMOT)的判别能力,提出了COAL框架,通过引入显式语义注入和反事实学习策略,增强对复杂语义结构的识别能力。COAL结合视觉语言模型和大语言模型,构建了一个层次化多流融合架构,有效缓解了稀疏监督导致的过拟合和语义崩溃问题。实验表明,该方法在多个基准数据集上取得了显著提升,尤其在具有挑战性的Refer-KITTI-V2数据集上超越了现有最优方法。

详情
英文摘要

Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

2605.14790 2026-05-15 cs.CL cs.AI

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

Songyang Gao, Yinghui Xia, Siyi Liu, Hui Xiong

AI总结 本文提出了一种名为“Graphs of Research(GoR)”的监督微调方法,用于提升基于大语言模型(LLM)的科研想法生成能力。该方法通过构建每篇种子论文的两跳引用邻域,利用引用位置、频率、前驱链接和发表时间等信息生成论文演化的有向无环图(DAG),并以此作为监督信号对模型进行训练。实验表明,GoR 在与基于 GPT-4o 的基线模型的对比中取得了最优性能,验证了引用演化图作为监督信号在科研想法生成任务中的有效性。

详情
英文摘要

Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

2605.14785 2026-05-15 cs.LG cs.CV

Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning

Alberto Tamajo, Srinandan Dasmahapatra, Rahman Attar

AI总结 在类增量学习(CIL)中,神经网络容易出现灾难性遗忘问题,而基于重放的策略虽能缓解这一问题,但研究发现不同类别被遗忘的程度并不均衡。本文系统分析了这种不均衡遗忘现象,提出三个最后一层系数以捕捉增量学习过程中影响各类别遗忘的不同梯度级干扰源,并验证这些系数能够有效预测各类别的遗忘程度。研究还发现,自诱导干扰系数是预测遗忘程度最强的指标,为缓解不均衡遗忘提供了新的思路和方向。

Comments 37 pages; 24 tables; 7 figures; submitted to a journal

详情
英文摘要

Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal$\unicode{x2013}$replaying a subset of past samples$\unicode{x2013}$is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient$\unicode{x2013}$capturing self-induced interference$\unicode{x2013}$emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.

2605.14781 2026-05-15 cs.CV

MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection

Leon Davies, Qinggang Meng, Mohamad Saada, Baihua Li, Simon Sølvsten

AI总结 单目3D目标检测在面对遮挡、截断和投影引起的尺度-深度歧义时面临挑战,尤其是在统一多类场景中,类别差异和部分可见性使得尺度估计更加不稳定。为此,本文提出MonoPRIO,通过自适应先验条件化方法,在尺度路径上优化统一的单目3D检测性能。该方法构建了类别感知的尺度原型,采用软混合先验路由解码器查询,并引入不确定性感知的对数空间条件化和Cluster-Aligned Prior正则化,显著提升了检测精度和鲁棒性。实验表明,MonoPRIO在KITTI测试集上取得了目前最强的统一多类检测结果,并在仅检测汽车的场景中也表现出优越的性能,同时计算量远低于现有方法。

Comments 12 pages, 4 figures, 8 tables. Submitted to Pattern Recognition. Code and reproducibility material available at https://github.com/bigggs/MonoPRIO

详情
英文摘要

Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.

2605.14779 2026-05-15 cs.LG

Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning

Byeongchan Kim, Min-hwan Oh

AI总结 本文提出了一种无需模型的离线多步强化学习算法——保守Peng's Q(λ)(CPQL),通过适配Peng's Q(λ)算子用于保守价值估计,替代传统的Bellman算子,从而在离线强化学习中更有效地估计值函数。该方法利用离线轨迹,首次在理论上和实验上证明了多步保守价值估计的有效性,并通过固定点特性自然引入行为策略的隐式正则化,有效缓解了价值估计过于悲观的问题,实现了优于或至少等于行为策略的性能,同时提供了接近最优的性能保障。实验表明,CPQL在D4RL基准测试中显著优于现有的离线单步方法,并在离线到在线学习框架中也展现出良好的应用前景。

Comments Accepted in ICLR 2026

详情
英文摘要

We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($λ$) (CPQL). Our algorithm adapts the Peng's Q($λ$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at https://github.com/oh-lab/CPQL.

2605.14774 2026-05-15 cs.AI

Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation

Lata B T, Savitha N J

AI总结 本文研究如何利用深度强化学习方法提高犯罪调查中犯罪嫌疑人的识别准确率。作者提出采用深度确定性策略梯度(DDPG)算法,通过训练犯罪现场资料、证人证词和嫌疑人档案等数据集,有效提升识别效率并减少误判。实验结果表明,该方法在识别准确率上达到95%,优于现有多种方法,为人工智能在司法领域的应用提供了新思路。

详情
Journal ref
Mathematical Statistician and Engineering Applications, https://www.philstat.org/index.php/MSEA/article/view/2953, ISSN: 2094-0343
英文摘要

In the world of AI and advanced technologies investigation aspects identification of a crime or criminal plays a major problem. In this research we focus on a Conventional ways of implicating criminal investigations usually rely on limited data analysis. Finding an optimal and efficient method that will effectively identify criminals from complex datasets and minimise false positives and false negatives is the considered as a challenge. The main novelty approach of this work is based on the deep learning algorithm Deep Deterministic Policy Gradient (DDPG) is presented in this paper. We train the DDPG model with a dataset of crime scene material, witness statements and suspect profiles. The algorithm uses features to maximise the likelihood of identifying the offender while minimising the noise impact and irrelevant data. We show the efficacy of the proposed method, where DDPG identified criminals with an amazing accuracy of 95% than other several existing methods.

2605.14773 2026-05-15 cs.LG cs.AI

Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

Suorong Yang, Hanqi Zhu, Hai Gan, Fangjian Su, Guang Li, Furao Shen, Soujanya Poria

AI总结 本文研究了数据选择在模型训练中的高效应用,指出现有方法虽关注选择哪些样本,但通常固定数据量比例,导致动态选择与静态数据量之间的不匹配。作者从优化角度出发,提出了一种名为PODS的插件式振荡数据量调度框架,通过动态调整数据选择比例,在增强正则化效果的同时保持优化稳定性。实验表明,PODS在多种数据集和任务中均有效提升了训练效率与模型性能的平衡。

详情
英文摘要

Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

2605.14772 2026-05-15 cs.CV cs.GR cs.LG

BioHuman: Learning Biomechanical Human Representations from Video

Yujun Huo, He Zhang, Chentao Song, Honglin Song, Zongyu Zuo, Tao Yu

AI总结 该研究旨在从视频中学习人体的生物力学表示,以超越传统运动学分析,实现对人体内部肌肉活动等生物力学状态的理解。为此,作者提出了一种基于仿真的框架,从现有的动作捕捉数据中估计肌肉激活状态,构建了包含同步视频、运动和激活信息的大型数据集BioHuman10M,并基于此数据集设计了一个端到端模型BioHuman,能够从单目视频中联合预测人体运动和肌肉激活状态。实验表明,该方法在运动重建和肌肉活动预测方面表现出色,并具有良好的泛化能力,为基于视频的生物力学理解提供了新的基准。

详情
英文摘要

Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

2605.14771 2026-05-15 cs.AI

MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

Shaoan Zhao, Huanlin Gao, Qiang Hui, Ting Lu, Xueqiang Guo, Yantao Li, Xinpei Su, Fuyuan Shi, Chao Tan, Fang Zhao, Kai Wang, Shiguo Lian

AI总结 MediaClaw 是一个基于 OpenClaw 生态构建的多模态智能体平台,旨在解决AIGC应用中的实际部署难题,如能力碎片化、接口异构、生产流程割裂和优质工作流复用受限等问题。该平台采用统一抽象、插件化扩展和工作流编排的三层架构,将全品类AIGC能力抽象为统一调用模型,并通过任务导向的技能模块实现复杂生产流程的可复用化。本文重点介绍了MediaClaw的架构设计理念、核心能力模型设计逻辑以及关键工程权衡,为构建多模态能力平台提供可复用的实践参考。

详情
英文摘要

MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

2605.14766 2026-05-15 cs.CL cs.AI eess.AS

Streaming Speech-to-Text Translation with a SpeechLLM

Titouan Parcollet, Shucong Zhang, Xianrui Zheng, Rogier C. van Dalen

AI总结 本文提出了一种基于大语言模型(LLM)的实时流式语音到文本翻译系统,旨在解决现有SpeechLLM系统在实际应用中响应速度慢的问题。该方法使模型不仅能生成翻译文本,还能判断是否已接收到足够的音频信息以进行输出,从而实现更高效的流式处理。实验表明,该系统在保持翻译质量接近非流式基线的同时,将延迟降低至1-2秒,显著提升了实时性。

Comments 9 pages of main text; 24 pages in total

详情
英文摘要

Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

2605.14765 2026-05-15 cs.SD cs.CL

Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani, Leili Javidpour, Mahdieh Soleymani Baghshah

AI总结 该研究针对波斯音乐生成模型缺乏的问题,构建了一个包含900多小时高质量音频的波斯音乐大规模数据集,涵盖流行、传统和现代等多种风格。基于该数据集对先进的生成模型MusicGen进行微调,使其更符合波斯音乐的调式、节奏和文化特点,并通过主客观指标评估其性能。该工作为波斯音乐生成研究提供了新资源,展示了音乐生成模型在适应非主流文化语境中的潜力。

Comments 9 pages, 2 figures, 3 tables

详情
英文摘要

Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.

2605.14764 2026-05-15 cs.LG cs.AI

Compositional Sparsity as an Inductive Bias for Neural Architecture Design

Hongyu Lin, Antonio Briola, Yuanrong Wang, Tomaso Aste

AI总结 本文研究了深度神经网络如何通过结构先验克服维度灾难的问题,提出了一种基于组合稀疏性的归纳偏差。作者结合信息过滤网络(IFN)和同调神经网络(HNN),构建了一种可解释的神经网络设计框架,通过分层组合实现抽象表示。实验表明,HNN在参数数量远少于传统深度网络的情况下,不仅在合成任务中能准确恢复稀疏结构,还在多个真实数据集上表现出更优的性能和稳定性。

详情
英文摘要

Identifying the structural priors that enable Deep Neural Networks (DNNs) to overcome the curse of dimensionality is a fundamental challenge in machine learning theory. Existing literature suggests that effective high-dimensional learning is driven by compositional sparsity, where target functions decompose into constituents supported on low-dimensional variable subsets. To investigate this hypothesis, we combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximisation, with Homological Neural Networks (HNNs), which map the inferred topology into fixed-wiring sparse neural graphs. We formalise the design principles underlying this construction and present an interpretable pipeline in which abstraction emerges through hierarchical composition. HNNs are orders of magnitude sparser than standard DNNs and require only minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover the underlying compositional structure and remain stable in regimes where dense alternatives degrade as dimensionality increases. Across a broad suite of real-world datasets, HNNs consistently match or outperform dense baselines while using far fewer parameters, exhibiting lower variance and showing reduced sensitivity to hyperparameters.

2605.14761 2026-05-15 cs.AI cs.HC

AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction

Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi

AI总结 该研究旨在解决AI准确预测个体对图像审美评价这一基础性挑战。研究提出了一种结合深度学习和大型语言模型(LLM)的集成系统,通过基于LLM的半结构化访谈主动获取用户的审美偏好,并结合图像的低级和高级语义特征进行预测。实验表明,该系统在预测性能上优于传统模型、人类预测者以及用户自身在一段时间后的重新评估,尤其在高评分图像上表现突出,表明AI在捕捉个体审美偏好方面可能比人类更具优势。

Comments 25 pages, 13 figures

详情
英文摘要

Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.