arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2412.13111 2026-05-20 cs.CV cs.GR

Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations

Motion-2-To-3: 利用2D运动数据进行3D运动生成

Ruoxi Guo, Huaijin Pi, Zehong Shen, Qing Shuai, Zechen Hu, Zhumei Wang, Yajiao Dong, Ruizhen Hu, Taku Komura, Sida Peng, Xiaowei Zhou

发表机构 * Zhejiang University(浙江大学) Deep Glint The University of Hong Kong(香港大学) Shenzhen University(深圳大学)

AI总结 本文提出了一种利用2D视频中提取的运动数据来改进基于文本的3D运动生成的方法,通过解耦局部关节运动和全局运动,有效学习局部运动先验,从而提升生成的3D人体运动的真实性和多样性。

Comments Project page: https://zju3dv.github.io/Motion-2-to-3/

Journal ref 2025 IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 2025, pp. 14305-14316

详情
AI中文摘要

文本驱动的人体运动合成已展现出在电影和游戏行业颠覆性设计的潜力。现有方法通常依赖于3D运动捕捉数据,这需要特殊设置,导致数据采集成本高,最终限制了人体运动的多样性和范围。相比之下,2D人体视频提供了一种广泛且易于获取的运动数据源,涵盖了更广泛风格和活动。在本文中,我们探索了从视频中提取的2D人体运动作为替代数据源,以改进基于文本的3D运动生成。我们的方法引入了一个新颖的框架,将局部关节运动与全局运动解耦,从而能够高效地从2D数据中学习局部运动先验。我们首先在大量文本-2D运动配对数据集上训练了一个单视角的2D局部运动生成器。然后,我们用3D数据对生成器进行微调,将其转换为多视角生成器,该生成器能够预测视图一致的局部关节运动和根动力学。在知名数据集和新文本提示上的评估表明,我们的方法能够高效利用2D数据,支持更广泛的真实3D人体运动生成。我们的代码在https://zju3dv.github.io/Motion-2-to-3/上公开提供。

英文摘要

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation. Our code is publicly available at https://zju3dv.github.io/Motion-2-to-3/.

2412.00404 2026-05-20 cs.CV

Hard-Label Black-Box Attacks on 3D Point Clouds

针对3D点云的硬标签黑盒攻击

Daizong Liu, Yunbo Tao, Junhao Dong, Keke Tang, Pan Zhou, Wei Hu, Yew-Soon Ong

发表机构 * Institute for Math & AI(数学与人工智能研究院) Wuhan University(武汉大学) Huazhong University of Science and Technology(华中科技大学) Shenzhen Huazhong University of Science and Technology Research Institute(深圳华中科技大学研究机构) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学) Cyberspace Institute of Advanced Technology(先进技术网络空间研究院) Guangzhou University(广州大学) Wangxuan Institute of Computer Technology(王轩计算机技术研究所) Peking University(北京大学)

AI总结 本文提出了一种基于硬标签黑盒攻击的3D点云攻击方法,通过引入新的频谱感知决策边界算法生成高质量对抗样本,以提升攻击性能和对抗质量。

详情
AI中文摘要

随着深度传感器在各种3D安全关键应用中的成熟,3D点云模型已被证明对对抗攻击脆弱。几乎所有的现有3D攻击者只是遵循白盒或黑盒设置,通过反向传播或估计的梯度迭代更新坐标扰动。然而,这些方法很难在现实世界中部署(没有提供模型细节),因为它们严重依赖于受害者模型的参数或输出logits。为此,我们提出了一种更具实际应用的攻击方法,即硬标签黑盒攻击,其中攻击者只能访问3D输入的预测标签。我们引入了一种基于新频谱感知决策边界算法的新型3D攻击方法,以生成高质量的对抗样本。具体而言,我们首先构建了一个类感知的模型决策边界,通过开发一种可学习的频谱融合策略,适应性地在频谱域中融合不同类别的点云,旨在在不扭曲原始几何的情况下制造其中间样本。然后,我们设计了一种迭代坐标-频谱优化方法,带有曲率感知的边界搜索,以沿决策边界移动中间样本,生成具有微小扰动的对抗点云。实验表明,我们的攻击在攻击性能和对抗质量方面优于现有的白盒/黑盒攻击者。

英文摘要

With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.

2411.08982 2026-05-20 cs.LG cs.DC

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Lynx:通过动态批量感知专家选择实现高效的MoE推理

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出Lynx系统,通过利用MoE训练中的负载平衡损失特性,减少专家调用总数,从而在不依赖工作负载的情况下实现高效的MoE推理,提升了吞吐量并保持了低的精度损失。

详情
AI中文摘要

混合专家(MoE)模型提供的选择性参数激活使其成为现代基础模型的流行选择。然而,当用于服务时,MoE面临一个根本性的矛盾。批处理对于服务性能至关重要,迫使激活所有专家,从而抵消了MoE的优势并加剧了内存带宽瓶颈。现有高效MoE推理方法即使在广泛的工作负载特定调优下也无法解决这一矛盾。我们提出了Lynx,一个能够在工作负载无关的情况下实现高效MoE推理的系统。Lynx利用了MoE训练的一个关键特性:负载平衡损失引入了批次级别的专家激活偏斜和冗余,它通过一种新的AffinityBinning技术重新映射每个批次中的低亲和力的token到专家分配,从而减少总调用的专家数量。我们在九个基准测试中对四种最先进的模型家族进行评估,结果显示Lynx在保持精度损失低于1个百分点的情况下,实现了高达1.30倍的吞吐量提升。此外,Lynx与现有技术互补,进一步提升了其性能,最高可提升1.38倍。

英文摘要

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. LYNX leverages a key property of MoE training: load-balancing losses introduce batch-level expert activation skews and redundancy, which it exploits by remapping low-affinity token-to-expert assignments within each batch using a novel AffinityBinning technique that reduces the total experts invoked. Our evaluation of LYNX on four state-of-the-art model families across nine benchmarks shows that it achieves up to 1.30x improvement in throughput while maintaining accuracy loss of less than 1% points across tasks. Further, LYNX is complementary to existing techniques where it additionally boosts their performance by up to 1.38x.

2410.18856 2026-05-20 cs.AI cs.CL

Entry-level guide to the use of large language models for medical research

大型语言模型在医学研究中应用的入门指南

Qiao Jin, Nicholas Wan, Robert Leaman, Shubo Tian, Zhizheng Wang, Yifan Yang, Zifeng Wang, Guangzhi Xiong, Po-Ting Lai, Qingqing Zhu, Benjamin Hou, Maame Sarfo-Gyamfi, Gongbo Zhang, Aidan Gilson, Balu Bhasuran, Zhe He, Aidong Zhang, Jimeng Sun, Chunhua Weng, Ronald M. Summers, Qingyu Chen, Yifan Peng, Zhiyong Lu

发表机构 * National Library of Medicine (NLM), National Institutes of Health (NIH)(国家医学图书馆(NLM)、国立卫生研究院(NIH)) Department of Computer Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系) Department of Biomedical Informatics, Columbia University(哥伦比亚大学生物医学信息学系) School of Medicine, Yale University(耶鲁大学医学院) School of Information, Florida State University(佛罗里达州立大学信息学院) Department of Radiology and Imaging Sciences, NIH Clinical Center(国立卫生研究院临床中心放射学与影像科学部) Department of Population Health Sciences, Weill Cornell Medicine(韦尔医学院人口健康科学系)

AI总结 本文提出了一套可操作的指南,帮助医疗专业人员更高效地利用大型语言模型(LLMs)进行医学研究,涵盖任务制定、模型选择、提示工程、微调和模型部署等关键步骤,确保安全可靠地将LLMs应用于临床实践。

详情
AI中文摘要

前沿大型语言模型(LLMs),如GPT-5、Claude 4.5、Gemini 3、Llama 4和DeepSeek-R1,代表了一类具有变革潜力的AI工具,能够通过在各种上下文中生成类人响应并适应新任务来革新医疗保健的各个方面。它们的应用潜力涵盖广泛医学任务,如临床文档、患者与临床试验的匹配以及回答医学问题。在本文中,我们提出了一套可操作的指南,帮助医疗专业人员更高效地利用LLMs进行工作,并提供了一套最佳实践。整体工作流程包括几个主要阶段,包括制定任务、选择LLMs、提示工程、微调和模型部署。我们首先讨论了识别与LLMs核心能力相匹配的医学任务以及基于选定任务和数据、性能要求和模型接口选择模型的关键考虑因素。然后回顾了提示工程和微调等策略,以将标准LLMs适应于专门的医学任务。部署考虑因素,包括监管合规性、伦理准则以及持续监控公平性和偏见,也进行了讨论。通过提供结构化的分步方法,本文入门教程旨在为医疗专业人员提供必要的工具,以有效将LLMs整合到临床实践中,确保这些强大技术以安全、可靠和有影响力的方式得到应用。

英文摘要

Frontier large language models (LLMs), such as GPT-5, Claude 4.5, Gemini 3, Llama 4, and DeepSeek-R1, represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this paper, we propose an actionable guideline to help healthcare professionals more effectively and efficiently utilize LLMs in their work, along with a set of best practices. The overall workflow consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and model deployment. We start with the discussion of critical considerations in identifying medical tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this entry-level tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.

2410.15362 2026-05-20 cs.LG cs.AI cs.CL cs.CR

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Faster-GCG: 面向对齐大语言模型的高效离散优化监狱突破攻击

Xiao Li, Wei Zhang, Zhuhong Li, Qiongxiu Li, Shei PernChua, BingZe Lee, Jinghao Cui, Yifan Huang, Xiaolin Hu

发表机构 * Tsinghua University(清华大学) Sea-Fill Duke University(杜克大学) Aalborg University(奥胡斯大学) Chinese Institute for Brain Research (CIBR)(中国脑科学研究院)

AI总结 本文提出Faster-GCG,通过改进估计、高效采样和避免重复评估,提高了对齐大语言模型的监狱突破攻击效率,实现了样本效率提升8倍,时间减少7倍,并在多个模型上取得了更高的突破成功率。

Comments 18 pages, new version

详情
AI中文摘要

对齐大语言模型(LLMs)因其安全性而受到广泛关注,尤其是在试图通过对抗性提示绕过安全边界(guardrails)的监狱突破攻击中。现有方法中,贪心坐标梯度(GCG)攻击通过离散标记优化实现了自动化监狱突破,但其低样本效率限制了实际应用。特别是,GCG需要约256,000次评估才能达到满意的监狱突破成功率,这是由于底层离散优化问题的固有难度。在本工作中,我们识别了限制GCG样本效率的三个关键因素:不准确的基于梯度的估计、低效的均匀采样以及重复评估先前探索的后缀。为了解决这些问题,我们提出了Faster-GCG,一种经过简化且改进的GCG变种,它结合了基于距离的正则化以提高估计、温度控制的采样以更有效的探索,以及一个标记已访问后缀的机制以避免冗余评估。Faster-GCG将所需的评估次数减少到32,000次,实现了与GCG相比样本效率提升8倍和时间减少7倍的改进。在该减少的预算下,Faster-GCG在五个对齐LLMs上平均达到了78.1%的监狱突破成功率,并在Qwen3.5-4B上达到了88.7%,优于最先进的白盒监狱突破方法。

英文摘要

Aligned Large Language Models (LLMs) have attracted significant attention for their safety, particularly in the context of jailbreak attacks that attempt to bypass guardrails via adversarial prompts. Among existing approaches, the Greedy Coordinate Gradient (GCG) attack pioneered automated jailbreaks through discrete token optimization; however, its low sample efficiency limits practical applicability. In particular, GCG requires approximately 256K evaluations per harmful behavior to achieve a satisfactory jailbreak success rate, due to the inherent difficulty of the underlying discrete optimization problem. In this work, we identify three key factors that limit the sample efficiency of GCG: inaccurate gradient-based estimation, inefficient uniform sampling, and repeated evaluation of previously explored suffixes. To address these issues, we propose Faster-GCG, a streamlined variant of GCG that incorporates distance-based regularization for improved estimation, temperature-controlled sampling for more effective exploration, and a visited-suffix marking mechanism to avoid redundant evaluations. Faster-GCG reduced the required evaluations to 32K, achieving up to an $8\times$ improvement in sampling efficiency and a $7\times$ reduction in wall-clock time compared to GCG. Under this reduced budget, Faster-GCG attained an average jailbreak success rate of 78.1\% across five aligned LLMs, and achieved 88.7\% against Qwen3.5-4B, outperforming state-of-the-art white-box jailbreak methods.

2409.08248 2026-05-20 cs.CV

TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

TextBoost: 通过文本编码器提升文本到图像生成的个性化

NaHyeon Park, Kunhee Kim, Hyunjung Shim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出TextBoost,一种高效的文本到图像扩散模型单次个性化方法,通过仅微调文本编码器提升计算和存储效率,并保持语义完整性,从而实现更快收敛和更低存储需求,同时保持高质量生成。

Comments Project page: https://textboost.github.io. Accepted to TMLR

详情
AI中文摘要

在本文中,我们介绍了TextBoost,一种高效的文本到图像扩散模型单次个性化方法。传统个性化方法通常涉及微调模型的大量部分,导致存储需求大且收敛慢。相反,我们提出仅选择性地微调文本编码器,显著提高了计算和存储效率。为了保持原始语义完整性,我们开发了一种新颖的因果保持适应机制。此外,轻量级适配器被用于在文本嵌入与交叉注意层交互之前局部细化文本嵌入,从而在极小的计算开销下显著增强文本嵌入的表达能力。在多样化的概念上进行的实证评估表明,TextBoost通过减少可训练参数的数量实现了更快的收敛速度和显著的存储需求降低。此外,TextBoost在主体保真度、文本保真度和生成多样性方面与现有方法相比具有可比性。我们展示所提出的方法为高质量文本到图像个性化提供了一种高效、可扩展且实用的解决方案,尤其在资源受限的环境中具有优势。

英文摘要

In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

2409.03192 2026-05-20 cs.CV

PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

PEPL: 精度增强的伪标签法用于半监督学习中的细粒度图像分类

Bowen Tian, Songning Lai, Lujundong Li, Zhihao Shuai, Runwei Guan, Tian Wu, Yutao Yue

发表机构 * HKUST(GZ)(香港科技大学(广州)) Institute of Deep Perception Technology, JITRI(感知技术研究所,JITRI) University of Liverpool(利物浦大学) Nanchang University(南昌大学) DI^2 Lab(DI²实验室)

AI总结 本文提出PEPL方法,通过生成高质量的伪标签来解决细粒度图像分类中标注数据稀缺的问题,利用CAMs进行语义混合伪标签生成,提升分类精度和鲁棒性。

Comments Accepted by ICASSP 2025

详情
AI中文摘要

细粒度图像分类随着深度学习和计算机视觉技术的发展取得了显著进步。然而,详细的标注数据稀缺仍然是一个主要挑战,尤其是在获取高质量标注数据成本高或耗时的情况下。为了解决这一限制,我们引入了Precision-Enhanced Pseudo-Labeling(PEPL)方法,专门设计用于半监督学习框架下的细粒度图像分类。我们的方法通过生成高质量的伪标签,利用大量未标注数据,通过两个关键阶段:初始伪标签生成和语义混合伪标签生成,逐步细化伪标签。这些阶段利用类激活图(CAMs)准确估计语义内容,并生成捕获细粒度分类所需关键细节的精炼标签。通过聚焦语义层面的信息,我们的方法有效克服了标准数据增强和图像混合技术在保留关键细粒度特征方面的局限性。我们在基准数据集上实现了最先进的性能,证明了与现有半监督策略相比,在准确性和鲁棒性上有了显著提升。

英文摘要

Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.

2403.07183 2026-05-20 cs.CL cs.AI cs.LG cs.SI

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

大规模监控AI修改内容:ChatGPT对AI会议同行评审影响的案例研究

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Machine Learning Department, NEC Labs America(NEC美国实验室机器学习部门) Department of Biomedical Data Science, Stanford University(斯坦福大学生物医学数据科学系) Department of Electrical Engineering, Stanford University(斯坦福大学电气工程系) Graduate School of Education, Stanford University(斯坦福大学教育研究生院) Department of Sociology, Stanford University(斯坦福大学社会学系) Graduate School of Business, Stanford University(斯坦福大学商学院) Department of Management Science and Engineering, Stanford University(斯坦福大学管理科学与工程系) Department of Computer Science, UC Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 本文提出了一种方法,用于估计大规模语料库中可能被大语言模型(LLM)显著修改或生成的文本比例。通过专家撰写和AI生成的参考文本,该最大似然模型能够高效地在语料库层面考察实际的LLM使用情况。研究以ChatGPT发布后举行的AI会议同行评审(ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023)为案例,发现6.5%至16.9%的提交文本可能被LLM显著修改。生成文本的情境揭示了用户行为:在信心较低、接近截止日期或回复作者反驳较少的评审中,估计的LLM生成文本比例更高。此外,观察到语料库层面的趋势可能过于微妙,无法在个体层面检测到,并讨论了这些趋势对同行评审的影响。呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

Comments 46 pages, 31 figures, ICML '24

详情
AI中文摘要

我们提出了一种方法,用于估计大规模语料库中可能被大语言模型(LLM)显著修改或生成的文本比例。我们的最大似然模型利用专家撰写和AI生成的参考文本,以准确且高效的方式在语料库层面考察实际的LLM使用情况。我们将该方法应用于ChatGPT发布后举行的AI会议同行评审案例研究,包括ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。我们的结果表明,在这些会议中提交的同行评审文本中,6.5%至16.9%可能被LLM显著修改,即超出拼写检查或小幅写作更新的范围。生成文本出现的情境提供了关于用户行为的见解:估计的LLM生成文本比例在信心较低、接近截止日期或来自较少回应作者反驳的评审中更高。我们还观察到语料库层面的生成文本趋势,这些趋势可能在个体层面过于微妙而无法检测到,并讨论了这些趋势对同行评审的影响。我们呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

英文摘要

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

2310.11203 2026-05-20 cs.LG stat.ML

Federated Learning with Nonvacuous Generalisation Bounds

联邦学习中的非空泛化界限

Pierre Jobic, Maxime Haddouche, Benjamin Guedj

发表机构 * Université Paris-Saclay CEA(巴黎-萨克雷大学CEA) Inria, CNRS, Ecole Normale Supérieure, PSL Research University(法国国家科学研究中心Inria、高等师范学院、巴黎-萨克雷研究大学) University College London(伦敦大学学院)

AI总结 本文提出了一种在联邦学习中训练随机预测器的新策略,通过在保持隐私的同时,释放本地预测器并保护训练数据不被其他节点知晓。研究构建了一个全局随机预测器,继承本地私有预测器的属性,基于PAC-Bayesian泛化界限。通过数值实验展示了该方法在预测性能上与批量方法相当,同时保持隐私。

详情
AI中文摘要

我们介绍了一种新的策略来训练联邦学习中的随机预测器,其中每个网络节点旨在通过释放本地预测器来保护隐私,同时保持其训练数据对其他节点的保密性。然后我们构建了一个全局随机预测器,该预测器继承本地私有预测器的属性,基于PAC-Bayesian泛化界限。我们考虑了同步情况,其中所有节点共享相同的训练目标(来源于泛化界限),以及异构和同构情况,其中每个节点可能有自己的个性化训练目标。通过一系列数值实验,我们证明了我们的方法在预测性能上与批量方法相当,其中所有数据集都在节点之间共享。此外,预测器由数值非空泛化界限支持,同时为每个节点保持隐私。我们明确计算了我们两种联邦设置的预测性能和泛化界限的增量,突显了为保护隐私而付出的代价。

英文摘要

We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but keeping secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisation bound. We consider the synchronous case where all nodes share the same training objective (derived from a generalisation bound), and the heterogenous and homogenous cases where each node may have its own personalised training objective. We show through a series of numerical experiments that our approach achieves a comparable predictive performance to that of the batch approach where all datasets are shared across nodes. Moreover the predictors are supported by numerically nonvacuous generalisation bounds while preserving privacy for each node. We explicitly compute the increment on predictive performance and generalisation bounds for our two federated settings, highlighting the price to pay to preserve privacy.

2112.08507 2026-05-20 cs.LG stat.ML

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

适应性实验的算法:在统计分析与奖励之间进行权衡:结合均匀随机分配与奖励最大化

Tong Li, Jacob Nogas, Haochen Song, Anna Rafferty, Eric M. Schwartz, Audrey Durand, Harsh Kumar, Nina Deliu, Sofia S. Villar, Dehan Kong, Joseph J. Williams

发表机构 * University of Toronto(多伦多大学) Carleton College(卡洛尔学院) University of Michigan(密歇根大学) University of Cambridge(剑桥大学)

AI总结 本文提出了一种统计敏感算法TS-PostDiff,通过结合均匀随机分配和奖励最大化,在统计分析与用户奖励之间进行权衡,以提高实验效率和准确性。

详情
AI中文摘要

传统随机A/B实验使用均匀随机(UR)概率分配臂,例如将50/50分配给网站的两个版本以发现哪个版本更能吸引用户。为了更快速和自动地利用数据来造福用户,多臂老虎机算法如汤普森采样(TS)已被提倡。虽然TS具有可解释性并结合了随机化关键的统计推断,但它可能导致有偏估计并增加假阳性率和假阴性率。我们引入了一种更统计敏感的算法,TS-PostDiff(后验概率小差异),它通过使用额外的自适应步骤混合TS和传统UR,其中使用UR(而非TS)的概率与臂差异的后验概率成正比。这使实验者能够定义什么算作小差异,低于此值,传统UR实验可以以低成本获得用于统计推断的信息数据,而高于此值则使用更多TS以最大化用户利益。我们评估了TS-PostDiff与UR、TS以及两个其他旨在提高统计推断的TS变体。我们考虑了在多种设置下的常见双臂实验结果,这些设置受到现实应用的启发。我们的结果提供了洞察,说明在何时以及为何TS-PostDiff或替代方法在用户利益(奖励)和统计推断(假阳性率和功率)之间提供更好的权衡。TS-PostDiff的自适应性有助于在差异较小时高效减少假阳性并提高统计功率,而在差异较大时增加奖励。这项工作强调了未来统计敏感算法开发中重要的考虑因素,这些算法需要在适应性实验中平衡奖励和统计分析。

英文摘要

Traditional randomized A/B experiments assign arms with uniform random (UR) probability, such as 50/50 assignment to two versions of a website to discover whether one version engages users more. To more quickly and automatically use data to benefit users, multi-armed bandit algorithms such as Thompson Sampling (TS) have been advocated. While TS is interpretable and incorporates the randomization key to statistical inference, it can cause biased estimates and increase false positives and false negatives in detecting differences in arm means. We introduce a more Statistically Sensitive algorithm, TS-PostDiff (Posterior Probability of Small Difference), that mixes TS with traditional UR by using an additional adaptive step, where the probability of using UR (vs TS) is proportional to the posterior probability that the difference in arms is small. This allows an experimenter to define what counts as a small difference, below which a traditional UR experiment can obtain informative data for statistical inference at low cost, and above which using more TS to maximize user benefits is key. We evaluate TS-PostDiff against UR, TS, and two other TS variants designed to improve statistical inference. We consider results for the common two-armed experiment across a range of settings inspired by real-world applications. Our results provide insight into when and why TS-PostDiff or alternative approaches provide better tradeoffs between benefiting users (reward) and statistical inference (false positive rate and power). TS-PostDiff's adaptivity helps efficiently reduce false positives and increase statistical power when differences are small, while increasing reward more when differences are large. The work highlights important considerations for future Statistically Sensitive algorithm development that balances reward and statistical analysis in adaptive experimentation.

2105.00933 2026-05-20 cs.SD cs.AI cs.LG eess.AS

Deep Neural Network for Musical Instrument Recognition using MFCCs

基于MFCCs的音乐乐器识别深度神经网络

Saranga Kingkor Mahanta, Abdullah Faiz Ur Rahman Khilji, Partha Pakray

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology, Silchar, Assam, India(电子与通信工程系,国家理工学院,西拉char,阿萨姆,印度)

AI总结 本文提出一种基于MFCCs的深度神经网络模型,用于对二十种不同类别的音乐乐器进行分类,利用伦敦爱乐乐团数据集实现高精度识别。

Journal ref Computacion y Sistemas, Vol 25, No 2 (2021): 25(2) 2021

详情
AI中文摘要

高效自动音乐分类任务在AI应用于音乐领域中具有重要性,并构成了各种高级应用的基础。音乐乐器识别是通过音频来识别乐器的任务。这种音频也称为声音振动,被模型用来与乐器类别匹配。在本文中,我们使用了一个经过训练以对二十种不同类别的音乐乐器进行分类的人工神经网络(ANN)模型。这里我们仅使用音频数据的梅尔频率倒谱系数(MFCCs)。我们的模型在完整的伦敦爱乐乐团数据集上进行训练,该数据集包含属于四个家族(木管乐器、铜管乐器、打击乐器和弦乐器)的二十种乐器类别。基于实验结果,我们的模型在相同数据集上实现了最先进的准确性。

英文摘要

The task of efficient automatic music classification is of vital importance and forms the basis for various advanced applications of AI in the musical domain. Musical instrument recognition is the task of instrument identification by virtue of its audio. This audio, also termed as the sound vibrations are leveraged by the model to match with the instrument classes. In this paper, we use an artificial neural network (ANN) model that was trained to perform classification on twenty different classes of musical instruments. Here we use use only the mel-frequency cepstral coefficients (MFCCs) of the audio data. Our proposed model trains on the full London philharmonic orchestra dataset which contains twenty classes of instruments belonging to the four families viz. woodwinds, brass, percussion, and strings. Based on experimental results our model achieves state-of-the-art accuracy on the same.

2002.09053 2026-05-20 cs.CV

Adapted Center and Scale Prediction: More Stable and More Accurate

适应中心和尺度预测:更加稳定和准确

Wenhao Wang, Jusheng Zhang

发表机构 * University of Technology Sydney(悉尼科技大学) Sun Yat-sen University(中山大学)

AI总结 本文提出了一种基于中心和尺度预测(CSP)的改进方法,旨在结合无锚点检测器的简洁性和两阶段检测器的准确性,通过增强CSP的鲁棒性、提出压缩宽度的新方法,并在CityPersons基准上取得第二名的性能,同时探索了可切换归一化的能力。

Comments 14 pages, 7 figures

详情
AI中文摘要

行人检测受益于深度学习技术,在近年来迅速发展。大多数检测器遵循通用目标检测框架,即默认框和两阶段过程。最近,无锚点和单阶段检测器被引入到这一领域。然而,它们的准确性并不令人满意。因此,为了同时享受无锚点检测器的简洁性和两阶段检测器的准确性,我们基于检测器提出了一些改进,即中心和尺度预测(CSP)。本文的主要贡献包括:(1)我们改进了CSP的鲁棒性,使其更容易训练。(2)我们提出了一种新的方法来预测宽度,即压缩宽度。(3)我们在CityPersons基准上取得了第二好的性能,即在合理集上9.3%的log-average miss rate(MR),在部分集上8.7%的MR,在裸集上5.6%的MR,这表明无锚点和单阶段检测器仍能保持高精度。(4)我们探索了可切换归一化的一些能力,这些能力在原始论文中未被提及。代码可在https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction上公开获取。

英文摘要

Pedestrian detection benefits from deep learning technology and gains rapid development in recent years. Most of detectors follow general object detection frame, i.e. default boxes and two-stage process. Recently, anchor-free and one-stage detectors have been introduced into this area. However, their accuracies are unsatisfactory. Therefore, in order to enjoy the simplicity of anchor-free detectors and the accuracy of two-stage ones simultaneously, we propose some adaptations based on a detector, Center and Scale Prediction(CSP). The main contributions of our paper are: (1) We improve the robustness of CSP and make it easier to train. (2) We propose a novel method to predict width, namely compressing width. (3) We achieve the second best performance on CityPersons benchmark, i.e. 9.3% log-average miss rate(MR) on reasonable set, 8.7% MR on partial set and 5.6% MR on bare set, which shows an anchor-free and one-stage detector can still have high accuracy. (4) We explore some capabilities of Switchable Normalization which are not mentioned in its original paper. The code is publicly available at https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction.

2605.19020 2026-05-20 cs.CV

A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

对用于开放集虹膜呈现攻击检测的视觉基础模型系统性失败分析

Rahul Anand, Siddharth Singh, Dileep A D, Mahadeva Prasanna, Raghavendra Ramachandra

发表机构 * Indian Institute of Technology, Dharwad, India(印度德瓦德理工学院) Indian Institute of Information Technology Dharwad, India(印度德瓦德信息学院) SAFE Center, Norwegian University of Science and Technology (NTNU)(挪威科学技术大学(NTNU)的安全中心)

AI总结 本文系统分析了视觉基础模型在开放集虹膜呈现攻击检测中的表现,发现其在面对未见过的攻击设备和跨光谱转移时表现不佳,强调了需要更鲁棒的虹膜检测表示方法。

详情
AI中文摘要

视觉基础模型在多种视觉识别任务中表现出强大的迁移能力,并日益被应用于生物识别领域。然而,其在开放集条件下用于虹膜呈现攻击检测(PAD)的适用性仍不够充分。本文系统分析了通用视觉基础模型在开放集虹膜PAD中的表现,使用周缘视觉图像进行评估。在三个明确分离不同分布偏移的开放集协议下,评估了五个代表性基础模型:未见过的呈现攻击设备(PAIs)、使用不同传感器捕获的未见数据集以及近红外(NIR)到可见光(VIS)光谱的跨光谱转移。在统一的实验框架内,评估了冻结的特征表示和参数高效的LoRA任务适应方法。结果表明,基础模型能够在具有相似传感特征的数据集之间迁移,但无法可靠地推广到未见过的攻击设备,并在跨光谱评估中急剧退化。尽管LoRA在某些跨数据集设置中提高了性能,但在攻击级别和光谱偏移下经常放大失败。额外的验证实验使用分段虹膜输入、完整主干微调、联合跨数据集和跨PAI偏移以及反向VIS到NIR转移进一步证实,这些失败并非仅仅是周缘视觉输入、弱适应或单向光谱评估的产物。这些发现表明,强闭合集或跨数据集性能不应被视为开放集安全性的证据,并突显了需要虹膜检测表示方法在保持对呈现伪影的敏感性的同时,在现实部署变化下保持稳定性的需求。

英文摘要

Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

2605.19018 2026-05-20 cs.LG

LoRA vs. Full Fine-Tuning: A Theoretical Perspective

LoRA与全微调:一种理论视角

Ali Zindari, Rotem Mulayoff, Sebastian U. Stich

发表机构 * Universität des Saarlandes(萨尔兰州大学) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍兹研究中心)

AI总结 本文从理论角度研究了LoRA与全微调在线性回归中的表现,发现LoRA在过定和欠定情况下能够以更低的额外风险优于全微调,且LoRA秩的选择影响泛化性能,实验验证了理论结果的广泛适用性。

Comments Preprint

详情
AI中文摘要

微调通过少量标记数据将预训练模型适应到下游任务。低秩适应(LoRA)是一种高效的微调方法,它在减少内存和计算成本的同时,通常能实现接近全微调的性能。尽管广泛应用,LoRA的理论行为尚未深入理解。本文在简单的线性回归设置中研究LoRA,并将其额外风险与全微调进行比较。我们的分析识别出在过定和欠定情况下,LoRA在某些条件下能够实现低于全微调的额外风险。具体而言,我们的理论预测当预训练任务与下游任务之间的差异在低秩范围内时,LoRA可以超越全微调。我们进一步展示了LoRA秩的选择如何影响泛化性能,解释了在某些情况下使用极小的秩可以提高测试准确率,尽管这限制了模型的表达能力。最后,我们通过实际任务的实验支持了我们的理论结果,表明所识别的权衡和见解超出了线性回归的范围。

英文摘要

Fine-tuning adapts a pre-trained model to downstream tasks using a small amount of labeled data. Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that reduces memory and computation costs while often achieving performance close to full fine-tuning. Despite its widespread use, the theoretical behavior of LoRA is not yet well understood. In this paper, we study LoRA in a simple linear regression setting and compare its excess risk with that of full fine-tuning. Our analysis identifies regimes in which LoRA achieves lower excess risk than full fine-tuning in both overdetermined and underdetermined settings. Specifically, our theory predicts that LoRA can outperform full fine-tuning when the difference between the pretraining and the downstream tasks is effectively low-rank. We further show how the choice of LoRA rank affects generalization performance, explaining why using a very small rank can improve test accuracy in certain settings, even though it limits model expressivity. Finally, we support our theoretical results with experiments on practical tasks, suggesting that the identified tradeoffs and insights extend beyond linear regression.

2605.19014 2026-05-20 cs.LG econ.EM stat.ML

SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction

SAGA:一种序列自适应的生成架构,用于多时间跨度概率预测的自适应时间符合预测

Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov, Hafize Gonca Cömert

发表机构 * Department of Economics, Stockholm University(斯德哥尔摩大学经济系) Institute of Social Sciences, Faculty of Economics and Administrative Sciences, Süleyman Demirel University(苏莱曼·德米雷尔大学社会科学学院,经济学与行政科学学院)

AI总结 本文提出SAGA,一种用于不规则表格面板序列的解码器-only transformer,结合分割符合校准包装器,提供个体层面的预测区间,并保证有限样本边缘覆盖。SAGA在瑞典LISA登记处的纵向数据上训练,预测了1到30年的年度劳动收入,并通过蒙特卡洛方法汇总成现值寿命收入分布。与传统参数过程和表格和循环基线相比,SAGA在10年时间跨度上将连续排名概率分数减少了31.9%,在20年时间跨度上将平均绝对误差减少了37.7%。符合区间在边缘情况下覆盖率为0.4个百分点,在最差的人口子群体中为2.4个百分点。重建的寿命收入基尼系数为0.327,与部分观测的真实值0.341和GKOS估计值0.378相比。模型权重、校准表和合成等价数据集已发布,供在保护的SCB MONA环境中外的复制使用。

Comments 14 pages, 3 figures, 12 tables, 5 appendices, 45 references. Submitted to IEEE TPAMI. Source code at https://github.com/olaflaitinen/saga (archived: doi:10.5281/zenodo.20260366). Synthetic equivalent dataset: doi:10.5281/zenodo.20260287. Empirical work conducted on the Swedish LISA register via SCB MONA (project SCB-MONA-2026-147); ethical approval Swedish Ethical Review Authority 2026-04127-01

详情
AI中文摘要

用于财政部门和中央银行的微模拟模型依赖于参数过程来捕捉生命周期收入的寿命,这些过程只捕捉条件分布的一阶和二阶矩,忽略了长期非线性结构。我们提出SAGA,一种用于不规则表格面板序列的解码器-only transformer,结合分割符合校准包装器,提供个体层面的预测区间,并保证有限样本边缘覆盖。在1990年至2022年的纵向瑞典LISA登记处数据上训练,包含2,143,817个个体和61,284,903人年,模型预测了1到30年的年度劳动收入,并通过蒙特卡洛方法汇总成现值寿命收入分布。与传统参数过程和表格和循环基线相比,SAGA在10年时间跨度上将连续排名概率分数减少了31.9%,在20年时间跨度上将平均绝对误差减少了37.7%。符合区间在边缘情况下覆盖率为0.4个百分点,在最差的人口子群体中为2.4个百分点。重建的寿命收入基尼系数为0.327,与部分观测的真实值0.341和GKOS估计值0.378相比。模型权重、校准表和合成等价数据集已发布,供在保护的SCB MONA环境中外的复制使用。

英文摘要

Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.

2605.19010 2026-05-20 cs.AI

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

AgentNLQ: 一个通用的自然语言到SQL代理

Olena Bogdanov, Yeunji Jung, Chandra Dhir, Pareekshitreddy Gaddam, Saurabh Jain, Lakshmi Tumati, Vijay Parthasarathy, Anup Shirgaonkar

发表机构 * JPMorganChase(摩根大通)

AI总结 本研究提出了一种多代理方法,用于改进自然语言到SQL的转换,该方法在BIRD基准测试中实现了78.1%的语义准确率,并通过优化的多代理解决方案、先进的模式增强方法以及跨不同领域和数据集的评估,展示了方法的准确性和泛化能力。

详情
AI中文摘要

自然语言到SQL(NL2SQL)转换是研究人员和企业关注的重要问题,因为关系数据库在广泛的实际问题中具有普遍的重要性。尽管大语言模型(LLMs)的能力迅速提升,NL2SQL尚未达到与人类专家SQL编写者同等的准确性,因此需要进一步改进NL2SQL算法。本研究提出了一种新的多代理方法用于NL2SQL,该方法在BIg Bench for LaRge-scale Database(BIRD)基准上实现了78.1%的语义准确性。我们的方法利用了用户提供的模式的语义丰富表示,添加了用户提供的业务规则,并生成了准确的SQL查询。本研究的主要贡献包括(a)我们设计了一种优化的多代理解决方案中的新调度器,该调度器利用LLMs进行计划、协调、反思和自我纠正以生成准确的SQL查询;(b)我们开发了一种先进的模式增强方法,创建了上下文感知的元数据以提高准确性;(c)我们通过在BIRD-SQL基准上评估该方法,展示了其在不同领域和数据集上的准确性和泛化能力。

英文摘要

Natural language to SQL (NL2SQL) conversion is an important problem for researchers and enterprises due to the ubiquitous importance of relational databases in broad-ranging practical problems. Despite the rapid advancements in the capabilities of LLMs, NL2SQL has not reached parity in accuracy with human expert SQL writers, hence needing additional improvements in NL2SQL algorithms. This study presents a new multi-agent method for NL2SQL that achieves 78.1% semantic accuracy on the BIg Bench for LaRge-scale Database (BIRD) benchmark. Our method leverages a semantically enriched representation of user-provided schema, adds user-provided business rules, and produces accurate SQL queries. The main contributions of this study are (a) We designed an optimized new orchestrator in a multi-agent solution that uses LLMs to plan, orchestrate, reflect, and self-correct to generate accurate SQL queries, (b) We developed an advanced schema enrichment method that creates context-aware metadata to improve accuracy, and (c) We demonstrated the accuracy and generalizability of the method across different domains and datasets by evaluating it on the BIRD-SQL benchmark.

2605.19009 2026-05-20 cs.RO cs.SY eess.SY

Adversarial Stress Testing of SPARK Humanoid Safety Filters

对SPARK人形机器人类安全过滤器的对抗性压力测试

Saurav Ghosh, Abdou Sow, Luke Zhang

发表机构 * Department of Computer Science and Engineering, Washington University in St. Louis, Missouri, United States(计算机科学与工程系,华盛顿大学圣路易斯分校,密苏里州,美国)

AI总结 本文通过复制和压力测试研究了SPARK人形机器人类安全过滤器的鲁棒性,评估了多种方法在不同环境下的表现,揭示了安全行为在障碍物密集、距离估计噪声和延迟信息下的变化,强调了在部署前需使用能暴露故障模式的评估指标。

Comments 5 pages, 7 figures, 1 table. Code available at https://github.com/ghoshsaurav/spark-adversarial-safety

详情
AI中文摘要

人形机器人由于具有高维身体、众多碰撞约束以及必须在人和障碍物附近操作,难以安全部署。安全过滤器通过在可能违反避障约束时修改名义控制动作来帮助。然而,名义基准分数并不能完全显示这些过滤器在更困难环境中的行为。在本工作中,我们通过复制和压力测试研究了SPARK人形安全过滤器的鲁棒性。我们复制了SPARK基准案例G1SportMode_D1_WG_SO_v1到MuJoCo,并在受控随机种子下评估RSSA、RSSS、SSA、CBF、PFM和SMA。我们还构建了一个后处理流程,将原始SPARK日志转换为目标跟踪、最小距离和碰撞步骤指标。我们的结果表明,某些方法更接近目标跟踪,而其他方法更有效减少碰撞步骤。压力测试进一步表明,在障碍物密集、距离估计噪声和延迟障碍信息下,安全行为可能发生改变。这些发现表明,人形自主性应在名义性能之外进行评估,使用能暴露故障模式的指标。

英文摘要

Humanoid robots are difficult to deploy safely because they have high-dimensional bodies, many collision constraints, and must operate near people and obstacles. Safety filters help by modifying a nominal control action when it may violate collision-avoidance constraints. Still, nominal benchmark scores do not fully show how these filters behave in harder environments. In this work, we study the robustness of SPARK humanoid safety filters through replication and stress testing. We replicate the SPARK benchmark case G1SportMode_D1_WG_SO_v1 in MuJoCo and evaluate RSSA, RSSS, SSA, CBF, PFM, and SMA under controlled random seeds. We also built a post-processing pipeline that converts raw SPARK logs into goal-tracking, minimum-distance, and collision-step metrics. Our results show that some methods track the goal more closely, while others reduce collision steps more effectively. The stress tests further indicate that safety behavior can change under obstacle crowding, noisy distance estimates, and delayed obstacle information. These findings suggest that humanoid autonomy should be evaluated beyond nominal performance, using metrics that expose failure modes before deployment.

2605.19008 2026-05-20 cs.AI cs.CL cs.LG

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

通过线学习的训练控制治理:在压力下受限制的自主训练以稳定性和效率

Anis Radianis

发表机构 * Qluon Inc.(Qluon公司)

AI总结 本文提出了一种名为Learn-by-Wire Guard (LBW-Guard)的受限制自主训练控制治理层,用于在压力下提高大型语言模型的稳定性和效率,通过在AdamW之上进行有界控制,以保持固定训练目标。

详情
AI中文摘要

现代语言模型训练越来越暴露于不稳定性、退化运行和计算浪费,特别是在使用激进的学习率、规模和运行时间压力条件时。本文介绍了Learn-by-Wire Guard (LBW-Guard),一种在AdamW之上运行的受限制自主训练控制治理层。而不是替换优化器更新规则,LBW-Guard通过观察训练 telemetry,解读对不稳定性敏感的制度,并在保持固定训练目标的同时对优化器执行应用有界控制。我们评估LBW-Guard在以Qwen2.5为中心的压力和鲁棒性套件中使用WikiText-103,以Qwen2.5-7B为经验锚点,与Qwen2.5-3B和Qwen2.5-14B进行模型大小比较,学习率压力测试,梯度裁剪基线以及无LoRA TinyLlama-1B全参数 sanity check。在7B参考设置中,LBW-Guard将最终困惑度从13.21降低到10.74,降低18.7%,同时将端到端时间从392.54秒降低到357.02秒,提高了1.10倍的速度。在更强的学习率压力下,AdamW在LR=3e-3时退化到最终困惑度1885.24,在LR=1e-3时为659.76,而LBW-Guard分别保持可训练性为11.57和10.33。梯度裁剪基线无法再现这种效果。这些结果支持了一个范围系统的结论,即对稳定性敏感的LLM训练可以受益于在优化器之上进行治理。LBW-Guard提供了证据,表明在压力下受限制的运行时间控制可以在保持生产力计算的同时,与优化器替换和局部梯度抑制保持不同。

英文摘要

Modern language-model training is increasingly exposed to instability, degraded runs, and wasted compute, especially under aggressive learning-rate, scale, and runtime-stress conditions. This paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW. Rather than replacing the optimizer update rule, LBW-Guard observes training telemetry, interprets instability-sensitive regimes, and applies bounded control to optimizer execution while preserving fixed training objectives. We evaluate LBW-Guard in a Qwen2.5-centered stress-and-robustness suite using WikiText-103, with Qwen2.5-7B as the empirical anchor, model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, learning-rate stress tests, gradient-clipping baselines, and a no-LoRA TinyLlama-1B full-parameter sanity check. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74, an 18.7% improvement, while reducing end-to-end time from 392.54s to 357.02s, a 1.10x speedup. Under stronger learning-rate stress, AdamW degrades to 1885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, whereas LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce this effect. These results support a scoped systems conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer. LBW-Guard provides evidence that bounded runtime control can preserve productive compute under stress while remaining distinct from optimizer replacement and local gradient suppression.

2605.19004 2026-05-20 cs.CV cs.LG cs.RO

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

EgoTraj: 用于多模态预测的现实世界人轨迹数据集

Ahmad Yehia, Abduallah Mohamed, Tianyi Wang, Jiseop Byeon, Kun Qian, Junfeng Jiao, Christian Claudel

发表机构 * Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin(土木、建筑与环境工程系,德克萨斯大学奥斯汀分校) Meta Reality Labs(Meta现实实验室) School of Architecture, The University of Texas at Austin(建筑学院,德克萨斯大学奥斯汀分校)

AI总结 本文提出EgoTraj数据集,用于多模态预测,包含75个真实城市环境中的人导航轨迹,提供了同步的RGB视频和地面真实数据,包括6自由度头部姿态、3D眼 gaze向量和场景注释,展示了该数据集在AR感知、导航和辅助系统中的应用价值。

Comments 21 pages, 14 figures. Project page: https://github.com/yehiahmad/EgoTraj

详情
AI中文摘要

准确地从第一人称视角预测人类轨迹在人形机器人、可穿戴传感系统和辅助导航等应用中起着核心作用。然而,由于现实世界环境中缺乏第一人称轨迹数据集,这一方向的进展受到限制。为了解决这一需求,我们介绍了EgoTraj,一个使用Meta Quest Pro (MQPro)录制的egocentric多模态开放数据集。EgoTraj包含75个由多个MQPro穿戴设备在真实城市环境中收集的人导航轨迹。每个记录都提供了同步的RGB视频以及地面真实数据,包括连续时间同步的6自由度头部姿态、每帧3D眼 gaze向量和场景注释。据我们所知,EgoTraj不同于典型的egocentric轨迹数据集,因为它捕捉了在多样化的城市路线中进行的长视距、自主导航,具有广泛的参与者多样性。为了展示该数据集的潜力,我们对几种最先进的egocentric轨迹预测方法进行了基准测试,并进行了消融研究以分析注视、场景和运动提示的贡献。结果突显了EgoTraj在AR感知、导航和辅助系统中的实用性。EgoTraj数据集、代码和EgoViz仪表板已公开在https://github.com/yehiahmad/EgoTraj。

英文摘要

Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.

2605.18999 2026-05-20 cs.LG

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

Yury Demidovich, Abhishek Chakraborty, Grigory Malinovsky, Angelia Nedić, Peter Richtárik

发表机构 * King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学科学与技术学院) Arizona State University(亚利桑那州立大学)

AI总结 本文研究了Muon优化器在一般范数几何中的自适应步长缩放规则,提出三种互补算法,包括Distance-Adaptive Muon、Scale-Calibrated Muon和Distance-Free Muon,通过证明站arity保证、目标间隙界和信任区域半径选择,提升了优化性能。

详情
AI中文摘要

Muon和相关的归一化优化器将更新方向的选择与步长缩放的选择解耦,但其实际性能仍然对归一化步长的尺度敏感。我们研究了Muon在一般范数几何中的自适应缩放规则,并开发了三种互补算法。对于光滑非凸目标,我们引入了Distance-Adaptive Muon,其信任区域半径由轨迹探索的半径设定,并在轨迹有界假设下证明了站arity保证。随后,我们转向星凸目标,这是用于推理深度神经网络经验损失景观的可处理模型,在此设置中,我们首先引入Scale-Calibrated Muon,它保持Muon的指数移动平均,但通过当前梯度和动量计算的局部下降证书设置步长长度。对于该方法,我们在初始子水平集有界假设下证明了最后迭代的O(1/T)目标间隙界,其中对应的半径参数仅出现在分析中,而不是算法中。最后,我们开发了Distance-Free Muon,这是一种重新中心的信任区域方法,使用标量距离证书和主要化的一维搜索来选择信任区域半径,无需要求未知的初始化到全局最小值的距离。在Transformer语言建模(GPT-124M/WikiText-103)和图像分类(ViT-Tiny/CIFAR-100)上的实验表明,所提出的自适应缩放规则减少了对手动缩放调整的敏感性,并在测试预算下匹配或改进了调优的固定缩放Muon基线。

英文摘要

Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon's exponential moving average but sets the step length from a local descent certificate computed from the current gradient and momentum. For this method, we prove a last-iterate O(1/T) objective-gap bound under a bounded initial sublevel-set assumption, where the corresponding radius parameter appears only in the analysis and not in the algorithm. Finally, we develop Distance-Free Muon, a recentered trust-region method that uses a scalar distance certificate and a majorized one-dimensional search to select the trust-region radius without requiring the unknown distance from the initialization to a global minimizer. Experiments on Transformer language modeling (GPT-124M/WikiText-103) and image classification (ViT-Tiny/CIFAR-100) show that the proposed adaptive scaling rules reduce sensitivity to manual scale tuning and match or improve tuned fixed-scale Muon baselines under the tested budgets.

2605.18984 2026-05-20 cs.CV

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

Artifact-Bench: 评估MLLMs在检测和评估AI生成视频中的伪影

Yuqi Tang, Yang Shi, Zhuoran Zhang, Qixun Wang, Xuehai Bai, Yue Ding, Ruizhe Chen, Bohan Zeng, Xinlong Chen, Xuanyu Zhu, Bozhou Li, Yuran Wang, Yifan Dai, Chengzhuo Tong, Xinyu Liu, Yiyan Ji, Yujie Wei, Yuhao Dong, Shilin Yan, Fengxiang Wang, Yi-Fan Zhang, Haotian Wang, Yuanxing Zhang, Pengfei Wan

发表机构 * Kling Team(Kling团队) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出Artifact-Bench,一个用于评估多模态大语言模型在检测和分析AI生成视频伪影能力的基准,揭示了现有模型在伪影感知和推理上的显著局限性。

详情
AI中文摘要

近年来,视频生成模型在提高AI生成视频的真实感方面取得了显著进步,但其输出仍存在时间不一致、结构失真和语义不连贯等伪影。尽管多模态大语言模型(MLLMs)在视觉理解方面表现出色,但其感知和推理这些伪影的能力仍不明确。现有基准缺乏对伪影感知和细粒度诊断推理的系统评估,尤其是在超越逼真内容的多样化AI生成视频领域。为解决这一差距,我们引入Artifact-Bench,一个全面的基准,用于评估MLLMs在AI生成视频伪影检测和分析上的能力。我们首先建立了涵盖逼真、动画和CG风格视频的三级层次化伪影分类法。基于此分类法,Artifact-Bench定义了三个互补任务:真实与AI生成视频分类、成对真实感比较和细粒度伪影识别。在19种领先MLLMs上的实验揭示了伪影感知和推理的显著局限性,许多模型在挑战性设置中接近随机甚至低于随机表现。我们进一步观察到MLLM判断与人类感知偏好之间存在显著不一致,突显了其作为AI生成视频真实感一般评估者的有限可靠性。

英文摘要

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

2605.18979 2026-05-20 cs.LG

TabQL: In-Context Q-Learning with Tabular Foundation Models

TabQL: 基于表格基础模型的上下文Q学习

Qisai Liu, Zhanhong Jiang, Timilehin Ayanlade, Ashutosh Kumar Nirala, Yang Li, Aditya Balu, Soumik Sarkar

发表机构 * Department of Mechanical Engineering(机械工程系) Iowa State University(爱荷华州立大学) Translational AI Center(转化人工智能中心) Department of Computer Science(计算机科学系)

AI总结 本文提出TabQL,一种基于表格基础模型的强化学习框架,通过上下文学习能力替代传统参数Q网络,提升Q值表示的适应性与效率。

详情
AI中文摘要

我们提出了表格Q学习(TabQL),一种强化学习框架,该框架用具有上下文学习能力的表格基础模型替代传统参数Q网络。关键思想是通过序列到序列基础模型对状态-动作-Q值元组的表格化表示来表示Q值,从而通过条件于近期经验实现快速适应。TabQL不同于经典DQN之处在于利用(i)零次或少次射Q值推断通过上下文更新,以及(ii)使用标准DQN进行预热阶段以生成高质量的上下文。特别是,为了增强上下文质量,新的转移是通过执行TabQL输出的动作和DQN预测的Q值生成的。我们正式化了TabQL,分析了其收敛性和样本复杂度在温和假设下的表现,并展示了TabQL在上下文学习下介于原始Q学习和DQN之间。我们的分析表明,TabQL通过上下文学习消除了Bellman更新,从而比DQN更高效。通过多个基准的广泛数值实验,展示了所提TabQL的有效性和有效性。

英文摘要

We propose Tabular Q-Learning (TabQL), a reinforcement learning framework that replaces the conventional parametric Q-network in Deep Q-Learning (DQN) with a tabular foundation model endowed with in-context learning capabilities. The key idea is to represent Q-values through a sequence-to-sequence foundation model operating over a tabularized representation of state-action-Q-value tuples, enabling rapid adaptation from limited online interaction by conditioning on recent experience. TabQL departs from classical DQN by leveraging (i) zero- or few-shot Q-value inference via in-context updates, and (ii) a warm-up phase using standard DQN to bootstrap high-quality context. Particularly, to enhance the context quality, new transitions are generated by executing actions output by TabQL with predicted Q values from DQN. We formalize TabQL, analyze its convergence and sample complexity under mild assumptions, and show that TabQL interpolates between vanilla Q-learning and DQN with in-context learning. Our analysis demonstrates that TabQL achieves improved efficiency compared to DQN by amortizing Bellman updates through in-context learning. Extensive numerical experiments with several benchmarks showcase the effectiveness and efficacy of the proposed TabQL.

2605.18974 2026-05-20 cs.CV cs.AI cs.MM

Harnessing Self-Supervised Features for Art Classification

利用自监督特征进行艺术分类

Federico Melis, Davide Bilardello, Emanuele Prato, Evelyn Turri, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学)

AI总结 本文研究了监督和自监督主干作为特征提取器在艺术分类和检索中的有效性,特别是绘画,通过DINO家族和CLIP模型的实验评估,证明自监督主干在艺术分类中能带来一致的性能提升,并为现实应用如虚拟现实中的博物馆导航提供了见解。

Comments IRCDL 2026

详情
AI中文摘要

对艺术品进行分类是一项具有挑战性的任务,因为精细细节和抽象特征的复杂相互作用决定了艺术作品的风格或流派。本文系统地研究了监督和自监督主干作为特征提取器在艺术品分类和检索中的有效性,特别是绘画。我们通过DINO家族和CLIP模型进行了广泛的实验评估,评估了多种分类策略和特征表示。我们的结果表明,使用自监督主干在艺术品分类性能上产生了持续的改进。此外,我们的工作为现实应用中的分类和检索模块提供了见解,例如支持博物馆导航的虚拟现实(VR)应用。

英文摘要

Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.

2605.18971 2026-05-20 cs.LG cs.AI

Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality

塑造先验:合成任务分布如何决定表格基础模型的质量

Mohamed Bouadi, Nassim Bouarour, Varun Kulkarni, Shivam Dubey, Aditya Tanna, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了合成任务分布对表格基础模型质量的影响,提出O'Prior方法,通过四个耦合组件构建更真实的先验,提升了下游任务的准确性和鲁棒性。

详情
AI中文摘要

什么是决定表格基础模型质量的因素?与语言或视觉不同,表格基础模型的归纳偏倚几乎完全来自于合成预训练分布,但这些分布的设计仍不明确。标准的合成先验过于良好:它们忽略了不规则性和失败模式,这些决定了部署的鲁棒性。我们引入O'Prior,一种基于四个耦合组件的组合现实先验:一个跨越不同功能家族的分层SCM元生成器;一个覆盖异质边际、缺失值和目标转换的模块化现实引擎;一个显式压力模块注入混淆和支持-查询不匹配;以及一个受课程指导、泄漏安全的生成协议。为了将先验设计作为科学变量隔离,我们固定了架构、优化器和计算预算,只改变合成任务分布。O'Prior在真实表格基准上实现了持续且显著的改进,收益集中在分布不规则性特征的领域。消融实验确认了机制多样性、现实组成和移位感知压力各自独立贡献,其效果不可互换。这些结果确立了合成先验构建作为表格基础模型质量的第一性且长期被忽视的决定因素。

英文摘要

What determines the quality of a tabular foundation model? Unlike language or vision, tabular foundation models acquire their inductive biases almost entirely from synthetic pretraining distributions, yet the design of these distributions remains poorly understood. Standard synthetic priors are too well-behaved: they omit the irregularities and failure modes that determine deployment robustness. We introduce O'Prior, a compositional realism prior built around four coupled components: a hierarchical SCM meta-generator spanning diverse functional families; a modular realism engine covering heterogeneous marginals, missingness, and target transforms; an explicit stress module injecting confounding and support-query mismatch; and a curriculum-governed, leakage-safe generation protocol. To isolate prior design as the scientific variable, we hold architecture, optimizer, and compute budget fixed and vary only the synthetic task distribution. O'Prior yields consistent and substantial improvements in downstream accuracy and robustness across real tabular benchmarks, with gains concentrated in regimes characterized by distributional irregularities. Ablations confirm that mechanism diversity, realism composition, and shift-aware stress each contribute independently, their effects are not interchangeable. These results establish synthetic prior construction as a first-order and largely overlooked determinant of tabular foundation model quality

2605.18956 2026-05-20 cs.CV

MotionMERGE: A Multi-granular Framework for Human Motion Editing, Reasoning, Generation, and Explanation

MotionMERGE: 一种用于人体动作编辑、推理、生成和解释的多粒度框架

Bizhu Wu, Jinheng Xie, Wenting Chen, Zhe Kong, Jianfeng Ren, Linlin Shen, Ruibin Bai, Rong Qu

发表机构 * Computer Vision Institute, School of Computer Science and Software Engineering, Shenzhen University(计算机视觉研究院,计算机科学与软件工程学院,深圳大学) Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University(广东省智能信息处理重点实验室,深圳大学) School of Computer Science, University of Nottingham Ningbo China(Nottingham Ningbo 中国计算机科学学院) Department of Electrical and Computer Engineering, National University of Singapore(电子与计算机工程系,新加坡国立大学) Department of Radiation Oncology, Stanford University(放射肿瘤科,斯坦福大学) Sun Yat-sen University(中山大学) School of Computer Science, University of Nottingham(计算机科学学院,Nottingham大学)

AI总结 本文提出MotionMERGE框架,通过细粒度语言引导的动作控制、跨粒度协同预训练和细粒度动作-语言对齐,实现了更精确的动作生成、理解和编辑,并建立了新的细粒度文本驱动动作编辑和动作引导推理基准。

详情
AI中文摘要

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

英文摘要

Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can't focus on motion's localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

2605.18933 2026-05-20 cs.LG

A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

对ReLU + RMSNorm块在三元量化下的符号幅度不对称性进行几何分析

Lei Dong

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过符号幅度分解解释了在三元量化下ReLU + RMSNorm块的符号幅度不对称性,揭示了ReLU和RMSNorm在权重扰动中的几何机制,并通过实验验证了这种不对称性在实际模型中的表现。

Comments 53 pages, 2 figures, 21 tables, 7 appendices

详情
AI中文摘要

预归一化变换器使用RMSNorm可以容忍三元{-1,0,+1}权重量化,其损失出人意料的小(Ma等人,2024)。我们通过符号幅度分解给出了几何解释。在具有独立同分布高斯权重的两层ReLU + RMSNorm模型中,符号翻转产生的横向输出能量是符号保持幅度扰动的π/(π-2)≈2.75倍,当翻转率p→0时(定理3)。机制:ReLU在两种扰动类型之间创建了隐藏空间的方向不对称性,RMSNorm的横向投影Fréchet导数选择性地暴露了这种不对称性。符号量化误差本身是一种符号保持的扰动,具有角度对齐cos²→2/π(定理4);其后ReLU径向分数(0.365)与前ReLU值1-2/π在0.4%内一致,因此ReLU对三元误差几乎是透明的。多层叠加的2.75倍因子未被实验支持;与真实模型符号敏感性之间的差距源于异常特征违反去局部化。对于幅度为α的输入维度,单个符号翻转产生的后ReLU能量放大约为R≈nα²,相对于去局部化的条目。在TinyLlama-1.1B上,线性响应(p≤0.5%)下,计数匹配的NLL利用稳定在约10×≈nE[α²],与每条目理论一致;所有列NLL比率为5.0×,在R_col≤19内(67×PPL差距反映了度量非线性)。测量的异常α在第12层(中位数0.024,最大0.26)确认了重尾浓度。Bussgang常数2/π、RMSNorm几何和ReLU半空间结构共同解释了预归一化模型中的符号幅度不对称性,R≈nα²解释了真实模型的偏差。

英文摘要

Pre-norm Transformers with RMSNorm tolerate ternary {-1,0,+1} weight quantization with surprisingly small loss (Ma et al., 2024). We give a geometric explanation via sign-magnitude decomposition of weight perturbations. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce $π/(π-2) \approx 2.75$ times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm, as the flip rate $p \to 0$ (Theorem 3). The mechanism: ReLU creates a hidden-space directional asymmetry between the two perturbation types, which RMSNorm's transverse-projection Fréchet derivative selectively exposes. Sign-quantization error is itself a sign-preserving perturbation with angular alignment $\cos^2 \to 2/π$ (Theorem 4); its post-ReLU radial fraction ($0.365$) matches the pre-ReLU value $1-2/π$ within $0.4\%$, so ReLU is approximately transparent to ternary error. Multi-layer compounding of the $2.75\times$ factor is not experimentally supported; the gap to real-model sign sensitivity arises from outlier features violating delocalization. For an input dimension with amplitude $α$, a single sign-flip produces post-ReLU energy amplified by $R \approx nα^2$ relative to a delocalized entry. On TinyLlama-1.1B, at linear response ($p \leq 0.5\%$), count-matched NLL leverage stabilizes at $\sim 10\times \approx n\mathbb{E}[α^2]$, matching the per-entry theory; the all-column NLL ratio of $5.0\times$ falls within $R_{\mathrm{col}} \leq 19$ ($67\times$ PPL gap reflects metric nonlinearity). Measured outlier $α$ at layer 12 (median $0.024$, max $0.26$) confirms heavy-tailed concentration. The Bussgang constant $2/π$, RMSNorm geometry, and ReLU half-space structure together explain sign-magnitude asymmetry in pre-norm models, with $R \propto nα^2$ accounting for real-model deviations.

2605.18921 2026-05-20 cs.RO

Geo-Data-Driven HD Map Generation Workflow with Integrated Reference-Free Constraint-Based Verification

基于地理数据的高精地图生成工作流与集成的无参考约束验证

Ruidi He, Vaibhav Tiwari, Mohanad Al-Ghobari, Meng Zhang, Andreas Rausch

发表机构 * Institute for Software and Systems Engineering(软件与系统工程研究所)

AI总结 本文提出了一种基于地理数据的高精地图生成工作流,结合了无参考约束验证,以降低对高精度参考数据的依赖,提高在缺乏专业测量数据或独立参考地图时的应用可行性。

详情
AI中文摘要

高精地图是自动驾驶系统的核心构件,但其生成通常依赖于传感器密集的移动测绘任务,而质量评估往往依赖于高精度参考数据。这些依赖性使得高精地图工程成本高且难以在缺乏专门测量数据或独立测量参考地图的环境中应用。本文提出了一种面向工程的基于地理数据的工作流,用于高精地图生成,并集成了表示层面的验证。该工作流使用公开可用的地理工程数据集作为主要输入源,并通过显式的中间表示和处理阶段,将它们转换为现有道路环境的车道级高精地图表示。为了在没有外部参考地图的情况下评估生成的表示,该工作流在工程过程中集成了可执行的基于约束的验证。所选约束来自与自动驾驶和道路设计指南相关的规范。它们直接在生成的车道let表示上进行评估,以检测几何、拓扑和高程相关的一致性问题。该工作流使用来自德国下萨克森州四个城市的基于真实世界shapefile的道路网络数据,并结合受控缺陷注入场景进行评估。真实世界评估显示,生成的地图表示在评估场景中满足所选约束,而缺陷注入研究证明了对所考虑缺陷类型的完全检测,没有观察到假阳性。结果表明,集成可执行验证的基于地理数据的高精地图生成可以在减少传感和参考数据可用性的情况下,为传感器密集的测绘工作流提供模块化和可检查的补充。

英文摘要

High-definition (HD) maps are core artifacts for automated driving systems, but their generation commonly relies on sensor-intensive mobile mapping campaigns, while quality assessment often depends on high-precision reference data. These dependencies make HD map engineering costly and difficult to apply in settings where specialised measurement data or independently measured reference maps are unavailable. This paper presents an engineering-oriented geo-data-driven workflow for HD map generation with integrated representation-level verification. The workflow uses openly available geo-engineering datasets as the primary input source and transforms them into lane-level HD map representations of existing road environments through explicit intermediate representations and processing stages. To assess the generated representations without external reference maps, the workflow integrates executable constraint-based verification into the engineering process. Selected constraints are derived from specifications relevant to automated driving and road-design guidelines. They are evaluated directly on the generated lanelet-based representation to detect geometric, topological, and elevation-related inconsistencies. The workflow is evaluated using real-world shapefile-based road-network data from four cities in Lower Saxony, Germany, and controlled defect-injection scenarios. The real-world evaluation shows that the generated map representations satisfy the selected constraints in the evaluated scenarios, while the defect-injection study demonstrates complete detection of the considered defect types without observed false positives. The results indicate that geo-data-driven HD map generation with integrated executable verification can provide a modular and inspectable complement to sensor-intensive mapping workflows under reduced sensing and reference-data availability.

2605.18905 2026-05-20 cs.LG cs.AI cs.NA cs.NE math.NA

Stability and Discretization Error of State Space Model Neural Operators

状态空间模型神经算子的稳定性与离散化误差

Abderrahim Bendahi, Adrien Fradin, Johan Peralez, Julie Digne, Madiha Nadri

发表机构 * École polytechnique(巴黎政治经济学院) Université Claude Bernard Lyon 1(里昂1大学) CNRS(法国国家科学研究中心) LAGEPP UMR 5007 Université Lyon 1(里昂1大学) INSA Lyon(里昂国立应用科学学院) LIRIS(里昂图像与信号研究所)

AI总结 本文研究了状态空间模型神经算子的稳定性与离散化误差,通过理论分析建立了神经算子近似方案的离散误差和稳定性保证,提出了针对SS-NOs和FNOs的新的离散误差定理,并通过实验验证了其在不同分辨率下的鲁棒性。

详情
AI中文摘要

神经算子已作为一种强大的、与离散化无关的框架,用于求解偏微分方程(PDEs)。尽管已建立的方法如深度运算网络(DeepONet)已成功实现了运算符的通用逼近,而如傅里叶神经算子(FNOs)等架构已显示出代数收敛速率,但连续理论与其离散数值实现之间的精确理论联系仍是一个挑战。具体来说,连续公式与离散数值稳定性之间的关系尚未被充分探索。在本文中,我们通过建立神经算子近似方案的离散误差和稳定性的理论保证来填补这一空白。我们证明了将解的正则性与输入离散化联系起来的分析界,提供了在现实数值约束下神经算子精度的正式量化。我们为SS-NOs和FNOs的具体情况推导了这些界,从而为这些模型提出了新的离散误差定理。此外,通过输入到状态稳定性(ISS)分析,我们正式评估了离散化对连续域中SS-NOs结果稳定性的影响。我们在1D和2D基准上的实验证实了我们的理论界,并展示了SS-NOs在不同分辨率下的鲁棒性。

英文摘要

Neural operators have emerged as a powerful, discretization-invariant framework for solving partial differential equations (PDEs). Although established approaches like the Deep Operator Network (DeepONet) have successfully achieved universal approximation for operators, and architectures such as Fourier Neural Operators (FNOs) have shown algebraic convergence rates, a precise theoretical connection between the continuous theory and its discrete numerical implementation remains a challenge. Specifically, the relationship between the continuous formulation and the discrete numerical stability has yet to be fully explored. In this paper, we address this gap by establishing theoretical guarantees for the discretization error and stability of neural operator approximation schemes. We prove analytical bounds that link solution regularity to input discretization, providing a formal quantification of neural operator accuracy under real-world numerical constraints. We derive these bounds to the specific cases of State Space Model-based Neural Operators (SS-NOs) and FNOs, thus providing a new discretization error theorem for these models. Additionally, through an input-to-state stability (ISS) analysis, we formally assess the impact of discretization on the stability of SS-NOs results obtained in the continuous domain. Our empirical experiments on 1D and 2D benchmarks validate our theoretical bounds and show the robustness of SS-NOs under varying resolutions.

2605.18904 2026-05-20 cs.LG cs.AI cs.CL

Dynamic Model Merging Made Slim

动态模型合并的轻量级方法

Guodong Du, Wanyu Lin

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出DiDi-Merging方法,通过可微分的秩分配平衡共享和专家参数,实现更高效的动态模型合并,在参数量上显著优于现有方法。

详情
AI中文摘要

模型合并使在不联合训练或访问原始数据的情况下重用微调模型成为可能。动态合并进一步通过选择性激活任务相关参数并高效组合多个任务的专家来提高灵活性。然而,现有动态方法要么维护一个完整的共享模型加小专家,要么为专家分配过多容量,导致准确性与效率之间的权衡不优。为此,我们提出DiDi-Merging,一种轻量动态合并框架,利用可微分的秩分配来平衡共享和专家参数。通过将参数预算分配建模为低秩模块中的可微分秩优化,并引入无需数据的细化步骤来恢复任务保真度,DiDi-Merging在仅1.24倍单个微调模型参数的情况下匹配现有动态基线,并在1.4倍时超越它们,显著优于需要>2倍存储容量的方法。DiDi-Merging适用于视觉、语言和多模态任务。

英文摘要

Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

2605.18903 2026-05-20 cs.LG cs.CV

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

推理可移植性:引导MLLMs在RLVR时代的持续学习

Qiuhe Hong, Yuyang Liu, Shuo Yang, Tiantian Peng, Fei Zhu, Yonghong Tian

发表机构 * Shenzhen Graduate School of Peking University(北京大学深圳研究生院) Centre for Artificial Intelligence and Robotics, HKISI, CAS(香港科学院人工智能与机器人研究中心) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出了一种名为推理可移植性(RP)的机制,通过在持续学习中引入推理层面的约束,改进了多模态大语言模型在RLVR环境下的适应能力,实验表明RDB-CL在提升最后准确率方面优于基线方法。

详情
AI中文摘要

在持续学习中,视觉-语言模型(VLM-CL)旨在不断适应新多模态任务的同时保留先前知识。新兴的将多模态大语言模型(MLLMs)与具有可验证奖励的强化学习(RLVR)相结合的范式,要求一种新的模式来引导持续适应。随着推理能力的进步,现在可以在推理层面施加约束。我们正式化了可移植性,即一个样本级别的度量,用于衡量先前策略行为在新任务中的可重用性,并实证表明推理层面的信号在分布外样本上仍可靠,而答案层面的信号则不然。我们将此形式化为推理可移植性(RP),并提出基于推理的动态平衡持续学习(RDB-CL),该方法根据RP调节RLVR中的每样本Kullback-Leibler正则化:一个紧密的锚点在高RP样本上保留可重用的推理,而低RP样本上的放松锚点则允许探索新的推理路径。实验表明,RDB-CL在提升最后准确率方面优于基线方法,相比 vanilla RLVR 基线提升了+12.0%。

英文摘要

Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy's behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.