arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2506.08618 2026-05-20 cs.LG cond-mat.mes-hall cond-mat.other cs.AI cs.CV

HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

HSG-12M: 一种大规模空间多图基准，源自非厄密晶体能量谱

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

AI总结本文提出HSG-12M，一个包含1160万静态和510万动态哈密顿量谱图的数据集，用于研究非厄密量子物理中的复杂几何结构，填补了现有图基准在空间多边学习方面的空白。

Comments Accepted to ICLR 2026, OpenReview: [https://openreview.net/forum?id=YxuKCME576]. 49 pages, 13 figures, 14 tables. Code & pipeline: [https://github.com/sarinstein-yan/Poly2Graph] Dataset: [https://github.com/sarinstein-yan/HSG-12M] Dataset released under CC BY 4.0. The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026)

AI中文摘要

人工智能正通过揭示理解复杂物理系统的新方法改变科学研究，但其影响仍受限于缺乏大规模、高质量的领域专用数据集。非厄密量子物理中蕴藏着丰富的资源，其中晶体的能量谱在复平面上形成复杂的几何结构，称为哈密顿量谱图。尽管这些谱图作为电子行为的指纹具有重要意义，但其系统研究一直受限于手动提取的依赖。为释放这一潜力，我们引入Poly2Graph：一个高性能、开源的管道，自动化将一维晶体哈密顿量映射到谱图。使用该工具，我们提出了HSG-12M：一个包含1160万静态和510万动态哈密顿量谱图的数据集，涵盖1401个特征多项式类别，源自177TB的谱势数据。关键的是，HSG-12M是首个大规模空间多图数据集——图嵌入在度量空间中，其中两个节点之间不同的几何轨迹被保留为单独的边。这同时填补了现有图基准在空间多边学习方面的空白。流行的GNN基准测试揭示了在大规模学习空间多边时的新挑战。除了其实际用途外，我们还表明谱图是多项式、向量和矩阵的通用拓扑指纹，建立了新的代数到图的联系。HSG-12M为凝聚态物理的数据驱动科学发现奠定了基础，为几何感知图学习的新机会以及更广泛领域铺平了道路。

英文摘要

AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane -- termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs -- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

URL PDF HTML ☆

赞 0 踩 0

2506.05317 2026-05-20 cs.CV

ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

ProJo4D：渐进式联合优化用于稀疏视图逆物理估计

Daniel Rho, Jun Myeong Choi, Biswadip Dey, Roni Sengupta

AI总结本文提出ProJo4D，一种渐进式联合优化框架，用于解决稀疏视图下逆物理参数估计问题，通过逐步扩展联合优化参数集，提高了4D未来状态预测和物理参数估计的准确性，达到几何精度提升10倍的性能。

Comments TMLR 2026

详情

AI中文摘要

神经渲染在3D重建和新视图合成方面已取得显著进展，将物理整合到这些框架中开辟了新的应用，如机器人和XR中的物理准确数字孪生。然而，从视觉观测中估计物理参数的逆问题仍具挑战性。现有物理感知神经渲染方法通常需要密集多视角视频，使其在可扩展的实际部署中不切实际。在稀疏视图设置下，当前方法采用的顺序优化策略导致严重误差累积：初始3D重建的不准确性会传播到后续阶段，降低物理状态和材料参数估计。另一方面，同时优化所有参数失败，因为问题高度非凸且通常非可微。我们提出ProJo4D，一种渐进式联合优化框架，逐步扩展联合优化的参数集。这种设计使物理感知梯度能够细化几何，同时避免直接对所有参数进行联合优化的不稳定性。在合成和真实世界数据集上的评估表明，ProJo4D在4D未来状态预测和物理参数估计方面显著优于先前工作，实现几何精度提升高达10倍，同时保持计算效率。请访问项目网页：https://daniel03c1.github.io/ProJo4D/

英文摘要

Neural rendering has advanced significantly in 3D reconstruction and novel view synthesis, and integrating physics into these frameworks opens new applications such as physically accurate digital twins for robotics and XR. However, the inverse problem of estimating physical parameters from visual observations remains challenging. Existing physics-aware neural rendering methods typically require dense multi-view videos, making them impractical for scalable, real-world deployment. Under sparse-view settings, the sequential optimization strategies employed by current approaches suffer from severe error accumulation: inaccuracies in initial 3D reconstruction propagate to subsequent stages, degrading physical state and material parameter estimates. On the other hand, simultaneous optimization of all parameters fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually expands the set of jointly optimized parameters. This design enables physics-informed gradients to refine geometry while avoiding the instability of direct joint optimization over all parameters. Evaluations on synthetic and real-world datasets demonstrate that ProJo4D substantially outperforms prior work in 4D future state prediction and physical parameter estimation, achieving up to 10x improvement in geometric accuracy while maintaining computational efficiency. Please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

URL PDF HTML ☆

赞 0 踩 0

2506.01529 2026-05-20 cs.LG

Learning Abstract World Models with a Group-Structured Latent Space

通过组结构潜在空间学习抽象世界模型

Thomas Delliaux, Nguyen-Khanh Vu, Vincent François-Lavet, Elise van der Pol, Emmanuel Rachelson

AI总结该研究通过在低维表示流形上引入几何先验，改进了马尔可夫决策过程的抽象模型学习，从而提升有限数据下的泛化能力，并在具有旋转和翻译特征的环境中实现了更有效的强化学习任务学习。

Comments 20 pages, 18 figures

详情

AI中文摘要

学习有意义的马尔可夫决策过程（MDPs）的抽象模型对于从有限数据中提高泛化能力至关重要。在本文中，我们展示了如何在学习的转移模型的低维表示流形上施加几何先验。我们通过适当选择潜在空间和相关的群作用，纳入已知的对称结构，这些结构编码了环境中的先验知识关于不变性。此外，我们的框架允许将额外的无结构信息与这些对称性一起嵌入。我们实验表明，这导致了比完全无结构方法更好的潜在转移模型预测，以及在具有旋转和翻译特征的环境中下游RL任务学习的改进。此外，我们的实验还显示，这导致了更简单和更解耦的表示。完整的代码可在GitHub上获得以确保可重复性。

英文摘要

Learning meaningful abstract models of Markov Decision Processes (MDPs) is crucial for improving generalization from limited data. In this work, we show how geometric priors can be imposed on the low-dimensional representation manifold of a learned transition model. We incorporate known symmetric structures via appropriate choices of the latent space and the associated group actions, which encode prior knowledge about invariances in the environment. In addition, our framework allows the embedding of additional unstructured information alongside these symmetries. We show experimentally that this leads to better predictions of the latent transition model than fully unstructured approaches, as well as better learning on downstream RL tasks, in environments with rotational and translational features, including in first-person views of 3D environments. Additionally, our experiments show that this leads to simpler and more disentangled representations. The full code is available on GitHub to ensure reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2506.01418 2026-05-20 cs.RO cs.CV

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

SEMNAV: 通过语义分割增强机器人中的视觉语义导航

Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre

AI总结本文提出SEMNAV，一种利用语义分割作为环境主要视觉输入表示的方法，以增强机器人代理的感知和决策能力，通过引入高层面的语义信息，提升模型在未知环境中的泛化能力，并引入SEMNAV数据集进行训练。

详情

DOI: 10.1007/s10489-026-07275-1
Journal ref: Applied Intelligence, 2026

AI中文摘要

视觉语义导航（VSN）是机器人学中的基本问题，其中智能体必须在未知环境中导航至目标对象，主要依靠视觉信息。大多数最先进的VSN模型是在模拟环境中训练的，其中使用的是现实世界的渲染场景，最理想的情况。这些方法通常依赖于虚拟场景的原始RGB数据，这限制了它们在真实世界环境中的泛化能力，由于域适应问题。为了解决这个问题，本文提出了SEMNAV，一种新的方法，利用语义分割作为环境的主要视觉输入表示，以增强代理的感知和决策能力。通过显式地引入这种高层语义信息，我们的模型学习到稳健的导航策略，提高了在未见过的环境中泛化的能力，无论是模拟还是真实世界。我们还引入了SEMNAV数据集，这是一个新编纂的数据集，用于训练如SEMNAV这样的语义分割感知导航模型。我们的方法在模拟环境和真实世界机器人平台上进行了广泛的评估。实验结果表明，SEMNAV优于现有的最先进VSN模型，在Habitat 2.0模拟环境使用HM3D数据集时实现了更高的成功率。此外，我们的实际实验突显了语义分割在缓解仿真到现实差距方面的有效性，使我们的模型成为实用VSN基于机器人应用的有希望的解决方案。代码和数据集可在https://github.com/gramuah/semnav访问。

英文摘要

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

URL PDF HTML ☆

赞 0 踩 0

2506.00286 2026-05-20 cs.LG cs.AI math.OC stat.ML

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

递归熵风险优化在折扣马尔可夫决策过程中的应用：带有生成模型的样本复杂性界

Oliver Mortensen, Mohammad Sadegh Talebi

AI总结本文研究了在有限折扣马尔可夫决策过程（MDP）中使用递归熵风险度量（ERM）进行风险敏感强化学习的问题，引入了基于模型的算法Model-Based ERM Q-Value Iteration（MB-RS-QVI），并推导了该算法在价值学习和策略学习中的PAC型样本复杂性界，证明了在最坏情况下样本复杂性与|β|/(1-γ)呈指数关系，为递归ERM在风险规避和风险寻求情形下的样本复杂性提供了首次严格保证。

详情

AI中文摘要

我们研究了在有限折扣马尔可夫决策过程（MDP）中使用递归熵风险度量（ERM）进行风险敏感强化学习的问题，其中风险参数β≠0控制智能体的风险态度：β>0表示风险规避，β<0表示风险寻求行为。假设MDP具有生成模型。我们的关注点是学习最优状态-动作价值函数（价值学习）和最优策略（策略学习）在递归ERM下的样本复杂性。我们引入了一个基于模型的算法，称为Model-Based ERM Q-Value Iteration（MB-RS-QVI），并推导了该算法在价值和策略学习中的PAC型样本复杂性界。两种PAC界都随|β|/(1-γ)呈指数增长，其中γ是折扣因子。我们还为价值和策略学习建立了相应的下界，证明在最坏情况下样本复杂性对|β|/(1-γ)的指数依赖是不可避免的。这些界在状态和动作的数量（S和A）上是紧的，为递归ERM在风险规避和风险寻求情形下的样本复杂性提供了首次严格保证。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $β\neq 0$ controls the agent's risk attitude: $β>0$ for risk-averse and $β<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|β|/(1-γ)$, where $γ$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|β|/(1-γ)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

URL PDF HTML ☆

赞 0 踩 0

2505.23747 2026-05-20 cs.CV cs.AI cs.LG

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: 提升基于视觉的空域智能的MLLM能力

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

AI总结本文提出Spatial-MLLM，一种基于纯2D观测的视觉空域推理框架，通过双编码器架构和空间感知帧采样策略提升空域理解能力，实验表明其在多种视觉空域任务中达到SOTA性能。

Comments 22 pages

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）在2D视觉任务上的性能显著提升。然而，提高其空间智能仍是一个挑战。现有的3D MLLMs总是依赖额外的3D或2.5D数据来整合空间意识，限制了它们在只有2D输入（如图像或视频）场景中的实用性。在本文中，我们提出了Spatial-MLLM，一种新颖的框架，用于从纯2D观测中进行基于视觉的空间推理。与传统视频MLLMs依赖CLIP-based视觉编码器优化语义理解不同，我们的关键见解是释放来自前馈视觉几何基础模型的强大结构先验。具体来说，我们提出了双编码器架构：一个预训练的2D视觉编码器用于提取语义特征，以及一个3D空间编码器，从视觉几何模型的主干初始化以提取3D结构特征。然后，一个连接器将两种特征整合到统一的视觉标记中以增强空间理解。此外，我们提出了一种在推理时间的空间感知帧采样策略，该策略选择视频序列中具有空间信息的帧，确保在有限的token长度下，模型专注于对空间推理至关重要的帧。除了架构改进外，我们从多个来源构建了一个训练数据集，并使用监督微调和GRPO对其进行训练。在各种真实世界数据集上的广泛实验表明，Spatial-MLLM在广泛的基于视觉的空间理解和推理任务中实现了SOTA性能。项目页面：https://diankun-wu.github.io/Spatial-MLLM/.

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

URL PDF HTML ☆

赞 0 踩 0

2505.17726 2026-05-20 cs.CV cs.AI

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Slot-MLLM: 多模态大语言模型中的面向对象视觉标记化

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

AI总结本文提出了一种面向对象的视觉标记化方法Slot-MLLM，通过基于Slot Attention的标记器，有效编码局部视觉细节并保持高层语义，从而提升多模态大语言模型在视觉内容理解和生成中的性能。

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）已成为实现人工通用智能的关键方法。特别是，视觉语言MLLMs已被开发用于从多模态输入中生成文本和视觉输出。这一进展需要高效的图像标记，使LLMs能够有效处理输入和输出。然而，现有的图像标记方法通常只能捕捉全局抽象概念或均匀分割的图像块，限制了MLLMs在理解和生成细节视觉内容方面的能力，尤其是在对象层面。为了解决这一限制，我们提出了一种基于Slot Attention的面向对象视觉标记器，专门针对MLLMs。具体而言，基于Q-Former编码器、扩散解码器和残差向量量化，我们提出的离散化槽标记能够编码局部视觉细节，同时保持高层语义，并与文本数据对齐，无缝集成到LLMs的统一下一个标记预测框架中。所得到的Slot-MLLM在各种涉及局部详细理解和生成的视觉语言任务中，相对于先前视觉标记器的基线表现显著提升。值得注意的是，这项工作是首次展示了使用MLLMs和真实自然图像进行面向对象槽注意力的可行性。

英文摘要

Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

URL PDF HTML ☆

赞 0 踩 0

2505.12217 2026-05-20 cs.CV

HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models

HyperCap：面向视觉语言模型的超光谱土地覆盖描述数据集

Aryan Das, Tanishq Rachamalla, Pravendra Singh, Koushik Biswas, Vinay Kumar Verma, Salvador Garcia, Antonio Plaza, Swalpa Kumar Roy

AI总结本文提出HyperCap数据集，通过整合光谱数据与像素级文本标注，提升遥感应用中的模型性能，为未来研究提供基础资源。

Comments Accepted for publication in IEEE Geoscience and Remote Sensing Magazine (GRSM), 2026

详情

DOI: 10.1109/MGRS.2026.3693613

AI中文摘要

我们介绍了HyperCap，首个大规模超光谱描述数据集，旨在提升模型在遥感应用中的性能和有效性。与传统超光谱成像（HSI）基准不同，HyperCap将光谱数据与像素级文本标注相结合，实现更深入的语义理解。该数据集通过结合自动和手动方法对四个基准数据集进行标注，确保准确性和一致性。使用最先进的编码器和多样的融合技术进行实证评估，显示出显著的分类性能提升。这些结果突显了视觉-语言学习在HSI中的潜力，并将HyperCap定位为未来研究的基础数据集。代码和数据集可在https://github.com/arya-domain/HyperCap获取。

英文摘要

We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) benchmarks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field. The code and dataset are available at https://github.com/arya-domain/HyperCap.

URL PDF HTML ☆

赞 0 踩 0

2505.04588 2026-05-20 cs.CL

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

ZeroSearch: 无需搜索即可激励大语言模型的搜索能力

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, Jingren Zhou

AI总结本研究提出ZeroSearch框架，通过模拟搜索提升大语言模型的搜索能力，解决真实搜索引擎文档质量不可控和API成本高的问题，实验表明其在不同参数规模的模型上均表现优异。

详情

AI中文摘要

有效信息检索对于增强大语言模型（LLMs）的推理和生成能力至关重要。最近的研究探索了通过与真实搜索引擎在现实环境中交互，利用强化学习（RL）来提高LLMs的搜索能力。尽管这些方法显示出有前景的结果，但面临两个主要挑战：（1）不可控的文档质量：搜索引擎返回的文档质量往往不可预测，引入噪声和训练过程的不稳定性。（2）极高的API成本：RL训练需要频繁的回放，可能涉及数万次搜索请求，导致显著的API费用，并严重限制可扩展性。为了解决这些挑战，我们引入了ZeroSearch，一种新颖的RL框架，通过在训练过程中使用模拟搜索来激励LLMs的搜索能力。我们的方法首先通过轻量级监督微调将LLM转换为检索模块，使其能够生成有用和嘈杂的文档以响应查询。在RL训练过程中，我们采用基于课程的回放策略，逐步降低生成文档的质量，逐步通过暴露模型于越来越具有挑战性的检索场景来激发其推理能力。广泛的实验表明，ZeroSearch有效地利用3B LLM作为检索模块来激励LLMs的搜索能力。令人印象深刻的是，7B检索模块的表现与真实搜索引擎相当，而14B检索模块甚至超过了它。此外，它在各种参数规模的基模型和指令微调模型上均表现出良好的泛化能力，并且与多种RL算法兼容。

英文摘要

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2504.09188 2026-05-20 cs.RO cs.SY eess.SY

Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators

顺应性显式参考 governor 用于接触友好的机器人机械臂

Yaashia Gautam, Gilberto Briscoe-Martinez, Adhitya Mohan, Nataliya Nechyporenko, Alessandro Roncone, Marco M. Nicotra

AI总结本文提出了一种顺应性显式参考 governor (CERG)，一种模块化的参考管理系统，使机器人能够在有保证的条件下与环境物理交互。CERG 作为高层规划器和低层控制器之间的中间层，强制操作约束并使自由运动和接触操作之间平滑过渡。CERG 通过限制接触时机械臂可用的总能量来确保安全。在没有接触的情况下，CERG 不会惩罚系统性能。

Comments Updated paper with current contributions and author list , accepted at IFAC World Congress, Busan, 2026

2504.07756 2026-05-20 cs.AI cs.CY

Artificial Intelligence, conceptual metaphors and conceptual engineering: Are AI-based framings of human behaviour and cognition successful?

人工智能、概念隐喻和概念工程：基于人工智能的对人类行为和认知的框架是否成功？

Warmhold Jan Thomas Mollema, Thomas Wachter

AI总结本文探讨了将人工智能概念应用于人类行为和认知领域的成功性，分析了这些框架是否属于概念隐喻还是概念工程，并指出其潜在的伦理和简化挑战。

详情

AI中文摘要

利用人工智能领域的概念来理解人类行为、神经科学和心理学正变得越来越流行。鉴于人工智能技术在日常生活中的大规模整合，人工智能相关概念被用来将人工智能系统与人类行为、脑功能和认知能力（如语言习得）进行类比。但科学家和哲学家也越来越倾向于将人工智能对人类概念领域的框架视为字面意义。本文探讨了这些‘人工智能框架’的知识和实践成功性：应用人工智能的概念图景到人类概念领域意味着什么？我们考虑并比较了两种可能的答案：这些例子是概念隐喻，还是概念工程的尝试。首先，我们论证当这些人工智能框架被视为概念隐喻时，它们可能陷入‘地图-领土谬误’。其次，我们论证这些比较也包含误导性的‘双重隐喻’，因为人类心理学与计算之间的隐喻性联系存在于计算的基础概念中。但我们也论证人工智能框架中存在一个可能的语义陷阱，这被概念工程观点所捕捉。即，人工智能框架指向了概念工程的可能途径。如果概念伦理和简化主义的挑战被克服，一些人工智能框架可能会丰富我们的知识和实践生活。因此，在最坏的情况下——作为隐含的概念隐喻——人工智能框架会完全误导我们；在最好的情况下，它促使我们重新反思当前概念的边界如何服务于我们以及如何改进它们。

英文摘要

Understanding human behaviour, neuroscience and psychology using concepts from the domain of AI is increasing in popularity. Given the massive integration of AI technologies into our daily lives, AI-related concepts are being used to compare AI systems with human behaviour, brain functions, and cognitive abilities like language acquisition. But scientists and philosophers are also increasingly tempted to take the AI-framing of the human conceptual domain as a literal one. This paper investigates the epistemic and practical success of these 'AI-framings': What does it mean to apply the conceptual constellation of AI to the human conceptual domain? We consider and compare two possible answers: either these examples are conceptual metaphors, or they are attempts at conceptual engineering. Firstly, we argue that when viewed as conceptual metaphors, the AI-framed descriptions risk committing the ''map-territory fallacy''. Secondly, we argue the comparisons also contain a misleading 'double metaphor' because of the metaphorical connection between human psychology and computation at the conceptual foundation of computation. But we also argue that there is a possible semantic catch to the AI-framing, which is captured by the conceptual engineering view. This is that the AI-framings point towards avenues for forms of conceptual engineering. If the challenges of conceptual ethics and reductionism are overcome, some AI-framings might enrich our epistemic and practical lives. So, at its worst - as implicit conceptual metaphor - the AI-framing leads us completely astray; at its best, it prompts us to reflect anew on how the boundaries of our current concepts serve us and how they could be improved.

URL PDF HTML ☆

赞 0 踩 0

2504.04065 2026-05-20 cs.CV cs.IR cs.MM

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

使检索增强的视觉问答实现协作参数知识校准

Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu

AI总结本文提出了一种统一的检索增强视觉问答框架，通过协作参数知识校准来充分利用KB-VQA中的跨任务协同效应，从而提升问答准确性。

Comments 10 pages, 5 figures, Under Review

详情

DOI: 10.1016/j.knosys.2026.116157
Journal ref: Knowledge-Based Systems, 8 July 2026, Volume 346

AI中文摘要

基于知识的视觉问答（KB-VQA）系统通过从外部知识库检索的知识来解决复杂的视觉-地面化问题。知识检索和答案生成任务都要求对问题上下文和外部知识进行精确的多模态理解。然而，现有方法将这两个阶段视为独立模块，在训练过程中交互有限，这阻碍了双向参数知识共享，最终导致性能不佳。为充分利用KB-VQA中的跨任务协同效应，我们提出了一种统一的检索增强VQA框架，具有协作参数知识校准。所提出的框架可以有效地将通用多模态预训练模型适应于细粒度、知识密集型任务，同时在训练和推理过程中使检索器和生成器能够协作增强和共享其参数知识。为了增强对问题和外部文档的细粒度理解，我们还将晚期交互机制整合到所提出的训练框架中。此外，我们引入了一种反思-回答机制，使模型能够显式评估并细化其知识边界。我们的方法在与最先进的模型竞争中取得了竞争力的表现，实现了回答准确率的显著4.7%的提升，并为基础MLLMs的VQA性能带来了平均7.5%的提升。

英文摘要

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

URL PDF HTML ☆

赞 0 踩 0

2504.00470 2026-05-20 cs.LG cs.CV

Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection

少即是多：通过最小可解释子集选择实现高效的黑盒属性分析

Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao

AI总结本文提出了一种高效的黑盒属性分析方法LiMA，通过将重要区域的属性分析转化为子模函数子集选择的优化问题，以更少的区域提供更准确的解释，并在多个基准模型上展示了显著的改进。

详情

AI中文摘要

为了开发一个可信的AI系统，目标是识别对模型决策影响最大的输入区域。现有属性方法的主要任务是高效且准确地识别输入-预测交互关系。特别是当输入数据是离散的，如图像时，分析输入和输出之间的关系由于组合爆炸而成为重大挑战。在本文中，我们提出了一种新颖且高效的黑盒属性机制LiMA（Less input is More faithful for Attribution），它将重要区域的属性分析重新表述为一个子模子集选择的优化问题。首先，为了准确评估交互，我们设计了一个子模函数，该函数量化子集的重要性并有效捕捉其对决策结果的影响。然后，通过一种新的双向贪心搜索算法，高效地对输入子区域按重要性进行排序。LiMA能够识别最和最不重要的样本，同时确保一个最优的属性边界，以最小化误差。在八个基础模型上的广泛实验表明，我们的方法在更少的区域上提供了忠实的解释，并表现出强大的泛化能力，插入和删除任务的平均改进分别为36.3%和39.6%。我们的方法在属性效率方面也优于朴素的贪心搜索，速度提高了1.6倍。此外，当解释模型预测错误的原因时，我们的方法平均最高置信度比最先进的属性算法高86.1%。代码可在https://github.com/RuoyuChen10/LIMA上获得。

英文摘要

To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.

URL PDF HTML ☆

赞 0 踩 0

2503.19877 2026-05-20 cs.CL

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

通过推理模型作为评估器来提升评估时的计算能力

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, Sean Welleck

AI总结本文探讨了通过增加评估时的计算量来提升语言模型的评估能力，利用推理模型作为评估器，分别评估响应整体和每个步骤，从而提高评估效果。

Comments ACL 2026 Findings

详情

AI中文摘要

随着语言模型（LM）的输出越来越自然，评估其质量变得越来越困难。同时，通过增加测试时的计算量来提升LM的'思考'时间，已被证明是解决数学和代码等领域挑战性问题的有效技术。这引发了一个自然的问题：是否可以通过增加测试时的计算量来提升LM的评估能力？为回答这个问题，我们研究了利用推理模型——即能够原生生成长链推理的LM——作为评估器。具体而言，我们考察了通过（1）使用推理模型，以及（2）提示这些模型不仅评估响应整体（即结果评估），还评估响应中的每个步骤（即过程评估）来利用更多测试时计算量的方法。在实验中，我们观察到评估器的性能随着生成更多推理标记而单调提升，类似于LM生成中的趋势。此外，我们使用这些更准确的评估器对多个生成进行重新排序，并证明在评估时花费更多计算量可以像在生成时花费更多计算量一样有效，从而提升LM的问题解决能力。

英文摘要

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

URL PDF HTML ☆

赞 0 踩 0

2503.13868 2026-05-20 cs.LG cs.AI

Out-of-Distribution Generalization in Time Series: A Survey

时间序列中的分布外泛化：综述

Xin Wu, Fei Teng, Xingwang Li, Ji Zhang, Tianrui Li, Qiang Duan

AI总结本文综述了时间序列中分布外泛化的方法，分析了数据分布、表示学习和分布外评估三个维度，总结了主流算法，指出了应用场景和存在的挑战，并提出了未来研究方向。

Comments Work in Progress

详情

DOI: 10.1016/j.inffus.2026.104336
Journal ref: Information Fusion 133, 104336 (2026)

AI中文摘要

时间序列经常表现出分布偏移、多样化的潜在特征和非平稳学习动态，特别是在开放和演变的环境中。这些特性对分布外（OOD）泛化提出了重大挑战。尽管已有显著进展，但系统性综述仍缺乏。为填补这一空白，我们首次全面回顾了时间序列中OOD泛化方法，旨在阐明该领域的发展轨迹和当前研究现状。我们的分析分为三个基础维度：数据分布、表示学习和OOD评估。在每个维度中，我们详细介绍了几种流行的算法。此外，我们强调了关键的应用场景，突显其实际影响。最后，我们识别了持续存在的挑战并提出了未来的研究方向。时间序列中OOD泛化方法的详细总结可通过https://tsood-generalization.com获取。

英文摘要

Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose significant challenges for out-of-distribution (OOD) generalization. While substantial progress has been made, a systematic synthesis of advancements remains lacking. To address this gap, we present the first comprehensive review of OOD generalization methodologies for time series, organized to delineate the field's evolutionary trajectory and contemporary research landscape. We organize our analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, we present several popular algorithms in detail. Furthermore, we highlight key application scenarios, emphasizing their real-world impact. Finally, we identify persistent challenges and propose future research directions. A detailed summary of the methods reviewed for the generalization of OOD in time series can be accessed at https://tsood-generalization.com.

URL PDF HTML ☆

赞 0 踩 0

2503.12172 2026-05-20 cs.LG cs.CR cs.CV

SEAL: Semantic Aware Image Watermarking

SEAL：语义感知图像水印

Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen

AI总结本文提出了一种新的水印方法，通过将生成图像的语义信息直接嵌入水印中，实现无损水印验证，无需依赖密钥模式数据库。通过局部敏感哈希从图像语义嵌入中推断密钥模式，并基于原始图像内容条件检测水印，提高对抗伪造攻击的鲁棒性。

详情

AI中文摘要

生成模型已迅速发展以生成逼真的输出。然而，它们的合成输出越来越多地挑战自然与AI生成内容之间的清晰区分，需要稳健的水印技术。水印通常需要保持目标图像的完整性，抵御移除尝试，并防止未经授权的复制到无关图像上。为了解决这一需求，最近的方法将持久水印嵌入由扩散模型生成的图像中使用初始噪声。然而，为此，它们要么会扭曲生成图像的分布，要么依赖于搜索一个长密钥字典进行检测。在本文中，我们提出了一种新的水印方法，将生成图像的语义信息直接嵌入水印中，使水印无损，且无需数据库中的密钥模式即可验证。相反，密钥模式可以从图像的语义嵌入中使用局部敏感哈希推断。此外，将水印检测条件化于原始图像内容可以提高对伪造攻击的鲁棒性。为了证明这一点，我们考虑了两种被忽视的攻击策略：（i）攻击者提取初始噪声并生成具有相同模式的新图像；（ii）攻击者在水印图像中插入无关（可能有害）的对象，可能在保持水印的情况下。我们通过实验证明了我们的方法对这些攻击的增强鲁棒性。总的来说，我们的结果表明，内容感知的水印可以缓解图像生成模型带来的风险。

英文摘要

Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method's increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.

URL PDF HTML ☆

赞 0 踩 0

2503.02170 2026-05-20 cs.CV cs.AI

Adaptive Camera Sensor for Vision Models

自适应摄像头传感器用于视觉模型

Eunsu Baek, Sunghwan Han, Taesik Gong, Hyung-Sin Kim

AI总结本文提出Lens，一种基于人类视觉感知的自适应摄像头传感器控制方法，通过从模型视角捕获高质量图像来提升模型性能，同时在真实时间内适应特定模型和场景，并通过新的ImageNet-ES Diverse数据集验证了其有效性。

Comments The International Conference on Learning Representations (ICLR 2025)

详情

AI中文摘要

领域偏移仍然是基于深度学习的计算机视觉中的持续挑战，通常需要大量的模型修改或标记数据集来解决。受人类视觉感知的启发，即通过矫正透镜调整输入质量而不是过度训练大脑，我们提出了Lens，一种新颖的摄像头传感器控制方法，通过从模型视角捕获高质量图像来增强模型性能，而不是依赖传统的以人类为中心的传感器控制。Lens是轻量级的，并且能够实时适应特定模型和场景的传感器参数。其核心是VisiT，一种无需训练的、模型特定的质量指标，它在测试时使用置信度分数评估单个未标记样本，而无需额外的适应成本。为了验证Lens，我们引入了ImageNet-ES Diverse，一个新基准数据集，捕捉了来自变化的传感器和光照条件的自然扰动。在ImageNet-ES和我们新的ImageNet-ES Diverse上的大量实验表明，Lens在各种传感器控制和模型修改的基线方案中显著提高了模型的准确性，同时保持了低延迟的图像捕获。Lens有效补偿了大模型大小差异，并与模型改进技术协同作用。我们的代码和数据集可在github.com/Edw2n/Lens.git上获得。

英文摘要

Domain shift remains a persistent challenge in deep-learning-based computer vision, often requiring extensive model modifications or large labeled datasets to address. Inspired by human visual perception, which adjusts input quality through corrective lenses rather than over-training the brain, we propose Lens, a novel camera sensor control method that enhances model performance by capturing high-quality images from the model's perspective rather than relying on traditional human-centric sensor control. Lens is lightweight and adapts sensor parameters to specific models and scenes in real-time. At its core, Lens utilizes VisiT, a training-free, model-specific quality indicator that evaluates individual unlabeled samples at test time using confidence scores without additional adaptation costs. To validate Lens, we introduce ImageNet-ES Diverse, a new benchmark dataset capturing natural perturbations from varying sensor and lighting conditions. Extensive experiments on both ImageNet-ES and our new ImageNet-ES Diverse show that Lens significantly improves model accuracy across various baseline schemes for sensor control and model modification while maintaining low latency in image captures. Lens effectively compensates for large model size differences and integrates synergistically with model improvement techniques. Our code and dataset are available at github.com/Edw2n/Lens.git.

URL PDF HTML ☆

赞 0 踩 0

2502.20981 2026-05-20 cs.CV

Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection

分布原型扩散学习用于开放集监督异常检测

Fuyun Wang, Tong Zhang, Yuanzhi Wang, Yide Qiu, Xin Liu, Xu Guo, Zhen Cui

AI总结本文提出了一种分布原型扩散学习方法，通过构建可学习的高斯原型来创建潜在表示空间，以提高正常样本的判别边界，并通过Schroedinger桥促进正常样本向原型的扩散，同时将异常样本推离，从而提升异常检测性能。

Comments Accepted by CVPR 2025

详情

AI中文摘要

在开放集监督异常检测（OSAD）中，现有方法通常生成伪异常来补偿观察到的异常样本稀缺，而忽视了正常样本的关键先验，导致判别边界效果不佳。为了解决这个问题，我们提出了一种分布原型扩散学习（DPDL）方法，旨在将正常样本封闭在紧凑且判别的分布空间中。具体来说，我们构建了多个可学习的高斯原型，以创建一个容纳丰富且多样正常样本的潜在表示空间，并学习Schroedinger桥以促进正常样本向这些原型的扩散过渡，同时将异常样本推离。此外，为了增强样本间的分离，我们设计了一种在超球面空间中的分散特征学习方法，有助于识别分布外的异常。实验结果表明，所提出的DPDL方法在9个公开数据集上取得了最先进的性能。

英文摘要

In Open-set Supervised Anomaly Detection (OSAD), the existing methods typically generate pseudo anomalies to compensate for the scarcity of observed anomaly samples, while overlooking critical priors of normal samples, leading to less effective discriminative boundaries. To address this issue, we propose a Distribution Prototype Diffusion Learning (DPDL) method aimed at enclosing normal samples within a compact and discriminative distribution space. Specifically, we construct multiple learnable Gaussian prototypes to create a latent representation space for abundant and diverse normal samples and learn a Schrödinger bridge to facilitate a diffusive transition toward these prototypes for normal samples while steering anomaly samples away. Moreover, to enhance inter-sample separation, we design a dispersion feature learning way in hyperspherical space, which benefits the identification of out-of-distribution anomalies. Experimental results demonstrate the effectiveness and superiority of our proposed DPDL, achieving state-of-the-art performance on 9 public datasets.

URL PDF HTML ☆

赞 0 踩 0

2501.09203 2026-05-20 cs.CV cs.RO

3D Modeling and Automated Measurement of Concrete Cracks via Segment Anything Refinement and Visual Inertial LiDAR Fusion

通过段落任何精修和视觉惯性LiDAR融合进行混凝土裂缝的3D建模与自动测量

Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He

AI总结本文提出了一种结合计算机视觉技术和多模态同时定位与建图（SLAM）的创新框架，用于二维裂缝检测、三维重建和三维自动裂缝测量，解决了现有方法在适应性和鲁棒性方面的不足，特别是在处理曲线或复杂几何形状时的挑战。

Comments Title and author list updated

详情

DOI: 10.1016/j.cacaie.2026.100019
Journal ref: Computer-Aided Civil and Infrastructure Engineering, Volume 45, 2026, 100019, ISSN 1093-9687

AI中文摘要

视觉-空间系统在混凝土裂缝检测中变得越来越关键。然而，现有方法往往缺乏对多样化场景的适应性，在基于图像的方法中表现出有限的鲁棒性，并且在处理曲线或复杂几何形状时存在困难。为了解决这些限制，本文提出了一种创新的框架，通过整合计算机视觉技术和多模态同时定位与建图（SLAM），用于二维（2D）裂缝检测、三维（3D）重建和三维自动裂缝测量。首先，基于基础的DeepLabv3+分割模型，并结合特定的改进利用基础模型Segment Anything Model（SAM），我们开发了一种具有强泛化能力的裂缝分割方法，能够在不熟悉的场景中生成精确的2D裂缝掩码。为了提高三维重建的准确性和鲁棒性，利用Light Detection and Ranging（LiDAR）点云与图像数据和分割掩码。通过利用图像和LiDAR-SLAM，我们开发了多帧和多模态融合框架，产生密集、着色的点云，有效捕捉裂缝语义在三维现实尺度上。此外，裂缝几何属性在三维密集点云空间中自动且直接地进行测量，超越了传统二维图像测量方法的限制。这一进步使该方法适用于具有曲线和复杂三维几何结构的结构部件。在各种混凝土结构上的实验结果突显了所提出方法的显著改进和独特优势，展示了其在现实应用中的有效性、准确性和鲁棒性。

英文摘要

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2412.13111 2026-05-20 cs.CV cs.GR

Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations

Motion-2-To-3: 利用2D运动数据进行3D运动生成

Ruoxi Guo, Huaijin Pi, Zehong Shen, Qing Shuai, Zechen Hu, Zhumei Wang, Yajiao Dong, Ruizhen Hu, Taku Komura, Sida Peng, Xiaowei Zhou

AI总结本文提出了一种利用2D视频中提取的运动数据来改进基于文本的3D运动生成的方法，通过解耦局部关节运动和全局运动，有效学习局部运动先验，从而提升生成的3D人体运动的真实性和多样性。

Comments Project page: https://zju3dv.github.io/Motion-2-to-3/

详情

DOI: 10.1109/ICCV51701.2025.01327
Journal ref: 2025 IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 2025, pp. 14305-14316

AI中文摘要

文本驱动的人体运动合成已展现出在电影和游戏行业颠覆性设计的潜力。现有方法通常依赖于3D运动捕捉数据，这需要特殊设置，导致数据采集成本高，最终限制了人体运动的多样性和范围。相比之下，2D人体视频提供了一种广泛且易于获取的运动数据源，涵盖了更广泛风格和活动。在本文中，我们探索了从视频中提取的2D人体运动作为替代数据源，以改进基于文本的3D运动生成。我们的方法引入了一个新颖的框架，将局部关节运动与全局运动解耦，从而能够高效地从2D数据中学习局部运动先验。我们首先在大量文本-2D运动配对数据集上训练了一个单视角的2D局部运动生成器。然后，我们用3D数据对生成器进行微调，将其转换为多视角生成器，该生成器能够预测视图一致的局部关节运动和根动力学。在知名数据集和新文本提示上的评估表明，我们的方法能够高效利用2D数据，支持更广泛的真实3D人体运动生成。我们的代码在https://zju3dv.github.io/Motion-2-to-3/上公开提供。

英文摘要

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation. Our code is publicly available at https://zju3dv.github.io/Motion-2-to-3/.

URL PDF HTML ☆

赞 0 踩 0

2412.00404 2026-05-20 cs.CV

Hard-Label Black-Box Attacks on 3D Point Clouds

针对3D点云的硬标签黑盒攻击

Daizong Liu, Yunbo Tao, Junhao Dong, Keke Tang, Pan Zhou, Wei Hu, Yew-Soon Ong

AI总结本文提出了一种基于硬标签黑盒攻击的3D点云攻击方法，通过引入新的频谱感知决策边界算法生成高质量对抗样本，以提升攻击性能和对抗质量。

详情

AI中文摘要

随着深度传感器在各种3D安全关键应用中的成熟，3D点云模型已被证明对对抗攻击脆弱。几乎所有的现有3D攻击者只是遵循白盒或黑盒设置，通过反向传播或估计的梯度迭代更新坐标扰动。然而，这些方法很难在现实世界中部署（没有提供模型细节），因为它们严重依赖于受害者模型的参数或输出logits。为此，我们提出了一种更具实际应用的攻击方法，即硬标签黑盒攻击，其中攻击者只能访问3D输入的预测标签。我们引入了一种基于新频谱感知决策边界算法的新型3D攻击方法，以生成高质量的对抗样本。具体而言，我们首先构建了一个类感知的模型决策边界，通过开发一种可学习的频谱融合策略，适应性地在频谱域中融合不同类别的点云，旨在在不扭曲原始几何的情况下制造其中间样本。然后，我们设计了一种迭代坐标-频谱优化方法，带有曲率感知的边界搜索，以沿决策边界移动中间样本，生成具有微小扰动的对抗点云。实验表明，我们的攻击在攻击性能和对抗质量方面优于现有的白盒/黑盒攻击者。

英文摘要

With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.

URL PDF HTML ☆

赞 0 踩 0

2411.08982 2026-05-20 cs.LG cs.DC

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Lynx：通过动态批量感知专家选择实现高效的MoE推理

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

AI总结本文提出Lynx系统，通过利用MoE训练中的负载平衡损失特性，减少专家调用总数，从而在不依赖工作负载的情况下实现高效的MoE推理，提升了吞吐量并保持了低的精度损失。

详情

AI中文摘要

混合专家（MoE）模型提供的选择性参数激活使其成为现代基础模型的流行选择。然而，当用于服务时，MoE面临一个根本性的矛盾。批处理对于服务性能至关重要，迫使激活所有专家，从而抵消了MoE的优势并加剧了内存带宽瓶颈。现有高效MoE推理方法即使在广泛的工作负载特定调优下也无法解决这一矛盾。我们提出了Lynx，一个能够在工作负载无关的情况下实现高效MoE推理的系统。Lynx利用了MoE训练的一个关键特性：负载平衡损失引入了批次级别的专家激活偏斜和冗余，它通过一种新的AffinityBinning技术重新映射每个批次中的低亲和力的token到专家分配，从而减少总调用的专家数量。我们在九个基准测试中对四种最先进的模型家族进行评估，结果显示Lynx在保持精度损失低于1个百分点的情况下，实现了高达1.30倍的吞吐量提升。此外，Lynx与现有技术互补，进一步提升了其性能，最高可提升1.38倍。

英文摘要

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. LYNX leverages a key property of MoE training: load-balancing losses introduce batch-level expert activation skews and redundancy, which it exploits by remapping low-affinity token-to-expert assignments within each batch using a novel AffinityBinning technique that reduces the total experts invoked. Our evaluation of LYNX on four state-of-the-art model families across nine benchmarks shows that it achieves up to 1.30x improvement in throughput while maintaining accuracy loss of less than 1% points across tasks. Further, LYNX is complementary to existing techniques where it additionally boosts their performance by up to 1.38x.

URL PDF HTML ☆

赞 0 踩 0

2410.18856 2026-05-20 cs.AI cs.CL

Entry-level guide to the use of large language models for medical research

大型语言模型在医学研究中应用的入门指南

Qiao Jin, Nicholas Wan, Robert Leaman, Shubo Tian, Zhizheng Wang, Yifan Yang, Zifeng Wang, Guangzhi Xiong, Po-Ting Lai, Qingqing Zhu, Benjamin Hou, Maame Sarfo-Gyamfi, Gongbo Zhang, Aidan Gilson, Balu Bhasuran, Zhe He, Aidong Zhang, Jimeng Sun, Chunhua Weng, Ronald M. Summers, Qingyu Chen, Yifan Peng, Zhiyong Lu

AI总结本文提出了一套可操作的指南，帮助医疗专业人员更高效地利用大型语言模型（LLMs）进行医学研究，涵盖任务制定、模型选择、提示工程、微调和模型部署等关键步骤，确保安全可靠地将LLMs应用于临床实践。

详情

AI中文摘要

前沿大型语言模型（LLMs），如GPT-5、Claude 4.5、Gemini 3、Llama 4和DeepSeek-R1，代表了一类具有变革潜力的AI工具，能够通过在各种上下文中生成类人响应并适应新任务来革新医疗保健的各个方面。它们的应用潜力涵盖广泛医学任务，如临床文档、患者与临床试验的匹配以及回答医学问题。在本文中，我们提出了一套可操作的指南，帮助医疗专业人员更高效地利用LLMs进行工作，并提供了一套最佳实践。整体工作流程包括几个主要阶段，包括制定任务、选择LLMs、提示工程、微调和模型部署。我们首先讨论了识别与LLMs核心能力相匹配的医学任务以及基于选定任务和数据、性能要求和模型接口选择模型的关键考虑因素。然后回顾了提示工程和微调等策略，以将标准LLMs适应于专门的医学任务。部署考虑因素，包括监管合规性、伦理准则以及持续监控公平性和偏见，也进行了讨论。通过提供结构化的分步方法，本文入门教程旨在为医疗专业人员提供必要的工具，以有效将LLMs整合到临床实践中，确保这些强大技术以安全、可靠和有影响力的方式得到应用。

英文摘要

Frontier large language models (LLMs), such as GPT-5, Claude 4.5, Gemini 3, Llama 4, and DeepSeek-R1, represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this paper, we propose an actionable guideline to help healthcare professionals more effectively and efficiently utilize LLMs in their work, along with a set of best practices. The overall workflow consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and model deployment. We start with the discussion of critical considerations in identifying medical tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this entry-level tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.

URL PDF HTML ☆

赞 0 踩 0

2410.15362 2026-05-20 cs.LG cs.AI cs.CL cs.CR

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Faster-GCG: 面向对齐大语言模型的高效离散优化监狱突破攻击

Xiao Li, Wei Zhang, Zhuhong Li, Qiongxiu Li, Shei PernChua, BingZe Lee, Jinghao Cui, Yifan Huang, Xiaolin Hu

AI总结本文提出Faster-GCG，通过改进估计、高效采样和避免重复评估，提高了对齐大语言模型的监狱突破攻击效率，实现了样本效率提升8倍，时间减少7倍，并在多个模型上取得了更高的突破成功率。

Comments 18 pages, new version

详情

AI中文摘要

对齐大语言模型（LLMs）因其安全性而受到广泛关注，尤其是在试图通过对抗性提示绕过安全边界（guardrails）的监狱突破攻击中。现有方法中，贪心坐标梯度（GCG）攻击通过离散标记优化实现了自动化监狱突破，但其低样本效率限制了实际应用。特别是，GCG需要约256,000次评估才能达到满意的监狱突破成功率，这是由于底层离散优化问题的固有难度。在本工作中，我们识别了限制GCG样本效率的三个关键因素：不准确的基于梯度的估计、低效的均匀采样以及重复评估先前探索的后缀。为了解决这些问题，我们提出了Faster-GCG，一种经过简化且改进的GCG变种，它结合了基于距离的正则化以提高估计、温度控制的采样以更有效的探索，以及一个标记已访问后缀的机制以避免冗余评估。Faster-GCG将所需的评估次数减少到32,000次，实现了与GCG相比样本效率提升8倍和时间减少7倍的改进。在该减少的预算下，Faster-GCG在五个对齐LLMs上平均达到了78.1%的监狱突破成功率，并在Qwen3.5-4B上达到了88.7%，优于最先进的白盒监狱突破方法。

英文摘要

Aligned Large Language Models (LLMs) have attracted significant attention for their safety, particularly in the context of jailbreak attacks that attempt to bypass guardrails via adversarial prompts. Among existing approaches, the Greedy Coordinate Gradient (GCG) attack pioneered automated jailbreaks through discrete token optimization; however, its low sample efficiency limits practical applicability. In particular, GCG requires approximately 256K evaluations per harmful behavior to achieve a satisfactory jailbreak success rate, due to the inherent difficulty of the underlying discrete optimization problem. In this work, we identify three key factors that limit the sample efficiency of GCG: inaccurate gradient-based estimation, inefficient uniform sampling, and repeated evaluation of previously explored suffixes. To address these issues, we propose Faster-GCG, a streamlined variant of GCG that incorporates distance-based regularization for improved estimation, temperature-controlled sampling for more effective exploration, and a visited-suffix marking mechanism to avoid redundant evaluations. Faster-GCG reduced the required evaluations to 32K, achieving up to an $8\times$ improvement in sampling efficiency and a $7\times$ reduction in wall-clock time compared to GCG. Under this reduced budget, Faster-GCG attained an average jailbreak success rate of 78.1\% across five aligned LLMs, and achieved 88.7\% against Qwen3.5-4B, outperforming state-of-the-art white-box jailbreak methods.

URL PDF HTML ☆

赞 0 踩 0

2409.08248 2026-05-20 cs.CV

TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation

TextBoost: 通过文本编码器提升文本到图像生成的个性化

NaHyeon Park, Kunhee Kim, Hyunjung Shim

AI总结本文提出TextBoost，一种高效的文本到图像扩散模型单次个性化方法，通过仅微调文本编码器提升计算和存储效率，并保持语义完整性，从而实现更快收敛和更低存储需求，同时保持高质量生成。

Comments Project page: https://textboost.github.io. Accepted to TMLR

详情

AI中文摘要

在本文中，我们介绍了TextBoost，一种高效的文本到图像扩散模型单次个性化方法。传统个性化方法通常涉及微调模型的大量部分，导致存储需求大且收敛慢。相反，我们提出仅选择性地微调文本编码器，显著提高了计算和存储效率。为了保持原始语义完整性，我们开发了一种新颖的因果保持适应机制。此外，轻量级适配器被用于在文本嵌入与交叉注意层交互之前局部细化文本嵌入，从而在极小的计算开销下显著增强文本嵌入的表达能力。在多样化的概念上进行的实证评估表明，TextBoost通过减少可训练参数的数量实现了更快的收敛速度和显著的存储需求降低。此外，TextBoost在主体保真度、文本保真度和生成多样性方面与现有方法相比具有可比性。我们展示所提出的方法为高质量文本到图像个性化提供了一种高效、可扩展且实用的解决方案，尤其在资源受限的环境中具有优势。

英文摘要

In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2409.03192 2026-05-20 cs.CV

PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

PEPL: 精度增强的伪标签法用于半监督学习中的细粒度图像分类

Bowen Tian, Songning Lai, Lujundong Li, Zhihao Shuai, Runwei Guan, Tian Wu, Yutao Yue

AI总结本文提出PEPL方法，通过生成高质量的伪标签来解决细粒度图像分类中标注数据稀缺的问题，利用CAMs进行语义混合伪标签生成，提升分类精度和鲁棒性。

Comments Accepted by ICASSP 2025

详情

DOI: 10.1109/ICASSP49660.2025.10889037

AI中文摘要

细粒度图像分类随着深度学习和计算机视觉技术的发展取得了显著进步。然而，详细的标注数据稀缺仍然是一个主要挑战，尤其是在获取高质量标注数据成本高或耗时的情况下。为了解决这一限制，我们引入了Precision-Enhanced Pseudo-Labeling（PEPL）方法，专门设计用于半监督学习框架下的细粒度图像分类。我们的方法通过生成高质量的伪标签，利用大量未标注数据，通过两个关键阶段：初始伪标签生成和语义混合伪标签生成，逐步细化伪标签。这些阶段利用类激活图（CAMs）准确估计语义内容，并生成捕获细粒度分类所需关键细节的精炼标签。通过聚焦语义层面的信息，我们的方法有效克服了标准数据增强和图像混合技术在保留关键细粒度特征方面的局限性。我们在基准数据集上实现了最先进的性能，证明了与现有半监督策略相比，在准确性和鲁棒性上有了显著提升。

英文摘要

Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0

2403.07183 2026-05-20 cs.CL cs.AI cs.LG cs.SI

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

大规模监控AI修改内容：ChatGPT对AI会议同行评审影响的案例研究

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

AI总结本文提出了一种方法，用于估计大规模语料库中可能被大语言模型（LLM）显著修改或生成的文本比例。通过专家撰写和AI生成的参考文本，该最大似然模型能够高效地在语料库层面考察实际的LLM使用情况。研究以ChatGPT发布后举行的AI会议同行评审（ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023）为案例，发现6.5%至16.9%的提交文本可能被LLM显著修改。生成文本的情境揭示了用户行为：在信心较低、接近截止日期或回复作者反驳较少的评审中，估计的LLM生成文本比例更高。此外，观察到语料库层面的趋势可能过于微妙，无法在个体层面检测到，并讨论了这些趋势对同行评审的影响。呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

Comments 46 pages, 31 figures, ICML '24

详情

AI中文摘要

我们提出了一种方法，用于估计大规模语料库中可能被大语言模型（LLM）显著修改或生成的文本比例。我们的最大似然模型利用专家撰写和AI生成的参考文本，以准确且高效的方式在语料库层面考察实际的LLM使用情况。我们将该方法应用于ChatGPT发布后举行的AI会议同行评审案例研究，包括ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。我们的结果表明，在这些会议中提交的同行评审文本中，6.5%至16.9%可能被LLM显著修改，即超出拼写检查或小幅写作更新的范围。生成文本出现的情境提供了关于用户行为的见解：估计的LLM生成文本比例在信心较低、接近截止日期或来自较少回应作者反驳的评审中更高。我们还观察到语料库层面的生成文本趋势，这些趋势可能在个体层面过于微妙而无法检测到，并讨论了这些趋势对同行评审的影响。我们呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

英文摘要

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

URL PDF HTML ☆

赞 0 踩 0

2310.11203 2026-05-20 cs.LG stat.ML

Federated Learning with Nonvacuous Generalisation Bounds

联邦学习中的非空泛化界限

Pierre Jobic, Maxime Haddouche, Benjamin Guedj

AI总结本文提出了一种在联邦学习中训练随机预测器的新策略，通过在保持隐私的同时，释放本地预测器并保护训练数据不被其他节点知晓。研究构建了一个全局随机预测器，继承本地私有预测器的属性，基于PAC-Bayesian泛化界限。通过数值实验展示了该方法在预测性能上与批量方法相当，同时保持隐私。

详情

AI中文摘要

我们介绍了一种新的策略来训练联邦学习中的随机预测器，其中每个网络节点旨在通过释放本地预测器来保护隐私，同时保持其训练数据对其他节点的保密性。然后我们构建了一个全局随机预测器，该预测器继承本地私有预测器的属性，基于PAC-Bayesian泛化界限。我们考虑了同步情况，其中所有节点共享相同的训练目标（来源于泛化界限），以及异构和同构情况，其中每个节点可能有自己的个性化训练目标。通过一系列数值实验，我们证明了我们的方法在预测性能上与批量方法相当，其中所有数据集都在节点之间共享。此外，预测器由数值非空泛化界限支持，同时为每个节点保持隐私。我们明确计算了我们两种联邦设置的预测性能和泛化界限的增量，突显了为保护隐私而付出的代价。

英文摘要

We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but keeping secret its training dataset with respect to the other nodes. We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisation bound. We consider the synchronous case where all nodes share the same training objective (derived from a generalisation bound), and the heterogenous and homogenous cases where each node may have its own personalised training objective. We show through a series of numerical experiments that our approach achieves a comparable predictive performance to that of the batch approach where all datasets are shared across nodes. Moreover the predictors are supported by numerically nonvacuous generalisation bounds while preserving privacy for each node. We explicitly compute the increment on predictive performance and generalisation bounds for our two federated settings, highlighting the price to pay to preserve privacy.

URL PDF HTML ☆

赞 0 踩 0

2112.08507 2026-05-20 cs.LG stat.ML

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

适应性实验的算法：在统计分析与奖励之间进行权衡：结合均匀随机分配与奖励最大化

Tong Li, Jacob Nogas, Haochen Song, Anna Rafferty, Eric M. Schwartz, Audrey Durand, Harsh Kumar, Nina Deliu, Sofia S. Villar, Dehan Kong, Joseph J. Williams

AI总结本文提出了一种统计敏感算法TS-PostDiff，通过结合均匀随机分配和奖励最大化，在统计分析与用户奖励之间进行权衡，以提高实验效率和准确性。

详情

AI中文摘要

传统随机A/B实验使用均匀随机（UR）概率分配臂，例如将50/50分配给网站的两个版本以发现哪个版本更能吸引用户。为了更快速和自动地利用数据来造福用户，多臂老虎机算法如汤普森采样（TS）已被提倡。虽然TS具有可解释性并结合了随机化关键的统计推断，但它可能导致有偏估计并增加假阳性率和假阴性率。我们引入了一种更统计敏感的算法，TS-PostDiff（后验概率小差异），它通过使用额外的自适应步骤混合TS和传统UR，其中使用UR（而非TS）的概率与臂差异的后验概率成正比。这使实验者能够定义什么算作小差异，低于此值，传统UR实验可以以低成本获得用于统计推断的信息数据，而高于此值则使用更多TS以最大化用户利益。我们评估了TS-PostDiff与UR、TS以及两个其他旨在提高统计推断的TS变体。我们考虑了在多种设置下的常见双臂实验结果，这些设置受到现实应用的启发。我们的结果提供了洞察，说明在何时以及为何TS-PostDiff或替代方法在用户利益（奖励）和统计推断（假阳性率和功率）之间提供更好的权衡。TS-PostDiff的自适应性有助于在差异较小时高效减少假阳性并提高统计功率，而在差异较大时增加奖励。这项工作强调了未来统计敏感算法开发中重要的考虑因素，这些算法需要在适应性实验中平衡奖励和统计分析。

英文摘要

Traditional randomized A/B experiments assign arms with uniform random (UR) probability, such as 50/50 assignment to two versions of a website to discover whether one version engages users more. To more quickly and automatically use data to benefit users, multi-armed bandit algorithms such as Thompson Sampling (TS) have been advocated. While TS is interpretable and incorporates the randomization key to statistical inference, it can cause biased estimates and increase false positives and false negatives in detecting differences in arm means. We introduce a more Statistically Sensitive algorithm, TS-PostDiff (Posterior Probability of Small Difference), that mixes TS with traditional UR by using an additional adaptive step, where the probability of using UR (vs TS) is proportional to the posterior probability that the difference in arms is small. This allows an experimenter to define what counts as a small difference, below which a traditional UR experiment can obtain informative data for statistical inference at low cost, and above which using more TS to maximize user benefits is key. We evaluate TS-PostDiff against UR, TS, and two other TS variants designed to improve statistical inference. We consider results for the common two-armed experiment across a range of settings inspired by real-world applications. Our results provide insight into when and why TS-PostDiff or alternative approaches provide better tradeoffs between benefiting users (reward) and statistical inference (false positive rate and power). TS-PostDiff's adaptivity helps efficiently reduce false positives and increase statistical power when differences are small, while increasing reward more when differences are large. The work highlights important considerations for future Statistically Sensitive algorithm development that balances reward and statistical analysis in adaptive experimentation.

URL PDF HTML ☆

赞 0 踩 0

2105.00933 2026-05-20 cs.SD cs.AI cs.LG eess.AS

Deep Neural Network for Musical Instrument Recognition using MFCCs

基于MFCCs的音乐乐器识别深度神经网络

Saranga Kingkor Mahanta, Abdullah Faiz Ur Rahman Khilji, Partha Pakray

AI总结本文提出一种基于MFCCs的深度神经网络模型，用于对二十种不同类别的音乐乐器进行分类，利用伦敦爱乐乐团数据集实现高精度识别。

详情

Journal ref: Computacion y Sistemas, Vol 25, No 2 (2021): 25(2) 2021

AI中文摘要

高效自动音乐分类任务在AI应用于音乐领域中具有重要性，并构成了各种高级应用的基础。音乐乐器识别是通过音频来识别乐器的任务。这种音频也称为声音振动，被模型用来与乐器类别匹配。在本文中，我们使用了一个经过训练以对二十种不同类别的音乐乐器进行分类的人工神经网络（ANN）模型。这里我们仅使用音频数据的梅尔频率倒谱系数（MFCCs）。我们的模型在完整的伦敦爱乐乐团数据集上进行训练，该数据集包含属于四个家族（木管乐器、铜管乐器、打击乐器和弦乐器）的二十种乐器类别。基于实验结果，我们的模型在相同数据集上实现了最先进的准确性。

英文摘要

The task of efficient automatic music classification is of vital importance and forms the basis for various advanced applications of AI in the musical domain. Musical instrument recognition is the task of instrument identification by virtue of its audio. This audio, also termed as the sound vibrations are leveraged by the model to match with the instrument classes. In this paper, we use an artificial neural network (ANN) model that was trained to perform classification on twenty different classes of musical instruments. Here we use use only the mel-frequency cepstral coefficients (MFCCs) of the audio data. Our proposed model trains on the full London philharmonic orchestra dataset which contains twenty classes of instruments belonging to the four families viz. woodwinds, brass, percussion, and strings. Based on experimental results our model achieves state-of-the-art accuracy on the same.

URL PDF HTML ☆

赞 0 踩 0