arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20544 2026-05-21 cs.RO cs.CV

Salil Parth Tripathi, Bertrand Chapron, Fabrice Collard, Nicolas Courty, Ronan Fablet

AI总结本文提出了一种意图控制的局部最优传输（IC-POT），通过引入点wise拒绝成本替代全局拒绝机制，解决了在应用中需要更结构化的点wise拒绝机制的问题，并展示了其在正样本无标签学习和开放部分领域适应中的实际应用价值。

详情

AI中文摘要

虽然最优传输（OT）通过要求两个测度精确匹配来施加刚性约束，而部分最优传输通过允许通过全局预算、标量退款或统一拒绝规则来保留未匹配的质量。然而，许多应用需要更结构化的点wise拒绝机制，其中决定是否未匹配质量取决于侧面特定的可靠性、支持几何或外部信息，关于哪些组件应参与比较。我们引入了意图控制的部分最优传输（IC-POT），即部分传输的一种有针对性的扩展，它用两个测度上的点wise拒绝成本替代了全局拒绝范式。我们证明了由此产生的优化问题可以以局部接受阈值的形式进行双解释，并可以通过将其重新表述为在扩展支持上的平衡Kantorovich OT问题来求解。除了理论分析外，我们还展示了IC-POT在拒绝由侧面信息驱动的设置中的实际相关性。在正样本无标签学习和开放部分领域适应中，将编码统计结构的点wise拒绝规则纳入固定基线流程中可以提高性能。最后，我们用一个地球物理实际案例来说明IC-POT的使用：多模态卫星海洋测量，其中物理和传感器先验自然地指导拒绝机制并定义检索的可比信号信息。

英文摘要

While optimal transport (OT) enforces a rigid constraint by requiring two measures to be matched exactly, partial optimal transport relaxes this requirement by allowing mass to remain unmatched through a global budget, scalar rebate, or uniform rejection rule. However, many applications call for more structured, pointwise rejection mechanisms, where the decision to leave mass unmatched depends on side-specific reliability, support geometry, or external information about which components should participate in the comparison. We introduce \emph{intent-controlled partial optimal transport} (IC-POT), a targeted generalization of partial transport that replaces the global rejection paradigm with pointwise rejection costs over both measures. We show that the resulting optimization problem admits a dual interpretation in terms of local acceptance thresholds and can be solved by recasting it as a balanced Kantorovich OT problem on an augmented support. Beyond theoretical analysis, we demonstrate the practical relevance of IC-POT in settings where rejection is driven by side information. In positive-unlabeled learning and open-partial domain adaptation, incorporating pointwise rejection rules that encode statistical structure improves fixed baseline pipelines. Finally, we motivate the use of IC-POT with a geophysical practical case: multi-modal satellite ocean measurements, for which physical and sensors priors naturally inform the rejection mechanism and define the retrieved comparable signal information.

URL PDF HTML ☆

赞 0 踩 0

2605.19776 2026-05-21 cs.CV

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

偏好顺序、评分锚定：从融合专家审美真实数据到自我蒸馏

Yuanpei Zhao, Jie Lin, Chao Zhang, Yilin Wang, Mao Li, Chenhui Li, Jie Hou, Tangjie Lv

AI总结本文提出PPaint基准，通过融合专家偏好和评分数据，改进图像审美评估模型，通过自我蒸馏方法在单次推理中实现更准确的审美评分，优于现有开源和闭源基线模型。

Comments 27 pages, 7 pages

详情

AI中文摘要

成对偏好和点状评分是图像审美评估（IAA）的两种主要标注协议，但现有基准仅采用其中一种，未能在受控条件下测量其互补性。我们引入PPaint，一种匹配双协议基准，在五个审美维度上，15名领域专家（每类5名）对150幅中国画进行双协议标注，通过本地密集偏好设计收集45,900个成对专家判断，同时匹配评分。匹配设计揭示了互补优势：偏好产生更一致的顺序排名，而评分锚定了绝对分数尺度。通过两种独立的偏好到评分方法融合两种信号，得到融合的专家真实数据，使两种构造收敛到几乎相同的分数。同样的偏好到评分原则也适用于无标签VLM训练。PSDistill通过Elo参考池将VLM的成对判断转换为校准的伪分数，并通过置信度加权排名优化训练相同的VLM，生成单次推理的审美评分器。在单个绘画类别上训练，蒸馏后的Qwen3-VL-8B在所有三个类别上将均值SRCC从0.504提升到0.709，优于所有开源基线，包括专用审美模型ArtiMuse，并在单次推理成本下与闭源Gemini-3.1-Pro相差0.04 SRCC，跨领域转移在APDDv2上进一步验证。我们将发布完整的PPaint数据集和训练代码。

英文摘要

Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

URL PDF HTML ☆

赞 0 踩 0

2605.19649 2026-05-21 cs.CV

CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

无需CAD的基于NeRF的航天器姿态估计器学习方法

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

AI总结本文提出了一种基于NeRF的图像增强方法，使航天器姿态估计器的学习不再依赖大量CAD渲染图像，仅需几十到几百张真实图像即可训练出准确的姿态估计器，同时提升了对实际轨道条件的鲁棒性。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

航天器姿态估计网络需要数万张CAD渲染图像进行训练。这种对合成CAD数据的依赖（i）限制了其在具有可靠几何先验的目标上的应用，排除了不合作或文档不全的航天器，（ii）由于不现实的光照和材料外观导致对真实轨道条件的泛化能力差。本文介绍了一种基于NeRF的图像增强方法，使学习航天器姿态估计器仅需几十到几百张图像。该方法通过几何一致的视角和外观增强生成大量多样化的数据集。这个增强的数据集使无需CAD模型或大规模合成数据集即可训练出准确的目标特定姿态估计器。实验表明，我们的方法支持仅用25到400张真实图像训练出准确的姿态估计器，即使在严重的光照变化下也是如此。当应用于大型CAD基于的合成数据集时，基于NeRF的增强也增强了域外泛化能力，提高了对真实轨道条件的鲁棒性。

英文摘要

Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.19624 2026-05-21 cs.CV cs.AI

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

AI总结本文提出了一种面向组件的结构保持风格迁移框架，用于卫星视觉的合成到真实数据构建，通过提取真实图像的部件级风格代码并注入到合成图像中，从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情

AI中文摘要

对于基于相机的卫星视觉感知，Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取，而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码，并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性，对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像，而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比，所提方法实现了最小的图像分布差异，FID为54.32，KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时，ADD通过率提高到0.260，AUC提高到0.611。这些结果表明，组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

URL PDF HTML ☆

赞 0 踩 0

2605.19537 2026-05-21 cs.LG

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

沉默的超参数：量化推理后端对LLM可重复性的影响

David Pape, Jonathan Evertz, Lea Schönherr

AI总结本文研究了推理后端对LLM基准测试结果的影响，发现不同后端可能导致基准分数变化达16.6个百分点，并引发高比例的输出分歧，强调了推理后端作为关键超参数的重要性。

详情

AI中文摘要

在LLM的进步中，标准化基准测试已成为衡量进展的主要方式，其中最先进的改进通常仅以小数点后几位百分比点来区分。同时，现代LLM评估的计算成本推动了专用推理后端的广泛应用，这些软件系统在推理时高效执行训练好的模型。尽管对可扩展性至关重要，系统级优化，如定制CUDA内核和降低精度的算术，可能会改变令牌概率并引入非确定性，这可能引发生成结果的分歧。在本工作中，我们首先调查了推理景观，识别出200个不同的引擎，并分析了35,000篇机器学习论文，发现尽管存在广泛多样性，特定的推理堆栈很少被报告。然后，我们系统地研究了推理后端如何影响LLM基准测试结果。在保持模型权重、解码参数和硬件不变的情况下，我们评估了五个广泛使用的推理引擎，包括vLLM、SGLang和llama.cpp，跨多个开放权重模型和已建立的基准测试。我们证明，仅选择后端即可使基准分数变化高达16.6个百分点，并引发高比例的输出分歧。通过隔离后端优化并追踪执行管道，我们发现这种分歧是由系统级优化如前缀缓存和CUDA图、定制内核以及日志处理中的引擎特定默认设置所驱动。我们的发现将推理后端识别为在LLM评估中之前未报告但重要的超参数，并倡导标准化报告推理堆栈以提高基准比较的可重复性和可解释性。

英文摘要

Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

URL PDF HTML ☆

赞 0 踩 0

2605.19503 2026-05-21 cs.RO cs.AI cs.LG

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

AI总结本文提出ARC-RL，一个包含四种MuJoCo连续控制环境的强化学习游乐场，这些环境的机器人形态灵感来自ARC Raiders的生物目录，通过统一的观察模板、动作约定和奖励函数，研究不同形态和动画风格约束下的强化学习算法性能。

详情

AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠，其形态统一来源于现实商业硬件。然而，游戏NPC受风格约束，缺乏sim-to-real机器人，通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL，一个包含四种MuJoCo连续控制环境的套件，其机器人形态受ARC Raiders的生物目录启发：18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数，其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚；在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示，这些演示既作为固定专家参考，也作为离线到在线训练的先验数据来源。在此游乐场中，我们进行了一项受控的实证研究，比较标准在线算法（SAC、SPEQ、SOPE-EO）和带有先验数据的算法（SACfD、SPEQ-O2O、SOPE），并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

URL PDF HTML ☆

赞 0 踩 0

2605.19376 2026-05-21 cs.AI

Generative Recursive Reasoning

生成性递归推理

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

AI总结本文提出Gram框架，通过将递归潜在推理转化为概率多轨迹计算，解决了传统递归推理模型的确定性问题，实现了条件推理和无条件生成。

详情

AI中文摘要

未来的神经推理系统应如何实现扩展计算？递归推理模型（RRMs）通过使用共享转移函数的迭代潜在状态细化，为自回归序列扩展提供了一种有前途的替代方法。然而，现有RRMs大多是确定性的，遵循单一的潜在轨迹并收敛到单一预测。我们引入生成性递归推理模型（GRAM），一种将递归潜在推理转化为概率多轨迹计算的框架。GRAM将推理视为随机的潜在轨迹，通过递归深度和并行轨迹采样实现多个假设、替代解决方案策略和推理时间扩展。这产生了一个支持通过p_θ(y|x)进行条件推理的潜在变量生成模型，并通过p_θ(x)实现无条件生成，无论输入是否固定或缺失。通过缩放变分推断训练，GRAM在结构推理和多解约束满足任务上优于确定性递归和循环基线，同时展示了无条件生成能力。

英文摘要

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_θ(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

URL PDF HTML ☆

赞 0 踩 0

2605.19138 2026-05-21 cs.RO cs.AI cs.LG

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

AI总结本文提出COBALT平台，通过基于云的远程操作技术，利用智能手机等设备大规模收集高质量的机器人学习数据，提高仿真实验和现实世界中的机器人学习效率。

详情

AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT，一个旨在大规模普及机器人学习的远程操作平台，无论是仿真还是现实世界。通过利用向量化的环境，我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作，从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接，包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步，支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行，每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU，凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究，显示基于手机的远程操作性能与或优于专用硬件，能够更快、更符合人体工学地收集数据。为确保数据质量，COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明，结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察，我们通过众包收集了一个大规模、高质量的试点数据集，该数据集包含7500多个演示（50多个小时），在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

URL PDF HTML ☆

赞 0 踩 0

2605.18860 2026-05-21 cs.LG cs.CV

Spectral structural distortion reveals redundant neurons in neural networks

谱结构扭曲揭示神经网络中的冗余神经元

Yongyu Wang

AI总结本文提出了一种基于谱结构扭曲的神经元冗余判定方法，通过分析神经网络层变换前后的关系结构，识别可移除的神经元并保持任务性能。

详情

AI中文摘要

过度参数化的神经网络通常包含许多可移除的神经元，但什么使神经元冗余仍不明确。现有剪枝标准通常依赖局部量如权重大小、激活强度或梯度敏感性，但这些指标对神经元在层变换中结构作用的洞察有限。本文表明，神经元冗余可通过在层间表示变换中参与谱结构扭曲的程度来表征。对于训练好的网络的每个隐藏层，我们记录预激活和后激活的隐藏状态，将神经元视为图节点，构建描述神经元层面关系结构的输入侧和输出侧图。然后我们定义了一个谱结构重要性分数，测量每个神经元对这两个关系结构之间主导图谱扭曲的贡献。参与度低的神经元被视为结构冗余并通过迭代剪枝过程移除，在每次结构变化后重新计算分数。在中间剪枝轮次中不进行参数更新；在达到目标参数减少后，对紧凑模型应用一次恢复微调阶段。直接消融分析和在传统神经网络、编码器-only Transformer 和解码器-only 语言模型上的实验表明，这种图谱标准能够识别可移除的神经元和 Transformer 单元，同时在压缩后保持任务性能。这些结果表明，神经冗余不仅仅是小权重或弱激活的结果，而是可以通过在层间关系结构谱扭曲中的弱参与来理解。

英文摘要

Overparameterized neural networks often contain many removable neurons, yet what makes a neuron redundant remains poorly understood. Existing pruning criteria commonly rely on local quantities such as weight magnitude, activation strength, or gradient sensitivity, but these measures provide limited insight into the structural role of a neuron in the transformation performed by a layer. Here we show that neuronal redundancy can be characterized by weak participation in the spectral structural distortion induced by layer-wise representation transformations. For each hidden layer of a trained network, we record pre-activation and post-activation hidden states, model neurons as graph nodes, and construct input-side and output-side graphs that describe neuron-level relational structure before and after the layer transformation. We then define a spectral structural importance score that measures the contribution of each neuron to the dominant graph-spectral distortion between these two relational structures. Low-participation neurons are treated as structurally redundant and removed through an iterative pruning process in which scores are recomputed after each structural change. No parameter updates are performed during intermediate pruning rounds; after the target parameter reduction is reached, a single recovery fine-tuning stage is applied to the compact model. Direct ablation analysis and experiments across conventional neural networks, encoder-only Transformers, and decoder-only language models show that this graph-spectral criterion identifies removable neurons and Transformer units while preserving task performance after compression. These results suggest that neural redundancy is not merely a consequence of small weights or weak activations, but can be understood through weak participation in the spectral distortion of layer-wise relational structure.

URL PDF HTML ☆

赞 0 踩 0

2605.18833 2026-05-21 cs.LG cs.AI

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

利用知识图谱嵌入进行自动化大数据质量评估

Hadi Fadlallah, Rima Kilany, Mitri Haber, Ali Jaber

AI总结本文提出了一种基于知识图谱嵌入的自动化大数据质量评估方法，通过整合多样化的知识图谱表示，利用上下文信息生成针对每个情境的全面数据质量评估计划。

Comments 17 pages, 10 figures

详情

DOI: 10.1504/IJDMMM.2025.150987
Journal ref: International Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405

AI中文摘要

自动化数据质量评估对于管理大数据至关重要，但现有解决方案在实现准确的上下文感知评估方面面临挑战。本文提出了一种基于知识的新方法，利用知识图谱嵌入来预测输入数据集的上下文表示与知识图谱中相关质量规则和维度之间的缺失边。我们通过整合知识图谱中的多样化表示，从深入的文献研究中获取洞察，从而开发出针对每个情境的全面且上下文特定的数据质量评估计划。利用知识图谱提高了我们对输入数据集上下文的理解，克服了传统方法仅依赖严格匹配并忽视上下文特征的局限性。通过注入数值边属性，我们为每个预测的质量测量分配相应的权重，为输入数据集提供全面的数据质量评估计划。为了评估我们的方法，我们利用AccentureLabs开发和基准测试的AmpliGraph框架。评估涉及使用由黎巴嫩原子能委员会（LAEC-CNRS）提供的现实世界辐射传感器数据集。从该评估中获得的结果证明了我们的解决方案能够为给定的输入数据集生成全面的数据质量评估计划。

英文摘要

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.18743 2026-05-21 cs.AI

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

AI总结本文提出SVFSearch，首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准，通过5000个四选一测试示例和4198个辅助训练示例，评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情

AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干，以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而，现有的基准很少评估在短视频应用中的这种能力，其中暂停的帧通常在视觉上具有歧义性，回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch，这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例，每个示例都围绕一个暂停的游戏场景展开，来自真实的短视频片段。为了支持公平且可重复的评估，SVFSearch提供了一个冻结的离线检索环境，包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口，避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距：最好的开源直接问答模型达到66.4%，最好的实际代理达到79.1%，而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈，包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

URL PDF HTML ☆

赞 0 踩 0

2605.17776 2026-05-21 cs.RO

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track: 一个大规模多模态数据集，用于通过多约束轨迹优化的无人机视觉跟踪

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, Ji Pei

AI总结本文提出CosFly-Track数据集，用于无人机视觉跟踪任务，通过多约束轨迹优化生成大规模多模态数据，提升了动态目标跟踪性能。

详情

AI中文摘要

解锁VLMs中的密集度量深度估计

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

AI总结本文提出DepthVLM，一种将单个VLM转换为原生密集几何预测器的简单有效框架，同时保持其多模态能力。通过在LLM主干上附加轻量级深度头，并在统一的视觉-文本监督范式下进行训练，DepthVLM能够在单次前向传递中生成高分辨率深度图和语言输出。此外，还引入了一个统一的室内-室外度量深度基准，实验表明DepthVLM在推理效率、复杂3D空间推理等方面均优于现有VLMs和纯视觉模型。

Comments Project Page: https://depthvlm.github.io/

详情

AI中文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

英文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.15691 2026-05-21 cs.LG

SEED: Targeted Data Selection by Weighted Independent Set

SEED：通过加权独立集实现目标数据选择

Yuan Zhang, Lifeng Guo, Junwen Pan, Wenzhao Zheng, Wen Zhou, Kuan Cheng, Kurt Keutzer, Shanghang Zhang

AI总结本文提出SEED方法，通过将数据选择问题建模为加权独立集（WIS）在相似性图上，解决样本质量与多样性之间的平衡问题，并引入节点价值校准和局部尺度归一化来提升数据选择的鲁棒性和可扩展性。

Comments 20 pages

详情

AI中文摘要

数据选择旨在从大规模训练语料中识别出紧凑且信息丰富的子集，平衡样本质量和收集多样性。我们将该问题建模为相似性图上的加权独立集（WIS），其中节点代表数据样本并按影响程度加权，边连接语义冗余的配对。这种建模自然产生同时高质量和多样化的子集。然而，实践中存在两个挑战：朴素的节点权重无法区分信息信号与梯度噪声，且在异构领域分布下构造边会产生结构不平衡的图，偏向社会稀疏区域。为解决这些问题，我们引入了两种从统一图视角出发的改进方法：（1）节点价值校准，限制影响估计到双侧显著子空间，以任务相关信号为基础确定节点重要性，而不是表面统计；（2）局部尺度归一化，适应边阈值到局部邻域密度，缓解因跨领域分布偏移引起的图不平衡。这些组件共同产生了一个稳健且可扩展的数据选择流程，称为SEED。我们进一步构建了 exttt{Honeybee-Remake-SEED-200K}，一个由SEED编纂的紧凑多模态数据集。广泛实验表明，SEED在指令微调、视觉指令微调和语义分割等任务上，优于现有最先进方法，适用于多种模型家族。

英文摘要

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

URL PDF HTML ☆

赞 0 踩 0