arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18059 2026-05-19 cs.RO

高效3D内容重建与生成

Jiahao Li

AI总结本文提出了一种高效的3D内容生成和重建方法，通过结合多视图扩散和稀疏视图3D重建，实现了高质量的3D资产生成，并开发了FastMap算法以提高3D重建的速度和精度。

详情

AI中文摘要

自动3D内容创建旨在用能够从文本或图像直接合成或恢复3D资产的系统取代劳动密集型的建模和扫描流程。其应用范围涵盖视频游戏、虚拟现实、机器人技术和模拟，使资产原型设计、多样化的交互世界生成和高效的3D数据收集成为可能。当前解决方案主要遵循两种互补的范式：（i）文本或图像到3D生成，学习3D几何和外观的先验知识，以从自然语言或单视图图像创建新资产；（ii）3D重建，从RGB图像估计相机姿态和几何结构。本论文在两个方向上都取得了进展。在生成方面，我介绍了Instant3D，它结合了多视图扩散和前馈稀疏视图3D重建，可在5-20秒内生成高质量的资产。在重建方面，我开发了FastMap，一种结构从运动流水线，通过使用一阶优化与广泛融合的GPU内核，实现了比现有最先进方法快10倍的速度提升，同时保持了可比的姿态精度和下游新视图合成质量。

英文摘要

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

URL PDF HTML ☆

赞 0 踩 0

2605.18048 2026-05-19 cs.AI

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS: 向 GUI 代理中的主动文档引导行动迈进

Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang, Haifeng Wang

AI总结本文提出 DocOS 基准，通过引导文档解决长尾任务，解决 GUI 代理在动态开放网络环境中处理长尾任务的能力限制，核心方法是主动文档引导行动，主要贡献是设计了一个评估文档引导问题解决能力的基准。

详情

AI中文摘要

尽管图形用户界面（GUI）代理在自动化设备交互中表现出色，但它们主要依赖于预训练或指令微调的静态参数知识。这种依赖从根本上限制了它们处理需要显式过程知识的长尾任务的能力，通常迫使代理采用低效且易碎的试错探索。为缓解这一限制，我们引入了面向 GUI 代理的主动文档引导行动，这是一种新的范式，通过使代理能够自主搜索相关文档来解决长尾任务，从而模仿人类问题解决方式。为了评估代理在此范式中的能力，我们提出了 DocOS，一个基准，用于评估在完全交互环境中文档引导的问题解决能力。DocOS 要求代理自主导航网络浏览器，定位相关在线文档，理解操作步骤，并将这些步骤准确地转化为可执行的 GUI 操作。广泛的实验表明，进展受到双重瓶颈的限制：代理在主动搜索中难以可靠地定位相关信息，并且频繁失败将检索到的指令准确地转化为精确的操作，这表明文档引导交互是使 GUI 代理在动态环境中自我演化的关键路径。

英文摘要

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2605.18045 2026-05-19 cs.RO cs.AI

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

置信度门控机器人自主性：不确定性何时真的有帮助？

Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle

AI总结本文研究了不确定性在机器人自主性决策中的作用，发现当基础模型具备一定能力时，简单的不确定性代理足以实现选择性门控，但无法用于语义新颖性检测。

Comments ICRA 2026 workshop paper

详情

AI中文摘要

机器人系统常常使用预测不确定性来决定是否自主行动还是退回到备用策略。在阈值门控自主性中，不确定性主要通过其对可能错误的排序能力起作用。标准指标如预期校准误差和AUROC并不能直接测试不确定性是否改变行动/退避决策。因此，我们通过斯皮尔曼等级相关性、配对bootstrap等价检验和行动/退避一致率来评估不确定性。在三个时间活动识别基准上，我们发现存在一个数据集依赖的胜任区域，在此之下不确定性只能提供弱且不稳定的错误排序。在此之上，softmax启发式方法、MC Dropout和集成模型产生相似的门控行为，而阈值选择对执行结果影响更大。一个多种子具身模拟显示，一旦实现自主性，碰撞率和成本也呈现出相同模式。在时间协变量转移下，排序质量保持稳定，但细粒度语义OOD检测仍接近随机。这些结果表明，一旦基础模型具备一定能力，简单的不确定性代理足以实现选择性门控，但无法用于语义新颖性检测。

英文摘要

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

URL PDF HTML ☆

赞 0 踩 0

2605.18041 2026-05-19 cs.CV

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

OmniSelect: 动态模态感知的令牌压缩用于高效多模态大语言模型

Morunliu Yang, Ruotao Xu, Le Li, Yue Wang, Jianxin Zhang, Juntao Li, Yihang Lou, Siwei Feng, Peifeng Li

AI总结本文提出OmniSelect，一种无需训练的模态自适应令牌剪枝框架，通过动态选择压缩策略来提高多模态大语言模型的效率，通过轻量级AudioCLIP模型估计跨模态相关性，并根据相关性得分在不同时间组中进行细粒度令牌剪枝，从而在不增加训练成本的情况下实现高效的多模态令牌压缩。

详情

AI中文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $ extbf{OmniSelect}$, a 免训练, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

英文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.18039 2026-05-19 cs.CV

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

SGSoft: 通过模板引导的软信号学习融合语义-几何特征以实现3D形状对应

Soyeon Yoon, Chang Wook Seo, Hyunjung Shim

AI总结本文提出SGSoft方法，通过模板引导的软信号学习融合语义-几何特征，实现3D形状对应，解决了结构变化、非等距变形和拓扑不一致的挑战，实现了最先进的跨类别泛化和最佳精度-效率权衡。

详情

AI中文摘要

学习变形3D形状之间的密集对应关系仍是一个长期挑战，由于结构变化、非等距变形和不一致拓扑。现有方法通常在通用性、几何保真度和效率之间进行权衡。我们通过提出SGSoft，一个统一的内在流程，解决这个问题：(i) 在标准模板上构建测地线对应场；(ii) 学习由预训练语义先验引导的多模态密集描述符，利用该测地线对应场监督；(iii) 通过描述符空间的最近邻搜索在单次前向传递中检索密集对应关系。这种公式在大姿态变化、结构差异和重新网格化下实现了稳定且拓扑不变的监督。SGSoft在跨类别泛化方面达到最先进的水平，同时在先前方法中提供了最佳的精度-效率权衡。它还实现了近实时推断，无需预对齐、成对优化或后处理。学习的描述符可以有效地转移到下游任务，如语义分割和变形转移，建立了一种可扩展且可部署的密集3D对应范式。

英文摘要

Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.

URL PDF HTML ☆

赞 0 踩 0

2605.18038 2026-05-19 cs.CV

Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels

基于补丁的鲁棒性鲑鱼重识别方法：使用弱轨迹标签

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

AI总结本文提出了一种基于补丁的重识别框架，通过融合补丁级预测来决定鲑鱼身份，利用侧线预测提取纹理锚定的补丁和补丁切片，通过多摄像头实验设置构建跨摄像头测试集，实验证明该方法在同轨迹验证和跨摄像头测试中均优于全图像基线，展示了更好的泛化能力和鲁棒性。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP)

详情

AI中文摘要

在商业网箱中，鲑鱼重识别具有挑战性，因为种群数量大，这要求严格准确性并使大规模标记数据获取不可行。轨迹ID可以作为代理标签，但会引入轨迹ID偏差。为了解决这些挑战，我们提出了一种基于补丁的重识别框架，将补丁级预测融合到鲑鱼身份决策中。一个关键组件是预测鲑鱼的侧线，从而提取纹理锚定的补丁和补丁切片。为了实现真实的评估，我们引入了一个实验设置，使用多个相距6米的摄像头，允许同一鱼在不同轨迹中被记录。这使得通过手动匹配确认构建跨摄像头测试集成为可能。我们的集成方法在同轨迹验证中（0.932到0.965 mAP）和跨摄像头测试中（0.609到0.860 mAP）均优于全图像基线。跨摄像头设置的显著改进证明了改进的通用性和鲁棒性。代码和数据：https://github.com/espenbh/salmon-reid-patch-ensemble。

英文摘要

Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon's lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: https://github.com/espenbh/salmon-reid-patch-ensemble.

URL PDF HTML ☆

赞 0 踩 0

2605.18035 2026-05-19 cs.AI cs.LG

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

零阶硬阈值化中方差减少的新见解：缓解梯度误差和扩张性矛盾

Xinzhe Yuan, William de Vazelhes, Bin Gu, Huan Xiong

AI总结本文提出了一种通用的方差减少零阶硬阈值化算法，通过考虑方差的作用，缓解零阶梯度与硬阈值操作之间的冲突，从而消除对随机方向数量的限制，提高收敛速度和应用范围。

Comments Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong

详情

Journal ref: International Conference on Learning Representations (ICLR), 2024

AI中文摘要

硬阈值化是机器学习中用于解决ℓ0约束优化问题的重要算法类型。然而，在某些情况下，目标函数的真实梯度可能难以获取，通常可以通过零阶（ZO）方法进行近似。到目前为止，SZOHT算法是唯一能够处理ℓ0稀疏性约束的ZO梯度算法。不幸的是，由于零阶梯度的偏差与硬阈值操作的扩张性之间存在固有的矛盾，SZOHT在ZO梯度的随机方向数量上存在明显的限制。本文通过考虑方差的作用，提供了一种新的方差减少见解：缓解零阶梯度与硬阈值操作之间的独特矛盾。在此视角下，我们提出了一种通用的方差减少零阶硬阈值化算法以及在标准假设下的通用收敛性分析。理论结果表明，新算法消除了对随机方向数量的限制，相较于SZOHT，具有改进的收敛速度和更广泛的应用范围。最后，我们通过岭回归问题以及黑盒对抗攻击问题展示了本方法的实用性。

英文摘要

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

URL PDF HTML ☆

赞 0 踩 0

2605.18032 2026-05-19 cs.CL cs.AI cs.HC cs.SE

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA：多智能体大语言模型工作流的离线评估与迭代优化

Kazuki Kawamura, Satoshi Waki, Kei Tateno

AI总结本文提出PROTEA，一种用于多智能体大语言模型工作流的离线评估和迭代优化接口，通过配置评分标准和可视化工作流图中的节点状态，帮助开发者定位瓶颈并改进工作流性能。

Comments 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations

详情

AI中文摘要

多智能体大语言模型工作流——由多个角色特定的LLM调用组成——通常优于单提示基线，但调试和优化仍然困难。失败可能源于中间输出的细微错误，这些错误会传播到下游节点，要求开发者检查长轨迹并推断应修改哪个代理。我们提出了PROTEA，一个统一的接口，用于离线、测试驱动的多智能体工作流改进。PROTEA执行工作流，用可配置的评分标准评分中间节点输出，并在工作流图上叠加每个节点的状态和理由，以定位可能的瓶颈。为了支持复杂系统，其中最终答案参考是主要监督，PROTEA执行反向节点评估：它从最终答案参考和图上下文生成候选节点级期望，然后将它们与观察到的节点输出进行比较。对于选定的节点，PROTEA以可编辑的前后比较形式呈现目标提示修订，然后自动重新运行并重新评估工作流，以显示输出变化和评分轨迹。在两个生产相关的工作流中，PROTEA将文档检查准确性从64.3%提高到83.9%，推荐Hit@5从0.30提高到0.38。在与六名经验丰富的LLM开发者进行的形成研究中，参与者重视图层面的定位、节点级别的理由以及可编辑的前后提示修订。

英文摘要

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

URL PDF HTML ☆

赞 0 踩 0

2605.18029 2026-05-19 cs.CV

What Matters for Grocery Product Retrieval with Open Source Vision Language Models

在开源视觉语言模型中，什么因素影响杂货产品检索

Emmanuel G. Maminta, Rowel O. Atienza

AI总结本文研究了开源视觉语言模型在杂货产品检索任务中的表现，发现数据质量比规模更重要，高效模型可以胜出，并且存在召回率差距的问题。

Comments Accepted in the 28th International Conference on Pattern Recognition (ICPR 2026)

详情

AI中文摘要

多模态产品检索（MPR）是无结账零售和自动化库存系统的基础，但需要细粒度SKU区分，而标准视觉语言基准无法捕捉这一点。我们首次系统地在GroceryVision挑战赛的MPR任务上评估了190个开源VLMs，隔离了预训练数据、架构和输入分辨率。我们的分析得出三个可操作的发现。（1）数据质量优于规模。从原始网络爬取切换到过滤数据集可获得高达16.6%的准确率提升，超过翻倍模型参数的收益。（2）高效模型可以获胜。MobileCLIP-B（150M参数）优于在噪声数据上训练的351M模型。我们引入了效率度量标准“语义功率密度”（ϕ），该指标惩罚低于阈值的准确性。（3）存在召回率差距。最先进模型在Recall@5上达到94.5%，但在Recall@1上下降17.5%，表明对比嵌入式在分类上有效，但无法对视觉相似的SKU进行排序。代码和评估脚本可在https://github.com/upeee/openmpr获取。

英文摘要

Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($ϕ$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.

URL PDF HTML ☆

赞 0 踩 0

2605.18028 2026-05-19 cs.LG cs.AI

FedSDR: Federated Self-Distillation with Rectification

FedSDR: 带校正的联邦自我蒸馏

Ziheng Ren, Zhanming Shen, Hao Wang, Ning Liu, You Song

AI总结本文提出FedSDR，一种改进的联邦自我蒸馏方法，通过引入双重流机制来解决联邦学习中数据分布不匹配和幻觉问题，提升模型的准确性和一致性。

Comments Accepted by ICML 2026

2605.18026 2026-05-19 cs.RO

See What I Mean: 对齐视觉与语言表示以实现视频细粒度物体理解

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

AI总结本文提出SWIM方法，通过对齐视觉和语言表示，仅从文本提示中实现细粒度物体理解，解决了传统方法需要显式视觉提示的问题，通过构建NL-Refer数据集和多层交叉注意力图提升文本-视觉对齐性能。

详情

Journal ref: CVPR 2026

AI中文摘要

我们提出了SWIM（See What I Mean），一种新颖的训练策略，通过对齐视觉和语言表示，仅从文本提示中实现细粒度物体理解。与需要显式视觉提示（如掩码或点）的传统方法不同，SWIM仅在训练期间利用掩码监督来指导跨模态注意力，使模型在推理时能够自动关注用户指定的物体。我们对预训练多模态大语言模型（MLLMs）的交叉注意力分析揭示了一种系统性差异：属性词在视觉模态中产生尖锐、局部化的激活，而物体名词由于语义参考偏差和分布式高层表示产生扩散和分散的模式。为了解决这种不对齐问题，我们构建了NL-Refer数据集，其中每个物体掩码都配以精确的自然语言指引用。SWIM从物体名词中提取多层交叉注意力图，并强制与真实掩码保持空间一致性。实验结果表明，SWIM显著提高了文本-视觉对齐性能，并在细粒度物体理解基准上优于基于视觉提示的方法。代码和数据可在https://github.com/HumanMLLM/SWIM获取。

英文摘要

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

URL PDF HTML ☆

赞 0 踩 0

2605.18015 2026-05-19 cs.LG cs.DB cs.SE

弥合差距：将阅读文本转换为对话式语音

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

AI总结本文提出了一种名为PACC的新方法，通过利用深度神经网络分析和修改语调、重音和节奏等语调特征，将阅读语音转换为更自然的对话语音，从而在虚拟助手、客户服务和语言学习工具中提高语音转换的自然度和准确性。

Comments 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems

详情

DOI: 10.1007/978-981-96-6681-2_38
Journal ref: Innovative Computing and Communications (ICICC 2025), Lecture Notes in Networks and Systems, Springer Nature, 2025, pp. 543-556

AI中文摘要

在最近的语音处理进展中，将阅读语音转换为对话语音引起了广泛关注。该领域的主要挑战是在实时应用中保持自然性和可懂性的同时，最小化计算开销。传统的阅读语音缺乏对话互动中至关重要的细微语调变化，这对虚拟助手、客户服务和语言学习工具等应用构成了挑战。本文介绍了一种新的方法，即带有对话上下文的语调调整（PACC），旨在将阅读语音转换为各种现代应用中使用的自然对话语音。PACC利用先进的深度神经网络来分析和修改语调特征，如语调、重音和节奏。与传统方法不同，我们的方法使用高保真生成对抗网络（HiFi-GAN）进行语音合成。我们的实验结果表明，语音转换在自然度和模型准确性方面有显著提高，通过在语音数据集上额外训练。这项研究为语音转换任务和Mean Opinion Score（MOS）评估建立了新的基准，并证明我们的方法可以成功扩展到其他语音转换应用。

英文摘要

In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.

URL PDF HTML ☆

赞 0 踩 0

2605.17999 2026-05-19 cs.AI

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

共享骨干PPO用于多UAV通信覆盖与连接保持

Z. Jiang

AI总结本文提出了一种共享骨干PPO算法，通过在Actor和Critic网络之间共享基础模块，实现了高效的训练和提升的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现，并与标准PPO算法进行比较。实验结果表明，所提出的方法具有优越的性能，此外，还集成了图信息聚合模块以适应代理之间的通信条件。整合该模块后，算法仍保持有效，训练后的代理群体表现出更高的合作水平。

2605.17997 2026-05-19 cs.LG cs.AI cs.CV

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

AI总结本文提出MARR，一种模块自适应残差重建方法，通过为每个模块分配特定的缩放系数，平衡残差相关的HA偏差和累积误差校正，从而在低比特量化中提升性能。

详情

AI中文摘要

近年来，基于残差重建的模型量化方法在低比特后训练量化（PTQ）中取得了有希望的性能，通过引入跨层残差来减少来自先前层的误差积累。然而，这些残差也可能引入额外的偏差，源于重建基于PTQ的Hessian近似（HA）假设，导致量化性能不理想。在本文中，我们分析发现，通过将残差项乘以一个缩放系数，可以提供一种直接的方法来缓解与残差强度相关的HA偏差，同时保持累积误差校正。更重要的是，我们观察到这种权衡是模块依赖性的，使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察，我们提出了模块自适应残差重建（MARR），为每个模块分配模块特定的缩放系数，以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计，我们设计了一种基于比例-积分-微分（PID）的自适应更新策略，利用重建误差作为反馈，逐步细化此系数。在多个典型的大语言模型（LLMs）和视觉变换器（ViTs）上的实验表明，MARR在低比特量化（小于等于4位）中表现出色，实现了LLMs高达20.2%的性能提升，以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0