arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.19727 2026-05-20 cs.CV

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Tango3D: 向全局和局部2D-3D对应关系对齐迈进

Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han, Chunchao Guo, Wenhan Luo

AI总结本文提出Tango3D，一种统一密集对应和全局检索的3D基础模型，通过几何感知的2D视觉骨干网络和预训练的3D VAE将图像编码为2D片段，点云编码为3D标记，并映射到共享空间以实现局部像素-点对齐和全局语义对齐。

详情

AI中文摘要

现有的3D基础模型通常将点云对齐到冻结的视觉-语言空间（如CLIP），通过将3D形状压缩成全局向量实现强大的跨模态检索。然而，这种仅全局对齐的方法无法建立精细的像素-点对应关系。为了解决这个问题，我们提出了Tango3D，一种基础模型，它统一了密集对应和全局检索。我们使用一个几何感知的2D视觉骨干网络和一个预训练的3D VAE将图像编码为2D片段，并将点云编码为3D标记。这些被映射到一个共享空间中，以实现局部像素-点对齐和全局语义对齐。为了稳定密集和全局目标的联合学习，我们引入了三阶段渐进训练策略。实验表明，我们的模型成功实现了对象级别的像素-点对齐，同时保持了具有竞争力的全局检索能力，这种联合能力是现有3D基础模型所不具备的。通过建立精细的对齐特征空间，Tango3D将丰富的语义注入到纯粹的几何3D标记中，为广泛密集3D下游任务铺平了道路。

英文摘要

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19726 2026-05-20 cs.CV

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

通过块近似稀疏注意力实现扩散语言模型的高效长上下文建模

Wenhu Zhang, Yiming Wu, Huanyu Wang, Yaoyang Liu, Huanzhang Dou, Senqiao Yang, Sitong Wu, Hanbin Zhao, Jiaya Jia

AI总结本文提出了一种块近似稀疏注意力框架（BA-Att），通过块级预下采样操作识别信息区域，避免依赖脆弱的位置先验，从而在保持高性能的同时提升计算效率，实验表明其在注意力计算上比FlashAttention快6.95倍，并在50%稀疏度下保持接近全注意力性能。

Comments CVPR 2026 Findings paper

详情

AI中文摘要

扩散语言模型（DLMs）能够实现全局一致、双向且可控的文本生成，相较于传统自回归LLMs具有优势，但扩展到超长序列仍成本高昂。许多现有块稀疏注意力方法通过固定采样模式在高分辨率注意力空间中选择块，如尾部区域或反斜线条带。此类先验驱动的采样可能遗漏显著令牌并引入分布变化下的不稳定性。在本文中，我们提出块近似稀疏注意力框架（BA-Att）具有块级预下采样操作，能够在紧凑的下采样空间内识别信息区域，避免依赖脆弱的位置先验。为了分析其理论行为，我们定义了一个 oracle 后下采样注意力图，并正式化预下采样与后下采样方案之间的近似误差。基于这一见解，我们引入了一个轻量级的范数排序模块和一个协方差补偿修正，利用对角线QK方差近似完整协方差，从而降低计算复杂度。广泛的实验表明，我们的操作在注意力计算上比FlashAttention快达6.95倍，并在50%稀疏度下在语言模型、多模态语言模型和视频生成模型中保持接近全注意力性能，展示了强大的效率和泛化能力。

英文摘要

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.19723 2026-05-20 cs.CL cs.AI

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理：基准测试、架构、评估与开放挑战

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

AI总结本文综述了大型语言模型在数学推理方面的最新进展，通过分析数据集、架构、训练策略和评估协议，探讨了数学推理的基准测试、架构设计、评估方法以及未来的研究挑战。

详情

AI中文摘要

数学推理对于教育、科学和工业中的问题解决至关重要，是评估人工智能系统的重要基准。随着大型语言模型（LLMs）推理能力的提升，理解其在数学推理方面的表现变得越来越重要。本文综述通过结构化的数据分析集、架构、训练策略和评估协议，综合了最近在LLMs中的数学推理进展。我们的系统性回顾涵盖了大约120篇同行评审研究和预印本，探讨了该研究领域的演变，并提供了一个统一的分析框架来理解当前的进展和限制。本文特别介绍了一种统一的数学数据集分类法，区分了预训练语料库、监督微调资源和评估基准在不同推理复杂性水平上的差异。本文还系统分析了推理架构和训练策略，包括工具集成、验证器引导推理和参数高效适应，以评估其对推理鲁棒性和泛化能力的影响。此外，现有度量标准的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解，我们的分析识别了反复出现的失败模式，如推理忠实性问题、基准偏见和泛化限制，并概述了改进符号接地、评估可靠性以及开发更稳健和可信的LLM推理系统的关键研究方向。

英文摘要

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19721 2026-05-20 cs.AI cs.LG cs.NI

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

投影潜在RL动作：面向通用化和可扩展的图组合优化

Franco Terranova, Guillermo Bernardez, Albert Cabellos-Aparicio, Nina Miolane, Abdelkader Lahmadi

AI总结本文提出了一种新的RL-GCO方法，通过在连续GNN动作嵌入空间中直接操作，实现高效的图组合优化解算，提升了通用性和可扩展性。

Comments Preprint

详情

AI中文摘要

图组合优化（GCO）因其在许多NP难问题中的自然图表示而受到越来越多的关注，但其组合爆炸使得精确方法在计算上不可行。最近的强化学习（RL）与图神经网络（GNN）的结合显著改进了基于学习的GCO求解器。然而，现有方法在跨不同图实例的泛化能力和随着动作空间增长的计算可扩展性方面存在局限。为了解决这两个挑战，我们引入了投影代理，一种新颖的RL-GCO方法，直接在连续的GNN动作嵌入空间中操作，通过单次前向传递预测所需潜在动作，并随后将其解码为有效的离散动作。此外，我们通过为观察和动作提供共享的嵌入空间，实现了RL方法之间的公平比较。在多样化的基准测试中，我们的方法在推理速度上达到现有解决方案的16.2倍，泛化能力提升40%，同时为具有多个相互依赖变量的超线性决策空间中的强大RL性能打开了大门。最后，我们发布了LaGCO-RL，一个Python库，自动化潜在动作空间的构建并支持现有RL-GCO解决方案，促进可重复性和适应新GCO基准。

英文摘要

Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.19718 2026-05-20 cs.CL

DocQT: 通过多样化的JPEG量化表提高文档伪造定位的鲁棒性

Kylian Ronfleux-Corail, Guillaume Bernard, Mickaël Coustaty, Nicolas Sidère

AI总结本文提出DocQT数据集，通过对比不同架构在不同量化表训练下的表现，证明标准质量因子增强无法代表实际压缩多样性，并展示了显式考虑量化表的架构在实际部署中的鲁棒性优势。

详情

AI中文摘要

文档操纵定位模型在公开基准上表现强劲，但在实际文档工作流程中泛化能力不足。我们发现这一差距的关键原因在于训练过程中使用的JPEG量化表分布狭窄（仅限于标准libjpeg质量因子）与实际保险文档管道中遇到的异质压缩配置之间的不匹配。为了隔离这一因素，我们进行了一项受控的因子研究，比较了两种具有不同量化表意识水平的架构（FFDN [2] 和 Mesorch [20]），每种架构在标准质量因子增强（Standard-QT）或从DocQT量化表库（Real-QT）采样的操作校准量化表下进行训练，并在三种再压缩条件下进行评估。在DocTamper [15] 上训练时使用Real-QT带来了显著的定位增益，并显著降低了真实操作文档中的像素级误报率，但仅适用于显式将量化表作为输入的架构。发布的DocQT量化表数据集和压缩再生产材料可在https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables直接获取。这些结果表明，标准质量因子增强无法充分代表实际压缩多样性，并且显式条件化于量化表的架构选择为实际部署提供了有意义的鲁棒性优势。

英文摘要

Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.19678 2026-05-20 cs.RO

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA: 多一致性约束用于鲁棒的视觉-语言-动作模型

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

AI总结本文提出RoVLA框架，通过多一致性约束提升视觉-语言-动作模型的鲁棒性，通过指令语义、轨迹演变和观察扰动三种互补变换增强模型的稳定性和泛化能力。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在具身操控中表现出色，但在视觉观察变化、语言指令改写和复合扰动下仍显脆弱。这种限制表明现有方法仍依赖于训练分布中的浅层相关性，而非学习任务语义、环境状态和动作生成之间的稳定耦合。尽管近期研究通过大规模训练、训练后适应或增强预测建模提高了鲁棒性，但很少在端到端策略本身中强制执行不变性一致性。为了解决这个问题，我们提出了RoVLA，一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三个互补的变换下强制一致性：指令语义、轨迹演变和观察扰动。具体而言，指令一致性（IC）通过语义等价指令改写促进稳定的语义关联，演变一致性（EC）在整个生成过程中保持一致的动作意图，观察一致性（OC）通过强制在受扰动前后的一致预测来提高对视觉和体感扰动的鲁棒性。通过在训练过程中显式建模这些不变性，RoVLA减少了对表面相关性的依赖，提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和现实世界操控任务上的实验表明，RoVLA在强基线方法上表现一致，并在多样化的任务和观察转移下表现出更优越的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将在https://github.com/HCPLab-SYSU/RoVLA上提供。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

URL PDF HTML ☆

赞 0 踩 0

2605.19677 2026-05-20 cs.LG q-bio.QM

Agentic Discovery of Cryomicroneedle Formulations

代理发现冷冻微针制剂配方

Hao Li, Lifu Du, Nurul Hameed, Shemonti Saha Authai, Zlata Stefanovic, Chenjie Xu

AI总结本研究提出了一种结合文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证的闭环工作流程，用于发现冷冻微针的冷冻保护剂配方，通过迭代湿实验验证提高了配方的准确性和有效性。

详情

AI中文摘要

冷冻微针提供了一种微创的皮下递送活细胞的途径，但其低温保存配方必须在保护细胞和限制毒性和设备制造约束之间取得平衡。本文报告了一种由AI辅助的闭环工作流程，用于冷冻微针冷冻保护剂的发现，结合了文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证。一个包含198种骨髓干细胞冷冻保存配方的curated数据集（来自42项研究）被转换为21种成分特征，并用于训练一个不确定性的文献先验模型。该模型捕捉了文献数据中的中等结构，但前瞻性地失败了，促使进行迭代的湿实验修正。在十次验证迭代和106次湿实验观察中，模型逐步适应了冷冻微针特定的结果：批次RMSE从41.21个百分点降低到6.86个百分点，后期阶段的排名相关性变得一致为正，累积的湿实验预测与测量总结达到了R²=0.942。最佳验证配方实现了95.15%的复苏存活率，同时具有低DMSO、ectoin、乙二醇和胎牛血清含量。然而，高存活率本身并不保证冷冻微针的完整形成，突显了未来多目标优化的必要性。这些结果表明，代理辅助的计算基础设施可以使数据高效的配方发现对拥有少量内部数据专业知识的实验室更加可及。项目代码可在https://github.com/baitmeister/ML-for-CryoMN上获得。

英文摘要

Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI-assisted, closed-loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian-process surrogate modelling, Bayesian optimization, and sequential wet-lab validation. A curated dataset of 198 mesenchymal stem-cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty-aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet-lab correction. Across ten validation iterations and 106 wet-lab observations, the model progressively adapted to cryomicroneedle-specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later-stage rank correlations became consistently positive, and the cumulative wet-lab predicted-versus-measured summary reached $R^2 = 0.942$. The best validated formulation achieved 95.15\% post-thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi-objective optimization. These results demonstrate that agent-assisted computational infrastructure can make data-efficient formulation discovery more accessible to labs with minimal data expertise in-house. Project code is available at https://github.com/baitmeister/ML-for-CryoMN.

URL PDF HTML ☆

赞 0 踩 0

2605.19671 2026-05-20 cs.AI

Transforming Constraint Programs to Input for Local Search

将约束程序转换为局部搜索的输入

Jo Devriendt, Patrick De Causmaecker, Marc Denecker

AI总结本文通过建立约束优化问题的对称性属性与局部搜索邻域之间的联系，自动从约束规范中生成邻域，用于IDP系统中的元启发式算法，并在六个经典优化问题上评估了生成的邻域。

Comments Unpublished paper accepted and presented at the Fourteenth International Workshop on Constraint Modelling and Reformulation (ModRef) in 2015

2605.19663 2026-05-20 cs.AI

基于反思生成的基准测试与进化

Junjie Wang, Xinghua Lou, Jason Li, Ye Tian, Keyu Chen, Yulin Li, Bin Kang, Jacky Mai, Yanwei Li, Zhuotao Tian, Liqiang Nie

AI总结本文提出R^3-Bench基准和R^3-Refiner框架，用于评估和提升反思视觉生成能力，通过改进迭代推理和修正能力，提升文本到图像模型的生成质量。

详情

AI中文摘要

文本到图像（T2I）模型和统一多模态模型（UMMs）在视觉生成领域取得了显著进展。然而，其依赖于单次生成范式限制了处理需要迭代细化的复杂提示的能力。为了实现多轮反思视觉生成（RVG），我们正式将Reason-Reflect-Rectify（R^3）循环作为核心框架，并引入R^3-Bench，一个包含600多个专家标注实例的基准，用于量化迭代推理和修正能力。在R^3-Bench上的评估揭示了一个关键差距：尽管最先进的模型能够识别生成错误，但它们无法生成具有操作性的修正指令。为弥合这一差距，我们提出了R^3-Refiner，一个双阶段框架，利用组相对策略优化（GRPO）和分层奖励机制（HRM）来更好地对齐修正与反思推理。实验表明，R^3-Refiner在R^3-Bench上实现了显著改进（在反思判断分数上提升12.0%，在修正分数上提升9.0%），并且可以无缝集成到各种多语言大型模型（MLLMs）中，以提升不同T2I模型在GenEval++和T2I-CompBench上的生成质量。代码可在https://github.com/xiaomoguhz/R3-Bench获取。

英文摘要

Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at https://github.com/xiaomoguhz/R3-Bench.

URL PDF HTML ☆

赞 0 踩 0

2605.19634 2026-05-20 cs.CV cs.AI

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

AI总结本文提出P2DNav框架，通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制，解决零样本视觉-语言导航中的方向推理与局部定位问题，实验表明其在R2R-CE基准上性能优异。

详情

AI中文摘要

视觉-语言导航（VLN）要求一个具身代理将自然语言指令转化为可执行的导航动作，以应对未见环境。现有零样本方法通常依赖额外的航点预测模块，这些模块往往将高层方向推理与细粒度局部定位纠缠在一起，导致决策错误且不稳定。在本文中，我们提出P2DNav，一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件：全景到俯视（P2D）、滑动窗口对话记忆（SDM）和反思重新定位机制（RRM）。P2D明确将导航决策分解为两个阶段：全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向，然后从该方向的俯视RGB观察中预测像素级目标点。此外，SDM将导航历史组织为多轮对话上下文，并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察，并在必要时返回全景方向选择。在R2R-CE基准上的实验表明，P2DNav在零样本方法中表现强劲。特别是，与最先进的（SOTA）零样本航点基于和航点自由方法相比，P2DNav在SR方面分别获得了146.6%和58.9%的提升，证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

URL PDF HTML ☆

赞 0 踩 0

2605.19633 2026-05-20 cs.CL cs.AI cs.LG cs.NE cs.SE

optimize_anything: A Universal API for Optimizing any Text Parameter

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

AI总结本文提出了一种基于LLM的通用优化系统，能够跨不同领域实现文本参数的优化，展示了其在六个多样化任务中的state-of-the-art性能，通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

详情

DOI: 10.1145/3786335.3813167
Journal ref: Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA

AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具？我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时，一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构（32.5%到89.5%），发现了将云成本降低40%的调度算法，生成了87%匹配或超过PyTorch的CUDA内核，并优于AlphaEvolve报告的圆圈打包解决方案（n=26）。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高，且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同，我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式，将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything，并支持多个后端作为GEPA项目的一部分，在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

URL PDF HTML ☆

赞 0 踩 0

2605.19631 2026-05-20 cs.RO cs.CV

基于LiDAR的人体动作捕捉的贝塞尔退化建模

Xiaoqi An, Lin Zhao, Jun Li, Chen Gong, Jian Yang

AI总结本文提出BMLiCap框架，通过时间可压缩的贝塞尔曲线建模人体动作，采用轨迹保留策略减少控制点，设计渐进式动作重建模块，利用时间尺度运动变换器和多级动作聚合器有效融合多尺度曲线，以提高复杂场景下的动作重建精度和时间连续性。

Comments Accepted by CVPR 2026

详情

AI中文摘要

基于LiDAR的3D人体动作捕捉在自动驾驶和机器人领域有广泛应用，准确的动作重建至关重要。然而，现有方法在不稳定输入和严重遮挡情况下常常导致预测抖动甚至失败。为了解决这些挑战，我们提出BMLiCap，一种从粗到细的框架，通过时间可压缩的贝塞尔曲线建模运动。通过采用轨迹保留策略减少控制点，我们获得了一种连贯且易于学习的动作表示。为了从LiDAR点云线索中重建人体动作，我们设计了一个渐进式动作重建模块。具体来说，引入了时间尺度运动变换器（TMT）来在多个时间尺度上预测运动曲线，并利用多级动作聚合器（MMA）来适应性融合多尺度曲线，以恢复详细的、时间连贯的姿态，有效弥补由遮挡和噪声引起的观测缺口。在四个主流基准LiDARHuman26M、FreeMotion、NoiseMotion和SLOPER4D上，BMLiCap在复杂场景中实现了最先进的准确性和时间连续性，证明了其在严重遮挡下的补偿能力和减少预测抖动的能力。

英文摘要

LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

URL PDF HTML ☆

赞 0 踩 0