arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13442 2026-05-14 cs.RO

Asymptotically Optimal Ergodic Coverage on Generalized Motion Fields

Christian Hughes, Yilang Liu, Yanis Lahrach, Julia Engdahl, Houston Warren, Darrick Lee, Fabio Ramos, Travis Miles, Ian Abraham

AI总结 本文研究了在动态流场环境中实现渐近最优的遍历覆盖问题,针对传统方法在非静态环境中无法保证覆盖质量的问题,提出了一种基于流场适应的遍历覆盖方法。该方法引入最大均值差异(MMD)作为遍历性度量,并将其与环境动态相结合,以在非完整约束和开环控制条件下实现鲁棒的探索路径规划。实验验证了该方法在海洋探测、人群与牲畜运动追踪等多样化时空过程中的有效性,并在空中和腿式机器人平台上验证了其在非凸、流场受限环境中的可行性。

Comments 13 pages, 9 figures, 6 tables, Robotics: Science and Systems 2026

详情
英文摘要

Autonomous robotic exploration in remote and extreme environments allows scientists to model complex transport phenomena and collective behaviors described by continuously deforming flow fields. Although these environments are naturally modeled as time-varying domains, most adaptive exploration methods assume static environments and fail to provide adequate coverage or satisfy any formal guarantees. This is especially the case in oceanography where autonomous underwater systems (UxS) have highly restrictive compute and payload requirements that necessitate path planning methods that yield robust data collection strategies in open-loop and underactuated settings. In this work, to address the aforementioned issues, we propose to formulate adaptive search as an ergodic coverage problem and investigate certifying coverage in the ergodic sense over evolving domains with flow-induced dynamics. We expand upon recent work demonstrating maximum mean discrepancy (MMD) as a functional ergodic metric, and derive a flow-adaptive formulation that explicitly accounts for domain evolution within the coverage objective. We show that this approach preserves ergodic coverage guarantees in ambient flows and enables effective exploration in under-actuated, and even open-loop planning settings by integrating environment dynamics. Experiments validate that our method generalizes to diverse spatiotemporal processes including ocean exploration, and tracking human and cattle movement. Physical experiments on aerial and legged robotic platforms validate our ability to obtain ergodic coverage in non-convex, flow-restricted environments while respecting robot dynamics.

2605.13436 2026-05-14 cs.CL cs.LG

Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

Ruan Visser, Trienko Grobler, Marcel Dunaiski

AI总结 本文研究了在低资源自然语言处理任务中,是否在预训练阶段应用BPE Dropout能提升下游任务表现。研究通过在多种语言的子集上训练单语和双语BERT模型,并在多个基准数据集上进行评估,发现同时在预训练和微调阶段使用随机分词能取得最佳效果,尤其在数据量较少时,预训练阶段引入BPE Dropout具有明显优势。实验还表明,预训练阶段的随机分词有助于模型更一致地接触形态对齐的分词方式,从而提升模型的表示能力。

Comments Comments: 12 pages, 8 figures, 5 tables

详情
英文摘要

Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining.

2605.13435 2026-05-14 cs.LG cs.AI

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

JaeHyeok Doo, Byeongguk Jeon, Seonghyeon Ye, Kimin Lee, Minjoon Seo

AI总结 本文提出了一种名为 Q-Flow 的强化学习框架,旨在充分利用基于流模型的策略的高表达能力,同时解决其在价值最大化过程中的优化稳定性问题。该方法通过利用流模型的确定性动态,直接将终端轨迹价值传播到中间潜在状态,从而在无需展开数值求解器的情况下实现稳定策略优化。实验表明,Q-Flow 在离线学习任务中显著优于现有先进方法,并支持在同一框架下的稳定在线适应。

Comments 27 pages

详情
英文摘要

There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

2605.13434 2026-05-14 cs.LG cs.DC math.OC stat.ML

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Ammar Mahran, Artavazd Maranjyan, Peter Richtárik

AI总结 本文研究了在数据和系统异构环境下分布式学习中的异步随机梯度下降(ASGD)方法。传统ASGD因未考虑不同工作节点的计算速度差异,导致模型更新偏向于局部目标的频率加权平均,而非全局目标。本文提出了一种名为Rescaled ASGD的新方法,通过按各节点计算时间比例调整步长,使得每个节点在周期内对模型的总学习率贡献相同,从而恢复对全局目标的正确优化。理论分析表明,该方法在非凸设置下能够收敛到全局目标的平稳点,且时间复杂度达到已知下界,实验验证了其有效性与先进性。

详情
英文摘要

Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.

2605.13431 2026-05-14 cs.SD

Text2Score: Generating Sheet Music From Textual Prompts

Keshav Bhandari, Sungkyun Chang, Abhinaba Roy, Francesca Ronchini, Emmanouil Benetos, Dorien Herremans, Simon Colton

AI总结 本文提出 Text2Score,一个用于从自然语言提示生成乐谱的两阶段框架,旨在解决文本驱动符号音乐生成中数据稀缺和自动标注不可靠的问题。该方法通过直接从符号化 XML 数据中提取监督信号,绕过了传统文本-音乐对的噪声和稀疏性问题,分为规划阶段和执行阶段:规划阶段利用大语言模型生成结构化的乐谱计划,执行阶段则生成符合该计划的 ABC 符号乐谱。实验表明,Text2Score 在可玩性、可读性等多个评估维度上均优于现有方法,并开源了数据集、代码及评估工具。

Comments 8 pages including references, 1 figure

详情
英文摘要

Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).

2605.13429 2026-05-14 cs.CL

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

Chong Li, Yingzhuo Deng, Wen Yang, Jiajun Zhang, Chengqing Zong

AI总结 本文提出了一种名为 TokAlign++ 的方法,旨在通过学习更优的词元对齐词典来提升大语言模型的词汇适配性能。该方法将源语言和目标语言的词表视为两种不同语言,从单语词元表示中学习双语对齐词典,并据此重新排列模型参数以适应新词表,再通过逐步微调实现模型适配。实验表明,该方法在15种语言上显著提升了多语言文本压缩率,并在较少训练步数下恢复了原模型性能,同时有效支持了基于词元的模型蒸馏。

Comments Paper under review

详情
英文摘要

Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

2605.13428 2026-05-14 cs.RO

SID: Sliding into Distribution for Robust Few-Demonstration Manipulation

Yicheng Ma, Wei Yu, Zhian Su, Xidan Zhang, Huixu Dong

AI总结 本文提出了一种名为SID的框架,用于在仅有少量演示的情况下实现鲁棒的机器人操作。SID通过学习物体中心的运动场,逐步引导系统接近演示的流形,并进入轻量化的自中心执行策略的可靠操作区域,从而减少分布外执行的问题。该方法在多个现实任务中表现出色,仅需两次演示即可在分布外初始条件下实现约90%的成功率,为少样本操作提供了一种新的范式。

Comments 20 pages, 14 figures. Project website: https://sliding-into-distribution.github.io/

详情
英文摘要

Generalizing robotic manipulation across object poses, viewpoints, and dynamic disturbances is difficult, especially with only a few demonstrations. End-to-end visuomotor policies are expressive but data-hungry, while planning and optimization satisfy explicit constraints but do not directly capture the interaction strategies demonstrated by humans. We propose Sliding into Distribution (SID), a structured framework that learns an object-centric motion field from canonicalized demonstrations to iteratively slide the system toward the demonstrated manifold and into the reliable operating region of a lightweight egocentric execution policy, mitigating out-of-distribution (OOD) execution. The motion field provides large corrective motions when far from the demonstration manifold and naturally vanishes near convergence, enabling robust reaching under substantial pose and viewpoint shifts. Within the reached regime, an egocentric policy trained with conditioned flow matching performs task-specific manipulation, supported by kinematically consistent point-cloud reprojection augmentation that preserves action-observation consistency. Across six real-world tasks, SID achieves approximately 90% success under OOD initializations with only two demonstrations, with under a 10% drop under distractors and external disturbances. Overall, SID provides a new paradigm for few-shot manipulation: explicitly managing distribution shift via online distribution recovery.

2605.13424 2026-05-14 cs.LG cs.CL

LIFT: Last-Mile Fine-Tuning for Table Explicitation

Divij Khaitan, Ashish Tiwari

AI总结 本文提出了一种名为LIFT(Last-Mile Fine-Tuning)的新型微调方法,用于从非结构化的剪贴板文本中提取表格并修正错误。该方法结合了预训练的大语言模型和微调的小语言模型(参数规模为1B-24B),在保证准确性的前提下显著提升了对输入格式变化的鲁棒性,并在仅有1000个训练样本的情况下,其性能优于端到端微调方法。研究显示,LIFT在表格提取任务中具有更高的效率和更强的适应性。

Comments 9 pages, 1 figure, 3 tables

详情
英文摘要

We propose last-mile fine-tuning, or Lift, a pipeline in which a pre-trained large language model extracts an initial table from unstructured clipboard text, and a fine-tuned small language model (1B-24B parameters SLM) repairs errors in the extracted table. On a benchmark of 2,596 tables from three datasets, Lift matches or exceeds end-to-end SLM fine-tuning on tree-edit-distance-based similarity (TEDS) metric while requiring as little as 1,000 training examples - where it outperforms end-to-end fine-tuning by up to 0.144 TEDS points. We term this approach last-mile fine-tuning and show it also more robust to input format variability. Comparisons with self-debug and end-to-end fine-tuning approaches show that last-mile fine-tuning provides an attractive option when training data is limited or when robustness to input variation is sought without compromising on accuracy.

2605.13418 2026-05-14 cs.LG

DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning

Marc Molina Van den Bosch, Riccardo Taiello, Albert Sund Aillet, Andrea Protani, Miguel Angel Gonzalez Ballester, Luigi Serio

AI总结 本文提出了一种名为DP-KFC的数据无关预处理方法,用于在隐私保护深度学习中提升优化效果。该方法通过结构化合成噪声探测网络,无需使用私有或公共数据即可估计曲率信息,解决了差分隐私优化中损失函数各向异性与噪声各向同性之间的几何不匹配问题。实验表明,DP-KFC在强隐私保护条件下显著优于传统DP-SGD及其他自适应方法,且在医疗等数据稀缺的领域展现出良好的应用潜力。

Comments Accepted at the International Conference on Machine Learning (ICML 2026). 9 pages main text + appendix, 5 figures, 2 tables. Code: https://github.com/molinamarcvdb/DP-KFC

详情
英文摘要

Differentially private optimization suffers from a fundamental geometric mismatch: deep networks have highly anisotropic loss landscapes, yet DP-SGD injects isotropic noise. Second-order preconditioning can resolve this, but estimating curvature typically requires private data (consuming privacy budget) or public data (introducing distribution shift). We show that the Fisher Information Matrix decouples into architectural sensitivity, recoverable via synthetic noise, and input correlations, approximable from modality-specific frequency statistics. We propose DP-KFC, which constructs KFAC preconditioners by probing networks with structured synthetic noise, requiring neither private nor public data. Empirically, DP-KFC consistently outperforms DP-SGD and adaptive baselines across diverse modalities in strong privacy regimes ($\varepsilon \leq 3$). DP-KFC matches private-data preconditioners while public-data variants degrade by up to $4.8\%$, showing that curvature can be estimated without consuming privacy budget or introducing distribution shift. This enables privacy-preserving learning in specialized domains (e.g., medical applications) where regulatory constraints make data scarce.

2605.13414 2026-05-14 cs.AI

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Zabir Al Nazi, Shubhashis Roy Dipta

AI总结 本文提出TRIAGE评估框架,用于评估大语言模型在资源受限情况下对未来任务进行选择、排序和计算分配的前瞻性元认知控制能力。该框架通过给模型提供任务池和预设的token预算,要求其制定一个包含任务选择、顺序和资源分配的统一计划,并基于模型在各任务上的解题能力和成本进行评估,从而计算出其分诊效率比。实验表明,当前主流语言模型在该能力上存在显著不足,揭示了其在资源高效部署方面尚未被充分测量的关键能力维度。

详情
英文摘要

Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

2605.13412 2026-05-14 cs.CL cs.AI

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux, Mohammad N. S. Jahromi, Rohat Bakuri-Jørgensen, Marieke Anne Heyl, Asta S. Stage Jarlner, Maria Vlachou, Anna Murphy Høgenhaug, Desmond Elliott, Thomas Gammeltoft-Hansen, Thomas B. Moeslund

AI总结 该研究探讨了使用现成的大语言模型(LLMs)对丹麦难民申请决定文本中的可信度评估进行自动标注的性能与误差。研究引入了一个名为RAB-Cred的高质量丹麦语法律文本分类数据集,并系统评估了多种模型和提示组合在零样本和少样本设置下的表现。研究揭示了顶级模型在标注中的不一致性与错误模式,强调了单一模型预测的局限性,并指出在法律等专业领域中,LLMs作为标注工具仍存在不足,需结合人类判断与更细致的评估方法。

Comments Accepted at the 20th Linguistic Annotation Workshop (LAW XX), co-located with ACL 2026 (https://sigann.github.io/LAW-XX-2026/)

详情
英文摘要

Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred

2605.13408 2026-05-14 cs.CL

From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

Neh Majmudar, Anne Huang, Jinfan Frank Hu, Elena Filatova

AI总结 本文研究了高中语言学竞赛中常用的两种语言谜题形式——“罗塞塔石碑”和“匹配对”(Match-Up),提出了一种将前者系统性地转换为后者的高效方法,从而加速新谜题的生成。通过让人类专家和大型语言模型(LLMs)对转换后的谜题对进行测试,研究发现两者在解决Match-Up谜题时均表现出“全或无”的模式,即要么完全解决,要么完全无法解决。该工作构建了一个包含配对谜题的新数据集,并深入分析了不同格式下谜题难度的差异,为理解人类与机器的语言推理能力提供了新视角。

Comments Proceedings of the Fifteenth Language Resources and Evaluation Conference

详情
英文摘要

In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.

2605.13407 2026-05-14 cs.LG cs.CE q-fin.ST

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

Namhyoung Kim, Jae Wook Song

AI总结 本文提出了一种名为PRISM-VQ的动态因子框架,用于解决跨截面股票收益预测中的低信噪比和市场制度变化带来的挑战。该方法结合了专家先验因子、通过跨截面结构学习得到的向量量化离散潜在因子,以及结构条件的专家混合网络,以生成时变因子载荷。实验表明,该方法在沪深300和标普500数据集上显著提升了收益预测和投资组合表现,同时保持了模型的可解释性。

Comments IJCAI 2026 Accepted Paper including Technical Appendix

详情
英文摘要

Predicting cross-sectional stock returns is challenging due to low signal-to-noise ratios and evolving market regimes. Classical factor models offer interpretability but limited flexibility, while deep learning models achieve strong performance yet often underutilize financial priors. We address this gap with PRISM-VQ (PRior-Informed Stock Model with Vector Quantization), a dynamic factor framework that integrates expert prior factors, vector-quantized discrete latent factors learned from cross-sectional structure, and a structure-conditioned Mixture-of-Experts to generate time-varying factor loadings. Vector quantization acts as an information bottleneck that suppresses noise while capturing robust market structure, with discrete codes serving both as latent factors and as routing signals for temporal expert specialization. Experiments on CSI 300 and S&P 500 show consistent improvements in cross-sectional return prediction and portfolio performance over strong baselines while preserving interpretability. Our code is available at https://github.com/finxlab/PRISM-VQ.

2605.13405 2026-05-14 cs.LG

When is Warmstarting Effective for Scaling Language Models?

Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Josif Grabocka, Frank Hutter, Aaron Klein

AI总结 本文研究了在扩展语言模型规模时,预热训练(warmstarting)的有效性。作者指出,尽管预热训练有助于节省资源,但在实际大模型训练中应用较少,主要受限于对模型性能保持和扩展策略的分析不足。研究发现,保持基础模型初始性能并非必要,简单且通用的扩展策略往往更有效,并确定了扩展因子的上限,超过该上限时从头训练更高效。实验表明,2倍扩展因子在多数情况下能提供最佳收敛速度提升,研究结果为模型扩展提供了实用指导和经验界限。

详情
英文摘要

Model growth from a given checkpoint aims to accelerate training of a larger model, offering potential resource savings. Despite recent interest, warmstarting has seen limited practical adoption in large-scale training. We attribute this to two underexplored factors: (1) an overemphasis on preserving the smaller model's performance at initialization, which constrains operator design for new architectures, and (2) insufficient analysis of how growth interacts with hyperparameters and scaling behavior, compounded by inconsistent growth factors across the literature. We show that preserving the base model's initial post-growth performance is not necessary for strong final performance, and that simple, architecture-agnostic growth strategies can outperform more complex warmstarting operators. Crucially, we empirically identify an upper bound on the growth factor $g$ beyond which training from scratch is more efficient. We observe this across multiple ablation setups. Notably, this limit is also present, but unreported, in prior published results. Across our experiments on dense MLPs and dense language models, we find that a $2\times$ growth factor is the most reliable in yielding convergence speedups, with gains most pronounced under 20 tokens/parameter budgets and diminishing as budget increases. We fit scaling laws over these observations to provide predictive guidance for practitioners deciding when and how much to grow. Together, our analysis provides practical guidelines and empirical limits for model growth.

2605.13404 2026-05-14 cs.SD

Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris, Konstantinos Tsamis

AI总结 该研究提出了一种名为Sec2Drum-DAC的条件潜扩散模型,用于从符号控制信息生成鼓声音频。该模型通过在物理时间点采样事件特征,并预测冻结DAC编码本嵌入的主成分坐标,而非直接生成波形样本,从而在保持节奏和力度信息的同时生成逼真的音频。实验表明,该方法在多个评估指标上优于确定性PCA回归和符号渲染基线,尤其在音谱和瞬态特性方面表现突出。

详情
英文摘要

Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features sampled in physical time at codec-frame locations and predicts standardized principal-component coordinates of frozen DAC summed-codebook embeddings rather than waveform samples. In the evaluated DAC configuration, 72 principal components capture the observed training-frame summed-latent subspace under the stated SVD threshold, yielding a compact continuous denoising target with a deterministic reconstruction path to the 1024-dimensional DAC latent space before waveform decoding. Across 1,733 held-out four-beat windows, PCA diffusion improves paired spectral and transient metrics over deterministic PCA regression and a symbolic rendering baseline, while direct regression remains stronger on phase-sensitive waveform L1. Auxiliary RVQ cross-entropy improves short-step diffusion on mel error, onset-flux cosine, and waveform L1, with the most favorable trade-offs occurring at 6-25 denoising steps depending on the metric.

2605.13403 2026-05-14 cs.RO cs.CV

RotVLA: Rotational Latent Action for Vision-Language-Action Model

Qiwei Li, Xicheng Gong, Xinghang Li, Peiyan Li, Quanyun Zhou, Hangjun Ye, Jiahuan Zhou, Yadong Mu

AI总结 本文提出RotVLA,一种基于连续旋转潜行动作表示的视觉-语言-动作(VLA)框架,旨在解决现有潜行动作模型在动作表示离散化带来的重建行为简单、表达能力有限等问题。RotVLA将潜动作建模为SO(n)空间中的元素,具有连续性、组合性和符合现实动作动态的结构化几何特性,并通过三帧学习框架强化时间动态特性。实验表明,RotVLA在多个基准测试中表现出色,显著优于现有VLA模型。

详情
英文摘要

Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and decode pipelines, which can lead to trivial frame reconstruction behavior, limited representational capacity, and a lack of physically meaningful structure. We introduce RotVLA, a VLA framework built on a continuous rotational latent action representation. Latent actions are modeled as elements of SO(n), providing continuity, compositionality, and structured geometry aligned with real-world action dynamics. A triplet frame learning framework further enforces meaningful temporal dynamics while avoiding degeneration. RotVLA consists of a VLM backbone and a flow-matching action head, pretrained on large-scale cross-embodiment robotic datasets and human videos with latent-action supervision. For downstream robot control, the flow-matching head is extended into a unified action expert that jointly denoises latent and robot actions. Here, latent actions serve as a latent planner, providing high-level guidance that conditions action generation. With only 1.7B parameters and 1700+ hours of pretraining data, RotVLA achieves 98.2% on LIBERO and 89.6% / 88.5% on RoboTwin2.0 under clean and randomized settings, respectively. It also demonstrates strong real-world performance on manipulation tasks, consistently outperforming existing VLA models.

2605.13402 2026-05-14 cs.CV cs.DS

Fast and Compact Graph Cuts for the Boykov-Kolmogorov Algorithm

Christian Møller Mikkelstrup, Anders Bjorholm Dahl, Philip Bille, Vedrana Andersen Dahl, Inge Li Gørtz

AI总结 本文研究了Boykov-Kolmogorov(BK)算法在计算最小$s$-$t$割问题中的性能优化,提出了改进的理论分析和新的快速紧凑算法(fcBK),将时间复杂度从$O(mn|C|)$降低至$O(m|C|)$。此外,作者设计了一种紧凑的图表示方法,使得算法能够在有限内存下处理包含数十亿顶点和万亿边的大规模图。实验表明,该实现是目前BK算法中最高效的实现,突显了内存效率在大规模图割计算中的重要性。

Comments 15 pages, 6 figures, submitted to the IEEE for possible publication

详情
英文摘要

Computing a minimum $s$-$t$ cut in a graph is a solution to a wide range of computer vision problems, and is often done using the Boykov-Kolmogorov (BK) algorithm. In this paper, we revisit the BK algorithm from both a theoretical and practical point of view. We improve the analysis of the time complexity of the BK algorithm to $O(mn|C|)$ and propose a new algorithm, the fast and compact BK (fcBK) algorithm, with a time complexity of $O(m|C|)$, where $m$, $n$, and $|C|$ are the number of edges, number of vertices, and the capacity of the cut, respectively. We additionally propose a compact graph representation that allows our implementation to find a minimum $s$-$t$ cut in a graph with upwards of $10^9$ vertices and $10^{10}$ edges on a machine with 128 GB of memory. We find our implementation of the BK algorithm to be the fastest available implementation of the BK algorithm when evaluating on a comprehensive set of benchmark datasets, highlighting the importance of memory-efficient implementations. We make our implementations publicly available for further research and implementation development within minimum $s$-$t$ cut algorithms.

2605.13401 2026-05-14 cs.LG cs.RO stat.ML

Trajectory-Level Data Augmentation for Offline Reinforcement Learning

Tobias Schmähling, Matthias Burkhardt, Tobias Windisch

AI总结 本文提出了一种用于离线强化学习的轨迹级数据增强方法,旨在解决主动定位等任务中从少量次优轨迹中训练策略的问题。该方法利用任务结构以及奖励函数、价值函数与日志策略之间的几何关系,通过轨迹层面的增强技术提升数据质量,从而提高离线强化学习的性能。研究提供了理论依据,并在不同维度和部分可观测性条件下验证了方法的有效性。

Comments 26 pages, 25 figures, Accepted at ICML 2026

详情
英文摘要

We propose a data augmentation method for offline reinforcement learning, motivated by active positioning problems. Particularly, our approach enables the training of off-policy models from a limited number of suboptimal trajectories. We introduce a trajectory-based augmentation technique that exploits task structure and the geometric relationship between rewards, value functions, and mathematical properties of logging policies. During data collection, our augmentation supports suboptimal logging policies, leading to higher data quality and improved offline reinforcement learning performance. We provide theoretical justification for these strategies and validate them empirically across positioning tasks of varying dimensionality and under partial observability.

2605.13399 2026-05-14 cs.LG cs.IT math.IT

The Diffusion Encoder

Akhil Premkumar, Sarah Lucioni

AI总结 本文提出了一种新型编码器,利用扩散模型的强大表达能力来替代传统变分自编码器中的编码器。为了解决扩散模型与解码器之间在潜在空间更新方向不一致的问题,作者设计了一种基于期望最大化算法思想的交替训练方案,从而实现了编码器与解码器的可靠同步,同时保持了扩散模型简单高效的训练目标。

Comments 22 pages + references, 10 figures

详情
英文摘要

We construct a new kind of encoder, leveraging the expressive power of diffusion models. In a traditional variational autoencoder, the encoder and decoder jointly negotiate a latent representation of the input. This is made possible by the reparameterization trick, which simplifies training at the cost of restricting the encoder to a simple family of distributions. Replacing this encoder with a diffusion model requires rethinking how the decoder pressure can be transmitted back to the encoder, given that they tend to update their internal estimates of the latent in opposing directions. We solve this problem with an alternating training scheme, inspired by the expectation-maximization algorithm. Our method enables more reliable synchronization between encoder and decoder, while preserving the simple and efficient training objective of standard diffusion models.

2605.13396 2026-05-14 cs.CV

PreFIQs: Face Image Quality Is What Survives Pruning

Jan Niklas Kolf, Guray Ozgur, Andrea Atzori, Žiga Babnik, Vitomir Štruc, Naser Damer, Fadi Boutros

AI总结 本文提出了一种无需训练和监督的面部图像质量评估框架 PreFIQs,基于“剪枝识别示例”(PIE)假设,通过分析预训练人脸识别模型及其剪枝版本之间嵌入向量的欧几里得距离来衡量图像质量。该方法从雅可比向量积的角度提供了理论支持,并在多个基准数据集上取得了优于现有方法的性能,验证了参数剪枝作为评估面部图像质量的有效信号。

Comments Accepted at CVPR 2026 Workshops

详情
英文摘要

Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.

2605.13395 2026-05-14 cs.LG cs.CV

Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation

Lilin Zhang, Yimo Guo, Yue Li, Jiancheng Shi, Xianggen Liu

AI总结 该论文研究了深度神经网络在长尾数据下的对抗训练问题,指出传统对抗训练方法在类别不平衡的数据上存在训练目标偏斜和对抗分布不稳定等局限。作者提出通过自适应调整对抗扰动来同时提升模型的鲁棒性和类别平衡能力,并设计了名为 RobustLT 的即插即用框架,实验表明该方法在多个长尾数据集上有效增强了模型的对抗鲁棒性与类别平衡性能。

Comments accepted by CVPR 2026

详情
英文摘要

Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose RobustLT, a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RobustLT consistently enhances adversarial robustness and class-balance on long-tailed datasets. The code is available at \href{https://github.com/zhang-lilin/RobustLT}{https://github.com/zhang-lilin/RobustLT}.

2605.13391 2026-05-14 cs.AI

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

Liangtian Liu, Zeyuan Wang, Ziyu Li, Kai Ouyang, Zichao Tang, Chengfu Liu, Haifeng Li, Hanwen Yu, Wentao Yang, Cheng Yang, Dongyang Hou

AI总结 随着多模态大语言模型的发展,遥感智能正从“感知”转向“行动”,但现有遥感智能体在工具调用上仍采用被动选择方式,难以在复杂任务中动态平衡上下文负载与工具集完整性。为此,本文提出RS-Claw,一种基于分层技能树的主动探索架构,通过技能封装技术对工具进行分层描述,使智能体能够按需逐步加载工具信息,从而显著释放上下文空间并提高关键工具的命中率。实验表明,RS-Claw在Earth-Bench基准测试中表现出色,有效压缩了输入令牌并优于现有方法。

详情
英文摘要

The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.

2605.13386 2026-05-14 cs.LG stat.ML

Support-Conditioned Flow Matching Is Kernel Smoothing

Daniel Matsui Smola

AI总结 本文研究了基于交叉注意力的生成模型在有限支持集条件下的生成机制,揭示其速度场本质上是 Nadaraya-Watson 核平滑器,并随着生成过程时间推移,核带宽逐渐缩小,从早期的全局平均过渡到后期的最近邻行为。研究将交叉注意力机制与经典核方法联系起来,并指出了三种失效场景,实验验证了理论预测,并表明 IP-Adapter 的交叉注意力实现了近似核平滑效果。

Comments Submitted to NeurIPS 2026. 18 pages, 10 figures, 1 table. Code at https://github.com/BaroqueObama/kernel-flow-matching-code

详情
英文摘要

Generative models are often conditioned on a small set of examples via cross-attention. Under the Gaussian optimal-transport path, we show that the exact velocity field induced by a finite support set is a Nadaraya--Watson kernel smoother whose bandwidth decreases with flow time, from broad averaging at early steps to nearest-neighbor at late steps. A single Gaussian-kernel attention head exactly computes this field, connecting cross-attention conditioning to classical kernel theory. The theory predicts three failure regimes: nearest-neighbor collapse of the kernel at high dimension, mismatch between the isotropic kernel and the data geometry, and insufficient support for nonparametric estimation. Experiments on Gaussian mixtures, spherical shells, and DINOv2 ImageNet features confirm that learned conditioning improves in precisely these regimes, and that IP-Adapter's cross-attention implements approximate NW smoothing in practice.

2605.13384 2026-05-14 cs.LG

Teaching and Learning under Deductive Errors

Jan Arne Telle, Brigt Håvardstun, Jose Hernandez-Orallo

AI总结 本文研究了在学习者存在演绎错误的情况下,机器教学与学习的框架。传统模型假设学习者不会犯推理错误,但人类和少样本学习下的大语言模型往往会出现不一致或随机错误。为此,作者提出了一种新的教学框架,在改进的PAC模型下,分析了教师如何在一定误差估计下找到近似正确的教学集,并探讨了相关计算问题的复杂性,给出了参数化的算法及实验验证。

Comments 15 pages, preprint neurips

详情
英文摘要

Most models of machine teaching and learning assume the learner makes no errors in its internal deductive inference. However, humans and large language models in few-shot learning regimes are two important examples of learners where this does not hold. They fail on some consistency checks, and they can fail stochastically. In this paper we introduce a teaching and learning framework that takes these deductive errors into account. We specifically study the case of machine teaching, as different characterizations of the teacher can account for both machine teaching and learning. In an overhauled Probably Approximately Correct (PAC) setting, we study theoretically that, for some estimated error level, the teacher must find a PAC teaching set that with high probability will lead the learner to guess a hypothesis that is approximately correct. We study the computational complexity of six different problems related to computing optimal PAC teaching sets. We give XP algorithms parametrized by size of teaching set, with tight runtime bounds under standard complexity assumptions like ETH. These results are complemented with a small experimental study of which teaching and learning protocols can best represent the observed behavior in some LLM teaching sessions.

2605.13383 2026-05-14 cs.LG

Beyond Oversquashing: Understanding Signal Propagation in GNNs Via Observables

Eden Nagar, Ya-Wei Eileen Lin, Ron Levie

AI总结 本文研究了图神经网络(GNNs)中信号传播的问题,指出传统方法在传播过程中容易导致信息丢失,表现为过度平滑和过度压缩现象。作者受量子力学启发,提出基于可观测量的新建模方法,用于刻画信号在图中的位置、集中程度及传播特性,并证明了标准谱图神经网络在信号传播能力上的不足。基于此,作者提出了一种新型谱图神经网络——Schrödinger GNN,能够更有效地在图中路由信号。

详情
英文摘要

Graph Neural Networks (GNNs) perform computations on graphs by routing the signal between graph regions using a graph shift operator or a message passing scheme. Often, the propagation of the signal leads to a loss of information, where the signal tends to diffuse across the graph instead of being deliberately routed between regions of interest. Two notions that depict this phenomenon are oversmoothing and oversquashing. In this paper, we propose an alternative approach for modeling signal propagation, inspired by quantum mechanics, using the notion of observables. Specifically, we model the place in the graph where the signal lies, how much the signal is concentrated there, and how much of the signal is propagated towards a location of interest when applying a GNN. Using these new concepts, we prove that standard spectral GNNs have poor signal propagation capabilities. We then propose a new type of spectral GNN, termed Schrödinger GNN, which we show has a superior capacity to route the signal across the graph.

2605.13382 2026-05-14 cs.RO

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

Ruiheng Wang, Shuanghao Bai, Haoran Zhang, Badong Chen, Xiangyu Xu

AI总结 本文提出了一种名为BlockVLA的新框架,旨在加速基于自回归(AR)的视觉-语言-动作(VLA)模型在机器人任务中的推理过程。通过引入块扩散范式,BlockVLA将预训练的AR模型转换为高效的离散扩散策略,在保持块级自回归依赖的同时实现块内并行去噪,从而兼顾全局因果一致性和局部并行生成。实验表明,BlockVLA在LIBERO和SimplerEnv基准测试中实现了比传统离散扩散模型3.3倍的推理加速,并在复杂长时序任务中表现出更优的训练效率和性能提升。

详情
英文摘要

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation during long-horizon execution. Discrete Diffusion Language Models (dLLMs) provide a promising alternative through parallel token refinement, but their practical deployment in robotics remains limited by repeated denoising function evaluations (NFEs) and the difficulty of directly applying standard KV caching to bidirectional iterative decoding. To bridge these paradigms, we propose BlockVLA, a framework that adapts pretrained AR backbones into an efficient discrete diffusion policy through a block diffusion paradigm. BlockVLA maintains autoregressive dependencies at the block level while enabling parallel denoising within each block, thereby combining global causal coherence with local parallel generation. This design enables prefix KV-cache reuse across completed blocks, reduces the effective cost of iterative denoising, and provides a smoother transition from AR pretraining to diffusion-based policy fine-tuning. We conduct extensive evaluations on the LIBERO and SimplerEnv benchmarks. Experimental results demonstrate that our BlockVLA achieves a 3.3$\times$ inference acceleration over standard discrete diffusion baselines. Furthermore, our model exhibits superior training efficiency, with success rates converging substantially faster than baselines, a gain that is particularly pronounced in complex, long-horizon tasks, where BlockVLA achieves significant performance gains in the early stages of training. This work establishes Block Diffusion as a robust bridge between large-scale pretrained AR models and efficient, high-frequency real-time robotic control.

2605.13381 2026-05-14 cs.CV cs.MM

Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics

Chiara Musso, Joy Battocchio, Andrea Montibeller, Giulia Boato

AI总结 随着AI生成图像日益逼真,视觉Transformer(ViT)已成为现代深度伪造检测的核心技术。然而,现有方法普遍依赖冻结的预训练主干网络,这引入了一个隐蔽但关键的漏洞。本文提出了一种基于目标检测器ViT主干网络知识的灰盒攻击方法——替代迭代对抗攻击(SIAA),能够在目标检测器的特征空间内生成高效对抗样本,实验表明该方法在多种场景下均能实现接近白盒攻击的高成功率,揭示了仅凭主干网络知识即可严重削弱检测器可靠性的问题,突显了在对抗性多媒体取证中亟需更鲁棒防御机制的重要性。

详情
英文摘要

As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector's ViT backbone alone and operates entirely within the target detector's feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.

2605.13380 2026-05-14 cs.RO

Exploring Human-Robot Collaboration: Analysis of Interaction Modalities in Challenging Tasks

Simone Arreghini, Cristina Iani, Alessandro Giusti, Valeria Villani, Lorenzo Sabattini, Antonio Paolillo

AI总结 本文研究了人类与机器人协作中的三种交互模式——被动、反应式和主动式,并通过实验分析它们在完成复杂任务时的表现。实验中,参与者在记忆中搭建七层彩色塔的过程中,分别与不同模式的机器人协作。结果表明,尽管机器人协助增加了任务时间,但大多数参与者更倾向于协作,尤其偏好机器人主动提供帮助的模式。研究指出,在受控协作任务中,及时的主动支持能够提升用户体验。

详情
英文摘要

This work compares three interaction modalities for human-robot collaboration: passive, reactive, and proactive. We studied 18 participants assembling a seven-layer colored tower from memory while using nearby and distant blocks. In the passive modality participants worked alone; in the reactive modality a mobile robot helped only upon request; in the proactive modality it initiated brick delivery and error signaling without explicit requests. Although robot assistance increased completion time, most participants preferred collaboration: 67% preferred proactive behavior and 78% judged it most useful. These results suggest that timely proactive support can improve user experience in controlled collaborative tasks.

2605.13375 2026-05-14 cs.CV cs.AI

GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao

AI总结 在视觉-语言模型(VLMs)中,处理大量视觉标记会导致高昂的计算开销。为解决这一问题,本文提出GRIP-VLM,一种基于强化学习的组相对重要性剪枝框架,将剪枝建模为马尔可夫决策过程,通过监督预热引导的组相对策略优化(GRPO)直接探索离散选择空间,从而避免连续近似方法带来的次优解问题。该方法结合预算感知评分器,无需重新训练即可动态评估并适应不同压缩比,实验表明其在多个多模态基准上优于启发式和监督学习基线,在保持精度的同时实现了最高达15%的推理加速。

Comments 10 pages, 11 figures

详情
英文摘要

In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.

2605.13373 2026-05-14 cs.CL

Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing

Daniel Fernández-González, Cristina Outeiriño Cid

AI总结 本文研究如何利用预训练的编码器-解码器 Transformer 模型进行序列到序列的成分句法分析。作者扩展了现有的序列到序列框架,基于 BART、mBART 和 T5 等预训练编码器-解码器模型构建句法分析器,并通过线性化策略进行微调与评估。实验表明,该方法在连续树库和复杂离散基准测试中均优于以往的序列到序列模型,并能与最先进的任务专用句法分析器竞争。

Comments Preliminary version

详情
英文摘要

To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models are typically initialized with pre-trained encoder-only language models like BERT or RoBERTa. However, the use of pre-trained encoder-decoder language models for constituency parsing has not been thoroughly explored. To bridge this gap, we extend the sequence-to-sequence framework by investigating parsers built on pre-trained encoder-decoder architectures, including BART, mBART, and T5. We fine-tune them to generate linearized parse trees and extensively evaluate them on different linearization strategies across both continuous treebanks and more complex discontinuous benchmarks. Our results demonstrate that our approach outperforms all prior sequence-to-sequence models and performs competitively with leading task-specific constituent parsers on continuous constituent parsing.